> The data= mode was not part of the past discussion, that's why I
> brought this up now. However, reiserfs or ext3fs with data=writeback
> only journal the fsync() metadata involved, not the order of data
> (file contents) versus directory contents, so you can end up with a
> "crash - journal replay - file with bogus contents" scenario.
This should not happen with a properly written application. fsync()
flushes a bunch of stuff to disk, but it normally makes no promise
about the ORDER in which that stuff goes out. fsync() itself is how
application authors can enforce an ordering on disk operations.
For example, a typical MTA might follow this paradigm:
write temp file
fsync()
rename temp file to destination
fsync()
report success
(Yes, I know, "link/unlink" is more common in practice than rename().
But the principle is the same.)
Or, in the case of Postfix:
write message file
fsync()
chmod +x message file
fsync()
report success
The first paradigm uses the presence of a directory entry to represent
"committed" data. The second uses a mode bit on the file.
Both of these paradigms work fine with data=writeback. Yes, they
require calling fsync() twice, but that is exactly what you need to
enforce the ordering constraints!
An MTA has two ordering constraints:
1) Data must be flushed to disk before it is marked on disk as
"committed". This is to ensure that, after a crash, the MTA does
not read a corrupted mail file.
2) Data must be marked on disk as "committed" before a success code
is reported to the remote MTA. This is to ensure that no mail is
lost.
The ext3 data=ordered mode enforces the first constraint for mailers
using the "rename" paradigm, eliminating the need for the first
fsync() call. But any MTA which relies on data=ordered semantics is
not only Linux-specific, but ext3/reiserfs specific!
Synchronous directory updates, a la FFS, enforce the second constraint
(again for the "rename" paradigm), eliminating the need for the second
fsync().
But to be robust across platforms and file systems, a mailer needs
both fsync() calls. (On Linux, you actually need to fsync() the
*directory*, not the file, for the "rename" paradigm. It would be
nice if we could convince MTA authors to do this.)
> I don't think so. They'd rather declare ReiserFS unsupported and go with
> chattr +S. Seen that.
>
> New implementations (Courier's maildrop) still rely on BSD FFS
> "synchronous directory" semantics.
Are you sure? Because that is ridiculous... Modern BSDs like to use
"soft updates", which need that second fsync() to commit the metadata.
So as long as fsync() commits the journal, either paradigm above
should work fine under any journaling mode.
Summary: *All* MTAs should call fsync() twice. The only issue is what
descriptors they should call it on, exactly :-).
- Pat
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/