We're working on that, see the "[PATCH] 64 bit scsi read/write" thread
on linux-fsdevel. About half of it is devoted to investigating the
detailed semantics of physical write completion.
> 2. The OS _MUST_ _NOT_ acknowledge the (assumedly synchronous
> operation) any earlier. (This may well include switching off drive
> write buffering.)
Yes, for now that's how you have to do it.
> If the OS cannot fulfill these two basic requirements, I can save all
> the log or FS atomicity efforts because they don't get me anywhere.
>
> The problem is not that the operation can fail, the problem IS
> premature acknowledgement. Even with atomic updates, as shown above.
Right now the interface for determining that the operation has actually
completed is "sync". Yes, that sucks but with journalling or atomic
commit it's not nearly as expensive as you might think. My early flush
patch does nearly the equivalent of sync, 10 times a second and it
actually improves performance (it does not attempt to do this under
high load of course).
We *should* have something like sys_sync_dev(majorminor) or
sys_sync_fs(mountpoint) (whatever that would look like). For
phase-tree the semantics are that the call doesn't return until the
metaroot of the then-current "branching" tree is known to be safely on
disk. (Side note: it's ok to allow subsequent updates on the same
filesystem to procede while an outstanding sync_dev is waiting for
confirmation from the block layer, because these don't affect the
filesystem state the sync_fs is waiting on.)
As I understand it, Ext2 allows much the same semantics. While we do
need to do something about exposing a more elegant interface, with Ext3
you should be ok with +S and a "sync" just before you report to the
world that the mail transaction is complete. Ext3 does *not* leave a
lot of dirty blocks hanging around in normal operation, so sync is not
nearly as slow as it is with good old Ext2.
> Note, of course there is no premature acknowledgement for the
> Linux-default asynchronous directory update. There IS for -o sync or
> chattr +S -- and that's what MTAs to to guarantee data integrity, and
> that's why I'm still suggesting dirsync or something to remedy the
> negative data write performance of full-sync.
>
> If the OS tell me "write completed" when it means "I queued your data
> for writing", it is BROKEN.
>
> That's my point.
>
> And since the common POSIX OS lacks a dedicated notification feature
> for e. g. rename, MTAs have no other choice than to rely on "has
> completed when the syscall returns".
>
> BTW, my Linux rename(2) man page doesn't document EIO condition,
> FreeBSD 4.3-STABLE and SUS v2 do.
Sounds like a man page bug.
-- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/