correct sorry, fsync/msync cannot check that bit of course.
> You have the same problem with your code, so I guess its better to just
> remove the I_DIRTY_PAGES check.
The point you have to clarify before claming we should remove the check
is if the munmap flush needs to be synchronous or not. In general munmap
doesn't need to be synchronous. If you want to commit the writes an
explicit msync(MS_SYNC) or fsync on the file is required. Otherwise the
updates will hit the platter in a rasonable amount of finite time
asynchronously. If somebody just intiated the fdatasync he will have to
finish before we can collect away the inode and in turn drop all its
cache, so those dirty pages cannot get lost in iput if somebody started
doing the flush under us either, and the guy doing the fdatasync under
us will have to wait synchronously for the stuff to be committed before
it can return.
If some page wasn't yet visible in the dirty_pages list by the time
__sync_one started, we'll find I_DIRTY_PAGES set. This is enforced by
the locking order (sync_one first clears the I_DIRTY_PAGES and then
it starts browsing the dirty_pages list while set_page_dirty first make the
page visible and then marks the inode dirty).
So the I_DIRTY_PAGES check guarantees that those dirty pages cannot be
lost in iput, that was the _only_ object of the patch and that is
certainly enough to fix the nfs fs data corruption reported.
Now if you claim that munmap needs to be synchronous for nfs that's a
completly different matter. I didn't even tried to make it synchronous.
It is possible it has to be synchronous, even write(2) (in theory ;) has
to behave like O_SYNC with nfs, but I'm not sure.
Another thing (completly unrelated to the above issues) that I noticed
while looking over this nfs code is that the __sync_one() for example
called by generic_file_write(O_SYNC) will recall fdatasync but no nfs_wb_all
is put before the fdatawait, and I'm not sure that the nfs_sync_page
called by the fdatawait is enough to rapidly flush the writepaged stuff
to the nfs server. nfs_sync_page apparently only cares about speculative
reads, not at all about committing writebacks. It would look much saner
to me if nfs_sync_page also does a nfs_wb_all() on the inode, so that
the ->sync_page callback gets the same semantics it has for the real
filesystems.
Comments?
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/