Re: O_DIRECT wierd behavior..

Andrea Arcangeli (andrea@suse.de)
Mon, 17 Dec 2001 18:18:40 +0100


On Sun, Dec 16, 2001 at 05:17:33PM +0900, GOTO Masanori wrote:
> At Sat, 15 Dec 2001 21:59:06 -0800,
> Andrew Morton wrote:
> > Then the kernel screws up the error handling, and ends up
> > setting the file size to -EINVAL (ie: rather large).
> >
> > 1: We're testing `written >= 0', but it is unsigned (!). In two
> > places.
> >
> > This one, IMO is a gcc shortcoming. The compiler is capable of warning
> > about expressions which always evaluate to true or false in `if' statements,
> > but turning this on also enables lots of things you don't want it to warn about.
> > gcc needs to provide finer control of its warning capabilities. I patched
> > gcc-2.7.2.3 to do this ages back and it was very useful.
> >
> > 2: If generic_osync_inode() returns an error, we fail to report it. In
> > two places.
> >
> > Here's a quick fix. It needs a review.
> >
> > --- linux-2.4.17-rc1/mm/filemap.c Thu Dec 13 14:07:55 2001
> > +++ linux-akpm/mm/filemap.c Sat Dec 15 21:52:06 2001
> > @@ -3038,8 +3038,11 @@ unlock:
> > /* For now, when the user asks for O_SYNC, we'll actually
> > * provide O_DSYNC. */
> > if (status >= 0) {
> > - if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
> > + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
> > status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> > + if (status < 0)
> > + written = 0; /* Return the right thing */
> > + }
> > }
>
> Right. If generic_osync_inode returns error, it must be needed.
> This patch seems ok than my patch...

The above have nothing with the O_DIRECT changes, the above was present
in 2.4.9 too.

my worry is that failing with -EIO or whatever if we written something can
screwup the app, the app will think the pos is still at the start of our
writes. the "ingore" of the osync failure (that can be generated only by
an I/O error) was on the lines of the ignore of a failure in
prepare_write/commit_write if we just written something. So to me it
looked quite intentional, not just a thinko. In those cases if we wrote
something we report "written" with a short-write (infact a short write
from kernel just indicates something is not strightforward) otherwise
only if nothing was written yet, we report -EIO. So the app, will know
something is been written and the "pos" of the fd is been updated, then
it will try again to write the remaining part and it will get the -EIO
next time.

But I see with common sense that a failing O_SYNC should be somehow
reported even if we just written something, or it could be silenty
ignored, the app at the very least should try again or to notify a
failure rather than losing the data journaling due the I/O errors in the
data/metadata flushing. At least this osync failure is something that
shouldn't happen in production. If an osync fails it means there's a bad
sector or at the very least some other unrelated software bug.

I'm unsure (it's basically a matter of API, not something a kernel
developer can choose liberally), and the SuSv2 is not saying anything about
O_SYNC failures in the write(2) manapge, but I guess it would be at
least saner to put the "pos" backwards if we fail osync but we just
written something (so if we previously advanced pos).

Comments? Andrew?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/