Call me crazy but IMHO it is clear and logical and easy how to handle
write errors at the filesystem level. I've been thinking about this
for the "ibu fs" I am hacking on... have your own writepage and your
own async-io-completion routine. If M of N bh's indicate write failure
when bh->b_end_io is called, queue those for the filesystem so it can
add those blocks to the bad block list, and allocate some new blocks for
writing. Seems straightforward to me except worst-case where all
remaining sectors are bad.
A side effect of this is that I am taking a cue from "NTFS TNG"
[which is nice, 100% page-cache-based code] and simply making the
hard sector size be the logical block size. That way, not only can
write errors be handled on a fine-grained (hard sector) level, there
are no limitations on what the filesystem's blocksize is. Right now,
we cannot have a block size larger than PAGE-CACHE-SIZE AFAICS, but
when blocksize==hard sector size, you can simply fake any blocksize you
want (32K, 64K, ...) You have a bit more overhead with an increased
number of bh's, but since the bh->b_data is pointing into a page,
that's all the overhead is... a buffer head.
Right now no filesystems really support big blocksize in this way
because fragment handling hasn't been thought through [again, AFAICS]
> For reads, sufficient state information is already there ("uptodate" bit
> - just add a counter for retries), but for writes we only have the dirty
> bit that gets cleared when the request gets sent off. So for writes
> we'd need to add a new bit ("write in progress", and then clear it on
> successful completion, and set the "dirty" bit again on error).
Handling read errors always seemed uglier than handling write errors,
but I haven't thought through read errors yet...
> So I'd actually _like_ for all IO requests to be clearly "try just
> once", and it being up to th eupper layers to retry on error.
I agree this is a good direction.
Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/