Really, I don't think we can lose page->buffers for *enough* users
of address_spaces to make it worthwhile.
If it was only being used for, say, blockdev inodes then we could
perhaps take it out and hash for it, but there are a ton of
filesystems out there...
The main problem I see with this patch series is that it introduces
a new way of performing writeback while leaving the old way in place.
The new way is better, I think - it's just a_ops->write_many_pages().
But at present, there are some address_spaces which support write_many_pages(),
and others which still use ->writepage() and sync_page_buffers().
This will make VM development harder, because the VM now needs to cope
with the nice, uniform, does-clustering-for-you writeback as well as
the crufty old write-little-bits-of-crap-all-over-the-disk writeback :)
I need to give the VM a uniform way of performing writeback for
all address_spaces. My current thinking there is that all
address_spaces (even the non-delalloc, buffer_head-backed ones)
need to be taught to perform multipage clustered writeback
based on the address_space, not the dirty buffer LRU.
This is pretty deep surgery. If it can be made to work, it'll
be nice - it will heavily deprecate the buffer_head layer and will
unify the current two-or-three different ways of performing
writeback (I've already unified all ways of performing writeback
for delalloc filesystems - my version of kupdate writeback, bdflush
writeback, vm-writeback and write(2) writeback are all unified).
> I've been playing with the idea of caching the physical block in the radix
> tree, which imposes the cost only on cache pages. This forces you to do a
> tree probe at IO time, but that cost is probably insignificant against the
> cost of the IO. This arrangement could make it quite convenient for the
> filesystem to exploit the structure by doing opportunistic map-ahead, i.e.,
> when ->get_block consults the metadata to fill in one physical address, why
> not fill in several more, if it's convenient?
That would be fairly easy to do. My current writeback interface
into the filesytem is, basically, "write back N pages from your
mapping->dirty_pages list" [1]. The address_space could quite simply
whizz through that list and map all the required pages in a batched
manner.
[1] Problem with the current implementation is that I've taken
out the guarantee that the page which the VM wanted to free
actually has I/O started against it. So if the VM wants to
free something from ZONE_NORMAL, the address_space may just
go and start writeback against 1000 ZONE_HIGHMEM pages instead.
In practice, I suspect this doesn't matter much. But it needs
fixing.
(Our current behaviour in this scenario is terrible. Suppose
a mapping has a mixture of dirty pages from two or more zones,
and the VM is trying to free up a particular zone: the VM will
*selectively* perform writepage against *some* of the dirty
pages, and will skip writeback of pages from other zones.
This means that we're submitting great chunks of discontiguous
I/O. It'll fragment the layout of sparse files and will
greatly decrease writeout bandwidth. We should be opportunistically
submitting writeback against disk-contiguous and file-offset-contiguous
pages from other zones at the same time! I'm doing that now, but
with the present VM design [2] I do need to provide a way to
ensure that writeback has commenced against the target page).
[2] The more I think about it, the less I like it. I have a feeling
that I'll end up having to, umm, redesign the VM. Damn.
-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/