I'm starting to implement a generic write clustering scheme and I would
like to receive comments and suggestions.
The write clustering issue has already been discussed (mainly at Miami)
and the agreement, AFAIK, was to implement the write clustering at the
per-address-space writepage() operation.
IMO there are some problems if we implement the write clustering in this
level:
- The filesystem does not have information (and should not have) about
limiting cluster size depending on memory shortage.
- By doing the write clustering at a higher level, we avoid a ton of
filesystems duplicating the code.
So what I suggest is to add a "cluster" operation to struct address_space
which can be used by the VM code to know the optimal IO transfer unit in
the storage device. Something like this (maybe we need an async flag but
thats a minor detail now):
int (*cluster)(struct page *, unsigned long *boffset,
unsigned long *poffset);
"page" is from where the filesystem code should start its search for
contiguous pages. boffset and poffset are passed by the VM code to know
the logical "backwards offset" (number of contiguous pages going backwards
from "page") and "forward offset" (cont pages going forward from
"page") in the inode.
The idea is to work with delayed allocated pages, too. A filesystem which
has this feature can, at its "cluster" operation, allocate delayed pages
contiguously on disk, and then return to the VM code which now can
potentially write a bunch of dirty pages in a few big IO operations.
I'm sure that a bit of tuning to know the optimal cluster size will be
needed. Also some fs locking problems will appear.
But it seems worth to me since we're avoiding a lot of code replication in
the future and also the performance gain will be _nice_.
Comments, thoughts?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/