I concluded I needed some kind of abstraction, but what? Instead of
reverting to the original buffer cache version, I converted my
unreliable page cache version to buffer operations, and in the process
backported some of Al Viro's streamlined page cache code to work in the
buffer cache. I noticed that only deletions and name changes were
required, so the code got somewhat shorter. Better, the converted
version worked reliably on the first try.
I didn't entirely discard the page cache code - I kept enough of the
extra mechanism around to make it pretty easy, I hoped, to take another
run at the page cache later, after I'd brought the code to a more mature
state. At San Jose last month I mentioned my problem to Stephen Tweedie
and he suggested an entirely different approach, where I'd work with the
buffers underlying the page cache pages instead of directly with the
pages themselves. I said I'd try it, it sounded good.
This week I finally got time to try that. My plan was to try to keep
the interface as close as possible to the traditional
bread/geblk/mark_dirty etc. interface which I found to be quite robust
and easy to work with. I first verified that mark_buffer_dirty and
brelse can be used with page cache buffers, and they can - there are
examples in the buffer.c. So I figured the quickest thing to do would
be to modify ext2_getblk to work through the page cache instead of the
buffer cache. The new ext2_bread would then just call the new
ext2_getblk, and would be otherwise unchanged.
So for the new ext2_getblk, given an inode and a block number, the steps
are:
- find or create the blocks's page cache page
- add buffers to the page if necessary
- select the correct buffer from the page's buffer list
- call ext2_get_block to map the block to disk
- for a newly created block, clear its buffer
Here's the code:
struct buffer_head *ext2_getblk_pcache (struct inode *inode,
u32 block, int create, int *err)
{
unsigned blockshift = inode->i_sb->s_blocksize_bits;
unsigned blocksize = 1 << blockshift;
unsigned pshift = PAGE_CACHE_SHIFT - blockshift;
struct page *page = grab_cache_page(inode->i_mapping, block >> pshift);
struct buffer_head *bh;
if (!page->buffers)
create_empty_buffers (page, inode->i_dev, blocksize);
bh = page_buffer (page, block & ((1 << pshift) - 1));
atomic_inc(&bh->b_count);
UnlockPage(page);
page_cache_release(page);
if ((*err = ext2_get_block(inode, block, bh, create)))
return NULL;
if (!buffer_mapped(bh))
return NULL;
if (buffer_new(bh))
{
if (!buffer_uptodate(bh))
wait_on_buffer(bh);
memset(bh->b_data, 0, blocksize);
mark_buffer_uptodate(bh, 1);
mark_buffer_dirty_inode(bh, inode);
clear_bit(BH_New, &bh->b_state);
}
return bh;
}
In testing this code I actually found a bug in the original ext2_getlk,
to wit, the BH_New bit was never being cleared. How we've survived that
to this point isn't at all clear to me, but Al probably has it patched
by now. With that bug out of the way I substituted the new functions
throughout the directory code and lo! everything worked the first try.
I then went on to clean up the code and structures to take advantage of
the fact that, with buffers you only need to keep a single pointer vs
two when you are working with a data region inside a page.
So now I have the directory index running in the page cache, and to do
that I just changed the names of two functions. Well, it wasn't that
easy, but almost :-)
But what about performance? In theory we should be running a little
faster because with the page cache, we don't have to look up each data
block via its index block(s): the page cache hash takes us straight to
the page we're interested in and we need to look into the index blocks
only if we find an unmapped block there. When we need to access the
same block many times as with the directory code that adds up to a lot
of CPU saved, and did I mention that the directory operations are
CPU-bound?
What about performance versus full-page page-cache operations? This
requires some speculation, because I never did implement the additional
mechanism that would have been necessary to make it work well when block
size does not match page size. When page size does match block size it
looks like the two methods will be roughly comparable in execution
speed, since the sequence of operations is very similar. When there are
several blocks on a page it looks like we might be doing more
hash lookups with the getblk-style code, but there is only one place in
the directory code - readdir - where we'd see the kind of sequential
access that might expose this overhead, and we can optimize that away
if we want to (the sequence brelse; ext2_bread is unecessary if the
next buffer is on the same page).
As far as code size goes, this didn't cost anything. The existing
ext2_getblk/bread functions are used only in the directory code, meaning
that the originals can be discarded after we've had a chance to evaluate
the relative performance.
What about compatibility with other page cache code? As far as I can
see we should be ok there - this style of access should mix and match
nicely with the full-page style of access, but caveat, I have not tried
this or analyzed it in great detail. The obvious thing to do would be
to try mixing Al Viro's full-page style readdir with my getblk-style
entry creation and lookup.
Status
------
Today's patch is meant for testing and benchmarking. There are
compilation switches to enable/disable:
- indexing in general (CONFIG_EXT2_INDEX)
- page cache vs buffer cache (CONFIG_EXT2_DIR_PAGE_CACHE)
- deferred index creation (DX_DEFER)
The default is: indexed, deferred creates, page cache. All these
options, except perhaps the first, will be removed in the finished
version, so if you are interested in doing performance measurements...
(hint hint)
I mentioned some cache-thrashing issues in a previous post - these come
up at directory sizes that were previously impractical with Ext2, and
are of two kinds:
- prune_icache problems (MM)
- Inode table thrashing issues
The MM problems were fixed by Marcelo Tosatti and by now are in the
2.4.4-pre trees - to be sure, you might want to ask Marcelo, or I can
forward you a copy of his patch for 2.4.3.
I've analyzed the inode table thrashing issues and now feel confident
that the thrashing can be pretty much eliminated by introducing
goal-directed inode allocation analogous to the current goal-directed
block allocation. (The thrashing we see is not specific to the
particular indexing method I've used, but would occur with any
block-wise index on inodes.) I plan to write up my findings on that
soon, and if you have read this far you'll probably find it quite
interesting. In any event, now that the MM issues have been cleaned up
performance is decent even with the thrashing. Adding more memory is an
option - I'd suggest using at least 256 Meg for 2,000,000 file
directories.
This patch remains pre-alpha, until the hash function is finalized.
Changes
-------
- optionally run in page cache
- use lock_super instead of lock_kernel for sb update
- cleaned up redundant data pointers
To Do
-----
- Finalize hash function
- Test/debug big endian
- Fail gracefully on random data
Future Work
-----------
- Generalize to n level index
- Coalesce on delete
- Goal-directed inode allocation
Testers wanted
--------------
- Big endian
- SMP
- Performance testing
The Patch is Available At
-------------------------
http://kernelnewbies.org/~phillips/htree/dx.testme-2.4.3-3
To apply:
cd /your/kernel/tree
patch -p0 </the/patch
To create an indexed directory:
mount /dev/hdxxx /test -o index
mkdir /test/foo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/