Nice. I posted a similar demonstration some time ago (could it *really* be
two years?) but I sorted the dirents in user space, which meant it was just a
demonstration, not a practical solution.
The problem I see with your approach is that the traversal is no longer in
hash order, so a leaf split in the middle of a directory traversal could
result in a lot of duplicate dirents. I'm not sure there's a way around that.
Another approach is to have the inode number correspond approximately to the
hash value. That's not as hard as it sounds, it simply means that the goal
for ext3_new_inode should be influenced by the hash value. (But watch out
for interaction with the new Orlov allocator.) It also depends on some
a priori estimate of the size of a directory so that increasing hash values
can be distributed more or less uniformly and monotonically across some
corresponding range of inode numbers. This isn't too hard either, just round
up the current size of the directory to a power of two and use that as the
size estimate. The size estimate would change from time to time over the
life of the directory, but there are only log N different sizes and that's
roughly how heavy the load on the inode cache would be during a directory
traversal. Finally, a nice property of this approach is that it stays stable
over many creates and deletes.
We've apparently got a simpler problem though: inode numbers aren't allocated
densely enough in the inode table blocks. This is the only thing I can think
of that could cause the amount of thrashing we're seeing. Spreading inode
number allocations out all over a volume is ok, but sparse allocation within
each table block is not ok because it multiplies the pressure on the cache.
We want a 30,000 entry directory to touch 1,000 - 2,000 inode table blocks,
not the worst-case 30,000. As above, fixing this comes down to tweaking the
ext3_new_inode allocation goal.
Regards,
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/