I've looked at both splay trees and the page lock patch by Ingo & Dave, and
personally I disagree with both approaches as they fail to address some of
the very common cases that are actually causing the underlying problem.
First off, if we take a look at why the page cache lock is being contended
a few obvious problems pop out immediately:
1. potentially long hash chains are walked with the page cache
lock held for the entire duration of the operation
2. multiple cache lines are touched while holding the page cache
lock
3. sequential lookups involve reaquiring the page cache lock
4. the page cache hash is too large, virtually assuring that
lookups will cause a cache miss
Neither proposed patch addresses all of these problems to the degree that
I think is possible and would like to attempt. What I've got in mind is
a hybrid hash + lock + b*tree scheme, where the lock is per-bucket. With
that in place, each bucket would be treated like the top of a b*tree
(I might have the name wrong, someone please correct me if what I'm
describing has a different name), where a number of {index, page} pairs
are stored, plus pointers to the next level of tree. The whole thing
would be sized at 1 cacheline per bucket and hold around 5-6 page pointers
per bucket (64 byte cacheline size). The lock would be a rw lock to
allow simultaneous readers (the most common operation).
This directly helps with #1 by allowing us to avoid pulling in extra
cache lines during the initial lookup phase since the page index #
is already in the cache line we've retrieved (we may need a pointer
to the address space, but the hash function can be chosen to make
such collisions irrelevant) as well as reducing the number of cache
lines accessed during a lookup. Since a tree is used for the worst
case, the length of a chain is reduced, with a very low likelyhood
of ever exceeding 2 levels (remember, we're working on buckets in
each level of the tree, not individual page cache entries). Item 3
is addressed by improving cache locality for sequential lookups.
Item 4 can be addressed by changing the hash function to cause some
grouping for address spaces.
I haven't coded up this idea yet, but I'll get moving on it. If we're
going to change the locking again, I'd like to go all the way and do
it Right. To this end I'd like to see some appropriate benchmarks made
to look at the behaviour of the proposed changes under sequential access
patterns as well as random. Thoughts?
-ben
-- Fish. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/