Here's a status report on this:
- I replicated your results using Rik's rmap12h patch for 2.4.19-pre7.
The overhead of rmap setup/teardown is a little higher even than the
minimal rmap patch in 2.5.recent, around 20% bottom line cost for this
(slightly unrealistic) test. Rik seems to take the pte_chain_locks
more often than necessary, a likely explanation for the higher overhead
vs 2.4.27
- I'm not doing this in 2.5 because dac960 is broken, and that's what my
only dual processor machine has. I'll return to messing with that
pretty soon (and Jens has offered help) but right now I'm focussed on
this setup/teardown slowness question.
- I've implemented the locking strategy described above and it is
apparently stable. It seems to reduce the overhead a percent or
two. On x86 it still isn't a solution, though it may well be quite
nice for ppc, judging from earlier results you mentioned. I'll
continue to fiddle with anon page assignments to page->index and see
if I can get another couple of percent. It's possible that putting
the spinlocks in separate cachelines may help as well.
- It's clear that I need to move on to batching up the pte chain creates
to get the kind of improvement we want.
I settled on a variant of the pte_chain_lock interface, and since I had to
update every use of it I adopted a more sensible name as well:
spinlock_t *lock_rmap(struct page *page);
void unlock_rmap(spinlock_t *lock);
(The functions formerly known as pte_chain_lock and pte_chain_unlock.) The
former acquires a lock indexed via page->index and returns it; the latter
is just spin_unlock. This interface is more efficient than looking up the
lock twice, and also accomodates my wrapper for resetting page->index, which
can't use page->index for the unlock, since it just changed:
static inline void set_page_index(struct page *page, unsigned long index)
{
spinlock_t *lock = lock_rmap(page);
page->index = index;
unlock_rmap(lock);
}
Lock_rmap loops to handle the possibility of page->index changing while
waiting on the lock:
static inline spinlock_t *lock_rmap(struct page *page)
{
unsigned long index = page->index;
while (1) {
spinlock_t *lock = rmap_locks + (index & (num_rmap_locks - 1));
spin_lock(lock);
if (index == page->index)
return lock;
spin_unlock(lock);
}
}
The normal case is nearly as efficient as a raw spin_lock. I probably have
to invalidate page->index for non-cache-coherent architectures. I'll worry
about that detail more if I get somewhere with the optimization on i386.
No doubt there are cases where the set_page_index doesn't have to be covered
by the rmap lock but I didn't spend time hunting for them - they're different
in 2.5 and this overhead doesn't seem to move the needle anyway.
I think this much is fairly solid. Now I'll see if I can take advantage of
this infrastructure by batching up the setup/teardown in copy_page_range,
which is apparently the main cause of the suckage.
-- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/