Sadly, it turns out that there are no possibilities for page table sharing
when forking from bash. It turns out there are only about 200 pages being
shared amonst three page tables (stack, text and .interp) and at least one
page in each of these gets written during the exec, so all are unshared and
hence there is no reduction in the number of pte chains that have to be
created. For forking from a larger parent, page table sharing has a
measurable benefit, but not from these little guys.
But there is something massively wrong with this whole picture. Your kickass
4 way is managing to set up and tear down only one pte chain node per
microsecond, if I'm reading your benchmark results correctly. That's really
horrible. I think we need to revisit the locking.
The idea I'm playing with now is to address an array of locks based on
something like:
spin_lock(pte_chain_locks + ((page->index >> 4) & 0xff));
so that 16 consecutive filemap pages use the same lock and there is a limited
total number of locks to keep cache pressure down. Since we are creating the
vast majority of the pte chain nodes while walking across page tables, this
should give nice locality.
For this to work, anon pages will need to have something in page->index.
This isn't too much of a challenge. A reasonable value to put in there is
the creator's virtual address, shifted right, and perhaps mangled a little to
reduce contention.
We can also look at batching the pte chain node creation by preallocating 16
nodes, taking the lock, and walking through the 16 nodes filling in the
pointers. If the page index changes to a different lock we drop the one we
have and acquire the new one.
-- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/