Here are some uniprocessor numbers:
up, 2.5.30+rmap-lock-speedup:
./daniel.sh 28.32s user 42.59s system 90% cpu 1:18.20 total
./daniel.sh 29.25s user 38.62s system 91% cpu 1:14.34 total
./daniel.sh 29.13s user 38.70s system 91% cpu 1:14.50 total
c01cdc88 149 0.965276 strnlen_user
c01341f4 181 1.17258 __page_add_rmap
c012d364 195 1.26328 rmqueue
c0147680 197 1.27624 __d_lookup
c010bb28 229 1.48354 timer_interrupt
c013f3b0 235 1.52242 link_path_walk
c01122cc 261 1.69085 do_page_fault
c0111fd0 291 1.8852 pte_alloc_one
c0124be4 292 1.89168 do_anonymous_page
c0123478 304 1.96942 clear_page_tables
c01236c8 369 2.39052 copy_page_range
c01078dc 520 3.36875 page_fault
c012b620 552 3.57606 kmem_cache_alloc
c0124d58 637 4.12672 do_no_page
c0123960 648 4.19798 zap_pte_range
c012b80c 686 4.44416 kmem_cache_free
c0134298 2077 13.4556 __page_remove_rmap
c0124540 2661 17.2389 do_wp_page
up, 2.5.26:
./daniel.sh 27.90s user 31.28s system 90% cpu 1:05.25 total
./daniel.sh 31.41s user 35.30s system 100% cpu 1:06.71 total
./daniel.sh 28.54s user 32.01s system 91% cpu 1:06.41 total
c0124f2c 167 1.21155 find_vma
c0131ea8 183 1.32763 do_page_cache_readahead
c012c07c 186 1.34939 rmqueue
c01c7dc8 192 1.39292 strnlen_user
c010ba78 210 1.52351 timer_interrupt
c0144c50 222 1.61056 __d_lookup
c01120b8 250 1.8137 do_page_fault
c013cc40 260 1.88624 link_path_walk
c0122cd0 282 2.04585 clear_page_tables
c0124128 337 2.44486 do_anonymous_page
c0122e7c 347 2.51741 copy_page_range
c0111e50 363 2.63349 pte_alloc_one
c01c94ac 429 3.1123 radix_tree_lookup
c01077cc 571 4.14248 page_fault
c0123070 620 4.49797 zap_pte_range
c0124280 715 5.18717 do_no_page
c0123afc 2957 21.4524 do_wp_page
So the pte_chain stuff seems to be costing 20% system time here.
But note that I made the do_page_cache_readahead and radix_tree_lookup
cost go away in 2.5.29. So it's more like 30%.
And it's all really in __page_remove_rmap, kmem_cache_alloc/free.
If we convert the pte_chain structure to
struct pte_chain {
struct pte_chain *next;
pte_t *ptes[L1_CACHE_BYTES - 4];
};
and take care to keep them compacted we shall reduce the overhead
of both __page_remove_rmap and the slab functions by up to 7, 15
or 31-fold, depending on the L1 size. page_referenced() wins as well.
Plus we almost halve the memory consumption of the pte_chains
in the high sharing case. And if we have to kmap these suckers
we reduce the frequency of that by 7x,15x,31x,etc.
I'll code it tomorrow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/