Since Alan's code looks like it prevents the possibility of prefetching too much data,
it seems something else must be the culprit. Arjan posted a link to a user-space program
that benchmarks various *_clear_page and *_copy_page schemes. I spent yesterday evening
playing with this.
On a whim, I added some NOP statements to the even_faster copy_page routine. Imagine
my surprise when I found that the NOP-modified routine showed the highest
throughput of all the *_copy_page routines, consistently beating the other optimized
routines by sometimes 10% or more (but generally between 2-5%). On my two Athlon machines
(both socket-A thunderbirds), I found that I got the best scores if I grouped the MOVQ and
MOVNTQ operations into sets of 4.
It's interesting to note that I don't run into any problems running any of the *_copy_page
schemes in user-space but if I try in kernel space, I get the notorious crash inside
fast_copy_page. (If there was some sort of fundamental hardware problem associated with
prefetch or streaming, wouldn't it also show up in user-space?) Note: I've yet to try the
NOP-modified routines in kernel-space.
I've tried this benchmark on 4 Athlon/Duron machines now (2 Socket-A Thunderbirds, 1
Socket-A Duron and 1 Slot-A Athlon). Each of the Socket-A machines showed improvement
using the NOP-modified routines while the Slot-A machine performed better with the original
3DNow routine.
Arjan's original code is at: http://www.fenrus.demon.nl/athlon.c
My modifications are at: http://sackheads.org/~mayfield/jrm_athlon.c
Example test runs:
copy_page() tests
copy_page function 'warm up run' took 21350 cycles per page
copy_page function '2.4 non MMX' took 27706 cycles per page
copy_page function '2.4 MMX fallback' took 28600 cycles per page
copy_page function '2.4 MMX version' took 21370 cycles per page
copy_page function 'faster_copy' took 13119 cycles per page
copy_page function 'even_faster' took 14767 cycles per page
copy_page function 'jrm_copy_page_8nop' took 12774 cycles per page
copy_page function 'jrm_copy_page_10nop' took 12746 cycles per page
copy_page function 'jrm_copy_page_12nop' took 12740 cycles per page
copy_page() tests
copy_page function 'warm up run' took 22499 cycles per page
copy_page function '2.4 non MMX' took 27769 cycles per page
copy_page function '2.4 MMX fallback' took 27696 cycles per page
copy_page function '2.4 MMX version' took 22666 cycles per page
copy_page function 'faster_copy' took 13058 cycles per page
copy_page function 'even_faster' took 13169 cycles per page
copy_page function 'jrm_copy_page_8nop' took 12691 cycles per page
copy_page function 'jrm_copy_page_10nop' took 12750 cycles per page
copy_page function 'jrm_copy_page_12nop' took 14786 cycles per page
The values obviously fluctuate depending on system activity but the jrm_* routines were
faster in 13 out of 15 trials.
- Jimmie
-- Jimmie Mayfield http://www.sackheads.org/mayfield email: mayfield+kernel@sackheads.org My mail provider does not welcome UCE -- http://www.sackheads.org/uce- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/