Nope. There are distinct alignment problems with movsl-based
memcpy on PII and (at least) "Pentium III (Coppermine)", which is
tested here:
copy_32 uses movsl. copy_duff just uses a stream of "movl"s
Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:
akpm:/usr/src/cptimer> ./cptimer -d -s
nbytes=10240 from_align=0, to_align=0
copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
OK, movsl wins. But now give the source address 8+1 alignment:
akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
nbytes=10240 from_align=1, to_align=0
copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec
The "movl"-based copy wins. By miles.
Make the source 8+4 aligned:
akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
nbytes=10240 from_align=4, to_align=0
copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec
So movl still beats movsl, by lots.
I have various scriptlets which generate the entire matrix.
I think I ended up deciding that we should use movsl _only_
when both src and dsc are 8-byte-aligned. And that when you
multiply the gain from that by the frequency*size with which
funny alignments are used by TCP the net gain was 2% or something.
It needs redoing. These differences are really big, and this
is the kernel's most expensive function.
A little project for someone.
The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/