On what CPU?
I checked the Athlon4 optimization manual and fxrstor is listed as 68/108
cycles (i guess depending on whether there is XMM state or not so 68 cycles
probably apply here) and fninit as 91 cycles. It doesn't list the SSE1
timings, but i guess the instructions don't take more than 3 cycles
(MMX instructions take that long). So Andrea's way should be
91+16*3=139+some cycles for emms (or 107 if sse ops take only a single cycle)
vs 68 or 108. So the fxrstor wins well.
On x86-64 the difference is even bigger because it has 16 XMM registers instead
of 8.
> In short, your "fast" code isn't actually any faster than doing it right.
At least on Athlon it should be slower.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/