> > Besides, I seriously doubt it is any faster than what is there already.
> >
> > Time it, and notice how:
> >
> > - fninit takes about 200 cycles
> > - fxrstor takes about 215 cycles
>
> On what CPU?
Sorry, should have mentioned. That's a P4.
> I checked the Athlon4 optimization manual and fxrstor is listed as 68/108
> cycles (i guess depending on whether there is XMM state or not so 68 cycles
> probably apply here) and fninit as 91 cycles. It doesn't list the SSE1
> timings, but i guess the instructions don't take more than 3 cycles
> (MMX instructions take that long). So Andrea's way should be
> 91+16*3=139+some cycles for emms (or 107 if sse ops take only a single cycle)
> vs 68 or 108. So the fxrstor wins well.
The athlon should be able to do two MMX / cycle (the throughput is much
lower, but there are no data dependencies here).
> > In short, your "fast" code isn't actually any faster than doing it right.
>
> At least on Athlon it should be slower.
I suspect that the real answer is "the speed difference is _way_ in the
noise".
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/