pxor+xorps is definitely faster than fxrestor on athlon-mp.
fxrestor on athlon-mp 1600, on cold cache (the "default fpu state" will
be cold most of the time, it's only ever used at the first math fault of
a task):
dualathlon:/home/andrea # for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ;
done
cycles 3507
cycles 3293
cycles 3179
cycles 3267
cycles 3374
cycles 3213
cycles 2575
cycles 3306
cycles 2915
cycles 3078
pxor + xorps on the same hardware cold cache:
dualathlon:/home/andrea # for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ;
done
cycles 2444
cycles 1619
cycles 2396
cycles 2644
cycles 2322
cycles 2404
cycles 2266
cycles 1657
cycles 2182
cycles 2077
And here it is again on PIII 800mhz for confirmation on true Intel CPU:
fxrestor:
for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ; done
cycles 3461
cycles 3294
cycles 3339
cycles 3276
cycles 3620
cycles 3443
cycles 3401
cycles 3455
cycles 3515
cycles 3327
xor+xorps:
for i in 1 2 3 4 5 6 7 8 9 10 ; do ./a.out ; done
cycles 2288
cycles 2181
cycles 2430
cycles 2486
cycles 2360
cycles 2534
cycles 2243
cycles 2199
cycles 2153
cycles 2396
As said in an earlier email it is a matter of memory bandwith, 59 bytes
of icache and zero data, against 7 bytes of icache and 512bytes of data.
the 512bytes of data are visibly slower, period. Saving mem bandwith is
much more important than reducing the number of instructions, even more
on SMP!
On the x86-64 it will be exactly the same, the mem bandwith is much
higher, but it's still a major bottleneck compared to a few more
instructions to execute (also consider the current ia32 isn't probably
reading the xmm8-15 slots, so on x86-64 fxrestor will be potentially
even slower compared to the in-core xorps that never do any I/O to
memory).
Go ahead and try yourself setting the #if to 0 or 1. The benchmark is
very accurate.
#include <sys/mman.h>
#include <asm/msr.h>
struct i387_fxsave_struct {
unsigned short cwd;
unsigned short swd;
unsigned short twd;
unsigned short fop;
long fip;
long fcs;
long foo;
long fos;
long mxcsr;
long reserved;
long st_space[32]; /* 8*16 bytes for each FP-reg = 128 bytes */
long xmm_space[32]; /* 8*16 bytes for each XMM-reg = 128 bytes */
long padding[56];
} __attribute__ ((aligned (16)));
#define LOOPS 1
#define load_mxcsr( val ) do { \
unsigned long __mxcsr = ((unsigned long)(val) & 0xffbf); \
asm volatile( "ldmxcsr %0" : : "m" (__mxcsr) ); \
} while (0)
struct i387_fxsave_struct i387;
char buf[1024*1024*40];
static void cold_dcache(void)
{
memset(buf, 0, 1024*1024*40);
}
main()
{
unsigned long before, after;
int i;
iopl(3);
mlockall(MCL_FUTURE);
#if 1
cold_dcache();
asm("cli");
rdtscl(before);
__asm__("fninit");
asm volatile(
"pxor %mm0, %mm0 \n"
"xorps %xmm0, %xmm0 \n"
"pxor %mm1, %mm1 \n"
"xorps %xmm1, %xmm1 \n"
"pxor %mm2, %mm2 \n"
"xorps %xmm2, %xmm2 \n"
"pxor %mm3, %mm3 \n"
"xorps %xmm3, %xmm3 \n"
"pxor %mm4, %mm4 \n"
"xorps %xmm4, %xmm4 \n"
"pxor %mm5, %mm5 \n"
"xorps %xmm5, %xmm5 \n"
"pxor %mm6, %mm6 \n"
"xorps %xmm6, %xmm6 \n"
"pxor %mm7, %mm7 \n"
"xorps %xmm7, %xmm7 \n"
"emms \n");
load_mxcsr(0x1f80);
rdtscl(after);
#else
asm volatile("fxsave %0" : : "m" (i387));
cold_dcache();
asm("cli");
rdtscl(before);
asm volatile("fxrstor %0" : "=m" (i387));
rdtscl(after);
#endif
printf("cycles %lu\n", after-before);
}
As said for SSE3 a kernel modification will be required anyways if xorps
isn't enough to initialize the new fpu, so the current 2.4 kernel cannot
cope with it anyways regardless of fxrestor or not.
So I don't buy that fxrestor is better, nor more generic.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/