No, modern CPU's do it well enough - starting from the Pentium, Intel does
all locking internally in the caches, and depends on the cache coherency
protocol to show the atomicity to the rest of the world. Only an i486 will
actually show the locked cycles on the bus, if I remember correctly.
In fact, I think the CPU will do a unaligned non-cache-crossing operation
as fast as a aligned store. The cacheline-crossing case is noticeably
slower, at least on a PPro (the Pentium had some optimizations where it
would pair the two cacheline accesses, and could do two cacheline accesses
in one cycle - so the cacheline crosser could execute at full speed, but
it would hurt pairing with _other_ memory instructions).
Testing shows:
- PPro core:
single-cycle stores, whether aligned or not, within a
cacheline.
8 cycles for cacheline crossing stores
- Athlon:
single cycle for unaligned, whether cache-line croesser or not.
(And as mentioned, I think Pentiums act the same as athlons).
In short, unaligned integer ops are not affected very much at all. They do
take more resources internally (ie they use two write-ports to the cache
when cache-crossing), so even when they run at the "same" speed, it pairs
etc better if aligned, but x86 is very very good at unaligned handling.
One of the advantages of a legacy of crap: x86 never had the choice to be
designed for "the good case". In order to run fast, an x86 has to run fast
even on bad code.
Because in real life, it doesn't matter how well you do on spec benchmarks
with good compilers.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/