A lock in a SMP system also needs to synchronize the instruction stream,
and not let stores move "out" from the locked region.
On a UP system, this all happens automatically (well, getting it to happen
right is obviously one of the big issues in an out-of-order CPU core, but
it's a very fundamental part of the core, so it's "free" in the sense that
if it isn't done, the CPU simply doesn't work).
On SMP, it's a memory barrier. This is why a "lock ; decl" is more
expensive than a "decl" - it's the implied memory ordering constraints (on
other architectures they are explicit). On an intel CPU, this basically
means that the pipeline is drained, so a locked instruction takes roughly
12 cycles on a PPro core (AMD's K7 core seems to be rather more graceful
about this one). I haven't timed a P4 lately, I think it's worse.
Other architectures do the memory ordering explicitly, and some are
better, some are worse. But it always costs you _something_.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/