Interesting. There seems to be three peaks: a big 4/1 split at 19-20 vs
35-36 cycles, which is probably just the L1 cache (8 bytes per entry,
32-byte cachelines on the EV5 gives 4 entries per cache load), while the
much smaller peak at 105 cycles might possibly be due to the virtual
lookup miss, causing a double TLB miss and a real walk every 8kB entries
(actually, much more often than that, since there's TLB pressure and the
virtual PTE mappings get thrown out faster than the theoretical numbers
would indicate)
It also shows how pretty studly it is to take a sw TLB miss quite that
quickly. Getting in and out of PAL-mode that fast is rather fast.
> ev6 @ 500MHz:
> 2.43: 78
> 72.13: 84
> 2.55: 89
> 5.87: 90
> 1.38: 105
> 5.94: 108
> 1.36: 112
>
> I wonder how much of that ev6 slowdown is due to an SRM that's
> has to handle both 3 and 4 level page tables, and how much is
> due to the more expensive syncing of the OOO pipeline...
The multi-level page table shouldn't hurt at all for the common case (ie
the virtual PTE lookup success), so my money would be on the pipeline
flush.
The other profile difference seems to be due to the 64-byte cacheline (ie
a cacheline now holds 8 entries, so 7/8th can be filled that way).
However, I doubt whether that third peak could be a double PTE fault, it
seems too big and too close in cycles to the others. So maybe the third
peak at 108 cycles is something else... As it seems to balance out very
nicely with the second peak, I wonder if there might not be something
making every other cache fill faster - like a 128-byte prefetch or an
external 128-byte line on the L2/L3? (Ie the third peak would be really
just the "other half" of the second peak).
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/