SSE has prefetchnta
3dnow has something similar.
In addition you can use movnti* for stores. These should be faster
because they use write combining and avoid the latency of fetching
the cache line of the destination just to overwrite it.
The tricky bit is to avoid prefetches over the boundary of your copy.
Prefetching from an uncached area or write combined area (like the
AGP gart which could start in next page) triggers hardware bugs in
various boxes. This unfortunately complicates the prefetching loops
a bit.
>
> > The rep ; movsl loop used in copy*user isn't
> > very good on modern x86 anyways (it is ok on PPro, but loses on Athlon
> > and P4)
>
> On PII and PIII, rep;movsl is slower than an open-coded
> duff-device copy for all src/dest alignments except for
> the case where both are eight-byte-aligned. By up to
> 20%, iirc. four-byte-aligned to four-byte-aligned isn't
> too bad.
That's surprising. AFAIK on PPro rep ; movs does magic prefetch
tricks in microcode, so it should be eventually faster if you do
not use explicit prefetching and you're not cache hot for
bigger copies (in smaller ones the setup overhead may dominate)
On Athlon rep ; movs loses clearly compared to an unrolled loop.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/