Looking at the assembly output of both optimised and unoptimised, we
see quite startling differences in the way the loops are done..
The unoptimised case..
    movl    $0, -12(%ebp)
.L75:
    cmpl    $63, -12(%ebp)
    jle .L78
    jmp .L76
	...
	movntq/movq inline asm bits
	...
    leal    12(%ebp), %eax
    addl    $64, (%eax)
    addl    $64, 8(%ebp)
    leal    -12(%ebp), %eax
    incl    (%eax)
    jmp .L75
Note it uses -12(%ebp) to keep track of how much its copied.
The optimised version is much more sensible..
    movl    $63, %ebx
    .p2align 2
.L98:
	...
    movntq/movq inline asm bits
    ...
    addl    $64, %ecx
    addl    $64, %edx
    decl    %ebx
    jns .L98
Keeping track of the count in an register, no indirect memory references,
leaving the only memory references to be the actual memory copies, which
let it achieve the full bandwidth of the memory bus.
Quite surprising. I doubt going over the top with CFLAGS buys you much.
The above optimisation comes in with just -O2.
		Dave
-- | Dave Jones. http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/