Re: Intel P6 vs P7 system call performance

Linus Torvalds (torvalds@transmeta.com)
Fri, 20 Dec 2002 08:47:05 -0800 (PST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Bjorn Helgaas: "Re: [PATCH] Fix CPU bitmask truncation"
Previous message: James Bottomley: "[PATCH 2.4] Enable checking of reap of in-use slab"

On Fri, 20 Dec 2002, Jamie Lokier wrote:
>
> Ulrich Drepper wrote:
> > int $0x80 -> call *%gs:0x18
>
> The calling convention has been (slightly) changed - i.e. 6 argument
> calls don't work, so why not go a bit further: allow the vsyscall entry
> point to clobber more GPRs?

Actually, six-argument syscalls _do_ work. I admit that the way to make
them work is "interesting", but it's also extremely simple.

> The benefit is that this allows Glibc to do a wholesale replacement of
> "int $0x80" -> "single call instruction". Otherwise, those pushes are
> completely unnecessary. It could be this short instead:
>
> vsyscall:
> movl %esp,%ebp
> sysenter
> jmp vsyscall
> ret

Yes, we could have changed the implementation to clobber more registers,
but if we want to support all system calls it would still have to save
%ebp, so the minimal approach would have been

vsyscall:
pushl %ebp
0:
movl %esp,%ebp
sysenter
jmp 0b /* only done for restarting */
popl %ebp
ret

which is all of 4 (simple) instructions cheaper than the one we have now.

And if the caller cannot depend on registers being saved, the caller may
actually end up being more complicated. For example, with the current
setup, you can have

getpid():
movl $__NR_getpid,%eax
jmp *%gs:0x18

but if system calls clobber registers, the caller needs to be

getpid():
pushl %ebx
pushl %esi
pushl %edi
pushl %ebp
movl $__NR_getpid,%eax
call *%gs:0x18
popl %ebp
popl %edi
popl %esi
popl %ebx
ret

and notice how the _real_ code sequence actually got much _worse_ from the
fact that you tried to save time by not saving registers.

> It is nice to be able to use the _exact_ same convention in glibc, for
> getting a patch out of the door quickly. But it is just as easy to do
> that putting the pushes and pops into the library itself:
>
> Instead of
>
> int $0x80 -> call *%gs:0x18
>
> Write
>
> int $0x80 -> pushl %ebp
> pushl %ecx
> pushl %edx
> call *%gs:0x18
> popl %edx
> popl %ecx
> popl %ebp

But where's the advantage then? You use the same number of instructions
dynamically, and you use _more_ icache space than if you have the pushes
and pops in just one place?

> It has exactly the same cost as the current patches, but provides
> userspace with more optimisation flexibility, using an asm clobber
> list instead of explicit instructions for inline syscalls, etc.

In practice, there is nothing around the call. And you have to realize
that the pushes and pops you added in your version are _wasted_ for other
cases. If the system call ends up being int 0x80, you just wasted time. If
the system call ended up being AMD's x86-64 version of syscall, you just
wsted time.

The advantage of putting all the register save in the trampoline is that
user mode literally doesn't have to _know_ what it is calling. It only
needs to know two simple rules:

- registers are preserved (except for %eax which is the return value, of
course)
- it should fill in arguments in %ebx, %ecx ... (but the things that
aren't arguments can just be left untouched)

And then depending on what the real low-level calling convention is, the
trampoline will save the _minimum_ number of registers (ie some calling
conventions might clobber different registers than %ecx/%edx - you have to
remember that "sysenter" is just _one_ out of at least three calling
conventions available).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Bjorn Helgaas: "Re: [PATCH] Fix CPU bitmask truncation"
Previous message: James Bottomley: "[PATCH 2.4] Enable checking of reap of in-use slab"