Ok, if we're just disagreeing about the best api,
then I can live with that. But it appears we're
talking at cross-purposes, so I want to try this one
more time. I'll lay my though processes out in detail,
and you can tell me at which step I'm going wrong:
Normally, you'd spend most of your time sitting in
ioctl(EP_POLL) waiting for something to happen. So
that's one syscall.
If you get an event that indicates you can accept()
a new connection, then you do an accept(). Assume it
succeeds. That's two syscalls. Then you register
interest in the fd with a write to /dev/poll, that's
three.
With the current /dev/epoll, you must try to read()
the new socket before you go back to ioctl(EP_POLL),
just in case there is data available. You expect
there isn't, but you have to try. This is the step
I'm talking about. That's four.
Assume data was not available, so you loop back
to ioctl(EP_POLL) and wait for an event. That's five
syscalls. The event comes in, you do another read()
on the socket, and probably get some data. That's
six syscalls to finally get your data.
ioctl(kpfd, EP_POLL) 1 wait for events
s = accept() 2 accept a new socket
write(kpfd, s) 3 register interest
n = read(s) 4 <-- annoying test-read
ioctl(kpfd, EP_POLL) 5 wait for events
n = read(s) 6 get some data
You have a similiar problem with write's, but I'm
guessing it's safe to assume the first write will
always succeed, so it's awkward but not a big
problem.
If /dev/epoll tested the initial state of the socket,
then there would be no need for the test read:
ioctl(kpfd, EP_POLL) 1 wait for events
s = accept() 2 accept a new socket
write(kpfd, s) 3 register interest
ioctl(kpfd, EP_POLL) 4 wait for events
n = read(s) 5 get some data
So, we've saved a syscall and, perhaps more importantly,
we don't have to keep a list of to-be-read-just-in-case
fd's sitting around. I wouldrather make this a "clean
api" argument than a performance argument, since it's
unclear that there is really any significant speed
difference in practice.
Note that the number of unnecessary syscalls is much
greater than 20%, since on a heavily loaded server, you
could be doing 1000's of unecessary reads for every
ioctl(EP_POLL).
On a fast local network you'd expect the test reads
to mostly return something, so it's no big deal. But
if you've got 10k very slow connections...
There's a good summary of the problem in the Banga,
Mogul and Druschel[1] paper at:
http://citeseer.nj.nec.com/banga99scalable.html
Page 5, right hand column, third paragraph.
By the way, thanks for the patch. I know I've been
complaining about it, but I wouldn't have bothered
unless I thought it was a good thing. I appreciate
your taking the time to write and release it.
-- Christopher St. John cks@distributopia.com DistribuTopia http://www.distributopia.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/