We've seen this bug (or a *very* similar one) with 2.2.16 and 2.4.[01]. I
haven't tried a newer 2.2.x or 2.4.2 or -acX.
One system is a P2/400, the other is a P3/800. The two boxes are
communicating over a mostly idle Ethernet, through 3 switches. One end is
a EEPro 100, the other end is an Acenic, although that shouldn't matter.
During a stall, the sending end shows a lot of data stuck in the Recv-Q:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 72848 0 ref.lab.ocp.interna:840 ref-0.sys.pnap.net:ssh ESTABLISHED
The receiving end shows a similar problem, but on the sending queue:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 28960 ref-0.sys.pnap.net:ssh ref.lab.ocp.interna:840 ESTABLISHED
Like I said, I don't believe that this is a network issue, because I can
un-stall the rsync by either stracing the *sending* ssh process, or by
putting the sending rsync into the background with ^Z and then popping it
back into the foreground. I have tcpdumps that I can send, but they look
pretty straightforward to me -- the window fills, so data stops flowing.
Strace doesn't seem to be particularly informative:
<blocked, strace starts>
select(4, [0], [1], NULL, NULL) = 1 (out [1])
write(1, "xxxxxxxxxxxxx"..., 66156) = 66156
...
select(4, [0], [1 3], NULL, NULL) = 2 (out [1 3])
write(1, "\0\0\0\0\274\2\0\0\0\0\0\0\271\30\0\0\0\0\0\0\274\2\0\0"..., 69526
<blocked again>
Strace on the receiving end shows the obvious -- it's sitting in select
waiting for data to arrive.
According to 'ps l', the ssh process is waiting in 'sock_wait_for_wmem'.
We've tried changing versions of rsync and ssh without any success. FWIW,
this kernel was compiled with GCC 2.95.2, from Debian potato.
Scott
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/