We've been pulling our hair out lately trying to figure out why
certain connections of ours have been just stalling and dying. It
turns out that the problem occurs when tcp segments are lost and then
coalesced into a single segment for retransmission. When that
happens, a bad checksum is computed, and then the connection dies
while one end continually retransmits the same packet with a bad
checksum.
Here's a patch against 2.4.5pre5 which should fix it:
--- tcp_output.c~ Thu Apr 12 15:11:39 2001
+++ tcp_output.c Fri Jun 29 13:58:31 2001
@@ -722,7 +722,7 @@
if (skb->ip_summed != CHECKSUM_HW) {
memcpy(skb_put(skb, next_skb_size), next_skb->data, next_skb_size);
- skb->csum = csum_block_add(skb->csum, next_skb->csum, skb->len);
+ skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
}
/* Update sequence range on original skb. */
Hopefully the problem is obvious in retrospect. skb_put(skb,...)
modifies skb->len, and the new value was being used in csum_block_add
instead of the original len. We're testing the patch now, but it
seems fairly obvious and apparently other people have been reporting
similar problems so I wanted to get this out there...
Todd
p.s. If there's followup, please Cc me directly, as I'm not
subscribed to lkml.
-- Todd Sabin <tas@webspan.net> BindView RAZOR Team <tsabin@razor.bindview.com> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/