NMI Watchdog detected LOCKUP on CPU1

Martin Josefsson (gandalf@wlug.westbo.se)
Tue, 9 Oct 2001 21:09:29 +0200 (CEST)


Hi

(Sorry for the crosspost but I don't know on which list the people capable
of helping are subscribed :) (it's kernel, networkrelated and possibly
netfilter related)

As you can see (see topic) I have a small problem.

I got this while stresstesting an SMP router here.

hardware:
2 x pIII 600
440BX chipset motherboard
2 x 32MB pc100 ram
128MB flashdisk
D-Link DFE570-TX (quad DECchip 21143) NIC
Berkshire PCI watchdogcard.

Kernel: 2.4.8-ac12 (got a hang on 2.4.9-ac18 too but didn't have a monitor
or serial console attached to it and the watchdogcard rebooted it)

The tulipdriver I'm using isn't the "vanilla" one, it's the optimized
polling version by Jamal and Robert (ANK was involved here too I think?)

I can't rule this driver out but I don't think it's at fault here.

I was running a quite heavy stresstest between 3 ports on the NIC.
All was going just fine until I heard the watchdog go *beep* as it
rebooted the router. And then I found this on the serial-console:

"NMI Watchdog detected LOCKUP on CPU1, registers:" and a stackdump
(decoded below)

loaded modules:

ipt_MASQUERADE 1360 1 (autoclean)
ip_nat_irc 2768 0 (unused)
ip_conntrack_irc 2496 0 [ip_nat_irc]
ip_nat_ftp 3184 0 (unused)
ip_conntrack_ftp 3280 0 [ip_nat_ftp]
iptable_nat 13424 7 [ipt_MASQUERADE ip_nat_irc ip_nat_ftp]
ip_conntrack 14224 8 [ipt_MASQUERADE ip_nat_irc ip_conntrack_irc ip_nat_ftp ip_conntrack_ftp iptable_nat]
iptable_filter 1728 0 (unused)
ip_tables 10592 5 [ipt_MASQUERADE iptable_nat iptable_filter]
tulip 42224 3

/proc/version:
Linux version 2.4.8-ac12 (root@tux) (gcc version 2.95.4 20010604 (Debian
prerelease)) #7 SMP Sun Aug 26 22:02:31 CEST 2001

the router is set up to SNAT packets going out of eth0 and then I have a
portforward leading in to a machine on another port in the NIC.

I had one machine on each of eth0, eth1 and eth3 generating and recieving
traffic. (the traffic was generated by 'nc ip port < /dev/zero' and
revieced by 'nc -l -p port > /dev/null' so there's some pipeing going on
too.)

two streams coming in on eth1 and eth3 where SNAT'd out via eth0 and one
stream coming in on eth0 was DNAT'd to the machine on eth1.
And then there where two streams between eth1 and eth3, one in each
direction.

= there where quite a lot of packets.

The router had been running this test for over 36 hours when it died.
As I wrote earlier, 2.4.9-ac18 died too, within a few hours but I had no
chance of getting the output :(

Decoded Oops:

ksymoops 2.4.1 on i686 2.4.9-ac5. Options used
-V (default)
-K (specified)
-L (specified)
-O (specified)
-m System.map-2.4.8-ac12 (specified)

NMI Watchdog detected LOCKUP on CPU1, registers:
CPU: 1
EIP: 0010:[<c0105c5f>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000087
eax: c483f2a4 ebx: c1e8c200 ecx: c1e8c29c edx: c1120000
esi: c1e8c2d0 edi: c11b2158 ebp: 5671b6f9 esp: c1121d58
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c1121000)
Stack: c483e8b5 c1e8c200 c11b2140 c4816734 c1e8c200 00000000 c022394e c1e8c200
00000000 c11b2158 c48065b1 c1970c80 c1679320 c26a3740 c11b2000 c02482d0
000000c8 c11b2158 00000000 5671b6f9 00000006 00000000 00000282 c11b2140
Call Trace: [<c483e8b5>] [<c4816734>] [<c022394e>] [<c48065b1>] [<c02482d0>]
[<c022e71b>] [<c0226e9d>] [<c0248390>] [<c022d408>] [<c0245950>] [<c02482b2>]
[<c02482d0>] [<c02459aa>] [<c022d408>] [<c02458e8>] [<c0245950>] [<c0244aed>]
[<c0244910>] [<c022d408>] [<c0244776>] [<c0244910>] [<c022771e>] [<c011956f>]
[<c010861b>] [<c01051b0>] [<c01051b0>] [<c010a714>] [<c01051b0>] [<c01051b0>]
[<c01051dd>] [<c0105242>] [<c011577b>]
Code: 81 38 00 00 00 01 75 f8 f0 81 28 00 00 00 01 0f 85 e4 ff ff

>>EIP; c0105c5f <__write_lock_failed+7/20> <=====
Trace; c483e8b5 <END_OF_CODE+4466b65/????>
Trace; c4816734 <END_OF_CODE+443e9e4/????>
Trace; c022394e <__kfree_skb+96/134>
Trace; c48065b1 <END_OF_CODE+442e861/????>
Trace; c02482d0 <ip_finish_output2+0/f8>
Trace; c022e71b <qdisc_restart+6b/17c>
Trace; c0226e9d <dev_queue_xmit+111/2ec>
Trace; c0248390 <ip_finish_output2+c0/f8>
Trace; c022d408 <nf_hook_slow+110/194>
Trace; c0245950 <ip_forward_finish+0/60>
Trace; c02482b2 <ip_finish_output+12a/134>
Trace; c02482d0 <ip_finish_output2+0/f8>
Trace; c02459aa <ip_forward_finish+5a/60>
Trace; c022d408 <nf_hook_slow+110/194>
Trace; c02458e8 <ip_forward+1f8/260>
Trace; c0245950 <ip_forward_finish+0/60>
Trace; c0244aed <ip_rcv_finish+1dd/220>
Trace; c0244910 <ip_rcv_finish+0/220>
Trace; c022d408 <nf_hook_slow+110/194>
Trace; c0244776 <ip_rcv+3b6/400>
Trace; c0244910 <ip_rcv_finish+0/220>
Trace; c022771e <net_rx_action+1da/334>
Trace; c011956f <do_softirq+6f/cc>
Trace; c010861b <do_IRQ+db/ec>
Trace; c01051b0 <default_idle+0/34>
Trace; c01051b0 <default_idle+0/34>
Trace; c010a714 <call_do_IRQ+5/d>
Trace; c01051b0 <default_idle+0/34>
Trace; c01051b0 <default_idle+0/34>
Trace; c01051dd <default_idle+2d/34>
Trace; c0105242 <cpu_idle+3e/54>
Trace; c011577b <_call_console_drivers+53/58>
Code; c0105c5f <__write_lock_failed+7/20>
00000000 <_EIP>:
Code; c0105c5f <__write_lock_failed+7/20> <=====
0: 81 38 00 00 00 01 cmpl $0x1000000,(%eax) <=====
Code; c0105c65 <__write_lock_failed+d/20>
6: 75 f8 jne 0 <_EIP>
Code; c0105c67 <__write_lock_failed+f/20>
8: f0 81 28 00 00 00 01 lock subl $0x1000000,(%eax)
Code; c0105c6e <__write_lock_failed+16/20>
f: 0f 85 e4 ff ff 00 jne fffff9 <_EIP+0xfffff9> c1105c58 <END_OF_CODE+d2df08/????>

To me it looks like ip_finish_output2 freaked and did something nasty.

I hope someone who knows the networking a little better than me can help
with this one. My goal is to make linux survive at least a month of
stresstesting :)

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/