On Sun, May 25, 2003 at 12:58:11PM +0200, Stephan von Krawczynski wrote:
> it did not take really long for rc3+aic20030520 to freeze - exactly one day.
Well, in some ways, it will be easier to debug it than when it took 14 days, if
it's the same bug, of course.
> Though I used nmi_watchdog there are no presentable outputs. As I expected the
> screen simply is black and no messages are in any logfiles.
> Again it froze while tar-ing about 80 GB of data onto an aic-driven SDLT. Data
> is coming from IDE drive connected to a 3ware 7500-8 (though no raid
> configuration).
OK, so there's a high probability that the problem is related to either SCSI or
IDE (or both), and less likely implies any other parts.
> Ah yes, one more thing: I can ping the box, but keyboard, mouse, display is
> dead and usually working processes stopped (like snmp).
that's surprizing, mine was completely dead IIRC. It's like it doesn't schedule
anymore but still processes interrupts. I don't know if a deadlock can cause
this behaviour.
> Willy: I am willing to try a serial console setup (as it does not interfere
> with X). I have tried this before with no luck. Can you provide some hints how
> you got that working (yes, I read Documentation/serial-console.txt, but I could
> not manage any output on the serial line).
I had to try several times, because the freeze was so sudden that I often
caught only a few chars. Justin even didn't believe me. First, you have to
check that CONFIG_SERIAL_CONSOLE is enabled. After that, you'll need a remote
console which can work at high speeds (I could get interesting results at 38400
bps). Surprizingly, above I had mangled output. Perhaps my cable wasn't good
enough (flat cisco RJ45 console cable). I also disabled hard and soft flow
control. But as I already stated, in my case it was easier because it froze
every 2-3 boots, and when it didn't I only had to start a "make -j dep" to get
it. So if I got frozen with no messages, I simply hit the reset button and tried
again. It seems more complicated in your case (although your big tar may be
helping).
When your setup seems OK, you should test it to be sure. I often use "mdir"
with nothing in the drive, or AltGr-SysRq-P to get console messages. If you
don't see anything on your serial console, then your setup is not ready yet
for a test.
Oh and by the way, if you're using modules, you may find interesting to keep
copies of lsmod output, and /proc/ksyms to get a more accurate decoding with a
further ksymoops.
If you really cannot catch anything, I suggest one of these solutions :
- apply the netconsole patch and have a linux box on the same lan with the
netconsole server. You can find it in -aa kernels for example.
- apply the kmsgdump patch, only if you have a floppy drive or a parallel
printer. It will try to reset the system after a panic, and use bios calls
to write the kernel messages buffer on the media. This usually works, but
there are some corner cases where it doesn't. But it's easy to try with
AltGr-SysRq-D. Download it from http://w.ods.org/tools/kmsgdump/
Good luck !
Willy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/