I've seen three variations of symptoms:
1) Almost complete lockout - machine responds to interrupts (indeed,
it can even complete a TCP connection) but no userspace code gets
executed. Alt-SysRq-* still works, console scrollback does not;
2) Partial lockout - lock_kernel() seems to be getting called without
a corresponding unlock_kernel(). This manifested as programs such
as 'ps' and 'top' getting stuck in kernel space;
3) Unkillable programs - a test program that allocates 512M of memory
and touches every page; running two copies of this simultaneously
repeatedly results in at least one of the copies getting stuck
in 'raid1_alloc_r1bh'.
Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
were observed under 2.4.5-ac13 only. I never get any PANICs, only
these variety of deadlocks. A reboot is the only way to resolve the
problem.
There seem to be two ways to manifest the problem. As alluded to in
(3), running two copies of the memory eater simultaneously along with
calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
or two). Another method to manifest the problem is to run multiple
copies of this script (I run 10 simultaneous copies):
#!/bin/sh
while /bin/true; do
ssh remote-machine 'sleep 1'
done
This script causes (1) in about a day or two.
If anyone has any suggestions about how to proceed to figure out what
the problem is (or if there is already a fix), please let me know.
I would be more than willing to provide a wide range of cooperation on
this problem. I don't have a feel for where to go from here, and I'm
hoping that someone with more experience can give me some
assistance..
-Bob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/