ASUS CUR_DLS FC-PGA Motherboard with dual 1Ghz PIII /768MB Registered
SDRAM, on board eepro100.
They boot nfsroot from a PIII 1Ghz running 2.4.6-xfs with nfsv3. The
diskless client root is on an ext2 filesystem. The master is stable.
one or more of the clients will always either lock up or reboot under
moderate load (kernel compile across all of them, cfdrc software run...).
I started with 2.4.6, tried 2.4.5, and am now running 2.4.3-12 (redhat
sources). The 2.4.3-12 seems to be more stable, but now some nodes simply
lockup instead of rebooting, while other times they reboot. The only
thing I can find to give any information about the problem is occasionally
on the console (but not in syslog -and I'm logging *.*) are APIC error in
CPUx blah(blah) messages, where blah, blah is replaced by various APIC
error codes (I haven't been able to determine the frequency or pattern).
I have just rebooted them all with "noapic" and am testing again, also
collecting tcpdump output.
When the machines lock, the sysreq key doesn't do anything. lkcd also of
course doesn't help.
In the meantime, searching the archives, I can see a few mentions of
similar problems, but I haven't been able to see any threads that reached
a useful conclusion (except for going back to 2.2.x). Is there a previous
discussion that is applicable and I just haven't understood it?
Can anyone suggest what additional information I can gather/provide in
order to debug the problem?
For what it is worth, I have another cluster running the same motherboard
with 2.4.1 with local disks (symbios onboard scsi controller) and slower
(866Mhz) cpus that doesn't show this problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/