>>Do you agree that this is likely to be a kernel problem? Is upgrading
>>the kernel my best course of action?
>>
>
>Almost every other report I have ever seen that looked like that one has always
>turned out to be hardware related. The randomness in paticular tends to be
>a pointer to thinks like cache faults.
>
>You do have ECC main memory which is good.
>
>What other hardware is in the machine ?
>
>
To recap from an earlier email, my nightly build & regression tests (a
roughly 2 hour process involving Sun's JDK, GNU make, gcc, g++, g77 &
Bourne shell scripts) has been failing intermittently on a dual-Xeon
system usually with an "Illegal intruction" signal. I've tried removing
the sound card and disabling the X11 server to avoid loading the nVidia
kernel mod. The intermittent failures disappear when I run non-SMP. I've
tried swapping the processors on the motherboard, and both processors
appear to work fine individually. Most of the boxes I've run on have >=
512MB ECC RAM. I've run Dell's hardware diagnostics (especially the
memory ones) twice. The diagnostics don't seem to have SMP tests where
both CPUs are being stressed.
FYI when I upgraded to the 2.4.18-1smp kernel, the failure rate went
from 20% to 100%. I have tried running the nightly build & regression on
roughly 6 different dual processors Pentium III or better machines
(cylcing it over and over), and they all have intermittent failures of
one kind or another. All these machines are made by Dell, but they
provide some evidence that it is not a hardware problem.
Tom
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/