Well, I just finished an overnight test on the server in question. If it's
an overheating CPU, then I don't understand how replacing the RAM stick
fixes the problem so that it runs without failure for 12 hours. An
overheating RAM stick could do it, too.
In tracing through my collection of computers, I have found others with
this flaky RAM that show intermittent failures using md5sum that other
tests don't find (including kernel compiles). All the failing RAM comes
from the same manufacturer and batch. I think it's just barely marginal
memory, myself. Once I replace it all, I should stop seeing the
every-other-fortnight problems that have plagued me for the past year.
It also means I need to check the junk pile to see if the failed systems
have RAM trouble at their heart.
(N.B.: This little exercise has opened my eyes some little bit. Easter
was a great time to bring down the servers one by one, blow the crap out of
them with compressed air, and check the RAM. And replace questionable
RAM. I'm also going to have a little talk with the computer store that
sold me a "white box" computer with PC66 SDRAM when PC100 SDRAM was called
for.)
Here's how bad some of this stuff was: I took an ASUS P5A-B board with an
AMD K6-II, slowed the bus down to 87 MHz from 100 MHz, and the damn memory
fails at the slower bus speed. Every loose stick from that batch of RAM
shows the failure. (memtest86 caught the errors, by the way, so
diagnostics aren't completely worthless.)
Kingston, Viking, and Micron are getting more of my business.
-- X -> unknown; Spurt -> drip of water under pressure Expert -> X-Spurt -> Unknown drip under pressure.- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/