You really cannot say that bad memory is involved ONLY when fsck's fail and
kernel compiled fail. No way.
Think about it: that failing bit can well be in a place that kernel never
touches, and gcc usually does not touch. Moreover the bit flip usually does
not happen every time; you have to stress the memory for hours and sometimes
use a specific bit pattern to trigger the problem.
I had one machine that compiled kernel just fine, and ran pretty smoothly
overall, but experienced weird hickups like dying apps, failing oracle
install etc. Not too much though, I was attributing them to buggy software.
Then I tried to take a large backup. Bzip failed (internal error) one third
of the time, and once produced a different result. I quickly hacked up an
user space memory tester, and sure enough it reported an error after five
hours. (The machine was already in production, so I couldn't just take it
down and launch memtest86.) I verified the problem with memtest86 during the
next night, and applied the marvellous badmem patch to the kernel. After
marking the problematic place unuable, all problem disappeared. I just lost
2MB out of 256.
What I learned is that spotting meory error can be difficult, and the
symptoms can be stealthy.
-- v --
v@iki.fi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/