RE: [STATUS 2.5] October 30, 2002

Ed Vance (EdV@macrolink.com)
Fri, 1 Nov 2002 10:17:45 -0800


On Fri, November 01, 2002 at 9:00 AM, Richard B. Johnson wrote:
> [...]
> The correctable ECC is supposed to be just that (correctable). It's
> supposed to be entirely transparent to the CPU/Software. An additional
> read of the affected error produces the same correction so the CPU
> will never even know. The x86 CPU/Software is only notified on an
> uncorrectable error. I don't know of any SDRAM controller that
> generates an interrupt upon a correctable error. Some store "logging"
> information internally, very difficult to get at on a running system.
>
> Given that, "scrubbing" RAM seems to be somewhat useless on a
> running system. The next write to the affected area will fix the
> ECC bits, that's what is supposed to clear up the condition.
>

Scrubbing has nothing whatever to do with reporting of correctable errors to
the CPU, even if it does the scrubbing.

Scrubbing does not happen on the basis of chance detection of correctable
errors from normal activity, because that would sometimes be too late.
Remember, the hardware only finds out about an error when the word is
accessed. There is no detection of the bit cell getting its charge altered,
and the errors are cumulative between corrections.

Scrubbing is intended to lower the probability that any given memory word
will be hit by a second error causing event (such as an alpha particle
emitted from a ceramic case) without having been accessed and corrected. The
scrub just continuously rolls through all of physical memory (at low
priority) again and again doing whatever level of access is necessary to
cause correction. This limits the maximum time between correction of any
memory word. Some memory systems automatically correct and rewrite
(atomically) on a read of a word with a single bit error. Some mainframe
memory systems do the whole ECC scrub/correction operation in hardware,
simultaneously in each bank.

The primary benefit of logging is to catch deteriorating memory cells during
periodic maintenance that either do not correct at all (single stuck bit,
single hits become uncorrectable) or that repeatedly fail over time, perhaps
due to charge leaks from long term diffusion of contaminants.

Cheers,
Ed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/