Re: [STATUS 2.5] October 30, 2002

Richard B. Johnson (root@chaos.analogic.com)
Fri, 1 Nov 2002 12:00:28 -0500 (EST)


On 1 Nov 2002, Alan Cox wrote:

> On Fri, 2002-11-01 at 14:05, Eric W. Biederman wrote:
> > When you have a correctable ECC error on a page you need to rewrite the
> > memory to remove the error. This prevents the correctable error from becoming
> > an uncorrectable error if another bit goes bad. Also if you have a
> > working software memory scrub routine you can be certain multiple
> > errors from the same address are actually distinct. As opposed to
> > multiple reports of the same error.
>
> Note that this area has some extremely "interesting" properties. For one
> you have to be very careful what operation you use to scrub and its
> platform specific. On x86 for example you want to do something like lock
> addl $0, mem. A simple read/write isnt safe because if the memory area
> is a DMA target your read then write just corrupted data and made the
> problem worse not better!
>

The correctable ECC is supposed to be just that (correctable). It's
supposed to be entirely transparent to the CPU/Software. An additional
read of the affected error produces the same correction so the CPU
will never even know. The x86 CPU/Software is only notified on an
uncorrectable error. I don't know of any SDRAM controller that
generates an interrupt upon a correctable error. Some store "logging"
information internally, very difficult to get at on a running system.

Given that, "scrubbing" RAM seems to be somewhat useless on a
running system. The next write to the affected area will fix the
ECC bits, that't what is supposed to clear up the condition.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/