> On Fri, 2002-11-01 at 14:05, Eric W. Biederman wrote:
> > When you have a correctable ECC error on a page you need to rewrite the
> > memory to remove the error. This prevents the correctable error from becoming
> > an uncorrectable error if another bit goes bad. Also if you have a
> > working software memory scrub routine you can be certain multiple
> > errors from the same address are actually distinct. As opposed to
> > multiple reports of the same error.
>
> Note that this area has some extremely "interesting" properties. For one
> you have to be very careful what operation you use to scrub and its
> platform specific. On x86 for example you want to do something like lock
> addl $0, mem. A simple read/write isnt safe because if the memory area
> is a DMA target your read then write just corrupted data and made the
> problem worse not better!
>
The correctable ECC is supposed to be just that (correctable). It's
supposed to be entirely transparent to the CPU/Software. An additional
read of the affected error produces the same correction so the CPU
will never even know. The x86 CPU/Software is only notified on an
uncorrectable error. I don't know of any SDRAM controller that
generates an interrupt upon a correctable error. Some store "logging"
information internally, very difficult to get at on a running system.
Given that, "scrubbing" RAM seems to be somewhat useless on a
running system. The next write to the affected area will fix the
ECC bits, that't what is supposed to clear up the condition.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/