> On 1 Nov 2002, Alan Cox wrote:
>
> > On Fri, 2002-11-01 at 14:05, Eric W. Biederman wrote:
> > > When you have a correctable ECC error on a page you need to rewrite the
> > > memory to remove the error. This prevents the correctable error from
> becoming
>
> > > an uncorrectable error if another bit goes bad. Also if you have a
> > > working software memory scrub routine you can be certain multiple
> > > errors from the same address are actually distinct. As opposed to
> > > multiple reports of the same error.
> >
> > Note that this area has some extremely "interesting" properties. For one
> > you have to be very careful what operation you use to scrub and its
> > platform specific. On x86 for example you want to do something like lock
> > addl $0, mem. A simple read/write isnt safe because if the memory area
> > is a DMA target your read then write just corrupted data and made the
> > problem worse not better!
yep lock addl $0, mem with the appropriate kmaps so it will work on any system
I use. It isn't rocket science but since it is using kmap_atomic that function
at least should probably get in the kernel.
> The correctable ECC is supposed to be just that (correctable). It's
> supposed to be entirely transparent to the CPU/Software. An additional
> read of the affected error produces the same correction so the CPU
> will never even know. The x86 CPU/Software is only notified on an
> uncorrectable error. I don't know of any SDRAM controller that
> generates an interrupt upon a correctable error. Some store "logging"
> information internally, very difficult to get at on a running system.
Polling the memory controller periodically isn't hard, and you can usually
get an interrupt as well. Though I have not explored the whole interrupt
territory. Finding out when you have a corrected error is extremely useful
as it gives a warning that your memory is going bad. Just like with a disk
getting a bunch of errors means it is time to be replaced, but you still
have a little time left.
> Given that, "scrubbing" RAM seems to be somewhat useless on a
> running system. The next write to the affected area will fix the
> ECC bits, that't what is supposed to clear up the condition.
If it is your kernel text space that is getting the error there will
be no next write.
Beyond that if you are trying to see if the multiple correctable errors
you have are a single error, or an actual problem software scrubbing helps.
Because then you know the second report was because the problem reoccured.
Making it likely you have a bad bit in your DIMM.
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/