I've been trying to track down why i seem to get disk corruption on my
harddrives after some good amount of usage all the time. It's been
happening for a long time across a number of different kernel versions.
I believe this is because i stick to the same board manufacturer, Abit
and use via chipsets.
I ran cerberus with dma enabled at UDMA4 and UDMA2, at udma4 cerberus
reports MEMORY errors and BBidehost2bus0target0lun0discN1 errors, but
mostly MEMORY errors before the kernel panics after a minute or two. At
udma2 the cerberus reports no errors but panics after a minute or two.
I ran cerberus a couple times on each, with UDMA4 it began to error
about 30 seconds into the test with MEMORY errors.
I thought, well this could be ram errors, so i ran memtest for a couple
hours. Nothing reported as being bad. I then thought, my hardware
could be the problem, so I ran e2fsck -c on the partition I was running
cerberus on with dma disabled via hdparm -d0 and it completed with no
errors found. I then rebooted, enabled udma2 and the kernel panic'd
with the same test after a few minutes.
The rest of this email is just information regarding the setup
First off the way my fs's are setup are as follows:
swap + files are now all on my primary master ide drive on the
motherboard ide controller. Swap on my primary master promise controller
seemed too problematic because of corruption, but i'm not sure if the
corruption i've seen is related only to the promise controller or if
it's not controller specific. I'll have to run the test without swap on
the promise drive and then run the test on my primary motherboard hdd
and again without swap.
cerberus version : 1.3.0pre4
dmesg info : http://signal-lost.homeip.net/lkml/dmesg
hdparm info : http://signal-lost.homeip.net/lkml/hdparm
pci info : http://signal-lost.homeip.net/lkml/lspci
tests completed before escaping in pio mode:
http://signal-lost.homeip.net/lkml/tests_passed
Errors during last test that caused kernel panic (udma2)
http://signal-lost.homeip.net/lkml/memory
Errors during test of udma4 (first test)
http://signal-lost.homeip.net/lkml/memory2
http://signal-lost.homeip.net/lkml/dmesg2
various segfaults of badblocks of BBidehost tests.
I ran memtest for an extensive amount of time after the first test
reported memory errors and go absolutely no errors (wasn't using dma
mode at the time either). And since these errors aren't produced when
not using DMA on my drives I find it very unlikely that it's "System
Ram" as the cause of them. I'm going to rerun the test on my
motherboard primary drive after posting this in case something happens
and i hose everything.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/