Oops I'll go fix that small detail. It should have been forwarded to me.
> 2) What is causing the hang and are there any hopes to
> fix it in software this time? Last year when I came to the kernel
> list with problems very similar, the consensus was that this
> is actually broken hardware in the OSB4 chipset...but obviously
> it is possible for at least some kernels to run quasi-normally
> on this hardware... what changed between 2.4.9 and 2.4.18 so
> it doesn't anymore?
The code traps out when it sees the I/O complete and it turns out that
the DMA engine flags say the engine is still running. In this state we
kill the box because we know the next I/O will be written 4 bytes skewed
with the last 4 bytes of the previous I/O apparently repeated at the
start.
I took it up with the Serverworks guys at the time, but they were not
able to duplicate the problem and provide advice. Since we could verify
this across an entire rendering farm it was clearly not a weird one off
bug. It also doesn't appear to be a Linux bug (but maybe one day I'll be
proved wrong).
If you drop the drives to MWDMA2 you'll see only slightly lower
performance and solid behaviour
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/