Re: RAID problems

jeff millar (wa1hco@adelphia.net)
Tue, 30 Jul 2002 23:18:58 -0400


----- Original Message -----
From: "Bill Davidsen" <davidsen@tmr.com>
To: "Jakob Oestergaard" <jakob@unthought.net>
Cc: "Kernel mailing list" <linux-kernel@vger.kernel.org>
Sent: Tuesday, July 30, 2002 10:46 PM
Subject: Re: RAID problems

> On Tue, 30 Jul 2002, Jakob Oestergaard wrote:
>
> > On Tue, Jul 30, 2002 at 01:25:37PM -0400, Bill Davidsen wrote:
>
> > > Why does no one seem surprised that one drive failed and the system
marked
> > > four bad instead of using the spare in the first place? That's a more
> > > interesting question.
> >
> > As stated in the above URL (no shit, really!), this can happen for
> > example if some device hangs the SCSI bus.
>
> I think you misread my comment, it was not "why doesn't the documentation
> say this" but rather "why does software RAID have this problem?" I know
> this can happen in theory, but it seems that the docs imply that this
> isn't a surprise in practice. I've been running systems with SCSI RAID and
> hardware controllers since 1989 (or maybe 1990), I've got EMC and HDS
> boxes, aaraid and ipc controllers, Promise and CMC(??) on system boards,
> and a number of systems running software RAID. And of all the drive
> failures I've had, exactly one had more than one drive fail, and that was
> in a power {something} issue which took out multiple drives and system
> boards on many systems.
>
> I just surprised that the software RAID doesn't have better luck with
> this, I don't see any magic other than maybe a bus reset the firmware
> would be doing, and I'm wondering why this seems to be common with Linux.
> Or am I misreading the frequency with which it happens?
>
> > Did *anyone* read that section ?!? ;)
> >
> > If someone feels the explanation there could be better, please just send
> > the better explanation to me and it will get in. Really, this section
> > is one of the few sections that I did *not* update in the HOWTO, because
> > I really felt that it was still both adequate (since no-one has demanded
> > elaboration) and correct.
>
> Thye words are clear, I'm surprised at the behaviour. Yes, I know that's
> not your thing.
>
> --
> bill davidsen <davidsen@tmr.com>
> CTO, TMR Associates, Inc
> Doing interesting things with little computers since 1979.

In the 3 weeks since installing Linux software raid-5 (3 x 80 GB), I had to
reinitialize the raid devices twice (mkraid --force)...once due to an abrupt
shutdown and once due to some weird ATA/ATAPI/drive problem that caused a
disk to begin "clicking" spasmodically...and left the raid array all out of
whack..

Linux software raid seems very fragile and very scary to recover. I feel a
much stronger need for backup with raid than without it.

Raid needs an automatic way to maintain device synchronization. Why should
I have to...
manually examine the device data (lsraid)
find two devices that match
mark the others failed in /etc/raidtab
reinitialize the raid devices...putting all data at risk
hot add the "failed" device
wait for it to recover (hours)
change /etc/raidtab again
retest everything

This is 10 times worse that e2fsck and much more error prone. The file
system guru's worked hard on journalling to minimize this kind of risk.

jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/