Re: master.kernel.org situation update

Larry McVoy (lm@bitmover.com)
Tue, 29 Jan 2002 19:11:28 -0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Arnaldo Carvalho de Melo: "Re: A modest proposal -- We need a patch penguin"
Previous message: Daniel Phillips: "Re: [PATCH] Radix-tree pagecache for 2.5"

On Tue, Jan 29, 2002 at 06:31:32PM -0800, H. Peter Anvin wrote:
> The situation on master.kernel.org is looking pretty grim. We were
> trying to add disk capacity (the host uses a DAC960PRL RAID
> controller) and the end result seems to be that a Mylex utility called
> "ezsetup" has completely trashed the RAID configuration information.
> What makes matters worse is that an MIS screwup here means no backups
> have been running for a month or so.

This doesn't help you now, but what you just hit is why we take a
different approach to backups. For any data we care about, we
stick in a 3ware 6410 controller, run it in JBOD mode, and have
4 drives mounted as
/home
/nightly
/weekly
/monthly
and we copy all the data once a day to the appropriate spot. On top
of that, we run the gzip checksum over all the data and save a database
of
pathname, size, mtime, chksum

tuples and for all where path, size, mtime match we compare the chksum
which had better be the same, otherwise the disk, filesystem, or memory
has corrupted your data. That way we get warned before all the backups
are gone.

Using a RAID is a losing proposition - it means you still have exactly
one copy of the data and no way to verify that it is correct. The RAID
just does what the fs/block layer tells it to do and if the upper layers
are handing down bad data, the RAID will faithfully store that bad data.
And you never know until you need it.

The other nice thing about the 4 way mirror is that you can do stuff like

diff foo.c /nightly/$PWD

and try and figure out what you were smoking when you made that change
before the coffee started working.

I know this doesn't help right now and is probably unwelcome advice, but
I'd encourage you to consider this approach in the future. It's brute
force but has huge advantages over tapes, RAID, etc. You'd be back on
line right now, albeit with maybe 12 hour old data, if you had this.
We have all our scripts in a BitKeeper repository and I'll happily give
them to you if you want them.

-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/






 Next message: Arnaldo Carvalho de Melo: "Re: A modest proposal -- We need a patch penguin"
 Previous message: Daniel Phillips: "Re: [PATCH] Radix-tree pagecache for 2.5"