Re: [patch 12/16] fix race between writeback and unlink

Andrew Morton (akpm@zip.com.au)
Tue, 04 Jun 2002 15:08:45 -0700


Linus Torvalds wrote:
>
> On Tue, 4 Jun 2002, Andrew Morton wrote:
> >
> > There's a patch at
> > http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre10/ext3-reloc-page.patch
> > which provides a simple `relocate page' ioctl for ext3 files.
>
> That's a good start, but before even egtting that far there is some need
> for a way to get a picture of the FS layout in a reasonably fs-independent
> way.
>
> Sure, bmap() actually does part of this (the "where are my blocks" part),
> but right now there is no way to query the FS for the "where can I put
> blocks" part.

Jeff Garzik was working on that a while back - a separate filesystem
which provides a "metadata view" of a real filesytem. So you can
poke around and find all these things out. In theory, different
filesystems should be able to offer the same view.

> You can do it with direct disk access and knowledge of the FS internals,
> but it should not be all that hard to add some simple interface to get a
> "block usage byte array" kind of thing (more efficient than doing bmap on
> all files, _and_ can tell about blocks reserved for inodes, superblocks
> and other special uses), which together with a user-level interface to
> "preallocate" and your "relocate page" should actually make it possible to
> make a fairly FS-independent defragmenter.

The e2fsprogs package includes a `libe2fs' library which offers
APIs for accessing the fs internals. It's exactly what you
say - direct disk access and knowledge of internals. So
that plus the try_to_relocate_page() ioctl is a shortest-path
route to a defragmenter for ext3, and only ext3. I wasn't
aiming very high here ;)

A totally different way of performing defrag could be to
copy the entire fs from one partition to a different one,
with kernel support for providing coherency while the copy
is in progress. It's basically a union/translucent mount
with COW. Swizzle the backing blockdev, drop the disk
mappings from all incore pages, renumber the inode without
breaking stuff... (OK, I've talked myself out of it ;/) It's
not super efficient, and it does require the provisioning of a
bounce disk, but it would use infrastructure which would be
useful for other stuff and it is fs-agnostic.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/