Re: [Ext2-devel] disk throughput

Alexander Viro (viro@math.psu.edu)
Mon, 5 Nov 2001 18:36:09 -0500 (EST)


On Mon, 5 Nov 2001, Linus Torvalds wrote:

> I don't particularly like behaviour that changes over time, so I would
> much rather just state clearly that the current inode allocation
> strategy is obviously complete crap. Proof: simple real-world
> benchmarks, along with some trivial thinking about seek latencies.
>
> In particular, the way it works now, it will on purpose try to spread
> out inodes over the whole disk. Every new directory will be allocated in
> the group that has the most free inodes, which obviously on average
> means that you try to fill up all groups equally.

> Which makes _no_ sense. There is no advantage to trying to spread things
> out, only clear disadvantages.

Wrong. Trivial example: create skeleton homedirs for 50 new users.
You _really_ don't want all of them in one cylinder group. Because they
will be slowly filling up with files, while directory structure is very likely
to stay more or less stable. You want the prefered group for the file
inode to be the same as its parent directory. _And_ you want data
close to inode if we can afford that. Worse yet, for data allocation
we use quadratic hash. Which works nicely _unless_ starting point for
all of them sits in the same group.

See where it's going? The real issue is ratio of frequencies for
directory and file creation. The "time-dependent" part is ugly, but
the thing it tries to address is very, very real. Allocation policy
for a tree created at once is different from allocation policy for
normal use.

Ideally we would need to predict how many (and how large) files
will go into directory. We can't - we have no time machines. But
heuristics you've mentioned is clearly broken. It will end up with
mostly empty trees squeezed into a single cylinder group and when
they start to get populated that will be pure hell.

And yes, it's more than realistic scenario. Your strategy would make
sense if all directories were created by untaring a large archive.
Which may be fairly accurate for your boxen (or mine, for that matter -
most of the time), but it's not universal.

Benchmarks that try to stress that code tend to be something like
cvs co, tar x, yodda, yodda. _All_ of them deal only with "fast-growth"
pattern. And yes, FFS inode allocator sucks for that scenario - no
arguments here. Unfortunately, the variant you propose will suck for
slow-growth one and that is going to hurt a lot.

The fact that Linux became a huge directory tree means that we tend
to deal with fast-growth scenario quite often. Not everyone works
on the kernel, though ;-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/