Re: [PATCH] Radix-tree pagecache for 2.5

Andrea Arcangeli (andrea@suse.de)
Fri, 1 Feb 2002 15:44:33 +0100


On Fri, Feb 01, 2002 at 10:04:50AM +0100, Ingo Molnar wrote:
>
> On Fri, 1 Feb 2002, Anton Blanchard wrote:
>
> > There were a few solutions (from davem and ingo) to allocate a larger
> > hash but with the radix patch we no longer have to worry about this.
>
> there is one big issue we forgot to consider.
>
> in the case of radix trees it's not only search depth that gets worse with
> big files. The thing i'm worried about is the 'big pagecache lock' being
> reintroduced again. If eg. a database application puts lots of data into a
> single file (multiple gigabytes - why not), then the
> mapping->i_shared_lock becomes a 'big pagecache lock' again, causing
> serious SMP contention for even the read() case. Benchmarks show that it's
> the distribution of locks that matters on big boxes.

exactly, this is the same thing I mentioned in some past email. It's not
that having per-inode data structures solves the locking completly, DBMS
are used to store stuff in a single file. And of course with a structure
like radix tree it would be a pain to have it scale within the same
file, unlike with the hashtable where each bucket is indipendent from
the others.

>
> dbench hides this issue, because it uses many temporary files, so the

Indeed, a lot of workloads would benefit from the separate data
structure and locking, but not all, some important one not.

> locking overhead is distributed. Would you be willing to run benchmarks
> that measure the scalability of reading from one bigger file, from
> multiple CPUs?

Agreed, also with DaveM patch applied, sizing the hash properly so it
has a mean distribution of 1 entry per bucket or so, will decrease the
window for the spinlock collisions as well btw.

>
> with hash based locking, the locking overhead is *always* distributed.
>
> with radix trees the locking overhead is distributed only if multiple
> files are used. With one big file (or a few big files), the i_shared_lock
> will always bounce between CPUs wildly in read() workloads, degrading
> scalability just as much as it is degraded with the pagecache_lock now.
>
> Ingo

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/