Re: [BK PATCH 2.5] Introduce 64-bit versions of PAGE_{CACHE_,}{MASK,ALIGN}

Andrea Arcangeli (andrea@suse.de)
Tue, 30 Jul 2002 00:18:52 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Linus Torvalds: "Re: [PATCH] vfs_read/vfs_write small bug fix (2.5.29)"
Previous message: Hugh Dickins: "[PATCH] vmacct9/9 remove acct arg from do_munmap"

On Mon, Jul 29, 2002 at 02:46:07PM -0700, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > On Mon, Jul 29, 2002 at 02:01:15PM -0700, Andrew Morton wrote:
> > > Andrea Arcangeli wrote:
> > > >
> > > > On Sun, Jul 28, 2002 at 07:05:19PM -0700, Andrew Morton wrote:
> > > > > But yes, all of this is a straight speed/space tradeoff. Probably
> > > > > some of it should be ifdeffed.
> > > >
> > > > I would say so. recalculating page_address in cpu core with no cacheline
> > > > access is one thing, deriving the index is a different thing.
> > > >
> > > > > The cost of the tree walk doesn't worry me much - generally we
> > > > > walk the tree with good locality of reference, so most everything is
> > > > > in cache anyway.
> > > >
> > > > well, the rbtree showedup heavily when it started growing more than a
> > > > few steps, it has less locality of reference though.
> > > >
> > > > > Good luck setting up a testcase which does this ;)
> > > >
> > > > a gigabit will trigger it in a millisecond. of course nobody tested it
> > > > either I guess (I guess not many people tested the 800Gbyte offset
> > > > either in the first place).
> > >
> > > There's still the mempool.
> >
> > that's hiding the problem at the moment, it's global, it doesn't provide
> > any real guarantee.
>
> Sizing the mempool to max_cpus * max tree depth provides a guarantee,
> provided you take care of context switches, which is pretty easy.

I guess I still prefer the GFP_KERNEL fallback because it avoids to
waste/reserve lots of ram, but I only care about correctness, the
current code isn't correct, doing max_cpus * max tree depth would
satisfy me completely too (saving ram is a lower prio), so it's up to
you as far as it cannot fail unless it's truly oom (i.e. you need a
GFP_KERNEL in your way).

>
> > ...
> >
> > so it's not too bad in terms of stack because there's not going to be
> > more than one walk at time, thanks for doing the math btw. You'd
> > basically need a second radix tree for the dirty pages (using the same
> > radix tree is not an option because it would increase pdflush complexity
> > too much with terabytes of clean pages in the tree).
>
> Not sure. If each ratnode has a 64-bit bitmap which represents
> dirty pages if it's a leaf node, or nodes which have dirty pages
> if it's a higher node then the "find the next 16 dirty pages above index
> N" is a pretty efficient thing.

You will have """only""" 18 layers, but scanning through 2**(6*18)
entries will take too long time even if only entry takes 1 nanosecond to
scan. Of course that's the extreme case, but still it should be too much
in practice. I doubt you can avoid at least an additional infrastructure
that tells you if any of the underlying ratnodes has any dirty page,
which will probably save ram at least because it can be coded as a
bitflag in each node, but that will force you an up-walk of the tree
every time you mark a page dirty (but of course also a second tree would
force you to do some tree every time you mark a page dirty/clean). The
second tree probably allows you not to go into the radix-tree
implementation details to provide the "underlying node dirty page" info,
and it would be faster if for example only the start of the inode has
dirty pages, that would allow the dirty page flushing to walk only a few
levels instead of potential 18 of them even to reach the first few
pages. But I don't think it's a common case, so probably the best
(but not simpler) approch is to mark each ratnode with a dirty
cumulative information.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Torvalds: "Re: [PATCH] vfs_read/vfs_write small bug fix (2.5.29)"
Previous message: Hugh Dickins: "[PATCH] vmacct9/9 remove acct arg from do_munmap"