Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]

Andreas Dilger (adilger@clusterfs.com)
Fri, 21 Feb 2003 12:41:09 -0700

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Andrea Arcangeli: "Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]"
Previous message: Thomas Molina: "Re: [PATCH 2.5.x, 2.4.x ...] very small"

On Feb 21, 2003 10:46 +0100, Andrea Arcangeli wrote:
> On Thu, Feb 20, 2003 at 04:15:36PM -0700, Andreas Dilger wrote:
> > What we did was set up a "kmap reservation", which used an atomic_dec()
> > + wait_event() to reschedule the task until it could get enough kmaps
> > to satisfy the request without deadlocking (i.e. exceeding the kmap cap
> > which we conservitavely set at 3/4 of all kmap space).
>
> Your approch was fragile (every arch is free to give you just 1 kmap in
> the pool and you still must not deadlock) and it's not capable of using
> the whole kmap pool at the same time. the only robust and efficient way
> to fix it is the kmap_nonblock IMHO

So (says the person who only ever uses i386 and ia64), does an arch exist
which needs highmem/kmap, but only ever gives 1 kmap in the pool?

> > This works for us because we are the only consumer of huge amounts of kmaps
> > on our systems, but it would be nice to have a generic interface to do that
> > so that multiple apps don't deadlock against each other (e.g. NFS + Lustre).
>
> This isn't the problem, if NFS wouldn't be broken it couldn't deadlock
> against Lustre even with your design (assuming you don't fall in the two
> problems mentioned above). But still your design is more fragile and
> less scalable, especially for a generic implementation where you don't
> know how many pages you'll reserve in mean, and you don't know how many
> kmaps entries the architecture can provide to you. But of course with
> kmap_nonblock you'll have to fallback submitting single pages if it
> fails, it's a bit more difficult but it's more robust and optimized IMHO.

In our case, Lustre (well Portals really, the underlying network protocol)
always knows in advance the number of pages that it will need to kmap
because the client needs to tell the server in advance how much bulk data
is going to send. This is required for being able to do RDMA. It might
be possible to have the server do the transfer in multiple parts if
kmap_nonblock() failed, but that is not how things are currently set up,
which is why we block in advance until we know we can get enough pages.

This is very similar to ext3 journaling, which requests in advance the
maximum number of journal blocks it might need, and blocks until it can
get them all.

The only problem happens when other parts of the kernel start acquiring
multiple kmaps without using the same reservation/accounting system as us.
Each works fine in isolation, but in combination it fails.

Cheers, Andreas

-- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrea Arcangeli: "Re: xdr nfs highmem deadlock fix [Re: filesystem access slowing system to a crawl]"
Previous message: Thomas Molina: "Re: [PATCH 2.5.x, 2.4.x ...] very small"