Because the code is correct ;). It is infact a fix and it took some time to fix
such bug in mid 2.2.x.
>
> Anyway, the first problem which HJ pointed out is in
> try_to_free_inodes() which attempts to implement a mutual exclusion with
> respect to itself as follows....
>
> if (block)
> {
> struct wait_queue __wait;
>
> __wait.task = current;
> add_wait_queue(&wait_inode_freeing, &__wait);
> for (;;)
> {
> /* NOTE: we rely only on the inode_lock to be sure
> to not miss the unblock event */
> current->state = TASK_UNINTERRUPTIBLE;
> spin_unlock(&inode_lock);
^^^^^^^^^^^^^^^^^^^^^^^^
> schedule();
> spin_lock(&inode_lock);
^^^^^^^^^^^^^^^^^^^^^^^^
> if (!block)
> break;
> }
> remove_wait_queue(&wait_inode_freeing, &__wait);
> current->state = TASK_RUNNING;
> }
> block = 1;
>
> Of course, this is utterly unsafe on an SMP machines, since access to
> the "block" variable isn't protected at all. So the first question is
Wrong, it's obviously protected by the inode_lock. And even if it wasn't
protected by the inode_lock in 2.2.x inode.c and dcache.c runs globally
serialized by the BKL (but it is obviously protected regardless of the BKL).
> why did whoever implemented this do it in this horribly complicated way
> above, instead of something simple, say like this:
>
> static struct semaphore block = MUTEX;
> if (down_trylock(&block)) {
> spin_unlock(&inode_lock);
> down(&block);
> spin_lock(&inode_lock);
> }
The above is overkill (there's no need to use further atomic API, when we can
rely on the inode_lock for the locking. It's overcomplex and slower.
> (with the appropriate unlocking code at the end of the function).
>
>
> Next question.... why was this there in the first place? After all,
To fix the "inode-max limit reached" faliures that you could reproduce on
earlier 2.2.x. (the freed inodes was re-used before the task that freed them
had a chance to allocate them for itself)
> most of the time try_to_free_inodes() is called with the inode_lock
> spinlock held, which would act as a automatic mutual exclusion anyway.
> The only time this doesn't happen is when we call prune_dcache(), where
> inode_lock is temporarily dropped.
Wrong:
spin_unlock(&inode_lock);
prune_dcache(0, goal);
spin_lock(&inode_lock);
sync_all_inodes();
__free_inodes(&throw_away);
the above code obviously drops the spinlock for doing things like flushing the
dirty inodes to the buffer cache that can block in balance_dirty() etc...
(and that's the real problem because it sleeps so also the BKL gets released
while prune_dcache in practice could not race because of the BKL)
> So I took a look at prune_dcache(), and discovered that (a) it's called
> from multiple places, and (b) it and shrink_dcache_sb() both iterate over
> dentry_unused and among other things, tried to free dcache structures
> without any kind of locking to prevent two kernel threads to
> from mucking with the contents of dentry_unused at the same time, and
> possibly having prune_one_dentry() being called by two processes on the
> same dentry structure. This should almost certainly cause problems.
we're running under the BKL all over the place in 2.2.x so they can't race.
> So the following patch I think is definitely necessary, assuming that
The patch is definitely not necessary.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/