Oops from local semaphore race condition

Ron Niles (Ron.Niles@falconstor.com)
Mon, 20 May 2002 20:11:57 -0400


Many places in the kernel (mostly drivers, like in scsi_error.c,
scsi_sleep()) have the following construct:

void function(void)
{
DECLARE_MUTEX_LOCKED(sem);

/* let another thread do the work and up() when done */
notify_handler(&sem);
down(&sem);
}

I have used this construct very often and under stress testing my driver got
a few Oops in __up_wakeup, as if the semaphore went corrupt. Then I realized
it can possibly go corrupt, due to a race condition which lets down()
continue before up() is complete:

down decrements sem->counter to -1
up increments sem->counter to 0
down enters __down, but does not sleep since sem->counter has already been
incremented
up enters __up, tries to wake_up(&sem->wait) but sem is now garbage since
function has exited.

I think this race is possible only on SMP. This problem seems to show up
more if I have heavy irq activity which I guess will interrupt down()
between the --sem->counter and __down(), and also interrupt up() between
++sem->counter and __up().

I am interested in some opinions:
(1) does the following up_atomic() function instead of up() indeed fix the
race condition?
(2) it is worthwile to put it into semaphore.c and semaphore.h?
(3) is it worthwhile to fix the condition in scsi modules and other drivers
that do this?

void up_atomic(struct semaphore *sem)
{
spin_lock_irq(&semaphore_lock);
up(sem);
spin_unlock_irq(&semaphore_lock);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/