I agree that the race is small. I 'found' this while playing
with some experimental code that does the same thing as the
fast path in set_cpus_allowed: it sets the cpu field while
holding the rq lock if the task is not on the rq. This code
runs as at higher frequency (than would be expected of
set_cpus_allowed) and found this hole.
> > My first thought was to simply add a check to the
> > above code to ensure that p was not currently running
> > (p != rq->curr). However, even with this change I
> > 'think' the same problem exists later on in schedule()
> > where we drop the runqueue lock before doing a context
> > switch. At this point, p is not on the runqueue and
> > p != rq->curr, yet we are still runnning in the context
> > of p until we actually do the context switch. To tell
> > the truth, I'm not exactly sure what 'rq->frozen' lock
> > is for. Also, the routine 'schedule_tail' does not
> > appear to be used anywhere.
>
> I agree here, too.
>
> Fyi, schedule_tail is called from assembly code on SMP machines. See
> entry.S
>
> rq->frozen is admittedly a bit confusing. Dave Miller added it - on
> some architectures mm->page_table_lock is grabbed during switch_mm().
> Additionally, swap_out() and others grab page_table_lock during wakeups
> which also grab rq->lock. Bad locking... Dave's solution was to make
> another lock, the "frozen state lock", which is held around context
> switches. This way we can protect the "not switched out yet" task
> without grabbing the whole runqueue lock.
>
> Anyhow, with this issue, I guess we need to fix it... I'll send a patch
> to Linus.
What about the code in schedule before calling context_switch?
Consider the case where the task relinquishing the CPU is still
runnable. Before dropping the rq lock, we set rq->curr = next
(where next != current). After dropping the rq lock, can't we
race with the code in load_balance. load_balance will steal
'runnable' tasks as long as the task is not specified by rq->curr.
Therefore, theoretically load_balance could steal a currently
running task.
Now, the window for this race is only from the time we drop the
rq lock to the time we do the context switch. During this time
we are running with interrupts disabled, so nothing should delay
us. In addition, code path on a racing CPU would be much longer:
move task via load balance and schedule it on another CPU.
I believe there is a race, but it is a race we will never lose.
Does that make it a race at all. :)
In the old scheduler both tasks (prev and next) were marked as
'have_cpu' during the context switch. As such they were 'hands
off' to other CPUs. Should we implement something like this
here?
-- Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/