> This has shown some good results on our testing [...]
(do you have any numeric results?)
> [...] but it yielded a hang under some circumstances. I think I've
> located the problem.
>
> When you decide to call resched_task(), you're holding a runqueue lock.
> I think this gets us into the same mess the mm stuff did a couple of
> months ago with IPIs.
It's perfectly safe to _generate_ an IPI from under the runqueue lock. We
do it in a number of cases, and did it for years. What precise hardware
did you run this on?
> proc A proc B
>
> grabs rq lock tries to grab rq lock
> rescheds a task on that rq spins with interrupts disabled
> sends IPI to all processors <hangs>
> <hangs>
> releases lock
firstly, the IPI is not sent to all processors, it's sent to a single
target CPU. Secondly, where and why would the hang happen?
we do loop in one place in the x86 APIC code, apic_wait_icr_idle(). I
believe this should never loop indefinitely, unless the hw is broken.
> Holding the lock does two things: it allows us to set p's state to
> RUNNING, and insures our task_running() check is valid for more than an
> instant. In the case of the call to resched_task(), the small window
> between unlocking the runqueue and calling resched_task() should be ok.
> If p is no longer running, there's no real harm done that I can see.
i dont disagree in theory with offloading the SMP-message sending outside
the critical section (it makes things more parallel, and the APIC message
itself goes asynchronously anyway), but i'd like to understand the precise
reason for the hang first. The current code should not hang.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/