Re: [PATCH] migration thread fix

Erich Focht (focht@ess.nec.de)
Fri, 19 Apr 2002 00:24:14 +0200 (CEST)


Bill,

thanks for sending me your patch. No, I didn't see it before.

rl> Eric, I am interested in your opinion of wli's patch, too. I really
rl> liked his approach.

Hmm, as far as I can see, it probably fixes the problems but keeps the
original philosophy of depending on the load balancer to get the
migration tasks scheduled on every CPU.

rl> You seem to remove a lot of code since, after starting the first thread,
rl> you rely on set_cpus_allowed and the existing migration_thread to push
rl> the task to the correct place. I suppose this will work .. but it may
rl> depend implicitly on behavior of the migration_threads and load_balance
rl> (not that the current code doesn't rely on load_balance - it does).

The first thread starts up on the boot cpu with cpus_allowed mask set
to the boot cpu. It will write itself into cpu_rq(0)->migration_thread
and be fully functional because all other existing threads (and
migration_init) will call yield() or schedule(). The other
migration_threads can also run only on cpu#0 and will do so when
migration_CPU0 was completely initialized. Then they move themselves
to their designated CPU by calling set_cpus_allowed(). Their initial
CPU mask beeing 1<<0, they will call migration_CPU0, which is fully
functional at this time and only does the job it was designed
for. There is no dependency on the load balancer any more.

rl> What happens if a migration_thread comes up on a CPU without a migration
rl> thread and then you call set_cpus_allowed?

It doesn't because the cpus_allowed mask is 0x0001.

rl> I am also curious what causes #1 you mention. Do you see it in the
rl> _normal_ code or just with your patch? I cannot see what we race
rl> against wrt interrupts ... disabling interrupts, however, would disable
rl> load_balance and that is a potential pitfall with using
rl> migration_threads to migrate migration_threads as noted above.

I saw the problem #1 when testing an interface to
userspace which migrates tasks from one cpu to another. It happens
pretty easilly that the timer interrupt occurs while the migration
thread is doing its job and holds the current runqueue
lock. scheduler_tick() spinlocks on the own rq->lock which will never
be released by migration_thread. Maybe this occurs pretty seldomly on
IA32 with 10ms ticks, in IA64 it's easy to produce.

Best regards,
Erich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/