We do that already. In fact, they always do that unless they're explicitly
migrated (in both NUMA and SMP cases) by the rebalancer.
> I think the ideal semantics would probably be something along the lines of:
>
> - a newly fork()ed thread executes on the same node as the creating thread
> - calling exec() sets a "feel free to shuffle me elsewhere" flag
> - threads are otherwise only shuffled to other nodes when a certain load ratio
> is exceeded (current-node:idle-node)
Read the code. It does pretty much exactly this already ;-)
Especially look at sched_balance_exec(), and balance_node()
> Unfortunatley the whole idea would seem to fall apart in the case of a
> fast-spawning thread pool type load. Perhaps there's a way to handle that
> automatically, or perhaps it would best be left as a scheduler tunable,
> I don't know.
It does, kind of ... in fact they all stay on the same runqueue. We need
to train the numa rebalancer to move multiple tasks in the case of heavy
imbalances, but it needs some care to avoid migrating short load spikes.
Moving from nr_running snapshots to load averages will probably fix it.
> I seem to recall SGI found great benefit in writing the scheduler in IRIX to
> work somewhat like this, though the loads on most SGI machines tend to be slow
> spawning, long running and big memory - the textbook case for reduced node
> shuffling. Linux would tend to have a much greater variety of load profiles,
> some of which would be less pleasantly affected.
Right - that's why it's hard ;-) Fixing any one case is easy ... fixing
all of them simultaneously is extremely difficult ;-)
M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/