This seems like the right approach to me, apart from the trigger to
do the cross-node rebalance. I don't believe that has anything to do
with when we're internally balanced within a node or not, it's
whether the nodes are balanced relative to each other. I think we should
just check that every N ticks, looking at node load averages, and do
a cross-node rebalance if they're "significantly out".
The definintion of "N ticks" and "significantly out" would be a tunable
number, defined by each platform; roughly speaking, the lower the NUMA
ratio, the lower these numbers would be. That also allows us to wedge
all sorts of smarts in the NUMA rebalance part of the scheduler, such
as moving the tasks with the smallest RSS off node. The NUMA rebalancer
is obviously completely missing from the current implementation, and
I expect we'd use mainly Erich's current code to implement that.
However, it's suprising how well we do with no rebalancer at all,
apart from the exec-time initial load balance code.
Another big advantage of this approach is that it *obviously* changes
nothing at all for standard SMP systems (whereas your current patch does),
so it should be much easier to get it accepted into mainline ....
M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/