Summary: [perf][sdet]sched_best_cpu does not pick best cpu
Kernel Version: 2.6.65
Status: NEW
Severity: normal
Owner: rml@tech9.net
Submitter: habanero@us.ibm.com
Hardware Environment:
pSeries NUMA
Software Environment:
Linux 2.5
Steps to Reproduce:
1.Boot Linux 2.5 kernel with NUMA support
2.Run any workload with fork+exec
Actual Results:
sched_best_cpu does not pick best cpu
Expected Results:
sched_best_cpu does pick best cpu
This occurs when you have NUMA nodes with no cpus, like on p670: node0 has 8
cpus, node1 has 0 cpus, node2 has 8 cpus. sched_best_cpu() uses
node_nr_running, which is initialized with "0" for each node. Since there
are no cpus in node1, no tasks are mograted there, and node_nr_running[1]
does not go above 0. So, with at least one task running, node1 is always
picked as the least loaded node to place a newly exec'd task.
Eventually the parent's task_cpu will be picked, making sched_best_cpu
ineffective.
Also, ideally this code needs some attention in node load. node load is
currently computed by summing up all the running tasks in a node. This
does not compensate for nodes that have different cpu counts, which may be
a likely situation with cpu hot plug, dynamic partitions, etc. This stuff
is probably a bit in the future and not highest priority, be we should
probably think about it.
Anyway, to fix the basic problem, add a new function, node_online, which
functions simialr to cpu_online, and add a statement "if
(!node_online(node) continue;" anywhere we do for(node=0; node<maxnumnodes;
node++)
Node_online for pSereis is easy, use numa_node_exists[] array.
I guess we need a asm-generic one, too.
I have a rough version of this running right now, I'll send a patch asap.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/