thanks for the many experiments, they help giving some more
insight. It's quite clear that we'll have to tune the knobs a bit to
make this fit well.
At first a comment on the "target" of the tuning:
> 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28
> 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12
> 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85
> 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66
> 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86
> 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94
> 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36
> AverageUserTime 56.38 seconds
> ElapsedTime 124.09
> TotalUserTime 1804.67
> TotalSysTime 7.71
>
> Ideally, there would be nothing but 100.0 in all non-zero entries.
> I'll try adding in the 05 patch, and if that does not help, will
> try a few other adjustments.
The numa_test is written such that you MUST get numbers below 100%
with this kind of scheduler! There is a "disturbing" process started 3
seconds after the parallel tasks are up and running which has the aim
to "shake" the system a bit and the scheduler ends up with some of
the parallel tasks moved away from their homenode. This is a
reallistic situation for long running parallel background jobs. In the
min-sched case this cannot happen, therefore you see the 100% all
over!
Increasing more and more the barrier for moving a task apart from it's
node won't necessarilly help the scheduler as we end up with the
min-sched limitations. The aim will be to introduce node affinity for
processes such that they can return to their homenode when the
exceptional load situation is gone. We should not try to reach the
min-sched numbers without having node-affine tasks.
> * Make the check in find_busiest_node into a macro that is defined
> in the arch specific topology header file. Then different NUMA
> architectures can tune this appropriately.
I still think that the load balancing condition should be clear
and the same for every architecture. But tunable. I understand that
the constant added helps but I don't have a clear motivation for it.
We could keep the imbalance ratio as a condition for searching the
maximum:
> + if (load > maxload && (4*load > ((5*4*this_load)/4))) {
(with tunable barrier height: say replace 5/4 by 11/8 or similar, i.e.
(8*load > ((11*8*this_load)/8))
)
and rule out any special unwanted cases after the loop. These would be
cases where it is obvious that we don't have to steal anything, like:
load of remote node is <= number of cpus in remote node.
For making this tunable I suggest something like:
#define CROSS_NODE_BAL 125
(100*load > ((CROSS_NODE_BAL*100*this_load)/100))
and let platforms define their own value. Using 100 makes it easy to
understand.
> * In find_busiest_queue change:
>
> cpumask |= __node_to_cpu_mask(node);
> to:
> cpumask = __node_to_cpu_mask(node) | (1UL << (this_cpu));
>
>
> There is no reason to iterate over the runqueues on the current
> node, which is what the code currently does.
Here I thought I'd give the load balancer the chance to balance
internally just in case the own node's load jumped up in the mean
time. But this should ideally be caught by find_busiest_node(),
therefore I agree with your change.
> Some numbers for anyone interested:Kernbench:
>
> All numbers are based on a 2.5.55 kernel with the cputime stats patch:
> * stock55 = no additional patches
> * mini+rebal-44 = patches 01, 02, and 03
> * rebal+4+fix = patches 01, 02, 03, and the cpumask change described
> above, and a +4 constant added to the check in find_busiest_node
> * rebal+4+fix+04 = above with the 04 patch added
>
> Elapsed User System CPU
> rebal+4+fix+04-55 29.302s 285.136s 82.106s 1253%
> rebal+4+fix-55 30.498s 286.586s 88.176s 1228.6%
> mini+rebal-55 30.756s 287.646s 85.512s 1212.8%
> stock55 31.018s 303.084s 86.626s 1256.2%
The first line is VERY good!
The results below (for Schedbench) are not really meaningful without
statistics... But the combination rebal+4+fix+04-55 is the clear
winner for me. Still, I'd be curious to know the rebal+fix+04-55
numbers ;-)
> Schedbench 4:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 27.34 40.49 109.39 0.88
> rebal+4+fix-55 24.73 38.50 98.94 0.84
> mini+rebal-55 25.18 43.23 100.76 0.68
> stock55 31.38 41.55 125.54 1.24
>
> Schedbench 8:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 30.05 44.15 240.48 2.50
> rebal+4+fix-55 34.33 46.40 274.73 2.31
> mini+rebal-55 32.99 52.42 264.00 2.08
> stock55 44.63 61.28 357.11 2.22
>
> Schedbench 16:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 52.13 57.68 834.23 3.55
> rebal+4+fix-55 52.72 65.16 843.70 4.55
> mini+rebal-55 57.29 71.51 916.84 5.10
> stock55 66.91 85.08 1070.72 6.05
>
> Schedbench 32:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 56.38 124.09 1804.67 7.71
> rebal+4+fix-55 55.13 115.18 1764.46 8.86
> mini+rebal-55 57.83 125.80 1850.84 10.19
> stock55 80.38 181.80 2572.70 13.22
>
> Schedbench 64:
> AvgUser Elapsed TotalUser TotalSys
> rebal+4+fix+04-55 57.42 238.32 3675.77 17.68
> rebal+4+fix-55 60.06 252.96 3844.62 18.88
> mini+rebal-55 58.15 245.30 3722.38 19.64
> stock55 123.96 513.66 7934.07 26.39
>
>
> And here is the results from running numa_test 32 on rebal+4+fix+04:
>
> Executing 32 times: ./randupdt 1000000
> Running 'hackbench 20' in the background: Time: 8.383
> Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
> 1 100.0 0.0 0.0 0.0 | 0 0 | 56.19
> 2 100.0 0.0 0.0 0.0 | 0 0 | 53.80
> 3 0.0 0.0 100.0 0.0 | 2 2 | 55.61
> 4 100.0 0.0 0.0 0.0 | 0 0 | 54.13
> 5 3.7 0.0 0.0 96.3 | 3 3 | 56.48
> 6 0.0 0.0 100.0 0.0 | 2 2 | 55.11
> 7 0.0 0.0 100.0 0.0 | 2 2 | 55.94
> 8 0.0 0.0 100.0 0.0 | 2 2 | 55.69
> 9 80.6 19.4 0.0 0.0 | 1 0 *| 56.53
> 10 0.0 0.0 0.0 100.0 | 3 3 | 53.00
> 11 0.0 99.2 0.0 0.8 | 1 1 | 56.72
> 12 0.0 0.0 0.0 100.0 | 3 3 | 54.58
> 13 0.0 100.0 0.0 0.0 | 1 1 | 59.38
> 14 0.0 55.6 0.0 44.4 | 3 1 *| 63.06
> 15 0.0 100.0 0.0 0.0 | 1 1 | 56.02
> 16 0.0 19.2 0.0 80.8 | 1 3 *| 58.07
> 17 0.0 100.0 0.0 0.0 | 1 1 | 53.78
> 18 0.0 0.0 100.0 0.0 | 2 2 | 55.28
> 19 0.0 78.6 0.0 21.4 | 3 1 *| 63.20
> 20 0.0 100.0 0.0 0.0 | 1 1 | 53.27
> 21 0.0 0.0 100.0 0.0 | 2 2 | 55.79
> 22 0.0 0.0 0.0 100.0 | 3 3 | 57.23
> 23 12.4 19.1 0.0 68.5 | 1 3 *| 61.05
> 24 0.0 0.0 100.0 0.0 | 2 2 | 54.50
> 25 0.0 0.0 0.0 100.0 | 3 3 | 56.82
> 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28
> 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12
> 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85
> 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66
> 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86
> 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94
> 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36
> AverageUserTime 56.38 seconds
> ElapsedTime 124.09
> TotalUserTime 1804.67
> TotalSysTime 7.71
>
> Ideally, there would be nothing but 100.0 in all non-zero entries.
> I'll try adding in the 05 patch, and if that does not help, will
> try a few other adjustments.
>
> Thanks for the quick effort on putting together the node rebalance
> code. I'll also get some hackbench numbers soon.
Best regards,
Erich
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/