> Alright, I'll try running some other numbers too, what can you
> recommend other than aim and kernel compiles?
Hmmm, take any ISV MPI code. I tried StarCD.
You can also try a small test which is sensitive on both memory latency
and bandwidth. The core is an indirect update loop
for(i=0; i<n; i++)
a[ix[i]] = a[ix[i]] + b[i];
with random ix[]. This is quite typical for simulations using unstructured
grids. You find it at:
http://home.arcor.de/efocht/linux/affinity_test.tgz
There are two similar perl scripts included which can be used for
submitting several such processes in parallel and formatting the
output. They are hardwired for 4 nodes (sorry) and need a file describing
the cpu to node assignement. It should contain the node numbers of the
logical CPUs separated by _one_ blank (sorry for the inconvenience, I was
using a patch for accessing this info in /proc...).
Try calling:
time ./disp 8 ./affinity 1000000
This will start 8 copies of ./affinity and collect some statistics about
percentage of time spent on each node, maximum percentage spent on a
node, the node number on which it spent most of the time, the initial
node, the user time. Interesting is also the elapsed time returned by
the "time" command. The output looks like this:
[focht@azusa ~/affinity]> ./dispn 4 ./affinity 1000000
Executing 4 times: ./affinity 1000000
-----------------------------------------------------------------------
% Time on Node Scheduled node
Job 0 1 2 3 Max (m , i) Time (s)
-----------------------------------------------------------------------
1 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 29.5
2 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 26.5
3 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 29.5
4 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 26.6
-----------------------------------------------------------------------
Average user time = 28.00 seconds
Normally you should see:
1: on kernels with the old scheduler:
- for #jobs < #cpus: poor balancing among nodes => bad times due
to bandwidth limitations
- for #jobs > #cpus: jobs are hopping from node to node: poor
latency
2: on kernels with Ingo's scheduler:
- less hopping across the nodes, but poor balance among nodes
(case 1).
- short additional load peaks can shift jobs away from their
memory to another node, from where they don't have a reason to
return.
3: on kernels with Ingo's O(1) + node affine extension:
- hopefully better balanced nodes (at least in average) for
#jobs < #cpus therefore better bandwidth available.
- better latency due to node affinity (though there is a problem
when a single job is running on a remote node: it can't be taken
back to the homenode even if that one is empty because a running
job can't be stolen).
With no memory affinity (but you have it, because of DISCONTIG) timings
are much worse...
> Sounds good, I'll have to update those macros later too (Jack
> reminded me that physical node numbers aren't always the same as
> logical node numbers).
Ah, ok... I'd prefer to use
#define CPU_TO_NODE(i) node_number[i]
The macro SAPICID_TO_NODE() is mainly used to build this one and its speed
doesn't matter. So make it a function, or whatever you need...
Regards,
Erich
PS: I heard that on your big systems you currently have 32 nodes, maybe
that's a better choice for NR_NODES?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/