Re: [Lse-tech] NUMA scheduling

Larry McVoy (lm@bitmover.com)
Mon, 25 Feb 2002 11:03:27 -0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Manohar Pradhan: "Urgent SCSI I/O Error"
Previous message: Joe Korty: "[RFC][PATCH] irq0 affinity broke on some i386 boxes"

On Mon, Feb 25, 2002 at 10:55:03AM -0800, Martin J. Bligh wrote:
> > - The load_balancing() concept is different:
> > - there are no special time intervals for balancing across pool
> > boundaries, the need for this can occur very quickly and I
> > have the feeling that 2*250ms is a long time for keeping the
> > nodes unbalanced. This means: each time load_balance() is called
> > it _can_ balance across pool boundaries (but doesn't have to).
>
> Imagine for a moment that there's a short spike in workload on one node.
> By agressively balancing across nodes, won't you incur a high cost
> in terms of migrating all the cache data to the remote node (destroying
> the cache on both the remote and local node), when it would be cheaper
> to wait for a few more ms, and run on the local node?

Great question! The answer is that you are absolutely right. SGI tried
a pile of things in this area, both on NUMA and on traditional SMPs (the
NUMA stuff was more page migration and the SMP stuff was more process
migration, but the problems are the same, you screw up the cache). They
never got the page migration to give them better performance while I was
there and I doubt they have today. And the process "migration" from CPU
to CPU didn't work either, people tended to lock processes to processors
for exactly the reason you alluded to.

If you read the early hardware papers on SMP, they all claim "Symmetric
Multi Processor", i.e., you can run any process on any CPU. Skip forward
3 years, now read the cache affinity papers from the same hardware people.
You have to step back and squint but what you'll see is that these papers
could be summarized on one sentence:

"Oops, we lied, it's not really symmetric at all"

You should treat each CPU as a mini system and think of a process reschedule
someplace else as a checkpoint/restart and assume that is heavy weight. In
fact, I'd love to see the scheduler code forcibly sleep the process for
500 milliseconds each time it lands on a different CPU. Tune the system
to work well with that, then take out the sleep, and you'll have the right
answer.

-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/






 Next message: Manohar Pradhan: "Urgent SCSI I/O Error"
 Previous message: Joe Korty: "[RFC][PATCH] irq0 affinity broke on some i386 boxes"