Re: a quest for a better scheduler

Andrea Arcangeli (andrea@suse.de)
Wed, 4 Apr 2001 17:08:47 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: christophe barbe: "Re: uninteruptable sleep (D state => load_avrg++)"
Previous message: Mark Lehrer: ""linux" terminal type"

On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:
>
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
>
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
>
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the wildfire:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

this is mostly an userspace issue, not really intended as a kernel optimization
(however it's also partly a kernel optimization). Basically it splits the load
of the numa machine into per-node load, there can be unbalanced load across the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select an
algorithm that scales would be a good idea IMHO. The above patch just adds a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: christophe barbe: "Re: uninteruptable sleep (D state => load_avrg++)"
Previous message: Mark Lehrer: ""linux" terminal type"