I wouldn't bother using RedHat's kernel for this at the moment, 
Andrea's tree is where the development work for this area has all
been happening recently. He's working on integrating O(1) sched
right now, which will get rid of the biggest issue I have with -aa
at the moment (the issue being that I'm too idle^H^H^H^Hbusy to
merge it ;-)).
> The catastrophic failures are still happening, in fact, the last
> lse-tech conference call a week or two ago was dedicated at least in
> part to them. The number of different ways in which these failures
> occur is large, so it's taking a while for the iterations of whack-a-mole
> game to converge to kernel stability. Andrea has probably been doing the
> most visible stuff on this front with the recent bh/inode exhaustion
> patches, with due credit to akpm as well for the unconditional bh
> stripping patch.
The problems we're seeing are mainly KVA exhaustion. The top hitlist
for me at the moment are:
1. PTEs (5000 tasks sharing a 2Gb SGA = 10GB of PTEs).
	We have two different implementations of highpte, Andrea's
	latest seems to work fairly well, and is much more scalable
	than earlier versions. We need to have shared	PTEs as well.
	I'd encourage people to benchmark the hell out of each
	solution, and help us come down to one, or a hybrid of both.
2. rmap pte_chains. 
	As far as I can see, these consume twice as much space as 
	the PTEs (ie 20Gb in the case above).
3. buffer_heads
	I have over 1Gb of bufferheads in (an enlarged) ZONE_NORMAL
	right now. akpm has given me a patch to prune them pretty
	viciously on an ongoing basis, Andrea has a patch to prune
	them under memory pressure. I have slight concerns about
	fragmentation under Andrea's approach, but both patches seem
	to work fine - performance still needs to be worked out.
4. struct page
	Bill Irwin has already done great things in shrinking this
	somewhat, but I think we need to be even more drastic at 
	some point, and only map the PTEs we need for each process,
	into a task (well address-space) specific KVA area, which
	I call user-kernel address space or UKVA (search back for
	my proposal to do this a couple of months ago).
5. kmap
	Persistent kmap sucks, and the global systemwide TLB flushes
	scale as O(1/N^2) with the number of CPUs. Enlarging the kmap 
	area helps a little, but really we need to stop doing this to
	ourselves. I will have a patch (hopefully within a week) to do 
	per-task kmap, based on the	UKVA patch that Dave McCracken has
	already implemented.
6. vmalloc
	Vmalloc space gets quickly exhausted, I think a large part of
	that is threads allocating 64K LDTs ... and 2.5 has a recent
	fix for that that we need to backport.
There are various other general scalability problems (eg. I'd like to
see Ingo's scheduler put into mainline 2.4 sometime soon, both 2.5 and
our benchmarking teams have kicked the hell out of it, and it stands
up well), but the above list is the things I can think of at the moment
that are specific to 32-bit machines (though some of those would also
help 64 bit).
M.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/