time tells all about kernel VM's

safemode (safemode@speakeasy.net)
Mon, 22 Oct 2001 23:04:25 -0400


We've all seen benchmarks and "load tests" and "real world runthroughs" on
the rik and aa kernel VM's. But time does tell all. I've had
2.4.12-ac3-hogstop up and running for over 5 days. The first hiccup i found
was a day or so ago when trying out defragging an ext2 fs on a hdd just for
the hell of it. I have 770MB of ram and 128MB of swap (since my other 128MB
of swap was on the drive i was defragging and i had swapoff'd it). First
the kernel created about 600MB of buffer in addition to the application
specified 128MB of buffer i had it using (e2defrag -p 16384). This brought
the system to a crawl. So in some twisted reality that may be considered
normal kernel behavior, so i let it pass. Then i created an insanely large
ps and tried loading it in ghostview, magnified it a couple times in
kghostview and what happens? I wish i could tell you but i cant because the
system immediately went unresponsive and started swapping at a turtles pace.
I can tell what didn't happen though.

A. OOM did not kick in and kill kghostview. Why you may ask? Read on to B.
B. The VM has this need to redistribute cache and buffer so that an OOM
situation doesn't take place until all the ram is basically being used. The
problem is that currently the VM will swap out stuff it isn't using and
without buffer it must read from the drive (which is being used to swap)
which takes more cpu which isn't there because the app is locking the kernel
up trying to allocate memory (see why dbench causes mp3 skips). So what
happens is that the kernel cant swap because the hdd io is being strangled by
the process that's going out of control (kghostview) which means that the VM
is stuck doing this redistribution at a snails pace and the OOM situation
never occurs (or occurs many days later when you've died of starvation).
Leaving you deadlocked on a kernel with a VM that is supposed to conquer this
situation and make it a thing of the past.

So what happens after a few days of uptime is that we see where the VM has
slight weaknesses that magnify over time and aren't aparent on the normal run
of tests done on each new release to decide if it's good or not.
Perhaps if i had not had any swap loaded at all this situation would have
been avoided.
I see this as a pretty serious bug
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/