Re: OOM killer???

Stephen Satchell (satch@fluent-access.com)
Thu, 29 Mar 2001 09:21:06 -0800


At 07:41 AM 3/29/01 -0800, David Konerding wrote:
>Now, if you're going to implement OOM, when it is absolutely necessary, at
>the very
>least, move the policy implementation out of the kernel. One of the general
>philosophies of Linux has been to move policy out of the kernel. In this
>case, you'd
>just have a root owned process with locked pages that can't be pre-empted,
>which
>implemented the policy. You'll never come up with an OOM policy that will fit
>everybody's needs unless it can be tuned for particular system's usage,
>and it's
>going to be far easier to come up with that policy if it's not in the kernel.

SUMMARY OF COMMENT: We need kernel support for such a userland
process. At a minimum, I believe we would need a means to steer term
signals and kill signals to the correct process in a process tree, a means
for processes to receive notification that the system is in trouble, and
that the policy be set and the OOM killer be implemented in a daemon that
accepts input directly from the admin in a config file, from the memory
system via suitable interfaces, and from the processes via communications
(probably through the process control block). SIGDANGER needs to be
defined, but never raised by the kernel. There also needs to be a means
for the OOM daemon to request the release of non-critical cache buffers

Comment follows:

I'm in basic agreement with your sentiments, but I'm concerned about how a
userland policy system will work without some support within the
kernel. The support also needs to be well-enough defined that applications
can be written to work with the policy manager.

Let's start with Oracle, as one of the examples that keeps being brought
up. How does Oracle deal with the problem? Part of the problem is that we
have a late-commit policy and no way to clean up when the processes that
are running exceed the capacity of the machine they are running on. The
AIX solution at first glance seems reasonable: give the running processes
a chance to "become part of the solution" by freeing memory space they have
reserved but are not using, or that can be decommissioned [cached
information] without destroying the process work. The policy
implementation can by default reward those processes by lowering their
chances of being killed if the condition is not corrected.

Another characteristic of Oracle (shared by other mission-critical systems)
is that the thing is not implemented as a single process, but as a
collection of running processes working together. Arbitrarily killing one
process can corrupt the process work of several processes. Currently there
is no mechanism for a process to inform the system that any kills should
really be directed at *this* parent, so that the whole thing shuts down
reasonably. If such a mechanism were to be provided, any signals to ANY
process would be steered to the "top" process. To prevent any subtle
attacks, we would have to define bounds on which process is identified as
the "top" process. (Non-root process in the process tree would probably be
fine for non-root subprocesses; any process in the process tree except init
would be suitable for root subprocesses.)

Several people have identified that one of the current versions of the OOM
killer doesn't cause the release of cache buffers within the Linux
kernel. I've seen mention of patches to correct this problem, but if you
move the implementation of memory overcommittment recovery from the kernel
to userland, you will need to have some way for that userland daemon to
tell the various subsystems to release cached information that can be
safely released. This lets the daemon used a structured recovery technique
where it does (A), check to see if that opens up enough room, does (B),
check to see if that opens enough room, and so forth. Note that some
method needs to tell malloc() to fail all subsequent memory requests so
that when the daemon takes a corrective action there isn't further
overcommittment.

Which brings up another point: why SIGKILL? SIGTERM would appear to be
the proper signal at a "yellow alert" so that the process has a chance of
going through an orderly shutdown -- which might include check-pointing
that week-long calculation of the 6,839,294,763,900,034th prime
number. This is especially important for process sets -- the first time
that the top-level ORACLE module finds out there is trouble is to get a
SIGCHILD signal from a troop when it isn't expecting one?

Finally, I want to bring up a sore subject: beancounting. When I was
taking CS 306 (Operating System Principles) at UIUC in 1972 one issue that
came up is the management of over-committment. Our term project, a system
resource manager, had to deal with a load that included just the behavior
that has been discussed here: programs that reserved resources that it
never uses. In our resource monitors, we were expected to keep track of
allocations, deallocations, and actual usage such that we could CONTROL the
overcommittment of resources, and therefore avoid deadlock. From my
reading of the threads and a glance at the source, the problem is that
processes can ask for and receive resources virtually without limit...as
long as nobody actually uses those resources. It's only when the processes
try to use the resources that the system has promised the processes it can
use that the problem rears its ugly head.

Not only should there be beancounting, but there needs to be policy input
to the kernel when to fail malloc() et. al. If I want to avoid all
overcommittment, I should be able to set a value in a file in the /proc
filesystem to zero to say "0 percent overcommittment" -- which means fail
malloc() calls when you reach a calculated high-water mark. Higher values
stored in this file means higher levels of overcommittment is allowed in
memory allocation calls. The default at boot would be zero; the
distributions could then decide how to set the overcommittment value in the
start-up scripts. The userland policy process could even tweak this
overcommittment value on the fly if so desired, to tune the system to
current demand and to the admin's inputs.

This helps separate the prevention measure (failing mallocs()) from the
recovery measure (killing processes).

I see no way that the beancounting can be relegated to a userland process
-- it needs to be in the kernel. To avoid excess bloat in the kernel, the
kernel should only count the beans and trigger the userland process when
thresholds are exceeded by the system. In this manner, no OOM killer code
need be in the kernel at all. No OOM killer registered? We then revert to
Version 7 action: panic.

Which leads to my final point: I believe that the SIGDANGER signal should
be defined in Linux. The signal would not be raised by the kernel in any
way -- that's left to the userland OOM daemon. The response to SIGDANGER
would be described. The default action would be to ignore SIGDANGER. One
comment is that a denial of service could be launched by a process defining
a SIGDANGER handler that would call malloc() -- I've already mentioned the
requirement that the userland daemon have a way of causing all calls to
malloc() to fail. The Linux definition would differ from the AIX
definition, but the net result would be the same and I believe that the
Linux definition can be written such that existing AIX-based handlers will
work with minimum modification.

I submit this as a strawman suggestion -- I'm not married to any of the
ideas. Feel free to suggest alternatives that solve the problem.

Stephen Satchell

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/