Re: Delaying writes to disk when there's no need

Helge Hafting (helgehaf@aitel.hist.no)
Mon, 31 Mar 2003 15:42:00 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Andreas Metzler: "Re: [2.4.21-pre5] compile error in ip_conntrack_ftp.c:440:"
Previous message: Daniel K.: "[patch] fix ec_read using wrong #define's in sonypi driver."

Erik Hensema wrote:
[...]
> Helge Hafting already pointed out that writing out the data earlier isn't
> desirable. The problem isn't in the waiting: the problem is in the writing.
> I think the current kernel tries to write too much data too fast when
> there's absolutely no reason to do so. It should probably gently write out
> small amounts of data until there is a more pressing need for memory.
>
I don't think the problem is "writing a large chunk", rather that this
chunk is scheduled for writing a bit too late. Memory is filling up
and the process producing data us throttled while waiting for
the write to free up pages. Then the "huge chunk" of pages is released,
and memory is allowed to fill up for too long again.

Seem to me the correct solution is to start writing out
things long before memory gets so full that we need to
throttle the producer.

This will result in somewhat smaller chunks and a somewhat
steadier stream of data. It will work better, but not because
the chunks are smaller. (Block devices is supposed to handle
enormous chunks with no problems, and the bandwith utilization
is generally better the bigger chunks you can get. 50M to
disk in one go isn't "pushing" anything - it is "nice".)
The reason an earlier start works better is that memory never fills up
to the point where the producer is throttled, assuming
the io system can keep up with the producer forever.
Throttling will _always_ happen when that isn't the case.

The tricky part here is knowing the bandwith of the output
device, and start writing at such a time that memory
won't have time to fill up in the case where a
producer is almost as fast as the output device.

The problem is that this depends on several things:
1. How much more memory is there (varies a lot, but
the kernel knows this one.)
2. How fast is the output device (varies a lot, different
areas on a disk have different speed. Different
disks have different speed. The speed of nfs depends
on network speed, network congestion,
roundtrip time, server load, and server disk speed.
You probably cannot get good estimates for all cases,
particularly not nfs in a shared net.
To get this right we need both bandwith and latency.
3. How fast is data produced? A global estimate may
be possible, looking at how fast memory is dirtied.
I have no idea if such an estimate is possible per
block device.
4. The big problem is that there may be several unrelated
processes dirtying memory to be written to several
very different block devices.
For this to work automatically we need a low estimate
for the bandwith for each block device/filesystem,
and memory dirying rate for each.

This seems hard to solve automatically. A specific
case of a realtime program writing near disk speed is solvable
by having an extra thread that issue a fsync whenever the
amount of written but unsynced data gets near the point
where the time necessary to write it is long enough
to fill memory with the same rate of producing data.
Of course one wants a substantial safety margin here,
perhaps an assumption that only one third or so of memory
actually will be available for caching the important stuff.

A manual solution is possible if we can have two "knobs"
for this:
1. Treshold for when to start writing out stuff
2. Treshold for when to throttle processes.

The latter may or may not be necessary, the point is that the former
should kick in long before throttling is necessary.

This is usually expressed as how many % of memory that is dirty, but
I'm not sure that is the right thing. It assumes that 100% will be
available after cleaning, which may be way off.

Something like % of memory that is still available (free,
or instantly freeable by reclaiming clean unpinned cache)

Helge Hafting

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andreas Metzler: "Re: [2.4.21-pre5] compile error in ip_conntrack_ftp.c:440:"
Previous message: Daniel K.: "[patch] fix ec_read using wrong #define's in sonypi driver."