Seem to me the correct solution is to start writing out
things long before memory gets so full that we need to
throttle the producer.
This will result in somewhat smaller chunks and a somewhat
steadier stream of data. It will work better, but not because
the chunks are smaller. (Block devices is supposed to handle
enormous chunks with no problems, and the bandwith utilization
is generally better the bigger chunks you can get. 50M to
disk in one go isn't "pushing" anything - it is "nice".)
The reason an earlier start works better is that memory never fills up
to the point where the producer is throttled, assuming
the io system can keep up with the producer forever.
Throttling will _always_ happen when that isn't the case.
The tricky part here is knowing the bandwith of the output
device, and start writing at such a time that memory
won't have time to fill up in the case where a
producer is almost as fast as the output device.
The problem is that this depends on several things:
1. How much more memory is there (varies a lot, but
the kernel knows this one.)
2. How fast is the output device (varies a lot, different
areas on a disk have different speed. Different
disks have different speed. The speed of nfs depends
on network speed, network congestion,
roundtrip time, server load, and server disk speed.
You probably cannot get good estimates for all cases,
particularly not nfs in a shared net.
To get this right we need both bandwith and latency.
3. How fast is data produced? A global estimate may
be possible, looking at how fast memory is dirtied.
I have no idea if such an estimate is possible per
block device.
4. The big problem is that there may be several unrelated
processes dirtying memory to be written to several
very different block devices.
For this to work automatically we need a low estimate
for the bandwith for each block device/filesystem,
and memory dirying rate for each.
This seems hard to solve automatically. A specific
case of a realtime program writing near disk speed is solvable
by having an extra thread that issue a fsync whenever the
amount of written but unsynced data gets near the point
where the time necessary to write it is long enough
to fill memory with the same rate of producing data.
Of course one wants a substantial safety margin here,
perhaps an assumption that only one third or so of memory
actually will be available for caching the important stuff.
A manual solution is possible if we can have two "knobs"
for this:
1. Treshold for when to start writing out stuff
2. Treshold for when to throttle processes.
The latter may or may not be necessary, the point is that the former
should kick in long before throttling is necessary.
This is usually expressed as how many % of memory that is dirty, but
I'm not sure that is the right thing. It assumes that 100% will be
available after cleaning, which may be way off.
Something like % of memory that is still available (free,
or instantly freeable by reclaiming clean unpinned cache)
Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/