It's a start. The important thing is to have supported my theory of what is
going on here. What I did there is probably a good thing, it seems quite
effective for combatting 0 order atomic failures. In this case you have a
driver that uses a fallback allocation strategy, starting with a 3 order
allocation attempt and dropping down top the next lower size on failure. If
0 order allocation fails the whole operation fails, and maybe you will lose
a packet. So 0 order allocations are important, we really want them to
succeed.
The next part of the theory says that the higher order allocations are
failing because of fragmentation. I put considerable thought into this
today while wandering around in a dungeon in Berlin watching bats (really)
and I will post an article to lkml tomorrow with my findings. To summarize
briefly here: a Linux system in steady state operation *is* going to show
physical fragmentation so that the chance of a higher order allocation
succeeding becomes very small. The chance of failure increases
exponentially (or worse) with a) the allocation order and b) the ratio of
allocated to free memory. The second of these you can control: the higher
you set zone->pages_min the better chance your higher order allocations
will have to succeed. Do you want a patch for that, to see if this works
in practice?
Of course it would be much better if we had some positive mechanism for
defragging physical memory instead of just relying on chance and hoping
for the best the way we do now. IMHO, such a mechanism can be built
practically and I'm willing to give it a try. Basically, kswapd would try
to restore a depleted zone order list by scanning mem_map to find buddy
partners for free blocks of the next lower order. This strategy, together
with the one used in the patch above, could largly eliminate atomic
allocation failures. (Although as I mentioned some time ago, getting rid
of them entirely is an impossible problem.)
The question remains why we suddenly started seeing more atomic allocation
failures in the recent Linus trees. I'll guess that the changes in
scanning strategy have caused the system to spend more time close to the
zone->pages_min amount of free memory. This idea seems to be supported by
your memstat listings. In some sense, it's been good to have the issue
forced so that we must come up with ways to make atomic and higher order
allocations less fragile.
-- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/