The flaw is in doing the put_page_testzero() outside of any locking
which would prevent other CPUs from finding and "rescuing" the zero-recount
page.
CPUA:
if (put_page_testzero()) {
/* Here's the window */
spin_lock(lru_lock);
list_del(page->lru);
CPUB:
spin_lock(lru_lock);
page = list_entry(lru);
page_cache_get(page); /* If this goes from 0->1, we die */
...
page_cache_release(page); /* double free */
2.5.31-mm1 has tests which make this race enormously improbable [1],
but it's still there.
It's that `put' outside the lock which is the culprit. Normally, we
handle that with atomic_dec_and_lock() (inodes) or by manipulating
the refcount inside an area which has exclusion (page presence in
pagecache).
The sane, sensible and sucky way is to always take the lock:
page_cache_release(page)
{
spin_lock(lru_lock);
if (put_page_testzero(page)) {
lru_cache_del(page);
__free_pages_ok(page, 0);
}
spin_unlock(lru_lock);
}
Because this provides exclusion from another CPU discovering the page
via the LRU.
So taking the above as the design principle, how can we speed it up?
How to avoid taking the lock in every page_cache_release()? Maybe:
page_cache_release(page)
{
if (page_count(page) == 1) {
spin_lock(lru_lock);
if (put_page_testzero(page)) {
if (PageLRU(page))
__lru_cache_del(page);
__free_pages_ok(page);
}
spin_unlock(lru_lock);
} else {
atomic_dec(&page->count);
}
}
This is nice and quick, but racy. Two concurrent page_cache_releases
will create a zero-ref unfreed page which is on the LRU. These are
rare, and can be mopped up in page reclaim.
The above code will also work for pages which aren't on the LRU. It will
take the lock unnecessarily for (say) slab pages. But if we put slab pages
on the LRU then I suspect there are so few non-LRU pages left that it isn't
worth bothering about this.
[1] The race requires that the CPU running page_cache_release find a
five instruction window against the CPU running shrink_cache. And
that they be operating against the same page. And that the CPU
running __page_cache_release() then take an interrupt in a 3-4
instruction window. And that the interrupt take longer than the
runtime for shrink_list. And that the page be the first page in
the pagevec.
It's a heat-death-of-the-universe-race, but even if it were to be
ignored, the current code is too complex.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/