It really makes the cricial difference (deadlock avoidance is the only
reason there is the serie thing in the kmaps). see the thread with Ingo
about mempool, it's exactly the same problem.
in short: the same task cannot get two entries from the same serie
(/mempool) or the system can deadlock.
think of this, assume 3 entries in the pool and a single pool, all tasks
are running fork(2) so they need two kmaps, assume we have a single pool:
task1 task2 task3 task4
kmap kmap kmap
kmap <- (hangs)
(hangs)->kmap kmap kmap kmap
deadlock
thanks to the kmap series we cannot deadlock instead:
task1 task2 task3 task4
kmap kmap kmap
kmap (hangs)
kmap2 kmap2 kmap2
kunmap .....
kunmap2......
kmap2
kunmap
kunmap2
> > in fork the pte_offset kmap could be an atomic one, but atomic are more
> > costly with the invlpg I believe, to do it in a raw the 2 variant with a
> > different serie should be faster for fork(2).
>
> Yes, maybe, maybe not.
I guess it depends on the pagetable overhead and on the arch details
(invlpg overhead). If there are less than 1024 pagetables kmap to copy
kmap_serie is going to run faster than kmap_atomic because it may not
need any tlb flush at all, but if there are more we may need to do a
strong global tlb flush, so it is not obvious what is better. I usually
try to avoid the atomic kmaps unless we can't sleep, that's the main
reason for that choice. But I'm open to suggestions, the best would be
to benchmark it first with a few pagetables, and then with an huge
number of pagetables to get some number out of it. At least it is
something that we can change freely anytime as soon as we have some
number and in the meantime it is not a critical thing to sort out.
>
> > > 4. You've lifted the PAE restriction to LAST_PKMAP 512 in i386/highmem.h,
> > > and use pkmap_page_table as one long array in mm/highmem.c, but I
> > > don't see where you enforce the contiguity of page table pages in
> > > i386/mm/init.c. (I do already have a patch for lifting the 1024,512
> > > kmaps limit, simplifying i386/mm/init.c, we've been using for months:
> > > I can update that from 2.4.9 if you'd like it.)
> >
> > correct, currently it works because the bootmem tends to return
> > physically contigous pages but it is not enforced and it may trigger
> > with a weird e820 layout. If you've a patch very feel free to post it!! :)
> > thanks for the review.
>
> Okay, I've attached the existing patch against 2.4.9 below. It originated
> from my Large PAGE_SIZE patch, where it became clear that the fixmaps are
> appropriate for MMUPAGE_SIZE maps but the kmaps for PAGE_SIZE maps (later,
> with Ben's work on Large PAGE_CACHE_SIZE, it became clear that actually
> kmaps should be PAGE_CACHE_SIZE, but I've not added in that complication).
>
> So the atomic kmaps are no longer done by fixmap.h, but come after the
> vmallocs, before the regular kmaps, in highmem.h. It also doesn't waste
> so much virtual space either end of the vmalloc area. The mm/init.c diff
> looks a lot, mainly because the result is a lot simpler. If it would
> help you for me to update to a release of your choice, let me know
> (perhaps the next aa, with your mods in).
the below patch looks overkill, I don't want to change all that code,
I'd just need the few liner to have an array of pte pointers in function
of PTRS_PER_PTE and LAST_KMAP. Either that or even much better just
allocate all the pagetables at once with a single bootmem call, that is
certainly a safe approch too, just count the number of pte needed for
the persistente kmaps and then do a single bootmem call to allocate all
of them at once, this second approch sounds the most efficient one. If
you want to make a patch against pte-highmem-5 feel free to go ahead. It
should be a few liner. I didn't checked what other good things are
getting changed by the big patch, but I'd prefer to have only the
necessary changes at least for the 2.4 version of the patch.
> > > 5. Shouldn't mm/vmscan.c be in the patch?
> >
> > can you elaborate?
>
> swap_out_pmd() uses pte_offset() on user page tables,
> so doesn't it need to pte_kunmap() afterwards?
Ouch, I completly overlooked vmscan.c!! So this definitely explains
hangs during swapout activities. Many thanks, that was an important one
to catch!
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/