Here is the summary of my experiments with difft per-cpu allocator methods.
The following methods were compared
1. Static per-cpu areas
2. kmalloc_percpu with NR_CPUS pointers and one extra dereference -- the
current implementation (no interlace) (kmalloc_percpu_current)
3. kmalloc_percpu with pointer arithmetic, but no interlace
(kmalloc_percpu_new)
4. alloc_percpu using Rusty's block allocator and the shared offset table
(alloc_percpu_block)
Application used was speeding up vm_enough_memory using per-cpu counters
and reducing atomic_operataions. Benchmark used was kernbench. Profile
ticks on vm_enough_memory was used to compare allocator methods
(vm_acct_memory was made inline). This was on a 4 processor pIII xeon.
To summarise,
1. Static per-cpu areas was 6.5 % better that kmalloc_percpu_current
2. kmalloc_percpu_new and static per-cpu areas had similar results.
3. alloc_percpu results were similar to static per-cpu areas and
kmalloc_percpu_new
4. Extra dereferences in alloc_percpu were not significant, but alloc_percpu
was interlaced and kmalloc_percpu_new wasn't. Insn profile seemed to
indicate extra cost in memory dereferencing of alloc_percpu was
offset by the interlacing/objects sharing the same cacheline part.
but then insn profiles are only indicative...not accurate.
todo:
I have to see how a interlaced kmalloc_percpu with pointer arithmetic
fares in these tests (once i have it working) and the performace part
of the percpu allocators will be hopefully clear.
Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/