Re: scalable kmap (was Re: vm lock contention reduction) (fwd)

Andrew Morton (akpm@zip.com.au)
Wed, 10 Jul 2002 20:06:22 -0700


Hanna Linder wrote:
>
> ...
> Andrew and Martin,
>
> I ran this updated patch on 2.5.25 with dbench on
> the 8-way with 4 Gb of memory compared to clean 2.5.25.
> I saw a significant improvement in throughput about 15%
> (averaged over 5 runs each).

Thanks, Hanna.

The kernel compile test isn't a particularly heavy user of
copy_to_user(), whereas with RAM-only dbench, copy_*_user()
is almost the only thing it does. So that makes sense.

Tried dbench on the 2.5G 4xPIII Xeon: no improvement at all.
This thing seems to have quite poor memory bandwidth - maybe
250 megabyte/sec downhill with the wind at its tail.

> Included is the pretty picture (akpm-2525.png) the
> data that picture came from (akpm-2525.data) and the raw
> results of the runs with min/max and timing results
> (2525akpmkmaphi and 2525clnhi).
> I believe the drop at around 64 clients is caused by
> memory swapping leading to increased disk accesses since the
> time increased by 200% in direct correlation with the decreased
> throughput.

Yes. The test went to disk. There are two reasons why
it will do this:

1: Some dirty data was in memory for more than 30-35 seconds or

2: More than 40% of memory is dirty.

In your case, the 64-client run was taking 32 seconds. After that
the disks lit up. Once that happens, dbench isn't a very good
benchmark. It's an excellent benchmark when it's RAM-only
though. Very repeatable and hits lots of code paths which matter.

You can run more clients before the disk I/O cuts in by
increasing /proc/sys/vm/dirty_expire_centisecs and
/proc/sys/vm/dirty_*_ratio.

The patch you tested only uses the atomic kmap across generic_file_read.
It is reasonable to hope that another 15% or morecan be gained by holding
an atomic kmap across writes as well. On your machine ;)

Here's what oprofile says about `dbench 40' with that patch:

c0140f1c 402 0.609543 __block_commit_write
c013dfd4 413 0.626222 vfs_write
c01402cc 431 0.653515 __find_get_block
c013a895 472 0.715683 .text.lock.highmem
c017fe30 494 0.749041 ext2_get_block
c012cef0 564 0.85518 unlock_page
c013ee80 564 0.85518 fget
c01079f4 571 0.865794 apic_timer_interrupt
c01e8ecc 594 0.900669 radix_tree_lookup
c013da90 597 0.905218 generic_file_llseek
c01514b4 607 0.92038 __d_lookup
c0106ff8 687 1.04168 system_call
c013a02c 874 1.32523 kunmap_high
c0148388 922 1.39801 link_path_walk
c0140b00 1097 1.66336 __block_prepare_write
c01346d0 1138 1.72552 rmqueue
c01127ac 1243 1.88473 smp_apic_timer_interrupt
c0139eb8 1514 2.29564 kmap_high
c0105368 6188 9.38272 poll_idle
c012d8a8 9564 14.5017 file_read_actor
c012ea70 21326 32.3361 generic_file_write

Not taking a kmap in generic_file_write is a biggish patch - it
means changing the prepare_write/commit_write API and visiting
all filesystems. The API change would be: core kernel no longer
holds a kmap across prepare/commit. If the filesystem wants one
for its own purposes then it gets to do it for itself, possibly in
its prepare_write().

I think I'd prefer to get some additional testing and understanding
before undertaking that work. It arguably makes sense as a small
cleanup/speedup anyway, but that's not a burning issue.

hmm. I'll do just ext2, and we can take another look then.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/