LTT benchmarks and patch update

Karim Yaghmour (karim@opersys.com)
Sun, 27 Oct 2002 11:46:57 -0500


First, here's the latest LTT patch:
http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.44-vanilla-021026-2.2.bz2

We've run the latest LTT through a series of stress tests. The tests
demonstrate that LTT has negligible impact when compiled out or with the
daemon off (and compiled in). Even under the most stressful LMbench tests
we show minimal system impact.

We ran 2 sets of tests:
1- Measuring the overall execution time of 3 tasks: a complete 2.5 kernel
build, a bzip2 on a 2.5 kernel tar archive, and the total time to run
LMbench.
2- A complete LMbench run.

Each of these was run in 4 different configurations (Configuration B
was only run on test set #2 since the micro-benchmarks already show
no difference with vanilla):
A- A vanilla 2.5.44 kernel
B- A patched 2.5.44 kernel with tracing off
C- A patched 2.5.44 kernel with tracing on, daemon off
D- A patched 2.5.44 kernel with tracing on, daemon running

All tests were run using the lockless scheme with TSC timestamping.

When the LTT patch is applied to the kernel the results point to the fact
that when tracing is disabled there is no impact on the kernel performance.
Some numbers even seem to imply that the LTT patch speeds up the kernel by
fractions of a percent reinforcing our belief that the differences being
measured are in the noise.

Even when tracing is built-in, the difference is minimal, if at all
measurable. Test set #1 shows the decrease in performance to be equal
or below 0.5%, while test set #2 shows almost no difference for most
operations, including null syscalls.

As expected, nevertheless, there is a cost to having the trace daemon
running, tracing all kernel events and logging events to disk. Even then,
however, the impact on the real-life workloads (test set #1) is around
2.0%, which is quite low given the quantity of data being collected.
Some micro-benchmarks show relatively large impact. Because the large
majority of applications behave much closer to test set #1 than test set
#2, however, we believe the results are acceptable.

Here's a summary of test set #1:
----------------------------------------------------------------
| Kernel | Compile | Compress | LMbench SMP | LMbench UP |
----------------------------------------------------------------
| A (secs) | 638 | 200 | 867 | 272.05 |
----------------------------------------------------------------
| C (secs) | 640 | 201 | 871 | 270.87 |
| delta (%) | 0.3% | 0.5% | 0.5% | -0.15% |
----------------------------------------------------------------
| D (secs) | 651 | 204 | 872 | 275.08 |
| delta (%) | 2.0% | 2.0% | 0.5% | 1.11% |
----------------------------------------------------------------
[Compile and Compress columns are an average of 10 runs on SMP system;
LMbench SMP column is on one run only; LMbench UP is 5 runs without
disk tests.]

Test set #2 was run both on UP and SMP system.

The UP run of test set #2 was run 10 times and the results below are an
average of these runs (Average obtained using the tools available from:
http://home.earthlink.net/~rwhron/kernel/lmbench_comparison.html). This
is a complete LMbench test, including disk tests (not the same test run
as the LMbench UP measurements on 5 runs for test set #1 above).
#######################################################################
L M B E N C H 2 . 0 S U M M A R Y
------------------------------------

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
null null open signal signal fork execve /bin/sh
kernel call I/O stat fstat close install handle process process process
------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
A 0.346 0.50314 4.193 0.730 5.186 0.843 3.201 292.3 1014.7 4985.2
B 0.346 0.50274 4.441 0.730 5.449 0.843 3.180 288.1 1002.0 5024.6
C 0.352 0.56244 4.241 0.735 5.432 0.849 3.156 301.0 1038.3 5134.7
D 0.963 1.5442 6.282 1.367 9.079 1.662 8.903 345.2 2472.6 6622.2

File select - times in microseconds - smaller is better
-------------------------------------------------------
select select select select select select select select
kernel 10 fd 100 fd 250 fd 500 fd 10 tcp 100 tcp 250 tcp 500 tcp
------- ------- ------- ------- ------- ------- ------- ------- -------
A 1.992 8.518 19.693 37.959 3.557 33.3852 51.2701 113.087
B 1.979 8.522 19.684 37.944 2.708 15.7315 37.7146 73.9382
C 2.594 14.305 34.586 67.002 3.265 22.8394 50.5247 100.358
D 6.015 39.079 94.875 232.318 8.283 85.7979 111.75 285.22

Context switching with 0K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/0k 4proc/0k 8proc/0k 16proc/0k 32proc/0k 64proc/0k 96proc/0k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 2.413 2.348 2.508 2.745 3.430 4.707 5.860
B 2.681 2.455 2.559 2.904 3.600 4.912 6.034
C 3.048 2.852 2.954 3.176 3.865 5.285 6.558
D 3.572 3.434 3.691 3.895 4.999 6.881 8.229

Context switching with 4K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/4k 4proc/4k 8proc/4k 16proc/4k 32proc/4k 64proc/4k 96proc/4k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 2.624 4.423 4.458 5.413 8.061 12.465 15.230
B 2.549 3.932 4.001 5.690 9.042 12.923 15.309
C 3.185 4.882 5.174 6.220 8.887 13.114 15.681
D 4.244 5.326 5.310 6.713 9.837 15.242 18.726

Context switching with 8K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/8k 4proc/8k 8proc/8k 16proc/8k 32proc/8k 64proc/8k 96proc/8k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 2.947 5.219 5.536 8.439 13.366 21.788 25.691
B 2.904 5.689 6.158 8.515 14.220 22.450 28.341
C 3.655 5.809 6.368 9.488 15.176 22.614 26.432
D 4.908 6.618 6.813 9.620 15.692 23.165 26.995

Context switching with 16K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/16k 4proc/16k 8proc/16k 16prc/16k 32prc/16k 64prc/16k 96prc/16k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 3.696 8.210 12.007 19.677 32.724 43.699 46.730
B 3.614 8.830 12.493 20.421 32.938 43.577 46.612
C 4.006 8.649 11.366 20.512 34.509 44.386 47.505
D 5.297 9.768 12.313 19.213 35.063 46.570 49.722

Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/32k 4proc/32k 8proc/32k 16prc/32k 32prc/32k 64prc/32k 96prc/32k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 5.137 15.498 28.268 54.907 77.701 88.583 84.718
B 6.458 17.387 27.123 52.752 76.295 84.247 84.857
C 6.404 17.144 32.665 54.851 76.668 84.841 85.616
D 6.979 15.674 25.008 53.044 79.204 87.115 87.739

Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/64k 4proc/64k 8proc/64k 16prc/64k 32prc/64k 64prc/64k 96prc/64k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 25.314 41.236 92.619 141.911 157.626 159.468 159.455
B 31.503 47.503 96.400 141.198 157.873 159.476 159.537
C 27.062 45.031 93.894 142.703 158.984 160.380 160.481
D 27.648 36.102 81.103 147.793 162.497 163.387 163.377

File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
0K 0K 1K 1K 4K 4K 10K 10K Mmap Prot Page
kernel Create Delete Create Delete Create Delete Create Delete Latency Fault Fault
------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ------
A 59.18 19.39 100.00 35.14 102.86 36.10 169.11 42.11 1371.1 1.045 4.00
B 59.47 20.97 100.92 36.18 102.06 35.84 168.89 42.61 1502.3 0.867 4.00
C 59.17 20.04 101.04 33.85 102.10 33.76 169.36 41.13 1368.5 0.948 5.80
D 67.19 23.82 112.33 40.62 115.90 41.30 187.37 50.20 1907.2 6.283 5.70

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
kernel Pipe AF/Unix UDP RPC/UDP TCP RPC/TCP TCPconn
------- ------- ------- ------- ------- ------- ------- -------
A 6.509 11.130 18.3123 45.6148 25.7486 63.1761 102.210
B 6.443 9.948 17.0914 46.7833 25.7519 61.3685 102.340
C 6.206 11.718 18.8520 47.5403 25.2025 67.8464 100.232
D 484.605 168.968 116.732 1222.69 269.400 556.922 121.171

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
File Mmap Bcopy Bcopy Memory Memory
kernel Pipe AF/Unix TCP reread reread (libc) (hand) read write
------- ------- ------- ------- ------- ------- ------- ------- ------- -------
A 323.57 221.12 95.28 203.32 386.49 191.25 190.68 411.29 293.50
B 324.01 205.28 129.15 208.22 386.53 191.13 190.76 414.87 293.78
C 324.79 200.83 111.97 206.33 385.56 191.11 190.70 414.64 293.52
D 145.22 158.78 98.09 198.87 384.65 190.64 188.14 413.70 292.82

*Local* More Communication bandwidths in MB/s - bigger is better
----------------------------------------------------------------
File Mmap Aligned Partial Partial Partial Partial
OS open open Bcopy Bcopy Mmap Mmap Mmap Bzero
close close (libc) (hand) read write rd/wrt copy HTTP
------- ------- ------- ------- ------- ------- ------- ------- ------- -------
A 206.83 293.34 189.77 230.56 674.92 330.34 312.93 293.96 8.969
B 208.91 298.47 190.07 231.22 674.97 333.76 313.55 294.28 8.867
C 207.15 295.53 189.75 230.87 674.32 332.87 313.11 293.99 8.890
D 208.27 273.29 188.14 229.86 672.54 329.64 312.23 293.30 7.122

Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
kernel Mhz L1 $ L2 $ Main mem
------- ----- ------- ------- ---------
A 800 3.783 44.180 172.92
B 800 3.781 38.313 172.94
C 800 3.788 48.874 173.18
D 800 3.789 38.492 173.51
#######################################################################

Test set #2 was run 5 times on a 4x SMP system and the results below are
an average of the those runs (The same tools as earlier were used to
extract this data):
#######################################################################
L M B E N C H 2 . 0 S U M M A R Y
------------------------------------

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
null null open signal signal fork execve /bin/sh
kernel call I/O stat fstat close install handle process process process
------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
A 0.501 0.92795 5.270 1.216 6.773 1.293 4.494 233.0 716.8 2912.9
B 0.501 0.93778 5.213 1.212 6.717 1.292 4.481 232.0 716.7 2901.2
C 0.504 1.08863 4.982 1.206 6.970 1.293 4.442 242.8 747.9 3053.7
D 1.916 3.12815 7.734 3.698 11.709 2.858 7.559 277.0 925.5 3959.3

File select - times in microseconds - smaller is better
-------------------------------------------------------
select select select select select select select select
kernel 10 fd 100 fd 250 fd 500 fd 10 tcp 100 tcp 250 tcp 500 tcp
------- ------- ------- ------- ------- ------- ------- ------- -------
A 4.411 27.406 66.021 130.107 5.246 35.1106 87.3160 168.771
B 4.424 27.378 65.973 130.098 5.170 35.0629 85.2186 172.147
C 5.754 42.016 100.761 196.230 6.670 50.1556 121.856 242.663
D 13.142 102.394 249.196 500.868 14.271 113.352 283.88 546.154

Context switching with 0K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/0k 4proc/0k 8proc/0k 16proc/0k 32proc/0k 64proc/0k 96proc/0k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 1.778 1.944 2.370 2.926 3.450 3.522 4.338
B 1.724 1.910 2.028 2.614 3.010 3.412 4.748
C 1.906 2.074 2.348 2.978 3.576 3.326 4.110
D 6.046 5.448 6.466 5.654 5.800 5.488 5.280

Context switching with 4K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/4k 4proc/4k 8proc/4k 16proc/4k 32proc/4k 64proc/4k 96proc/4k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 2.988 5.778 7.428 8.454 8.496 8.394 7.934
B 6.818 6.644 7.614 8.698 8.548 8.598 7.906
C 5.528 6.526 7.810 8.820 8.422 8.724 8.348
D 8.064 8.208 8.606 9.034 8.774 8.482 8.906

Context switching with 8K - times in microseconds - smaller is better
---------------------------------------------------------------------
2proc/8k 4proc/8k 8proc/8k 16proc/8k 32proc/8k 64proc/8k 96proc/8k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 8.388 8.816 9.804 9.546 8.758 9.364 14.334
B 8.402 8.098 8.990 9.326 8.204 9.286 14.350
C 6.636 6.758 6.884 7.170 7.416 10.052 13.452
D 8.462 8.994 9.986 9.284 9.380 9.726 9.978

Context switching with 16K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/16k 4proc/16k 8proc/16k 16prc/16k 32prc/16k 64prc/16k 96prc/16k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 9.358 8.954 9.740 10.228 11.780 23.400 27.402
B 11.792 11.962 12.350 11.138 11.658 26.948 36.402
C 11.742 12.130 12.420 12.922 12.088 21.380 31.318
D 12.266 12.572 12.712 12.906 12.720 13.940 20.930

Context switching with 32K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/32k 4proc/32k 8proc/32k 16prc/32k 32prc/32k 64prc/32k 96prc/32k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 16.122 17.358 17.826 17.946 22.550 48.360 67.204
B 18.504 17.958 17.784 17.574 20.966 49.100 71.354
C 16.342 17.150 17.372 17.746 24.338 45.454 66.822
D 19.286 18.684 18.322 18.990 24.352 51.198 68.866

Context switching with 64K - times in microseconds - smaller is better
----------------------------------------------------------------------
2proc/64k 4proc/64k 8proc/64k 16prc/64k 32prc/64k 64prc/64k 96prc/64k
kernel ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch ctx swtch
------- --------- --------- --------- --------- --------- --------- ---------
A 27.460 31.226 29.184 31.742 50.472 135.096 176.156
B 28.674 28.366 27.690 29.234 53.842 130.884 178.004
C 28.816 28.060 28.566 29.886 51.270 141.374 178.096
D 28.772 28.816 28.736 42.250 67.816 159.820 180.094

File create/delete and VM system latencies in microseconds - smaller is better
----------------------------------------------------------------------------
0K 0K 1K 1K 4K 4K 10K 10K Mmap Prot Page
kernel Create Delete Create Delete Create Delete Create Delete Latency Fault Fault
------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ------
A 88.76 33.00 162.68 58.29 187.13 58.17 268.37 70.43 4653.0 1.086 4.00
B 90.26 32.96 164.18 58.25 166.26 58.42 262.70 70.80 4638.6 1.053 4.00
C 91.75 31.96 164.47 57.70 168.66 58.05 261.68 71.00 5020.2 1.359 4.00
D 101.59 35.90 154.57 63.19 155.20 63.26 244.55 77.66 6931.2 2.012 7.00

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
kernel Pipe AF/Unix UDP RPC/UDP TCP RPC/TCP TCPconn
------- ------- ------- ------- ------- ------- ------- -------
A 8.381 20.217 33.332 59.6795 41.2206 74.5685 117.646
B 8.223 15.084 33.4689 55.7983 38.1489 71.8043 121.885
C 9.135 16.638 39.5581 67.6097 47.3104 82.4835 118.360
D 19.460 29.729 49.8840 465.325 61.6741 131.809 131.579

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
File Mmap Bcopy Bcopy Memory Memory
kernel Pipe AF/Unix TCP reread reread (libc) (hand) read write
------- ------- ------- ------- ------- ------- ------- ------- ------- -------
A 509.65 504.18 172.52 290.36 330.88 193.43 157.06 330.87 201.57
B 514.71 503.90 154.21 289.53 330.19 193.63 156.88 330.18 201.73
C 474.63 496.20 144.88 289.44 330.55 194.23 158.10 330.58 200.94
D 360.74 344.55 137.47 284.20 327.97 195.28 159.18 328.65 203.26

*Local* More Communication bandwidths in MB/s - bigger is better
----------------------------------------------------------------
File Mmap Aligned Partial Partial Partial Partial
OS open open Bcopy Bcopy Mmap Mmap Mmap Bzero
close close (libc) (hand) read write rd/wrt copy HTTP
------- ------- ------- ------- ------- ------- ------- ------- ------- -------
A 292.56 269.89 192.90 167.37 785.96 202.14 202.73 350.89 10.298
B 291.84 270.12 191.63 165.61 784.42 202.30 202.98 350.96 10.310
C 291.89 265.37 192.08 165.79 785.20 201.56 202.06 350.30 9.952
D 285.84 244.90 194.22 169.39 781.87 203.51 205.93 349.57 7.180

Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
kernel Mhz L1 $ L2 $ Main mem
------- ----- ------- ------- ---------
A 700 4.301 12.907 182.16
B 700 4.301 12.908 182.13
C 700 4.303 12.915 182.30
D 700 4.326 12.989 183.39
#######################################################################

Karim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/