OK, a closer look. This is on a dual 1.7G P4, with HT disabled (involuntarily,
grr.) Looks like an 8-10% hit on context-switch intensive stuff.
2.5.54+BK
=========
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu Linux 2.5.54 3 4 11 6 48 12 53
*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
tbench 32: (85k switches/sec)
Throughput 114.633 MB/sec (NB=143.291 MB/sec 1146.33 MBit/sec)
Throughput 114.157 MB/sec (NB=142.696 MB/sec 1141.57 MBit/sec)
Throughput 115.095 MB/sec (NB=143.869 MB/sec 1150.95 MBit/sec)
pollbench 1 100 5000 (118k switches/sec)
result with handles 1 processes 100 loops 5000:time 8.371942 sec.
result with handles 1 processes 100 loops 5000:time 8.381814 sec.
result with handles 1 processes 100 loops 5000:time 8.367576 sec.
pollbench 2 100 2000 (105k switches/sec)
result with handles 2 processes 100 loops 2000:time 3.694412 sec.
result with handles 2 processes 100 loops 2000:time 3.672226 sec.
result with handles 2 processes 100 loops 2000:time 3.657455 sec.
pollbench 5 100 2000 (79k switches/sec)
result with handles 5 processes 100 loops 2000:time 4.564727 sec.
result with handles 5 processes 100 loops 2000:time 4.783192 sec.
result with handles 5 processes 100 loops 2000:time 4.561067 sec.
2.5.54+BK+broken-wrmsr-backout-patch:
=====================================
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu Linux 2.5.54 3 4 11 6 48 12 53
i686-linu Linux 2.5.54 1 3 8 4 40 10 51
*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu Linux 2.5.54 3 14 22 26 30 57
i686-linu Linux 2.5.54 1 12 28 22 32 58
tbench 32:
Throughput 121.701 MB/sec (NB=152.126 MB/sec 1217.01 MBit/sec)
Throughput 124.958 MB/sec (NB=156.197 MB/sec 1249.58 MBit/sec)
Throughput 124.086 MB/sec (NB=155.107 MB/sec 1240.86 MBit/sec)
pollbench 1 100 5000
result with handles 1 processes 100 loops 5000:time 7.306432 sec.
result with handles 1 processes 100 loops 5000:time 7.352913 sec.
result with handles 1 processes 100 loops 5000:time 7.337019 sec.
pollbench 2 100 2000
result with handles 2 processes 100 loops 2000:time 3.184550 sec.
result with handles 2 processes 100 loops 2000:time 3.251854 sec.
result with handles 2 processes 100 loops 2000:time 3.209147 sec.
pollbench 5 100 2000
result with handles 5 processes 100 loops 2000:time 4.135773 sec.
result with handles 5 processes 100 loops 2000:time 4.117304 sec.
result with handles 5 processes 100 loops 2000:time 4.119047 sec.
The tbench changes should probably be ignored. After profiling tbench
I can say that this thoughput difference is _not_ due to the task switcher
change (__switch_to is only 1%). I left the numbers here to show what
the effect of simply relinking and rebooting the kernel can be.
BTW, the pollbench numbers are not stunningly better than the 500MHz PIII:
pollbench 1 100 5000
result with handles 1 processes 100 loops 5000:time 9.609487 sec.
pollbench 2 100 2000
result with handles 2 processes 100 loops 2000:time 4.016496 sec.
pollbench 5 100 2000
result with handles 5 processes 100 loops 2000:time 4.917921 sec.
I didn't profile the P4. John has promised P4 oprofile support for
next week, which will be nice.
I did profile Manfred's pollbench on the PIII, uniprocessor build. Note
that there is only a 5% throughput difference on this machine. It's all
in __switch_to(). Here the PIII is doing 70k switches/sec.
2.5.54+BK:
c012abbc 534 2.69888 buffered_rmqueue
c0116714 617 3.11837 __wake_up_common
c010a606 635 3.20934 restore_all
c014b038 745 3.76529 do_poll
c013d4dc 757 3.82594 fget
c014551c 766 3.87142 pipe_write
c010a5c4 1249 6.31254 system_call
c014b0f0 1273 6.43384 sys_poll
c01090a4 1775 8.97099 __switch_to
c0116484 1922 9.71394 schedule
2.5.54+BK+backout-patch:
c012abbc 768 3.1024 buffered_rmqueue
c0116714 790 3.19127 __wake_up_common
c010a5e6 809 3.26803 restore_all
c013d4dc 918 3.70834 fget
c014551c 936 3.78105 pipe_write
c014b038 977 3.94668 do_poll
c01090a4 1070 4.32236 __switch_to
c014b0f0 1606 6.48758 sys_poll
c010a5a4 1678 6.77843 system_call
c0116484 2542 10.2686 schedule
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/