> With TSO enabled:
>
> ftp> get big.iso /dev/null
> local: /dev/null remote: big.iso
> 200 PORT command successful.
> 150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes).
> 226 Transfer complete.
> 2038628352 bytes received in 21.16 secs (94070.2 kB/s)
>
> ftp server CPU utilization: ~ 15%
>
> So we get almost 15% of throughput drop. This was with plain "netkit
> fptd". AFAIK, it does a simple read/write loop (not sendfile()).
When we use TSO for non-sendfile() applications it really
stresses memory allocations. We do these 64K+ kmalloc()'s
for each packet we construct.
But I don't think that's what is happening here, rather the PCI
controller is "talking" to the CPU's L2 cache with coherency
transactions on all the data of every packet going to the chip.
Whereas with a sendfile() type setup, the PCI controller is going
straight to main memory for the data part of the packets since the
CPU is unlikely to have each page cache page in it's L2 caches. In
the sendmsg() case, it's virtually guarenteed that the cpu will have
all the packet data in it's L2 cache in an unshared-modified state.
I know how this can be fixed, can you use L2-bypassing stores in
your csum_and_copy_from_user() and copy_from_user() implementations
like we do on sparc64? That would exactly eliminate this situation
where the card is talking to the cpu's L2 cache for all the data
during the PCI DMA transation on the send side.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/