If you surf on over to
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.15/ you'll see
some code which performs 64k I/Os. Reads direct into pagecache.
It reduces the cost of reading from disk by 25% in my testing.
(That code is ready to go - just waiting for Linus to rematerialise).
The remaining profile is interesting. The workload is simply
`cat large_file > /dev/null':
c012b448 33 0.200877 kmem_cache_free
c0131af8 33 0.200877 flush_all_zero_pkmaps
c01e51bc 33 0.200877 blk_recount_segments
c01f9aec 34 0.206964 hpt374_udma_stop
c016eb80 36 0.219138 ext2_get_block
c0133320 37 0.225225 page_cache_readahead
c013740c 37 0.225225 __getblk
c0131ba0 41 0.249574 kmap_high
c01fa1c4 41 0.249574 ata_start_dma
c016e7dc 46 0.28001 ext2_block_to_path
c01e5320 48 0.292184 blk_rq_map_sg
c01c65d0 50 0.304358 radix_tree_reserve
c014bfb0 53 0.32262 do_mpage_bio_readpage
c01f4d88 54 0.328707 ata_irq_request
c0136b34 64 0.389579 __get_hash_table
c0126a00 72 0.438276 do_generic_file_read
c016e910 82 0.499148 ext2_get_branch
c0126610 88 0.535671 unlock_page
c0106df4 91 0.553932 system_call
c012b04c 94 0.572194 kmem_cache_alloc
c01f2494 126 0.766983 ata_taskfile
c01c66e8 163 0.992208 radix_tree_lookup
c012d250 165 1.00438 rmqueue
c0105274 2781 16.9284 default_idle
c0126e48 11009 67.0136 file_read_actor
That's a single 500MHz PIII Xeon, reading at 35 megabytes/sec.
There's 17% "overhead" here. Going to a larger filesystem
blocksize would provide almost zero benefit in the I/O layers.
Savings from larger blocks and larger pages would come into
the radix tree operations, get_block, a few other places.
At a guess, 8k blocks would cut the overhead to 10-12%.
And larger block size significantly penalises bandwidth for
the many-small-file case. The larger the blocks, the worse
it gets. You end up having to implement complexities such
as tail-merging to get around the inefficiency which the
workaround for your other inefficiency caused.
And larger pages with small blocks isn't an answer - CPU load
and seek costs from 2-blocks-per-page is measurable. At
4 blocks-per-page it's getting serious.
Small pages and pagesize=blocksize are good. I see no point in
going to larger pages or blocks until the current scheme is
working efficiently and has been *proven* to still be unfixably
inadequate.
The current code sucks. Simply amortising that suckiness across
larger blocks is not the right thing to do.
-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/