Re: [Patch 2/2] Retry based aio read - filesystem read changes

Janet Morgan (janetmor@us.ibm.com)
Mon, 31 Mar 2003 10:32:20 -0800


> On Wed, Mar 05, 2003 at 02:42:54AM -0800, Andrew Morton wrote:
> > Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> >
> > +extern int FASTCALL(aio_wait_on_page_bit(struct page *page, int bit_nr));
> > +static inline int aio_wait_on_page_locked(struct page *page)
>
> Oh boy.
>
> There are soooo many more places where we can block:
>
> - write() -> down(&inode->i_sem)
>
> - read()/write() -> read indirect block -> wait_on_buffer()
>
> - read()/write() -> read bitmap block -> wait_on_buffer()
>
> - write() -> allocate block -> mark_buffer_dirty() ->
> balance_dirty_pages() -> writer throttling
>
> - write() -> allocate block -> journal_get_write_access()
>
> - read()/write() -> update_a/b/ctime() -> journal_get_write_access()
>
> - ditto for other journalling filesystems
>
> - read()/write() -> anything -> get_request_wait()
> (This one can be avoided by polling the backing_dev_info congestion
> flags)
>
> - read()/write() -> page allocator -> blk_congestion_wait()
>
> - write() -> balance_dirty_pages() -> writer throttling
>
> - probably others.
>
> Now, I assume that what you're looking for here is an 80% solution, but it
> seems that a lot more changes will be needed to get even that far.

I'm trying to identify significant blocking points in the filesystem read
path. I'm not sure what the best approach is, so figured I'd just track
callers of schedule() under a heavy dd workload.

I collected data while running 1000 processes, where each process was
using dd to sequentially read from a 1GB file. I used 10 target files
in all, so 100 processes read from file1, 100 processes read from file2,
etc. All these numbers were pretty much arbitrary. I used stock 2.5.62
to test.

The top 3 callers of schedule based on my dd read workload are listed
below. Together they accounted for 92% of all calls to schedule.
Suparna's filesystem aio patch already modifies 2 of these 3 callpoints
to be "retryable". So the remaining candidate is the call to cond_resched
from do_generic_mapping_read. Question is whether this qualifies as the
sort of thing that should cause a retry, i.e., is cond_resched a kind of
voluntary yield/preemption point which may not even result in a context
switch if there is nothing more preferable to run? And even if the call to
cond_resched is not cause for retry, the profile data seems to indicate that
Suparna's patch is roughly an 80% solution (unless I'm missing something here,
which is entirely possible ;-).

So here are the call statistics:

70% of all calls to schedule were from __lock_page:
Based on the profile data, almost all calls to _lock_page were
from do_generic_mapping_read (see filemap.c/Line 101 below).
Suparna's patch already retries here.

15% of all calls to schedule were from __cond_resched:
do_generic_mapping_read -> cond_resched -> __cond_resched
(see filemap.c/Line 41 below).
Should this be made retryable ???

7% of all calls to schedule were from wait_on_page_bit:
Based on the profile data, almost all calls to wait_on_page_bit were
from do_generic_mapping_read->wait_on_page_locked -> wait_on_page_bit
(see filemap.c/Line 123 below). Suparna's patch covers this
callpoint, too.

Here's some detail....

Callers of schedule()

7768210 Total calls to schedule
------------
5440195 __lock_page+176
1143862 do_generic_mapping_read+d6
569139 wait_on_page_bit+17e
488273 work_resched+5
43414 __wait_on_buffer+14a
23594 blk_congestion_wait+99
21171 worker_thread+14f
9838 cpu_idle+6d
6308 do_select+29b
5178 schedule_timeout+b8
4136 sys_sched_yield+d9
2096 kswapd+122
1796 sleep_on+55
1767 sys_nanosleep+f8
1655 do_exit+4eb
1643 wait_for_completion+d2
1365 interruptible_sleep_on+55
1034 pipe_wait+98
1004 balanced_irq+6d
975 ksoftirqd+108
971 journal_stop+107
884 sys_wait4+2ff
780
658 unix_wait_for_peer+d6
576 read_chan+52c
580 __pdflush+d2
105 do_poll+ec
102 __down+9e
71 tty_wait_until_sent+f2
59 ksoftirqd+b5
55 __down_interruptible+d8
41
16 migration_thread+cc
10 uart_wait_until_sent+d6
9 unix_stream_data_wait+109
8 tcp_data_wait+16c
4 wait_for_packet+140
3 wait_til_done+110
2 write_chan+280
2 generic_file_aio_write_nolock+bf4
2 journal_commit_transaction+97a
1 usb_hub_thread+c0
1 jfs_lazycommit+1c6
1 serio_thread+e0
1 jfs_sync+259
1 jfsIOWait+15d
1 _lock_fdc+12f
1 acpi_ex_system_do_suspend+3d
1 init_pcmcia_ds+25e

readprofile:

10058989 total 2.9520
5982060 __copy_to_user_ll 41542.0833
928000 do_generic_mapping_read 698.7952
447968 poll_idle 4666.3333
301263 file_read_actor 1176.8086
265660 radix_tree_lookup 1844.8611
209991 page_cache_readahead 410.1387
129655 vfs_read 311.6707
117666 kmap_atomic 919.2656
111874 fget 1165.3542
110996 __generic_file_aio_read 182.5592
109402 kunmap_atomic 3418.8125
95092 vfs_write 228.5865
76119 mark_page_accessed 679.6339
69564 current_kernel_time 724.6250
52569 fput 1095.1875
51824 update_atime 231.3571
51375 activate_page 267.5781
46577 scsi_request_fn 61.9375
41731 generic_file_read 237.1080
39300 shrink_cache 38.9881
37658 schedule 26.1514
37501 refill_inactive_zone 22.3220
37325 scsi_dispatch_cmd 70.6913
32762 unlock_page 292.5179
31924 buffered_rmqueue 86.7500
30454 __make_request 21.3862
28532 do_page_cache_readahead 54.0379
27783 delay_tsc 434.1094
27723 ext2_get_branch 69.3075
25726 shrink_list 13.1793
25246 __find_get_block 63.1150
24443 mpage_end_io_read 169.7431
23800 add_to_page_cache 87.5000
20669 do_softirq 71.7674
19895 system_call 452.1591
18906 do_mpage_readpage 14.4101
18783 page_referenced 73.3711
18137 free_hot_cold_page 59.6612
18015 __wake_up 225.1875
13790 radix_tree_insert 53.8672
11227 radix_tree_delete 31.8949
11188 __alloc_pages 11.4631
10885 __brelse 136.0625
10699 write_null 668.6875
10299 __pagevec_lru_add 35.7604
10100 page_address 42.0833
9968 ext2_get_block 7.5976

2.5.62 filemap.c/do_generic_mapping_read:
1 /*
2 * This is a generic file read routine, and uses the
3 * inode->i_op->readpage() function for the actual low-level
4 * stuff.
5 *
6 * This is really ugly. But the goto's actually try to clarify some
7 * of the logic when it comes to error handling etc.
8 * - note the struct file * is only passed for the use of readpage
9 */
10 void do_generic_mapping_read(struct address_space *mapping,
11 struct file_ra_state *ra,
12 struct file * filp,
13 loff_t *ppos,
14 read_descriptor_t * desc,
15 read_actor_t actor)
16 {
17 struct inode *inode = mapping->host;
18 unsigned long index, offset;
19 struct page *cached_page;
20 int error;
21
22 cached_page = NULL;
23 index = *ppos >> PAGE_CACHE_SHIFT;
24 offset = *ppos & ~PAGE_CACHE_MASK;
25
26 for (;;) {
27 struct page *page;
28 unsigned long end_index, nr, ret;
29
30 end_index = inode->i_size >> PAGE_CACHE_SHIFT;
31
32 if (index > end_index)
33 break;
34 nr = PAGE_CACHE_SIZE;
35 if (index == end_index) {
36 nr = inode->i_size & ~PAGE_CACHE_MASK;
37 if (nr <= offset)
38 break;
39 }
40
41 cond_resched();
42 page_cache_readahead(mapping, ra, filp, index);
43
44 nr = nr - offset;
45
46 /*
47 * Try to find the data in the page cache..
48 */
49 find_page:
50 read_lock(&mapping->page_lock);
51 page = radix_tree_lookup(&mapping->page_tree, index);
52 if (!page) {
53 read_unlock(&mapping->page_lock);
54 handle_ra_miss(mapping,ra);
55 goto no_cached_page;
56 }
57 page_cache_get(page);
58 read_unlock(&mapping->page_lock);
59
60 if (!PageUptodate(page))
61 goto page_not_up_to_date;
62 page_ok:
63 /* If users can be writing to this page using arbitrary
64 * virtual addresses, take care about potential aliasing
65 * before reading the page on the kernel side.
66 */
67 if (!list_empty(&mapping->i_mmap_shared))
68 flush_dcache_page(page);
69
70 /*
71 * Mark the page accessed if we read the beginning.
72 */
73 if (!offset)
74 mark_page_accessed(page);
75
76 /*
77 * Ok, we have the page, and it's up-to-date, so
78 * now we can copy it to user space...
79 *
80 * The actor routine returns how many bytes were actually
used..
81 * NOTE! This may not be the same as how much of a user
buffer
82 * we filled up (we may be padding etc), so we can only
update
83 * "pos" here (the actor routine has to update the user
buffer
84 * pointers and the remaining count).
85 */ 86 ret = actor(desc, page, offset,
nr);
87 offset += ret;
88 index += offset >> PAGE_CACHE_SHIFT;
89 offset &= ~PAGE_CACHE_MASK;
90
91 page_cache_release(page);
92 if (ret == nr && desc->count)
93 continue;
94 break;
95
96 page_not_up_to_date:
97 if (PageUptodate(page))
98 goto page_ok;
99
100 /* Get exclusive access to the page ... */
101 lock_page(page);
102
103 /* Did it get unhashed before we got the lock? */
104 if (!page->mapping) {
105 unlock_page(page);
106 page_cache_release(page);
107 continue;
108 }
109
110 /* Did somebody else fill it already? */
111 if (PageUptodate(page)) {
112 unlock_page(page);
113 goto page_ok;
114 }
115
116 readpage:
117 /* ... and start the actual read. The read will unlock the
page. */
118 error = mapping->a_ops->readpage(filp, page);
119
120 if (!error) {
121 if (PageUptodate(page))
122 goto page_ok;
123 wait_on_page_locked(page);
124 if (PageUptodate(page))
125 goto page_ok;
126 error = -EIO;
127 }
128
129 /* UHHUH! A synchronous read error occurred. Report it */
130 desc->error = error;
131 page_cache_release(page);
132 break;
133
134 no_cached_page:
135 /*
136 * Ok, it wasn't cached, so we need to create a new
137 * page..
138 */
139 if (!cached_page) {
140 cached_page = page_cache_alloc_cold(mapping);
141 if (!cached_page) {
142 desc->error = -ENOMEM;
143 break;
144 }
145 }
146 error = add_to_page_cache_lru(cached_page, mapping,
147 index, GFP_KERNEL);
148 if (error) {
149 if (error == -EEXIST)
150 goto find_page;
151 desc->error = error;
152 break;
153 }
154 page = cached_page;
155 cached_page = NULL;
156 goto readpage;
157 }
158
159 *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
160 if (cached_page)
161 page_cache_release(cached_page);
162 UPDATE_ATIME(inode);
163 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/