Bonnie on NBD w/ memory pressure deadlocks (problem in wait_for_tcp_memory?)

Jeff Raubitschek (raubitsj@crib.corepower.com)
Mon, 15 Jan 2001 16:20:23 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Juergen E Fischer: "Re: SCSI scanner problem with all kernels since 2.3.42"
Previous message: Jens Axboe: "Re: vmware 2.0.3, kernel 2.4.0 and a cdrom"

running 2.4.0 with kdb patch

[1.] Bonnie on NBD w/ memory pressure deadlocks (problem in wait_for_tcp_memory?)
[2.] Full description
This bug appears to be totally reproducable on different hardware and kernel versions.

The conditions that create the problem:
2 machines (client, server) (both p4 1.4G) networked with 100Mb
the server runs: ./nbd-server 8899 /dev/hda6 1937804k
the client runs: ./nbd-client serverip 8899 /dev/nb5
mke2fs /dev/nb5
mount /dev/nb5 -t ext2 /mnt/nb5
./Bonnie -d /mnt/nb5 -s 100

(nbd-client and nbd-server from http://atrey.karlin.mff.cuni.cz/~pavel/nbd/nbd.html)

NBD seems to do fine with normal disk use, but when Bonnie is run with large file sizes
it causes memory pressure and this triggers the problem being seen.

Bonnie output:
FILE '/mnt/nbd5/Bonnie.8776', ssize: 104857600
Writing with putc()...done
Rewriting... [and here it hangs]

What I think is going on:
I compiled kdb into the kernel after unsuccessfully being able to figure it out by just
looking at the source. Doing this seemed to confirm my suspicions about the cause but
I was unable to figure out the exact problem.

using kdb I found the backtraces of important processes in the client and server:

Client
------
pid 5: bdflush
schedule+0x2d8
schedule+timeout+0x17
wait_for_tcp_memory+0x12e
tcp_sendmsg+0x666
inet_sendmsg+0x40
sock_sendmsg+0x7a
[nbd]nbd_xmit+0xda
[nbd]nbd_send_req+0x8f
[nbd]do_nbd_request+0x104
__make_request+0x5be
generic_make_request+0xd7
submit_bh+0x58
ll_rw_block+0x12f
flush_dirty_buffers+0x81
bdflush+0x7b
kernel_thread+0x23
pid 8740: nbd_client
schedule+0x2d8
__down+0x61
__down_failed+0xb
[nbd].text.lock+0x19
[nbd]nbd_do_it+0x41
[nbd]nbd_ioctl+0x316
blkdev_ioctl+0x2c
sys_ioctl+0x174
system_call+0x33
pid 8753: Bonnie
schedule+0x2d8
__lock_page+0x8b
lock_page+0x18
do_generic_file_read+0x29b
generic_file_read+0x5d
sys_read+0x91
system_call+0x33

Server
------
pid 8431: nbd-server
schedule+0x2d8
schedule_timeout+0x17
wait_for_tcp_memory+0x12e(0xc6ebe400, 0x7fffffff)
tcp_sendmsg+0x666(0xc6ebe400,0xc60bdf7c, 0x1010)
inet_sendmsg+0x40(0xc640aa04,0xc60bdf7c,0x1010, 0xc60bdf44, 0xc640aa04)
sock_sendmsg+0x7a(0xc640aa04,0xc60bdf7c,0x1010)
sock_write+0x8f(0xc7ebd00, 0xbfffa548, 0x1010, 0xc70ebd20)
sys_write+0x95(0x4,0xbfffa548, 0x1010, 0x1010, 0xbfffa548)
system_call+0x33

both machines are low in memory but have buffer memory still:
server: mem total=126216 used=124504 free=1712 shared=0 buffers=109844 cached=4912
-/+ buffers/cache: 9748 116468
client: mem total=126216 used=124264 free=1952 shared=0 buffers= 29940 cached=15568
-/+ buffers/cache: 78756 47460

What I think is going on is the client is busy reading blocks from the server over nbd
and dirtying them, eventually the buffer cache consumes all memory. This memory pressure
causes bdflush to try to flush dirty buffers which requires it to send the blocks to the
server. This does not complete because wait_for_tcp_memory never succeeds ?? (I am still
a bit unsure of what is going on with wait_for_tcp_memory)

Thus the client can not send any more requests because nbd is locked by bdflush which is
trying to flush dirty buffers but appearently cannot.

Also the server seems to be in the same wait_for_tcp_memory loop. I think if I understand
better what is going on in wait_for_tcp_memory I will be closer to figureing out how to
solve this problem.

Any help would be appreciated, I have much more info if anything more is
needed.

Thank you very much,
-jeff

[3.] keywords: nbd, networking, low memory
[4.] Linux version 2.4.0-kdb (raubitsj@jr-lnx) (gcc version egcs-2.91.66 19990314/Linux
(egcs-1.1.2 release)) #2 Thu Jan 11 21:00:11 PST 2001
[5.] no oops
[6.] this problem is 100% reproducable, seems to be reproducable on different hardware,
w/ different kernel versions too
[7.] Environment (this listing is limited, but can be provided if needed)
cpuinfo: P4 1.4GHz 128MB ram
modules: acenic, nbd

-------------------------------------------------------------------------------
Jeff Raubitschek
Computer Engineer
raubitsj@writeme.com
-------------------------------------------------------------------------------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

Next message: Juergen E Fischer: "Re: SCSI scanner problem with all kernels since 2.3.42"
Previous message: Jens Axboe: "Re: vmware 2.0.3, kernel 2.4.0 and a cdrom"