We have a Dell PowerEdge 6600, 4x 1.6HGz Xeon, 8GB RAM, PERC3/DCL
(MegaRAID) controller with 128MB cache, driving two channels each with 3
36GB 15krpm UltraSCSI/LVD 160 drives. I currently have 8GB of swapspace
set up (4 2gb swapspaces, in descending priority.) It's a real beast.
When it's been running for a couple hours, doing a `find /` will drag it
to it's knees, as evidenced by this scraping from top during one of it's
bad moments:
--->8--[ Cut Here ]--->8--
1:17pm up 14:34, 4 users, load average: 0.83, 0.98, 1.11
78 processes: 76 sleeping, 1 running, 1 zombie, 0 stopped
CPU0 states: 0.0% user, 53.0% system, 0.0% nice, 46.0% idle
CPU1 states: 0.1% user, 41.1% system, 0.0% nice, 57.0% idle
CPU2 states: 0.1% user, 42.0% system, 0.0% nice, 56.1% idle
CPU3 states: 0.0% user, 41.1% system, 0.0% nice, 58.0% idle
Mem: 7762204K av, 7756320K used, 5884K free, 0K shrd, 236704K buff
Swap: 8388544K av, 432K used, 8388112K free 7260064K cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
8595 gregory 9 0 664 664 528 D 41.3 0.0 0:02 find /usr
7195 root 9 0 2080 2080 1476 S 40.2 0.0 0:02 /usr/sbin/sshd
333 root 9 0 15252 14M 15072 D 35.0 0.1 0:38 /usr/sbin/MegaSer
7115 root 10 0 2092 2092 1476 S 28.1 0.0 0:25 /usr/sbin/sshd
7 root 15 0 0 0 0 SW 22.9 0.0 3:06 kswapd
7300 root 10 0 928 928 720 R 14.9 0.0 1:19 top
1 root 8 0 224 224 184 S 0.0 0.0 1:21 init [3]
2 root 9 0 0 0 0 SW 0.0 0.0 0:01 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0
4 root 18 19 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1
--->8--[ Cut Here ]--->8--
This was still on it's way to total oblivion. Once it's _at_ oblivion,
all 4 CPU's are at 100% system, with find, kswapd and kupdated usually
the top three processes, but always find & kswapd are up at the top.
Something as simple as a find should not do this, and doesn't, even on
much whimpier machines (single 300mhz P-III with a single IDE drive &
128MB ram doesn't even break a sweat.)
Software: Kernel v. 2.4.19 (from kernel.org), patched with LVM 1.0.5 and
Dell's BroadCom Gigabit Ethernet drivers. Userland LVM tools are from
the same 1.0.5 sources. Using the kernel megaraid driver for the RAID
controller.
So far, on the advice of one of the Dell developers for the megaraid
driver, I've tried flashing the PERC card back to firmware 1.61O (it
shipped with 1.72). This didn't seem to have much positive effect.
I've since flashed up to 1.73, and am letting the system warm up a bit
to see if it'll have any affect.
The problem never manifests itself within an hour so or so of startup,
but after a certain amount of time, it becomes progressively more
noticable. After 21 days of uptime, the system became so unresponsive
on a Thursday morning (our busiest day) that we had to restart.
So, what other information can I provide to make the problem more
diagnosable, and where else can I look for an explaination?
I have been unable to replicate this on any other hardware. I have at
least three other systems running the same version of SuSE Linux (7.2),
all with the same version of the kernel and LVM as mentioned above.
Unfortunately, this Dell PE6600 is the only one of its kind that we
own...
Thanks in advance for any help. Feel free to reply to the list, as I am
subscribed.
-- Gregory K. Ade <gregory@castandcrew.com> Sr. Systems Administrator Cast & Crew Entertainment Services, Inc.- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/