Brian Tinsley wrote:
> Out of curiosity, which RH kernel are you using? I moved on to 2.4.19
> and 2.4.20 primarily because the RH 2.4.18 series of kernels
> apparently has a scheduler bug (at least one) that causes the
> heartbeat software from Linux-HA to loose heartbeat signals and
> failover. Not a good scenario when you are trying to provide HA
> systems to hospitals!
>
>
> Russell Leighton wrote:
>
>>
>> I can't help, but I can echo a "me too".
>>
>> We only see it when I have 2 file I/O intensive processes...they both
>> will just stop for some few seconds, system seems idle...then
>> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID
>> Controller .
>>
>> Brian Tinsley wrote:
>>
>>> We have been having terrible problems with long stalls, meaning from
>>> a couple of minutes to an hour, happening when filesystem I/O load
>>> gets high. The system time as reported by vmstat or sar will
>>> increase up to 99% and as it spreads to each procesor, the system
>>> becomes completely unresponsive (except that it responds to pings
>>> just fine - interesting!). When the system finally returns to the
>>> world of the living, the only evidence that something bad has
>>> happened is the runtime for kswapd is abnormally high. I have seen
>>> this happen with the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP
>>> PIII and PIV machines (either 4GB or 8GB RAM, all SCSI disks, dual
>>> GigE NICs). I've searched the lkml archives and google and have
>>> found several similar postings, but there is never an explanation or
>>> resolution. Any help would be *very* much appreciated! If any info
>>> from the system in question is desired, I will be glad to provide it.
>>>
>>>
>>>
>>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/