[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] RHEL5 and status-board not available bug?



I'm not completely sure if you believe there is a bug in Xymon,
or in the Linux kernel of your RHEL system ... But I have a few
comments.

On Tue, Feb 10, 2009 at 07:35:24AM +0000, Flyzone Micky wrote:
> Well...We think it's a big bug, where 'we' is me and RedHat support.
> Of course I'm speaking of Linux and not about the Solaris bug,
> and my kernel parameter are ok.
> 
> I moved from a rhel4.5 with kernel 2.6.9-55 to a rhel5.3 with 
> kernel 2.6.18-128 with bonding (active-passive) gigabit ethernet, 
> and nfs files storing the xymon data in a Veritas cluster.
> The xymon server get 3000 hosts and about 17093 status messages.
> The problem is...the timeout, the hobbit status page go in green,
> the pages sometimes are slow to be read or give a "Status not
> available"

3000 hosts is a fairly large setup. I assume you're doing data
collection for graphs for all of these servers, and that you're
running version 4.2.x of Xymon.

I would guess that your problems - at least in part - stem from 
the amount of I/O you're doing for updating all of the RRD-files.
I know from personal experience that heavy disk I/O can cause
network connections in Xymon to time out. Having your data on a
network-filesystem is different from what I've tried, but it
could make this problem worse - because the I/O is now entirely
handled by the Linux kernel, whereas with a local disk for storage
at least some of the I/O is handled by the disk controller.

What you could try - at least for a short period - would be to
stop the [rrdstatus] and [rrddata] tasks in hobbitlaunch.cfg.
This stops data from being collected into the graphs, but it
will also reduce your disk I/O to practially nil. If your system
then starts behaving properly, then we need to look at reducing
the load from your RRD updates (I have a couple of suggestions).
If the problem persists, then some other explanation must be found.


> Speaking with Redhat premium support, I sent them a trace of the
> error (about 40MB gzip...) and for them the cause is a bug in the
> thread management cause in the RHEL5 is not more possible to use
> the old POSIX implementation of threading, but needs to use just
> the Linux Threading "version". Of course I have lost some of the
> sentences....sorry but I'm not a programmer.

I don't know how the change in "POSIX threading" plays into this.
Hobbit is not a threaded application, it is plain and simple 
single-task application all the way through. It may have some
meaning in relation to NFS.



Regards,
Henrik