[hobbit] RHEL5 and status-board not available bug?

Henrik Størner henrik at hswn.dk
Thu Feb 12 23:31:50 CET 2009


On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
> On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:
> 
> >I'm not completely sure if you believe there is a bug in Xymon,
> >or in the Linux kernel of your RHEL system ... 
> 
> I think is in Hobbit. And I have news about it, I'll write more down.
> 
> >I would guess that your problems - at least in part - stem from 
> >the amount of I/O you're doing for updating all of the RRD-files.
> 
> No, excluded at all, already tried to disable all the ext tests.
> However I tried also switching the data in local SCSI disks
> and iostat indicate a really low I/O wait.

"really low" as in ... how much ? If you're looking at the vmstat 
output, check the "vmstat1" graph and see how much I/O wait
takes up of your cpu time. AND - remember that disk I/O in
Linux is a single-processor task, so on a dual-CPU box
your I/O system is saturated when your vmstat1 graph shows
50% of the time is spent in I/O wait.

On a quad-cpu box the limit it 25%, obviously.

I also have my RRD files on 10k RPM SCSI disks, hardware raid
controller etc. Without the caching in Xymon 4.3, it couldn't
keep up with the amount of RRD updates I was feeding it it.
Which also shows in the fact that flushing the cache - which
essentially does the same amount of disk I/O as a full update
of all the RRD files - takes about 8 minutes. No chance at all
then of keeping up with 5-minute update cycles.

I really think you should try shutting off the hobbitd_rrd tasks,
just to see what happens.

> > If the problem persists, then some other explanation must be found.
> 
> Must for sure....it's a big trouble saw 3000 hosts becaming purple
> then green then purple :)

For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.

I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?

I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ? 

Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp. This could be a problem - I've seen Hobbit break
on a system with ~1200 hosts, because the network test would
ping all of them, overflowing the ARP cache. This is tunable
with
     sysctl net.ipv4.neigh.default.gc_thresh1=3072
     sysctl net.ipv4.neigh.default.gc_thresh2=4096
(see the arp(7) man-page for what these do).

Is this server also running the network tests ?

Network-wise, it makes sense to tune a busy Hobbit server in the
same manner that you would a very busy webserver (which also 
has to handle lots of short-lived connections). Another possible
tuning parameter would be 
     sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT
state for new connections. It goes against the recommended way
of doing TCP, but unless you're running Hobbit over high-latency
networks it should not cause any problems.


> >I don't know how the change in "POSIX threading" plays into this.
> >Hobbit is not a threaded application, it is plain and simple 
> >single-task application all the way through. It may have some
> >meaning in relation to NFS.
> 
> Ups...is not a multithread? I'm not a programmer but....how it can
> follow 3000 hosts sending data without multithread?

By avoiding all the overhead of using threads :-)

Seriously, 3000 hosts on a 5-minute cycle is only 10 hosts/second.
Each host triggers perhaps 5-10 connections (e.g. an old client
reporting cpu,disk,memory,msgs,procs,conn), and since the core
daemon isn't doing any disk I/O handling 50-100 connections per
second isn't that big a deal.

> However here the news: the problem persist just with RHEL5 with
> architecture x86_64 with all kind of 2.6 kernels.
> With RHEL5 and x86 (32bit) there isn't the bug.

It's quite odd that there is a problem on x86-64, but not on x86-32.
One (I) would expect the 64-bit systems to have a bit more "oomph"
so they should be the ones that worked best.

A datapoint here. I'm also running Hobbit on a 64-bit Linux 
platform, but it is using SPARC (Sun) hardware. Kernel is
2.6.18-6-sparc64. This hardware is *ancient* (about 10 years old),
but handles twice the number of hosts and statuses that you have.
I do have the RRD's on a different server, though.

> However, the problem exist also in our hobbit lab (always 64bit)
> stressing the Hobbit with more than 20 "virtual host"

So you're saying that on a RHEL 5.3 64-bit Intel server, setting
up Hobbit and feeding it with data from ~20 clients will make
the system break?

I think I would have heard about it before if this was a general
problem.


Regards,
Henrik




More information about the Xymon mailing list