[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [hobbit] RHEL5 and status-board not available bug?
- To: hobbit (at) hswn.dk
- Subject: Re: [hobbit] RHEL5 and status-board not available bug?
- From: Henrik Størner <henrik (at) hswn.dk>
- Date: Thu, 12 Feb 2009 23:31:50 +0100
- References: <20090212180648.F3E12BE407D (at) ws1-9.us4.outblaze.com>
- User-agent: Mutt/1.5.18 (2008-05-17)
On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
> On Tue, 10 Feb 2009 16:39:35 +0100, Henrik wrote:
>
> >I'm not completely sure if you believe there is a bug in Xymon,
> >or in the Linux kernel of your RHEL system ...
>
> I think is in Hobbit. And I have news about it, I'll write more down.
>
> >I would guess that your problems - at least in part - stem from
> >the amount of I/O you're doing for updating all of the RRD-files.
>
> No, excluded at all, already tried to disable all the ext tests.
> However I tried also switching the data in local SCSI disks
> and iostat indicate a really low I/O wait.
"really low" as in ... how much ? If you're looking at the vmstat
output, check the "vmstat1" graph and see how much I/O wait
takes up of your cpu time. AND - remember that disk I/O in
Linux is a single-processor task, so on a dual-CPU box
your I/O system is saturated when your vmstat1 graph shows
50% of the time is spent in I/O wait.
On a quad-cpu box the limit it 25%, obviously.
I also have my RRD files on 10k RPM SCSI disks, hardware raid
controller etc. Without the caching in Xymon 4.3, it couldn't
keep up with the amount of RRD updates I was feeding it it.
Which also shows in the fact that flushing the cache - which
essentially does the same amount of disk I/O as a full update
of all the RRD files - takes about 8 minutes. No chance at all
then of keeping up with 5-minute update cycles.
I really think you should try shutting off the hobbitd_rrd tasks,
just to see what happens.
> > If the problem persists, then some other explanation must be found.
>
> Must for sure....it's a big trouble saw 3000 hosts becaming purple
> then green then purple :)
For hosts to go purple they have to go more than 30 minutes without
an update - they don't go purple just because they miss a single
update.
I suppose you have check the kernel logs ('dmesg' output) for
anything odd ?
I'm wondering if maybe you're running out of ports (there's only
64K of them, only about half can be used by normal apps). How
many ports do you have in TIME_WAIT state ?
Another thing is the size of the ARP cache, if your hosts are
all on the same IP network or your router/firewall is doing
proxy-arp. This could be a problem - I've seen Hobbit break
on a system with ~1200 hosts, because the network test would
ping all of them, overflowing the ARP cache. This is tunable
with
sysctl net.ipv4.neigh.default.gc_thresh1=3072
sysctl net.ipv4.neigh.default.gc_thresh2=4096
(see the arp(7) man-page for what these do).
Is this server also running the network tests ?
Network-wise, it makes sense to tune a busy Hobbit server in the
same manner that you would a very busy webserver (which also
has to handle lots of short-lived connections). Another possible
tuning parameter would be
sysctl net.ipv4.tcp_tw_reuse=1
which enables the kernel to re-use ports that are in a TIME_WAIT
state for new connections. It goes against the recommended way
of doing TCP, but unless you're running Hobbit over high-latency
networks it should not cause any problems.
> >I don't know how the change in "POSIX threading" plays into this.
> >Hobbit is not a threaded application, it is plain and simple
> >single-task application all the way through. It may have some
> >meaning in relation to NFS.
>
> Ups...is not a multithread? I'm not a programmer but....how it can
> follow 3000 hosts sending data without multithread?
By avoiding all the overhead of using threads :-)
Seriously, 3000 hosts on a 5-minute cycle is only 10 hosts/second.
Each host triggers perhaps 5-10 connections (e.g. an old client
reporting cpu,disk,memory,msgs,procs,conn), and since the core
daemon isn't doing any disk I/O handling 50-100 connections per
second isn't that big a deal.
> However here the news: the problem persist just with RHEL5 with
> architecture x86_64 with all kind of 2.6 kernels.
> With RHEL5 and x86 (32bit) there isn't the bug.
It's quite odd that there is a problem on x86-64, but not on x86-32.
One (I) would expect the 64-bit systems to have a bit more "oomph"
so they should be the ones that worked best.
A datapoint here. I'm also running Hobbit on a 64-bit Linux
platform, but it is using SPARC (Sun) hardware. Kernel is
2.6.18-6-sparc64. This hardware is *ancient* (about 10 years old),
but handles twice the number of hosts and statuses that you have.
I do have the RRD's on a different server, though.
> However, the problem exist also in our hobbit lab (always 64bit)
> stressing the Hobbit with more than 20 "virtual host"
So you're saying that on a RHEL 5.3 64-bit Intel server, setting
up Hobbit and feeding it with data from ~20 clients will make
the system break?
I think I would have heard about it before if this was a general
problem.
Regards,
Henrik