[hobbit] RHEL5 and status-board not available bug?

Buchan Milne bgmilne at staff.telkomsa.net
Mon Feb 16 14:55:26 CET 2009


On Monday 16 February 2009 13:35:51 Flyzone Micky wrote:
> On Thu, Feb 12, 2009 at 06:06:48PM +0000, Flyzone Micky wrote:
> >"really low" as in ... how much ?
>
> Output of iostat command:
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            2.22    0.00    0.91    3.62    0.00   93.26
>
> This is the output of iostat about nfs:
> Device:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s
> vnetapp:/vol/hobbit     1631.11       373.97         0.00
>
> wBlk_dir/s   rBlk_svr/s   wBlk_svr/s    rops/s    wops/s
>       0.00      1170.83       825.22    840.76    840.76
>

Unfortunately, this doesn't show anything about how the underlying IO system 
is performing. The load average for this host would be relevant, as well as 
iostat-type data for the NFS server, and any stats available in the actual 
disks.

E.g., 1 NFS bulk operation could translate to 16 IOPS on the "spindle", so you 
could be doing 25000 IOPS, which is quite serious IO (you probably need at 
least 160 fast spindles to manage that). Or, it could translate to less. So, 
you need to check your storage system.

> In this last iostat have also a rsync statistic in it cause I was
> mantening a rsync on local disk of hobbit.
>
> Unlucky nfsstat doesn't sho
>
> >of all the RRD files - takes about 8 minutes. No chance at all
> >then of keeping up with 5-minute update cycles.
>
> But in this case will not appear a warning like this (that I don't have)?
> WARNING: Runtime 110 longer than BBSLEEP
>
> >I really think you should try shutting off the hobbitd_rrd tasks,
> >just to see what happens.
>
> Maybe I missed in the last post, but I have already done, and didn't
> solve the problem.
>
> >For hosts to go purple they have to go more than 30 minutes without
> >an update - they don't go purple just because they miss a single
> >update.
>
> Right...but doesn't appear always, I remember also an old patch
> that was in all-in-one about dirty-datas, but was already applied.
>
> >I suppose you have check the kernel logs ('dmesg' output) for
> >anything odd ?
>
> Done, like all the logs in the system and hobbit. Nothing more
> message that could help.
>
> >I'm wondering if maybe you're running out of ports (there's only
> >64K of them, only about half can be used by normal apps). How
> >many ports do you have in TIME_WAIT state ?
>
> Excluded, the port is 235-300 at maximun, and in the kernel parameter
> I also tried to use (like in Oracle):
> net.ipv4.ip_local_port_range = 1024 65000
> but with or without nothing change.
>
> >Another thing is the size of the ARP cache, if your hosts are
> >all on the same IP network or your router/firewall is doing
> >proxy-arp.
>
> The networks are about 4 differents.
> And however, remember about my test on a just 20 clients.
>
> >Is this server also running the network tests ?
> > ...
> >     sysctl net.ipv4.tcp_tw_reuse=1
> >which enables the kernel to re-use ports that are in a TIME_WAIT
>
> Yes, but like before...appear also with just a 20 clients,
> so I would exclude a problem related at the numbers of clients.
> However I tried also with:
> net.ipv4.tcp_fin_timeout = 30
> instead of the default 120 seconds in RHEL5 to leave a port
> in TIME_WAIT state.
>
> >One (I) would expect the 64-bit systems to have a bit more "oomph"
> >so they should be the ones that worked best.
>
> Ahm...what is a oomph? :-S
>
> >A datapoint here. I'm also running Hobbit on a 64-bit Linux
> >platform, but it is using SPARC (Sun) hardware.
>
> we are trying to shutdown all our sparc and pass to linux.. :)
>
> >So you're saying that on a RHEL 5.3 64-bit Intel server, setting
> >up Hobbit and feeding it with data from ~20 clients will make
> >the system break?
>
> Yes, this is the point RHEL > 5.0 and 64bit (AMD)...
> I need yet to try on Fedora 10 64bit

My workstation is running RHEL 5.2 on a Sun Ultra 40, and Hobbit (well, 
devmon) is polling about 10 network devices, and getting client reports from 
about 4 VMs (hobbitd gets 1.7 messages/sec), updating 2300 RRD files, and I've 
never seen this.

In the production environment, my hobbit on RHEL5 x86_64 is only doing 
polling/testing/proxying (the display is on a RHEL4 i386).

Regards,
Buchan



More information about the Xymon mailing list