[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [hobbit] Re: hobbit_rrd stops working after about 1 hour
Hi,
it happened a third time for me this night (3 times in 3 weeks) :
syptoms: hobbitd seems to slow down and stops graphing.
i think Naeem and me are hitting a bug.
i've looked closer this night, and i saw that hobbitd_rrd was running at 100% on
the cpu it was on;
i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
hobbitd.log has : Task rrdstatus terminated, status 1
rrd_status.log has : Worker process died with exit code 1, terminating
during normal running, vmstat shows a i/o wait of 25%
my problems happened always at night, exactly at the time legato starts
-> something strange is happening whith hobbitd_rrd when the server is under
very heavy i/o..
henrik, could this be a OS issue or more a hobbitd_rrd problem ?
Olivier
Selon Naeem.Maqsud (at) sybase.com:
> Well, as nobody has suggested anything to my problem I guess that I'm the
> only one having this issue. I have managed to find the root cause. The
> hobbitd_rrd process was showing to be in "uninterruptible sleep" state most
> of the time with high iowait associated with the CPU it was running on. I
> suspected that the problem may be due to disk IO while updating rrds for
> the 2000 hosts.
> I created a tmpfs filesystem and copied the rrd directory into it. Since
> then (48 hours ago) my rrd graphs have been updating continuously. I do
> however need to write back to disk periodically to avoid loss of data after
> a reboot.
>
> This is OK as a temporary fix but I would like to have a permanent
> solution. I would like to hear from other hobbit users who have more than
> 1000 hosts monitored. What type of servers and disk subsystems are they
> using? Perhaps my problem is to do with RedHat and Dell server combination.
> Perhaps I need to stripe over multiple spindles.
>
> -Naeem
>
>
>
>
> Naeem
> Maqsud/SYBASE
> To
> 08/18/2005 05:02 hobbit (at) hswn.dk
> PM cc
>
> Subject
> hobbit_rrd stops working after
> about 1 hour
>
>
>
>
>
>
>
>
>
>
> Hi,
>
> I'm testing out hobbit 4.1.1 for possible migration from big brother (with
> bbgen). I suspected scalability issues with BB as my rrd graphs were
> updated intermittently. However, hobbit is exhibiting similar problems.
> After about 1 hr of restarting hobbit, the rrd graphs stop updating except
> for the cpu utilization for the hobbit server itself.
>
> The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
> processors and 1GB of memory. About 800 servers are sending updates to the
> hobbit server. Another 1200 servers are getting remote tests.
>
> Load average has stayed below 1 most of the time. CPU usage has been low
> with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
> after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with
> 100% utilization for the one hour that it is busy.
>
> I hope someone can shed some light on this.
>
> Thanks,
> Naeem
>
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe (at) hswn.dk
>
>
>
--
Olivier Beau