[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] Re: hobbit_rrd stops working after about 1 hour



Hi,

it happened a third time for me this night (3 times in 3 weeks) : 
syptoms: hobbitd seems to slow down and stops graphing.


i think Naeem and me are hitting a bug.


i've looked closer this night, and i saw that hobbitd_rrd was running at 100% on
the cpu it was on; 
i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
hobbitd.log has : Task rrdstatus terminated, status 1
rrd_status.log has : Worker process died with exit code 1, terminating


during normal running, vmstat shows a i/o wait of 25%
my problems happened always at night, exactly at the time legato starts


-> something strange is happening whith hobbitd_rrd when the server is under
very heavy i/o..


henrik, could this be a OS issue or more a hobbitd_rrd problem ?




Olivier


Selon Naeem.Maqsud (at) sybase.com:

> Well, as nobody has suggested anything to my problem I guess that I'm the
> only one having this issue. I have managed to find the root cause. The
> hobbitd_rrd process was showing to be in "uninterruptible sleep" state most
> of the time with high iowait associated with the CPU it was running on. I
> suspected that the problem may be due to disk IO while updating rrds for
> the 2000 hosts.
> I created a tmpfs filesystem and copied the rrd directory into it. Since
> then (48 hours ago) my rrd graphs have been updating continuously. I do
> however need to write back to disk periodically to avoid loss of data after
> a reboot.
> 
> This is OK as a temporary fix but I would like to have a permanent
> solution. I would like to hear from other hobbit users who have more than
> 1000 hosts monitored. What type of servers and disk subsystems are they
> using? Perhaps my problem is to do with RedHat and Dell server combination.
> Perhaps I need to stripe over multiple spindles.
> 
> -Naeem
> 
> 
> 
>                                                                            
>              Naeem                                                         
>              Maqsud/SYBASE                                                 
>                                                                         To 
>              08/18/2005 05:02          hobbit (at) hswn.dk                      
>              PM                                                         cc 
>                                                                            
>                                                                    Subject 
>                                        hobbit_rrd stops working after      
>                                        about 1 hour                        
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
> 
> 
> 
> 
> Hi,
> 
> I'm testing out hobbit 4.1.1 for possible migration from big brother (with
> bbgen). I suspected scalability issues with BB as my rrd graphs were
> updated intermittently. However, hobbit is exhibiting similar problems.
> After about 1 hr of restarting hobbit, the rrd graphs stop updating except
> for the cpu utilization for the hobbit server itself.
> 
> The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
> processors and 1GB of memory. About 800 servers are sending updates to the
> hobbit server. Another 1200 servers are getting remote tests.
> 
> Load average has stayed below 1 most of the time. CPU usage has been low
> with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
> after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with
> 100% utilization for the one hour that it is busy.
> 
> I hope someone can shed some light on this.
> 
> Thanks,
> Naeem
> 
> 
> 
> 
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe (at) hswn.dk
> 
> 
> 


--
Olivier Beau