[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] Re: hobbit_rrd stops working after about 1 hour



Olivier,

Why don't you try the  approach of putting your rrd files in a tmpfs
filesystem? This seems to have resolved my rrd problem. At least you can
try to see if this resolves your issue and then you know for sure it is
related to disk IO. This is what I did:

1. mkdir /usr/local/bbvar/rrd_orig
2. mv /usr/local/bbvar/rrd /usr/local/bbvar/rrd_orig
3. mkdir /usr/local/bbvar/rrd
4. Add the following line to /etc/fstab:
      tmpfs /usr/local/bbvar/rrd tmpfs mode=755,rw,size=2G 0 0

5. mount /usr/local/bbvar/rrd; chown <id of bb user> /usr/local/bbvar/rrd

6. cp -pr /usr/local/bbvar/rrd_orig/rrd/* /usr/local/bbvar/rrd

7. Start hobbit

If you want to keep this as a permanent solution, then you will need to
setup a cronjob to periodically copy the rrd files from the tmpfs
filesystem back to disk. This is because if you unmount the tmpfs FS all
data will be lost. You can put a line in crontab as shown below to run at
8:30 PM daily:

      30 20 * * *  rsync -av /usr/local/bbvar/rrd /usr/local/bbvar/rrd_orig

Remember that everytime you reboot, you will need to copy the files from
disk to the tmpfs filesystem. You can put a line in /etc/rc.local to do
this for you.

Hope this helps.

-Naeem



                                                                           
             Olivier Beau                                                  
             <olivier (at) qalpit.c                                             
             om>                                                        To 
                                       hobbit (at) hswn.dk                      
             08/22/2005 11:28                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: [hobbit] Re: hobbit_rrd stops   
             Please respond to         working after about 1 hour          
              hobbit (at) hswn.dk                                               
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Hi,

it happened a third time for me this night (3 times in 3 weeks) :
syptoms: hobbitd seems to slow down and stops graphing.


i think Naeem and me are hitting a bug.


i've looked closer this night, and i saw that hobbitd_rrd was running at
100% on
the cpu it was on;
i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
hobbitd.log has : Task rrdstatus terminated, status 1
rrd_status.log has : Worker process died with exit code 1, terminating


during normal running, vmstat shows a i/o wait of 25%
my problems happened always at night, exactly at the time legato starts


-> something strange is happening whith hobbitd_rrd when the server is
under
very heavy i/o..


henrik, could this be a OS issue or more a hobbitd_rrd problem ?




Olivier


Selon Naeem.Maqsud (at) sybase.com:

> Well, as nobody has suggested anything to my problem I guess that I'm the
> only one having this issue. I have managed to find the root cause. The
> hobbitd_rrd process was showing to be in "uninterruptible sleep" state
most
> of the time with high iowait associated with the CPU it was running on. I
> suspected that the problem may be due to disk IO while updating rrds for
> the 2000 hosts.
> I created a tmpfs filesystem and copied the rrd directory into it. Since
> then (48 hours ago) my rrd graphs have been updating continuously. I do
> however need to write back to disk periodically to avoid loss of data
after
> a reboot.
>
> This is OK as a temporary fix but I would like to have a permanent
> solution. I would like to hear from other hobbit users who have more than
> 1000 hosts monitored. What type of servers and disk subsystems are they
> using? Perhaps my problem is to do with RedHat and Dell server
combination.
> Perhaps I need to stripe over multiple spindles.
>
> -Naeem
>
>
>
>

>              Naeem

>              Maqsud/SYBASE

>
To
>              08/18/2005 05:02          hobbit (at) hswn.dk

>              PM
cc
>

>
Subject
>                                        hobbit_rrd stops working after

>                                        about 1 hour

>

>

>

>

>

>

>
>
>
>
> Hi,
>
> I'm testing out hobbit 4.1.1 for possible migration from big brother
(with
> bbgen). I suspected scalability issues with BB as my rrd graphs were
> updated intermittently. However, hobbit is exhibiting similar problems.
> After about 1 hr of restarting hobbit, the rrd graphs stop updating
except
> for the cpu utilization for the hobbit server itself.
>
> The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
> processors and 1GB of memory. About 800 servers are sending updates to
the
> hobbit server. Another 1200 servers are getting remote tests.
>
> Load average has stayed below 1 most of the time. CPU usage has been low
> with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
> after the restart of hobbit server, hobbitd_rrd process stays on CPU3
with
> 100% utilization for the one hour that it is busy.
>
> I hope someone can shed some light on this.
>
> Thanks,
> Naeem
>
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe (at) hswn.dk
>
>
>


--
Olivier Beau

To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe (at) hswn.dk