[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] RRD crashing high availability hobbit



On Thursday, 20 August 2009 11:06:30 j.sansford (at) ntlworld.com wrote:
> Hi again all,
>
> I need some help configuring/debugging why our hobbit servers are crashing
> (due to rrd, which I shall explain shortly) and how to get around this. We
> have 3 hobbit servers with proxies, however I will simplify this
> explanation with just 2 hobbits and no proxies (as we discovered the same
> thing happens).
>
> Detail of theoretical setup:
>
> 1) 2 datacentres. Each datacentre contains a single hobbit server instance.
> 2) Each client reports to their local datacentre hobbit server.
> 3) Each hobbit server is configured such that they know about the other
> hobbit (through BBDISPLAYS).
>
>
> The issue is that for what looks like most server side tests, such as
> vmstat etc, that we are getting feedback loops between the hobbit servers.
>
> For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
> The client reports back to hobbit in DC1 and hobbit then also reports this
> data to the hobbit in DC2. The hobbit in DC2 however is configured to
> report to DC1 and so bounces the message back (i think). Therefore the
> server tries to update the rrd twice within a second resulting in errors.
> Eventually this will crash the server.

How did you determine that this is what is "crashing" the server?

> An example of the rrd error
> messages:
>
> 2009-08-20 11:04:04 RRD error updating
> /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
> illegal attempt to update using time 1250762644 when last update time is
> 1250762644 (minimum one second step)
> 2009-08-20 11:04:06 RRD error updating
> /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
> illegal attempt to update using time 1250762646 when last update time is
> 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
> /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
> illegal attempt to update using time 1250762646 when last update time is
> 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
> /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
> illegal attempt to update using time 1250762646 when last update time is
> 1250762646 (minimum one second step)

I have a number of setups where messages like this are common, due to running 
network tests and SNMP polling at intervals smaller than 5 minutes (without 
adjusting all the RRD files to cater to this), and I have not seen hobbit 
"crash" due to this.

What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd 
die and leave a status message? Or, does something else occur? Does the server 
reboot? Does the OS hang? How often does this occur?

> My question is - how can we stop this happening?

You would first need to tell us what is happening ...

> Also, why is this
> happening? Is there a way we can disable rrd graphing on one server so just
> one hobbit server handles the graphing?
>
> I hope that makes sense. If you need further clarification please let me
> know.


If hobbitd or hobbitd_rrd or some other process actually crashes, you should 
be able to get a core file, from which you can get a backtrace (e.g. with gdb), 
which would allow someone to see why it is crashing, and possibly fix it.

Regards,
Buchan