RRD crashing high availability hobbit

j.sansford at ntlworld.com j.sansford at ntlworld.com
Thu Aug 20 12:06:30 CEST 2009


Hi again all,

I need some help configuring/debugging why our hobbit servers are crashing (due to rrd, which I shall explain shortly) and how to get around this. We have 3 hobbit servers with proxies, however I will simplify this explanation with just 2 hobbits and no proxies (as we discovered the same thing happens).

Detail of theoretical setup:

1) 2 datacentres. Each datacentre contains a single hobbit server instance.
2) Each client reports to their local datacentre hobbit server.
3) Each hobbit server is configured such that they know about the other hobbit (through BBDISPLAYS).


The issue is that for what looks like most server side tests, such as vmstat etc, that we are getting feedback loops between the hobbit servers.

For instance: A hobbit server in DC1 tests a client in DC1 using vmstat. The client reports back to hobbit in DC1 and hobbit then also reports this data to the hobbit in DC2. The hobbit in DC2 however is configured to report to DC1 and so bounces the message back (i think). Therefore the server tries to update the rrd twice within a second resulting in errors. Eventually this will crash the server. An example of the rrd error messages:

2009-08-20 11:04:04 RRD error updating /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762644
 when last update time is 1250762644 (minimum one second step)
2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)
2009-08-20 11:04:06 RRD error updating /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1: illegal attempt to update using time 1250762646 when last update time is 1250762646 (minimum one second step)


My question is - how can we stop this happening? Also, why is this happening? Is there a way we can disable rrd graphing on one server so just one hobbit server handles the graphing? 

I hope that makes sense. If you need further clarification please let me know.

Cheers
James



More information about the Xymon mailing list