[hobbit] RRD crashing high availability hobbit

Fri Aug 21 17:26:18 CEST 2009

Hi,
I saw that the problem is in the creation of the rrd for xtstats for netapp filers.
Can you check what version of the netapp.pl package you have installed ? Have you applied the latest patch included in the hobbit_perl_client distribution to the hobbit server 4.2.3?

In the last version of the Hobbit_perl_client (v 1.21) there was a correction is the netapp.pl code and also a patch to be applied to a clean 4.2.3 that should solve a hobbit_rrd crashing problem in the xtstats function caused by different kind of data sent by different storage software versions.

If your hobbitd_rrd still crash after the patch application can you run the hobbitd_rrd with the -debug as suggested and try to extract the data regarding the xtstats that make the server crash? (or can you send me the last 5-6 minutes of that logs) so I can analyze what the module is receiving and what is going wrong?

Thanks
Francesco

-----Original Message-----
From: j.sansford at ntlworld.com [mailto:j.sansford at ntlworld.com] 
Sent: giovedì 20 agosto 2009 18.34
To: hobbit at hswn.dk; Buchan Milne
Subject: Re: [hobbit] RRD crashing high availability hobbit

Hi Buchan,

We get a core dump, running a pstack gives the following info:

core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
 fed28a17 _lwp_kill (1, 6) + 7
 fecd1d63 raise    (6) + 1f
 fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd
 08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200
 0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1
 0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6
 08054044 main     (2, 804613c, 8046148) + 4dc
 080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80

Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server. 

Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.

Cheers
James.

---- Buchan Milne <bgmilne at staff.telkomsa.net> wrote: 
> On Thursday, 20 August 2009 11:06:30 j.sansford at ntlworld.com wrote:
> > Hi again all,
> >
> > I need some help configuring/debugging why our hobbit servers are crashing
> > (due to rrd, which I shall explain shortly) and how to get around this. We
> > have 3 hobbit servers with proxies, however I will simplify this
> > explanation with just 2 hobbits and no proxies (as we discovered the same
> > thing happens).
> >
> > Detail of theoretical setup:
> >
> > 1) 2 datacentres. Each datacentre contains a single hobbit server instance.
> > 2) Each client reports to their local datacentre hobbit server.
> > 3) Each hobbit server is configured such that they know about the other
> > hobbit (through BBDISPLAYS).
> >
> >
> > The issue is that for what looks like most server side tests, such as
> > vmstat etc, that we are getting feedback loops between the hobbit servers.
> >
> > For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
> > The client reports back to hobbit in DC1 and hobbit then also reports this
> > data to the hobbit in DC2. The hobbit in DC2 however is configured to
> > report to DC1 and so bounces the message back (i think). Therefore the
> > server tries to update the rrd twice within a second resulting in errors.
> > Eventually this will crash the server.
> 
> How did you determine that this is what is "crashing" the server?
> 
> > An example of the rrd error
> > messages:
> >
> > 2009-08-20 11:04:04 RRD error updating
> > /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
> > illegal attempt to update using time 1250762644 when last update time is
> > 1250762644 (minimum one second step)
> > 2009-08-20 11:04:06 RRD error updating
> > /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
> > illegal attempt to update using time 1250762646 when last update time is
> > 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
> > /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
> > illegal attempt to update using time 1250762646 when last update time is
> > 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
> > /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
> > illegal attempt to update using time 1250762646 when last update time is
> > 1250762646 (minimum one second step)
> 
> I have a number of setups where messages like this are common, due to running 
> network tests and SNMP polling at intervals smaller than 5 minutes (without 
> adjusting all the RRD files to cater to this), and I have not seen hobbit 
> "crash" due to this.
> 
> What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd 
> die and leave a status message? Or, does something else occur? Does the server 
> reboot? Does the OS hang? How often does this occur?
> 
> > My question is - how can we stop this happening?
> 
> You would first need to tell us what is happening ...
> 
> > Also, why is this
> > happening? Is there a way we can disable rrd graphing on one server so just
> > one hobbit server handles the graphing?
> >
> > I hope that makes sense. If you need further clarification please let me
> > know.
> 
> 
> If hobbitd or hobbitd_rrd or some other process actually crashes, you should 
> be able to get a core file, from which you can get a backtrace (e.g. with gdb), 
> which would allow someone to see why it is crashing, and possibly fix it.
> 
> Regards,
> Buchan

To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk