[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] RRD crashing high availability hobbit



On Friday, 21 August 2009 00:42:59 David Baldwin wrote:
> j.sansford (at) ntlworld.com wrote:
> > Hi Buchan,
> >
> > We get a core dump, running a pstack gives the following info:
> >
> > core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
> >  fed28a17 _lwp_kill (1, 6) + 7
> >  fecd1d63 raise    (6) + 1f
> >  fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) +
> > cd 08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
> > 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf,
> > 8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff,
> > 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd
> > (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main    
> > (2, 804613c, 8046148) + 4dc
> >  080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80


OK, so it crashed in do_netapp_extratest_rrd from hobbitd/rrd/do_netapp.c . 
I'm not familiar with pstack, but it looks like this may be from a stripped 
binary (or, you may be able to get more information from pstack).

If pstack can't show the values, then you may want to consider running 
hobbitd_rrd with the --debug flag, which should result in some logging of what 
it has received just before it crashes.

> That looks like you are running extratest for a netapp which from what I
> can see in hobbitd/do_rrd.c is what handles the xtstats column reported
> by netapp.pl - just from a cursory glance at the code - I don't use it
> myself. You really need to look at the C code to check it's doing the
> right thing. You have 2 choices - quick fix is to disable just that test
> in netapp.pl - other option is to work out what format it should be and
> fix the test.
>
> In 4.2.3 for example, the do_devmon.c RRD code doesn't actually
> implement what is documented

What is not implemented?

Where do you see this documented?

There is one fix that I have committed in svn (Xymon 4.2 branch, Xymon 4.3 
branch, devmon svn). I am not aware of any other requests or bugs filed on the 
devmon rrd collector.

> and I use a perl script with --extra-script
> instead

Is this the one shipped with devmon, or would you like to contribute a better 
one?

> Various RRD handlers are in hobbitd/rrd/do_*.c
> Looking at the code for xstrdup in lib/memory.c as below you should
> check your logs - it's probably getting called with a NULL pointer
> (unlikely you're out of memory), but the logs should tell you.
>
> char *xstrdup(const char *s)
> {
>         char *result;
>
>         if (s == NULL) {
>                 errprintf("xstrdup: Cannot dup NULL string\n");
>                 abort();
>         }
>
>         result = strdup(s);
>         if (result == NULL) {
>                 errprintf("xstrdup: Out of memory\n");
>                 abort();
>         }
>
> #ifdef MEMORY_DEBUG
>         add_to_memlist(result, strlen(result)+1);
> #endif
>
>         return result;
> }

xstrdup is called twice in do_netapp_extratest_rrd, but seeing the string that 
it's aborting on would help narrow it down. If you can provide the status 
message that made hobbitd_rrd crash (retrieve it using: bb localhost 
'hobbitdlog hostname.testname') it can be used to reproduce this by someone 
trying to fix the bug.

> > Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of
> > errors, which span over 607625 lines (this is just for today, we roll the
> > logs each night). This seems abnormally large to me and I think
> > eventually this is crashing the server.

It is still unlikely that this has anything to do with hobbitd_rrd crashing.

Regards,
Buchan