[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [hobbit] RRD crashing high availability hobbit
- To: hobbit (at) hswn.dk
- Subject: Re: [hobbit] RRD crashing high availability hobbit
- From: Buchan Milne <bgmilne (at) staff.telkomsa.net>
- Date: Fri, 21 Aug 2009 14:27:06 +0100
- Cc: David Baldwin <david.baldwin (at) ausport.gov.au>
- References: <20090820173344.IT1PI.402075.root (at) web05-winn.ispmail.private.ntl.com> <4A8DDF83.4070501 (at) ausport.gov.au>
- User-agent: KMail/1.11.4 (Linux/2.6.27.23-xen-3.4.0-1mdv; KDE/4.2.4; x86_64; ; )
On Friday, 21 August 2009 00:42:59 David Baldwin wrote:
> j.sansford (at) ntlworld.com wrote:
> > Hi Buchan,
> >
> > We get a core dump, running a pstack gives the following info:
> >
> > core 'core' of 11142: hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
> > fed28a17 _lwp_kill (1, 6) + 7
> > fecd1d63 raise (6) + 1f
> > fecb1bad abort (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) +
> > cd 08060291 xstrdup (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
> > 0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf,
> > 8081a00, 8081cc0) + 200 0805c1c9 do_netapp_extrastats_rrd (84ec4ff,
> > 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1 0805e0ea update_rrd
> > (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6 08054044 main
> > (2, 804613c, 8046148) + 4dc
> > 080539fc _start (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
OK, so it crashed in do_netapp_extratest_rrd from hobbitd/rrd/do_netapp.c .
I'm not familiar with pstack, but it looks like this may be from a stripped
binary (or, you may be able to get more information from pstack).
If pstack can't show the values, then you may want to consider running
hobbitd_rrd with the --debug flag, which should result in some logging of what
it has received just before it crashes.
> That looks like you are running extratest for a netapp which from what I
> can see in hobbitd/do_rrd.c is what handles the xtstats column reported
> by netapp.pl - just from a cursory glance at the code - I don't use it
> myself. You really need to look at the C code to check it's doing the
> right thing. You have 2 choices - quick fix is to disable just that test
> in netapp.pl - other option is to work out what format it should be and
> fix the test.
>
> In 4.2.3 for example, the do_devmon.c RRD code doesn't actually
> implement what is documented
What is not implemented?
Where do you see this documented?
There is one fix that I have committed in svn (Xymon 4.2 branch, Xymon 4.3
branch, devmon svn). I am not aware of any other requests or bugs filed on the
devmon rrd collector.
> and I use a perl script with --extra-script
> instead
Is this the one shipped with devmon, or would you like to contribute a better
one?
> Various RRD handlers are in hobbitd/rrd/do_*.c
> Looking at the code for xstrdup in lib/memory.c as below you should
> check your logs - it's probably getting called with a NULL pointer
> (unlikely you're out of memory), but the logs should tell you.
>
> char *xstrdup(const char *s)
> {
> char *result;
>
> if (s == NULL) {
> errprintf("xstrdup: Cannot dup NULL string\n");
> abort();
> }
>
> result = strdup(s);
> if (result == NULL) {
> errprintf("xstrdup: Out of memory\n");
> abort();
> }
>
> #ifdef MEMORY_DEBUG
> add_to_memlist(result, strlen(result)+1);
> #endif
>
> return result;
> }
xstrdup is called twice in do_netapp_extratest_rrd, but seeing the string that
it's aborting on would help narrow it down. If you can provide the status
message that made hobbitd_rrd crash (retrieve it using: bb localhost
'hobbitdlog hostname.testname') it can be used to reproduce this by someone
trying to fix the bug.
> > Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of
> > errors, which span over 607625 lines (this is just for today, we roll the
> > logs each night). This seems abnormally large to me and I think
> > eventually this is crashing the server.
It is still unlikely that this has anything to do with hobbitd_rrd crashing.
Regards,
Buchan