[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] Suspected bug in External Script handling

To: <hobbit (at) hswn.dk>
Subject: Re: [hobbit] Suspected bug in External Script handling
From: "Graham Nayler" <graham.nayler (at) hallmarq.net>
Date: Fri, 26 Sep 2008 20:35:53 +0100
References: <2A2874E4052E4D489CE0899C0CA6EBF1 (at) gnaylerwork>

Sorry about the absence of word wrapping in that first attempt - hopefullythis one will be easier to read!

----- Original Message -----From: Graham Nayler

To: hobbit (at) hswn.dk
Sent: Friday, September 26, 2008 7:17 PM
Subject: [hobbit] Suspected bug in External Script handling


Hi guys,

(Bug report and fix submitted here as Sourceforge looks 'not particularly'active)

I've just started setting up a Hobbit system to monitor a load of Windowsboxes (using BBWin), and am implementing our custom tests using externalscript mechanism. Once I finally got my head around the Hobbit/BBWininterface and worked out that it's really simple to implement, just veryconfusing to find the right document to look at, the test columns and graphswere displayed fine, but with dodgy data in the graphs.

The problem though is that RRD only intermittently gets its updates - maybeonce every 15-20 minutes. I eventually realised that it's a problem with theupdate caching implemented in hobbitd_rrd. This is using the snapshotversion as of 2008/Aug/02. The bug does not apply to version 4.2.0 fromSourceForge, which was prior to the implementation of write-behind caching.

The majoirty of the internal tests cache RRD updates using static data heldin do_rrd.c (v1.61 2008/04/02). External scripts though are handled byforking to a child of the hobbitd_rrd process in do_external.c (v1.222008/03/22) - I assume to avoid a midbehaving user script from snarling upthe whole system. Once the data is collected from the script it then passesit on to RRD in the normal way. However, the forked process uses a copy ofthe static data, so this goes into a different cache to that in the mainprocess. And once done the child process goes away - without forcing thecache to empty, so loses the data that were just created. Following thislogic, the cache never fills up enough to flush itself, and so the datadon't make it to RRD (which rather begs the question of how I got anythingin the graph at all - but then that's a side issue).


The obvious solutions appear to be:

1) don't fork to a child - but that would allow misbehaving scripts to hangthe system2) fork, but pass the data back to the parent process once it's done -possible, but not a trivial fix3) fork as currently, but flush the cache before closing the child process -not particularly elegant, but simple to implement.

I've implemented a fix of type 3. It's important to only flush what ishandled by the external script handler, as the parent process will stillhave it's copy of the cache at the time of the fork, and will flush thatitself in the normal course of events. There is a function in do_rrd.c thatallows a partial flush of the cache - rrdcacheflushhost(). This flusheseverything that matches the supplied "hostname", which can be the full pathto the RRD archive, or a leading substring thereof. If it is only externalscripts supplying the test data, then no keys matching that test name willever be held in the parent process cache, so this path can be used as a keyto flush the cache prior to exiting the child. The name of the repository(RRD file) is held within the do_rrd module context as the static string"rrdfn", which is accessible to the worker functions. This is used in thefollowing fix to generate the match string - it's a bit ugly but it works.


So, in do_external.c,v 1.22 2008/03/22 07:48:55
in function do_external_rrd()
   declare a char * variable called extkey

then after line 106 (within the R_DATA case) :create_and_update_rrd(hostname, testname, classname, pagepaths, params,NULL);

insert

extkey = (char *)malloc(strlen(hostname) + strlen(rrdfn) +3*sizeof(char));

   if( extkey ) {
       sprintf(extkey, "/%s/%s", hostname, rrdfn);
       dbgprintf("%09d : Forcing flush of '%s'\n", extkey );
       rrdcacheflushhost(extkey);
       xfree(extkey);
   }

This is now working reliably for me.

If the external script is used to feed additional data into one of theinternal test repositories this fix will fail - with that internallygenerated data being written both by the parent and the child. A work-aroundfor that would be to make a similar rrdcacheflushhost() call prior to thefork, so clearing out any such entries from the parent, and then the childcan write out only the data it generated itself EXCEPT for the fact that wehaven't worked out what rrdfn is by that time. Another alternative would beto put in a switch to temporarily prevent the caching mechanism insidecreate...rrd. The simplest though is...just don't do it!

As an additional observation, the rrd-status.log shows that at or around thetermination of the child process the message pipe receives an EINTRcompletion, then loops around and restarts the message wait. I've no ideawhether this is to be expected - although it looks a bit odd. I've not donemuch *NIX IPC development though, so I'll leave that one to the experts.


Graham Nayler
www.hallmarq.net

References:
- Suspected bug in External Script handling
  - From: Graham Nayler

Prev by Date: Hobbit and BBWin0.12 Windows Event logs
Next by Date: Downtime Question
Previous by thread: Suspected bug in External Script handling
Next by thread: Hobbit and BBWin0.12 Windows Event logs
Index(es):
- Date
- Thread