[hobbit] Suspected bug in External Script handling

Graham Nayler graham.nayler at hallmarq.net
Fri Sep 26 21:35:53 CEST 2008


Sorry about the absence of word wrapping in that first attempt - hopefully 
this one will be easier to read!


----- Original Message ----- 
From: Graham Nayler
To: hobbit at hswn.dk
Sent: Friday, September 26, 2008 7:17 PM
Subject: [hobbit] Suspected bug in External Script handling


Hi guys,

(Bug report and fix submitted here as Sourceforge looks 'not particularly' 
active)

I've just started setting up a Hobbit system to monitor a load of Windows 
boxes (using BBWin), and am implementing our custom tests using external 
script mechanism. Once I finally got my head around the Hobbit/BBWin 
interface and worked out that it's really simple to implement, just very 
confusing to find the right document to look at, the test columns and graphs 
were displayed fine, but with dodgy data in the graphs.

The problem though is that RRD only intermittently gets its updates - maybe 
once every 15-20 minutes. I eventually realised that it's a problem with the 
update caching implemented in hobbitd_rrd. This is using the snapshot 
version as of 2008/Aug/02. The bug does not apply to version 4.2.0 from 
SourceForge, which was prior to the implementation of write-behind caching.

The majoirty of the internal tests cache RRD updates using static data held 
in do_rrd.c (v1.61 2008/04/02). External scripts though are handled by 
forking to a child of the hobbitd_rrd process in do_external.c (v1.22 
2008/03/22) - I assume to avoid a midbehaving user script from snarling up 
the whole system. Once the data is collected from the script it then passes 
it on to RRD in the normal way. However, the forked process uses a copy of 
the static data, so this goes into a different cache to that in the main 
process. And once done the child process goes away - without forcing the 
cache to empty, so loses the data that were just created. Following this 
logic, the cache never fills up enough to flush itself, and so the data 
don't make it to RRD (which rather begs the question of how I got anything 
in the graph at all - but then that's a side issue).

The obvious solutions appear to be:
1) don't fork to a child - but that would allow misbehaving scripts to hang 
the system
2) fork, but pass the data back to the parent process once it's done - 
possible, but not a trivial fix
3) fork as currently, but flush the cache before closing the child process - 
not particularly elegant, but simple to implement.

I've implemented a fix of type 3. It's important to only flush what is 
handled by the external script handler, as the parent process will still 
have it's copy of the cache at the time of the fork, and will flush that 
itself in the normal course of events. There is a function in do_rrd.c that 
allows a partial flush of the cache - rrdcacheflushhost(). This flushes 
everything that matches the supplied "hostname", which can be the full path 
to the RRD archive, or a leading substring thereof. If it is only external 
scripts supplying the test data, then no keys matching that test name will 
ever be held in the parent process cache, so this path can be used as a key 
to flush the cache prior to exiting the child. The name of the repository 
(RRD file) is held within the do_rrd module context as the static string 
"rrdfn", which is accessible to the worker functions. This is used in the 
following fix to generate the match string - it's a bit ugly but it works.

So, in do_external.c,v 1.22 2008/03/22 07:48:55
in function do_external_rrd()
    declare a char * variable called extkey
    then after line 106 (within the R_DATA case) : 
create_and_update_rrd(hostname, testname, classname, pagepaths, params, 
NULL);
insert
    extkey = (char *)malloc(strlen(hostname) + strlen(rrdfn) + 
3*sizeof(char));
    if( extkey ) {
        sprintf(extkey, "/%s/%s", hostname, rrdfn);
        dbgprintf("%09d : Forcing flush of '%s'\n", extkey );
        rrdcacheflushhost(extkey);
        xfree(extkey);
    }

This is now working reliably for me.

If the external script is used to feed additional data into one of the 
internal test repositories this fix will fail - with that internally 
generated data being written both by the parent and the child. A work-around 
for that would be to make a similar rrdcacheflushhost() call prior to the 
fork, so clearing out any such entries from the parent, and then the child 
can write out only the data it generated itself EXCEPT for the fact that we 
haven't worked out what rrdfn is by that time. Another alternative would be 
to put in a switch to temporarily prevent the caching mechanism inside 
create...rrd. The simplest though is...just don't do it!

As an additional observation, the rrd-status.log shows that at or around the 
termination of the child process the message pipe receives an EINTR 
completion, then loops around and restarts the message wait. I've no idea 
whether this is to be expected - although it looks a bit odd. I've not done 
much *NIX IPC development though, so I'll leave that one to the experts.

Graham Nayler
www.hallmarq.net

 




More information about the Xymon mailing list