[Xymon] rrd logs and graphs

Jeremy Laidman jlaidman at rebel-it.com.au
Wed Mar 4 07:03:02 CET 2015


On 4 March 2015 at 12:40, Vernon Everett <everett.vernon at gmail.com> wrote:

> Here's what I ran, with error output.
> ./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
> /var/tmp/xymon.out
> 2015-03-04 08:45:22 Using default environment file
> /opt/local/xymon/server/etc/xymonserver.cfg
> 2015-03-04 08:45:58 Peer not up, flushing message queue
> 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.
>
> What is that GOCLIENT thing?
>

>From what I can understand, it's a semaphore shared between xymond and all
of the xymond_channel instances.  When there are several channel readers,
they all get sent the message address, and as each one accepts the message,
she decrements GOCLIENT.  When GOCLIENT is zero, it means all readers have
received (and probably copied) the message, and the memory can be freed.
Each reader waits until GOCLIENT goes back to zero before waiting for the
next message.

There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go
back to zero.  If the time is exceeded in a channel reader, it means
another reader is taking too long to handle a message, and so the first
reader gives up, logs the error you saw, and carries on with the next
message loop.  I'm not sure if this is a sign of trouble.  Or it might be
normal when you're running your own instance of xymond_channel.  Or it
might be a side-effect of the "cat" command blocking when writing to your
output file due to a high message rate and contention on whatever
filesystem has /var/tmp/.

There's a description of how GOCLIENT works in the file new-daemon.txt, in
the source code.


> In the output file, /var/tmp/xymon.out from
> ./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
> /var/tmp/xymon.out
> there is no mention of the subversion or energise stuff either.
>

Does it have mention of the correct data set names?  We can't draw any
conclusions if it's not collecting the data we expect.

Did any of the RRD files skip an update at the time the new rogue files
were created?  Do these files match up with entries in xymon.out?  Or
anything interesting at the same time as the rogue entries were created?

If you're seeing correct entries in xymon.out, but not the bogus entries,
then I'm inclined to agree that xymond_rrd is at fault, and is possibly
using memory it's not supposed to.  I wonder if running xymond_rrd with
"--no-cache" might have an effect.  Obviously, it's better if you can cache
updates to the RRD files, but it might narrow down the region of code
that's responsible.

This is not conclusive.  It's possible that when you have two instances of
xymond_channel, only one is corrupting data names, and it just so happened
that it was the one being used by xymond_rrd.  Could be that another time
you would see your extra reader getting the bogus entries.  That's the
problem with using a second instance for analysis, rather than somehow
getting the analysis happening on the one that writes to the RRD files.

On the other hand, if you ran two instances of xymond_rrd, both on the same
data channel, and if both instances create the bogus RRD files, then you
know that you can probably use the second instance to narrow down the
fault, without impacting the creation of RRD files for real work.

Are you still running xymond_rrd with "--debug"?  Did this show anything
interesting when the bogus RRD files were created?

What version of Xymon are you running?  Did this start happening after an
upgrade?  I wonder if it's a bug with some versions but not others.

J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20150304/b0c6d934/attachment.html>


More information about the Xymon mailing list