[Xymon] rrd logs and graphs

Vernon Everett everett.vernon at gmail.com
Wed Mar 4 08:02:55 CET 2015


Thanks for your assistance Jeremy.
Quite a bit to digest.

Everything in the xymon.out file I am collecting looks exactly like I would
expect it to look.
There is a line, prefixed by @@data#<sequence> (or at least I think it's a
sequence number),followed by pipe-separated data that looks like the host
name, time stamp, IP address, an empty field, host name again, the rrd-file
prefix, another blank field and the last field is other.
I then get a blank line, and the data looks like what I am trying to send.


Looks like we might need to check with JC for more on that GOCLIENT thing.
I just find it odd that it happened about the same time as the corruption.
I haven't seen it again today, and haven't seen any other corruption either.

At this site, we are running Xymon 4.3.12, but I have seen similar
behaviour (although not to such an extent) elsewhere with 4.3.17, and I
think I also saw it with 4.3.18, but I no longer have access to that site.

I am not seeing any lost data points in the other graphs. But that could be
difficult to spot.
Will run a few rrdtool dumps, and look for gaps at that timestamp. Let you
know what I find.

As for the --debug option, it caused xymond_rrd to crash and burn, dumping
cores as we go.
It gets ugly.
Earlier in this thread, John Thurston mentioned this behaviour too.
It also creates a red xymond_rrd button on the xymon server, but the
results are not overly helpful.
- Program crashedFatal signal caught!

Don't think it started after an upgrade.
Something I did notice, the problem appears to be limited to data only,
used to display graphs in trends.
I am not seeing this for data when there is a status and data component.
Or at least I haven't seen it yet.

What are the implications of running with "--no-cache"?
I have implemented this by adding "--no-cache" but if it's going to have a
long-term impact, I don't want to leave it that way indefinitely.

Regards
Vernon




On 4 March 2015 at 14:03, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:

> On 4 March 2015 at 12:40, Vernon Everett <everett.vernon at gmail.com> wrote:
>
>> Here's what I ran, with error output.
>> ./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
>> /var/tmp/xymon.out
>> 2015-03-04 08:45:22 Using default environment file
>> /opt/local/xymon/server/etc/xymonserver.cfg
>> 2015-03-04 08:45:58 Peer not up, flushing message queue
>> 2015-03-04 09:05:21 Gave up waiting for GOCLIENT to go low.
>>
>> What is that GOCLIENT thing?
>>
>
> From what I can understand, it's a semaphore shared between xymond and all
> of the xymond_channel instances.  When there are several channel readers,
> they all get sent the message address, and as each one accepts the message,
> she decrements GOCLIENT.  When GOCLIENT is zero, it means all readers have
> received (and probably copied) the message, and the memory can be freed.
> Each reader waits until GOCLIENT goes back to zero before waiting for the
> next message.
>
> There's a timeout of 1 second that xymond_channel waits for GOCLIENT to go
> back to zero.  If the time is exceeded in a channel reader, it means
> another reader is taking too long to handle a message, and so the first
> reader gives up, logs the error you saw, and carries on with the next
> message loop.  I'm not sure if this is a sign of trouble.  Or it might be
> normal when you're running your own instance of xymond_channel.  Or it
> might be a side-effect of the "cat" command blocking when writing to your
> output file due to a high message rate and contention on whatever
> filesystem has /var/tmp/.
>
> There's a description of how GOCLIENT works in the file new-daemon.txt, in
> the source code.
>
>
>> In the output file, /var/tmp/xymon.out from
>> ./xymoncmd xymond_channel  --channel=data --filter=e-series cat >
>> /var/tmp/xymon.out
>> there is no mention of the subversion or energise stuff either.
>>
>
> Does it have mention of the correct data set names?  We can't draw any
> conclusions if it's not collecting the data we expect.
>
> Did any of the RRD files skip an update at the time the new rogue files
> were created?  Do these files match up with entries in xymon.out?  Or
> anything interesting at the same time as the rogue entries were created?
>
> If you're seeing correct entries in xymon.out, but not the bogus entries,
> then I'm inclined to agree that xymond_rrd is at fault, and is possibly
> using memory it's not supposed to.  I wonder if running xymond_rrd with
> "--no-cache" might have an effect.  Obviously, it's better if you can cache
> updates to the RRD files, but it might narrow down the region of code
> that's responsible.
>
> This is not conclusive.  It's possible that when you have two instances of
> xymond_channel, only one is corrupting data names, and it just so happened
> that it was the one being used by xymond_rrd.  Could be that another time
> you would see your extra reader getting the bogus entries.  That's the
> problem with using a second instance for analysis, rather than somehow
> getting the analysis happening on the one that writes to the RRD files.
>
> On the other hand, if you ran two instances of xymond_rrd, both on the
> same data channel, and if both instances create the bogus RRD files, then
> you know that you can probably use the second instance to narrow down the
> fault, without impacting the creation of RRD files for real work.
>
> Are you still running xymond_rrd with "--debug"?  Did this show anything
> interesting when the bogus RRD files were created?
>
> What version of Xymon are you running?  Did this start happening after an
> upgrade?  I wonder if it's a bug with some versions but not others.
>
> J
>
>


-- 
"Accept the challenges so that you can feel the exhilaration of victory"
- General George Patton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20150304/a399bc1b/attachment.html>


More information about the Xymon mailing list