[Xymon] All Xymon rrd graphs suddenly haywire

Tue Jul 7 22:50:24 CEST 2015

I had the exact same thing happen a couple of months ago, with
xymon-4.3.12.  I don't know what triggered it, and it was only a
short-duration spike, then everything returned to normal.

The majority of my client systems are real machines.  There are some VMs
running in qemu-kvm on RHEL5, and some other VMs in VMware

My xymon server is also a real machine,

Ralph Mitchell

On Tue, Jul 7, 2015 at 4:36 PM, Steve B <rectifier at gmail.com> wrote:

> It is extremely odd J.C. and thanks very much for your reply, has given me
> something to think about. I am not at the office now but before I left,
> after the copying over of the rrds files from Friday all looked ok, graphs
> were being generated properly, from xymon, bbwin from hosts and devmon.
> Then an hour later just as I was leaving, I saw a few checks having the
> issue again. It was slowly starting again. I had decided I had to do the
> *restore* of the rrds though just for peace of mind about the network
> intervention at the weekend not being anything to do with this whole issue
> (which would be very unlikely in the first place) so now I can be sure that
> is not the culprit.
>
> To answer your question, it is all (bar none actually) types of RRD (conn,
> disk, memory, devmon etc).  I did not see anything unusual in the status
> history around that time, but now that it has happened again today after
> the restore, I have some good time stamps to check through log files
> tomorrow.  Perhaps not all hosts/checks will be affected by the time I
> arrive at the office tomorrow.
>
> At the vm level I have not checked yet (handled by another team) but will
> do tomorrow. I did check the server from within RHEL and cpu/memory/disk
> seemed fine today and last few days.
>
> I still think it's (our) Xymon that's having some difficulties somewhere
> although general host memory corruption is something I will look at.
>
> Thanks again, will post more when I make some discoveries.
>
> Steve
>
> On Tue, Jul 7, 2015 at 6:02 PM, J.C. Cleaver <cleaver at terabithia.org>
> wrote:
>
>>
>>
>> On Tue, July 7, 2015 5:13 am, Steve B wrote:
>> > Hi all,
>> >
>> > This weekend, something happened with all our graphs. Every hosts'
>> graphs
>> > are either corrupted or distorted and the history is unusable. I have
>> > checked all the usual places for graphs logging, rrd-data.log and
>> > rrd-status.log and other system log files but I am stumped as to where
>> to
>> > start fixing this.  We are looking at restoring rrds from previous
>> > snapshot
>> > which may or may not work but still would like to solve this mystery.
>> >
>> > I have attached 2 screens but I do not know if these are viewable on the
>> > mailing list.  It is hard to explain without but essentially there are
>> > huge
>> > numbers in our graphs such
>> > 3945789385793485793847593847593847593847593847593845793485739 and lots
>> of
>> > '?' and there is no usable history, just a straight line along the base
>> > with one peak (or two) around the time this all happened (with a day or
>> > two
>> > out either way). If you try to zoom in, you get to a screen that just
>> says
>> > 'zoom source image' and it's a black screen but if you hover your mouse
>> > over the screen you can find an area that is selectable and this shows a
>> > close up of the zoom area
>> >
>> > rrdtool info example (for the same screenshot host test):
>> >
>> > filename = "disk,C.rrd"
>> > rrd_version = "0003"
>> > step = 300
>> > last_update = 1436270189
>> > ds[pct].type = "GAUGE"
>> > ds[pct].minimal_heartbeat = 600
>> > ds[pct].min = 0.0000000000e+00
>> > ds[pct].max = 1.0000000000e+02
>> > ds[pct].last_ds = "89"
>> > ds[pct].value = 7.9210000000e+03
>> > ds[pct].unknown_sec = 0
>> > ds[used].type = "GAUGE"
>> > ds[used].minimal_heartbeat = 600
>> > ds[used].min = 0.0000000000e+00
>> > ds[used].max = NaN
>> > ds[used].last_ds = "28436524"
>> > ds[used].value = 2.5308506360e+09
>> > ds[used].unknown_sec = 0
>> > rra[0].cf = "AVERAGE"
>> > rra[0].rows = 576
>> > rra[0].pdp_per_row = 1
>> > rra[0].xff = 5.0000000000e-01
>> > rra[0].cdp_prep[0].value = NaN
>> > rra[0].cdp_prep[0].unknown_datapoints = 0
>> > rra[0].cdp_prep[1].value = NaN
>> > rra[0].cdp_prep[1].unknown_datapoints = 0
>> > rra[1].cf = "AVERAGE"
>> > rra[1].rows = 576
>> > rra[1].pdp_per_row = 6
>> > rra[1].xff = 5.0000000000e-01
>> > rra[1].cdp_prep[0].value = 4.4500000000e+02
>> > rra[1].cdp_prep[0].unknown_datapoints = 0
>> > rra[1].cdp_prep[1].value = 1.4218146600e+08
>> > rra[1].cdp_prep[1].unknown_datapoints = 0
>> > rra[2].cf = "AVERAGE"
>> > rra[2].rows = 576
>> > rra[2].pdp_per_row = 24
>> > rra[2].xff = 5.0000000000e-01
>> > rra[2].cdp_prep[0].value = 2.0470000000e+03
>> > rra[2].cdp_prep[0].unknown_datapoints = 0
>> > rra[2].cdp_prep[1].value = 6.5402986560e+08
>> > rra[2].cdp_prep[1].unknown_datapoints = 0
>> > rra[3].cf = "AVERAGE"
>> > rra[3].rows = 576
>> > rra[3].pdp_per_row = 288
>> > rra[3].xff = 5.0000000000e-01
>> > rra[3].cdp_prep[0].value = 1.2727000000e+04
>> > rra[3].cdp_prep[0].unknown_datapoints = 0
>> > rra[3].cdp_prep[1].value = 4.0657944878e+09
>> > rra[3].cdp_prep[1].unknown_datapoints = 0
>> >
>> > This weekend we had a network intervention in that we moved some network
>> > connections in one of the 2 data centers but there was no downtime as we
>> > switched the network connectivity to the other data room. Our Xymon
>> server
>> > is running on a virtual server (RHEL5) and the version we are using is
>> > 4.3.19.
>> >
>> > All graphs were fine until this point.  Any ideas?
>>
>>
>> This is quite odd.
>>
>> There aren't too many things that could concertedly affect all RRD's like
>> that within the code path. Is it the same type of RRD (eg, disk) for all
>> hosts, or all RRDs for all hosts? Did you see anything unusual in the
>> status history snapshots (if any) taken around this time?
>>
>> If it happened to RRDs on both the 'data' and 'status' channels at once,
>> that narrows down the possibilities even further. I'm assuming you've
>> checked syslog for host level events for the VM, but did anything odd
>> happen with the hypervisor around this time? General host memory
>> corruption is about the only thing I can think of that might cause this --
>> haven't run into it before.
>>
>>
>> Regarding fixing the issue, restoring from backups might be the easiest
>> option. If you want to save the surrounding data, your best bet might be
>> to export/reimport the RRD to remove the "spike". I've used
>>
>> http://www.serveradminblog.com/2010/11/remove-spikes-from-rrd-graphs-howto/
>> in the past for doing this. It's easiest to script around the various
>> types of RRD files, using a similar max setting for all "la" graphs, for
>> example.
>>
>> I seem to recall someone posting a script they had used for this in the
>> past, but a search of the list archives hasn't revealed anything for me.
>>
>>
>> HTH,
>>
>> -jc
>>
>>
>>
>
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20150707/cde65cba/attachment.html>