[Xymon] Xymon server (4.3.27) occasionally assigns reports to the wrong host
Axel Beckert
beckert at phys.ethz.ch
Wed Jun 15 17:58:16 CEST 2016
Hi,
in the past few months I found more and more indices for a strange bug
in (at least) Xymon 4.3.27 which occasionally mixes up hosts when
handling reports:
One of our two active Xymon servers occasionally assigns test reports
to the wrong hosts, also causing false positives in alerting.
(Actually I had the notion that this is happening for quite a while
now but we today found a false positve caused by that and could nail
down the real origin of the report to be clearly from a different
host.)
Not sure where I should start with describing the details, so here's
just a bunch of facts:
* Happens at least for the tests "disk", "ports", "cpu" and our
self-written test "raid", but maybe more.
* Some of the wrong status pages show "Status message received from
127.0.0.1" despite they should have come from some client IP address
("raid" test), others show the expected IP of the host for which the
test is said to belong as sender ("disk" test).
* Machines with a single disk (e.g. VMs) occassional report status of
a "raid" test which is not deployed to them -- and then (for obvious
reasons) went purple on it. On that server, there's only one machine
in having a RAID, but its "raid" reports have been misassigned to at
least three other hosts, all host which have rather many tests
(compared to a bunch of sensors which send in only very few tests
per host).
* All the so far noticed wrongly assigned raid reports were
originating from the same machine and all the so far noticed wrongly
assigned disk reports were originating from a different, but always
the same machine.
* For about a month one machine's service status page (as well as
non-green.html) occassionally (approx. 1-3 times a day) show the
disk report for another machine. At least all the cases we looked at
due to different "yellow" thresholds for disk usage levels were
coming from the same other machine.
Examples take from the history of the "disk" test of the same host:
+ An expected, "green" report:
Wed Jun 15 17:39:49 CEST 2016 - Filesystems ok
Filesystem 1024-blocks Used Available Capacity Mounted on
udev 2013012 0 2013012 0% /dev
tmpfs 405736 46412 359324 12% /run
/dev/sda1 75620456 2599176 69942852 4% /
tmpfs 2028668 0 2028668 0% /dev/shm
tmpfs 5120 4 5116 1% /run/lock
tmpfs 2028668 0 2028668 0% /sys/fs/cgroup
cgmfs 100 0 100 0% /run/cgmanager/fs
+ One of the misassigned false positive reports:
Tue Jun 14 10:35:13 CEST 2016 - Filesystems NOT ok
yellow / (94% used) has reached the WARNING level (90%)
Filesystem 1024-blocks Used Available Capacity Mounted on
udev 505668 0 505668 0% /dev
tmpfs 102980 884 102096 1% /run
/dev/sda1 155570664 137658232 10003236 94% /
tmpfs 5120 4 5116 1% /run/lock
none 1132776 516 1132260 1% /tmp
cgroup 12 0 12 0% /sys/fs/cgroup
+ Excerpt from the "Client data" link from the status page of the
item directly above this one:
[df]
Filesystem 1024-blocks Used Available Capacity Mounted on
udev 2013012 0 2013012 0% /dev
tmpfs 405736 46464 359272 12% /run
/dev/sda1 75620456 2591688 69950340 4% /
tmpfs 2028668 0 2028668 0% /dev/shm
tmpfs 5120 4 5116 1% /run/lock
tmpfs 2028668 0 2028668 0% /sys/fs/cgroup
cgmfs 100 0 100 0% /run/cgmanager/fs
tmpfs 405736 0 405736 0% /run/user/0
Please notice
+ the different set of file systems due to the reports are in
reality coming from systems with different init systems.
+ the different overall size of /dev/sda1. The system hasn't
rebooted since like 10 days and only has an 80 GB disk while the
yellow report comes from a system with a 160 GB disk.
* The "Client data" link always have the correct "df" and "netstat"
output even in cases where the service status page shows the "df" or
"netstat" output from the wrong host. So I assume it's a bug in
report handling and neither on the client side nor on the network
side of the of the daemon.
* The misassigned reporting of the "raid" test started on 22nd of
April 2016. That fits very well to the fact that I upgraded that
server on 20th of April 2016 from 4.3.26 (plus
CGI-regression fixes) to 4.3.27.
The misassigned reporting of the "disk" test was first seen in our
(IRC) logs on 15th of May 2016, i.e. weeks after the upgrade of
Xymon to 4.3.27. But the misassigned "ports" test seems to have
started already on 13th of April 2016 and that test was green for
more than a year beforehands), i.e. while still 4.3.26 (plus
CGI-regression fixes) was running.
Can't say when the misassigned cpu tests started since I didn't find
them via obvious false positives but because an actual report
reported xen daemons on a webserver.
* There are hosts whose reports gets assigned to other hosts while
being assigned reports from yet another host, e.g. Host 1 (web
server with nearly full disk)'s procs test is (occassionally)
reported as being for host 2 (a Xen server, easily spottable by
several daemons containing "xen" in their name), host 2's cpu test is
(occassionally) reported for host 3 (another web server, but with
very low disk usage) and host's 1 disk test is (occassionally)
reported as belonging to host 3, too.
So far we've noticed this only on one of our two xymon servers, let's
call it "Server B". And even explicitly grepping in the log history of
our Xymon IRC bot for potential cases (like ports on servers where
they never were nor shouldn't be) on Server A didn't bring up anything
similar.
Both run on Debian 8 Jessie ("Stable") on the amd64 architecture with
backported xymon packages from Debian Unstable/Testing (i.e. always
the newest packages -- as I test the packages there under live
conditions before uploading them to Debian). Both use Apache 2.4 as
webserver.
Server A:
* real hardware
* about 850 hosts
* configured "critical view" setup
* only reports to itself.
* permanently monitored by our group, i.e. I expect that we would
notice such misassignements rather quickly, especially if they
happen in the "critical view".
* Also runs some other monitoring-like tools like e.g. Smokeping.
Server B:
* virtual machine
* about 35 hosts
* Some hosts (actually network-enabled environment sensors) report
once per minute. Those reports get passed through to a separate task
("CMD xymond_channel --channel=status ...") which logs _all_ their
values and archives them month-wise for later reference.
* no "critical view"
* reports to itself and to Server A.
* Also tries to report to Server C (which was a test installation of
the Xymon 5 branch, now offline)
* mostly monitors customer machines (call it "Xymon as a Service" ;-),
i.e. not that much looked at by our group.
* Writes "Flushed 1 stale messages for 0.0.0.0:0" every 5 minutes into
/var/log/xymon/clientdata.log, probably related to Server C being
offline.
So far we only noticed that strange behaviour on Server B -- which is
the less busier server (despite that bunch of minute-wise reporting
hosts).
Actually the first notion has been seen when the VM "Server B"
reported to Server A that its (non-existing) "raid" test went
purple.
And then today one of our customers argued about his disk test going
yellow several times a day despite the accused partition is at 4% most
of the time according to Xymon -- except when it's at 94% for 5
minutes and then back at 4% again... (And I can't imagine an old 80 GB
disk being stuffed with 75 GB of data within 5 minutes) One single
_other_ host reporting to Server B is permanently on 94% disk usage and
its (and only its) yellow thresholds is set to 95%, hence its reports
are green for the same set of data.
So does anyone have an idea what goes wrong on our Server B so that it
mixes up reports from multiple hosts while Server A doesn't show these
symptoms?
Kind regards, Axel Beckert
--
Axel Beckert <beckert at phys.ethz.ch> support: +41 44 633 26 68
IT Services Group, HPT H 6 voice: +41 44 633 41 89
Departement of Physics, ETH Zurich
CH-8093 Zurich, Switzerland http://nic.phys.ethz.ch/
More information about the Xymon
mailing list