[Xymon] Thoughts on Usefulness/Reliability of Purple Alerts

Matt Vander Werf mvanderw at nd.edu
Mon Aug 24 19:01:34 CEST 2015

This is primarily for Henrik and J.C., but anyone else is free to chime in
their thoughts on this as well!

We have a Xymon server (the latest Terabithia RPM on RHEL 7) in production
that monitors around 1950 hosts (and consistently growing).
About a week ago, we experienced some pretty bad purple alert storms in the
middle of the night that were all false-positive alerts (over 300 alerts
one night). For most of the tests that went purple, they went back to green
at the next update interval. At this point, we've been unable to figure out
a root cause behind this issue, but it hasn't happened again since early
last week (all the easy, understandable possible causes has been ruled out:
network load/bandwidth, CPU load of Xymon server and of Xymon clients
affected, etc.).

We have been using purple alerts for some time now, find them fairly
reliable for the most part, and think they are useful, as machines hang or
something similar (causing the Xymon client on the machine to stop being
able to report to the Xymon server) and we don't get any red or yellow
alerts for any other tests (sometimes a machine can hang but still have a
network connection that can be successfully pinged by Xymon, we have
found). We haven't had any major issues with false-positive purple alerts
(for the most part), or any purple alert storms, since we started using
them consistently for all our machines a couple years ago.

I understand that when Xymon was first forked from Big Brother a long while
back, it may have been noted that one big change from Big Brother was that
you didn't need to do purple alerts (or something like that) and that it
was discouraged to use purple alerts, as they were seen as widely
unreliable. (I'm hearing this from a coworker of mine, who set up our
original Xymon server some 5 years ago, but have been unable to find what
he's referring to.) But from what I can see from the current documentation
and the mailing list archives, I'm not seeing any place where the use of
purple alerts is discouraged due to them being unreliable.

So, I wanted to see what the current thinking/view regarding purple alerts
and the use of purple alerts was by both the original main maintainer,
Henrik, and the more current main maintainer, J.C. (at least of the current
release). Are purple alerts still considered wholly unreliable, or even
somewhat unreliable (or were they ever)? Are they discouraged in any way or
fashion from being used? Have they caused issues for any of you on this
list? Or vice versa: Have they worked well for you? I'm fully aware that
this purple alert storm issue we had is just a one-off occurrence and we
could have not more additional issues in the future with purple alerts.

I understand that purple alerts are different than other alerts, like red
and yellow alerts, in that it is an indication that the Xymon client has
stopped working/reporting (on a per-test basis) to the Xymon server for
some reason, rather than an issue from a specific test (e.g. with the CPU
load, memory, etc.).

*(Possible) Feature Request:
In addition, I'd be interested if there was a way that you could only get
one alert for a machine if say all the tests for that machine go purple,
instead of an alert for each purple test. I don't believe this is possible
currently, correct? Is this something that could possibly be implemented in
the future? I understand if it's not or if it wouldn't be very easy.

I appreciate your time in answering my questions and look forward to your
input! (And apologies for the long-winded e-mail!)

Thanks very much in advance!!

Matt Vander Werf
