[Xymon] Thoughts on Usefulness/Reliability of Purple Alerts

Wed Aug 26 08:38:24 CEST 2015

On Mon, August 24, 2015 10:15 am, Sean MacGuire wrote:
> OK, I'll chime in... I wrote Big Brother so know something about
> purple alerts.
>
> The purple alert was something no other monitoring system did, and
> it took care of the problem of a BB (and Xymon) client dropping dead
> and/or machines being in a zombie state (i.e. responding to pings but
> otherwise hung).
>
> They're useful as a indication of something being wrong, vs a red or
> yellow alert which provide a clear and actionable problem ststus.
>
> So purple alerts are awesome, unless your server has lost contact
> with the clients reporting in and everyone goes purple at the same
> time resulting in the "Massive Purple Explosion".
>
> Background explanation - the idea was to timestamp reports into the
> future, and if a client doesn't report in by then, the validity of
> the last report is in question - it's works the exact same way as
> the expiration date on a carton of milk - the milk might not have
> gone bad, yet, but you might want to check it before drinking it.
>
>
> Matt Vander Werf wrote:
>> This is primarily for Henrik and J.C., but anyone else is free to chime
>> in their thoughts on this as well!
>>
>> *Background:
>> We have a Xymon server (the latest Terabithia RPM on RHEL 7) in
>> production that monitors around 1950 hosts (and consistently growing).
>> About a week ago, we experienced some pretty bad purple alert storms in
>> the middle of the night that were all false-positive alerts (over 300
>> alerts one night). For most of the tests that went purple, they went
>> back to green at the next update interval. At this point, we've been
>> unable to figure out a root cause behind this issue, but it hasn't
>> happened again since early last week (all the easy, understandable
>> possible causes has been ruled out: network load/bandwidth, CPU load of
>> Xymon server and of Xymon clients affected, etc.).
>>
>> We have been using purple alerts for some time now, find them fairly
>> reliable for the most part, and think they are useful, as machines hang
>> or something similar (causing the Xymon client on the machine to stop
>> being able to report to the Xymon server) and we don't get any red or
>> yellow alerts for any other tests (sometimes a machine can hang but
>> still have a network connection that can be successfully pinged by
>> Xymon, we have found). We haven't had any major issues with
>> false-positive purple alerts (for the most part), or any purple alert
>> storms, since we started using them consistently for all our machines a
>> couple years ago.
>>
>> I understand that when Xymon was first forked from Big Brother a long
>> while back, it may have been noted that one big change from Big Brother
>> was that you didn't need to do purple alerts (or something like that)
>> and that it was discouraged to use purple alerts, as they were seen as
>> widely unreliable. (I'm hearing this from a coworker of mine, who set up
>> our original Xymon server some 5 years ago, but have been unable to find
>> what he's referring to.) But from what I can see from the current
>> documentation and the mailing list archives, I'm not seeing any place
>> where the use of purple alerts is discouraged due to them being
>> unreliable.
>>
>> *Question(s):
>> So, I wanted to see what the current thinking/view regarding purple
>> alerts and the use of purple alerts was by both the original main
>> maintainer, Henrik, and the more current main maintainer, J.C. (at least
>> of the current release). Are purple alerts still considered wholly
>> unreliable, or even somewhat unreliable (or were they ever)? Are they
>> discouraged in any way or fashion from being used? Have they caused
>> issues for any of you on this list? Or vice versa: Have they worked well
>> for you? I'm fully aware that this purple alert storm issue we had is
>> just a one-off occurrence and we could have not more additional issues
>> in the future with purple alerts.
>>
>> I understand that purple alerts are different than other alerts, like
>> red and yellow alerts, in that it is an indication that the Xymon client
>> has stopped working/reporting (on a per-test basis) to the Xymon server
>> for some reason, rather than an issue from a specific test (e.g. with
>> the CPU load, memory, etc.).
>>
>> *(Possible) Feature Request:
>> In addition, I'd be interested if there was a way that you could only
>> get one alert for a machine if say all the tests for that machine go
>> purple, instead of an alert for each purple test. I don't believe this
>> is possible currently, correct? Is this something that could possibly be
>> implemented in the future? I understand if it's not or if it wouldn't be
>> very easy.
>>
>>
>> I appreciate your time in answering my questions and look forward to
>> your input! (And apologies for the long-winded e-mail!)
>>
>>
>> Thanks very much in advance!!
>>
>> --
>> Matt Vander Werf

Matt,

Generally speaking, a purple alert should be seen first and foremost as an
indication of a failure in the monitoring *system*... where "system"
includes the client pushing data up from the various servers you're paying
attention to.

By having a calculation made on each message's receipt of how long that
message is good for (receipt time + [default, or specified]), we have a
"fail safe" for an unknown issue occurring that requires attention. The
proximate cause of the purple is the failure to receive a message. Whether
that's caused by a hang or death of the usual originator, a bug in a
xymonproxy, a cut network cable, or xymond being unable to handle all of
the traffic sent to it before it times out, is left somewhat as an
exercise for the administrator.

Because purple alerts are generated from xymond's own view of its internal
state (calculated once a minute) and are never sent IN to xymond, purple
alerts should be a reliable indicator that... some other type of
unreliability is going on :)

Because of the wide possibility of different configurations, it's a little
dangerous to create a one-size-fits-all strategy for purples. In a typical
xymon installation with xymonnet and xymond_client running locally on the
same machine, with no proxies or network segments in the middle, and with
clients reporting directly in as well, you really shouldn't see any purple
alerts outside of clients dying... And if the client is dying because the
box is dying, by default you'll only get the 'conn' test red alert instead
of the various xymond_client and xymonnet-generated ones (unless you're
using the 'noclear' line in hosts.cfg).

Your suggestion to have only a single 'purple' come through would
*typically* work, but you'd have to ask yourself which test would be the
representative one. In our case, we found it easiest to nominate a
specific xymond_client test -- "memory" -- and only send purple
notifications for that out to our alert team. That takes care of
xymond_client, while leaving esoteric situations caused by the failure of
different sharded xymonnet's, xymonproxy's, or custom independent tests
free to fail in their own way.

Again, it's the non-typical cases where it gets tricky. What about custom
tests that aren't being generated by xymond_client that are still
functioning? Perhaps you have xymonnet running on a different machine
that's reporting back to xymond (or to a xymonproxy that's reporting back
to xymond!) that has failed in some way. And of course, it could be that
xymond is under heavy load and is unable to keep up with incoming messages
generally (something we experienced in both the TCP and BFQ configs as we
were scaling out).

Sort of along these lines, however, I'd been considering having a more
"host-wide" way of defining certain failure states directly within xymond,
which would allow some of this override logic to happen centrally (and
more reliably). Imagine a 'conn' being red optionally causing *all* tests
to fail-to-clear, removing the need for this calculation from the
remainder of xymonnet tests. Or a true host-wide "disable" that gets
applied to all tests, even new ones, as a xymond flag. A host-wide
"purple-state" could be conceptualized as well.

That's just a thought, though, and it kind of depends on whether people
would find such a feature useful.

Anyway, I hope that's answered some of your questions!

Regards,

-jc