[Xymon] Thoughts on Usefulness/Reliability of Purple Alerts
john.r.rothlisberger at accenture.com
john.r.rothlisberger at accenture.com
Wed Aug 26 15:20:01 CEST 2015
My 2 cents…
As you already know, purple alerts may or may not indicate a real issue from the perspective of the client server fulfilling its intended role. Example: you are monitoring a webserver, the client service terminates, the tests go purple – YET the website continues to function.
Then there are those occasions where a server hangs but still responds to a ping.
There is one other occasion when some of the tests go purple but not all and that is usually when the eventlogs fill the data file. This results in tests such as procs, svcs, who, etc going purple.
My solution has been to send alerts only when the disk test has gone purple. This reduces the number of purple alerts being sent out and also limits those alerts to a single alert per client.
Upcoming PTO: 8/19-21 & 8/24, 8/28, 9/21-9/29
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Matt Vander Werf
Sent: Monday, August 24, 2015 12:02 PM
To: cleaver at terabithia.org; henrik at hswn.dk
Cc: xymon at xymon.com; Rich Sudlow <rich at nd.edu>
Subject: [Xymon] Thoughts on Usefulness/Reliability of Purple Alerts
This is primarily for Henrik and J.C., but anyone else is free to chime in their thoughts on this as well!
We have a Xymon server (the latest Terabithia RPM on RHEL 7) in production that monitors around 1950 hosts (and consistently growing).
About a week ago, we experienced some pretty bad purple alert storms in the middle of the night that were all false-positive alerts (over 300 alerts one night). For most of the tests that went purple, they went back to green at the next update interval. At this point, we've been unable to figure out a root cause behind this issue, but it hasn't happened again since early last week (all the easy, understandable possible causes has been ruled out: network load/bandwidth, CPU load of Xymon server and of Xymon clients affected, etc.).
We have been using purple alerts for some time now, find them fairly reliable for the most part, and think they are useful, as machines hang or something similar (causing the Xymon client on the machine to stop being able to report to the Xymon server) and we don't get any red or yellow alerts for any other tests (sometimes a machine can hang but still have a network connection that can be successfully pinged by Xymon, we have found). We haven't had any major issues with false-positive purple alerts (for the most part), or any purple alert storms, since we started using them consistently for all our machines a couple years ago.
I understand that when Xymon was first forked from Big Brother a long while back, it may have been noted that one big change from Big Brother was that you didn't need to do purple alerts (or something like that) and that it was discouraged to use purple alerts, as they were seen as widely unreliable. (I'm hearing this from a coworker of mine, who set up our original Xymon server some 5 years ago, but have been unable to find what he's referring to.) But from what I can see from the current documentation and the mailing list archives, I'm not seeing any place where the use of purple alerts is discouraged due to them being unreliable.
So, I wanted to see what the current thinking/view regarding purple alerts and the use of purple alerts was by both the original main maintainer, Henrik, and the more current main maintainer, J.C. (at least of the current release). Are purple alerts still considered wholly unreliable, or even somewhat unreliable (or were they ever)? Are they discouraged in any way or fashion from being used? Have they caused issues for any of you on this list? Or vice versa: Have they worked well for you? I'm fully aware that this purple alert storm issue we had is just a one-off occurrence and we could have not more additional issues in the future with purple alerts.
I understand that purple alerts are different than other alerts, like red and yellow alerts, in that it is an indication that the Xymon client has stopped working/reporting (on a per-test basis) to the Xymon server for some reason, rather than an issue from a specific test (e.g. with the CPU load, memory, etc.).
*(Possible) Feature Request:
In addition, I'd be interested if there was a way that you could only get one alert for a machine if say all the tests for that machine go purple, instead of an alert for each purple test. I don't believe this is possible currently, correct? Is this something that could possibly be implemented in the future? I understand if it's not or if it wouldn't be very easy.
I appreciate your time in answering my questions and look forward to your input! (And apologies for the long-winded e-mail!)
Thanks very much in advance!!
Matt Vander Werf
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Xymon