[Xymon] Thoughts on Usefulness/Reliability of Purple Alerts
Phil.Crooker at orix.com.au
Wed Aug 26 03:12:29 CEST 2015
My two bits:
Big brother had problems with network tests where if you ran too many tests (eg ssh, smtp, etc) or if they took too long they wouldn't complete before the next round of tests began. This caused everything to go purple and certainly one of the reasons purple alerts weren't considered 'reliable'. In my experience (with not a huge number of hosts and tests) this doesn't occur with xymon, presumably because the tests are run in parallel rather than sequentially as was the case with BB.
About your feature request - have you tried using the 'depends=' parameter in hosts.cfg?
From: Xymon <xymon-bounces at xymon.com> on behalf of Matt Vander Werf <mvanderw at nd.edu>
Sent: Tuesday, 25 August 2015 2:31 AM
To: cleaver at terabithia.org; henrik at hswn.dk
Cc: xymon at xymon.com; Rich Sudlow
Subject: [Xymon] Thoughts on Usefulness/Reliability of Purple Alerts
This is primarily for Henrik and J.C., but anyone else is free to chime in their thoughts on this as well!
We have a Xymon server (the latest Terabithia RPM on RHEL 7) in production that monitors around 1950 hosts (and consistently growing).
About a week ago, we experienced some pretty bad purple alert storms in the middle of the night that were all false-positive alerts (over 300 alerts one night). For most of the tests that went purple, they went back to green at the next update interval. At this point, we've been unable to figure out a root cause behind this issue, but it hasn't happened again since early last week (all the easy, understandable possible causes has been ruled out: network load/bandwidth, CPU load of Xymon server and of Xymon clients affected, etc.).
We have been using purple alerts for some time now, find them fairly reliable for the most part, and think they are useful, as machines hang or something similar (causing the Xymon client on the machine to stop being able to report to the Xymon server) and we don't get any red or yellow alerts for any other tests (sometimes a machine can hang but still have a network connection that can be successfully pinged by Xymon, we have found). We haven't had any major issues with false-positive purple alerts (for the most part), or any purple alert storms, since we started using them consistently for all our machines a couple years ago.
I understand that when Xymon was first forked from Big Brother a long while back, it may have been noted that one big change from Big Brother was that you didn't need to do purple alerts (or something like that) and that it was discouraged to use purple alerts, as they were seen as widely unreliable. (I'm hearing this from a coworker of mine, who set up our original Xymon server some 5 years ago, but have been unable to find what he's referring to.) But from what I can see from the current documentation and the mailing list archives, I'm not seeing any place where the use of purple alerts is discouraged due to them being unreliable.
So, I wanted to see what the current thinking/view regarding purple alerts and the use of purple alerts was by both the original main maintainer, Henrik, and the more current main maintainer, J.C. (at least of the current release). Are purple alerts still considered wholly unreliable, or even somewhat unreliable (or were they ever)? Are they discouraged in any way or fashion from being used? Have they caused issues for any of you on this list? Or vice versa: Have they worked well for you? I'm fully aware that this purple alert storm issue we had is just a one-off occurrence and we could have not more additional issues in the future with purple alerts.
I understand that purple alerts are different than other alerts, like red and yellow alerts, in that it is an indication that the Xymon client has stopped working/reporting (on a per-test basis) to the Xymon server for some reason, rather than an issue from a specific test (e.g. with the CPU load, memory, etc.).
*(Possible) Feature Request:
In addition, I'd be interested if there was a way that you could only get one alert for a machine if say all the tests for that machine go purple, instead of an alert for each purple test. I don't believe this is possible currently, correct? Is this something that could possibly be implemented in the future? I understand if it's not or if it wouldn't be very easy.
I appreciate your time in answering my questions and look forward to your input! (And apologies for the long-winded e-mail!)
Thanks very much in advance!!
Matt Vander Werf
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Xymon