[Xymon] Thoughts on Usefulness/Reliability of Purple Alerts
Matt Vander Werf
mvanderw at nd.edu
Wed Aug 26 16:01:38 CEST 2015
Thanks for your input!
Just curious, but what would be your definition of "a huge number of hosts
and tests"? This might be kind of subjective, as people (including myself)
might interpret this differently. Would around 1950 hosts, most of them
with just the standard tests (conn,cpu,disk,memory) set up for alerts, with
maybe 50 or so additional other alerts set up for various other "custom"
Just curious where you come down on a "huge number of hosts and tests".
Matt Vander Werf
HPC System Administrator
University of Notre Dame
Center for Research Computing - Union Station
506 W. South Street
South Bend, IN 46601
Phone: (574) 631-0692
On Tue, Aug 25, 2015 at 9:12 PM, Phil Crooker <Phil.Crooker at orix.com.au>
> My two bits:
> Big brother had problems with network tests where if you ran too many
> tests (eg ssh, smtp, etc) or if they took too long they wouldn't complete
> before the next round of tests began. This caused everything to go purple
> and certainly one of the reasons purple alerts weren't considered
> 'reliable'. In my experience (with not a huge number of hosts and
> tests) this doesn't occur with xymon, presumably because the tests are run
> in parallel rather than sequentially as was the case with BB.
> About your feature request - have you tried using the 'depends=' parameter
> in hosts.cfg?
> *From:* Xymon <xymon-bounces at xymon.com> on behalf of Matt Vander Werf <
> mvanderw at nd.edu>
> *Sent:* Tuesday, 25 August 2015 2:31 AM
> *To:* cleaver at terabithia.org; henrik at hswn.dk
> *Cc:* xymon at xymon.com; Rich Sudlow
> *Subject:* [Xymon] Thoughts on Usefulness/Reliability of Purple Alerts
> This is primarily for Henrik and J.C., but anyone else is free to chime in
> their thoughts on this as well!
> We have a Xymon server (the latest Terabithia RPM on RHEL 7) in production
> that monitors around 1950 hosts (and consistently growing).
> About a week ago, we experienced some pretty bad purple alert storms in
> the middle of the night that were all false-positive alerts (over 300
> alerts one night). For most of the tests that went purple, they went back
> to green at the next update interval. At this point, we've been unable to
> figure out a root cause behind this issue, but it hasn't happened again
> since early last week (all the easy, understandable possible causes has
> been ruled out: network load/bandwidth, CPU load of Xymon server and of
> Xymon clients affected, etc.).
> We have been using purple alerts for some time now, find them fairly
> reliable for the most part, and think they are useful, as machines hang or
> something similar (causing the Xymon client on the machine to stop being
> able to report to the Xymon server) and we don't get any red or yellow
> alerts for any other tests (sometimes a machine can hang but still have a
> network connection that can be successfully pinged by Xymon, we have
> found). We haven't had any major issues with false-positive purple alerts
> (for the most part), or any purple alert storms, since we started using
> them consistently for all our machines a couple years ago.
> I understand that when Xymon was first forked from Big Brother a long
> while back, it may have been noted that one big change from Big Brother was
> that you didn't need to do purple alerts (or something like that) and that
> it was discouraged to use purple alerts, as they were seen as widely
> unreliable. (I'm hearing this from a coworker of mine, who set up our
> original Xymon server some 5 years ago, but have been unable to find what
> he's referring to.) But from what I can see from the current documentation
> and the mailing list archives, I'm not seeing any place where the use of
> purple alerts is discouraged due to them being unreliable.
> So, I wanted to see what the current thinking/view regarding purple alerts
> and the use of purple alerts was by both the original main maintainer,
> Henrik, and the more current main maintainer, J.C. (at least of the current
> release). Are purple alerts still considered wholly unreliable, or even
> somewhat unreliable (or were they ever)? Are they discouraged in any way or
> fashion from being used? Have they caused issues for any of you on this
> list? Or vice versa: Have they worked well for you? I'm fully aware that
> this purple alert storm issue we had is just a one-off occurrence and we
> could have not more additional issues in the future with purple alerts.
> I understand that purple alerts are different than other alerts, like red
> and yellow alerts, in that it is an indication that the Xymon client has
> stopped working/reporting (on a per-test basis) to the Xymon server for
> some reason, rather than an issue from a specific test (e.g. with the CPU
> load, memory, etc.).
> *(Possible) Feature Request:
> In addition, I'd be interested if there was a way that you could only get
> one alert for a machine if say all the tests for that machine go purple,
> instead of an alert for each purple test. I don't believe this is possible
> currently, correct? Is this something that could possibly be implemented in
> the future? I understand if it's not or if it wouldn't be very easy.
> I appreciate your time in answering my questions and look forward to your
> input! (And apologies for the long-winded e-mail!)
> Thanks very much in advance!!
> Matt Vander Werf
> Please consider the environment before printing this e-mail
> This message from ORIX Australia may contain confidential and/or
> privileged information. If you are not the intended recipient, any use,
> disclosure or copying of this message (or of any attachments to it) is not
> authorised. If you have received this message in error, please notify the
> sender immediately and delete the message and any attachments from your
> system. Please inform the sender if you do not wish to receive further
> communications by email.
> information we collect and hold, how we collect and handle it and your
> available on our website <http://www.orix.com.au>.
> We do not accept liability for any loss or damage caused by any computer
> viruses or defects that may be transmitted with this message. We recommend
> you carry out your own checks for viruses or defects.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Xymon