<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 14 May 2015 at 06:32, John Thurston <span dir="ltr"><<a href="mailto:john.thurston@alaska.gov" target="_blank">john.thurston@alaska.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""></span>

To me, xymon/hobbit/BB are alerting tools. Their purpose is to tell me "A threshold you defined has been exceeded. You'd better go figure out if there is a problem brewing!" When Xymon has done this, it's job is done. I don't expect it to do much more.<br></blockquote></div><br></div><div class="gmail_extra">Personally, Xymon is much more than for alerting.  It's also critical for forensics.  When a fault has been detected, the graphs and snapshot reports are extremely valuable for working out what historical factors may be relevant to a fault.<br><br>Two ways I use Xymon for forensics:<br><br>1) If an event has a history, there might be a pattern that can enlighten the cause (eg disk space problems at the start of every month) or a coincident event (eg packet loss concurrent with a spike in disk I/O).<br><br>2) If a threshold measure has a short-term spike or a long-term slow increase, then identifying when the metric started its incline can help pin down the change or event that caused it.<br><br></div><div class="gmail_extra">"Go fix it" helps with the immediate problem and it's purpose is tactical, for the short term.  But looking to the past can help prevent recurrence in the future.<br><br></div><div class="gmail_extra">In the specific case of a CPU load fault, it can be valuable to know what processes are new - in other words, what wasn't running 5 minutes before the event, that was running after the event.  In some cases a new process lifetime can be gleaned from the STIME column in the output of "ps -ef".  In other cases, it might be a process that is run from cron or inetd, or in a while loop, and doesn't have a very long lifetime.  Or there might be a situation where you have a clean-up process that has crashed, and you might want to know what was running that is no longer. In reality, these are somewhat contrived scenarios, and I have no concrete examples to prove that it can happen.  But in your own words, it's "silly to think [we] can predict all the information [we'll] need", and so in my opinion (and experience) the more, the better.<br><br></div><div class="gmail_extra">If security is the problem, then secure the data.  Suppressing the data is only one way to secure the data, and doing so can have down-sides.<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">In my deployment, I limit unauthenticated access to Apache, so those who don't need to see my log files and process listings, don't get to see them, but those who might benefit, can see them.<br><br></div><div class="gmail_extra">Cheers<br></div><div class="gmail_extra">Jeremy<br><br></div></div>