depends tag questions, possible feature request
Charles Jones
jonescr at cisco.com
Mon Oct 9 21:56:12 CEST 2006
First a quick summary of the pertinent part of my bb-hosts:
subpage PRODDB Prod DB
1.2.3.4 prod-db1 # ssh pulldata
TRENDS:*,disk:disk|disk1,vmstat:vmstat1
#
subpage PRODURLS Prod URLS
1.2.3.5 URL-SomePortal #
cont;http://username:password@vhost.com;SUCCESSFUL
1.2.3.5 URL-SomeApp # cont;http://someapp.com/monitor.php;SUCCESS
1.2.3.6 URL-SomeOtherApp #
cont;http://someotherapp.com/monitor.php;SUCCESS
(In case you are wondering, I have URLS monitored that way because the
URLS monitored are load balanced across many servers. I have other tests
for the httpd processes for those specific servers, but the PRODURLS
entries are for alerting when those external URLS are not responding).
That being said, today we had a problem with prod-db1. Basically Oracle
went nuts and the system load went to 110+, and as a result all of the
PRODURLS alerts went off.
Now, no problem so far, since this is by design. The problem is that As
the DB was able to handle a request here and there, the PRODURLS were
"flapping" (changing status from red to green to red to green). So, Acks
had no effect, "Disable until OK" had no effect.
I was tasked with how to reduce the amount of pager spam the next time
this happens. The obvious way is to just go in an disable the affected
hosts/services for a specific time period, but this is easier said than
done when the world is on fire and you are on a conference bridge and
have people standing around you waiting for things to be fixed...In
short, the guys that were oncall didn't have time to go log into Hobbit
and do the disables...meanwhile their pagers are going nuts which adds
to their frustration.
*It would be nice if Hobbit had "flap detection"*, where if a service
changes states more than X times in X minutes or seconds, it turns clear
or blue (or maybe even a new color). I am reminded that Nagios has this
feature, and Hobbit is totally better than Nagios, so we shouldn't have
that feature missing right? ;-)
*It would be nice if the depend tag worked for any column/test type*.
I looked at using the "depends" tag, but it appears that *depends only
works for network checks*. In other words, I cannot do:
1.2.3.5 URL-SomeApp #
cont;http://someapp.com/monitor.php;SUCCESS
depends=(http:prod-db1/procs,prod-db1/cpu)
If I have misunderstood about the depends tag, let me know, but it
appears from the man page that it only works for network tests:
"The 'depends' tag is evaluated on the BBNET server while running the
network tests. It can therefore only refer to other network tests that
are handled by the same BBNET server - there is currently no way to use
the e.g. the status of locally run tests (disk, cpu, msgs) or network
tests from other BBNET servers in a dependency definition. Such
dependencies are silently ignored."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20061009/a9a1e5e3/attachment.html>
More information about the Xymon
mailing list