depends tag questions, possible feature request

Charles Jones jonescr at cisco.com
Mon Oct 9 21:56:12 CEST 2006


First a quick summary of the pertinent part of my bb-hosts:

subpage PRODDB Prod DB
1.2.3.4 prod-db1              # ssh pulldata 
TRENDS:*,disk:disk|disk1,vmstat:vmstat1
#
subpage PRODURLS Prod URLS
1.2.3.5 URL-SomePortal        # 
cont;http://username:password@vhost.com;SUCCESSFUL
1.2.3.5 URL-SomeApp           # cont;http://someapp.com/monitor.php;SUCCESS
1.2.3.6 URL-SomeOtherApp      # 
cont;http://someotherapp.com/monitor.php;SUCCESS

(In case you are wondering, I have URLS monitored that way because the 
URLS monitored are load balanced across many servers. I have other tests 
for the httpd processes for those specific servers, but the PRODURLS 
entries are for alerting when those external URLS are not responding).

That being said, today we had a problem with prod-db1. Basically Oracle 
went nuts and the system load went to 110+, and as a result all of the 
PRODURLS alerts went off.

Now, no problem so far, since this is by design. The problem is that As 
the DB was able to handle a request here and there, the PRODURLS were 
"flapping" (changing status from red to green to red to green). So, Acks 
had no effect, "Disable until OK" had no effect.

I was tasked with how to reduce the amount of pager spam the next time 
this happens. The obvious way is to just go in an disable the affected 
hosts/services for a specific time period, but this is easier said than 
done when the world is on fire and you are on a conference bridge and 
have people standing around you waiting for things to be fixed...In 
short, the guys that were oncall didn't have time to go log into Hobbit 
and do the disables...meanwhile their pagers are going nuts which adds 
to their frustration.

*It would be nice if Hobbit had "flap detection"*, where if a service 
changes states more than X times in X minutes or seconds, it turns clear 
or blue (or maybe even a new color). I am reminded that Nagios has this 
feature, and Hobbit is totally better than Nagios, so we shouldn't have 
that feature missing right? ;-)

*It would be nice if the depend tag worked for any column/test type*.
I looked at using the "depends" tag, but it appears that *depends only 
works for network checks*. In other words, I cannot do:
1.2.3.5 URL-SomeApp           # 
cont;http://someapp.com/monitor.php;SUCCESS 
depends=(http:prod-db1/procs,prod-db1/cpu)

If I have misunderstood about the depends tag, let me know, but it 
appears from the man page that it only works for network tests:
"The 'depends' tag is evaluated on the BBNET server while running the 
network tests. It can therefore only refer to other network tests that 
are handled by the same BBNET server - there is currently no way to use 
the e.g. the status of locally run tests (disk, cpu, msgs) or network 
tests from other BBNET servers in a dependency definition. Such 
dependencies are silently ignored."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20061009/a9a1e5e3/attachment.html>


More information about the Xymon mailing list