[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

depends tag questions, possible feature request



First a quick summary of the pertinent part of my bb-hosts:

subpage PRODDB Prod DB
1.2.3.4 prod-db1 # ssh pulldata TRENDS:*,disk:disk|disk1,vmstat:vmstat1
#
subpage PRODURLS Prod URLS
1.2.3.5 URL-SomePortal # cont;http://username:password (at) vhost.com;SUCCESSFUL
1.2.3.5 URL-SomeApp # cont;http://someapp.com/monitor.php;SUCCESS
1.2.3.6 URL-SomeOtherApp # cont;http://someotherapp.com/monitor.php;SUCCESS


(In case you are wondering, I have URLS monitored that way because the URLS monitored are load balanced across many servers. I have other tests for the httpd processes for those specific servers, but the PRODURLS entries are for alerting when those external URLS are not responding).

That being said, today we had a problem with prod-db1. Basically Oracle went nuts and the system load went to 110+, and as a result all of the PRODURLS alerts went off.

Now, no problem so far, since this is by design. The problem is that As the DB was able to handle a request here and there, the PRODURLS were "flapping" (changing status from red to green to red to green). So, Acks had no effect, "Disable until OK" had no effect.

I was tasked with how to reduce the amount of pager spam the next time this happens. The obvious way is to just go in an disable the affected hosts/services for a specific time period, but this is easier said than done when the world is on fire and you are on a conference bridge and have people standing around you waiting for things to be fixed...In short, the guys that were oncall didn't have time to go log into Hobbit and do the disables...meanwhile their pagers are going nuts which adds to their frustration.

*It would be nice if Hobbit had "flap detection"*, where if a service changes states more than X times in X minutes or seconds, it turns clear or blue (or maybe even a new color). I am reminded that Nagios has this feature, and Hobbit is totally better than Nagios, so we shouldn't have that feature missing right? ;-)

*It would be nice if the depend tag worked for any column/test type*.
I looked at using the "depends" tag, but it appears that *depends only works for network checks*. In other words, I cannot do:
1.2.3.5 URL-SomeApp # cont;http://someapp.com/monitor.php;SUCCESS depends=(http:prod-db1/procs,prod-db1/cpu)


If I have misunderstood about the depends tag, let me know, but it appears from the man page that it only works for network tests:
"The 'depends' tag is evaluated on the BBNET server while running the network tests. It can therefore only refer to other network tests that are handled by the same BBNET server - there is currently no way to use the e.g. the status of locally run tests (disk, cpu, msgs) or network tests from other BBNET servers in a dependency definition. Such dependencies are silently ignored."