[hobbit] grouping methods

Mon Jun 16 20:07:37 CEST 2008

Sloan [mailto:joe at tmsusa.com] wrote:

> We've not had a bb server go down in all the years we've been 
> using it, but sometimes wan connectivity goes away due to 
> circumstances beyond our control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring system
has no way of knowing whether a system is down because it crashed or if
it's down because the network went down.  All it knows is that it can't
talk to the system anymore and something is wrong, so it generates an
alert.  When a whole network goes down, it can become hundreds of
simultaneous alerts.  And that's annoying enough when it's just email
alerts.  When you use Hobbit to generate cases in your trouble ticket
system, that can be hundreds of new, useless cases to manually close.

We don't want to raise the amount of time a system has to be down before
Hobbit generates an alert, because we want to know as soon as possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind
of flood-detection routine built in, where it could tell how rapidly it
was sending alerts about connection problems for machines all on the
same network, and was smart enough to think "Whoa, I'm about to send 100
connection alarms about systems on the same network.... Instead of
sending 100 of them, maybe I'll just send ONE alert saying "You got a
big problem here."

Doug Linder