[hobbit] the all or nothing nature of hobbit
Buchan Milne
bgmilne at staff.telkomsa.net
Fri Dec 8 08:10:08 CET 2006
On Thursday 07 December 2006 23:05, Henrik Stoerner wrote:
> On Thu, Dec 07, 2006 at 11:59:30AM -0800, Dan Simoes wrote:
> > I love hobbit and have been using it (and BB) for many years, so take
> > this as constructive criticism.
> >
> > One of my biggest headaches with BB (and now hobbit) has been the
> > all-or-nothing nature of alerts.
> > By this I mean that if your main network link is down, everything goes
> > red for network status.
> >
> > Something happened on my monitoring box (probably DNS) that caused a
> > cadence of http errors. http was not truly down on all these N hosts on
> > various networks, it was the network test that was failing on the
> > monitoring box.
>
> It's a valid point - but it is also very, very difficult to handle. Not
> so much because it is difficult to suppress alerts; the $1bn question is
> how to decide when to suppress an alert, and which issue is the root
> cause of all the problems we're seeing.
>
> Heck, sometimes it can be difficult even for intelligent humans to
> figure out what is really going on ...
>
> I think what this really boils down to is some form of event correlation
> mechanism,
Event correlation seems to be the current buzzword from all the monitoring
tool vendors whose presentations I have seen recently ...
> on top of which you then apply some heuristics (that's a
> fancy word for "guessing") to decide what is the core issue. E.g. if we
> have 200 tests reporting a failure because of a DNS lookup that timed
> out, then we probably have an issue with the DNS server we used. But it
> could also be a firewall mis-configuration that blocks our outbound DNS
> queries, or an IP address conflict that causes our DNS lookups to go to
> a server which doesn't handle DNS - it is really hard for any machine to
> figure that out by itself.
>
> The current implementation is not ideal, I'll be the first to admit
> that. Any ideas for improving it are welcome, but please consider the
> possibilities for the system making wrong decisions. I'd rather send out
> one alert too many than one too few.
>
> > I'm unaware of a solution to this issue, and I'm considering moving to
> > another product because of it.
>
> If you know of any products that are really good at handling this, I'd
> be interested to hear about them.
I can list some (proprietary ones) that are punting this, but I've never seen
them in action.
Regards,
Buchan
--
Buchan Milne
ISP Systems Specialist - Monitoring/Authentication Team Leader
B.Eng,RHCE(803004789010797),LPIC-2(LPI000074592)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20061208/8376108e/attachment.sig>
More information about the Xymon
mailing list