[xymon] xymon-4.3.0-RC1: alerting question

Buchan Milne bgmilne at staff.telkomsa.net
Mon Feb 7 22:31:06 CET 2011


On Monday, 7 February 2011 16:37:14 Dominique Frise wrote:
> Hi Henrik,
> 
> Thanks for replying.
> 
> On 02/ 7/11 01:10 PM, Henrik Størner wrote:
> > In<4D4C0F83.8080204 at unil.ch>  Dominique Frise<dominique.frise at unil.ch>  
writes:
> >> What is the minimum time for the same alert status to stay up to be
> >> processed correctly by Xymon ?
> > 
> > I am not sure I understand the question - are you saying that
> > Xymon does not generate the notifications you expect it to ?
> 
> Sort of...
> 
> We have SNMP trap handling configured (thanks Andy Farrior)

It is an ugly hack. We need a better solution. I didn't implement this one for 
my own environments, as I was not willing to settle for it (one issue being 
the multiple parts, snmptrapd->snmptt->sec->perl script), but I haven't 
finished the work I wanted to do (a perl NetSNMP::TrapReceiver running in 
snmptrapd that does all the tasks above) to have a better solution.

> but are not
> completely happy with how it handles the alerting.
> When a bad trap from a given host is received, an alert status is
> generated for Xymon (yellow or red). So far, so good.

Actually, IMHO, no. The BB model works on monitoring a status, and generating 
an event when the status changes. The problem comes when you listen for events 
(traps), and the only way to handle them is to create a status, so you can 
generate an event.

I think event-based monitoring should not go via 'status' messages, but go 
into a separate channel, which handles events as events, and possibly alerts 
directly instead of via the status channel.

> Then, before this status'validity is expired (before it turns purple), a
> periodic launch of a script will reset its color to green, thus
> generating a recovered message indenpendently of the real status of the
> service reported by the trap. Further more, while a <host>.trap status
> is in alert state, other bad traps from same host and of same level will
> not generate any alerts (igmored).

This is a generic problem, and applies to some extent to other tests as well. 
Even if different types of traps were reported to different tests, there is 
the issue of no component-level ack/alert/recover/disable etc. So, for 
example, if non-critical filesystem goes yellow, and this is ack'ed or 
disabled, then a critical filesystem does red, there will be no new 
notification, it won't appear on the critical systems view, just as a trap for 
a non-critical router interface will be lumped together with a critical one.

> Here follow a description of what we are trying to implement in order to
> improve this hanlding:
> 
> ****
> 1. a bad <host>trap is detected.
> 2. generate a yellow/red <host>.trap status for Xymon.
> 3. after a short delay (ideally 1 sec.), generate a clear <host>.trap
> status for Xymon.

So, the status page for the host is useless, the only thing you get is 
alerting, it would be much better (IMHO) to go:

1)snmptrapd running NetSNMP::TrapReceiver which does MIB parsing etc., pruning 
of duplicate traps itself, storing some trap details, and sends an 'event' 
message to hobbitd.
2)A hobbit worker listening on the event channel and deciding when to send 
page or ack messages to hobbitd for hobbitd_alert to act on. In some cases, it 
might be desirable for it to do something besides alert (e.g. trigger a 
configuration update for a network device on a device configuration save trap)

> All traps status except those in alert state are periodically set to clear.
> The red/yellow -> clear transition should not generate a recovered
> message. This should be achieved by removing "clear" from "OKCOLORS" in
> xymonserver.cfg but this does not work without modifying xymond_alert.c.
> A good <host>.trap should generate a green message and thus a recovered
> message.

This is mostly just going to result in disk churn that you don't even want to 
look at, just to send some mails. If you didn't have Xymon in the picture, 
snmptrapd and traptoemail would do most of what you get ...

> We know that a 100% handling of traps in Xymon is not possible because
> we are misusing a single status (trap) to report many others, but his
> scenario would allow:
> 
> - a better alerting of all bad traps from the same host and of same level.

Well, it is slightly better, but I don't see how traps for different reasons 
in different orders are going to be handled well.

> - the recovered status is a real recover (the text of the trap explains
> what recovered)

This is about the only advantage, and I think there is more that could be 
improved with fewer disadvantages.

Regards,
Buchan



More information about the Xymon mailing list