[xymon] xymon-4.3.0-RC1: alerting question

Tue Feb 8 11:02:35 CET 2011

Hi Buchan,

On 02/ 7/11 10:31 PM, Buchan Milne wrote:
> On Monday, 7 February 2011 16:37:14 Dominique Frise wrote:
>> Hi Henrik,
>>
>> Thanks for replying.
>>
>> On 02/ 7/11 01:10 PM, Henrik Størner wrote:
>>> In<4D4C0F83.8080204 at unil.ch>   Dominique Frise<dominique.frise at unil.ch>
> writes:
>>>> What is the minimum time for the same alert status to stay up to be
>>>> processed correctly by Xymon ?
>>>
>>> I am not sure I understand the question - are you saying that
>>> Xymon does not generate the notifications you expect it to ?
>>
>> Sort of...
>>
>> We have SNMP trap handling configured (thanks Andy Farrior)
>
> It is an ugly hack. We need a better solution. I didn't implement this one for
> my own environments, as I was not willing to settle for it (one issue being
> the multiple parts, snmptrapd->snmptt->sec->perl script), but I haven't
> finished the work I wanted to do (a perl NetSNMP::TrapReceiver running in
> snmptrapd that does all the tasks above) to have a better solution.
>
Well Andy's work is advertized as "A very elegant method of feeding 
traps into Xymon" ;-) 
(http://www.xymon.com/xymon/help/xymon-tips.html#snmptraps)
This is also the kind of approach that is used for Nagios but there 
alerting is better supported by the "volatile" service 
.(http://nagios.sourceforge.net/docs/2_0/volatileservices.html).

>> but are not
>> completely happy with how it handles the alerting.
>> When a bad trap from a given host is received, an alert status is
>> generated for Xymon (yellow or red). So far, so good.
>
> Actually, IMHO, no. The BB model works on monitoring a status, and generating
> an event when the status changes. The problem comes when you listen for events
> (traps), and the only way to handle them is to create a status, so you can
> generate an event.
>
> I think event-based monitoring should not go via 'status' messages, but go
> into a separate channel, which handles events as events, and possibly alerts
> directly instead of via the status channel.
>
Agree

>> Then, before this status'validity is expired (before it turns purple), a
>> periodic launch of a script will reset its color to green, thus
>> generating a recovered message indenpendently of the real status of the
>> service reported by the trap. Further more, while a<host>.trap status
>> is in alert state, other bad traps from same host and of same level will
>> not generate any alerts (igmored).
>
> This is a generic problem, and applies to some extent to other tests as well.
> Even if different types of traps were reported to different tests, there is
> the issue of no component-level ack/alert/recover/disable etc. So, for
> example, if non-critical filesystem goes yellow, and this is ack'ed or
> disabled, then a critical filesystem does red, there will be no new
> notification, it won't appear on the critical systems view, just as a trap for
> a non-critical router interface will be lumped together with a critical one.
>
Not trivial to solve

>> Here follow a description of what we are trying to implement in order to
>> improve this hanlding:
>>
>> ****
>> 1. a bad<host>trap is detected.
>> 2. generate a yellow/red<host>.trap status for Xymon.
>> 3. after a short delay (ideally 1 sec.), generate a clear<host>.trap
>> status for Xymon.
>
> So, the status page for the host is useless, the only thing you get is
> alerting, it would be much better (IMHO) to go:
>
> 1)snmptrapd running NetSNMP::TrapReceiver which does MIB parsing etc., pruning
> of duplicate traps itself, storing some trap details, and sends an 'event'
> message to hobbitd.
> 2)A hobbit worker listening on the event channel and deciding when to send
> page or ack messages to hobbitd for hobbitd_alert to act on. In some cases, it
> might be desirable for it to do something besides alert (e.g. trigger a
> configuration update for a network device on a device configuration save trap)
>
Solid concept indeed

>> All traps status except those in alert state are periodically set to clear.
>> The red/yellow ->  clear transition should not generate a recovered
>> message. This should be achieved by removing "clear" from "OKCOLORS" in
>> xymonserver.cfg but this does not work without modifying xymond_alert.c.
>> A good<host>.trap should generate a green message and thus a recovered
>> message.
>
> This is mostly just going to result in disk churn that you don't even want to
> look at, just to send some mails. If you didn't have Xymon in the picture,
> snmptrapd and traptoemail would do most of what you get ...
>
The database history fed by snmptt is quite useful too

>> We know that a 100% handling of traps in Xymon is not possible because
>> we are misusing a single status (trap) to report many others, but his
>> scenario would allow:
>>
>> - a better alerting of all bad traps from the same host and of same level.
>
> Well, it is slightly better, but I don't see how traps for different reasons
> in different orders are going to be handled well.
>
Not covered at all :-(

>> - the recovered status is a real recover (the text of the trap explains
>> what recovered)
>
> This is about the only advantage, and I think there is more that could be
> improved with fewer disadvantages.
>
Eager to test your solution...
Dont forget to drop us a mail when its ready for testing!

Regards,
Dominique

> Regards,
> Buchan
>
> To unsubscribe from the xymon list, send an e-mail to
> xymon-unsubscribe at xymon.com
>
>