[xymon] xymon_4.3.0-RC1: possible lost alerts

Henrik Størner henrik at hswn.dk
Mon Feb 14 13:46:30 CET 2011


In <4D59102A.2000507 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:

>On 02/14/11 11:00 AM, Henrik Størner wrote:
>> In<4D556C14.5060207 at unil.ch>  Dominique Frise<dominique.frise at unil.ch>  writes:
>>
>>> I think I found a bug in xymond_alert.c.
>>
>>> Lets say there is a page msg for hostA.serviceA and this alert will not
>>> be processed immediately because of this part of code:
>>
>>>     816                  /*
>>>     817                   * When a burst of alerts happen, we get lots of alert messages
>>>     818                   * coming in quickly. So lets handle them in bunches and only
>>>     819                   * do the full alert handling once every 10 secs - that lets us
>>>     820                   * combine a bunch of alerts into one transmission process.
>>>     821                   */
>>>     822                  if (nowtimer<  (lastxmit+10)) continue;
>>>     823                  lastxmit = nowtimer;
>>
>>
>>> The main loop will then wait for a new msg from xymond (Want msg<num>,
>>> startpos... etc).
>>
>>> Now if the next msg is a page recovery from the same hostA.serviceA,
>>> the next processing of the active alerts (for loop) will then cleanup
>>> the alert for hostA.serviceA without sending any alert.
>>
>> I haven't tested your diagnosis, but it is probably correct
>> (from how I remember that this code works).
>>
>> But is it a problem ?
>>
>> If you get an alert that clears a few seconds later (that is why there
>> is a recovery message), then what is the point of sending an alert ?
>> The notification would be for data that is no longer valid, and
>> personally I would rather NOT be alerted a 3 AM if the problem no
>> longer exists.
>>
>> So I am tempted to invoke the old "this is not a bug, it's a feature!"
>> meme :-)
>>

>I think the problem is rather that the behaviour is not deterministic.
>Some alert/recovered transitions will get through (if the alert goes 
>into the alerts loop processing without waiting) or can get lost (if 
>alert and recovery are processed in the same loop).

But it is "deterministic enough" that you will either get both of
them (alert + recovery), or neither. You will not get an alert
and then lose the recovery-message, or get a recovery-message
without the alert having been sent.


Regards,
Henrik




More information about the Xymon mailing list