[xymon] xymon_4.3.0-RC1: possible lost alerts

Dominique Frise dominique.frise at unil.ch
Mon Feb 14 14:38:08 CET 2011


On 02/14/11 01:46 PM, Henrik Størner wrote:
> In<4D59102A.2000507 at unil.ch>  Dominique Frise<dominique.frise at unil.ch>  writes:
>
>> On 02/14/11 11:00 AM, Henrik Størner wrote:
>>> In<4D556C14.5060207 at unil.ch>   Dominique Frise<dominique.frise at unil.ch>   writes:
>>>
>>>> I think I found a bug in xymond_alert.c.
>>>
>>>> Lets say there is a page msg for hostA.serviceA and this alert will not
>>>> be processed immediately because of this part of code:
>>>
>>>>      816                  /*
>>>>      817                   * When a burst of alerts happen, we get lots of alert messages
>>>>      818                   * coming in quickly. So lets handle them in bunches and only
>>>>      819                   * do the full alert handling once every 10 secs - that lets us
>>>>      820                   * combine a bunch of alerts into one transmission process.
>>>>      821                   */
>>>>      822                  if (nowtimer<   (lastxmit+10)) continue;
>>>>      823                  lastxmit = nowtimer;
>>>
>>>
>>>> The main loop will then wait for a new msg from xymond (Want msg<num>,
>>>> startpos... etc).
>>>
>>>> Now if the next msg is a page recovery from the same hostA.serviceA,
>>>> the next processing of the active alerts (for loop) will then cleanup
>>>> the alert for hostA.serviceA without sending any alert.
>>>
>>> I haven't tested your diagnosis, but it is probably correct
>>> (from how I remember that this code works).
>>>
>>> But is it a problem ?
>>>
>>> If you get an alert that clears a few seconds later (that is why there
>>> is a recovery message), then what is the point of sending an alert ?
>>> The notification would be for data that is no longer valid, and
>>> personally I would rather NOT be alerted a 3 AM if the problem no
>>> longer exists.
>>>
>>> So I am tempted to invoke the old "this is not a bug, it's a feature!"
>>> meme :-)
>>>
>
>> I think the problem is rather that the behaviour is not deterministic.
>> Some alert/recovered transitions will get through (if the alert goes
>> into the alerts loop processing without waiting) or can get lost (if
>> alert and recovery are processed in the same loop).
>
> But it is "deterministic enough" that you will either get both of
> them (alert + recovery), or neither. You will not get an alert
> and then lose the recovery-message, or get a recovery-message
> without the alert having been sent.
>
>

This leads me to another question that never get answered:
what is suppose to happen if you remove the "clear" color from OKCOLORS 
in xymonserver.cfg ?
We would expect that not recovery message should be sent when a status 
goes from yellow/red to clear. Only the repeat interval should be reset.
Does this make sense ?

Dominique



More information about the Xymon mailing list