[xymon] xymon_4.3.0-RC1: possible lost alerts

Dominique Frise dominique.frise at unil.ch
Mon Feb 14 12:21:14 CET 2011


On 02/14/11 11:00 AM, Henrik Størner wrote:
> In<4D556C14.5060207 at unil.ch>  Dominique Frise<dominique.frise at unil.ch>  writes:
>
>> I think I found a bug in xymond_alert.c.
>
>> Lets say there is a page msg for hostA.serviceA and this alert will not
>> be processed immediately because of this part of code:
>
>>     816                  /*
>>     817                   * When a burst of alerts happen, we get lots of alert messages
>>     818                   * coming in quickly. So lets handle them in bunches and only
>>     819                   * do the full alert handling once every 10 secs - that lets us
>>     820                   * combine a bunch of alerts into one transmission process.
>>     821                   */
>>     822                  if (nowtimer<  (lastxmit+10)) continue;
>>     823                  lastxmit = nowtimer;
>
>
>> The main loop will then wait for a new msg from xymond (Want msg<num>,
>> startpos... etc).
>
>> Now if the next msg is a page recovery from the same hostA.serviceA,
>> the next processing of the active alerts (for loop) will then cleanup
>> the alert for hostA.serviceA without sending any alert.
>
> I haven't tested your diagnosis, but it is probably correct
> (from how I remember that this code works).
>
> But is it a problem ?
>
> If you get an alert that clears a few seconds later (that is why there
> is a recovery message), then what is the point of sending an alert ?
> The notification would be for data that is no longer valid, and
> personally I would rather NOT be alerted a 3 AM if the problem no
> longer exists.
>
> So I am tempted to invoke the old "this is not a bug, it's a feature!"
> meme :-)
>

I think the problem is rather that the behaviour is not deterministic.
Some alert/recovered transitions will get through (if the alert goes 
into the alerts loop processing without waiting) or can get lost (if 
alert and recovery are processed in the same loop).

Dominique



More information about the Xymon mailing list