[xymon] xymon_4.3.0-RC1: possible lost alerts
Henrik Størner
henrik at hswn.dk
Mon Feb 14 13:46:30 CET 2011
In <4D59102A.2000507 at unil.ch> Dominique Frise <dominique.frise at unil.ch> writes:
>On 02/14/11 11:00 AM, Henrik Størner wrote:
>> In<4D556C14.5060207 at unil.ch> Dominique Frise<dominique.frise at unil.ch> writes:
>>
>>> I think I found a bug in xymond_alert.c.
>>
>>> Lets say there is a page msg for hostA.serviceA and this alert will not
>>> be processed immediately because of this part of code:
>>
>>> 816 /*
>>> 817 * When a burst of alerts happen, we get lots of alert messages
>>> 818 * coming in quickly. So lets handle them in bunches and only
>>> 819 * do the full alert handling once every 10 secs - that lets us
>>> 820 * combine a bunch of alerts into one transmission process.
>>> 821 */
>>> 822 if (nowtimer< (lastxmit+10)) continue;
>>> 823 lastxmit = nowtimer;
>>
>>
>>> The main loop will then wait for a new msg from xymond (Want msg<num>,
>>> startpos... etc).
>>
>>> Now if the next msg is a page recovery from the same hostA.serviceA,
>>> the next processing of the active alerts (for loop) will then cleanup
>>> the alert for hostA.serviceA without sending any alert.
>>
>> I haven't tested your diagnosis, but it is probably correct
>> (from how I remember that this code works).
>>
>> But is it a problem ?
>>
>> If you get an alert that clears a few seconds later (that is why there
>> is a recovery message), then what is the point of sending an alert ?
>> The notification would be for data that is no longer valid, and
>> personally I would rather NOT be alerted a 3 AM if the problem no
>> longer exists.
>>
>> So I am tempted to invoke the old "this is not a bug, it's a feature!"
>> meme :-)
>>
>I think the problem is rather that the behaviour is not deterministic.
>Some alert/recovered transitions will get through (if the alert goes
>into the alerts loop processing without waiting) or can get lost (if
>alert and recovery are processed in the same loop).
But it is "deterministic enough" that you will either get both of
them (alert + recovery), or neither. You will not get an alert
and then lose the recovery-message, or get a recovery-message
without the alert having been sent.
Regards,
Henrik
More information about the Xymon
mailing list