[hobbit] RECOVERED alerts red->yellow
Alan Sparks
asparks at doublesparks.net
Wed Jul 23 23:39:38 CEST 2008
Anyone have any other ideas how to fix this bug? Thanks...
-Alan
Alan Sparks wrote:
> After a day of running in trace and debug modes on the alerts module,
> I think I understand how this is broken. But I'm unsure anything but
> hacking the code can fix the issue. It appears to be unfortunate
> interactions in some of the features, including the "flap detection"
> stuff.
>
> So: If I have the rule:
> MAIL me at whereever.com TEST=disk COLOR=RED RECOVERED
> and ALERTCOLORS="red,yellow,purple"
>
> The traces show Hobbit going through the following "thought process":
> * Say the disk goes yellow. That's in Hobbit's alert color list, so
> it triggers alert processing. But, no rule matches that color, so no
> alert is sent.
> * Say the disk now goes red. Now, Hobbit sees that as a transition
> from an alert state to another alert state. Normally, it would
> suppress this, but there is logic to special-case going red, and the
> alert processing is triggered. This time, a rule matches, and an
> alert is sent.
> * Say now the disk goes yellow. This is seen by Hobbit as a
> transition from an alert state to another alert state (due to both
> colors in ALERTCOLORS). No alert processin is done -- it is
> suppressed since it is NOT a recovery (it's flapping between two alert
> states). BUT, Hobbit now remembers the current color (alert state) as
> yellow.
> * Finally, the disk goes green. This is a recovery, since it is a
> transition from the ALERTCOLORS to the OKCOLORS. And, this triggers
> alert rule processing. HOWEVER, now, the alert code scans for a rule
> for the last state of the alert -- yellow. And, of course, no such
> rule exists, and the rule that would trigger the recovery page is not
> used, and no recovery page is sent.
>
> The RECOVERED keyword is only a flag on the rule that says if you
> match this rule during recovery processing, this recip does want a
> recovery page. But, Hobbit keeps no memory about which rule triggered
> an alert, it seems. It has to go back through the ruleset during
> recovery processing to find a rule to use. And because the colors
> change, no such rule can exist.
>
> So I think you can call it a bug, or an unfortunate side effect of
> adding yellow to the ALERTCOLORS list. If you do, you'll compromise
> your recovery paging. If you don't, you can't send alerts on warning
> (yellow) conditions. Short of changing the code to eliminate the
> alert state suppression (i.e., flap detection),
>
> I'm not certain how this can be fixed or worked around.
> -Alan
>
>
> Mark Hinkle wrote:
>> Yes, I see the same thing as Alan and maybe that is why his
>> description makes sense to me.
>>
>> The real questions are: what triggers a recovery message to be sent
>> and who gets them? Is it when a test goes from any color to green? Or
>> is it any "down-grade" in alert state (i.e. red->yellow, or
>> yellow->green)? It appears to be the former - any color to green. And
>> that makes sense - "recovery" means everything is ok, and that is
>> what "green" means.
>>
>> But that does leave an open question about that state change from
>> red->yellow. In my environment, different notification methods are
>> used for "red" than are used for "yellow", specifically sms text for
>> red vs. emails for yellow.
>>
>> *And that is where the problem comes in*: if a "red" failed test
>> first goes to "yellow" before then going to "green", the recovery
>> message (upon going green) is only sent to the notification
>> destinations configured for the *yellow state*, not the red state.
>>
>> I certainly understand how this logically occurs - red->yellow is not
>> a recovery so nothing would be sent there at all. But hobbit does not
>> seem to save a complete list of who has been notified for each
>> "event", so it basically forgets about those folks sent notifications
>> at the red level as soon as it transitions to yellow. When the test
>> finally goes green, hobbit checks the alerts config for who would
>> have been notified at *the state just before green* (in this case
>> yellow) and sends recovery messages to those destinations. But it has
>> lost the fact that it was actually at a red level previous to the
>> yellow and should have sent recovery to those destinations as well.
>>
>> I believe that BB keeps track of who has been notified for each event
>> via the "np_user at host.com_host1.disk" type of entries in the tmp dir.
>> This allows it to have a complete list of notification destinations
>> that it could/can use for recoveries. I am not saying hobbit should
>> use the same mechanism, but hobbit does *appear* to be losing some
>> rather important state info.
>>
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>
More information about the Xymon
mailing list