[hobbit] RECOVERED alerts red->yellow
Alan Sparks
asparks at doublesparks.net
Thu Jul 24 01:11:39 CEST 2008
Had already considered that, but it doesn't work. But thanks for the
suggestion!
-Alan
Hubbard, Greg L wrote:
> You might try having a separate rule for each color. Then maybe the
> rule would fire when the test transitions into that color. It may not
> fire when it transitions from one color to another in the same rule.
> But I am just guessing!
>
> GLH
>
> -----Original Message-----
> From: Alan Sparks [mailto:asparks at doublesparks.net]
> Sent: Wednesday, July 23, 2008 4:40 PM
> To: hobbit at hswn.dk
> Subject: Re: [hobbit] RECOVERED alerts red->yellow
>
> Anyone have any other ideas how to fix this bug? Thanks...
> -Alan
>
> Alan Sparks wrote:
>
>> After a day of running in trace and debug modes on the alerts module,
>> I think I understand how this is broken. But I'm unsure anything but
>> hacking the code can fix the issue. It appears to be unfortunate
>> interactions in some of the features, including the "flap detection"
>> stuff.
>>
>> So: If I have the rule:
>> MAIL me at whereever.com TEST=disk COLOR=RED RECOVERED and
>> ALERTCOLORS="red,yellow,purple"
>>
>> The traces show Hobbit going through the following "thought process":
>> * Say the disk goes yellow. That's in Hobbit's alert color list, so
>> it triggers alert processing. But, no rule matches that color, so no
>> alert is sent.
>> * Say the disk now goes red. Now, Hobbit sees that as a transition
>> from an alert state to another alert state. Normally, it would
>> suppress this, but there is logic to special-case going red, and the
>> alert processing is triggered. This time, a rule matches, and an
>> alert is sent.
>> * Say now the disk goes yellow. This is seen by Hobbit as a
>> transition from an alert state to another alert state (due to both
>> colors in ALERTCOLORS). No alert processin is done -- it is
>> suppressed since it is NOT a recovery (it's flapping between two alert
>>
>
>
>> states). BUT, Hobbit now remembers the current color (alert state) as
>>
>
>
>> yellow.
>> * Finally, the disk goes green. This is a recovery, since it is a
>> transition from the ALERTCOLORS to the OKCOLORS. And, this triggers
>> alert rule processing. HOWEVER, now, the alert code scans for a rule
>> for the last state of the alert -- yellow. And, of course, no such
>> rule exists, and the rule that would trigger the recovery page is not
>> used, and no recovery page is sent.
>>
>> The RECOVERED keyword is only a flag on the rule that says if you
>> match this rule during recovery processing, this recip does want a
>> recovery page. But, Hobbit keeps no memory about which rule triggered
>>
>
>
>> an alert, it seems. It has to go back through the ruleset during
>> recovery processing to find a rule to use. And because the colors
>> change, no such rule can exist.
>>
>> So I think you can call it a bug, or an unfortunate side effect of
>> adding yellow to the ALERTCOLORS list. If you do, you'll compromise
>> your recovery paging. If you don't, you can't send alerts on warning
>> (yellow) conditions. Short of changing the code to eliminate the
>> alert state suppression (i.e., flap detection),
>>
>> I'm not certain how this can be fixed or worked around.
>> -Alan
>>
>>
>> Mark Hinkle wrote:
>>
>>> Yes, I see the same thing as Alan and maybe that is why his
>>> description makes sense to me.
>>>
>>> The real questions are: what triggers a recovery message to be sent
>>> and who gets them? Is it when a test goes from any color to green? Or
>>>
>
>
>>> is it any "down-grade" in alert state (i.e. red->yellow, or
>>> yellow->green)? It appears to be the former - any color to green. And
>>> that makes sense - "recovery" means everything is ok, and that is
>>> what "green" means.
>>>
>>> But that does leave an open question about that state change from
>>> red->yellow. In my environment, different notification methods are
>>> used for "red" than are used for "yellow", specifically sms text for
>>> red vs. emails for yellow.
>>>
>>> *And that is where the problem comes in*: if a "red" failed test
>>> first goes to "yellow" before then going to "green", the recovery
>>> message (upon going green) is only sent to the notification
>>> destinations configured for the *yellow state*, not the red state.
>>>
>>> I certainly understand how this logically occurs - red->yellow is not
>>>
>
>
>>> a recovery so nothing would be sent there at all. But hobbit does not
>>>
>
>
>>> seem to save a complete list of who has been notified for each
>>> "event", so it basically forgets about those folks sent notifications
>>>
>
>
>>> at the red level as soon as it transitions to yellow. When the test
>>> finally goes green, hobbit checks the alerts config for who would
>>> have been notified at *the state just before green* (in this case
>>> yellow) and sends recovery messages to those destinations. But it has
>>>
>
>
>>> lost the fact that it was actually at a red level previous to the
>>> yellow and should have sent recovery to those destinations as well.
>>>
>>> I believe that BB keeps track of who has been notified for each event
>>>
>
>
>>> via the "np_user at host.com_host1.disk" type of entries in the tmp dir.
>>> This allows it to have a complete list of notification destinations
>>> that it could/can use for recoveries. I am not saying hobbit should
>>> use the same mechanism, but hobbit does *appear* to be losing some
>>> rather important state info.
>>>
>>>
>>
>> To unsubscribe from the hobbit list, send an e-mail to
>> hobbit-unsubscribe at hswn.dk
>>
>>
>>
>>
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>
>
More information about the Xymon
mailing list