[hobbit] RECOVERED alerts red->yellow

Thu Jul 24 01:11:39 CEST 2008

Had already considered that, but it doesn't work.  But thanks for the 
suggestion!
-Alan

Hubbard, Greg L wrote:
> You might try having a separate rule for each color.  Then maybe the
> rule would fire when the test transitions into that color.  It may not
> fire when it transitions from one color to another in the same rule.
> But I am just guessing!
>
> GLH
>
> -----Original Message-----
> From: Alan Sparks [mailto:asparks at doublesparks.net] 
> Sent: Wednesday, July 23, 2008 4:40 PM
> To: hobbit at hswn.dk
> Subject: Re: [hobbit] RECOVERED alerts red->yellow
>
> Anyone have any other ideas how to fix this bug?  Thanks...
> -Alan
>
> Alan Sparks wrote:
>   
>> After a day of running in trace and debug modes on the alerts module, 
>> I think I understand how this is broken.  But I'm unsure anything but 
>> hacking the code can fix the issue.  It appears to be unfortunate 
>> interactions in some of the features, including the "flap detection"
>> stuff.
>>
>> So: If I have the rule:
>> MAIL me at whereever.com TEST=disk COLOR=RED RECOVERED and 
>> ALERTCOLORS="red,yellow,purple"
>>
>> The traces show Hobbit going through the following "thought process":
>> * Say the disk goes yellow.  That's in Hobbit's alert color list, so 
>> it triggers alert processing.  But, no rule matches that color, so no 
>> alert is sent.
>> * Say the disk now goes red.  Now, Hobbit sees that as a transition 
>> from an alert state to another alert state.  Normally, it would 
>> suppress this, but there is logic to special-case going red, and the 
>> alert processing is triggered.  This time, a rule matches, and an 
>> alert is sent.
>> * Say now the disk goes yellow.  This is seen by Hobbit as a 
>> transition from an alert state to another alert state (due to both 
>> colors in ALERTCOLORS).  No alert processin is done -- it is 
>> suppressed since it is NOT a recovery (it's flapping between two alert
>>     
>
>   
>> states).  BUT, Hobbit now remembers the current color (alert state) as
>>     
>
>   
>> yellow.
>> * Finally, the disk goes green.  This is a recovery, since it is a 
>> transition from the ALERTCOLORS to the OKCOLORS.  And, this triggers 
>> alert rule processing.  HOWEVER, now, the alert code scans for a rule 
>> for the last state of the alert -- yellow.  And, of course, no such 
>> rule exists, and the rule that would trigger the recovery page is not 
>> used, and no recovery page is sent.
>>
>> The RECOVERED keyword is only a flag on the rule that says if you 
>> match this rule during recovery processing, this recip does want a 
>> recovery page.  But, Hobbit keeps no memory about which rule triggered
>>     
>
>   
>> an alert, it seems.  It has to go back through the ruleset during 
>> recovery processing to find a rule to use.  And because the colors 
>> change, no such rule can exist.
>>
>> So I think you can call it a bug, or an unfortunate side effect of 
>> adding yellow to the ALERTCOLORS list.  If you do, you'll compromise 
>> your recovery paging.  If you don't, you can't send alerts on warning
>> (yellow) conditions.  Short of changing the code to eliminate the 
>> alert state suppression (i.e., flap detection),
>>
>> I'm not certain how this can be fixed or worked around.
>> -Alan
>>
>>
>> Mark Hinkle wrote:
>>     
>>> Yes, I see the same thing as Alan and maybe that is why his 
>>> description makes sense to me.
>>>
>>> The real questions are: what triggers a recovery message to be sent 
>>> and who gets them? Is it when a test goes from any color to green? Or
>>>       
>
>   
>>> is it any "down-grade" in alert state (i.e. red->yellow, or
>>> yellow->green)? It appears to be the former - any color to green. And
>>> that makes sense - "recovery" means everything is ok, and that is 
>>> what "green" means.
>>>
>>> But that does leave an open question about that state change from
>>> red->yellow. In my environment, different notification methods are
>>> used for "red" than are used for "yellow", specifically sms text for 
>>> red vs. emails for yellow.
>>>
>>> *And that is where the problem comes in*: if a "red" failed test 
>>> first goes to "yellow" before then going to "green", the recovery 
>>> message (upon going green) is only sent to the notification 
>>> destinations configured for the *yellow state*, not the red state.
>>>
>>> I certainly understand how this logically occurs - red->yellow is not
>>>       
>
>   
>>> a recovery so nothing would be sent there at all. But hobbit does not
>>>       
>
>   
>>> seem to save a complete list of who has been notified for each 
>>> "event", so it basically forgets about those folks sent notifications
>>>       
>
>   
>>> at the red level as soon as it transitions to yellow. When the test 
>>> finally goes green, hobbit checks the alerts config for who would 
>>> have been notified at *the state just before green* (in this case
>>> yellow) and sends recovery messages to those destinations. But it has
>>>       
>
>   
>>> lost the fact that it was actually at a red level previous to the 
>>> yellow and should have sent recovery to those destinations as well.
>>>
>>> I believe that BB keeps track of who has been notified for each event
>>>       
>
>   
>>> via the "np_user at host.com_host1.disk" type of entries in the tmp dir.
>>> This allows it to have a complete list of notification destinations 
>>> that it could/can use for recoveries. I am not saying hobbit should 
>>> use the same mechanism, but hobbit does *appear* to be losing some 
>>> rather important state info.
>>>
>>>       
>>
>> To unsubscribe from the hobbit list, send an e-mail to 
>> hobbit-unsubscribe at hswn.dk
>>
>>
>>
>>     
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>
>