[hobbit] RECOVERED alerts red->yellow
    Alan Sparks 
    asparks at doublesparks.net
       
    Wed Jul 23 23:39:38 CEST 2008
    
    
  
Anyone have any other ideas how to fix this bug?  Thanks...
-Alan
Alan Sparks wrote:
> After a day of running in trace and debug modes on the alerts module, 
> I think I understand how this is broken.  But I'm unsure anything but 
> hacking the code can fix the issue.  It appears to be unfortunate 
> interactions in some of the features, including the "flap detection" 
> stuff.
>
> So: If I have the rule:
> MAIL me at whereever.com TEST=disk COLOR=RED RECOVERED
> and ALERTCOLORS="red,yellow,purple"
>
> The traces show Hobbit going through the following "thought process":
> * Say the disk goes yellow.  That's in Hobbit's alert color list, so 
> it triggers alert processing.  But, no rule matches that color, so no 
> alert is sent.
> * Say the disk now goes red.  Now, Hobbit sees that as a transition 
> from an alert state to another alert state.  Normally, it would 
> suppress this, but there is logic to special-case going red, and the 
> alert processing is triggered.  This time, a rule matches, and an 
> alert is sent.
> * Say now the disk goes yellow.  This is seen by Hobbit as a 
> transition from an alert state to another alert state (due to both 
> colors in ALERTCOLORS).  No alert processin is done -- it is 
> suppressed since it is NOT a recovery (it's flapping between two alert 
> states).  BUT, Hobbit now remembers the current color (alert state) as 
> yellow.
> * Finally, the disk goes green.  This is a recovery, since it is a 
> transition from the ALERTCOLORS to the OKCOLORS.  And, this triggers 
> alert rule processing.  HOWEVER, now, the alert code scans for a rule 
> for the last state of the alert -- yellow.  And, of course, no such 
> rule exists, and the rule that would trigger the recovery page is not 
> used, and no recovery page is sent.
>
> The RECOVERED keyword is only a flag on the rule that says if you 
> match this rule during recovery processing, this recip does want a 
> recovery page.  But, Hobbit keeps no memory about which rule triggered 
> an alert, it seems.  It has to go back through the ruleset during 
> recovery processing to find a rule to use.  And because the colors 
> change, no such rule can exist.
>
> So I think you can call it a bug, or an unfortunate side effect of 
> adding yellow to the ALERTCOLORS list.  If you do, you'll compromise 
> your recovery paging.  If you don't, you can't send alerts on warning 
> (yellow) conditions.  Short of changing the code to eliminate the 
> alert state suppression (i.e., flap detection),
>
> I'm not certain how this can be fixed or worked around.
> -Alan
>
>
> Mark Hinkle wrote:
>> Yes, I see the same thing as Alan and maybe that is why his 
>> description makes sense to me.
>>
>> The real questions are: what triggers a recovery message to be sent 
>> and who gets them? Is it when a test goes from any color to green? Or 
>> is it any "down-grade" in alert state (i.e. red->yellow, or 
>> yellow->green)? It appears to be the former - any color to green. And 
>> that makes sense - "recovery" means everything is ok, and that is 
>> what "green" means.
>>
>> But that does leave an open question about that state change from 
>> red->yellow. In my environment, different notification methods are 
>> used for "red" than are used for "yellow", specifically sms text for 
>> red vs. emails for yellow.
>>
>> *And that is where the problem comes in*: if a "red" failed test 
>> first goes to "yellow" before then going to "green", the recovery 
>> message (upon going green) is only sent to the notification 
>> destinations configured for the *yellow state*, not the red state.
>>
>> I certainly understand how this logically occurs - red->yellow is not 
>> a recovery so nothing would be sent there at all. But hobbit does not 
>> seem to save a complete list of who has been notified for each 
>> "event", so it basically forgets about those folks sent notifications 
>> at the red level as soon as it transitions to yellow. When the test 
>> finally goes green, hobbit checks the alerts config for who would 
>> have been notified at *the state just before green* (in this case 
>> yellow) and sends recovery messages to those destinations. But it has 
>> lost the fact that it was actually at a red level previous to the 
>> yellow and should have sent recovery to those destinations as well.
>>
>> I believe that BB keeps track of who has been notified for each event 
>> via the "np_user at host.com_host1.disk" type of entries in the tmp dir. 
>> This allows it to have a complete list of notification destinations 
>> that it could/can use for recoveries. I am not saying hobbit should 
>> use the same mechanism, but hobbit does *appear* to be losing some 
>> rather important state info.
>>
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>
    
    
More information about the Xymon
mailing list