[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] RECOVERED alerts red->yellow

To: hobbit (at) hswn.dk
Subject: Re: [hobbit] RECOVERED alerts red->yellow
From: Alan Sparks <asparks (at) doublesparks.net>
Date: Wed, 23 Jul 2008 15:39:38 -0600
References: <1F7B01020EC4D04DA17703634B9E888E0561A012 (at) ULPGCTMVMAI003.EU.COLT> <4874B9EF.3030205 (at) doublesparks.net> <487548FF.9020001 (at) hinkman.com> <48760FB1.7050907 (at) doublesparks.net>
User-agent: Thunderbird 2.0.0.14 (Windows/20080421)

Anyone have any other ideas how to fix this bug?  Thanks...
-Alan

Alan Sparks wrote:

After a day of running in trace and debug modes on the alerts module,I think I understand how this is broken. But I'm unsure anything buthacking the code can fix the issue. It appears to be unfortunateinteractions in some of the features, including the "flap detection"stuff.
So: If I have the rule:
MAIL me (at) whereever.com TEST=disk COLOR=RED RECOVERED
and ALERTCOLORS="red,yellow,purple"

The traces show Hobbit going through the following "thought process":
* Say the disk goes yellow. That's in Hobbit's alert color list, soit triggers alert processing. But, no rule matches that color, so noalert is sent.* Say the disk now goes red. Now, Hobbit sees that as a transitionfrom an alert state to another alert state. Normally, it wouldsuppress this, but there is logic to special-case going red, and thealert processing is triggered. This time, a rule matches, and analert is sent.* Say now the disk goes yellow. This is seen by Hobbit as atransition from an alert state to another alert state (due to bothcolors in ALERTCOLORS). No alert processin is done -- it issuppressed since it is NOT a recovery (it's flapping between two alertstates). BUT, Hobbit now remembers the current color (alert state) asyellow.* Finally, the disk goes green. This is a recovery, since it is atransition from the ALERTCOLORS to the OKCOLORS. And, this triggersalert rule processing. HOWEVER, now, the alert code scans for a rulefor the last state of the alert -- yellow. And, of course, no suchrule exists, and the rule that would trigger the recovery page is notused, and no recovery page is sent.
The RECOVERED keyword is only a flag on the rule that says if youmatch this rule during recovery processing, this recip does want arecovery page. But, Hobbit keeps no memory about which rule triggeredan alert, it seems. It has to go back through the ruleset duringrecovery processing to find a rule to use. And because the colorschange, no such rule can exist.
So I think you can call it a bug, or an unfortunate side effect ofadding yellow to the ALERTCOLORS list. If you do, you'll compromiseyour recovery paging. If you don't, you can't send alerts on warning(yellow) conditions. Short of changing the code to eliminate thealert state suppression (i.e., flap detection),
I'm not certain how this can be fixed or worked around.
-Alan


Mark Hinkle wrote:
Yes, I see the same thing as Alan and maybe that is why hisdescription makes sense to me.
The real questions are: what triggers a recovery message to be sentand who gets them? Is it when a test goes from any color to green? Oris it any "down-grade" in alert state (i.e. red->yellow, oryellow->green)? It appears to be the former - any color to green. Andthat makes sense - "recovery" means everything is ok, and that iswhat "green" means.
But that does leave an open question about that state change fromred->yellow. In my environment, different notification methods areused for "red" than are used for "yellow", specifically sms text forred vs. emails for yellow.
*And that is where the problem comes in*: if a "red" failed testfirst goes to "yellow" before then going to "green", the recoverymessage (upon going green) is only sent to the notificationdestinations configured for the *yellow state*, not the red state.
I certainly understand how this logically occurs - red->yellow is nota recovery so nothing would be sent there at all. But hobbit does notseem to save a complete list of who has been notified for each"event", so it basically forgets about those folks sent notificationsat the red level as soon as it transitions to yellow. When the testfinally goes green, hobbit checks the alerts config for who wouldhave been notified at *the state just before green* (in this caseyellow) and sends recovery messages to those destinations. But it haslost the fact that it was actually at a red level previous to theyellow and should have sent recovery to those destinations as well.
I believe that BB keeps track of who has been notified for each eventvia the "np_user (at) host.com_host1.disk" type of entries in the tmp dir.This allows it to have a complete list of notification destinationsthat it could/can use for recoveries. I am not saying hobbit shoulduse the same mechanism, but hobbit does *appear* to be losing somerather important state info.
To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe (at) hswn.dk

Follow-Ups:
- RE: [hobbit] RECOVERED alerts red->yellow
  - From: Hubbard, Greg L

References:
- RE: [hobbit] RECOVERED alerts red->yellow
  - From: Ward, Martin
- Re: [hobbit] RECOVERED alerts red->yellow
  - From: Alan Sparks
- Re: [hobbit] RECOVERED alerts red->yellow
  - From: Mark Hinkle
- Re: [hobbit] RECOVERED alerts red->yellow
  - From: Alan Sparks

Prev by Date: RE: [hobbit] FROM Address in emails
Next by Date: Re: [hobbit] Help! bbtest-net gets http test timing wrong
Previous by thread: Re: [hobbit] RECOVERED alerts red->yellow
Next by thread: RE: [hobbit] RECOVERED alerts red->yellow
Index(es):
- Date
- Thread