[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] RECOVERED alerts red->yellow



Yes, I see the same thing as Alan and maybe that is why his description makes sense to me.

The real questions are: what triggers a recovery message to be sent and who gets them? Is it when a test goes from any color to green? Or is it any "down-grade" in alert state (i.e. red->yellow, or yellow->green)? It appears to be the former - any color to green. And that makes sense - "recovery" means everything is ok, and that is what "green" means.

But that does leave an open question about that state change from red->yellow. In my environment, different notification methods are used for "red" than are used for "yellow", specifically sms text for red vs. emails for yellow.

*And that is where the problem comes in*: if a "red" failed test first goes to "yellow" before then going to "green", the recovery message (upon going green) is only sent to the notification destinations configured for the *yellow state*, not the red state.

I certainly understand how this logically occurs - red->yellow is not a recovery so nothing would be sent there at all. But hobbit does not seem to save a complete list of who has been notified for each "event", so it basically forgets about those folks sent notifications at the red level as soon as it transitions to yellow. When the test finally goes green, hobbit checks the alerts config for who would have been notified at *the state just before green* (in this case yellow) and sends recovery messages to those destinations. But it has lost the fact that it was actually at a red level previous to the yellow and should have sent recovery to those destinations as well.

I believe that BB keeps track of who has been notified for each event via the "np_user (at) host.com_host1.disk" type of entries in the tmp dir. This allows it to have a complete list of notification destinations that it could/can use for recoveries. I am not saying hobbit should use the same mechanism, but hobbit does *appear* to be losing some rather important state info.

--
Mark L. Hinkle
hinkman (at) hinkman.com