[hobbit] Use hobbit in operation center with critcal systems view

Fri Nov 9 00:26:42 CET 2007

On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:
> In our environment the Operation Center always call when a alerts shows 
> up on their Event Console and acknowledge the alert. With this action 
> the alert is no longer visible for the operators.
>
> Now following questions/toughts came up when we look closer;
> 
> Acknowledge; 
> If an alert is acknowledge from the operators in critical systems 
> this is a fix acknowledge for the given time, also when there is a 
> status change.
> When a problem is fixed and goes red/yellow again it will not shown 
> up in critical view until the acked time is expired.
> This sould be an option to ack a alert until a status change (like in 
> disable until ok).

I decided against the "ack-until-ok" method, because in my experience
systems often go briefly ok while being fixed, and then they crash
again. (E.g. you'd reboot a server and all the processes startup, but
one process that is being monitored dies after a few minutes). So the
monitoring reports OK for a few minutes, and then go red - if you did
use an "ack-until-ok" it would show up on the critical systems view
again, triggering a new ticket.

What happens now is that when the status goes green, a timer kicks off
in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, plus a bit
for good measure). If the test has been OK throughout those 12 minutes
then the ack is cleared; if it goes non-green during that time the timer
is reset and the ack persists (at least until it eventually expires).

> The option Host-ack seems to be broken, on my system only one Test is 
> acknowledged although the Host-ack Checkbox is selected.

A quick test says you're right. Will have to look into that.

> Log;
> Missing a Log/Report from Critical view. A Report with information about 
> the alerts and acknowledgeds information that were made in Critical systems 
> would be helpful.

Right now it isn't even being logged, except inside the Hobbit daemon. A
reporting tool is needed, I agree.

> Definition (Edit Critical Systems);
> Easiest way for us; made standard definitions and add host to this templates. Works fine.
> But i miss a connection between alerts and critical view definition. Something like a option in hobbit-alerts.cfg to define that this rule is also valid for critical view. 
> Send a email when a alert shows up in critical view with all the possibiltys form hobbit-alerts.cfg. 

Wouldn't these two do the same thing ? 
Using the alert definitions to control the critical view is an interesting 
idea, I hadn't thought of that.

> Special Case missed or belated Messages by Operation Center;
> Now some application/scripts sends Alerts to the Console View and the Operation Center make an alert call for each event. 
> A problem in Hobbit/BB is when changes happen in red messages, the Operation Center didnt realize that until the acknowledge time runs out and they make the alert call again.
> This can happen for example in the disk status test (a second filesystem goes red) or with nested Tests/Logfiles. With the Event Console they get two messages (each for one Filesystem).

This is a problem with all of the tests that have multiple ways of going
red: disk, procs, msgs and http are the common ones. I don't have
solution to that right now. The way Hobbit works right now assumes that
when you get an alert about the "disk" status, you keep on fixing it
until the status goes green - and then the Operations Center won't need
to raise a ticket for the second event.

Regards,
Henrik