[hobbit] Use hobbit in operation center with critcal systems view

Eduard Michels emichels at quicksoft.com.br
Fri Nov 9 11:38:09 CET 2007


 

> -----Original Message-----
> From: Henrik Stoerner [mailto:henrik at hswn.dk] 
> Sent: quinta-feira, 8 de novembro de 2007 21:27
> To: hobbit at hswn.dk
> Subject: Re: [hobbit] Use hobbit in operation center with 
> critcal systems view
> 
> On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:
> > In our environment the Operation Center always call when a alerts 
> > shows up on their Event Console and acknowledge the alert. 
> With this 
> > action the alert is no longer visible for the operators.
> >
> > Now following questions/toughts came up when we look closer;
> >
> > Acknowledge;
> > If an alert is acknowledge from the operators in critical 
> systems this 
> > is a fix acknowledge for the given time, also when there is 
> a status 
> > change.
> > When a problem is fixed and goes red/yellow again it will 
> not shown up 
> > in critical view until the acked time is expired.
> > This sould be an option to ack a alert until a status 
> change (like in 
> > disable until ok).
> 
> I decided against the "ack-until-ok" method, because in my 
> experience systems often go briefly ok while being fixed, and 
> then they crash again. (E.g. you'd reboot a server and all 
> the processes startup, but one process that is being 
> monitored dies after a few minutes). So the monitoring 
> reports OK for a few minutes, and then go red - if you did 
> use an "ack-until-ok" it would show up on the critical 
> systems view again, triggering a new ticket.
> 
> What happens now is that when the status goes green, a timer 
> kicks off in Hobbit which lasts 12 minutes (i.e. 2 normal 
> test cycles, plus a bit for good measure). If the test has 
> been OK throughout those 12 minutes then the ack is cleared; 
> if it goes non-green during that time the timer is reset and 
> the ack persists (at least until it eventually expires).
> 
> > The option Host-ack seems to be broken, on my system only 
> one Test is 
> > acknowledged although the Host-ack Checkbox is selected.
> 
> A quick test says you're right. Will have to look into that.
> 
> > Log;
> > Missing a Log/Report from Critical view. A Report with information 
> > about the alerts and acknowledgeds information that were made in 
> > Critical systems would be helpful.
> 
> Right now it isn't even being logged, except inside the 
> Hobbit daemon. A reporting tool is needed, I agree.
> 
> > Definition (Edit Critical Systems);
> > Easiest way for us; made standard definitions and add host 
> to this templates. Works fine.
> > But i miss a connection between alerts and critical view 
> definition. Something like a option in hobbit-alerts.cfg to 
> define that this rule is also valid for critical view.
> > Send a email when a alert shows up in critical view with 
> all the possibiltys form hobbit-alerts.cfg.
> 
> Wouldn't these two do the same thing ?
> Using the alert definitions to control the critical view is 
> an interesting idea, I hadn't thought of that.
> 
> > Special Case missed or belated Messages by Operation 
> Center; Now some 
> > application/scripts sends Alerts to the Console View and 
> the Operation Center make an alert call for each event.
> > A problem in Hobbit/BB is when changes happen in red 
> messages, the Operation Center didnt realize that until the 
> acknowledge time runs out and they make the alert call again.
> > This can happen for example in the disk status test (a 
> second filesystem goes red) or with nested Tests/Logfiles. 
> With the Event Console they get two messages (each for one 
> Filesystem).
> 
> This is a problem with all of the tests that have multiple 
> ways of going
> red: disk, procs, msgs and http are the common ones. I don't 
> have solution to that right now. The way Hobbit works right 
> now assumes that when you get an alert about the "disk" 
> status, you keep on fixing it until the status goes green - 
> and then the Operations Center won't need to raise a ticket 
> for the second event.
> 
I use as a solution to this problem, the counting of alerts within each test, if the number of alerts has changed, then a new alert will be generated with the status of the test

> 
> Regards,
> Henrik
> 
> 
> To unsubscribe from the hobbit list, send an e-mail to 
> hobbit-unsubscribe at hswn.dk
> 
> 
> 
> 




More information about the Xymon mailing list