AW: [hobbit] Use hobbit in operation center with critcal systemsview

Gräub Roland roland.graeub at rtc.ch
Mon Nov 12 10:01:08 CET 2007


> > 
> > Acknowledge; 
> > If an alert is acknowledge from the operators in critical systems 
> > this is a fix acknowledge for the given time, also when there is a 
> > status change.
> > When a problem is fixed and goes red/yellow again it will not shown 
> > up in critical view until the acked time is expired.
> > This sould be an option to ack a alert until a status 
> change (like in 
> > disable until ok).
> 
> I decided against the "ack-until-ok" method, because in my experience
> systems often go briefly ok while being fixed, and then they crash
> again. (E.g. you'd reboot a server and all the processes startup, but
> one process that is being monitored dies after a few minutes). So the
> monitoring reports OK for a few minutes, and then go red - if you did
> use an "ack-until-ok" it would show up on the critical systems view
> again, triggering a new ticket.
> 
> What happens now is that when the status goes green, a timer kicks off
> in Hobbit which lasts 12 minutes (i.e. 2 normal test cycles, 
> plus a bit
> for good measure). If the test has been OK throughout those 12 minutes
> then the ack is cleared; if it goes non-green during that 
> time the timer
> is reset and the ack persists (at least until it eventually expires).

I agree with you the ack until-ok could be end in a lot more unneeded alerts. So its unnecessary.
The cleartime of 12 min is a good choice, might be an option in hobbitserver.cfg.

> 
> > Definition (Edit Critical Systems);
> > Easiest way for us; made standard definitions and add host 
> to this templates. Works fine.
> > But i miss a connection between alerts and critical view 
> definition. Something like a option in hobbit-alerts.cfg to 
> define that this rule is also valid for critical view. 
> > Send a email when a alert shows up in critical view with 
> all the possibiltys form hobbit-alerts.cfg. 
> 
> Wouldn't these two do the same thing ? 

Actually in daytimes the recovery-group gets alerts on the in-house pager.
This are the identical Systems like in the operator view but the defintion is in hobbit-alerts.

By the way in the page.log i get this message from my custom pager-script;
2007-11-09 09:05:00 hobbitd_alert: Got message 52634, expected 52615
Maybe the reason is the long script runtime to send the message trough a slow analog modem connection on a other server; this takes 30seconds to finish.
But i dont know what this message really mean, it seems to work as expected.

> Using the alert definitions to control the critical view is 
> an interesting 
> idea, I hadn't thought of that.
> 
> > Special Case missed or belated Messages by Operation Center;
> > Now some application/scripts sends Alerts to the Console 
> View and the Operation Center make an alert call for each event. 
> > A problem in Hobbit/BB is when changes happen in red 
> messages, the Operation Center didnt realize that until the 
> acknowledge time runs out and they make the alert call again.
> > This can happen for example in the disk status test (a 
> second filesystem goes red) or with nested Tests/Logfiles. 
> With the Event Console they get two messages (each for one 
> Filesystem).
> 
> This is a problem with all of the tests that have multiple 
> ways of going
> red: disk, procs, msgs and http are the common ones. I don't have
> solution to that right now. The way Hobbit works right now 
> assumes that
> when you get an alert about the "disk" status, you keep on fixing it
> until the status goes green - and then the Operations Center 
> won't need
> to raise a ticket for the second event.
> 

Its like you say when its red i have to fix it until the test is green again. 
Maybe we disassemble some Test(example made for important procs a own test / split custom tests).

Roland

 



More information about the Xymon mailing list