[hobbit] Use hobbit in operation center with critcal systems view
Eduard Michels
emichels at quicksoft.com.br
Fri Nov 9 11:38:09 CET 2007
> -----Original Message-----
> From: Henrik Stoerner [mailto:henrik at hswn.dk]
> Sent: quinta-feira, 8 de novembro de 2007 21:27
> To: hobbit at hswn.dk
> Subject: Re: [hobbit] Use hobbit in operation center with
> critcal systems view
>
> On Mon, Oct 15, 2007 at 09:29:53AM +0200, Gräub Roland wrote:
> > In our environment the Operation Center always call when a alerts
> > shows up on their Event Console and acknowledge the alert.
> With this
> > action the alert is no longer visible for the operators.
> >
> > Now following questions/toughts came up when we look closer;
> >
> > Acknowledge;
> > If an alert is acknowledge from the operators in critical
> systems this
> > is a fix acknowledge for the given time, also when there is
> a status
> > change.
> > When a problem is fixed and goes red/yellow again it will
> not shown up
> > in critical view until the acked time is expired.
> > This sould be an option to ack a alert until a status
> change (like in
> > disable until ok).
>
> I decided against the "ack-until-ok" method, because in my
> experience systems often go briefly ok while being fixed, and
> then they crash again. (E.g. you'd reboot a server and all
> the processes startup, but one process that is being
> monitored dies after a few minutes). So the monitoring
> reports OK for a few minutes, and then go red - if you did
> use an "ack-until-ok" it would show up on the critical
> systems view again, triggering a new ticket.
>
> What happens now is that when the status goes green, a timer
> kicks off in Hobbit which lasts 12 minutes (i.e. 2 normal
> test cycles, plus a bit for good measure). If the test has
> been OK throughout those 12 minutes then the ack is cleared;
> if it goes non-green during that time the timer is reset and
> the ack persists (at least until it eventually expires).
>
> > The option Host-ack seems to be broken, on my system only
> one Test is
> > acknowledged although the Host-ack Checkbox is selected.
>
> A quick test says you're right. Will have to look into that.
>
> > Log;
> > Missing a Log/Report from Critical view. A Report with information
> > about the alerts and acknowledgeds information that were made in
> > Critical systems would be helpful.
>
> Right now it isn't even being logged, except inside the
> Hobbit daemon. A reporting tool is needed, I agree.
>
> > Definition (Edit Critical Systems);
> > Easiest way for us; made standard definitions and add host
> to this templates. Works fine.
> > But i miss a connection between alerts and critical view
> definition. Something like a option in hobbit-alerts.cfg to
> define that this rule is also valid for critical view.
> > Send a email when a alert shows up in critical view with
> all the possibiltys form hobbit-alerts.cfg.
>
> Wouldn't these two do the same thing ?
> Using the alert definitions to control the critical view is
> an interesting idea, I hadn't thought of that.
>
> > Special Case missed or belated Messages by Operation
> Center; Now some
> > application/scripts sends Alerts to the Console View and
> the Operation Center make an alert call for each event.
> > A problem in Hobbit/BB is when changes happen in red
> messages, the Operation Center didnt realize that until the
> acknowledge time runs out and they make the alert call again.
> > This can happen for example in the disk status test (a
> second filesystem goes red) or with nested Tests/Logfiles.
> With the Event Console they get two messages (each for one
> Filesystem).
>
> This is a problem with all of the tests that have multiple
> ways of going
> red: disk, procs, msgs and http are the common ones. I don't
> have solution to that right now. The way Hobbit works right
> now assumes that when you get an alert about the "disk"
> status, you keep on fixing it until the status goes green -
> and then the Operations Center won't need to raise a ticket
> for the second event.
>
I use as a solution to this problem, the counting of alerts within each test, if the number of alerts has changed, then a new alert will be generated with the status of the test
>
> Regards,
> Henrik
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>
>
More information about the Xymon
mailing list