[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [hobbit] purple page grouping & alert acknowledgment
On Mon, Feb 28, 2005 at 01:28:18PM -0500, Tom Georgoulias wrote:
>
> While were on the topic of purple status messages...Hobbit is config'd
> to turn a host purple if it hasn't heard from it in 30 mins. I want
> mine to go purple after 15, so I changed the PURPLEDELAY from "30" to
> "15" in hobbitserver.cfg, but that doesn't seem to make a difference.
> What else needs to be changed?
It's the program that generates the status message, that also
determines how long it is valid. So this is something you set on each
BB client or extension script. You actually cannot set it anywhere for
the network tests performed by bbtest-net (I just checked and was a
bit surprised that I had not provided some way of changing this).
> I think I found a loop hole that may cause problems in certain
> circumstances: Say I get a red alert for something, give an estimate of
> 120 mins to fix it, and the host goes purple 45 mins later (i.e. it
> crashes), before the ack clears. That ack stays in the red state and I
> won't get a page for the red -> purple transition until after the 120
> mins passed and paging resumes (presumably because the ack wasn't
> cleared because it never went green before going purple). This could be
> bad news if I have a system that crashes when the support tech is busy
> with other things or if a system is brought back online after a purple
> status and returns to something non green (i.e. disk is the only thing
> that is monitored on the system, and it immediately goes to red after
> boot up and stays that way for a while).
There are lots of ways you can outsmart the system. And you needn't
have a purple status in-between:
1) Disk fills up and goes red
2) Clueless admin ack's the disk alert for 60 minutes, then reboots
the server because that "usually fixes things"
3) Disk stays red and no alerts go out until an hour has passed
In such cases there is little Hobbit can do. When you ack an alert,
you take over the responsibility for that status for the time the ack
is valid. If you "fix" something without checking that it actually did
solve the problem, you're asking for trouble.
If you really want it, it's not a big problem to implement an
"de-acknowledge" function. It might even be worthwhile for reporting
purposes, to keep track of how much time your admins are using on
troubleshooting. I'm open to suggestions.
Regards,
Henrik