[hobbit] No pages when going from yellow to red

Henrik Stoerner henrik at hswn.dk
Mon Nov 7 22:27:21 CET 2005


On Mon, Nov 07, 2005 at 03:56:37PM -0500, Pat Vaughan wrote:
> 
> > No, and that might be something that could change. The repeat-
> > checking code currently identifies an alert by the combination
> > of hostname, servicename and recipient; I could easily change
> > that so a separate line in the config-file would result in a new
> > set of repeat-checks.
> 
> Is this something that might make it into the next version?  I'm almost
> ready to take a snapshot if I have to.  This bit me again today.

I did some work on this yesterday - while working on it, I found
out that there is something buggy in the current version. From my
Changes file (http://www.hswn.dk/beta/Changes):

* The handling of alerts was counting the duration of an event
  based on when the color last changed. This meant that each
  time the color changed, any DURATION counters were reset.
  This would cause alerts to not go out if a status was changing
  between yellow and red faster than any DURATION setting.
  Changed this to count the event start as the *first* time the
  status went into an alert state (yellow or red, usually).

I then also implemented the following change:

* When a status goes yellow->red, the repeat-interval is
  now cleared for any alerts. This makes sure you get an
  alert immediately for the most severe state seen. This
  only affects the first such transition; if the status
  later changes between yellow/red, this normal REPEAT
  interval applies.

So you'll now get an alert when it goes yellow, and another
when it goes red (if your configuration includes alerts for 
these colors, obviously).

This is in the current snapshot, and will also be in the next
release. I am tempted to do a 4.1.3 release fairly soon - this
problem is fairly serious. And the disk graph problem that is
also fixed in the current snapshot annoys quite a few people.


> It seems
> to me that the most intelligent change would be to generate a new
> repeat-check for every line in the hobbit-alerts file or, and I haven't
> looked at the code at all, to reset the repeat timer every time a test
> changes color (possibly using a different keyword to keep current setups
> working as anticipated).

I'd rather not have the REPEAT handling tied to the physical layout
of the configuration file - it makes it a lot harder to handle when
the file is changed while alerts are active. I know I wrote something
different in the message you've quoted, but after looking some more
at the problem I've changed my mind.

I think the new code strikes a sensible balance between getting
the necessary alerts and not being flooded with them. The current
version works the way it does because I did not want to be
flooded with alerts by a state that kept on changing between
yellow and red - eg. a disk that is filled just about the 
limit between the warning and panic levels. The new code will
give you that one extra alert telling you that the situation
is critical, but once it has done that it will obey the
REPEAT setting and only send you an alert every 30 minutes
(or whatever your REPEAT interval is).


Regards,
Henrik




More information about the Xymon mailing list