[Xymon] Yellow->red escalation, bug or feature?

Henrik Størner henrik at hswn.dk
Wed Jan 11 22:39:02 CET 2012


On 11-01-2012 20:53, Gore, David W (David) wrote:
> Since it has been argued that it is not exactly a bug I would only
> humbly request that the current behavior is not changed but enhanced for
> those who want it to work differently.   If an alert has been alarming
> for x time and then goes red do you want to wait even longer to be
> alerted.  Yellow time + red time or yellow time and now its red so
> alert, provided the yellow time exceeds the red threshold.

If I understand it correctly, then the unhappiness with the current 
setup is that the DURATION setting in alerts.cfg counts both yellow and 
red time. So when a status goes yellow, stays there for a few hours time 
before going red - then a rule such as

    MAIL cio at example.com COLOR=RED DURATION>3h

will trigger immediately.


Some would argue that if you haven't fixed a problem before it goes 
critical, then your CIO *should* be notified.


The other school of thought argues that this rule means the CIO only 
wants to be informed when something has been really hosed for at least 
three hours. So the yellow warning-time shouldn't count when evaluating 
the DURATION setting for that rule - only the critical time counts.


Is that a correct understanding of the arguments here ?


Let's say I implement the 3-hour delay before sending an escalation 
notice. What should happen if the status is yellow for two hours, then 
goes red for 2h50m, dips back into yellow for 10 minutes and then goes 
back to red ? Should the 2h50m count after the status was yellow for a 
while? Or does a 10 minute yellow status completely reset the duration 
counter for the almost-3-hours red status?

I'm not trying to be too pedantic here, but it is the sort of things 
that do happen. So let's discuss how it can best be handled.


I think Josh is right that changing this will require some sort of 
additional configuration setting to indicate that "this duration value 
applies to the time it's been red only". It's for curbing escalation 
notices. And therefore it is obviously only an issue for those statuses 
that can be yellow - not those that can only be red or green.

It's been quite some time since I last dug into the alert-module code, 
so I cannot say how much effort it will take to add this. Right now I am 
not sure if the alert module has enough information about an alert to be 
able to implement it.


Meanwhile, may I draw your attention to the "SCRIPT" way of sending 
alerts. It's not an ideal solution, but I think it's a usable 
work-around for this problem:

The alert script gets triggered just the same as your MAIL alerts do. 
But your script can query xymond to see when the status last changed (to 
red, presumably) - it's the "lastchange" field stored for a status. So 
you could put something like this in your alert script:

#!/bin/sh

# This script only handles red
if test "$BBCOLORLEVEL" != "red"
then
    exit 0
fi

REDSTART=`xymon 127.0.0.1 "xymondlog $BBHOSTNAME.$BBSVCNAME 
fields=lastchange" | head -n 1`
NOW=`date +%s`
REDDURATION=`expr $NOW - $REDSTART`
if test $REDDURATION -lt 10800	# 3-hour (10800 secs) delay
then
         exit 0
fi

... send the alert ...

(the "head -n 1" is needed, because xymondlog also sends you the full 
status message. On the other hand, that might be useful when generating 
the alert message).


Regards,
Henrik



More information about the Xymon mailing list