[Xymon] Yellow->red escalation, bug or feature?
Henrik Størner
henrik at hswn.dk
Wed Jan 11 22:39:02 CET 2012
On 11-01-2012 20:53, Gore, David W (David) wrote:
> Since it has been argued that it is not exactly a bug I would only
> humbly request that the current behavior is not changed but enhanced for
> those who want it to work differently. If an alert has been alarming
> for x time and then goes red do you want to wait even longer to be
> alerted. Yellow time + red time or yellow time and now its red so
> alert, provided the yellow time exceeds the red threshold.
If I understand it correctly, then the unhappiness with the current
setup is that the DURATION setting in alerts.cfg counts both yellow and
red time. So when a status goes yellow, stays there for a few hours time
before going red - then a rule such as
MAIL cio at example.com COLOR=RED DURATION>3h
will trigger immediately.
Some would argue that if you haven't fixed a problem before it goes
critical, then your CIO *should* be notified.
The other school of thought argues that this rule means the CIO only
wants to be informed when something has been really hosed for at least
three hours. So the yellow warning-time shouldn't count when evaluating
the DURATION setting for that rule - only the critical time counts.
Is that a correct understanding of the arguments here ?
Let's say I implement the 3-hour delay before sending an escalation
notice. What should happen if the status is yellow for two hours, then
goes red for 2h50m, dips back into yellow for 10 minutes and then goes
back to red ? Should the 2h50m count after the status was yellow for a
while? Or does a 10 minute yellow status completely reset the duration
counter for the almost-3-hours red status?
I'm not trying to be too pedantic here, but it is the sort of things
that do happen. So let's discuss how it can best be handled.
I think Josh is right that changing this will require some sort of
additional configuration setting to indicate that "this duration value
applies to the time it's been red only". It's for curbing escalation
notices. And therefore it is obviously only an issue for those statuses
that can be yellow - not those that can only be red or green.
It's been quite some time since I last dug into the alert-module code,
so I cannot say how much effort it will take to add this. Right now I am
not sure if the alert module has enough information about an alert to be
able to implement it.
Meanwhile, may I draw your attention to the "SCRIPT" way of sending
alerts. It's not an ideal solution, but I think it's a usable
work-around for this problem:
The alert script gets triggered just the same as your MAIL alerts do.
But your script can query xymond to see when the status last changed (to
red, presumably) - it's the "lastchange" field stored for a status. So
you could put something like this in your alert script:
#!/bin/sh
# This script only handles red
if test "$BBCOLORLEVEL" != "red"
then
exit 0
fi
REDSTART=`xymon 127.0.0.1 "xymondlog $BBHOSTNAME.$BBSVCNAME
fields=lastchange" | head -n 1`
NOW=`date +%s`
REDDURATION=`expr $NOW - $REDSTART`
if test $REDDURATION -lt 10800 # 3-hour (10800 secs) delay
then
exit 0
fi
... send the alert ...
(the "head -n 1" is needed, because xymondlog also sends you the full
status message. On the other hand, that might be useful when generating
the alert message).
Regards,
Henrik
More information about the Xymon
mailing list