[Xymon] Yellow->red escalation, bug or feature?

SebA spah at syntec.co.uk
Thu Jan 12 13:07:14 CET 2012


xymon-bounces at xymon.com wrote:
> On 11-01-2012 20:53, Gore, David W (David) wrote:
>> Since it has been argued that it is not exactly a bug I would only
>> humbly request that the current behavior is not changed but enhanced
>> for those who want it to work differently.   If an alert has been
>> alarming for x time and then goes red do you want to wait even
>> longer to be alerted.  Yellow time + red time or yellow time and now
>> its red so alert, provided the yellow time exceeds the red threshold.

Yes, I do want to wait even longer.  I want to wait for the duration that
was specified in the alert rule, for the colour that was specified in the
alert rule.  And I think this is how one would expect xymond_alert to behave
given the syntax of the rule, with no prior knowledge of Xymon (and not
having read the documentation).

> If I understand it correctly, then the unhappiness with the current
> setup is that the DURATION setting in alerts.cfg counts both
> yellow and
> red time. So when a status goes yellow, stays there for a few
> hours time
> before going red - then a rule such as
> 
>     MAIL cio at example.com COLOR=RED DURATION>3h
> 
> will trigger immediately.
> 
> 
> Some would argue that if you haven't fixed a problem before it goes
> critical, then your CIO *should* be notified.

Sounds like, for people who want that behaviour, they need a (yet to be
implemented) WARNINGDURATION> rule.  This implies that tier1 support
probably get alerts on yellows, which I expect could result in a lot of
false positive alerts for them!  But if that's how they want it, that's
their affair.

> The other school of thought argues that this rule means the CIO only
> wants to be informed when something has been really hosed for
> at least
> three hours. So the yellow warning-time shouldn't count when
> evaluating the DURATION setting for that rule - only the critical
> time counts. 
> 
> 
> Is that a correct understanding of the arguments here ?

Yes.

> Let's say I implement the 3-hour delay before sending an escalation
> notice. What should happen if the status is yellow for two
> hours, then
> goes red for 2h50m, dips back into yellow for 10 minutes and
> then goes
> back to red ? Should the 2h50m count after the status was
> yellow for a
> while? Or does a 10 minute yellow status completely reset the duration
> counter for the almost-3-hours red status?

I already responded to this issue in my old post here:
http://lists.xymon.com/oldarchive/2009/02/msg00145.html, but I'll quote the
relevant part:

"...since this test can flap between yellow and red and I consider
yellow to be a sufficient degree of recovery that I don't want another alert
as soon as it goes red again. If we look at disk in particular though,
surely if it is flapping between yellow and red the problem isn't too
serious. If one does want an alert for this, one can eliminate the DURATION
rule. If one does not, the DURATION rule should be a way of preventing
getting alerts for the flapping behaviour. This is what I've always
considered the use of the DURATION rule (although I was wrong given the way
it is currently working)."

> I'm not trying to be too pedantic here, but it is the sort of things
> that do happen. So let's discuss how it can best be handled.
> 
> 
> I think Josh is right that changing this will require some sort of
> additional configuration setting to indicate that "this
> duration value
> applies to the time it's been red only". It's for curbing escalation
> notices. And therefore it is obviously only an issue for
> those statuses
> that can be yellow - not those that can only be red or green.

Continuing my quote from my old post:
"Perhaps a more flexible and useful solution, while
still remaining easy to use, is to incorporate the change you suggest
[which was (quote Henrik): "What would probably be best was for Xymon to
calculate the duration based on the COLOR-settings defined for the alert"]
with a RECOVERY= rule in the alerts. So each rule can specify what colour
consistutes a recovery. This means that some tests can have yellow while
others have green, allowing for different alerting behaviour for flapping
depending on the test, and it also allows those who get notified of
recoveries to have this information when they want. :)"

<snip>
> 
> Regards,
> Henrik

And, at the risk of dirtying this thread, a closely related issue is my
original post in the same thread:
http://lists.xymon.com/oldarchive/2009/01/msg00364.html
Quote:
"It seems the combination of TIME=W:0845:2355 and DURATION>15 in
hobbit-alerts.cfg means the earliest an alert can be sent out is 9 am.  Is
this what you would expect?  I would have expected these two rules to mean
the test should be in an alarm colour for more than 15 minutes and be
between the times of 08:45 and 23:55, weekdays.  Instead it seems to be
relating the DURATION with the time such that the DURATION only applies
_during_ the TIME."

So, if the CIO has a DURATION > 3 hours for a particular alert and a global
TIME=W:0845:2355 (to retain their beauty sleep) he (or she) will only get
the alert after 11:45 am.  Might not be what they want.

Kind regards,

SebA




More information about the Xymon mailing list