[Xymon] Yellow->red escalation, bug or feature?

Elizabeth Schwartz betsy.schwartz at gmail.com
Sat Jan 14 23:49:47 CET 2012


Exactly. If something is yellow, by definition, we've said it's NOT critical.

Our most frequent example is disk space. A disk which fills up 100%
will cause a critical disruption to production. On many disks we go
yellow at 80%, to give ourselves plenty of warning, and red at 95%.
Now when a disk goes red, I do want someone to look at it immediately,
but it doesn't really matter that it's been yellow for a long time. In
fact, the LONGER it's been yellow the LESS urgent it is, because it's
not filling up very quickly. Our senior team does NOT want to be paged
for this!


If I wanted something to page when it's been yellow for three hours,
I've already got the capability of paging after it's been yellow for
three hours.

When something turns red, I want to follow the rules and timing for reds

>Let's say I implement the 3-hour delay before sending an escalation notice. What should happen if the status is yellow for two hours, >then goes red for 2h50m, dips back into yellow for 10 minutes and then goes back to red ? Should the 2h50m count after the status >was yellow for a while? Or does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?

This case doesn't make a lot of sense to me. If something's been red
for 2h50, I've probably already escalated it up to the hilt. The above
scenario is only a problem in the case where a red alert is set to be
ignored for the first three hours. I don't think that's a common
scenario. Anything we could ignore for 3 hours is probably a yellow.

Having to write a custom test for every single red in our environment
doesn't seem like a good alternative, especially for the built-in
tests.



More information about the Xymon mailing list