[xymon] xymon_4.3.0-RC1: possible lost alerts

Dominique Frise dominique.frise at unil.ch
Mon Feb 14 17:08:51 CET 2011


On 02/14/11 02:51 PM, Henrik Størner wrote:
> In<4D593040.6090808 at unil.ch>  Dominique Frise<dominique.frise at unil.ch>  writes:
>
>> what is suppose to happen if you remove the "clear" color from OKCOLORS
>> in xymonserver.cfg ?
>
> Then a "clear" status would trigger alerts, i.e. the xymond_alert
> module would begin to see alert-messages for a clear status (same
> as for yellow, red, purple).
>
> I don't think you would actually see any alerts being sent, unless
> you also change ALERTCOLORS to include the "clear" status.
>
> But that would be a bad idea, since "clear" is also used for
> e.g. "noping" hosts, or for client-side statuses (cpu, disk, ...)
> when the server is down ("conn" status is red means client-side
> tests will not go purple - they go clear).
>
>> We would expect that not recovery message should be sent when a status
>> goes from yellow/red to clear. Only the repeat interval should be reset.
>> Does this make sense ?
>
> Kind of, yes. I don't recall if it was actually tested.
>

(Sorry, same reply was sent before with garbage as top post.)

I dont't think it was ;-)
Here below the little changes we made in xymond_alerts.c (version before 
your last changes) to achieve this:


[super at iris xymond]# diff -u xymond_alert.c.dist xymond_alert.c
--- xymond_alert.c.dist Sun Nov 14 18:21:19 2010
+++ xymond_alert.c      Mon Feb 14 15:02:24 2011
@@ -355,7 +355,7 @@
         char *msg;
         int seq;
         int argi;
-       int alertcolors, alertinterval;
+       int alertcolors, alertinterval, okcolors;
         char *configfn = NULL;
         char *checkfn = NULL;
         int checkpointinterval = 900;
@@ -377,6 +377,7 @@
         /* Load alert config */
         alertcolors = colorset(xgetenv("ALERTCOLORS"), ((1 << 
COL_GREEN) | (1 << COL_BLUE)));
         alertinterval = 60*atoi(xgetenv("ALERTREPEAT"));
+       okcolors = colorset(xgetenv("OKCOLORS"), (1 << COL_RED));

         /* Create our loookup-trees */
         hostnames = rbtNew(name_compare);
@@ -656,7 +657,7 @@
                                         awalk->maxcolor = newcolor;
                                 }
                         }
-                       else {
+                       else if ((okcolors & (1 << newcolor)) != 0) {
                                 /*
                                  * Send one "recovered" message out 
now, then go to A_DEAD.
                                  * Dont update the color here - we want 
recoveries to go out
@@ -663,6 +664,11 @@
                                  * only if the alert color triggered an 
alert
                                  */
                                 awalk->state = A_RECOVERED;
+                       } else {
+                               /*
+                                * This color should not trigger 
"recovered" messages.
+                                */
+                               awalk->state = A_NORECIP;
                         }


With this in place we can better support alerting for SNMP traps (see 
previous discussion with Buchan 
http://www.xymon.com/archive/2011/02/msg00062.html), but then we want 
all short transitions from an alert state to a clear status to be 
processed by Xymon (not ignored).

Dominique



More information about the Xymon mailing list