[Xymon] Serious trouble, red after yellow didn't page at all tonight

Elizabeth Schwartz betsy.schwartz at gmail.com
Wed Apr 6 05:22:54 CEST 2011


Yesterday we had a red after yellow page all the way up the hierarchy
immediately. Today we had a red after yellow not page at ALL.
It did page in BB (test is going to both servers during this test
period. Running Xymon 4.3.0 and really hoping to go live ASAP

Here are the hist log entries, see it go red at 22:07 for five minutes

Tue Apr  5 17:32:35 2011 red 1302039155 300
Tue Apr  5 17:37:35 2011 green 1302039455 900
Tue Apr  5 17:52:35 2011 yellow 1302040355 599
Tue Apr  5 18:02:34 2011 red 1302040954 601
Tue Apr  5 18:12:35 2011 yellow 1302041555 1199
Tue Apr  5 18:32:34 2011 red 1302042754 900
Tue Apr  5 18:47:34 2011 yellow 1302043654 12002
Tue Apr  5 22:07:36 2011 red 1302055656 300
Tue Apr  5 22:12:36 2011 yellow 1302055956

History shows critical status:
Tue Apr 5 22:07:36 EDT 2011 OTHER Applications ( "mysqle1" ): CRITICAL

And it paged and emailed earlier in the evening: (domain name elided).
It paged correctly at 6:34 and 6:45 but nothing at 10:07:

Tue Apr  5 17:34:28 2011 db0.other (10.100.4.51) techops[160] 1302039268 0
Tue Apr  5 17:34:28 2011 db0.com.other (10.100.4.51) alert1[162] 1302039268 0
Tue Apr  5 17:37:35 2011 db0.other (10.100.4.51) techops[160] 1302039455 0 300
Tue Apr  5 17:52:35 2011 db0.other (10.100.4.51) techops[160] 1302040355 0
Tue Apr  5 17:52:35 2011 db0.other (10.100.4.51) ticket[161] 1302040355 0
Tue Apr  5 18:04:18 2011 db0.other (10.100.4.51) techops[160] 1302041058 0
Tue Apr  5 18:04:18 2011 db0.other (10.100.4.51) alert1[162] 1302041058 0
Tue Apr  5 18:34:18 2011 db0.other (10.100.4.51) alert1[162] 1302042858 0
Tue Apr  5 18:34:18 2011 db0.other (10.100.4.51) alert2[163] 1302042858 0
Tue Apr  5 18:34:18 2011 db0.other (10.100.4.51) alert3[164] 1302042858 0
Tue Apr  5 18:45:02 2011 db0.other (10.100.4.51) alert1[162] 1302043502 0
Tue Apr  5 18:45:02 2011 db0.other (10.100.4.51) alert2[163] 1302043502 0
Tue Apr  5 18:45:02 2011 db0.other (10.100.4.51) alert3[164] 1302043502 0


And here are lines 159-165 in the hobbit-alerts.cfg:
HOST=%^db EXHOST=%.*dl2.example* SERVICE=other
   MAIL techops REPEAT=1d  RECOVERED
   MAIL ticket REPEAT=1d COLOR=yellow                           # open
ticket email
   MAIL alert1 REPEAT=10 COLOR=red,purple FORMAT=SMS# page onshift or
oncall at start RED, rep every 10 minutes
   MAIL alert2 DURATION>20 REPEAT=10 COLOR=red,purple FORMAT=SMS# page
secondary after 20 mins RED . Repevery 10 minutes
   MAIL alert3 DURATION>40 REPEAT=10 COLOR=red,purple FORMAT=SMS# page
tertiary after 40 mins RED. Rep every 10mins
   MAIL alert4 DURATION>60 REPEAT=10 COLOR=red,purple FORMAT=SMS# page
team after 60 mins RED. Rpt every 10mins


I don't believe it was acked or signed out. It' s a complex custom test



More information about the Xymon mailing list