[Xymon] Serious trouble, red after yellow didn't page at all tonight
Elizabeth Schwartz
betsy.schwartz at gmail.com
Wed Apr 6 05:22:54 CEST 2011
Yesterday we had a red after yellow page all the way up the hierarchy
immediately. Today we had a red after yellow not page at ALL.
It did page in BB (test is going to both servers during this test
period. Running Xymon 4.3.0 and really hoping to go live ASAP
Here are the hist log entries, see it go red at 22:07 for five minutes
Tue Apr 5 17:32:35 2011 red 1302039155 300
Tue Apr 5 17:37:35 2011 green 1302039455 900
Tue Apr 5 17:52:35 2011 yellow 1302040355 599
Tue Apr 5 18:02:34 2011 red 1302040954 601
Tue Apr 5 18:12:35 2011 yellow 1302041555 1199
Tue Apr 5 18:32:34 2011 red 1302042754 900
Tue Apr 5 18:47:34 2011 yellow 1302043654 12002
Tue Apr 5 22:07:36 2011 red 1302055656 300
Tue Apr 5 22:12:36 2011 yellow 1302055956
History shows critical status:
Tue Apr 5 22:07:36 EDT 2011 OTHER Applications ( "mysqle1" ): CRITICAL
And it paged and emailed earlier in the evening: (domain name elided).
It paged correctly at 6:34 and 6:45 but nothing at 10:07:
Tue Apr 5 17:34:28 2011 db0.other (10.100.4.51) techops[160] 1302039268 0
Tue Apr 5 17:34:28 2011 db0.com.other (10.100.4.51) alert1[162] 1302039268 0
Tue Apr 5 17:37:35 2011 db0.other (10.100.4.51) techops[160] 1302039455 0 300
Tue Apr 5 17:52:35 2011 db0.other (10.100.4.51) techops[160] 1302040355 0
Tue Apr 5 17:52:35 2011 db0.other (10.100.4.51) ticket[161] 1302040355 0
Tue Apr 5 18:04:18 2011 db0.other (10.100.4.51) techops[160] 1302041058 0
Tue Apr 5 18:04:18 2011 db0.other (10.100.4.51) alert1[162] 1302041058 0
Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert1[162] 1302042858 0
Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert2[163] 1302042858 0
Tue Apr 5 18:34:18 2011 db0.other (10.100.4.51) alert3[164] 1302042858 0
Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert1[162] 1302043502 0
Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert2[163] 1302043502 0
Tue Apr 5 18:45:02 2011 db0.other (10.100.4.51) alert3[164] 1302043502 0
And here are lines 159-165 in the hobbit-alerts.cfg:
HOST=%^db EXHOST=%.*dl2.example* SERVICE=other
MAIL techops REPEAT=1d RECOVERED
MAIL ticket REPEAT=1d COLOR=yellow # open
ticket email
MAIL alert1 REPEAT=10 COLOR=red,purple FORMAT=SMS# page onshift or
oncall at start RED, rep every 10 minutes
MAIL alert2 DURATION>20 REPEAT=10 COLOR=red,purple FORMAT=SMS# page
secondary after 20 mins RED . Repevery 10 minutes
MAIL alert3 DURATION>40 REPEAT=10 COLOR=red,purple FORMAT=SMS# page
tertiary after 40 mins RED. Rep every 10mins
MAIL alert4 DURATION>60 REPEAT=10 COLOR=red,purple FORMAT=SMS# page
team after 60 mins RED. Rpt every 10mins
I don't believe it was acked or signed out. It' s a complex custom test
More information about the Xymon
mailing list