[Xymon] alerting issue?

Gavin Stone-Tolcher g.stone-tolcher at its.uq.edu.au
Tue Jan 21 07:16:18 CET 2014


I have recently migrated from a large BigBrother/bbgen installation (hosts.cfg 5300 lines) to xymon 4.3.12.
Surprisingly there have been very few issues. Performance is very good compared to BigBrother/bbgen.
We have just experienced a potentially major issue wiith alerting.

Our issue seems to be with alerts not being generated for a rule if the initial event transition to red is not within the "TIME" range for an alerting rule.

An example follows of the behaviour experienced:
The "http" service went down for a system "butterfly.soe.uq.edu.au" at 03:07am and recovered 3 days later:

Mon Jan 20 10:14:31 2014             green    1 days 4:50:51
Fri Jan 17 03:07:44 2014                 red         3 days 7:06:47

Alerting for this test is as follows:
============
alerts.cfg:

$AISMAILSVCS=cifs,cont,cpu,disk,fping,http,inode,login,loginc,memory,ssh,sslcert,rtmpe,rtmps,rtmpt,svcs,xfer_proxy_c,xfer_proxy_e,xfer_proxy_k
$AISSMSSVCS=cifs,cont,cpu,disk,fping,http,inode,login,loginc,memory,ssh,sslcert,rtmpe,rtmps,rtmpt,svcs
$AISTFHSVCS=fping,http,login

#
# web/proxy/other/cert alerts
#
PAGE=%its-ais/ais-(web|proxy|other).*
        MAIL ais-web at domain SERVICE=$AISMAILSVCS DURATION>2m COLOR=red REPEAT=1w FORMAT=PLAIN RECOVERED
        MAIL ais-web-sms at domain SERVICE=$AISSMSSVCS DURATION>6m TIME=*:0701:2159 COLOR=red REPEAT=1w FORMAT=SMS RECOVERED
============

The "info" test output displays alerting rules as:

Alerting:
Service Recipient                             1st Delay              Stop after            Repeat Time of Day        Colors
ais-web at domain (R)                     2m 1s                    -                              1w          -                              red
ais-web-sms at domain (R)            6m 1s                    -                              1w          *:0701:2159        red

============

The notification log displays only email alert/recovery for "ais-web at domain", nothing for "ais-web-sms at domain" recipient:

Time                                                     Host                                                       Service Recipient
Mon Jan 20 10:14:47 2014             butterfly.soe.uq.edu.au               http       ais-web at domain
Fri Jan 17 03:10:29 2014                 butterfly.soe.uq.edu.au               http       ais-web at domain

No notification was sent to "ais-web-sms at domain" by the second "MAIL" rule above after it's start time of 07:01 the morning following the failure even though the "http" test was to remain red for 3 days.

Manually testing the alerting rules with:
~/server/bin/xymoncmd xymond_alert --test butterfly.soe.uq.edu.au http --duration=362

indicates syntax is ok and will send both emails when tested during the 0701:2159 TIME window of the second rule:

00029580 2014-01-17 11:31:30 Matching host:service:dgroup:page 'butterfly.soe.uq.edu.au:http: Linux Servers:its-usg/usg-linux,its-ais/ais-other' against rule line 1002
00029580 2014-01-17 11:31:30 *** Match with 'PAGE=%its-ais/ais-(web|proxy|other).*' ***
00029580 2014-01-17 11:31:30 Matching host:service:dgroup:page 'butterfly.soe.uq.edu.au:http: Linux Servers:its-usg/usg-linux,its-ais/ais-other' against rule line 1003
00029580 2014-01-17 11:31:30 *** Match with 'MAIL ais-web at domain SERVICE=$AISMAILSVCS DURATION>2m COLOR=red REPEAT=1w FORMAT=PLAIN RECOVERED' ***
00029580 2014-01-17 11:31:30 Mail alert with command '/usr/bin/mutt -s "Xymon [12345] butterfly.soe.uq.edu.au:http CRITICAL (RED)" ais-web at domain'
00029580 2014-01-17 11:31:30 Matching host:service:dgroup:page 'butterfly.soe.uq.edu.au:http: Linux Servers:its-usg/usg-linux,its-ais/ais-other' against rule line 1004
00029580 2014-01-17 11:31:30 *** Match with 'MAIL ais-web-sms at domain SERVICE=$AISSMSSVCS DURATION>6m TIME=*:0701:2159 COLOR=red REPEAT=1w FORMAT=SMS RECOVERED' ***
00029580 2014-01-17 11:31:30 Mail alert with command '/usr/bin/mutt ais-web-sms at domain'

Is there anything wrong with the alerting logic I have used in alerts.cfg or am I mis-understanding how it works?

The BigBrother behaviour would have been to send the alert after the rule settle time at the start of the time window for the rule if an event happened prior to the start of the alerting time window.


Contriving a dummy test in the hosts.cfg and alerts.cfg for an unpingable host "dummy.alerting.test" "fping".
Event log for "dummy.alerting.test" "fping":
Tue Jan 21 15:47:43 2014               red         0:16:12

alerts.cfg:
HOST=dummy.alerting.test
        MAIL g.stone-tolcher at its.uq.edu.au DURATION>2m TIME=*:1600:1700 COLOR=red REPEAT=1w FORMAT=PLAIN RECOVERED

Notification:
Tue Jan 21 16:00:36 2014               dummy.alerting.test      fping      g.stone-tolcher at its.uq.edu.au<mailto:g.stone-tolcher at its.uq.edu.au>

Seems to indicate that it is working similar to what is expected, i.e. send notification at start of TIME window if event is still current (ignore duration/settle time unlike bigbrother)?
I do not understand why the other alert would not have occurred.

Any help with this issue would be appreciated.


Cheers,
Gavin Stone-Tolcher, IT Support Officer, Network Operations and Incident Response
Information Technology Services
The University of Queensland
Level 4, Prentice Building, St Lucia 4072
T: +61 7 334 66645, M: +61 401 140 838
E: g.stone-tolcher at its.uq.edu.au<mailto:g.stone-tolcher at its.uq.edu.au> W: www.its.uq.edu.au<http://www.its.uq.edu.au>

ITS: Service. Team. Accountability. Results.

IMPORTANT: This email and any attachments are intended solely for the addressee(s), contain copyright material and are confidential. We do not waive any legal privilege or rights in respect of copyright or confidentiality. Except as intended addressees are otherwise permitted, you do not have permission to use, disclose, reproduce or communicate any part of this email or its attachments. Statements, opinions and information not related to the official business of The University of Queensland are neither given nor endorsed by us. By using this email (including accessing any attachments or links) you agree we are not liable for any loss or damage of any kind arising in connection with any electronic defect, virus or other malicious code we did not intentionally include.

Please consider the environment before printing this email.

CRICOS Code 00025B

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20140121/33539835/attachment.html>


More information about the Xymon mailing list