[Xymon] possible xymon 4.3.21 holiday alerting bug?

Gavin Stone-Tolcher g.stone-tolcher at its.uq.edu.au
Mon Oct 12 06:17:10 CEST 2015


> I simulated "REPEAT=1h" on our alternate production server (same config/polling/clients but no alerting), and it exhibited the same behaviour.

I am getting a bit confused. 
I re-ran my test with "REPEAT=1w", after putting the test host on a different page and adding a new regex for that page to the alerts.cfg file. 
Finally issued a "drop hostname" to the server for that host.

Now it seems to work as desired, no repeating alerts! I put the test host entry back onto the original page, changed the uncommented the original alerts.cfg regex and issued a drop again for that host, still working as it should (no repeats)!

I have no idea what might be going on.

>From --trace log/alert-trace.log 
Recurring alerts every xymonnet poll? :

00013136 2015-10-09 11:09:57 Matching host:service:dgroup:page 'zeus-test.router.edu.au:fping:(NULL):its-un/un-rtmon' against rule line 168
00013136 2015-10-09 11:09:57 *** Match with 'MAIL gavin at sms.xx.yy.edu.au SERVICE=$UNSMSSVCS DURATION>6m TIME=W:0600:0759,W:1731:2200,60:0600:2200 COLOR=red REPEAT=1h FORMAT=SMS RECOVERED' ***
00013136 2015-10-09 11:09:57 Mail alert with command '/usr/bin/mutt gavin at sms.xx.yy.edu.au'

00023404 2015-10-09 11:11:00 Matching host:service:dgroup:page 'zeus-test.router.edu.au:fping:(NULL):its-un/un-rtmon' against rule line 168
00023404 2015-10-09 11:11:00 *** Match with 'MAIL gavin at sms.xx.yy.edu.au SERVICE=$UNSMSSVCS DURATION>6m TIME=W:0600:0759,W:1731:2200,60:0600:2200 COLOR=red REPEAT=1h FORMAT=SMS RECOVERED' ***
00023404 2015-10-09 11:11:00 Mail alert with command '/usr/bin/mutt gavin at sms.xx.yy.edu.au'

00029253 2015-10-09 11:12:00 Matching host:service:dgroup:page 'zeus-test.router.edu.au:fping:(NULL):its-un/un-rtmon' against rule line 168
00029253 2015-10-09 11:12:00 *** Match with 'MAIL gavin at sms.xx.yy.edu.au SERVICE=$UNSMSSVCS DURATION>6m TIME=W:0600:0759,W:1731:2200,60:0600:2200 COLOR=red REPEAT=1h FORMAT=SMS RECOVERED' ***
00029253 2015-10-09 11:12:00 Mail alert with command '/usr/bin/mutt gavin at sms.xx.yy.edu.au'

Now correct behaviour after issuing a "drop" for "zeus-test.router.edu.au" and changing back to "REPEAT=1w" instead of "REPEAT=1h":

00009373 2015-10-12 14:01:43 Matching host:service:dgroup:page 'zeus-test.router.edu.au:fping:Test  Hosts:its-un/un-rtmon' against rule line 169
00009373 2015-10-12 14:01:43 *** Match with 'MAIL gavin at sms.xx.yy.edu.au SERVICE=$UNSMSSVCS DURATION>6m TIME=W:0600:0759,W:1731:2200,60:0600:2200 COLOR=red REPEAT=1w FORMAT=SMS RECOVERED' ***
00009373 2015-10-12 14:01:43 Recipient 'zeus-test.router.edu.au|fping|mail|gavin at sms.xx.yy.edu.au' dropped, next alert due at 1445226920 > 1444622503

I can't just issue a "drop" to hosts that have this issue during a public holiday on our production server.

Any thoughts on what might be occurring?


Cheers,
Gavin Stone-Tolcher, IT Support Officer, Network Operations and Incident Response
Information Technology Services
The University of Queensland
Level 4, Prentice Building, St Lucia 4072
T: +61 7 334 66645, M: +61 401 140 838
E: g.stone-tolcher at its.uq.edu.au W: www.its.uq.edu.au

ITS: Service. Team. Accountability. Results.

IMPORTANT: This email and any attachments are intended solely for the addressee(s), contain copyright material and are confidential. We do not waive any legal privilege or rights in respect of copyright or confidentiality. Except as intended addressees are otherwise permitted, you do not have permission to use, disclose, reproduce or communicate any part of this email or its attachments. Statements, opinions and information not related to the official business of The University of Queensland are neither given nor endorsed by us. By using this email (including accessing any attachments or links) you agree we are not liable for any loss or damage of any kind arising in connection with any electronic defect, virus or other malicious code we did not intentionally include.

Please consider the environment before printing this email.

CRICOS Code 00025B

-----Original Message-----
From: Gavin Stone-Tolcher 
Sent: Friday, 9 October 2015 11:16 AM
To: 'J.C. Cleaver' <cleaver at terabithia.org>
Cc: xymon at xymon.com
Subject: RE: [Xymon] possible xymon 4.3.21 holiday alerting bug?

> Hmm. Does the REPEAT value work with a smaller interval (such as 1d or 1h)? And what type of system are you running on?
> I'm curious if there's a REPEAT over/underflow going on instead of something specific to the TIME exclusion back and forth.

All the alerting rules we use have a "REPEAT=1w", and they do seem to work as intended during non holiday times.
I simulated "REPEAT=1h" on our alternate production server (same config/polling/clients but no alerting), and it exhibited the same behaviour.

The system is Oracle Linux 6, which as I understand it, is really a RHEL 6 variant. We are running vanilla 4.3.21 compiled from source, not the rpm version.

# uname -a
Linux xx.yy.edu.au 2.6.32-504.30.3.el6.x86_64 #1 SMP Tue Jul 14 08:51:44 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.6 (Santiago) # cat /etc/oracle-release Oracle Linux Server release 6.6

> Is the test persistently red with no spurious recoveries being generated during the period in question?

Test is hard red in history.


Cheers,
Gavin Stone-Tolcher, IT Support Officer, Network Operations and Incident Response Information Technology Services The University of Queensland Level 4, Prentice Building, St Lucia 4072
T: +61 7 334 66645, M: +61 401 140 838
E: g.stone-tolcher at its.uq.edu.au W: www.its.uq.edu.au

ITS: Service. Team. Accountability. Results.

IMPORTANT: This email and any attachments are intended solely for the addressee(s), contain copyright material and are confidential. We do not waive any legal privilege or rights in respect of copyright or confidentiality. Except as intended addressees are otherwise permitted, you do not have permission to use, disclose, reproduce or communicate any part of this email or its attachments. Statements, opinions and information not related to the official business of The University of Queensland are neither given nor endorsed by us. By using this email (including accessing any attachments or links) you agree we are not liable for any loss or damage of any kind arising in connection with any electronic defect, virus or other malicious code we did not intentionally include.

Please consider the environment before printing this email.

CRICOS Code 00025B

-----Original Message-----
From: J.C. Cleaver [mailto:cleaver at terabithia.org]
Sent: Friday, 9 October 2015 1:14 AM
To: Gavin Stone-Tolcher <g.stone-tolcher at its.uq.edu.au>
Cc: xymon at xymon.com
Subject: Re: [Xymon] possible xymon 4.3.21 holiday alerting bug?



On Wed, October 7, 2015 11:58 pm, Gavin Stone-Tolcher wrote:
> Hi, We are seeing unusual alerting behaviour with Xymon 4.3.21 server 
> using a "holidays.cfg"  with HOLIDAYLIKEWEEKDAY=0.
>
> We have a network operations team (uqnoc-sms) that gets alerts during 
> business hours (TIME=W:0800:1700) And a data networks team (dn-sms) 
> that get out of business hours alerts in certain windows 
> (TIME=W:0600:0759,W:1701:2200,60:0600:2200)
>
> Rules are like:
>
> PAGE=$UNSMSREGEX EXHOST=$UNEXCLUDE
>         MAIL uqnoc-sms at xx.yy.edu.au SERVICE=$UNSMSSVCS DURATION>6m
> TIME=W:0800:1700 COLOR=red REPEAT=1w FORMAT=SMS RECOVERED
>         MAIL dn-sms at xx.yy.edu.au SERVICE=$UNSMSSVCS DURATION>6m
> TIME=W:0600:0759,W:1701:2200,60:0600:2200 COLOR=red REPEAT=1w 
> FORMAT=SMS RECOVERED
>
> For a "red" conn test covered by the rule on a weekday public holiday, 
> it seems to correctly identify not to send an alert to "uqnoc-sms"
> (TIME=W:0800:1700 ) and instead correctly generates an alert to "dn-sms"
> (TIME=60:0600:2200 component), but then keeps sending the same alert 
> approximately every minute (my xymonnet poll cycle). Ignores REPEAT=1w?
>
> Before I try and debug much further, I thought I would ask if anyone 
> else has seen similar behaviour?

Hmm. Does the REPEAT value work with a smaller interval (such as 1d or 1h)? And what type of system are you running on?

I'm curious if there's a REPEAT over/underflow going on instead of something specific to the TIME exclusion back and forth.

Is the test persistently red with no spurious recoveries being generated during the period in question?


-jc





More information about the Xymon mailing list