[Xymon] Trouble shooting alerts

Barrie Parker Barrie.Parker at uregina.ca
Fri Nov 18 18:22:21 CET 2011


Folks:

I've been working on this problem for the last couple of days and it's likely an easy fix, but I can't see it. 

I have a Linux test box, ix-test, that I've shut ntpd down on. The status page shows the red "critical" icon for procs. When I drill down into procs, the page is red with ntpd red as I expect.

But I never get an email alert. I can manually email out, so postfix is okay on the xymon server.

The snippet of code from alerts.cfg looks like:

HOST=ix-test    SERVICE=ntpd
                MAIL barrie DOT parker AT uregina DOT ca REPEAT=10 COLOR=RED


I've gone through literally hundreds of emails in the archives (there's over 7200 with alerts and troubleshooting in them ;). I've gone through what I've seen in them, but I'm still no closer to a solution.

When I do: bin/xymoncmd xymond_alert --dump-config
2011-11-18 09:43:40 Using default environment file /usr/local/xymon/server/etc/xymonserver.cfg
  <SNIP>

  125   HOST=ix-test SERVICE=ntpd
        MAIL barrie DOT parker AT uregina DOT ca FORMAT=TEXT REPEAT=10 COLOR=red

<SNIP>

I gather that's what I should see. 

When I do:  bin/xymoncmd xymond_alert --test ix-test ntpd
2011-11-18 09:44:24 Using default environment file /usr/local/xymon/server/etc/xymonserver.cfg
00016177 2011-11-18 09:44:24 send_alert ix-test:ntpd state Paging
00016177 2011-11-18 09:44:24 Matching host:service:dgroup:page 'ix-test:ntpd:Math:' against rule line 122
00016177 2011-11-18 09:44:24 Failed 'HOST=rss03 SERVICE=syslog-ng' (hostname not in include list)
00016177 2011-11-18 09:44:24 Matching host:service:dgroup:page 'ix-test:ntpd:Math:' against rule line 125
00016177 2011-11-18 09:44:24 *** Match with 'HOST=ix-test    SERVICE=ntpd' ***
00016177 2011-11-18 09:44:24 Matching host:service:dgroup:page 'ix-test:ntpd:Math:' against rule line 126
00016177 2011-11-18 09:44:24 *** Match with 'MAIL barrie DOT parker AT uregina DOT ca REPEAT=10 COLOR=RED' ***
00016177 2011-11-18 09:44:24 Mail alert with command 'mail -s "Xymon [12345] ix-test:ntpd CRITICAL (RED)" barrie DOT parker AT uregina DOT ca' 
<SNIP>

That would seem to indicate success reading and parsing the alerts file.

I've tried to debug with capturing the output of:./bbcmd --env=../etc/xymonserver.cfg xymond_channel --channel=page cat
and then using the code between
@@page#339/ix-test|1321631963.933541|142.3.156.124|ix-test|procs|142.3.156.124|13 
21633763|red|red|1321625351||647263|linux|linux||

and

@@


and then feeding it to:

bin/xymoncmd
2011-11-18 10:48:43 Using default environment file /usr/local/xymon/server/etc/x ymonserver.cfg
rss03:/usr/local/xymon/server> bin/xymond_alert --debug <input.txt
17777 2011-11-18 10:49:04 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbyte s=0, bufleft=266239
17777 2011-11-18 10:49:04 Got 8542 bytes
17777 2011-11-18 10:49:04 xymond_alert: Got message 339 @@page#339/ix-test|13216 31963.933541|142.3.156.124|ix-test|procs|142.3.156.124|1321633763|red|red|132162 5351||647263|linux|linux||
17777 2011-11-18 10:49:04 startpos 8541, fillpos 8542, endpos -1
17777 2011-11-18 10:49:04 Got page message from ix-test:procs
17777 2011-11-18 10:49:04 Alert status changed from 0 to 1
17777 2011-11-18 10:49:04 Found no first matching rule
17777 2011-11-18 10:49:04 Opening file /usr/local/xymon/server/etc/alerts.cfg
17777 2011-11-18 10:49:04 Opening file /usr/local/xymon/server/etc/holidays.cfg
17777 2011-11-18 10:49:04 Transport setup is:
17777 2011-11-18 10:49:04 xymondportnumber = 1984
17777 2011-11-18 10:49:04 xymonproxyhost = NONE
17777 2011-11-18 10:49:04 xymonproxyport = 0
17777 2011-11-18 10:49:04 Recipient listed as '142.3.156.39'
17777 2011-11-18 10:49:04 Standard protocol on port 1984
17777 2011-11-18 10:49:04 Will connect to address 142.3.156.39 port 1984
17777 2011-11-18 10:49:04 Connect status is 0
17777 2011-11-18 10:49:04 Sent 16 bytes
17777 2011-11-18 10:49:04 Read 1437 bytes
17777 2011-11-18 10:49:04 Closing connection
17777 2011-11-18 10:49:04 Found no first matching rule
17777 2011-11-18 10:49:04 cleanup_alert called for host ix-test, test procs
17777 2011-11-18 10:49:04 0 alerts to go
2011-11-18 10:49:04 Bad data in channel, skipping it
2011-11-18 10:49:04 Buffer sync lost, flushing data
17777 2011-11-18 10:49:04 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbyte s=0, bufleft=266239
17777 2011-11-18 10:49:04 get_xymond_message: Returning NULL due to EOF
rss03:/usr/local/xymon/server>

I see the line (twice) "Found no first matching rule"  but I'm not sure how to interpret this.

Oh and xymon is 4.3.5 on the server and it should be 4.3.2 on the monitored host.

Thank you.
Regards,
Barrie.




More information about the Xymon mailing list