[Xymon] alert/hostname loading
John Thurston
john.thurston at alaska.gov
Mon Dec 14 21:27:05 CET 2015
On 12/1/2015 12:03 PM, John Thurston wrote:
> On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
> - snip -
>>
>> Hmm. This seems to be fundamentally a different issue than the "hostdata
>> module going rogue" thing, which was about zombies never being picked up.
>>
>> AFAICT, somehow the hosts tree structure is getting clobbered as a result
>> of the drop (assuming all of those hosts are expected to be existing).
- snip -
> I haven't yet found a way to induce this failure, so I haven't yet
> identified the minimal recovery steps. I'm working on it, though.
I think I might be able to reproduce the failure :) Start with the
following, stable server arrangement:
+ x.bar.com is running xymon 4.3.22 on Solaris 10 SPARC
+ The following is defined in tasks.cfg:
CMD xymond_channel --channel=page --log=$XYMONSERVERLOGS/alert.log \
xymond_alert --debug --checkpoint-file=$XYMONTMP/alert.chk \
--checkpoint-interval=600
+ Host foo.bar.com is defined in DNS and does not permit ICMP traffic
and does not have a xymon client installed on it
Throw a spanner in the works by the following actions:
+ Add host foo.bar.com to an existing page and group in hosts.cfg
+ ~/server/bin/xymoncmd ~/server/bin/xymonnet foo.bar.com
And see the trouble commence in alert.log:
> 6690 2015-12-14 10:52:06.859998 Got 415 bytes
> 6690 2015-12-14 10:52:06.860110 xymond_alert: Got message 95 @@page#95/foo.bar.com|1450122726.859873|10.10.10.55|foo.bar.com|conn|0.0.0.0|1450124526|red|none|1450122726|Page/Subpage|65234||||
> 6690 2015-12-14 10:52:06.860140 startpos 5659, fillpos 5659, endpos -1
> 6690 2015-12-14 10:52:06.860172 Got page message from foo.bar.com:conn
> 6690 2015-12-14 10:52:06.860249 Alert status changed from 0 to 1
> 6690 2015-12-14 10:52:06.860285 Checking criteria for host 'foo.bar.com', which is not defined
> 6690 2015-12-14 10:52:06.861674 Checking criteria for host 'foo.bar.com', which is not defined
> 6690 2015-12-14 10:52:06.861728 Checking criteria for host 'foo.bar.com', which is not defined
> 6690 2015-12-14 10:52:06.861761 Found no first matching rule
> 6690 2015-12-14 10:52:06.861813 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg
> 6690 2015-12-14 10:52:06.861861 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg
> 6690 2015-12-14 10:52:06.861891 Checking criteria for host 'zebra.bar.com', which is not defined
After killing the "xymond_channel --channel=page" process, a new one is
created as a child of xymonlaunch and everything behaves normally again.
I currently have a tail on my alert.log to warn me of the appearance of
the string, "which is not defined". When that appears, I know it is time
to HUP the "page" channel. This is a rather crude hammer to leave laying
on the table next to my production server, but it keeps us running :)
I have a core file from the xymond_channel process, but its stack
contains only:
> feee041c _syscall6 (1, 1, 0, 1, 7d0, 3a0f4) + 20
> 00013c90 _start (0, 0, 0, 0, 0, 0) + 5c
I have a core file from the xymond_alert process, but its stack contains
only:
> feede7d8 __pollsys (ffbfcd50, 1, ffbfcdc0, 0, 0, 0) + 8
> fee79b8c pselect (ffbfcd50, fef56790, fef56790, 40, ffbfcdc0, 0) + 1c8
> fee79f04 select (1, ffbfce58, 0, 0, ffbfce48, ffbfced8) + a0
> 00015fa4 get_xymond_message (4b400, 4b14c, 4b148, ffbfcf88, 4b16c, 35d50) + 270
> 0003293c main (1, 566f245d, 0, 33b00, 4b000, 33bb8) + 378
> 00014a34 _start (0, 0, 0, 0, 0, 0) + 5c
which is whatever it was happily processing when I killed it, not the
stack at the time it ended up at line 815 of loadalerts.c
What can I do and what information can I gather which will help narrow
the fault domain?
--
Do things because you should, not just because you can.
John Thurston 907-465-8591
John.Thurston at alaska.gov
Enterprise Technology Services
Department of Administration
State of Alaska
More information about the Xymon
mailing list