[Xymon] alert/hostname loading

John Thurston john.thurston at alaska.gov
Mon Dec 14 21:27:05 CET 2015


On 12/1/2015 12:03 PM, John Thurston wrote:
> On 12/1/2015 11:48 AM, J.C. Cleaver wrote:
> - snip -
>>
>> Hmm. This seems to be fundamentally a different issue than the "hostdata
>> module going rogue" thing, which was about zombies never being picked up.
>>
>> AFAICT, somehow the hosts tree structure is getting clobbered as a result
>> of the drop (assuming all of those hosts are expected to be existing).

- snip -

> I haven't yet found a way to induce this failure, so I haven't yet
> identified the minimal recovery steps. I'm working on it, though.

I think I might be able to reproduce the failure :)  Start with the 
following, stable server arrangement:

+ x.bar.com is running xymon 4.3.22 on Solaris 10 SPARC
+ The following is defined in tasks.cfg:
   CMD xymond_channel --channel=page  --log=$XYMONSERVERLOGS/alert.log \
   xymond_alert --debug --checkpoint-file=$XYMONTMP/alert.chk \
   --checkpoint-interval=600
+ Host foo.bar.com is defined in DNS and does not permit ICMP traffic 
and does not have a xymon client installed on it

Throw a spanner in the works by the following actions:

+ Add host foo.bar.com to an existing page and group in hosts.cfg
+ ~/server/bin/xymoncmd ~/server/bin/xymonnet foo.bar.com

And see the trouble commence in alert.log:

> 6690 2015-12-14 10:52:06.859998 Got 415 bytes
> 6690 2015-12-14 10:52:06.860110 xymond_alert: Got message 95 @@page#95/foo.bar.com|1450122726.859873|10.10.10.55|foo.bar.com|conn|0.0.0.0|1450124526|red|none|1450122726|Page/Subpage|65234||||
> 6690 2015-12-14 10:52:06.860140 startpos 5659, fillpos 5659, endpos -1
> 6690 2015-12-14 10:52:06.860172 Got page message from foo.bar.com:conn
> 6690 2015-12-14 10:52:06.860249 Alert status changed from 0 to 1
> 6690 2015-12-14 10:52:06.860285 Checking criteria for host 'foo.bar.com', which is not defined
> 6690 2015-12-14 10:52:06.861674 Checking criteria for host 'foo.bar.com', which is not defined
> 6690 2015-12-14 10:52:06.861728 Checking criteria for host 'foo.bar.com', which is not defined
> 6690 2015-12-14 10:52:06.861761 Found no first matching rule
> 6690 2015-12-14 10:52:06.861813 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg
> 6690 2015-12-14 10:52:06.861861 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg
> 6690 2015-12-14 10:52:06.861891 Checking criteria for host 'zebra.bar.com', which is not defined

After killing the "xymond_channel --channel=page" process, a new one is 
created as a child of xymonlaunch and everything behaves normally again.

I currently have a tail on my alert.log to warn me of the appearance of 
the string, "which is not defined". When that appears, I know it is time 
to HUP the "page" channel. This is a rather crude hammer to leave laying 
on the table next to my production server, but it keeps us running :)

I have a core file from the xymond_channel process, but its stack 
contains only:
>  feee041c _syscall6 (1, 1, 0, 1, 7d0, 3a0f4) + 20
>  00013c90 _start   (0, 0, 0, 0, 0, 0) + 5c

I have a core file from the xymond_alert process, but its stack contains 
only:
>  feede7d8 __pollsys (ffbfcd50, 1, ffbfcdc0, 0, 0, 0) + 8
>  fee79b8c pselect  (ffbfcd50, fef56790, fef56790, 40, ffbfcdc0, 0) + 1c8
>  fee79f04 select   (1, ffbfce58, 0, 0, ffbfce48, ffbfced8) + a0
>  00015fa4 get_xymond_message (4b400, 4b14c, 4b148, ffbfcf88, 4b16c, 35d50) + 270
>  0003293c main     (1, 566f245d, 0, 33b00, 4b000, 33bb8) + 378
>  00014a34 _start   (0, 0, 0, 0, 0, 0) + 5c
which is whatever it was happily processing when I killed it, not the 
stack at the time it ended up at line 815 of loadalerts.c

What can I do and what information can I gather which will help narrow 
the fault domain?

-- 
    Do things because you should, not just because you can.

John Thurston    907-465-8591
John.Thurston at alaska.gov
Enterprise Technology Services
Department of Administration
State of Alaska



More information about the Xymon mailing list