[Xymon] alert/hostname loading (was Re: xymon hostdata module going rogue)
J.C. Cleaver
cleaver at terabithia.org
Tue Dec 1 21:48:14 CET 2015
On Tue, December 1, 2015 9:14 am, John Thurston wrote:
> How embarrassing. I was composing a note to mention a problem with the
> list archives not capturing all messages . . . when I discovered that
> the message for which I was searching was never sent to the list.
>
> I composed the following message back in early October and then sent it
> only to myself :p No wonder it didn't generate any chatter.
>
> On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
>> On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
>>> On 8/28/2015 12:45 PM, John Thurston wrote:
>>>> On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
>>>>> I have a xymon server running 4.3.21 that seems to be accumulating
>>>>> processes like these:
>>>>>
>>>>> hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00
>>>>> [xymond_hostdata] <defunct>
> . . .
>>>>>
>>>>> It seemed related to drop messages . . .
>>>>
>>>> Hey, I think I'm seeing the same thing on Solaris with 4.3.21
>>>>
>>>> I've ended up here after a customer let me know that email alerts were
>>>> not working as expected. After a few hours of digging around, I
>>>> decided
>>>> that the alert daemon was failing to retrieve hostnames and failing
>>>> miserably.
>>>>
>>>> Have other people seen this behavior?
>>>
>>> I have duplicated this behavior on another xymon server on Solaris. It
>>> certainly looks like this behavior breaks the alert daemon.
>>> Fortunately,
>>> I "drop" hosts in batches so can restart Xymon at that time, but this
>>> is
>>> still pretty icky.
>>>
>>> J.C., do you know if your patch made it into the code-base?
>>>
>>> Has anyone else tested this patch? If so, on what operating systems?
>
> This patch took care of the defunct/zonebie processes on "drop" events,
> but I've just discovered that it does not solve the underlying problem.
> It still appears that xymond_hostdata does not behave correctly
> following a "drop" command. The effect is that alerts fail to be
> delivered for _some_ messages because hostnames can no longer be
> retrieved.
>
> Example:
>
> My xymon server is humming along. I have the alert module debug-logging
> to alerts.log. Immediately after issuing a "drop" command of the sort:
>
> #xymon localhost "drop foo.bar.com sslcert"
>
> the following sorts appear in the alerts.log. After this, some messages
> may result in alert emails being sent, but most quietly disappear.
> Currently, my resolution is to "xymon.sh restart" but that is much too
> heavy handed for long term use.
>
>> 21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted
>> 21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of
>> /opt/xymon/server/etc/alerts.cfg
>> 21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of
>> /opt/xymon/server/etc/holidays.cfg
>> 21178 2015-10-05 16:39:43.257718 Checking criteria for host
>> 'doadrbjnu-sp.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.257773 Found a first matching rule
>> 21178 2015-10-05 16:39:43.257802 Checking criteria for host
>> 'doadrbjnu-sp.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.257830 Checking criteria for host
>> 'doadrbjnu-sp.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.257854 Found a first matching rule
>> 21178 2015-10-05 16:39:43.257879 Checking criteria for host
>> 'doadrbjnu-sp.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.257910 Checking criteria for host
>> 'steam.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.257935 Found a first matching rule
>> 21178 2015-10-05 16:39:43.257960 Checking criteria for host
>> 'steam.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.257986 Checking criteria for host
>> 'steam.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.258010 Found a first matching rule
>> 21178 2015-10-05 16:39:43.258035 Checking criteria for host
>> 'steam.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.258061 Checking criteria for host
>> 'upsjdc.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.258088 Found a first matching rule
>> 21178 2015-10-05 16:39:43.258113 Checking criteria for host
>> 'upsjdc.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.258140 Checking criteria for host
>> 'upsjdc.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.258164 Found a first matching rule
>> 21178 2015-10-05 16:39:43.258188 Checking criteria for host
>> 'upsjdc.bar.com', which is not defined
>> 21178 2015-10-05 16:39:43.258211 0 alerts to go
>> 21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos
>> 134769, endpos -1, usedbytes=0, bufleft=131470
>> 21178 2015-10-05 16:39:47.962032 Got 2831 bytes
>> 21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039
>> @@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754||||
>> 21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos
>> -1
>> 21178 2015-10-05 16:39:47.962204 Got page message from
>> soajnuexhs1.bar.com:msgs
>> 21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos
>> 137600, endpos -1, usedbytes=0, bufleft=128639
>> 21178 2015-10-05 16:39:58.022397 Got 297 bytes
>> 21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040
>> @@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS|||||
>> 21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos
>> -1
>> 21178 2015-10-05 16:39:58.022593 Got page message from
>> doadofjdc-ea05p.bar.com:msgs
>> 21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0
>> 21178 2015-10-05 16:39:58.022666 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022706 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022739 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022776 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022808 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022841 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022873 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022904 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022935 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022967 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.022998 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023028 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023059 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023089 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023120 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023151 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023187 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023221 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023252 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023282 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023313 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023342 Checking criteria for host
>> 'doadofjdc-ea05p.bar.com', which is not defined
>> 21178 2015-10-05 16:39:58.023369 Found no first matching rule
>> 21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos
>> 137897, endpos -1, usedbytes=0, bufleft=128342
>> 21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due
>> to EOF
Hmm. This seems to be fundamentally a different issue than the "hostdata
module going rogue" thing, which was about zombies never being picked up.
AFAICT, somehow the hosts tree structure is getting clobbered as a result
of the drop (assuming all of those hosts are expected to be existing).
There were a few patches for things in xymond.c at one point, and more
error checking when going to POSIX btrees generally, but I hadn't
encountered this in other intermittent hostlist readers.
1) Which version of Solaris is this?
2) Have you experienced this in other workers for xymon? (IE,
xymond_client not being able to look up hostnames after a drop -- would
probably lead to random purples)
3) Does issuing a "reload" command or -HUP to xymond_alert re-sync things?
-jc
More information about the Xymon
mailing list