[Xymon] xymon hostdata module going rogue

John Thurston john.thurston at alaska.gov
Tue Dec 1 18:14:20 CET 2015


How embarrassing. I was composing a note to mention a problem with the 
list archives not capturing all messages . . . when I discovered that 
the message for which I was searching was never sent to the list.

I composed the following message back in early October and then sent it 
only to myself :p  No wonder it didn't generate any chatter.

On 8/28/2015 3:12 PM, J.C. Cleaver wrote:
> On Fri, August 28, 2015 3:16 pm, John Thurston wrote:
>> On 8/28/2015 12:45 PM, John Thurston wrote:
>>> On 6/10/2015 9:01 AM, Scot Kreienkamp wrote:
>>>> I have a xymon server running 4.3.21 that seems to be accumulating
>>>> processes like these:
>>>>
>>>> hobbit   28430  0.0  0.0      0     0 ?        Z    12:50   0:00
>>>> [xymond_hostdata] <defunct>
  . . .
>>>>
>>>> It seemed related to drop messages . . .
>>>
>>> Hey, I think I'm seeing the same thing on Solaris with 4.3.21
>>>
>>> I've ended up here after a customer let me know that email alerts were
>>> not working as expected. After a few hours of digging around, I decided
>>> that the alert daemon was failing to retrieve hostnames and failing
>>> miserably.
>>>
>>> Have other people seen this behavior?
>>
>> I have duplicated this behavior on another xymon server on Solaris. It
>> certainly looks like this behavior breaks the alert daemon. Fortunately,
>> I "drop" hosts in batches so can restart Xymon at that time, but this is
>> still pretty icky.
>>
>> J.C., do you know if your patch made it into the code-base?
>>
>> Has anyone else tested this patch? If so, on what operating systems?

This patch took care of the defunct/zonebie processes on "drop" events, 
but I've just discovered that it does not solve the underlying problem. 
It still appears that xymond_hostdata does not behave correctly 
following a "drop" command. The effect is that alerts fail to be 
delivered for _some_ messages because hostnames can no longer be retrieved.

Example:

My xymon server is humming along. I have the alert module debug-logging 
to alerts.log.  Immediately after issuing a "drop" command of the sort:

#xymon localhost "drop foo.bar.com sslcert"

the following sorts appear in the alerts.log. After this, some messages 
may result in alert emails being sent, but most quietly disappear.
Currently, my resolution is to "xymon.sh restart" but that is much too 
heavy handed for long term use.

> 21178 2015-10-05 16:39:43.257559 get_xymond_message: Interrupted
> 21178 2015-10-05 16:39:43.257624 No files modified, skipping reload of /opt/xymon/server/etc/alerts.cfg
> 21178 2015-10-05 16:39:43.257680 No files modified, skipping reload of /opt/xymon/server/etc/holidays.cfg
> 21178 2015-10-05 16:39:43.257718 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.257773 Found a first matching rule
> 21178 2015-10-05 16:39:43.257802 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.257830 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.257854 Found a first matching rule
> 21178 2015-10-05 16:39:43.257879 Checking criteria for host 'doadrbjnu-sp.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.257910 Checking criteria for host 'steam.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.257935 Found a first matching rule
> 21178 2015-10-05 16:39:43.257960 Checking criteria for host 'steam.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.257986 Checking criteria for host 'steam.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.258010 Found a first matching rule
> 21178 2015-10-05 16:39:43.258035 Checking criteria for host 'steam.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.258061 Checking criteria for host 'upsjdc.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.258088 Found a first matching rule
> 21178 2015-10-05 16:39:43.258113 Checking criteria for host 'upsjdc.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.258140 Checking criteria for host 'upsjdc.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.258164 Found a first matching rule
> 21178 2015-10-05 16:39:43.258188 Checking criteria for host 'upsjdc.bar.com', which is not defined
> 21178 2015-10-05 16:39:43.258211 0 alerts to go
> 21178 2015-10-05 16:39:43.258270 Want msg 5039, startpos 134769, fillpos 134769, endpos -1, usedbytes=0, bufleft=131470
> 21178 2015-10-05 16:39:47.962032 Got 2831 bytes
> 21178 2015-10-05 16:39:47.962143 xymond_alert: Got message 5039 @@page#5039/soajnuexhs1.bar.com|1444091987.961845|10.2.3.40|soajnuexhs1.bar.com|msgs|0.0.0.0|1444093787|red|red|1444088306|ETS/MsgDir|540754||||
> 21178 2015-10-05 16:39:47.962171 startpos 137600, fillpos 137600, endpos -1
> 21178 2015-10-05 16:39:47.962204 Got page message from soajnuexhs1.bar.com:msgs
> 21178 2015-10-05 16:39:47.962252 Want msg 5040, startpos 137600, fillpos 137600, endpos -1, usedbytes=0, bufleft=128639
> 21178 2015-10-05 16:39:58.022397 Got 297 bytes
> 21178 2015-10-05 16:39:58.022526 xymond_alert: Got message 5040 @@page#5040/doadofjdc-ea05p.bar.com|1444091998.022274|10.2.167.44|doadofjdc-ea05p.bar.com|msgs|0.0.0.0|1444093798|green|red|1444091998|DOA/IRIS|||||
> 21178 2015-10-05 16:39:58.022558 startpos 137897, fillpos 137897, endpos -1
> 21178 2015-10-05 16:39:58.022593 Got page message from doadofjdc-ea05p.bar.com:msgs
> 21178 2015-10-05 16:39:58.022630 Alert status changed from 1 to 0
> 21178 2015-10-05 16:39:58.022666 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022706 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022739 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022776 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022808 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022841 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022873 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022904 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022935 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022967 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.022998 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023028 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023059 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023089 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023120 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023151 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023187 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023221 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023252 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023282 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023313 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023342 Checking criteria for host 'doadofjdc-ea05p.bar.com', which is not defined
> 21178 2015-10-05 16:39:58.023369 Found no first matching rule
> 21178 2015-10-05 16:39:58.023402 Want msg 5041, startpos 137897, fillpos 137897, endpos -1, usedbytes=0, bufleft=128342
> 21178 2015-10-05 16:40:10.109262 get_xymond_message: Returning NULL due to EOF




-- 
    Do things because you should, not just because you can.

John Thurston    907-465-8591
John.Thurston at alaska.gov
Enterprise Technology Services
Department of Administration
State of Alaska



More information about the Xymon mailing list