[Xymon] xymon hostdata module going rogue

J.C. Cleaver cleaver at terabithia.org
Tue Dec 1 22:53:48 CET 2015


On Tue, December 1, 2015 1:41 pm, John Thurston wrote:
> On 12/1/2015 11:51 AM, J.C. Cleaver wrote:
>> On Tue, December 1, 2015 9:32 am, John Thurston wrote:
>> *snip*
>>
>>> In this occurrence, it does not appear to be related to a "drop"
>>> message. My last recorded "drop" was at 20151103-0846 and the alert
>>> process didn't start logging "which is not defined" until 20151120-0007
>>
>> Hmm. Okay, that does change things slightly. Fortunately, that means
>> it's
>> probably specifically caused by drops per se. Were there any other
>> errors
>> that occurred with other components around this time?
>
> I have several instances of "Oversize status msg from " in the
> xymond.log, but those are appearing six hours before the bad behavior
> appeared in xymon_alert. I have difficulty believing they are related.

Ack. Yeah, that should have been 'NOT specifically' :)


>> Perhaps the system
>> being low enough on memory that some re-allocations might have failed?
>
> I think this is unlikely. The system has 256GB of RAM, and there are no
> memory caps placed on the non-global zone in which xymon is running. I
> don't have information of its size on Nov 20, but today it using about
> 400MB of RAM. All of the zones on the system are consuming less than
> 10GB of the 256GB and it wouldn't have been significantly different a
> few weeks ago.
>
> I've been doing some 'drops' today to try to break it, but haven't
> succeeded. I'll continue to beat on it and see if I can find a
> repeatable failure scenario.
>
> fwiw, this is under 4.3.22


Hmm.
This is an area where it's possible that glibc/NULL issues might be
causing subtle things too. I could easily see the btree getting hosed by
tree re-insertion of a key we weren't really expecting.


-jc




More information about the Xymon mailing list