[Xymon] Possible defect in rrd handler causing divide-by-zero crashes

John Thurston john.thurston at alaska.gov
Wed Apr 22 20:28:43 CEST 2015


On 4/21/2015 4:40 PM, J.C. Cleaver wrote:
> On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
>> It has been a long road, but I may have uncovered a defect in the rrd
>> handler. I'm currently running xymon 4.3.17 (somewhat patched) on
>> Solaris 10 on SPARC.
>>
>> :: Symptom ::
>> The xymond_rrd process crashes. It leaves footprints in the log like:
>>> 2015-04-20 19:09:18 Child process 23929 died: Signal 8
>>> 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe
>>> 2015-04-20 19:09:18 Peer not up, flushing message queue
>> It also leaves a pid file behind.
>> It also leaves gaps in the rrd data.
- snip -
>> :: Hypothesis ::
>>
>> The message handling code is accepting messages from clients stating 0MB
>> total physical memory, but such information is making its way into the
>> RRD handler and causing a divide by zero.
>>
>> Can anyone else test this hypothesis?
>>
>> Can someone with more C-skills look at do_la_rrd and see if a zero
>> really can find its way into its division statements?

> Yep, seems exactly like that's the case! I believe the following patch
> should fix it for you. Can you try it out?

Thank you!

I created a script with which I could semi-reliably induce a crash by 
feeding a message claiming 0MB of physical memory. It isn't 100% 
reliable because I think there is some magic timing I haven't 
deciphered. But if I wait five or ten minutes between attempts, I can 
crash the unpatched process with my message.

After applying your patch, I am _unable_ to crash the process with my 
message. I also found the "report had 0 total physical/pagefile memory 
listed" text in my rrd-status log.

Now I want to try to grasp the possible consequences of using this 
patch. Am I correct that by responding to this condition with "return 
0", there will not be a call made to do_memory_rrd_update for this 
host/message combination? And that the worst consequence of this will be 
a possible gap in the data stored in the rrd for this host?

-- 
    Do things because you should, not just because you can.

John Thurston    907-465-8591
John.Thurston at alaska.gov
Enterprise Technology Services
Department of Administration
State of Alaska



More information about the Xymon mailing list