[Xymon] Possible defect in rrd handler causing divide-by-zero crashes

J.C. Cleaver cleaver at terabithia.org
Wed Apr 22 21:04:25 CEST 2015


On Wed, April 22, 2015 11:28 am, John Thurston wrote:
> On 4/21/2015 4:40 PM, J.C. Cleaver wrote:
>> On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
>>> It has been a long road, but I may have uncovered a defect in the rrd
>>> handler. I'm currently running xymon 4.3.17 (somewhat patched) on
>>> Solaris 10 on SPARC.
>>>
>>> :: Symptom ::
>>> The xymond_rrd process crashes. It leaves footprints in the log like:
>>>> 2015-04-20 19:09:18 Child process 23929 died: Signal 8
>>>> 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe
>>>> 2015-04-20 19:09:18 Peer not up, flushing message queue
>>> It also leaves a pid file behind.
>>> It also leaves gaps in the rrd data.
> - snip -
>>> :: Hypothesis ::
>>>
>>> The message handling code is accepting messages from clients stating
>>> 0MB
>>> total physical memory, but such information is making its way into the
>>> RRD handler and causing a divide by zero.
>>>
>>> Can anyone else test this hypothesis?
>>>
>>> Can someone with more C-skills look at do_la_rrd and see if a zero
>>> really can find its way into its division statements?
>
>> Yep, seems exactly like that's the case! I believe the following patch
>> should fix it for you. Can you try it out?
>
> Thank you!
>
> I created a script with which I could semi-reliably induce a crash by
> feeding a message claiming 0MB of physical memory. It isn't 100%
> reliable because I think there is some magic timing I haven't
> deciphered. But if I wait five or ten minutes between attempts, I can
> crash the unpatched process with my message.
>
> After applying your patch, I am _unable_ to crash the process with my
> message. I also found the "report had 0 total physical/pagefile memory
> listed" text in my rrd-status log.


Great to hear! Unfortunately, it means a search for other un-validated
zero divs is probably warranted.


>
> Now I want to try to grasp the possible consequences of using this
> patch. Am I correct that by responding to this condition with "return
> 0", there will not be a call made to do_memory_rrd_update for this
> host/message combination? And that the worst consequence of this will be
> a possible gap in the data stored in the rrd for this host?


This is correct. Since no report is coming in, RRD will eventually see
'NaN' instead of "0's".

The actual memory *status*, FBoFW, is done via a different calculation
(for Solaris, unix_memory_report() in xymond_client.c). It appears that
phystotal of 0 there will cause the Physical memory usage to be listed as
'0', which would probably not trigger anything on MEMPHYS alerts in
analysis.cfg. (Not sure if that's the safest approach, but if there are
clients that regularly report in 0 total RAM, the alternative might be
worse.)


Regards,

-jc






More information about the Xymon mailing list