[Xymon] Possible defect in rrd handler causing divide-by-zero crashes

J.C. Cleaver cleaver at terabithia.org
Wed Apr 22 02:40:15 CEST 2015


On Tue, April 21, 2015 2:04 pm, John Thurston wrote:
> It has been a long road, but I may have uncovered a defect in the rrd
> handler. I'm currently running xymon 4.3.17 (somewhat patched) on
> Solaris 10 on SPARC.
>
> :: Symptom ::
> The xymond_rrd process crashes. It leaves footprints in the log like:
>> 2015-04-20 19:09:18 Child process 23929 died: Signal 8
>> 2015-04-20 19:09:18 Peer at 0.0.0.0:0 failed: Broken pipe
>> 2015-04-20 19:09:18 Peer not up, flushing message queue
> It also leaves a pid file behind.
> It also leaves gaps in the rrd data.
>
> :: Tracing ::
> By collecting core dumps and using pstack, I was able to see:
>> __divdi3+0x164(0, 0, 0, 0, 0, 0)
>> do_la_rrd+0x2d4(1017c8, 1017e4, 10181d, 101829, 1, 55369956)
>> update_rrd+0x76c(1017c8, 1017e4, 10182c, 55369956, 1017c0, 69000)
>> main+0xa40(10182c, ffbfdc3c, 101829, 68800, 3, 49538)
>> _start+0x5c(0, 0, 0, 0, 0, 0)
>
> By enabling --debug on the "xymond_channel --channel=status" task (after
> applying jc's patch of March 13 so that the debug code didn't crash
> xymond - Thank you!), I was able to see that the last messages received
> prior to crashes was always a sequence of status messages containing
> "mem" followed by "cpu".
>
> A "Signal 8" is commonly associated with a "divide by zero" event. Not
> being very C-aware, I had to consult google to learn that __divdi3 is a
> library used for unsigned long-int division. So if I'm reading the stack
> trace correctly, update_rrd called do_la_rrd which tied to do a division
> which failed.
>
> :: Smoking gun? ::
> One of the hosts (running a BBPE 4.3 client), whose status message
> triggered a crash, has the following in its last CPU message:
>> Memory Statistics
>> Total Physical memory:                    0 bytes (1.00MB)
>> Available Physical memory:                0 bytes (1.00MB)
>> Total PageFile size:            17397915648 bytes (16.20GB)
>> Available PageFile size:        12473839616 bytes (11.62GB)
>> Total Virtual memory size:      17397915648 bytes (16.20GB)
>> Available Virtual memory size:  12487081984 bytes (11.63GB)
>
> And has the following in its last MEM message:
>> Physical Memory: 0.00MB/1.00MB (-1.#J%)
>> Commit Charge: 4.59GB/16.20GB (28.30%)
>>
>> Memory Statistics (Used/Available/Total in bytes)
>> Physical Memory:            0 (0.00MB) /            0 (1.00MB) /
>>    0 (1.00MB)
>> Commit Charge:     4924112896 (4.59GB) /  12473839616 (11.62GB) /
>> 17397915648 (16.20GB)
>> Virtual memory:    4910481408 (4.57GB) /  12487081984 (11.63GB) /
>> 17397915648 (16.20GB)
>
> It isn't reasonable for a system to actually have 0MB of memory, but
> that is what the client has reported.
>
> :: Hypothesis ::
>
> The message handling code is accepting messages from clients stating 0MB
> total physical memory, but such information is making its way into the
> RRD handler and causing a divide by zero.
>
> Can anyone else test this hypothesis?
>
> Can someone with more C-skills look at do_la_rrd and see if a zero
> really can find its way into its division statements?
>



Yikes.

Yep, seems exactly like that's the case! I believe the following patch
should fix it for you. Can you try it out?


HTH,

-jc
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xymon.rrd_divzero.patch
Type: application/octet-stream
Size: 475 bytes
Desc: not available
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20150421/eb717aef/attachment.obj>


More information about the Xymon mailing list