[Xymon] xymond_rrd - Program crashed after fresh install of Xymon 4.3.30 and data from Xymon 4.3.17

Andrey Chervonets A.Chervonets at cominder.eu
Thu Oct 17 13:37:47 CEST 2019


To get more information I have enabled "--debug"  to both channels (status 
and data).
Then we see a bit more information in rrd-status.log:
....
2019-10-17 13:40:02.376153 Host 'synologyhost.domain.eu' reports netstat 
for an unknown OS
408 2019-10-17 13:40:02.376181 Flush, but xymonmsg is empty
408 2019-10-17 13:40:02.376185 0 status messages merged into 1 
transmissions
408 2019-10-17 13:40:02.376203 xymond_rrd: Got message 612 
@@status#612/synologyhost.domain.eu|1571308802.357389|83.99.221.6||synologyhost.domain.eu|procs|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0|
408 2019-10-17 13:40:02.376210 startpos 95710, fillpos 99309, endpos 97006
408 2019-10-17 13:40:02.376227 Flush, but xymonmsg is empty
408 2019-10-17 13:40:02.376233 0 status messages merged into 1 
transmissions
408 2019-10-17 13:40:02.376244 xymond_rrd: Got message 613 
@@status#613/synologyhost.domain.eu|1571308802.357673|83.99.221.6||synologyhost.domain.eu|raid|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0|
408 2019-10-17 13:40:02.376251 startpos 97010, fillpos 99309, endpos 97945
408 2019-10-17 13:40:02.376269 Flush, but xymonmsg is empty
408 2019-10-17 13:40:02.376276 0 status messages merged into 1 
transmissions
408 2019-10-17 13:40:02.376288 xymond_rrd: Got message 614 
@@status#614/synologyhost.domain.eu|1571308802.368308|83.99.221.6||synologyhost.domain.eu|temperature|1571326802|green||green|1570620002|0||0||1571051696||p_cominder|0|
408 2019-10-17 13:40:02.376294 startpos 97949, fillpos 99309, endpos 98645
2019-10-17 13:40:02.381339 Child process 408 died: Signal 6
2019-10-17 13:40:04.432302 Peer at 0.0.0.0:0 failed: Broken pipe
2019-10-17 13:40:04.452708 Peer not up, flushing message queue
13920 2019-10-17 13:40:04.557656  setup_feedback_queue: got ID -1 for key 
0xA03EB91
13920 2019-10-17 13:40:04.558141 Opening file 
/u01/app/xymon/product/xymon4.3.30/server/etc/rrddefinitions.cfg
13920 2019-10-17 13:40:04.558326 Want msg 1, startpos 0, fillpos 0, endpos 
-1, usedbytes=0, bufleft=1052671
13920 2019-10-17 13:40:04.558359 Got 6716 bytes
...
Here we can see processing of data from our Synology NAS with Synology 
Monitoring Tool 1.4.8, http://www.sysco.ch/synomon/ enabled.
Make note - despite RRD crash we can see good status and text of 
"temperature" metric status like:
--
Device             Temp(C)   Temp(F)
---------------------------------------
green    system         52      125
green    /dev/sda       36      96
green    /dev/sdb       38      100
green    /dev/sdd       36      96
---------------------------------------

Synology Monitoring Tool 1.4.8, http://www.sysco.ch/synomon/
Model: RS812+ (synologyhost,domain.eu)
Processor: Intel(R) Atom(TM) CPU D2701   @ 2.13GHz
System temperature: 52°C
Serial number: serialnumberdata-replaced
Firmware: 6.2-24922
MAC address(s): number-replaced, number-replaced
Linux version 3.10.105 (root at build10) (gcc version 4.9.3 20150311 
(prerelease) (crosstool-NG 1.20.0) ) #24922 SMP Fri May 10 02:51:01 CST 
2019
--

After stopping the plugin on Synology we have got no more data from it and 
no more xymond_rrd crash (red changed to purple, as expected).

I am note sure where is the problem/bug. So I have added the Synology 
Monitoring Tool developers e-mail to our communictaion.

Please, review and give the hint how can we fix the problem -  our NAS 
state monitoring is quite critical thing we need.

The suspection has been also proved by GDC info (as instructed at: 
http://www.robertandrobert.com/xymon/help/known-issues.html ):
--
[xymon at synologyhost server]$ /bin/gdb 
/u01/app/xymon/product/xymon4.3.30/server/bin/xymond_rrd  tmp/core.408
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
... copyright...
...
Reading symbols from 
/u01/app/xymon/product/xymon4.3.30/server/bin/xymond_rrd...done.
[New LWP 408]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond_rrd 
--rrddir=/u01/app/xymon/product/xymon4.3.30/data/rrd --debug'.
Program terminated with signal 6, Aborted.
#0  0x00007f62fcd85337 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 
expat-2.1.0-10.el7_3.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 
freetype-2.8-14.el7.x86_64 fribidi-1.0.2-1.el7.x86_64 
glib2-2.56.1-5.el7.x86_64 glibc-2.17-292.el7.x86_64 
graphite2-1.3.10-1.el7_3.x86_64 harfbuzz-1.7.5-2.el7.x86_64 
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 
libX11-1.6.7-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 
libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 
libcom_err-1.42.9-16.el7.x86_64 libffi-3.0.13-18.el7.x86_64 
libgcc-4.8.5-39.el7.x86_64 libglvnd-1.0.1-0.8.git5baa1e5.el7.x86_64 
libglvnd-egl-1.0.1-0.8.git5baa1e5.el7.x86_64 
libglvnd-glx-1.0.1-0.8.git5baa1e5.el7.x86_64 libpng-1.5.13-7.el7_2.x86_64 
libselinux-2.5-14.1.el7.x86_64 libthai-0.1.14-9.el7.x86_64 
libtirpc-0.2.4-0.16.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 
libxcb-1.13-1.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 
openssl-libs-1.0.2k-19.el7.x86_64 pango-1.42.4-4.el7_7.x86_64 
pcre-8.32-17.el7.x86_64 pixman-0.34.0-1.el7.x86_64 
rrdtool-1.4.8-9.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 
zlib-1.2.7-18.el7.x86_64
(gdb)
(gdb)
(gdb) bt
#0  0x00007f62fcd85337 in raise () at /lib64/libc.so.6
#1  0x00007f62fcd86a28 in abort () at /lib64/libc.so.6
#2  0x0000000000428e63 in sigsegv_handler (signum=<optimized out>) at 
sig.c:57
#3  0x00007f62fcd853b0 in <signal handler called> () at /lib64/libc.so.6
#4  0x00007f62fcd89f97 in ____strtoll_l_internal () at /lib64/libc.so.6
#5  0x000000000040f9c2 in do_temperature_rrd (__nptr=0x0) at 
/usr/include/stdlib.h:280
#6  0x000000000040f9c2 in do_temperature_rrd 
(hostname=hostname at entry=0x7f62fdfceb43 "synologyhost.domain.eu", 
testname=testname at entry=0x7f62fdfceb58 "temperature", 
classname=classname at entry=0x7f62fdfceb99 "p_cominder", 
pagepaths=pagepaths at entry=0x7f62fdfceba4 "0", msg=msg at entry=0x7f62fdfceba7 
"status+300 synologyhost,domain.eu.temperature green 2019-10-17 13:40:01 
[synologyhost.domain.eu] - temperature\nDevice", ' ' <repeats 13 times>, 
"Temp(C)   Temp(F)\n", '-' <repeats 39 times>, "\n&green    system"..., 
tstamp=tstamp at entry=1571308802) at rrd/do_temperature.c:100
#7  0x000000000041316b in update_rrd 
(hostname=hostname at entry=0x7f62fdfceb43 "synologyhost.domain.eu", 
testname=<optimized out>,
    testname at entry=0x7f62fdfceb58 "temperature", 
msg=msg at entry=0x7f62fdfceba7 "status+300 
synologyhost,domain.eu.temperature green 2019-10-17 13:40:01 
[synologyhost.domain.eu] - temperature\nDevice", ' ' <repeats 13 times>, 
"Temp(C)   Temp(F)\n", '-' <repeats 39 times>, "\n&green    system"..., 
tstamp=tstamp at entry=1571308802, sender=sender at entry=0x7f62fdfceb36 
"83.99.221.6", ldef=<optimized out>, 
classname=classname at entry=0x7f62fdfceb99 "p_cominder", 
pagepaths=pagepaths at entry=0x7f62fdfceba4 "0") at do_rrd.c:714
#8  0x0000000000403434 in main (argc=<optimized out>, argv=0x7ffffb4bd4b8) 
at xymond_rrd.c:391
(gdb)
--

So, we know which metric cause RRD crash, we have workaround (to make RRD 
working to generate other metrics graphs),
but we need better solution to make all that working as expected.

P.S. Note: real hostname is replaced in all outputs submitted in e-mail 
(just if there are some checksums are used).


Best regards,

Andrey Chervonets
----------------------
CoMinder Support
http://www.cominder.eu/
mobile: +371 26517848

 


"Xymon" <xymon-bounces at xymon.com> wrote on 15.10.2019 13:00:01:

> From: xymon-request at xymon.com
> To: xymon at xymon.com
> Date: 15.10.2019 13:00
> Subject: Xymon Digest, Vol 105, Issue 9
> Sent by: "Xymon" <xymon-bounces at xymon.com>
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 14 Oct 2019 15:09:53 +0300
> From: Andrey Chervonets <A.Chervonets at cominder.eu>
> To: xymon at xymon.com
> Subject: [Xymon] xymond_rrd - Program crashed after fresh install of
>    Xymon 4.3.30 and data from Xymon 4.3.17
> Message-ID:
> <OFD5D1CD2D.3E1D4B14-ONC2258493.00408D6C-C2258493.0042D300 at cominder.eu>
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> Good day!
> 
> Recently we have installed Xymon 4.3.30 on new VM (CentOS Linux release 
> 7.7.1908 (Core)  - guest under KVM
> Guest Kernel:   3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 
UTC 
> 2019 x86_64 x86_64 x86_64 GNU/Linux
> 
> All OK, except xymond_rrd is crashing frequently - the "xymond_rrd" 
metric 
> is always red (was never green) with message:
>  - Program crashed
> Fatal signal caught!
> 
> In rrd-status.log we can find frequent messages like:
> 
> 2019-10-14 14:35:03.609265 Child process 2997 died: Signal 6
> 2019-10-14 14:35:04.239677 Peer at 0.0.0.0:0 failed: Broken pipe
> 2019-10-14 14:35:08.886124 Peer not up, flushing message queue
> 2019-10-14 14:36:45.883398 Host 'synologyhost.domain.eu' reports netstat 

> for an unknown OS
> 2019-10-14 14:36:45.888875 Child process 21622 died: Signal 6
> 2019-10-14 14:36:52.510319 Peer at 0.0.0.0:0 failed: Broken pipe
> 2019-10-14 14:36:52.510720 Peer not up, flushing message queue
> 2019-10-14 14:40:02.689062 Host 'synologyhost.domain.eu' reports netstat 

> for an unknown OS
> 2019-10-14 14:40:02.694320 Child process 28158 died: Signal 6
> 2019-10-14 14:40:05.119354 Peer at 0.0.0.0:0 failed: Broken pipe
> 2019-10-14 14:40:05.250422 Peer not up, flushing message queue
> 
> Note: lines like "Host 'synologyhost.domain.eu' reports netstat for an 
> unknown OS" are comining from Synonlogy NAS with Monitoring package 
> installed.
> I am sure it is not related - it was working on old Xymon 4.3.17 (CentOS 

> 6.6)
> 
> After fresh installation we just remapped (with symbolic link) the data 
> directory to continue employ old data logs and rra.
> 
> There is plenty of core files under server/tmp/
> srw-rw-rw- 1 xymon monitor       0 Oct 14 14:40 rrdctl.572
> -rw------- 1 xymon monitor 3252224 Oct 14 14:45 core.572
> srw-rw-rw- 1 xymon monitor       0 Oct 14 14:45 rrdctl.17027
> -rw------- 1 xymon monitor 3248128 Oct 14 14:50 core.17027
> srw-rw-rw- 1 xymon monitor       0 Oct 14 14:50 rrdctl.30574
> -rw------- 1 xymon monitor 3248128 Oct 14 14:55 core.30574
> srw-rw-rw- 1 xymon monitor       0 Oct 14 14:55 rrdctl.13275
> -rw------- 1 xymon monitor 3239936 Oct 14 15:00 core.13275
> -rw-r--r-- 1 xymon monitor 1887355 Oct 14 15:02 xymond.chk
> -rw-r--r-- 1 xymon monitor       0 Oct 14 15:02 alert.chk.sub
> -rw-r--r-- 1 xymon monitor   70921 Oct 14 15:02 alert.chk
> srw-rw-rw- 1 xymon monitor       0 Oct 14 15:02 rrdctl.5887
> srw-rw-rw- 1 xymon monitor       0 Oct 14 15:02 rrdctl.5954
> -rw------- 1 xymon monitor 3764224 Oct 14 15:05 core.5887
> srw-rw-rw- 1 xymon monitor       0 Oct 14 15:05 rrdctl.10234
> 
> 
> Question: How can we diagnose what is the cause of the problem?
> 
> 
> 
> Best regards,
> 
> Andrey Chervonets
> ----------------------
> SIA CoMinder
> http://www.cominder.eu/
> mobile: +371 26517848
> -------------- next part --------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20191017/6cdc8b73/attachment.htm>


More information about the Xymon mailing list