[Xymon] xymongen crashes in 4.3.29
Japheth Cleaver
cleaver at terabithia.org
Tue Aug 27 22:15:01 CEST 2019
Alright, I believe I have fixed the issue here... Multiple issues within
the availability code (fixed in
https://sourceforge.net/p/xymon/code/8081/), and then a typo in a
Terabithia patch. Please try out 4.3.30-0.6 in the /testing/ repo if
possible.
You can also manually perform a run of this by executing: `xymoncmd
xymonreports.sh daily` as the xymon user. That should give a
reproduceable crash in 4.3.30-0.5 and be clean in 4.3.30-0.6.
I've also built the EL7 packages on a CentOS 7 box, which should provide
proper compatibility while we're in a mixed 7.6/7.7 state in the ecosystem.
Regards,
-jc
On 8/23/2019 9:00 AM, Matt Vander Werf wrote:
> Hi JC,
>
> Unfortunately, this didn't seem to fix the crashes. Today I got the
> e-mail at 1:05 AM, though the core file has a timestamp of 1:04 AM.
> This time frame still matches up with it being the dailyreport task
> that is triggering the crashes (since there are no crashes any other
> time of the day).
>
> [root@<xymon server> ~]# xymoncmd --version
> Xymon version 4.3.30-0.5.el7.terabithia
> [root@<xymon server> ~]# cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.7 (Maipo)
>
> The latest core backtrace looks to be the same as previously (same
> client and service too), but I'm including it here [1] just for
> completeness.
>
> Let me know if there's anything else I can provide to debug this.
>
> Thanks!
>
>
> [1]
> [root@<xymon server> ~]# gdb -q /usr/libexec/xymon/xymongen core.1312
> Reading symbols from /usr/libexec/xymon/xymongen...Reading symbols
> from /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
> done.
> [New LWP 1312]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `/usr/libexec/xymon/xymongen
> --reportopts=1566446400:1566532799:0:nongr --recent'.
> Program terminated with signal 6, Aborted.
> #0 0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) bt
> #0 0x00007fc746f7e377 in __GI_raise (sig=sig at entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> #1 0x00007fc746f7fa68 in __GI_abort () at abort.c:90
> #2 0x00005559fc1ce4f5 in sigsegv_handler (signum=<optimized out>) at
> sig.c:57
> #3 <signal handler called>
> #4 strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
> #5 0x00007fc746f90681 in __find_specmb (format=0xfce <Address 0xfce
> out of bounds>) at printf-parse.h:109
> #6 _IO_vfprintf_internal (s=s at entry=0x7ffeb3e6d340,
> format=format at entry=0xfce <Address 0xfce out of bounds>,
> ap=ap at entry=0x7ffeb3e6d478) at vfprintf.c:1308
> #7 0x00007fc74705dc78 in ___vsprintf_chk (s=0x7ffeb3e6d6c2 "",
> flags=1, slen=18446744073709551615,
> format=0xfce <Address 0xfce out of bounds>,
> args=args at entry=0x7ffeb3e6d478) at vsprintf_chk.c:83
> #8 0x00007fc74705dbcd in ___sprintf_chk (s=<optimized out>,
> flags=flags at entry=1,
> slen=slen at entry=18446744073709551615, format=<optimized out>) at
> sprintf_chk.c:32
> #9 0x00005559fc1bf96a in sprintf (__fmt=<optimized out>,
> __s=<optimized out>)
> at /usr/include/bits/stdio2.h:33
> #10 parse_histlogfile (starttime=1566446400,
> timespec=0x5559fc431f50 <timespec.7157> "Wed_Sep_2_19:34:55_2015",
> servicename=0x5559fd61b2d0 "procs",
> hostname=0x5559fdc94520 "<client hostname>") at availability.c:174
> #11 parse_historyfile (fd=fd at entry=0x5559fdc9be00, repinfo=<optimized
> out>,
> hostname=0x5559fdc94520 "<client hostname>",
> servicename=0x5559fd61b2d0 "procs",
> fromtime=<optimized out>, totime=1566532799,
> for_history=for_history at entry=0, warnlevel=97,
> greenlevel=99.995000000000005, warnstops=-1, reporttime=0x0) at
> availability.c:475
> #12 0x00005559fc1b496c in init_state (filename=<optimized out>,
> filename at entry=0x7ffeb3e7f950 "<client hostname>.procs",
> log=log at entry=0x7ffeb3e7f860)
> at loaddata.c:275
> #13 0x00005559fc1b568e in load_state
> (sumhead=sumhead at entry=0x5559fc3fad48 <dispsums>) at loaddata.c:626
> #14 0x00005559fc1af794 in main (argc=5, argv=0x7ffeb3e84b58) at
> xymongen.c:599
>
>
> --
> Matt Vander Werf
>
>
> On Thu, Aug 22, 2019 at 5:29 PM Matt Vander Werf <matt1299 at gmail.com
> <mailto:matt1299 at gmail.com>> wrote:
>
> Hi JC,
>
> Ah ha! That is one place I did not look and the timing certainly
> matches up!
>
> I have installed that new version on my Xymon server (running
> actual RHEL 7) and we'll see how it fares tomorrow morning....
>
> Thanks!
>
> --
> Matt Vander Werf
>
>
> On Thu, Aug 22, 2019 at 5:12 PM Japheth Cleaver
> <cleaver at terabithia.org <mailto:cleaver at terabithia.org>> wrote:
>
> Hi,
>
> I think this might be xymongen in report mode from the
> "dailyreport" file in /tasks.d/; the timing would check out.
> I believe the problem here is one of the Terabithia patches
> now doing the wrong thing after some of the string-handling
> changes in 4.3.29 -- causing core dumps in certain situations.
>
> If you're running actual RHEL7 on this (not CentOS, which
> hasn't released 7.7 yet), would you mind checking the
> xymon-4.3.30-0.5 package in the EL7 Terabithia testing repo
> and see if this helps?
> https://repo.terabithia.org/rpms/xymon/testing/el7/x86_64/
>
> Regards,
> -jc
>
> On 8/22/2019 11:34 AM, Matt Vander Werf wrote:
>> Hi Torsten,
>>
>> No, there wasn't anything running from cron or anything else
>> around that time, let alone anything that restarts the
>> network or Xymon.
>>
>> Thanks.
>>
>> --
>> Matt Vander Werf
>>
>>
>> On Wed, Aug 21, 2019 at 5:43 AM Torsten Richter
>> <bb4 at richter-it.net <mailto:bb4 at richter-it.net>> wrote:
>>
>> Hi Matt,
>>
>> dumb question: is there any cron job running at this time
>> that is restarting XYmon fiddling with the network, like
>> restarting the network for some reason?
>>
>> Regards,
>> Torsten
>>
>>> Matt Vander Werf <matt1299 at gmail.com
>>> <mailto:matt1299 at gmail.com>> hat am 20. August 2019 um
>>> 17:10 geschrieben:
>>>
>>> Hi all,
>>>
>>> Every day since we updated our Xymon server to 4.3.29
>>> (from 4.3.28), I've gotten an e-mail alert due to xymond
>>> turning red that reads:
>>>
>>> red xymongen program crashed
>>>
>>> Fatal signal caught!
>>>
>>> The strange thing is that this has happened at 1:04 AM
>>> every day...like clockwork. I have xymongen set to run
>>> every 1 minute and it has no problems running any other
>>> time of the day. We are using the Terabithia RPMs and
>>> the Xymon server is running RHEL 7.
>>>
>>> I've scoured the system to find anything that is set to
>>> run at/around that time via cron, etc. and haven't found
>>> anything. The system logs don't show anything is
>>> happening around that time either.
>>>
>>> I turned on debug logging for xymond and xymongen and
>>> haven't been able to find anything unusual in either
>>> logs around that time. But it is dumping core files for
>>> xymongen every time it crashes.
>>>
>>> I used gdb to get the backtrace on all of the core files
>>> (so far) and I've found that they all show the same
>>> thing. It shows the same host in the backtrace too
>>> (although I'm farily confident it isn't specific or
>>> isolated to that host but just the first one it runs
>>> into that it has issues with when processing).
>>>
>>> I've included an example gdb output below (the most
>>> recent one) [1].
>>>
>>> Is anyone else running into this by chance? Or any idea
>>> what might be the cause?
>>>
>>> Thanks!
>>>
>>>
>>> [1]
>>> # gdb -q /usr/libexec/xymon/xymongen core.16327
>>> Reading symbols from
>>> /usr/libexec/xymon/xymongen...Reading symbols from
>>> /usr/lib/debug/usr/libexec/xymon/xymongen.debug...done.
>>> done.
>>> [New LWP 16327]
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>> Core was generated by `/usr/libexec/xymon/xymongen
>>> --reportopts=1566187200:1566273599:0:nongr --recent'.
>>> Program terminated with signal 6, Aborted.
>>> #0 0x00007f4657c49377 in __GI_raise (sig=sig at entry=6)
>>> at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
>>> 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
>>> (gdb) bt
>>> #0 0x00007f4657c49377 in __GI_raise (sig=sig at entry=6)
>>> at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
>>> #1 0x00007f4657c4aa68 in __GI_abort () at abort.c:90
>>> #2 0x00005589375dd455 in sigsegv_handler
>>> (signum=<optimized out>) at sig.c:57
>>> #3 <signal handler called>
>>> #4 strchrnul () at ../sysdeps/x86_64/strchrnul.S:33
>>> #5 0x00007f4657c5b681 in __find_specmb (format=0xfce
>>> <Address 0xfce out of bounds>) at printf-parse.h:109
>>> #6 _IO_vfprintf_internal (s=s at entry=0x7ffd5dabcc00,
>>> format=format at entry=0xfce <Address 0xfce out of
>>> bounds>, ap=ap at entry=0x7ffd5dabcd38) at vfprintf.c:1308
>>> #7 0x00007f4657d28c78 in ___vsprintf_chk
>>> (s=0x7ffd5dabcf82 "", flags=1, slen=18446744073709551615,
>>> format=0xfce <Address 0xfce out of bounds>,
>>> args=args at entry=0x7ffd5dabcd38) at vsprintf_chk.c:83
>>> #8 0x00007f4657d28bcd in ___sprintf_chk (s=<optimized
>>> out>, flags=flags at entry=1,
>>> slen=slen at entry=18446744073709551615, format=<optimized
>>> out>) at sprintf_chk.c:32
>>> #9 0x00005589375ce8ca in sprintf (__fmt=<optimized
>>> out>, __s=<optimized out>)
>>> at /usr/include/bits/stdio2.h:33
>>> #10 parse_histlogfile (starttime=1566187200,
>>> timespec=0x558937840f50 <timespec.7157>
>>> "Wed_Sep_2_19:34:55_2015", servicename=0x5589383b6d70
>>> "procs",
>>> hostname=0x558938a335d0 "<client hostname>") at
>>> availability.c:174
>>> #11 parse_historyfile (fd=fd at entry=0x558938a3aea0,
>>> repinfo=<optimized out>,
>>> hostname=0x558938a335d0 "<client hostname>",
>>> servicename=0x5589383b6d70 "procs",
>>> fromtime=<optimized out>, totime=1566273599,
>>> for_history=for_history at entry=0, warnlevel=97,
>>> greenlevel=99.995000000000005, warnstops=-1,
>>> reporttime=0x0) at availability.c:475
>>> #12 0x00005589375c38cc in init_state
>>> (filename=<optimized out>,
>>> filename at entry=0x7ffd5dacf210 "<client
>>> hostname>.procs", log=log at entry=0x7ffd5dacf120)
>>> at loaddata.c:275
>>> #13 0x00005589375c45ee in load_state
>>> (sumhead=sumhead at entry=0x558937809d48 <dispsums>) at
>>> loaddata.c:626
>>> #14 0x00005589375be6f4 in main (argc=5,
>>> argv=0x7ffd5dad4418) at xymongen.c:599
>>>
>>>
>>> --
>>> Matt Vander Werf
>>> _______________________________________________
>>> Xymon mailing list
>>> Xymon at xymon.com <mailto:Xymon at xymon.com>
>>> http://lists.xymon.com/mailman/listinfo/xymon
>>
>>
>> _______________________________________________
>> Xymon mailing list
>> Xymon at xymon.com <mailto:Xymon at xymon.com>
>> http://lists.xymon.com/mailman/listinfo/xymon
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20190827/13a7afcc/attachment.htm>
More information about the Xymon
mailing list