[Xymon] xymond crashing! -- Please help!

Sat Jan 30 19:45:44 CET 2016

Hi J.C.,

So it appears that only fixed it temporarily.

If I stop the service and start it back up again, it crashes again.

I think I figured out how to read the core file and get a backtrace for you
(I think).

Here's what I got from the most recent crash (with some host names
obfuscated):

[New LWP 13283]
Reading symbols from /usr/sbin/xymond...Reading symbols from
/usr/lib/debug/usr/sbin/xymond.debug...done.
done.
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
--checkpoint-file=/var/lib/xymon'.
Program terminated with signal 6, Aborted.
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace
#0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
#1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
#2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
sig.c:57
#3  <signal handler called>
#4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
pb=0x2020202020202020) at tree.c:47
#5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
#6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
#7  0x00007f570f5386bd in get_clientconfig
(hostname=hostname at entry=0x7f57142cb320
"*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612 "linux",
    hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
#8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
origin=origin at entry=0x7f570f550e97 "", can_respond=can_respond at entry=1) at
xymond.c:4955
#9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized out>)
at xymond.c:6288

Is this what you wanted? Do you want me to install the debug package for
glibc or other packages?

Let me know what I can do.

Thanks!!

--
Matt Vander Werf

On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf <matt1299 at gmail.com>
wrote:

> Hi J.C.,
>
> Moving the xymond.chk checkpoint file out of the way after it was stopped
> seemed to fix this (at least so far).
>
> I see that I lost all record of disabled tests (getting alerts for things
> that were disabled).
>
> What all data exactly did I lose with moving that checkpoint file out of
> the way?
>
> Is there anyway to get the data back? Or maybe figure out the corruptness
> in the checkpoint file and then move the file back in place?
>
> Also, see my most recent e-mail with the xymonlaunch log (if you haven't
> already). Looks like this has happened in the past but resolved itself....
>
> Regarding the backtrace....
>
> I put those lines in /etc/sysconfig/xymonlaunch and I see the core files
> being generated now.
> I feel embarrassed to admit this, but how exactly do I get the backtrace
> out of the binary core files, besides trying to read the files with an
> editor? Any way to know which core file had the backtrace?
>
> Also, I see this in journalctl:
>
> Ignoring invalid environment assignment 'export
> DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
>
>
> Thanks for your help!!
>
> --
> Matt Vander Werf
>
> On Sat, Jan 30, 2016 at 12:39 PM, J.C. Cleaver <cleaver at terabithia.org>
> wrote:
>
>> Hi Matt,
>>
>> The log lines you're seeing are actually from the new xymond process
>> trying to start up, then failing because the port is already in use. I
>> think the timeout right below it is from the previous process's signal
>> handler giving up, based on the timestamps.
>>
>> Can you get a backtrace from xymond's core file? It should be left in
>> /var/lib/xymon/tmp/, or in the (*shudder*) systemd journal somewhere...
>>
>> If your system is set not to keep them by default, add
>> ''
>> export DAEMON_COREFILE_LIMIT="unlimited"
>> ulimit -c unlimited
>> ''
>> to /etc/sysconfig/xymonlaunch
>>
>> I suspect there might be something corrupted in the xymond checkpoint
>> file.
>> First, do a 'service xymon stop' and make sure all xymon processes are
>> completely gone, including any xymond's still pending, then start xymon
>> back up. If it crashes again, do the same, but move the
>> /var/lib/xymon/xymond.chk checkpoint file out of the way after it's off,
>> and let it come back up.
>>
>> If it *still* doesn't come up, there's something else going on. Either
>> way, a full backtrace will help let us see where exactly it's dying.
>>
>>
>> HTH,
>> -jc
>>
>>
>> On Sat, January 30, 2016 8:28 am, Matt Vander Werf wrote:
>> > As a followup, xymond seems to try and start itself up again after a
>> while
>> > (probably because xymonlaunch is still running) and goes for a short
>> while
>> > working just fine and then just crashes again with the same messages and
>> > results.
>> >
>> > --
>> > Matt Vander Werf
>> >
>> > On Sat, Jan 30, 2016 at 11:21 AM, Matt Vander Werf <matt1299 at gmail.com>
>> > wrote:
>> >
>> >> Hello,
>> >>
>> >> I'm having a major issue with xymond crashing shortly after the service
>> >> starts.
>> >>
>> >> I'm using the the latest Terabithia RPM for RHEL 7
>> >> (4.3.24-3.el7.terabithia).
>> >>
>> >> When I check the status of the xymon service, it shows it as up but
>> with
>> >> only the xymonlaunch parent process and vmstat processes. Upon
>> >> restarting
>> >> the service, I see it start normally (all the normal channel processes,
>> >> etc.) and then after a while they all go away, leaving the following
>> >> process behind:
>> >>
>> >>            ├─2760 xymon-signal 0.0.0.0 status+1d/group:signal
>> >> <server
>> >> hostname>.xymond red (Check time of report) - xymond program crashed
>> >> Fatal
>> >> signal caught!
>> >>
>> >> along with the xymonlaunch process and some vmstat processes. After a
>> >> while that process goes away. Sometimes a single xymond_rrd will show
>> up
>> >> alongside the xymonlaunch and vmstat processes as well after a little
>> >> while.
>> >>
>> >> I'm already running xymond in --debug mode.
>> >>
>> >> This is what I see in the xymond log around the time of the crash:
>> >>
>> >> 2773 2016-01-30 11:02:32.515505 Status: Host=<host>, test=ntp
>> >> 2773 2016-01-30 11:02:32.515507  -- create_hostlist_t for <host>
>> >> (<client
>> >> IP address>)
>> >> 2773 2016-01-30 11:02:32.515513 Status: Host=<host>, test=conn
>> >> 2773 2016-01-30 11:02:32.515520 Status: Host=<host>, test=raid
>> >> 2773 2016-01-30 11:02:32.515529 Status: Host=<host>, test=memory
>> >> 2773 2016-01-30 11:02:32.515534 Status: Host=<host>, test=files
>> >> 2773 2016-01-30 11:02:32.515670 Status: Host=<host>, test=procs
>> >> 2773 2016-01-30 11:02:32.515879 Status: Host=<host>, test=inode
>> >> 2773 2016-01-30 11:02:32.515891 Status: Host=<host>, test=disk
>> >> 2773 2016-01-30 11:02:32.516004 Status: Host=<host>, test=cpu
>> >> 2773 2016-01-30 11:02:32.516605 Loaded 14419 status logs
>> >> 2016-01-30 11:02:32 Setting up network listener on 0.0.0.0:1984
>> >> 2016-01-30 11:02:32.516677 Cannot bind to listen socket (Address
>> already
>> >> in use)
>> >> 2016-01-30 11:02:59.538906 Whoops ! Failed to send message (Timeout)
>> >> 2016-01-30 11:02:59.539020 ->
>> >> 2016-01-30 11:02:59.539023 ->  Recipient '<server IP address>', timeout
>> >> 50
>> >> 2016-01-30 11:02:59.539024 ->  1st line: 'status+1d/group:signal
>> <server
>> >> hostname>.xymond red (Check time of report) - xymond program crashed'
>> >>
>> >> It seems to get finished with loading all the hosts and then it crashes
>> >> (the last host before it crashes is the last client I have
>> >> alphabetically).
>> >>
>> >> I've tried stopping the service, killing off any remaining xymon owned
>> >> processes, and started the service with the same results. I've also
>> >> tried
>> >> restarting the xymon server machine itself, with the same crash
>> >> happening
>> >> when the service starts the first time.
>> >>
>> >> This just started happening out of the blue a couple of hours ago...
>> >>
>> >> Looking in netstat, there are no active connections using port 1984 on
>> >> the
>> >> local side, just a bunch of clients trying to connect to the server
>> with
>> >> 1984 in the foreign address.
>> >>
>> >> ANY help would be much appreciated as currently our Xymon server is not
>> >> working!!
>> >>
>> >> Thanks!!
>> >>
>> >> --
>> >> Matt Vander Werf
>> >>
>> > _______________________________________________
>> > Xymon mailing list
>> > Xymon at xymon.com
>> > http://lists.xymon.com/mailman/listinfo/xymon
>> >
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20160130/ac2269d7/attachment.html>