[Xymon] xymond crashing! -- Please help!

Sun Jan 31 21:00:19 CET 2016

Hi J.C.,

First of all, thanks for your continued assistance on this! It's greatly
appreciated!! :)

When looking to see if there were any clients in common in the different
crash back traces, and I think I may have figured out the issue here.

I've been trying to confirm that this is in fact the cause of the crashing
(which is why you haven't heard back from me for so long), and I think it's
safe to say that this was the cause of the crashing.

I noticed one client in particular that was showing up in several of the
core dump back traces (let's call this Client A). Now, this client stood
out to me cause I've been seeing it often in error messages from xymond
regarding oversized client messages and truncated status messages (see PS
below).

So, I went to Client A and stopped the xymon-client service. Then I went
and took out all entries for Client A from the checkpoint file (after
stopping the xymon service, of course).

Started up the xymon service again and voila...no crashing! I confirmed
this by starting up the xymon-client service on Client A again and then
restarting the xymon server service and after a short while, it crashed (as
expected).

Now, while the client messages were very strange and large and unusual,
there definitely wasn't any binary data. I will be sending you the
problematic entries from the checkpoint file in a separate e-mail (as I'd
rather not send the contents to the entire list...). Hopefully you can make
sense of them...

I'm not sure what in the client entries made it crash here of if it was
supposed to crash at all. Could it have really been from too large of
client/status messages?

**Do you know if it's possible to address this so it doesn't crash in the
future? For now, I'm just going to have to keep the xymon-client service
turned off, at least until the client in question calms down.

(PS: I can't really control what is done on my client machines very much,
so I was more or less ignoring the oversized client and status messages
from the client in question (knowing they eventually would go away).
I already tried increasing the MAXMSG size values, but I didn't want to
have to increase them as high as I would have needed to to satisfy the
client in question. I never thought that they would actually ever cause
Xymon to crash...)

Anyways, thanks as always for your excellent help J.C.!!

--
Matt Vander Werf

On Sun, Jan 31, 2016 at 12:16 PM, J.C. Cleaver <cleaver at terabithia.org>
wrote:

>
> On Sat, January 30, 2016 3:32 pm, Matt Vander Werf wrote:
> > Opps...somehow sent too soon there...
> >
> > No, I haven't made any recent changes to client-local.cfg. I don't
> > actually
> > use that config for anything actually.
> >
> > It seems to work just fine when you're starting off with no xymond.chk
> > file
> > (like when the file is moved out of the way), but once the service gets
> > restarted (or stopped and started), then the crashes start again and it
> > becomes basically unusable. So maybe it has to do with reading the
> current
> > state from the xymond.chk file? Or loading all the statuses?
> >
> > It seems to load all the statuses and then tries to set up a network
> > listener and then crashes.
>
>
> This is most likely the previous xymond instance still taking the network
> port. After startup, xymond never re-reads the checkpoint file. If crashes
> are eventually occurring even after it's started up without a checkpoint
> file in place then whatever it is is occurring "live" and it's not the
> checkpoint itself that's the problem.
>
>
> >
> > No, I'm not seeing any other error messages from xymond's startup that
> > would seem related. Just that "Cannot bind to listen socket (Address
> > already in use)" you saw earlier when it crashes.
> >
> > Are you saying I could pull data from the old xymond.chk file and
> manually
> > put it in the current xymond.chk file when xymond is stooped? Or?
>
> This is correct. The checkpoint file is a simple text file written out.
> Actually, it might be worth a quick scan with grep or just eyeballing a
> 'cat' to see if there's any obviously corrupt data in there. Initial raw
> messages are not binary safe, and are decompressed by xymond if needed
> before internal processing, so everything there should be plain. If you
> see binary garbage, something unusual has happened.
>
> >
> > Any other ideas? I'm sort of in a rut here...  :/ Not entirely sure what
> I
> > can do to get my Xymon instance working again..
> >
> > Any other details I can provide that might shine a light on this issue?
> >
>
>
> - Can you send a copy of your client-local.cfg? Or, if not using it much,
> revert it to the standard one?
>
> - When did the issue first start?
>
> - Based on the single backtrace, there's something strange about a client
> record being pulled in, or an underlying issue with posix btrees, and/or
> memory management.
>
> Are all the crashes occurring at the same area? If so, for the same client
> host message/report?
>
> Is the main xymond server under any sort of memory pressure, or has there
> been a recent glibc update or change in libraries that might require a
> reboot to fully take effect?
>
>
> -jc
>
> > Thanks!!
> >
> > --
> > Matt Vander Werf
> >
> > On Sat, Jan 30, 2016 at 6:05 PM, Matt Vander Werf <matt1299 at gmail.com>
> > wrote:
> >
> >> Hi J.C.,
> >>
> >> No,
> >>
> >>
> >>
> >> --
> >> Matt Vander Werf
> >>
> >> On Sat, Jan 30, 2016 at 5:46 PM, J.C. Cleaver <cleaver at terabithia.org>
> >> wrote:
> >>
> >>>
> >>> On Sat, January 30, 2016 10:45 am, Matt Vander Werf wrote:
> >>> > Hi J.C.,
> >>> >
> >>> > So it appears that only fixed it temporarily.
> >>> >
> >>> > If I stop the service and start it back up again, it crashes again.
> >>> >
> >>> > I think I figured out how to read the core file and get a backtrace
> >>> for
> >>> > you
> >>> > (I think).
> >>> >
> >>> > Here's what I got from the most recent crash (with some host names
> >>> > obfuscated):
> >>> >
> >>> > [New LWP 13283]
> >>> > Reading symbols from /usr/sbin/xymond...Reading symbols from
> >>> > /usr/lib/debug/usr/sbin/xymond.debug...done.
> >>> > done.
> >>> > Missing separate debuginfo for
> >>> > Try: yum --enablerepo='*debug*' install
> >>> > /usr/lib/debug/.build-id/33/97b0d696701dbd7c09eb4bf023f7f4eebec9ed
> >>> > [Thread debugging using libthread_db enabled]
> >>> > Using host libthread_db library "/lib64/libthread_db.so.1".
> >>> > Core was generated by `xymond --restart=/var/lib/xymon/tmp/xymond.chk
> >>> > --checkpoint-file=/var/lib/xymon'.
> >>> > Program terminated with signal 6, Aborted.
> >>> > #0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
> >>> > Missing separate debuginfos, use: debuginfo-install
> >>> > glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64
> >>> > krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64
> >>> > libselinux-2.2.2-6.el7.x86_64 lz4-r131-1.el7.x86_64
> >>> > openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64
> >>> > xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
> >>> > (gdb) backtrace
> >>> > #0  0x00007f570e29a5f7 in raise () from /lib64/libc.so.6
> >>> > #1  0x00007f570e29bce8 in abort () from /lib64/libc.so.6
> >>> > #2  0x00007f570f53cdf5 in sigsegv_handler (signum=<optimized out>) at
> >>> > sig.c:57
> >>> > #3  <signal handler called>
> >>> > #4  0x00007f570f5403b4 in xtree_i_compare (pa=0x7ffead8cb9a0,
> >>> > pb=0x2020202020202020) at tree.c:47
> >>> > #5  0x00007f570e3574c0 in tfind () from /lib64/libc.so.6
> >>> > #6  0x00007f570f5405d4 in xtreeFind (treehandle=<optimized out>,
> >>> > key=key at entry=0x7f57142cb320 "*<client hostname>*") at tree.c:140
> >>> > #7  0x00007f570f5386bd in get_clientconfig
> >>> > (hostname=hostname at entry=0x7f57142cb320
> >>> > "*<client hostname>*", hostclass=hostclass at entry=0x7f57208e4612
> >>> "linux",
> >>> >     hostos=hostos at entry=0x7f57208e460c "linux") at clientlocal.c:192
> >>> > #8  0x00007f570f535dec in do_message (msg=msg at entry=0x7f572064c300,
> >>> > origin=origin at entry=0x7f570f550e97 "",
> >>> can_respond=can_respond at entry=1)
> >>> at
> >>> > xymond.c:4955
> >>> > #9  0x00007f570f5282c7 in main (argc=<optimized out>, argv=<optimized
> >>> > out>)
> >>> > at xymond.c:6288
> >>> >
> >>> >
> >>> > Is this what you wanted? Do you want me to install the debug package
> >>> for
> >>> > glibc or other packages?
> >>> >
> >>> > Let me know what I can do.
> >>> >
> >>> > Thanks!!
> >>>
> >>> This works. It's strange in that it points to a problem with the
> >>> client-local configs, but I'm not sure how the tree would get into a
> >>> corrupt state.
> >>>
> >>> Were any changes made recently to the client-local file? Any other
> >>> errors
> >>> seen during xymond's startup that might seem related?
> >>>
> >>> It's probably *not* an issue with a status message, if they're all
> >>> crashing at the same spot. This was an incoming client message that was
> >>> either garbled or accessing garbled data somehow.
> >>>
> >>>
> >>> >
> >>> > --
> >>> > Matt Vander Werf
> >>> >
> >>> > On Sat, Jan 30, 2016 at 1:10 PM, Matt Vander Werf
> >>> <matt1299 at gmail.com>
> >>> > wrote:
> >>> >
> >>> >> Hi J.C.,
> >>> >>
> >>> >> Moving the xymond.chk checkpoint file out of the way after it was
> >>> >> stopped
> >>> >> seemed to fix this (at least so far).
> >>> >>
> >>> >> I see that I lost all record of disabled tests (getting alerts for
> >>> >> things
> >>> >> that were disabled).
> >>> >>
> >>> >> What all data exactly did I lose with moving that checkpoint file
> >>> out
> >>> of
> >>> >> the way?
> >>> >>
> >>> >> Is there anyway to get the data back? Or maybe figure out the
> >>> >> corruptness
> >>> >> in the checkpoint file and then move the file back in place?
> >>>
> >>> There are several different bits in there, including scheduled tasks,
> >>> disable states, and the current status messages. You can manually copy
> >>> the
> >>> file back at this point while xymond is off and it will load state back
> >>> from it (along with the old status messages, but they'll get
> >>> overwritten
> >>> as soon as the next cycle come through).
> >>>
> >>>
> >>>
> >>> >>
> >>> >> Also, see my most recent e-mail with the xymonlaunch log (if you
> >>> haven't
> >>> >> already). Looks like this has happened in the past but resolved
> >>> >> itself....
> >>> >>
> >>> >> Regarding the backtrace....
> >>> >>
> >>> >> I put those lines in /etc/sysconfig/xymonlaunch and I see the core
> >>> files
> >>> >> being generated now.
> >>> >> I feel embarrassed to admit this, but how exactly do I get the
> >>> backtrace
> >>> >> out of the binary core files, besides trying to read the files with
> >>> an
> >>> >> editor? Any way to know which core file had the backtrace?
> >>> >>
> >>> >> Also, I see this in journalctl:
> >>> >>
> >>> >> Ignoring invalid environment assignment 'export
> >>> >> DAEMON_COREFILE_LIMIT=unlimited': /etc/sysconfig/xymonlaunch
> >>>
> >>> Ugh. systemd :( I forgot that that's not a real shell file any more.
> >>> Looks
> >>> like you found a way though!
> >>>
> >>>
> >>> -jc
> >>>
> >>>
> >>>
> >>
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20160131/bf382882/attachment.html>