[Xymon] Critical System Page -- HTTP 500 Error

EDSchminke at Hormel.com EDSchminke at Hormel.com
Fri Aug 4 17:53:21 CEST 2017


I think I can point to a specific cause for this issue.  It seems to be a
combination of the "uptime" test being in an alert condition and the same
test failing during an exclusion window on the Critical Systems Page.

I have a number of Windows systems monitored for uptime.

In analysis.cfg:
UP 10m 37d yellow

In critical.cfg:
CTX_Template|uptime|||*:0400:2400|1|EPD|System has rebooted|rchicks
2017-08-04 07:58:11

I also set Xymon to send me alerts for ALL systems between 2:30AM and
3:30AM; the average time window for the Critical Systems Page going down

In alerts.cfg:
HOST=%.*
    MAIL edschminke at hormel.com FORMAT=text REPEAT=1h TIME=*:0230:0330
FORMAT=text
    MAIL edschminke at hormel.com FORMAT=text TIME=*:0230:0330 FORMAT=text
RECOVERED


Last night, around 2:45, 4 of these systems were rebooted.  As soon as the
first email was sent that a system went yellow for uptime, I got the alert
that http went red for the Critical Systems Page.  When the last email was
sent that uptime recovered, I got the alert that http recovered.

This morning, I rebooted a different Windows host.  I watched the test go
yellow, but the Critical Systems Page was fine.  In this case, the
condition was within the "Monitoring Time" window.  I then went into the
Critical Systems Editor and modified the "Monitoring Time" and put it
outside the window (e.g. current time 8AM, window: 12PM-12AM).  As soon as
I refresh the Critical Systems Page, it crashes.  Change the "Monitoring
Time" so that the condition is back inside the window (e.g. 4AM), refresh,
it loads fine.

I tested the same process with a few tests; disk, memory, cpu.  I could not
duplicate the problem with those tests.  I think the problem is limited to
uptime, but it very well could be others.  It also does not seem to matter
whether it is the actual host config, or a "cloned" host config.  The crash
happens with both.

If it matters, here's my environment..

I'm currently running Xymon v4.3.27.  The OS is Red Hat Enterprise Linux
v6.8.  Kernel is 2.6.32-431.el6.  Architecture is x86_64.  glibc version is
2.12-1.192.el6; for what it's worth, but i686 and x86_64 packages are
installed.

A gdb backtrace shows that crash occurs in a "strncmp" function in
lib/loadcriticalconf.c on line 249

(gdb) backtrace
#0  0x0000003603729420 in __strncmp_sse42 () from /lib64/libc.so.6
#1  0x000000000040fa40 in get_critconfig (key=<value optimized out>,
flags=<value optimized out>, resultkey=<value optimized out>) at
loadcriticalconf.c:249
#2  0x00000000004030eb in loadstatus (maxprio=3, maxage=31536000,
mincolor=3, wantacked=0) at criticalview.c:115
#3  0x00000000004036f0 in main (argc=<value optimized out>, argv=<value
optimized out>) at criticalview.c:513
(gdb) frame 1
#1  0x000000000040fa40 in get_critconfig (key=<value optimized out>,
flags=<value optimized out>, resultkey=<value optimized out>) at
loadcriticalconf.c:249
249					if (strncmp(realkey, rec->key, strlen
(realkey)) != 0) handle=xtreeEnd(rbconf);
(gdb) print realkey
$1 = 0x1c20c80 "CTX_Template|uptime"
(gdb) print *rec
$2 = {key = 0x435f6c65746e6957 <Address 0x435f6c65746e6957 out of bounds>,
priority = 1769236850, starttime = 7310575213499737428, endtime = 0,
crittime = 0x1c1d8e0 "Wintel_Critical_Template",
  ttgroup = 0x21 <Address 0x21 out of bounds>, ttextra = 0x6364727673737763
<Address 0x6364727673737763 out of bounds>, updinfo = 0x3603003d31 <Address
0x3603003d31 out of bounds>}

All of the crash details are still in my GitHub repo at
https://github.com/edschminke/xymon  ...including the coredump file.  I
suspect better C developers than myself can put that to a lot better use.

Thanks!

Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN 55912
Phone: (507) 434-6817
edschminke at hormel.com | www.hormelfoods.com





More information about the Xymon mailing list