[Xymon] Critical System Page -- HTTP 500 Error
EDSchminke at Hormel.com
EDSchminke at Hormel.com
Fri Aug 4 17:53:21 CEST 2017
I think I can point to a specific cause for this issue. It seems to be a
combination of the "uptime" test being in an alert condition and the same
test failing during an exclusion window on the Critical Systems Page.
I have a number of Windows systems monitored for uptime.
In analysis.cfg:
UP 10m 37d yellow
In critical.cfg:
CTX_Template|uptime|||*:0400:2400|1|EPD|System has rebooted|rchicks
2017-08-04 07:58:11
I also set Xymon to send me alerts for ALL systems between 2:30AM and
3:30AM; the average time window for the Critical Systems Page going down
In alerts.cfg:
HOST=%.*
MAIL edschminke at hormel.com FORMAT=text REPEAT=1h TIME=*:0230:0330
FORMAT=text
MAIL edschminke at hormel.com FORMAT=text TIME=*:0230:0330 FORMAT=text
RECOVERED
Last night, around 2:45, 4 of these systems were rebooted. As soon as the
first email was sent that a system went yellow for uptime, I got the alert
that http went red for the Critical Systems Page. When the last email was
sent that uptime recovered, I got the alert that http recovered.
This morning, I rebooted a different Windows host. I watched the test go
yellow, but the Critical Systems Page was fine. In this case, the
condition was within the "Monitoring Time" window. I then went into the
Critical Systems Editor and modified the "Monitoring Time" and put it
outside the window (e.g. current time 8AM, window: 12PM-12AM). As soon as
I refresh the Critical Systems Page, it crashes. Change the "Monitoring
Time" so that the condition is back inside the window (e.g. 4AM), refresh,
it loads fine.
I tested the same process with a few tests; disk, memory, cpu. I could not
duplicate the problem with those tests. I think the problem is limited to
uptime, but it very well could be others. It also does not seem to matter
whether it is the actual host config, or a "cloned" host config. The crash
happens with both.
If it matters, here's my environment..
I'm currently running Xymon v4.3.27. The OS is Red Hat Enterprise Linux
v6.8. Kernel is 2.6.32-431.el6. Architecture is x86_64. glibc version is
2.12-1.192.el6; for what it's worth, but i686 and x86_64 packages are
installed.
A gdb backtrace shows that crash occurs in a "strncmp" function in
lib/loadcriticalconf.c on line 249
(gdb) backtrace
#0 0x0000003603729420 in __strncmp_sse42 () from /lib64/libc.so.6
#1 0x000000000040fa40 in get_critconfig (key=<value optimized out>,
flags=<value optimized out>, resultkey=<value optimized out>) at
loadcriticalconf.c:249
#2 0x00000000004030eb in loadstatus (maxprio=3, maxage=31536000,
mincolor=3, wantacked=0) at criticalview.c:115
#3 0x00000000004036f0 in main (argc=<value optimized out>, argv=<value
optimized out>) at criticalview.c:513
(gdb) frame 1
#1 0x000000000040fa40 in get_critconfig (key=<value optimized out>,
flags=<value optimized out>, resultkey=<value optimized out>) at
loadcriticalconf.c:249
249 if (strncmp(realkey, rec->key, strlen
(realkey)) != 0) handle=xtreeEnd(rbconf);
(gdb) print realkey
$1 = 0x1c20c80 "CTX_Template|uptime"
(gdb) print *rec
$2 = {key = 0x435f6c65746e6957 <Address 0x435f6c65746e6957 out of bounds>,
priority = 1769236850, starttime = 7310575213499737428, endtime = 0,
crittime = 0x1c1d8e0 "Wintel_Critical_Template",
ttgroup = 0x21 <Address 0x21 out of bounds>, ttextra = 0x6364727673737763
<Address 0x6364727673737763 out of bounds>, updinfo = 0x3603003d31 <Address
0x3603003d31 out of bounds>}
All of the crash details are still in my GitHub repo at
https://github.com/edschminke/xymon ...including the coredump file. I
suspect better C developers than myself can put that to a lot better use.
Thanks!
Erik D. Schminke | Associate Systems Programmer
Hormel Foods Corporation | One Hormel Place | Austin, MN 55912
Phone: (507) 434-6817
edschminke at hormel.com | www.hormelfoods.com
More information about the Xymon
mailing list