[Xymon] Xymon server post-migration blues pt. 1 - "brown-outs"
Greg Earle
earle at isolar.DynDNS.ORG
Wed Apr 24 06:50:00 CEST 2019
[Apologies in advance for the too-wordy message. Part 1 of 2 .]
Recently I violated one of the prime rules of being a SysAdmin - don't
ever change two things at once.
At my work we were forced to migrate our data center to a new facility,
so we bought a new monitoring PC to replace a RHEL 6.10 system that ran
Xymon 4.3.12. The new monitoring PC runs RHEL 7.6 and Xymon 4.3.28
(using the Terabithia RPMs).
Ever since then, I have been running into two big problems that did not
exist before. This is the first problem ("brown-outs"); I'll describe
the 2nd in part 2.
Randomly, a group of systems (or some subset of the group) will report
in as CRITICAL/RED due to failed xymonnet tests. Mostly SSH, but some
SMTP and FTP as well. The hosts/services are all actually fine and the
red alerts are incorrect/false positives.
The problem is getting worse - I'm now seeing several hundred red alerts
a day from these "brown-outs". The hosts involved are more-or-less
random - different buildings/OSes, etc. Sometimes all of them provoke
alerts; most of the time it's just a subset of the list.
When they fail, the alert message is always
--
Service <service> on <host> is not OK : Service listening but
unavailable (connect timeout)
--
To try and catch it in the act, I ran this test in a loop:
[root at mgmt xymon]# while true; do ( echo "["`date`"]" ; xymonnet
--report --ping --checkresponse --timing --debug --no-update 2>&1 >
/tmp/xymonnet.out ; grep 'err=[^0]' /tmp/xymonnet.out ); done
A couple of times I think I did catch it; here's an example:
--
Address=192.168.1.26:22, open=1, res=0, err=1, connecttime=0.004546,
totaltime=11.653128,
Address=192.168.1.25:22, open=1, res=0, err=1, connecttime=0.004510,
totaltime=11.653092,
Address=192.168.1.219:22, open=1, res=0, err=1, connecttime=0.003163,
totaltime=11.651745,
Address=192.168.1.151:22, open=1, res=0, err=1, connecttime=0.002923,
totaltime=11.651505,
Address=192.168.1.50:22, open=1, res=0, err=1, connecttime=0.002906,
totaltime=11.651488,
Address=137.78.80.38:22, open=1, res=0, err=1, connecttime=0.002819,
totaltime=11.651401,
[... another 10 elided ...]
Address=192.168.1.184:22, open=1, res=0, err=1, connecttime=0.001098,
totaltime=12.393879,
Address=192.168.1.174:25, open=1, res=0, err=1, connecttime=0.000426,
totaltime=12.364234,
Address=192.168.1.182:25, open=1, res=0, err=1, connecttime=0.000418,
totaltime=12.364226,
Address=192.168.1.25:25, open=1, res=0, err=1, connecttime=0.000411,
totaltime=12.364219,
Address=192.168.1.25:21, open=1, res=0, err=1, connecttime=0.022773,
totaltime=12.364044,
--
Notice the non-zero connecttime, but the exceeded-the-timeout totaltime
values.
The services always immediately recover in the next test pass.
Are there any knobs I can turn on to help debug this problem? I'm
assuming it's network/router/switch-related, but I need a smoking gun.
Failing that, is there any way in a .cfg file setting to turn these
particular "Service listening but unavailable" statuses into a Yellow
alert rather than Red? (I'd rather not have to resort to this but as a
stop-gap, I would.)
- Greg
More information about the Xymon
mailing list