[Xymon] Xymon server post-migration blues pt. 1 - "brown-outs"

Wed Apr 24 06:50:00 CEST 2019

[Apologies in advance for the too-wordy message.  Part 1 of 2	.]

Recently I violated one of the prime rules of being a SysAdmin - don't 
ever change two things at once.

At my work we were forced to migrate our data center to a new facility, 
so we bought a new monitoring PC to replace a RHEL 6.10 system that ran 
Xymon 4.3.12.  The new monitoring PC runs RHEL 7.6 and Xymon 4.3.28 
(using the Terabithia RPMs).

Ever since then, I have been running into two big problems that did not 
exist before.  This is the first problem ("brown-outs"); I'll describe 
the 2nd in part 2.

Randomly, a group of systems (or some subset of the group) will report 
in as CRITICAL/RED due to failed xymonnet tests.  Mostly SSH, but some 
SMTP and FTP as well.  The hosts/services are all actually fine and the 
red alerts are incorrect/false positives.

The problem is getting worse - I'm now seeing several hundred red alerts 
a day from these "brown-outs".  The hosts involved are more-or-less 
random - different buildings/OSes, etc.  Sometimes all of them provoke 
alerts; most of the time it's just a subset of the list.

When they fail, the alert message is always

--
Service <service> on <host> is not OK : Service listening but 
unavailable (connect timeout)
--

To try and catch it in the act, I ran this test in a loop:

[root at mgmt xymon]# while true; do ( echo "["`date`"]" ; xymonnet 
--report --ping --checkresponse --timing --debug --no-update 2>&1 > 
/tmp/xymonnet.out ; grep 'err=[^0]' /tmp/xymonnet.out ); done

A couple of times I think I did catch it; here's an example:

--
Address=192.168.1.26:22, open=1, res=0, err=1, connecttime=0.004546, 
totaltime=11.653128,
Address=192.168.1.25:22, open=1, res=0, err=1, connecttime=0.004510, 
totaltime=11.653092,
Address=192.168.1.219:22, open=1, res=0, err=1, connecttime=0.003163, 
totaltime=11.651745,
Address=192.168.1.151:22, open=1, res=0, err=1, connecttime=0.002923, 
totaltime=11.651505,
Address=192.168.1.50:22, open=1, res=0, err=1, connecttime=0.002906, 
totaltime=11.651488,
Address=137.78.80.38:22, open=1, res=0, err=1, connecttime=0.002819, 
totaltime=11.651401,

[... another 10 elided ...]

Address=192.168.1.184:22, open=1, res=0, err=1, connecttime=0.001098, 
totaltime=12.393879,
Address=192.168.1.174:25, open=1, res=0, err=1, connecttime=0.000426, 
totaltime=12.364234,
Address=192.168.1.182:25, open=1, res=0, err=1, connecttime=0.000418, 
totaltime=12.364226,
Address=192.168.1.25:25, open=1, res=0, err=1, connecttime=0.000411, 
totaltime=12.364219,
Address=192.168.1.25:21, open=1, res=0, err=1, connecttime=0.022773, 
totaltime=12.364044,
--

Notice the non-zero connecttime, but the exceeded-the-timeout totaltime 
values.

The services always immediately recover in the next test pass.

Are there any knobs I can turn on to help debug this problem?  I'm 
assuming it's network/router/switch-related, but I need a smoking gun.

Failing that, is there any way in a .cfg file setting to turn these 
particular "Service listening but unavailable" statuses into a Yellow 
alert rather than Red?  (I'd rather not have to resort to this but as a 
stop-gap, I would.)

		- Greg