[Xymon] Xymon disruption every night!

J.C. Cleaver cleaver at terabithia.org
Tue Feb 16 21:50:28 CET 2016



On Tue, February 16, 2016 1:44 am, L-M-J wrote:
>
> Hi,
>
>   I'm still running into troubles every night between ~0h30 and ~2h40 :-(
>   1) I checked the backup on my physical XYmon server : around 9pm and
> runs for 4:45 min.
>   2) We cross-monitored the DNS server from another monitoring tool : no
> DNS outage detected.
>   3) I monitored the Xymon server network link state with "mii-tool" every
> seconds : no troubles detected
>   4) I pinged my Xymon servers from 2 differents network places all night
> long : no troubles detected.
>   5) No firewalls between my Xymon server and the monitored hosts
>   6) Over 500 hosts, only ~30 are in trouble every night and mostly the
> same
>   7) Hosts are VM, physical servers, public internet website
>
>
>   Here is what I've found in the xymond.log today :
> 	2016-02-16 02:02:57 Flapping detected for www.foo1.com:http - 5 changes
> in 1708 seconds
> 	2016-02-16 02:02:57 Flapping detected for www.foo2.com:http - 5 changes
> in 1708 seconds
> 	2016-02-16 02:02:57 Flapping detected for www.microsoft.com:http - 5
> changes in 1708 seconds
> 	2016-02-16 02:06:14 Flapping detected for server01:http - 5 changes in
> 1678 seconds
> 	2016-02-16 02:06:14 Flapping detected for server02:http - 5 changes in
> 1678 seconds
> 	2016-02-16 02:06:29 Flapping detected for server03:conn - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server04:ldap - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server06:ssh - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server05:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server07:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server08:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server09:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for foo.bar1.com:http - 5 changes
> in 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for foo.bar2.com:http - 5 changes
> in 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for foo.bar3.fr:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server10:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server11-t:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server12:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server13:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server14:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server15:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server16:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server17:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server18:http - 5 changes in
> 1745 seconds
> 	2016-02-16 02:07:21 Flapping detected for server19:http - 5 changes in
> 1745 seconds
>
>
>   Here is a part of the configuration + errors displayed in the XYmon HTTP
> interface :
> 	hosts.cfg : 0.0.0.0	server03	# conn	NAME:"server03" DESCR:"VM FOO BAR"
> 	Error :		conn NOT ok : DNS lookup failed / Unable to resolve hostname
> server03
> 				System unreachable for 2 poll periods (86 seconds)
>
> 	Everything looks like the DNS resolution failed.
>
> 	hosts.cfg : 10.X.Y.188 server05 # conn tse NAME:"Server 05" DESCR:"My
> comment" http://server05/
> 	Error : DNS error  red http://server05/ - DNS error
>
>   - Why I have a "DNS error" here ? I set up the IP yesterday to this host
> to solve the issue. The "conn" error disappear since yesterday evening
> but the http still remains.
>
>

All signs do point to an issue with DNS resolution here.

Was this a custom compile or are you using a package? If custom, what
version of c-ares is on your system? That's the underlying resolution
library that xymonnet is using by default to handle DNS lookups. The fact
that the 'conn' test remained good after you added the local hosts entry
matches that, since HTTP tests are performed using their own secondary DNS
lookup (to deal with vhosts, etc) unless the IP is specified there as
well.

Xymon otherwise does not cache DNS records or anything else when it comes
to network polling like this, since xymonnet is a brand new execution for
each run.

Try adding the '--dnslog=' option to xymonnet during this period to get a
log of exactly what's happening with DNS resolution, and --debug as well
(but just once or twice). You can also try testing using '--no-ares',
however the system resolver is much slower and less predictable than
c-ares (normally).

Another potential help might be altering your --concurrency=N setting to
something lower than the system default (which will typically be 256).


There's clearly *something* going on that's specific to that period, but
signs do point to something more on the host. This is especially true if
you add a local DNS cache and you're still seeing the problem.


HTH,
-jc





More information about the Xymon mailing list