[Xymon] Purple storm

Henrik Størner henrik at hswn.dk
Tue Mar 20 08:10:45 CET 2012


On 19-03-2012 19:15, Poppy, Ben wrote:
> I have an interesting problem that happened last night. We are working
> on a DR test. Part of that test includes shutting down some DC’s in our
> DR datacenter. When that happened, most tests that are initiated from
> the xymon servers (http, dns, ssh, ftp, etc) to the monitored server
> went purple. The servers that went purple were not all in our DR
> datacenter, it was at all of our sites, and even included some tests to
> the xymon server itself (we monitor the HTTP web page of xymon itself as
> well).
>
> Both of our xymon servers point to 2 windows DC’s in our production
> datacenter in /etc/resolv.conf for DNS lookups.

Check the "xymonnet" status history. I suppose this status will show 
some yellow events during this, caused by the network tests taking too 
long to run.

The status will tell you more about what part of the network tests are 
taking too long.

This should also show up in the xymonnet.log file.

One likely culprit would be if you are doing "ntp" tests or custom DNS 
queries from Xymon against the DC's that are down. "ntp" tests use an 
external program (ntpdate) to perform the query, and it has a very long 
timeout when servers are not responding. DNS queries use the C-ARES 
library, and because I misunderstood how the timeout handling works in 
this library it can several minutes *per test* to timeout.

Fixes for both of these issues are "in the pipeline" for the next major 
Xymon version.


Regards,
Henrik



More information about the Xymon mailing list