[Xymon] Purple storm

Poppy, Ben poppy.ben at marshfieldclinic.org
Thu Apr 12 08:27:01 CEST 2012


I may have missed this in a past post, how do I apply this patch?

I do test DNS for sure on servers at our DR site (many of them). The test you suggest below, is that to simulate the purple storm? Should it essentially turn purple if I begin dropping all packets to a few DNS servers I'm testing?

Would I be able to run this same iptables on my backup xymon server in our primary site to a few servers it checks DNS against in our DR site? Should that effectively cause the purple storm?

Thanks for your help. 
________________________________________
From: xymon-bounces at xymon.com [xymon-bounces at xymon.com] on behalf of Henrik Størner [henrik at hswn.dk]
Sent: Thursday, April 12, 2012 12:47 AM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

On 12-04-2012 05:28, Poppy, Ben wrote:
> To be honest, I'm not sure what the cause is 100%.

Me neither.

> The setup we have, should not have any dependencies on our DR site.
> Our xymon servers at our primary site use DNS servers in our primary
> site. They monitor a bunch of servers at our DR site, but the
> dependency ends there (and all that should mean is the servers show
> RED when DR site is down).

Aha - but you DO have tests in each setup that checks systems on the
other site ? Would that happen to include any DNS or NTP checks ?

I suspect that you have each of your Xymon's setup to test availability
of the DNS servers on both the primary and the DR site. That could be a
problem.

> Another bit of information, during this 5 hour outage, both of our
> xymon servers went from showing properly (where DR servers were
> showing RED conn as they weren't reachable, but the servers we
> monitor in our primary site were up), to everything going purple in
> conn (and other tests).. It would alternate back and forth over the
> course of the outage (I didn't detect a regular timeframe of when it
> switched from RED to PURPLE)..

The interesting thing is that they switch to purple, indicating that
something is stalled.

I have seen something like this happen when we had a number of DNS
checks in the Xymon servers, and network access to these failed (broken
switch to a customer network). This caused xymon to stall on these DNS
checks, and all of the network tests went purple.

I know that this is difficult to test, because obviously you cannot just
cut the connection between the two sites to try it out. But you could
try applying this patch which changes the DNS lookup code to use the
same kind of timeout settings as the development version - the 4.3.x
versions suffer from a common misunderstanding about how the C-ARES
library handles timeout that make DNS timeouts take much too long.

One possible way of testing it would be if you can firewall access from
e.g. your DR site Xymon server to the primary site's DNS server. If you
are running Xymon on a Linux server, then "iptables" can do that for
you. If your primary site DNS server is 10.1.2.3, then

   iptables -I OUTPUT 1 -d 10.1.2.3 -j DROP
   iptables -I INPUT 1 -s 10.1.2.3 -j DROP

will cause all traffic to/from this server to be dropped.


Regards,
Henrik

______________________________________________________________________
The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.



More information about the Xymon mailing list