[Xymon] Purple storm

Poppy, Ben poppy.ben at marshfieldclinic.org
Thu Apr 12 00:23:12 CEST 2012


And a fiber cut to our DR datacenter caused another 5 hour purple storm.

The traces on our DCs showed our DC's are our production datacenter getting and responding to all DNS lookups. None were getting forwarded down to our other datacenter.

While this was happening, we changed our secondary xymon server to point to our linux bind dns servers (so that xymon1 was pointing to dc1/2, and xymon2 was pointing to binddns1/2), and that still had the purple storms on both xymon servers.

At this point, I'm not sure what to do. Upgrade to latest xymon in the hopes that somehow some bug was fixed that's causing this? downgrade back to xymon 4.2.3 or even hobbit 4.2?

Another idea that they are suggesting is changing all the shortname entries (in hosts.cfg) to FQDN entries. The problem is there are over 1700 entries and I'd have to essentially find out what domain they are in, and then do a rename command as well.. Not to mention that we don't have this issue unless our DR-DC goes down..

Any other ideas from the list by chance?

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Jamison Maxwell
Sent: Tuesday, March 20, 2012 5:11 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

I think the interesting sniffer would be on the DC's that remain up.  Just to make sure I got this straight, you've got two DC's on the LAN with Xymon and two DC's in your DR site.  You shutdown the DC's in the DR site and now queries are timing  out (or something) to the DC's on the LAN.

If that's the case, I would first look at DNS on the DC's.  If you do a packet capture on the DC's filtered by UDP 53 and the only packets from or to your Xymon server, then this would show whether the queries are making to your remaining DC's and if there is any delay in the response.   It wouldn't surprise me is Windows was so worried about the missing domain controllers that it forgot to respond to DNS queries.  If there's not, then refer to the tcpump you have going on on your Xymon server to make sure the packets are making it back with acceptable latency.  

I'm making the assumption that your DNS configuration has the zone in question as a primary, Active Directory integrated zone, also, are you running a caching DNS server on your Xymon system?  I've seen some odd results with DNS caching.  What order are the name servers in /etc/resolv.conf?  I was playing with Debian this weekend and for some reason, no matter what I did it would not use any other name server than the first one in the list except for dig and nslookups, but not for regular queries.

Of course you could always just install your favorite DNS server on your Xymon system and transfer the zones as secondary zones from your Windows boxes, that'll definitely solve the problem and remove the prerequisite of your DC's staying up so Xymon works.  ...matter of fact, I think I'll do that myself....




Jamison Maxwell

-----Original Message-----
From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Don Kuhlman
Sent: Tuesday, March 20, 2012 3:05 PM
To: xymon at xymon.com
Subject: Re: [Xymon] Purple storm

Would you be able to run a tcpdump or use a network sniffer to see what the server is doing when you're getting the long response times?

Maybe that will help you see what it is trying to reach when that is happening.


On 3/20/12 1:51 PM, "Poppy, Ben" <poppy.ben at marshfieldclinic.org> wrote:

>Yes, that's the strange part, we can still manually do digs and 
>nslookups from the xymon server to other DNS servers.
>
>-----Original Message-----
>From: Phil Crooker [mailto:Phil.Crooker at orix.com.au]
>Sent: Tuesday, March 20, 2012 12:41 AM
>To: Poppy, Ben
>Cc: xymon at xymon.com
>Subject: Re: [Xymon] Purple storm
>
>So, can you do DNS queries from the xymon server when DC3 & 4 are down?
>
>
>>>> "Poppy, Ben"  03/20/12 11:50 AM >>>
>So they are pointing to 2 DC's that stay up this entire time, we'll 
>call them DC1 and DC2. Then we shutdown DR-DC3 and DR-DC4. When those 
>servers are down, we begin to have issues.
>
>-----Original Message-----
>From: Jeremy Laidman [mailto:jlaidman at rebel-it.com.au]
>Sent: Monday, March 19, 2012 7:46 PM
>To: Poppy, Ben
>Cc: xymon at xymon.com
>Subject: Re: [Xymon] Purple storm
>
>On Tue, Mar 20, 2012 at 5:15 AM, Poppy, Ben  wrote:
>> I have an interesting problem that happened last night. We are 
>> working
>
>> on a DR test. Part of that test includes shutting down some DC's in 
>> our DR datacenter. When that happened, most tests that are initiated 
>> from the xymon servers (http, dns, ssh, ftp, etc) to the monitored
>server went purple.
>
>For network tests, Xymon resolves the IP address from the servername 
>(typically using DNS), and then uses that IP address to perform the test.
> The IP address in the hosts.cfg file is not normally used for network 
>tests.  So if your DNS fails, Xymon's network tests fail also.
>
>You can prevent this, and use the IP address supplied in hosts.cfg, by 
>adding "testip" to each hosts.cfg entry that requires it.  You can add 
>it to a ".default." entry so that it applies to all hosts.
>
>J
>
>______________________________________________________________________
>The contents of this message may contain private, protected and/or 
>privileged information.  If you received this message in error, you 
>should destroy the e-mail message and any attachments or copies, and 
>you are prohibited from retaining, distributing, disclosing or using 
>any information contained within.  Please contact the sender and advise 
>of the erroneous delivery by return e-mail or telephone.  Thank you for 
>your cooperation.
>_______________________________________________
>Xymon mailing list
>Xymon at xymon.com
>http://lists.xymon.com/mailman/listinfo/xymon
>
>--
>
>This message from ORIX Australia might contain confidential and/or 
>privileged information. If you are not the intended recipient, any use, 
>disclosure or copying of this message (or of any attachments to it) is 
>not authorised.
>
>If you have received this message in error, please notify the sender 
>immediately and delete the message and any attachments from your system.
>Please inform the sender if you do not wish to receive future 
>communications by email.
>
>ORIX handles personal information according to a Privacy Policy that is 
>consistent with the National Privacy Principles. Please let us know if 
>you would like a copy. It is also available at http://www.orix.com.au .
>
>
>
>______________________________________________________________________
>The contents of this message may contain private, protected and/or 
>privileged information.  If you received this message in error, you 
>should destroy the e-mail message and any attachments or copies, and 
>you are prohibited from retaining, distributing, disclosing or using 
>any information contained within.  Please contact the sender and advise 
>of the erroneous delivery by return e-mail or telephone.  Thank you for 
>your cooperation.
>_______________________________________________
>Xymon mailing list
>Xymon at xymon.com
>http://lists.xymon.com/mailman/listinfo/xymon

_______________________________________________
Xymon mailing list
Xymon at xymon.com
http://lists.xymon.com/mailman/listinfo/xymon
_______________________________________________
Xymon mailing list
Xymon at xymon.com
http://lists.xymon.com/mailman/listinfo/xymon

______________________________________________________________________
The contents of this message may contain private, protected and/or privileged information.  If you received this message in error, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained within.  Please contact the sender and advise of the erroneous delivery by return e-mail or telephone.  Thank you for your cooperation.



More information about the Xymon mailing list