[Xymon] Xymon flapping: network slowness reality or delusion?
Mills, David (IS)
David.Mills at ngc.com
Thu Mar 21 16:00:44 CET 2013
Resolution: unfortunately, it turned out, in this case, that after stopping / restarting the Xymon server ("xymon.sh restart"), everything became docile again. The response time graph for "xymonnet" on the server looks like a really bad hair day followed by a nearly straight blue line after the restart.
We're running 4.3.3 (and, yes, we're trying to migrate to 4.3.10). Has anyone heard of related bugs in this version of the code, or other theories?
From: Mills, David (IS)
Sent: Friday, March 15, 2013 5:18 PM
To: 'cleaver at terabithia.org'
Cc: xymon at xymon.com
Subject: RE: EXT :Re: [Xymon] Xymon flapping: network slowness reality or delusion?
From: cleaver at terabithia.org [mailto:cleaver at terabithia.org]
Sent: Friday, March 15, 2013 1:31 PM
To: Mills, David (IS)
Cc: xymon at xymon.com
Subject: EXT :Re: [Xymon] Xymon flapping: network slowness reality or delusion?
> Hi, All ...
> The other day, our Xymon (4.3.3) started sending out notifications due
> to flapping on various hosts, various network-based tests which lasted
> for a rather sharply-defined period. It caused a fair bit of angst and
> I was on the hot-seat to prove Xymon was functioning properly.
> Here are some of the summary facts:
Assuming you're saving status results in history (the default), can you look at the status messages from the down periods? Were they DNS timeouts or timeout timeouts? I'd start with the ping checks, since that's pretty cut-and-dried...
- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look at the 'trends' page for the polling host for that period and see if anything unusual happened around the same time?
Thanks! After poking around on the Xymonnet history dumps, I found some very interesting stuff I don't know what to make of:
- For the top 20 worst times in a 24 hour period, the three categories of networking that had significantly elevated levels were "TCP tests completed", "DNS tests executed" and "NTP tests executed".
- Oddly, after graphing the respective times for these categories in a spreadsheet, it became obvious that the DNS and TCP tests were roughly inversions of each other: when one was super-high, the other would go low.
- Even weirder, the PING tests were ... NORMAL!! While the rest of the Xymon network tests were jumping off a cliff, good old 'ping' was chugging along without (mostly) mishap. This last datum seems to blow a hole in the theory that this is truly a network problem (vs. a Xymon server/host problem).
Any other thoughts?
More information about the Xymon