[Xymon] Xymon flapping: network slowness reality or delusion?

cleaver at terabithia.org cleaver at terabithia.org
Fri Mar 15 19:30:57 CET 2013


> Hi, All ...
>
> The other day, our Xymon (4.3.3) started sending out notifications due to
> flapping on various hosts, various network-based tests which lasted for a
> rather sharply-defined period. It caused a fair bit of angst and I was on
> the hot-seat to prove Xymon was functioning properly.
>
> Here are some of the summary facts:
>
> -       The flapping is pretty well documented in Xymon as occurring due
> to connection times exceeding our 10-second threshold - most of the , as
> configured in tasks.cfg
>
>            CMD xymonnet --report --ping --checkresponse \
>                         --timeout=10 --dns-timeout=2 \
>                         --dnslog=/var/log/xymon-4.3.3/dns.log \
>                         --concurrency=5
>    INTERVAL 3m
>
> -       Output from the "xymonnet" report (currently - not captured during
> the "storm") shows:
>
>             xymonnet version 4.3.3
>             SSL library : OpenSSL 0.9.8l 5 Nov 2009
>             LDAP library: OpenLDAP 20416
>
>             Statistics:
>              Hosts total           :     2081
>              Hosts with no tests   :        2
>              Total test count      :     2864
>              Status messages       :     2856
>              Alert status msgs     :        0
>              Transmissions         :       30
>
>             DNS statistics:
>              # hostnames resolved  :     3337
>              # succesful           :      921
>              # failed              :     1266
>              # calls to dnsresolve :     2850
>
>             TCP test statistics:
>              # TCP tests total     :     1769
>              # HTTP tests          :     1244
>              # Simple TCP tests    :      525
>              # Connection attempts :     1767
>              # bytes written       :   235845
>              # bytes read          :  2514747
>
>
>             TIME SPENT
>             Event                                           Start time
>      Duration
>             xymonnet startup                            1040654.310651
>             -
>             Service definitions loaded                  1040654.319152
>      0.008501
>             Tests loaded                                1040655.696733
>      1.377581
>             DNS lookups completed                       1040656.213268
>      0.516534
>             Test engine setup completed                 1040657.416739
>      1.203470
>             TCP tests completed                         1040675.444183
>     18.027443
>             PING test completed (923 hosts)             1040699.991467
>     24.547283
>             PING test results sent                      1040700.080247
>      0.088780
>             Test result collection completed            1040700.144033
>      0.063785
>             LDAP test engine setup completed            1040700.152852
>      0.008819
>             LDAP tests executed                         1040700.360821
>      0.207968
>             LDAP tests result collection completed      1040700.360829
>      0.000007
>             DNS tests executed                          1040700.441820
>      0.080991
>             NTP tests executed                          1040722.413523
>     21.971702
>             Test results transmitted                    1040723.295458
>      0.881935
>             xymonnet completed                          1040723.313935
>      0.018476
>             TIME TOTAL
>     69.003284
>
> -       Rather sharply defined start-up / cut-off for the "storm": I can
> point to the 5-minute segment when it started / stopped
> -       The Xymon server OS/NIC hardware check out diagnostically
> -       According to our network team's records, the network connection
> bandwidth utilization coming in / out of the Xymon server was < 1%
> capacity (i.e. we have lots of bandwidth)
> -       According to our network team there were no significant loss of
> packets or congestion at the switch level (there's only one hop between
> the Xymon server and the rest of the hosts)
> -       The types of services affected seemed pretty random: mostly HTTP
> tests, but LOTs of SSH/ping/NTP/LDAP, etc. as well.
>
> Any initial thoughts?
>
> Thanks!
>
> david
>
> ~~~~~~~~~~~~~~~~~~~
> David Mills
> Systems Administrator
> Northrop Grumman
> 512-595-1238
> david.mills at ngc.com
>



Assuming you're saving status results in history (the default), can you
look at the status messages from the down periods? Were they DNS timeouts
or timeout timeouts? I'd start with the ping checks, since that's pretty
cut-and-dried...

- Has anything like this occurred before?
- Even if no threshold was crossed on the Xymon server itself, take a look
at the 'trends' page for the polling host for that period and see if
anything unusual happened around the same time?


HTH,
-jc





More information about the Xymon mailing list