[hobbit] fping tuning

Hubbard, Greg L greg.hubbard at eds.com
Thu Apr 27 00:07:22 CEST 2006


Eric,

If it takes twice as long to ping 600 things as it does 300 things,
isn't that to be expected?  After all, you are pinging twice as many
items.  I don't think this is "geometric" but linear.  I don't know how
many parallel paths fping has, but I suspect it is far less than either
300 or 600, so some queuing is going to occur.

But it is hard to calculate, too, because some of the IP addresses do
not respond.  If a host responds, then the pinger is free to move to the
next one in the list.  Otherwise it has to go through the "time out and
retry" dance.  So a non-responsive host could cause 5 or 6 seconds in
delay until the pinger decides it is down and moves on.  Since Fping
runs things in parallel, it is the luck of the draw regarding which
stream might get bogged down.

What is your ping interval?  Pinging 1400 IP's in 40 seconds sounds
pretty good to me -- you have a lot of room for growth.  (Big Brother
can't do this without some help)  That is about 35 IPs per second.
Round down to 30 per second, multiply by 300, and you could possibly
monitor 9,000 IP's with this one server in a five minute window!  After
all, you only need to get to them all in the cycle before circling
around and hitting them again.  Some of the pay ware management systems
actually try to space out their activity through the polling cycle so
they don't hog the network themselves.

GLH
 

-----Original Message-----
From: Schwimmer, Eric E *HS [mailto:EES2Y at hscmail.mcc.virginia.edu] 
Sent: Wednesday, April 26, 2006 4:50 PM
To: hobbit at hswn.dk
Subject: RE: [hobbit] fping tuning

> > 
> > We're monitoring 1420 IPs in hobbit, and it takes fping ~40 seconds 
> > to go through them all:
> 
> Is that a number you get from the "bbtest" status or from running 
> fping by hand?

Both.  The values are fairly consistent, falling between somewhere in
the 39-42 range.
 
> Are you doing other network tests in Hobbit than just ping?
> Hobbit does the ping tests in parallel with the other tests.

We are doing other tests, but not many.  Here's the relevent lines from
our servers bbtest report:

TIME SPENT
Event                                            Starttime
Duration
TCP tests completed                      1146086231.293585
1.211963 
PING test completed (1434 hosts)         1146086271.488185
40.194600 
PING test results sent                   1146086271.523332
0.035147
TIME TOTAL
41.549643 

> > <snip>
> > [root at hobbit fping]# fping -i5 -b12 -f ips -r1 -t250 -B2 -q -s
> 
> Are you using those parameters also on the FPING command in 
> hobbitserver.cfg? Or is it just for your testing ?

This is just what I've been using for testing (the -f flag is root only
and wouldn't work very well when used from within hobbit).  The value of
my FPING envvar in hobbitserver.cfg is "/usr/sbin/fping -i10 -b12".
However the average difference in polling time betweeh the two is only 1
or 2 seconds.
 
> > Now, this seems a bit lengthy to me.  I mean, if the avg round trip 
> > time is 5.83 ms, and there are 1430 hosts, should the total time in 
> > transit for all hosts should be 8336ms, or 8 seconds... right?
> 
> No, it should be less - because fping pings several hosts in parallel.
> 
> You have "-i5" which causes a 5 ms delay between each ping.
> So that's (5/1000)*1430 = 7.15 seconds where it does nothing.
> The default setting is "-i25" - i.e. 5 times higher - which would 
> actually match your ~40 seconds nicely.

Using the default delay interval (i.e. not specifying the -i flag when
calling fping) causes the test to take much longer, on the order of 60 -
70 seconds.  However, values of 15 or less passed to -i don't make much
of a difference in polling time.
(FWIW, fping doesn't let you specific a value for -i less than 10 unless
you are root.  I hacked the fping code to get around this so I could run
it under hobbit with -i1, but I saw no difference in polling times using
-i1 vs -i15).


> Don't forget that there is probably also some time spent doing ARP 
> lookups for all of these IP's. Unless you have "testip"
> on all of the entries in bb-hosts (or run bbtest-net with "--dns=ip"),

> you'll also spend some time on DNS lookups
> (hint: use a local caching DNS server on the Hobbit server).

Yep, I have --dns=ip in the bbtest-net stanza of my hobbitlaunch.cfg
(that makes a BIG difference), so I don't think it's a DNS resolution
problem.  In the testing fping command above, the -f flag specifies a
file that is a list of all the IP addresses from my bb-host file, with
not DNS names included, so I don't think it's a DNS problem. 
I feel like its some sort of concurrency issue within fping, since I can
reproduce this latency completely outside of hobbit.

As a complete aside, we caching server for things outside of hobbit, and
I've written a little script that monitors the bb-hosts file ( and all
filed included from bb-hosts) and when it detects any changes, it will
write a bind9 zone file to somewhere on disk.  Its handy for making sure
your bb-hosts is synced with your DNS.  If anybody is interested in it,
drop me a line (I'll have to 'pretty it up' first) and I'll posted it on
my hobbit tools page for people to use.

> > Even when I remove the hosts that aren't responding, the results on 
> > are par with those above.
> > 
> > Our polling interval is once every 60 seconds (which we want to 
> > maintain, because we like to know ASAP when something drops even one

> > ping), so it's not a problem yet. We add hosts on a daily basis, 
> > however, so it will be a problem some time in the future and I'd 
> > like to fix it before it becomes a problem.
> 
> Well, the good news is that it probably won't become a problem.
> Because fping pings multiple hosts in parallel, the runtime doesn't 
> change very much when you add more hosts.

Ah, so you would think ;)  However, our graph in our bbtest column says
otherwise;  it has been climbing slowly but steadily since it started
graphing data.  You can also reproduce this by using a newline delimited
list of IP addresses in a file, like I did above, and feeding it to
fping.  As you increase the number of IPs in the file, the poll time
increases geometrically.  For instance, when I poll 300 hosts:

<snip>
[root at hobbit fpingtest]# fping -i5 -b12 -f 300ips -r1 -t250 -B2 -q -s

     300 targets
     298 alive
       2 unreachable
       0 unknown addresses

      12 timeouts (waiting for response)
     310 ICMP Echos sent
     298 ICMP Echo Replies received
       0 other ICMP received

 0.24 ms (min round trip time)
 2.38 ms (avg round trip time)
 101 ms (max round trip time)
        8.856 sec (elapsed real time)
</snip>

vs when I poll 600 hosts:
<snip>
[root at hobbit fpingtest]# fping -i5 -b12 -f 600ips -r1 -t250 -B2 -q -s

     600 targets
     597 alive
       3 unreachable
       0 unknown addresses

      14 timeouts (waiting for response)
     611 ICMP Echos sent
     597 ICMP Echo Replies received
       0 other ICMP received

 0.21 ms (min round trip time)
 2.48 ms (avg round trip time)
 100 ms (max round trip time)
       16.144 sec (elapsed real time)
</snip>

You can see that the ping time roughly doubles.  This is bad :(

> If it does become an issue, spread the load. Setup an extra server to 
> do half the network tests, and configure your bb-hosts file with 
> "NET:net-a" and "NET:net-b" tags on the hosts. Then you set 
> BBLOCATION="net-a" on one box, and "BBLOCATION=net-b" on the other. 
> Then they'll only test those hosts where the NET:...
> setting matches. Unless it's an OS limitation, you could probably do 
> that on a single box and just have two instances of the [bbnet] task 
> in hobbitlaunch.cfg - instead of running bbtest-net directly, they 
> would run a shell-script which sets the BBLOCATION environment just 
> before running bbtest-net.

I was thinking of doing something along these lines, however the
bb-hosts file is maintained mostly by the (non-unix-savvy) staff here,
using the bb-hostedit CGI script, and I'd rather not have them have to
keep track of which host needed which NET tag, etc.  

I've tested the network capabilities of this box using iperf as well as
several concurrent ping floods, and it can send upwards of 10000+ ICMP
packets per second (with successful replies from another host on the
same 1000bT switch).

So this leads me to believe that it is a problem solely with fping;  if
they had a public forum or a mailing list, I'd be whining there instead
of here. :)  I can't say that I was expecting to find the 'magic bullet'
for this problem here, but I was hoping that there might be some fping
wizard out there some magic bullets to spare.  Anywho, thanks for your
thoughts, Henrik.  I'll poke some more at the fping code and see if I
can figure out whats going on (I doubt it);  if not, I'll start working
towards hacking together a load balancing script that will auto-add NET:
tags to bb-hosts entry, or something along those lines.

Thanks,
-Eric

To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk





More information about the Xymon mailing list