Debugging help: bbtest-net gets http test timing wrong

Alan Sparks asparks at doublesparks.net
Sat Jun 14 02:20:57 CEST 2008


Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a fresh 
install of CentOS 4.6 x86_64, up to date on patches.  I have a problem 
with HTTP tests on "random" web servers that I just can't figure out.

I have about 64 of my hosts in the bb-hosts on this server, and have 
http tests defined for these servers.  On most of these servers, Hobbit 
is reporting the "Seconds:" for the response at 3 seconds.  It seems 
that it is inconsistent -- one cycle to the next, the 3-second response 
may move to a different set of servers.

The http: tests are defined using the IP address of the server - no 
server name (so no DNS lookup).

I've run a loop of tests on the same URL using wget and with curl, and 
used my browser and Telnet to connect to the same URL.  I consistently 
get a response time of about 0.2 seconds maximum from the servers.

The bbnet entry in hobbitlaunch.cfg looks like:
CMD bbtest-net --report --ping --checkresponse --debug

With the debugging turned on, I see the following entries periodically 
in the network test log:
Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, 
totaltime=3.006810,
Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, 
totaltime=3.007413,
Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, 
totaltime=3.007120,

The problem does not affect the same hosts each time.  The problem will 
show a different number of hosts usually each cycle, sometimes on same 
servers, but often on different ones.

I've tried the following to see if anything will help:
* Reducing the number of hosts.  If I only have a couple or three in the 
bb-hosts, the problem doesn't manifest.
* Recompiling.  Doesn't help.
* Changing the test URL.  Doesn't help.
* Adding a --concurrency= option to the launch.  If I use a concurrency 
of 1, the problem does not manifest.

Setting the concurrency to 1 to fix the problem isn't an option, but 
makes me think something is getting really mixed up in the select() 
processing in bbnet.

Does anyone have any ideas how to diagnose where Hobbit is coming up 
with a 3-second latency, when none of my test tools running off the same 
server can duplicate the same timing?

Thanks for any ideas, this is really baffling me.
-Alan





More information about the Xymon mailing list