[hobbit] Debugging help: bbtest-net gets http test timing wrong

Alan Sparks asparks at doublesparks.net
Mon Jun 16 06:42:46 CEST 2008


On the Apache servers, they are configured with HostNameLookups Off.  My 
own measurements using "conventional" tools (like a loop of curls and 
wgets) consistently show responses <0.2s.  The Hobbit problem is that a) 
the problem randomly affects the servers in the list (I have about 64 
servers in the test server), and b) no server that Hobbit suddenly 
reports as "slow" ever appears to be slow from other external tests.

Configurations are matched as close as possibly, accounting for module 
and other differences between Apache 2.0 and 2.2.  I'm very sure DNS 
lookups on the Web server end cannot account for this, as it /should/ 
show in logs, and affect non-Hobbit probes as well.

It appears that Hobbit implements HTTP testing itself, in the bbtest-net 
codebase.  No external tools are used.  Yes, resolver and NSS configs 
are the same.  And the HTTP tests are specifically targeted at IP 
addresses, not host names, so there /should/ not be a DNS lookup 
involved in the test connection, as far as I can tell from the code...

And yeah, the tcpdump on both ends is planned for Monday.  I want to 
somehow prove the response is actually showing up sooner than the test 
says...
-Alan

Tim McCloskey wrote:
> I'll take a stab at this but you may have already looked at the things 
> I wonder about or they may not help in any way. I've only thought 
> about this for a couple of minutes so I could be way off.
>
> You have stated there are a couple servers that seem to respond 
> differently.  Can you, 100% consistently, recreate the proper|improper 
> response from the web boxes?  If so, look at the changelog between 
> apache 2.2 and 2.0 (assuming that those servers - 2.0, 2.2 - are on 
> the same network and that all of the media setting match ie. 100Fdx 
> 100Fdx, etc.).
>
> Are you _sure_ you are not using DNS in some fashion, perhaps reverse 
> lookups or perhaps the newer apache config file contains some lookup 
> setting.  (can you get away with using the same httpd.conf on the 2.0 
> and 2.2 boxes?)
>
> What does the network traffic and connections look like on each of the 
> servers?   Have you tried running tcpdump on one of the web boxes to 
> see if there are any clues there?
>
> I don't recall if hobbit uses wget for his http gets.  Are all the 
> servers using the same settings in resolv.conf and nsswitch.conf?
>
> There are probably other things to check but start with making sure 
> DNS is not involved, even if you think it is not.
>
>
>
> Alan Sparks wrote:
>> Continuing to try to debug this problem, have tried about everything 
>> I can to resolve the issues with http probes.  Including:
>> * Complete rebuild of the server with CentOS 4.6 and recompile of 
>> Hobbit.  Same issues.
>> * Removing everything from /etc/sysctl.conf, rebooting.  Same issues.
>> * Manipulating the httpd.conf configs on remote servers, forcing 
>> HTTP/1.0, removing ETags, creating a very simple index page to test 
>> against.  Same issues.
>> * Upgrading Apache on sample remote server to Apache 2.2.8 (most are 
>> 2.2.4).  Same issue.
>> * Recompiling Hobbit with debugging flags, to make sure the optimizer 
>> is not applied.  Same issue.
>>
>> The only two servers I have that seem to work consistently well are a 
>> pair of Apache 2.0.52 servers.  The 2.2.4+ servers all seem to give 
>> Hobbit issues.  Although, again, repeated curl or wget probe cycles 
>> against the servers from the Hobbit server never show more than a 
>> 0.2s response time.
>>
>> But, Hobbit continues to report things like:
>>
>> http://10.1.17.251/ - OK
>>
>> HTTP/1.1 200 OK
>> Date: Mon, 16 Jun 2008 00:54:03 GMT
>> Server: Apache/2.2.8 (EL)
>> Last-Modified: Sun, 15 Jun 2008 00:37:00 GMT
>> ETag: "7c809f-9b-44fa9b5806300"
>> Accept-Ranges: bytes
>> Content-Length: 155
>> Connection: close
>> Content-Type: text/html; charset=UTF-8
>>
>>
>> I can't come up with anything other than Hobbit as a cause. But is 
>> there anything I can do to trace what is happening internally to get 
>> past this problem?  Any ideas at all would really be appreciated.  
>> Thanks in advance.
>> -Alan
>>
>> Seconds:     3.00
>>
>>
>> Alan Sparks wrote:
>>> Have a new install of Hobbit (4.2, tried 4.3 snap as well) on a 
>>> fresh install of CentOS 4.6 x86_64, up to date on patches.  I have a 
>>> problem with HTTP tests on "random" web servers that I just can't 
>>> figure out.
>>>
>>> I have about 64 of my hosts in the bb-hosts on this server, and have 
>>> http tests defined for these servers.  On most of these servers, 
>>> Hobbit is reporting the "Seconds:" for the response at 3 seconds.  
>>> It seems that it is inconsistent -- one cycle to the next, the 
>>> 3-second response may move to a different set of servers.
>>>
>>> The http: tests are defined using the IP address of the server - no 
>>> server name (so no DNS lookup).
>>>
>>> I've run a loop of tests on the same URL using wget and with curl, 
>>> and used my browser and Telnet to connect to the same URL.  I 
>>> consistently get a response time of about 0.2 seconds maximum from 
>>> the servers.
>>>
>>> The bbnet entry in hobbitlaunch.cfg looks like:
>>> CMD bbtest-net --report --ping --checkresponse --debug
>>>
>>> With the debugging turned on, I see the following entries 
>>> periodically in the network test log:
>>> Address=10.1.5.17:80, open=1, res=0, err=0, connecttime=0.002965, 
>>> totaltime=3.006810,
>>> Address=10.1.5.18:80, open=1, res=0, err=0, connecttime=0.002956, 
>>> totaltime=3.007413,
>>> Address=10.1.24.67:80, open=1, res=0, err=0, connecttime=0.002860, 
>>> totaltime=3.007120,
>>>
>>> The problem does not affect the same hosts each time.  The problem 
>>> will show a different number of hosts usually each cycle, sometimes 
>>> on same servers, but often on different ones.
>>>
>>> I've tried the following to see if anything will help:
>>> * Reducing the number of hosts.  If I only have a couple or three in 
>>> the bb-hosts, the problem doesn't manifest.
>>> * Recompiling.  Doesn't help.
>>> * Changing the test URL.  Doesn't help.
>>> * Adding a --concurrency= option to the launch.  If I use a 
>>> concurrency of 1, the problem does not manifest.
>>>
>>> Setting the concurrency to 1 to fix the problem isn't an option, but 
>>> makes me think something is getting really mixed up in the select() 
>>> processing in bbnet.
>>>
>>> Does anyone have any ideas how to diagnose where Hobbit is coming up 
>>> with a 3-second latency, when none of my test tools running off the 
>>> same server can duplicate the same timing?
>>>
>>> Thanks for any ideas, this is really baffling me.
>>> -Alan
>>>
>>>
>>>
>>> To unsubscribe from the hobbit list, send an e-mail to
>>> hobbit-unsubscribe at hswn.dk
>>>
>>>
>>>
>>
>>
>>
>> To unsubscribe from the hobbit list, send an e-mail to
>> hobbit-unsubscribe at hswn.dk
>>
>>
>>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>





More information about the Xymon mailing list