[hobbit] Help! bbtest-net gets http test timing wrong

Alan Sparks asparks at doublesparks.net
Wed Jul 23 23:52:50 CEST 2008


So I've been trying for a onth to get this working, to no avail, and 
have pretty much exhausted everything to figure our why Hobbit randomly 
gets 3-second return times on HTTP tests.

I'm even willing to call it a kernel problem, a problem with select() -- 
but it happens in multiple reasonably-contemporary kernels.

I've tried:
* CentOS 4.6 my standard build, CentOS 4.6 out-of-box, CentOS 5.1 out of 
box -- all have same problem.
* Disabled ARES.  Checked my DNS servers, they are answering fast.  
Besides, the tests are against IP addresses, not host names.
* Removed all sysctl settings, let them default.  No change.
* Experimented with concurrency settings on [bbnet].  Doesn't help.
* Ran tcpdumps between the web servers and Hobbit server.  The tcpdumps 
indicate the web server is always answering immediately and sending the 
response... but the FIN packet (when Hobbit completes the test) is 
delayed 3 seconds.

This issue tends to move around from host to host.  It seems to affect 
Web servers that are sending static HTML pages, and all of them less 
than 4000 bytes (many about 155 bytes).

As before, testing with other tools on the box show no issues with 
network connectivity or responses against the same servers.

The /only/ thing that has come close to helping is adding a setsockopt() 
to bbtest-net, to set the receive bufer to 1024 bytes.  This seems to 
override something that helps the select() call return better somehow.  
It's not a reliable or even sensible solution.  It also does not work on 
the 2.6.18 kernel on CentOS 5.1...

I'm really stumped and at the end of the rope on this.  Has anyone had 
anything that looks like this?
-Alan

Alan Sparks wrote:
> Does exactly the same thing on a fresh install of CentOS 5, x86_64. 
> All built by hand.
> -Alan
>
> Alan Sparks wrote:
>> I see where the problem seems to be occurring.  But for my life I 
>> can't understand why.
>>
>> Packet traces from the Hobbit server and the Web servers showing the 
>> 3-second delays show that Hobbit connects, and gets an imediate 
>> answer from the server (milliseconds).  But the servers show that 
>> Hobbit does not close the connection (a FIN packets sends/acks) for 3 
>> seconds.
>>
>> Looking at the bb-network debugging logging, I see that the select() 
>> call sleeps for 3 seconds before returning in these cases.  So the 
>> only conclusion I can arrive at is that select() doesn't return with 
>> the active file descriptors on schedule for some bizarre reason.
>>
>> For a desperation test, I forced the receive buffer on the sockets to 
>> a small number (1024 bytes):
>>                        if (sockok) {
>>                                int size = 1024;
>>                                res = setsockopt(nextinqueue->fd,
>>                                        SOL_SOCKET, SO_RCVBUF, &size, 
>> sizeof(size));
>>
>> This sortof works.  the select() no longer hangs, and the HTTP tests 
>> start returning "normal"-ish results, i.e. numbers that match curl 
>> and wget statistics.
>>
>> But, it messes with numbers for other Web servers, the ones that 
>> return a page significantly larger than 1024 bytes.
>>
>> Like I said, I just can't get it.  Hobbit or CentOS?  There's nothing 
>> odd about this build, a generic CentOS 4.6 x86_64 build, Hobbit 4.2 
>> with allinone patch, build for x86_64.
>>
>> Any suggestions at all?  If this isn't the right place to ask, where 
>> would be?  I can't get my hands around why the only thing that I 
>> can't get to work here is Hobbit...
>>
>> Thanks for your indulgence.  I really wish I could fix this.
>> -Alan
>>
>>
>> Alan Sparks wrote:
>>> After some Googling, I have added "AcceptFilter http none" 
>>> directives to the Apache 2.2 servers, which hasn't really helped 
>>> anything...
>>>
>>> Perhaps I should ask:  Can anyone verify Hobbit works correctly on a 
>>> 64-bit system?  Not should, but does, on a Centos 4 or RHEL 4 x86_64 
>>> install?
>>>
>>> I see a lot of debugging trace stuff (dbgprint calls) in the contest 
>>> and httptest code.  Can anyone tell me how to enable it to trace 
>>> what Hobbit is doing?
>>>
>>> Am really at a loss.  This can't be rocket science to get it to 
>>> probe HTTP correctly.  But a week later, I still cannot get it to 
>>> match any other monitoring tool's results.
>>> -Alan
>>>
>>> Alan Sparks wrote:
>>>> tcpdumps show a couple of interesting points.
>>>>
>>>> 1) There are definitely no DNS lookups occurring as a consequence 
>>>> of the Hobbit probes.  No port 53 traffic out.
>>>>
>>>> 2) The packets from the Hobbit server, and the incoming packets to 
>>>> the Apache server, sometimes look like:
>>>>
>>>> 15:20:01.160095 IP (tos 0x0, ttl  62, id 31129, offset 0, flags 
>>>> [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum 
>>>> ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 
>>>> 143665233 0,nop,wscale 2>
>>>>
>>>> 15:20:04.159715 IP (tos 0x0, ttl  62, id 31131, offset 0, flags 
>>>> [DF], proto 6, length: 60) hobbit.45116 > target.http: S [tcp sum 
>>>> ok] 265769416:265769416(0) win 17520 <mss 8760,sackOK,timestamp 
>>>> 143668233 0,nop,wscale 2>
>>>>
>>>> 15:20:04.160223 IP (tos 0x0, ttl  62, id 31133, offset 0, flags 
>>>> [DF], proto 6, length: 40) hobbit.45116 > target.http: . [tcp sum 
>>>> ok] 265769417:265769417(0) ack 1051782089 win 17520
>>>>
>>>> So that accounts for three seconds... it appears there are 2 SYN 
>>>> packets, but the first isn't getting processed and there's a 
>>>> 3-second delay to the next SYN (which gets ACKed).  I don't know 
>>>> why this happens only with the Hobbit connections... and I don't 
>>>> know why the first SYN seems to be getting ignored.  Server is not 
>>>> at all busy.
>>>>
>>>> -Alan
>>>> Tim McCloskey wrote:
>>>>> I get that wget/curl always work.  Not sure what resolver settings 
>>>>> may be implemented differently for hobbit.
>>>>>
>>>>> Still thinking this may be unrelated to hobbit (even though 
>>>>> wget/curl work fine for you).  We have many apache boxes spanning 
>>>>> multiple networks running httpd versions 1.3, 2.0 and 2.2 that 
>>>>> hobbit(4.2 with allinone patch) likes just fine and reports 
>>>>> accurate times (Seconds: 0.nn).  We also have fairly proper 
>>>>> forward and reverse DNS records for the systems involved.
>>>>>
>>>>> I can't imagine hobbit parsing the wrong response times, but if 
>>>>> that is the case I wonder what external libraries are used (not 
>>>>> hobbit provided libs, as ours parse fine and are likely the same 
>>>>> as yours).
>>>>>
>>>>> Anyway, good luck with the tcpdump.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Alan Sparks wrote:
>>>>>> UseCanonicalName is off, and HostNameLookup is off, on every 
>>>>>> server, regardless of version.
>>>>>> -Alan
>>>>>>
>>>>>> Tim McCloskey wrote:
>>>>>>> What do you have for
>>>>>>> UseCanonicalName
>>>>>>> in the apache 2.0 boxes?
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> To unsubscribe from the hobbit list, send an e-mail to
>>>>> hobbit-unsubscribe at hswn.dk
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> To unsubscribe from the hobbit list, send an e-mail to
>>>> hobbit-unsubscribe at hswn.dk
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> To unsubscribe from the hobbit list, send an e-mail to
>>> hobbit-unsubscribe at hswn.dk
>>>
>>>
>>>
>>
>>
>>
>> To unsubscribe from the hobbit list, send an e-mail to
>> hobbit-unsubscribe at hswn.dk
>>
>>
>>
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>





More information about the Xymon mailing list