[hobbit] server fails to receive all of client message

Adam Goryachev mailinglists at websitemanagers.com.au
Fri May 9 07:22:20 CEST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Adam Goryachev wrote:
> Adam Goryachev wrote:
>> Anyway, the problem is that approximately since then, a number of client
>> reports are not completely received. Sometimes some of the ps output is
>> truncated, sometimes the ports sections is truncated, etc. This leads to
>> false positive alerts (ie, procs goes red because some monitored procs
>> are not running since they were after the truncated section).
> 
>> I've increased the timeout on the hobbitd (--timeout=60) but this
>> doesn't seem to have helped. The only common factor between the clients
>> which have this problem are:
> 
>> 1) Most of them are running bbproxy and passing status messages from a
>> number of clients.
>> 2) The rest of them are on very slow connections, or frequently very
>> busy connections.

I've made some more possible progress, I still don't really know how to
approach this problem, or try to solve it....

Basically, I used tcpdump to catch all traffic sent to port 1984 on my
local server. I then used wireshark to analyse the data to find the
specific stream of packets that lead to hobbit getting a red alert due
to truncated client report.

It now seems to point toward some sort of transport 'problem' in that I
get a number of 'errors' such as "TCP Previous segment lost" and "TCP
Dup ACK" and "TCP Retransmission" and the final packet is a "RST" which
I assume is when you would normally get a "Connection reset by peer"
type error.

I would love to publish the trace, but don't know how to obfuscate it's
contents to conceal some of the details (ie, the contents of the hobbit
client status that was being reported).

However, I do have the following questions:
1) If the connection died due to an error, why does hobbit still use the
contents of what it received? (Is this the better to know half the
information than none, or we can't tell the difference between
connection closed due to an error and connection closed at end of
transport?)

2) From what I know, TCP is meant to be fairly robust in the face of
lost packets, and other errors. The fact I am seeing these sort of
failures concerns me that my network must be unhappy in some way. Yet,
from a user experience point of view, everything seems to be working
perfectly..... ie, web browsing/ssh connections /etc...

BTW, the network connection is quite busy during the times when these
errors happen due to remote backups being done at those times. Could
that be the cause of the problem?

Any comments, suggestions, etc, would be greatly appreciated.

Regards,
Adam

- --
Adam Goryachev
Website Managers
www.websitemanagers.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFII9+MGyoxogrTyiURAoZ5AJ4uwxQMAIuEvF32XWxZuBPqBU3bYQCfYtVy
T4RIJ40hdntCZtTIXRouCtY=
=Begp
-----END PGP SIGNATURE-----



More information about the Xymon mailing list