[hobbit] Serious hobbit problem, client data truncated by server

Wed Jun 6 14:01:36 CEST 2007

> -----Original Message-----
> From: Gore, David W (David) 
> Sent: Monday, June 04, 2007 17:52
> To: hobbit at hswn.dk
> Subject: [hobbit] Serious hobbit problem, client data 
> truncated by server
> 
> I have a very serious hobbit problem.  Our hobbit has been 
> working very
> well for more than a year.  I have rolled back some config files,
> bb-hosts, client-local.cfg, and hobbit-clients.cfg, on the 
> hopes one of
> them may have a typo causing hobbit to act erratically.  
> Unfortunately,
> no luck.
> 
> So what is the problem?  The client sends, msg.<host>.txt, as some of
> you may know, and you can see this file on the server or web page via
> the 'Client data' link.  Unfortunately, the hobbit server is 
> truncating
> the '[ps]' listing which means you lose all the other entries after
> '[ps]' and now you are also going to start alarming on missing
> processes.
> 
> Alarming and paging out the on-call on missing processes in the middle
> of the night and creating bogus tickets is very bad.  There isn't too
> much in the logs, but we do have something.
> 
> Starting on June 02 we got this in bb-display.log:
> 
> 2007-06-04 12:21:07 Whoops ! bb failed to send message - timeout
> 2007-06-04 12:21:07 hobbitd status-board not available
> 2007-06-04 14:21:47 Whoops ! bb failed to send message - timeout
> 2007-06-04 14:21:47 hobbitd status-board not available
> 2007-06-04 15:02:02 Whoops ! bb failed to send message - timeout
> 2007-06-04 15:02:02 hobbitd status-board not available
> 
> Any ideas?  Henrik?  
> 
> 
> Oh and of course the message size is more than adequate to handle the
> data. We have many hosts that send 2-3 times more data on average and
> nothing has changed on the client.
>  

Missing info from my first post: 

Dual hobbit servers running Hobbit 4.2.0 w/allinone patch on Fedora Core
5 on Dell Optiplex GX620 dual core P4 3.2GHz w/1G of memory each

Here is the solution:

Here is what my netstat looks like on the problem host (hobbit1):

[root at hobbit1 ~]# netstat -in
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
TX-OVR Flg
eth0       1500   0 39784213 17711963    383      0 41615768      0
0      0 BMRU
lo        16436   0 10299557      0      0      0 10299557      0      0
0 LRU

Obviously not good.  This host WAS our primary hobbit server with a
10Mb/s half-duplex connection.  Our secondary hobbit server runs at
100Mb/s full-duplex.

Additionally, I had this setting in hobbitserver.cfg:

BBDISP="0.0.0.0"               # IP of a single hobbit/bbd server
BBDISPLAYS="hobbit1 hobbit2"   # IP of multiple hobbit/bbd servers. If
used, BBDISP must be 0.0.0.0

I am not sure what that does with parallel servers, but it does appear
to tie them together and when one is down perhaps this causes problems.
This setting was probably made more than a year ago, and perhaps it just
took time to catch up with us combined with the slow network connection.
I really am not sure why you would do what I did or what it's effect is
but I have changed both servers to this setting:

BBDISP="$BBSERVERIP"           # IP of a single hobbit/bbd server
BBDISPLAYS=""                  # IP of multiple hobbit/bbd servers. If
used, BBDISP must be 0.0.0.0

I have also moved the primary server to the 100Mb/s full-duplex
connection and have plans to move the secondary to a 100Mb/s full-duplex
connection.

So far everything appears to be running fine.  Perhaps now when I do
multi-host clientupdates it will work as expected.  This slow connection
could easily explain why some messages from the client were being
truncated on the server and why clientupdates also caused problems.  I
should have also found the problem a lot sooner.

Thanks go to Henrik and Mike Rowell for helping.

~David