[hobbit] server fails to receive all of client message

Tue Dec 16 06:54:03 CET 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rodolfo Pilas wrote:
> Adam, take a look at:
> 
> http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_fix_.22Oversize_status_msg_from_192.168.1.31_for_test.my.com:ports_truncated_.28n.3D508634.2C_limit.3D262144.29.22

I've checked that, and it does not seem to be the problem.

All original info is below, a quick recap, my hobbit server doesn't
receive the complete client data, and so procs (and sometimes ports) go
red since the data is missing, the procs are not found).

Currently, the bbproxy server (which is running a hobbit on 127.0.0.1)
shows a green for procs for itself (the bbclient running locally) which
means the bbclient passed a message to 10.30.10.9 (bbproxy) which passed
the message to 127.0.0.1 (hobbit) which displayed it correctly.

The same bbproxy failed to send that message to the remote hobbit
server, which has the following as the "Client Data" when I click at the
bottom of the red procs page:

client mail,servername,com,au.linux linux
[date]
Tue Dec 16 15:56:14 EST 2008
[uname]
Linux mail 2.6.18-6-686 i686
[osversion]
Debian 4.0
[uptime]
 15:56:14 up 11 days,  6:53,  1 user,  load average: 0.22, 0.39, 0.43
[who]
userag   pts/0        Dec 16 13:29 (123.123.123.12.static.net.au)
[df]
Filesystem         1024-blocks      Used Available Capacity Mounted on
/dev/md0               9775120   7792092   1983028      80% /
/dev/md1             146002196  23881268 122120928      17% /home
[mount]
/dev/md0 on / type reiserfs (rw,notail)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /home type reiserfs (rw)
//ptserver/Shared$ on /mnt/ptserver/shared type smbfs (ro)
//ptserver/Shared$ on /mnt/ptserver/sharedrw type smbfs (rw)
[free]
             total       used       free     shared    buffers     cached
Mem:       3112548    2735760     376788          0     451828    1188956
- -/+ buffers/cache:    1094976    2017572
Swap:      1012072         88    1011984
[ifconfig]
eth0      Link encap:Ethernet  HWaddr 00:13:20:5F:EC:F3
          inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::213:20ff:fe5f:ecf3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:14847337 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15440612 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:795753357 (758.8 MiB)  TX bytes:4174004073 (3.8 GiB)

eth1      Link encap:Ethernet  HWaddr 00:09:5B:1A:16:26
          inet addr:10.30.10.9  Bcast:10.30.15.255  Mask:255.255.240.0
          inet6 addr: fe80::209:5bff:fe1a:1626/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:18232874 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25222742 errors:7 dropped:0 overruns:7 carrier:7
          collisions:0 txqueuelen:1000
          RX bytes:771706320 (735.9 MiB)  TX bytes:2584475027 (2.4 GiB)
          Interrupt:74 Base address:0xc000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:7545664 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7545664 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3978669335 (3.7 GiB)  TX bytes:3978669335 (3.7 GiB)

tun0      Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:10.30.99.1  P-t-P:10.30.99.2  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:767727 errors:0 dropped:0 overruns:0 frame:0
          TX packets:873992 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:57714836 (55.0 MiB)  TX bytes:210654878 (200.8 MiB)

[route]
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt
Iface
10.30.99.2      0.0.0.0         255.255.255.255 UH        0 0          0
tun0
10.30.99.0      10.30.99.2      255.255.255.0   UG        0 0          0
tun0
192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0
eth0
10.30.0.0       0.0.0.0         255.255.240.0   U         0 0          0
eth1
10.30.0.0       10.30.10.254    255.255.0.0     UG        0 0          0
eth1
0.0.0.0         192.168.1.1     0.0.0.0         UG        0 0          0
eth0
[netstat]
Ip:
    40747731 total packets received
    138429 with invalid addresses
    1633704 forwarded
    1 with unknown protocol
    0 incoming packets discarded
    38950970 incoming packets delivered
    50359204 requests sent out
    15 outgoing packets dropped
    1 fragments dropped after timeout
    1831 reassemblies required
    915 packets reassembled ok
    1 packet reassembles failed
Icmp:
    1657334 ICMP messages received
    3902 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 1566573
        timeout in transit: 54
        redirects: 123
        echo requests: 26880
        echo replies: 61402
    139635 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 112755
        echo replies: 26880
Tcp:
    1669446 active connections openings
    1247657 passive connection openings
    361872 failed connection attempts
    132167 connection resets received
    96 connections established
    33096061 segments received
    40931113 segments send out
    715004 segments retransmited
    0 bad segments received.
    125842 resets sent
Udp:
    4173585 packets received
    23298 packets to unknown port received.
    62 packet receive errors
    6868647 packets sent
TcpExt:
    482 resets received for embryonic SYN_RECV sockets
    268 packets pruned from receive queue because of socket buffer overrun
    14 ICMP packets dropped because they were out-of-window
    1084404 TCP sockets finished time wait in fast timer
    2236 time wait sockets recycled by time stamp
    354 packets rejects in established connections because of timestamp
    366541 delayed acks sent
    179 delayed acks further delayed because of locked socket
    Quick ack mode was activated 111492 times
    2510467 packets directly queued to recvmsg prequeue.
    3003729 of bytes directly received from backlog
    380000212 of bytes directly received from prequeue
    7194390 packet headers predicted
    724319 packets header predicted and directly queued to user
    8998070 acknowledgments not containing data received
    5065326 predicted acknowledgments
    1472 times recovered from packet loss due to fast retransmit
    42465 times recovered from packet loss due to SACK data
    161 bad SACKs received
    Detected reordering 165 times using FACK
    Detected reordering 181 times using SACK
    Detected reordering 444 times using reno fast retransmit
    Detected reordering 3197 times using time stamp
    1530 congestion windows fully recovered
    17437 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 155
    73943 congestion windows recovered after partial ack
    29880 TCP data loss events
    TCPLostRetransmit: 42
    541 timeouts after reno fast retransmit
    60499 timeouts after SACK recovery
    11423 timeouts in loss state
    84462 fast retransmits
    6777 forward retransmits
    45329 retransmits in slow start
    261506 other TCP timeouts
    TCPRenoRecoveryFail: 203
    14386 sack retransmits failed
    2738 times receiver scheduled too late for direct processing
    13320 packets collapsed in receive queue due to low socket buffer
    89544 DSACKs sent for old packets
    1294 DSACKs sent for out of order packets
    21665 DSACKs received
    446 DSACKs for out of order packets received
    34120 connections reset due to unexpected data
    3110 connections reset due to early user close
    34312 connections aborted due to timeout
[ports]
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address
State
tcp        0      0 0.0.0.0:20000           0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:1984          0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:1984         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:20002           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:55555           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:37              0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:389             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:21000           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:873             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:9               0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:139             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:13              0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:783           0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:80              0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:8080         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:113             0.0.0.0:*
LISTEN
tcp        0      0 127.0.0.1:53            0.0.0.0:*
LISTEN
tcp        0      0 10.30.10.9:1080         0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:5432            0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:25              0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:445             0.0.0.0:*
LISTEN
tcp        0      0 0.0.0.0:14238           0.0.0.0:*
LISTEN
tcp        0      0 192.168.1.2:56464       123.123.32.80:80
ESTABLISHED
tcp        0      0 192.168.1.2:51389       123.123.176.189:80
ESTABLISHED
tcp        0      0 192.168.1.2:52865       123.123.28.123:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1028
TIME_WAIT
tcp        0      0 192.168.1.2:40408       123.213.88.51:80
ESTABLISHED
tcp        0      0 192.168.1.2:53207       123.123.50.28:443
ESTABLISHED
tcp        0      0 10.30.10.9:33905        10.30.10.1:139
ESTABLISHED
tcp        0      0 192.168.1.2:49885       123.123.28.124:80
ESTABLISHED
tcp        0      0 192.168.1.2:53071       123.123.176.176:80
ESTABLISHED
tcp        0      0 127.0.0.1:5432          127.0.0.1:53017
ESTABLISHED
tcp        0      0 192.168.1.2:25          123.123.132.183:57226
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2826
ESTABLISHED
tcp        0      0 127.0.0.1:54646         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54645         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54655         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54639         127.0.0.1:1984
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2829
ESTABLISHED
tcp        0      0 127.0.0.1:54664         127.0.0.1:1984
TIME_WAIT
tcp        0      0 127.0.0.1:54665         127.0.0.1:1984
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.14:2830
ESTABLISHED
tcp        0      0 192.168.1.2:41020       123.123.38.94:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1050
ESTABLISHED
tcp        0      0 192.168.1.2:35851       123.123.88.59:80
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.16:1051
ESTABLISHED
tcp        0      0 192.168.1.2:47514       123.123.88.51:80
ESTABLISHED
tcp        0      0 192.168.1.2:37393       123.123.176.45:443
ESTABLISHED
tcp        0      0 10.30.10.9:49509        10.30.10.9:995
ESTABLISHED
tcp        0      0 192.168.1.2:38569       123.123.15.124:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1068
ESTABLISHED
tcp        0   1640 192.168.1.2:40677       123.213.176.45:443
ESTABLISHED
tcp        0      1 192.168.1.2:59642       123.123.176.176:80
LAST_ACK
tcp        0      0 10.30.10.9:8080         10.30.10.16:1066
ESTABLISHED
tcp        0      0 192.168.1.2:46183       123.123.88.51:80
ESTABLISHED
tcp        0      0 10.30.10.9:8080         10.30.10.16:1067
ESTABLISHED
tcp        0   1640 192.168.1.2:32966       123.123.176.45:443
ESTABLISHED
tcp        0      0 192.168.1.2:41246       123.123.88.56:80
ESTABLISHED
tcp        0      0 192.168.1.2:35872       123.123.38.94:80
TIME_WAIT
tcp        0      0 10.30.10.9:139          10.30.10.1:3133
ESTABLISHED
tcp        0      0 10.30.10.9:47825        10.30.10.1:139
TIME_WAIT
tcp        0      0 10.30.10.9:47828        10.30.10.1:139
TIME_WAIT
tcp        0      0 192.168.1.2:33758       123.123.38.94:80
TIME_WAIT
tcp        0      0 10.30.10.9:8080         10.30.10.17:3132
ESTABLISHED
tcp

Note, this is clearly truncated mid-line/mid-report!

So, the procs test on the remote hobbit shows red, and alerts/etc...
The interesting thing to note this time is that the actual procs report
showed most of the procs were actually found with some (truncated) ps
output data. Which means, hobbit actually had more data than is
displayed on the Client Data page....

If anyone can advise how to resolve this, I would be exceptionally keen
to hear about it.

A upgrade to the bandwidth for one end is scheduled for 2 weeks time,
which may help this site, but I am still seeing the same problem with
other hosts in other locations.

Two additional questions:
1) Can some sort of checksum/verification marker be added to the end of
the client data so the server knows that if it doesn't see the end
marker to discard the entire message?

2) Compression of the client data?
Nevermind, I'll post this as a separate email so it can get a little
better visibility...

Thanks,
Adam

> Adam Goryachev escribió:
>> Adam Goryachev wrote:
>>> Anyway, the problem is that approximately since then, a number of client
>>> reports are not completely received. Sometimes some of the ps output is
>>> truncated, sometimes the ports sections is truncated, etc. This leads to
>>> false positive alerts (ie, procs goes red because some monitored procs
>>> are not running since they were after the truncated section).
>>
>>> I've increased the timeout on the hobbitd (--timeout=60) but this
>>> doesn't seem to have helped. The only common factor between the clients
>>> which have this problem are:
>>
>>> 1) Most of them are running bbproxy and passing status messages from a
>>> number of clients.
>>> 2) The rest of them are on very slow connections, or frequently very
>>> busy connections.
>>
>>
>> I have made some 'progress' of sorts.
>>
>> I've increased the MAX values as I was getting some "Oversize ...
>> truncated" messages in my log file. I then went home thinking "Great, I
>> managed to solve this one thing today at least". Except, I started
>> getting messages a few hours later.
>>
>> So after further investigation, I've decided I really can't work out
>> what is happening, and why it isn't working. I've enabled debug output
>> from bbproxy, but I don't really know what it all means.
>>
>> I can see that if I set bbproxy to only forward messages to 127.0.0.1
>> the local hobbit server gets all the data correctly. If I add the remote
>> server, then some things don't work properly. Since it is likely all a
>> big jumbled mess by now, I'll post a few sections of config files, and
>> hopefully someone will notice my stupid mistake (or multiple mistakes)...
>>
>> I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all
>> client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are
>> windows PC's using the BB windows client), one is a linux hobbit-client
>> and of course 10.30.10.9 is a hobbit client (plus a couple of old ext
>> scripts using the old BB env). I think all this is working fine, since
>> nothing goes randomly purple/red.
>>
>> 10.30.10.9 is behind NAT but has complete access to the internet.
>>
>> I have a remote server behind a NAT router which has port 1984 port
>> forwarded to it. It is receiving reports from around 20 other hobbit
>> client machines perfectly, so I don't suspect the NAT router/hobbit
>> config itself.
>>
>> Some config from 10.30.10.9:
>>
>> hobbitserver.cfg:
>> BBSERVERIP="127.0.0.1"
>> BBDISP="127.0.0.1"
>> BBDISPLAYS=""
>> MAXLINE="32768"
>>
>> hobbitclient.cfg
>> BBDISP="10.30.10.9"
>> BBDISPLAYS=""
>> BB="$BBHOME/bin/bb --debug --timeout=60"
>> MAXLINE="32768"
>>
>> hobbitlaunch.cfg
>> [hobbitd]
>>         ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
>>         CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
>> --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
>> --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
>> --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
>> --listen=127.0.0.1
>>
>>
>> [bbproxy]
>>         ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
>>         CMD $BBHOME/bin/bbproxy --hobbitd
>> --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
>> --report=$MACHINE.bbproxy --no-daemon --timeout=30
>> --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
>>         CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
>> --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
>> --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
>>         LOGFILE $BBSERVERLOGS/bbproxy.log
>>
>> [hobbitclient]
>>         ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg
>>         NEEDS hobbitd
>>         CMD /usr/lib/hobbit/client/bin/hobbitclient.sh
>>         LOGFILE $BBSERVERLOGS/hobbitclient.log
>>         INTERVAL 5m
>>
>>
>> On the remote hobbit server with the public IP I have:
>> hobbitserver.cfg
>> BBSERVERIP="192.168.2.6"
>> BBDISP="192.168.2.6"
>> BBDISPLAYS=""
>> MAXLINE="32768"
>> MAXMSG_STATUS="1024"
>> MAXMSG_CLIENT="1024"
>> MAXMSG_DATA="512"
>>
>> hobbitlaunch.cfg
>> [hobbitd]
>>         HEARTBEAT
>>         ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
>>         CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
>> --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
>> --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
>> --admin-senders=127.0.0.1,$BBSERVERIP
>> --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
>> --store-clientlogs=!msgs --timeout=60
>>
>> Any suggestions as to what is going wrong would be really appreciated.
>>
>> BTW, bbnet tests from the 10.30.10.9 host are not submitted to the
>> bbproxy at all because of the BBDISP setting in the hobbitserver.cfg,
>> but if I change this to point to 10.30.10.9 then it seems to break the
>> web interface. I'm not really too concerned about this right now
>> though....
>>
>> Thanks for any tips/pointers/etc
>>
>> Regards,
>> Adam

- --
Adam Goryachev
Website Managers
Ph: +61 2 8304 0000                            adam at websitemanagers.com.au
Fax: +61 2 8304 0001                            www.websitemanagers.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklHQnsACgkQGyoxogrTyiVq/QCgjU1lIamzEs5lq5bKGo9K3Hgt
XQAAnAvUe9sQC3PHeTeJjFi2+XGG50dT
=65Io
-----END PGP SIGNATURE-----