[hobbit] server fails to receive all of client message
Adam Goryachev
adam at websitemanagers.com.au
Tue Dec 16 06:54:03 CET 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Rodolfo Pilas wrote:
> Adam, take a look at:
>
> http://en.wikibooks.org/wiki/System_Monitoring_with_Hobbit/FAQ#Q._How_do_I_fix_.22Oversize_status_msg_from_192.168.1.31_for_test.my.com:ports_truncated_.28n.3D508634.2C_limit.3D262144.29.22
I've checked that, and it does not seem to be the problem.
All original info is below, a quick recap, my hobbit server doesn't
receive the complete client data, and so procs (and sometimes ports) go
red since the data is missing, the procs are not found).
Currently, the bbproxy server (which is running a hobbit on 127.0.0.1)
shows a green for procs for itself (the bbclient running locally) which
means the bbclient passed a message to 10.30.10.9 (bbproxy) which passed
the message to 127.0.0.1 (hobbit) which displayed it correctly.
The same bbproxy failed to send that message to the remote hobbit
server, which has the following as the "Client Data" when I click at the
bottom of the red procs page:
client mail,servername,com,au.linux linux
[date]
Tue Dec 16 15:56:14 EST 2008
[uname]
Linux mail 2.6.18-6-686 i686
[osversion]
Debian 4.0
[uptime]
15:56:14 up 11 days, 6:53, 1 user, load average: 0.22, 0.39, 0.43
[who]
userag pts/0 Dec 16 13:29 (123.123.123.12.static.net.au)
[df]
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/md0 9775120 7792092 1983028 80% /
/dev/md1 146002196 23881268 122120928 17% /home
[mount]
/dev/md0 on / type reiserfs (rw,notail)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/md1 on /home type reiserfs (rw)
//ptserver/Shared$ on /mnt/ptserver/shared type smbfs (ro)
//ptserver/Shared$ on /mnt/ptserver/sharedrw type smbfs (rw)
[free]
total used free shared buffers cached
Mem: 3112548 2735760 376788 0 451828 1188956
- -/+ buffers/cache: 1094976 2017572
Swap: 1012072 88 1011984
[ifconfig]
eth0 Link encap:Ethernet HWaddr 00:13:20:5F:EC:F3
inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::213:20ff:fe5f:ecf3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:14847337 errors:0 dropped:0 overruns:0 frame:0
TX packets:15440612 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:795753357 (758.8 MiB) TX bytes:4174004073 (3.8 GiB)
eth1 Link encap:Ethernet HWaddr 00:09:5B:1A:16:26
inet addr:10.30.10.9 Bcast:10.30.15.255 Mask:255.255.240.0
inet6 addr: fe80::209:5bff:fe1a:1626/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18232874 errors:0 dropped:0 overruns:0 frame:0
TX packets:25222742 errors:7 dropped:0 overruns:7 carrier:7
collisions:0 txqueuelen:1000
RX bytes:771706320 (735.9 MiB) TX bytes:2584475027 (2.4 GiB)
Interrupt:74 Base address:0xc000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:7545664 errors:0 dropped:0 overruns:0 frame:0
TX packets:7545664 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3978669335 (3.7 GiB) TX bytes:3978669335 (3.7 GiB)
tun0 Link encap:UNSPEC HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
inet addr:10.30.99.1 P-t-P:10.30.99.2 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:767727 errors:0 dropped:0 overruns:0 frame:0
TX packets:873992 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:57714836 (55.0 MiB) TX bytes:210654878 (200.8 MiB)
[route]
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt
Iface
10.30.99.2 0.0.0.0 255.255.255.255 UH 0 0 0
tun0
10.30.99.0 10.30.99.2 255.255.255.0 UG 0 0 0
tun0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0
eth0
10.30.0.0 0.0.0.0 255.255.240.0 U 0 0 0
eth1
10.30.0.0 10.30.10.254 255.255.0.0 UG 0 0 0
eth1
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0
eth0
[netstat]
Ip:
40747731 total packets received
138429 with invalid addresses
1633704 forwarded
1 with unknown protocol
0 incoming packets discarded
38950970 incoming packets delivered
50359204 requests sent out
15 outgoing packets dropped
1 fragments dropped after timeout
1831 reassemblies required
915 packets reassembled ok
1 packet reassembles failed
Icmp:
1657334 ICMP messages received
3902 input ICMP message failed.
ICMP input histogram:
destination unreachable: 1566573
timeout in transit: 54
redirects: 123
echo requests: 26880
echo replies: 61402
139635 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 112755
echo replies: 26880
Tcp:
1669446 active connections openings
1247657 passive connection openings
361872 failed connection attempts
132167 connection resets received
96 connections established
33096061 segments received
40931113 segments send out
715004 segments retransmited
0 bad segments received.
125842 resets sent
Udp:
4173585 packets received
23298 packets to unknown port received.
62 packet receive errors
6868647 packets sent
TcpExt:
482 resets received for embryonic SYN_RECV sockets
268 packets pruned from receive queue because of socket buffer overrun
14 ICMP packets dropped because they were out-of-window
1084404 TCP sockets finished time wait in fast timer
2236 time wait sockets recycled by time stamp
354 packets rejects in established connections because of timestamp
366541 delayed acks sent
179 delayed acks further delayed because of locked socket
Quick ack mode was activated 111492 times
2510467 packets directly queued to recvmsg prequeue.
3003729 of bytes directly received from backlog
380000212 of bytes directly received from prequeue
7194390 packet headers predicted
724319 packets header predicted and directly queued to user
8998070 acknowledgments not containing data received
5065326 predicted acknowledgments
1472 times recovered from packet loss due to fast retransmit
42465 times recovered from packet loss due to SACK data
161 bad SACKs received
Detected reordering 165 times using FACK
Detected reordering 181 times using SACK
Detected reordering 444 times using reno fast retransmit
Detected reordering 3197 times using time stamp
1530 congestion windows fully recovered
17437 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 155
73943 congestion windows recovered after partial ack
29880 TCP data loss events
TCPLostRetransmit: 42
541 timeouts after reno fast retransmit
60499 timeouts after SACK recovery
11423 timeouts in loss state
84462 fast retransmits
6777 forward retransmits
45329 retransmits in slow start
261506 other TCP timeouts
TCPRenoRecoveryFail: 203
14386 sack retransmits failed
2738 times receiver scheduled too late for direct processing
13320 packets collapsed in receive queue due to low socket buffer
89544 DSACKs sent for old packets
1294 DSACKs sent for out of order packets
21665 DSACKs received
446 DSACKs for out of order packets received
34120 connections reset due to unexpected data
3110 connections reset due to early user close
34312 connections aborted due to timeout
[ports]
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address
State
tcp 0 0 0.0.0.0:20000 0.0.0.0:*
LISTEN
tcp 0 0 127.0.0.1:1984 0.0.0.0:*
LISTEN
tcp 0 0 10.30.10.9:1984 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:20002 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:55555 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:37 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:389 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:21000 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:873 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:9 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:139 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:13 0.0.0.0:*
LISTEN
tcp 0 0 127.0.0.1:783 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:80 0.0.0.0:*
LISTEN
tcp 0 0 10.30.10.9:8080 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:113 0.0.0.0:*
LISTEN
tcp 0 0 127.0.0.1:53 0.0.0.0:*
LISTEN
tcp 0 0 10.30.10.9:1080 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:5432 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:25 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:445 0.0.0.0:*
LISTEN
tcp 0 0 0.0.0.0:14238 0.0.0.0:*
LISTEN
tcp 0 0 192.168.1.2:56464 123.123.32.80:80
ESTABLISHED
tcp 0 0 192.168.1.2:51389 123.123.176.189:80
ESTABLISHED
tcp 0 0 192.168.1.2:52865 123.123.28.123:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1028
TIME_WAIT
tcp 0 0 192.168.1.2:40408 123.213.88.51:80
ESTABLISHED
tcp 0 0 192.168.1.2:53207 123.123.50.28:443
ESTABLISHED
tcp 0 0 10.30.10.9:33905 10.30.10.1:139
ESTABLISHED
tcp 0 0 192.168.1.2:49885 123.123.28.124:80
ESTABLISHED
tcp 0 0 192.168.1.2:53071 123.123.176.176:80
ESTABLISHED
tcp 0 0 127.0.0.1:5432 127.0.0.1:53017
ESTABLISHED
tcp 0 0 192.168.1.2:25 123.123.132.183:57226
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.14:2826
ESTABLISHED
tcp 0 0 127.0.0.1:54646 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54645 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54655 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54639 127.0.0.1:1984
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.14:2829
ESTABLISHED
tcp 0 0 127.0.0.1:54664 127.0.0.1:1984
TIME_WAIT
tcp 0 0 127.0.0.1:54665 127.0.0.1:1984
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.14:2830
ESTABLISHED
tcp 0 0 192.168.1.2:41020 123.123.38.94:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1050
ESTABLISHED
tcp 0 0 192.168.1.2:35851 123.123.88.59:80
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.16:1051
ESTABLISHED
tcp 0 0 192.168.1.2:47514 123.123.88.51:80
ESTABLISHED
tcp 0 0 192.168.1.2:37393 123.123.176.45:443
ESTABLISHED
tcp 0 0 10.30.10.9:49509 10.30.10.9:995
ESTABLISHED
tcp 0 0 192.168.1.2:38569 123.123.15.124:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1068
ESTABLISHED
tcp 0 1640 192.168.1.2:40677 123.213.176.45:443
ESTABLISHED
tcp 0 1 192.168.1.2:59642 123.123.176.176:80
LAST_ACK
tcp 0 0 10.30.10.9:8080 10.30.10.16:1066
ESTABLISHED
tcp 0 0 192.168.1.2:46183 123.123.88.51:80
ESTABLISHED
tcp 0 0 10.30.10.9:8080 10.30.10.16:1067
ESTABLISHED
tcp 0 1640 192.168.1.2:32966 123.123.176.45:443
ESTABLISHED
tcp 0 0 192.168.1.2:41246 123.123.88.56:80
ESTABLISHED
tcp 0 0 192.168.1.2:35872 123.123.38.94:80
TIME_WAIT
tcp 0 0 10.30.10.9:139 10.30.10.1:3133
ESTABLISHED
tcp 0 0 10.30.10.9:47825 10.30.10.1:139
TIME_WAIT
tcp 0 0 10.30.10.9:47828 10.30.10.1:139
TIME_WAIT
tcp 0 0 192.168.1.2:33758 123.123.38.94:80
TIME_WAIT
tcp 0 0 10.30.10.9:8080 10.30.10.17:3132
ESTABLISHED
tcp
Note, this is clearly truncated mid-line/mid-report!
So, the procs test on the remote hobbit shows red, and alerts/etc...
The interesting thing to note this time is that the actual procs report
showed most of the procs were actually found with some (truncated) ps
output data. Which means, hobbit actually had more data than is
displayed on the Client Data page....
If anyone can advise how to resolve this, I would be exceptionally keen
to hear about it.
A upgrade to the bandwidth for one end is scheduled for 2 weeks time,
which may help this site, but I am still seeing the same problem with
other hosts in other locations.
Two additional questions:
1) Can some sort of checksum/verification marker be added to the end of
the client data so the server knows that if it doesn't see the end
marker to discard the entire message?
2) Compression of the client data?
Nevermind, I'll post this as a separate email so it can get a little
better visibility...
Thanks,
Adam
> Adam Goryachev escribió:
>> Adam Goryachev wrote:
>>> Anyway, the problem is that approximately since then, a number of client
>>> reports are not completely received. Sometimes some of the ps output is
>>> truncated, sometimes the ports sections is truncated, etc. This leads to
>>> false positive alerts (ie, procs goes red because some monitored procs
>>> are not running since they were after the truncated section).
>>
>>> I've increased the timeout on the hobbitd (--timeout=60) but this
>>> doesn't seem to have helped. The only common factor between the clients
>>> which have this problem are:
>>
>>> 1) Most of them are running bbproxy and passing status messages from a
>>> number of clients.
>>> 2) The rest of them are on very slow connections, or frequently very
>>> busy connections.
>>
>>
>> I have made some 'progress' of sorts.
>>
>> I've increased the MAX values as I was getting some "Oversize ...
>> truncated" messages in my log file. I then went home thinking "Great, I
>> managed to solve this one thing today at least". Except, I started
>> getting messages a few hours later.
>>
>> So after further investigation, I've decided I really can't work out
>> what is happening, and why it isn't working. I've enabled debug output
>> from bbproxy, but I don't really know what it all means.
>>
>> I can see that if I set bbproxy to only forward messages to 127.0.0.1
>> the local hobbit server gets all the data correctly. If I add the remote
>> server, then some things don't work properly. Since it is likely all a
>> big jumbled mess by now, I'll post a few sections of config files, and
>> hopefully someone will notice my stupid mistake (or multiple mistakes)...
>>
>> I have a network 10.x.x.x which has a hobbit server at 10.30.10.9, all
>> client machines report to 10.30.10.9 as the BBDISPLAY/BBPAGER (most are
>> windows PC's using the BB windows client), one is a linux hobbit-client
>> and of course 10.30.10.9 is a hobbit client (plus a couple of old ext
>> scripts using the old BB env). I think all this is working fine, since
>> nothing goes randomly purple/red.
>>
>> 10.30.10.9 is behind NAT but has complete access to the internet.
>>
>> I have a remote server behind a NAT router which has port 1984 port
>> forwarded to it. It is receiving reports from around 20 other hobbit
>> client machines perfectly, so I don't suspect the NAT router/hobbit
>> config itself.
>>
>> Some config from 10.30.10.9:
>>
>> hobbitserver.cfg:
>> BBSERVERIP="127.0.0.1"
>> BBDISP="127.0.0.1"
>> BBDISPLAYS=""
>> MAXLINE="32768"
>>
>> hobbitclient.cfg
>> BBDISP="10.30.10.9"
>> BBDISPLAYS=""
>> BB="$BBHOME/bin/bb --debug --timeout=60"
>> MAXLINE="32768"
>>
>> hobbitlaunch.cfg
>> [hobbitd]
>> ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
>> CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
>> --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
>> --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
>> --admin-senders=127.0.0.1,$BBSERVERIP --store-clientlogs=!msgs
>> --listen=127.0.0.1
>>
>>
>> [bbproxy]
>> ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
>> CMD $BBHOME/bin/bbproxy --hobbitd
>> --bbdisplay=123.234.456.567,127.0.0.1 --listen=10.30.10.9
>> --report=$MACHINE.bbproxy --no-daemon --timeout=30
>> --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
>> CMD $BBHOME/bin/bbproxy --hobbitd --bbdisplay=127.0.0.1
>> --listen=10.30.10.9 --report=$MACHINE.bbproxy --no-daemon --timeout=30
>> --pidfile=$BBSERVERLOGS/bbproxy.pid --debug --log-details
>> LOGFILE $BBSERVERLOGS/bbproxy.log
>>
>> [hobbitclient]
>> ENVFILE /usr/lib/hobbit/client/etc/hobbitclient.cfg
>> NEEDS hobbitd
>> CMD /usr/lib/hobbit/client/bin/hobbitclient.sh
>> LOGFILE $BBSERVERLOGS/hobbitclient.log
>> INTERVAL 5m
>>
>>
>> On the remote hobbit server with the public IP I have:
>> hobbitserver.cfg
>> BBSERVERIP="192.168.2.6"
>> BBDISP="192.168.2.6"
>> BBDISPLAYS=""
>> MAXLINE="32768"
>> MAXMSG_STATUS="1024"
>> MAXMSG_CLIENT="1024"
>> MAXMSG_DATA="512"
>>
>> hobbitlaunch.cfg
>> [hobbitd]
>> HEARTBEAT
>> ENVFILE /usr/lib/hobbit/server/etc/hobbitserver.cfg
>> CMD hobbitd --pidfile=$BBSERVERLOGS/hobbitd.pid
>> --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk
>> --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log
>> --admin-senders=127.0.0.1,$BBSERVERIP
>> --maint-senders=127.0.0.1,$BBSERVERIP -www-senders=127.0.0.1,$BBSERVERIP
>> --store-clientlogs=!msgs --timeout=60
>>
>> Any suggestions as to what is going wrong would be really appreciated.
>>
>> BTW, bbnet tests from the 10.30.10.9 host are not submitted to the
>> bbproxy at all because of the BBDISP setting in the hobbitserver.cfg,
>> but if I change this to point to 10.30.10.9 then it seems to break the
>> web interface. I'm not really too concerned about this right now
>> though....
>>
>> Thanks for any tips/pointers/etc
>>
>> Regards,
>> Adam
- --
Adam Goryachev
Website Managers
Ph: +61 2 8304 0000 adam at websitemanagers.com.au
Fax: +61 2 8304 0001 www.websitemanagers.com.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAklHQnsACgkQGyoxogrTyiVq/QCgjU1lIamzEs5lq5bKGo9K3Hgt
XQAAnAvUe9sQC3PHeTeJjFi2+XGG50dT
=65Io
-----END PGP SIGNATURE-----
More information about the Xymon
mailing list