[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] Network test dying



> On Mon, Mar 13, 2006 at 12:45:27PM -0500, James B Horwath wrote:
> > I have been running hobbit for several months now without incident.  I 
am 
> > running hobbit 4.1.2p1 on Redhat Enterprise 3 on IBM pseries hardware. 
 I 
> > haven't had any issues until this morning.  Now it appears after about 
one 
> > hour of running the system flat out dies. I am sent a notification for 

> > every system connected.  Then it appears the network process dies.  I 
was 
> > running Tcpdump to see what was wrong. I see the completion of a 
network 
> > test about 30 minutes ago to a machine on the same subnet.  I am not 
> > running iptables/ipchains.  I am not experienced at hard-core hobbit 
> > debugging.  I looked in /var/log/hobbit and don't see anything 
strange. 
> > There are no core files on the hobbit directory.
> > 
> > Any advise on where to start?  All my network test are now purple.
> 
> Is there a "bbtest-net" and/or "fping" process which hangs ? If there
> is, it would be interesting to attach to it with "gdb" and see what
> it is doing. Alternatively, kill it with a "kill -6" which will trigger 
> a core dump in ~hobbit/data/tmp/ - you can run the core dump through 
> gdb, which might give me an idea what it is doing.
> 
> 
> You can also try su'ing to the hobbit user and run the command
> 
>    bbcmd bbtest-net --debug host1 host2
> 
> (replace "host1" and "host2" with a couple of the hosts in your
>  bb-hosts file).
> 
> 
> Is DNS lookups working on this box ? That is one of the few things that
> can cause the network tests to slow down dramatically. But they ought to
> time out automatically. Same goes for the other commands that run as
> part of the network tests (rpc and ntp queries).

Henrik,

Thanks for the tips.  DNS works fine and the network tests seems to work 
fine when I try them.  My system is pretty much idle and I don't see 
anything nasty in the system logs.  I have included the process table and 
an lsof bbtest-net process.  When I did the kill -6 on the network process 
it worked once and then failed stopped again.  I did a strings on the core 
and may have found a machine with a slove DNS resolution. I am keeping my 
fingers crossed.

Regards,
Jim

[root (at) bigbrother etc]# ps -ef | grep hobbit

hobbit   18470     1  0 15:10 ?        00:00:00 
/usr/local/hobbit/server/bin/hobbitlaunch 
--config=/usr/local/hobbit/server/etc/hobbitlaunch.cfg 
--env=/usr/local/hobbit/server/etc/hobbitserver.cfg 
--log=/var/log/hobbit/hobbitlaunch.log 
--pidfile=/var/log/hobbit/hobbitlaunch.pid
hobbit   18471 18470  0 15:10 ?        00:00:05 hobbitd 
--pidfile=/var/log/hobbit/hobbitd.pid 
--restart=/usr/local/hobbit/server/tmp/hobbitd.chk 
--checkpoint-file=/usr/local/hobbit/server/tmp/hobbitd.chk 
--checkpoint-interval=600 --log=/var/log/hobbit/hobbitd.log 
--admin-senders=127.0.0.1 10.98.200.46
hobbit   18473 18470  0 15:10 ?        00:00:00 hobbitd_channel 
--channel=stachg --log=/var/log/hobbit/history.log hobbitd_history
hobbit   18474 18473  0 15:10 ?        00:00:00 hobbitd_history
hobbit   18475 18470  0 15:10 ?        00:00:01 hobbitd_channel 
--channel=page --log=/var/log/hobbit/page.log hobbitd_alert 
--checkpoint-file=/usr/local/hobbit/server/tmp/alert.chk 
--checkpoint-interval=600
hobbit   18476 18475  0 15:10 ?        00:00:00 hobbitd_alert 
--checkpoint-file=/usr/local/hobbit/server/tmp/alert.chk 
--checkpoint-interval=600
hobbit   18477 18470  0 15:10 ?        00:00:19 hobbitd_channel 
--channel=status --log=/var/log/hobbit/rrd-status.log hobbitd_rrd 
--rrddir=/usr/local/hobbit/rrd
hobbit   18478 18470  0 15:10 ?        00:00:00 hobbitd_channel 
--channel=data --log=/var/log/hobbit/rrd-data.log hobbitd_rrd 
--rrddir=/usr/local/hobbit/rrd
hobbit   18479 18470  0 15:10 ?        00:00:00 hobbitd_channel 
--channel=client --log=/var/log/hobbit/clientdata.log hobbitd_client
hobbit   18480 18478  0 15:10 ?        00:00:00 hobbitd_rrd 
--rrddir=/usr/local/hobbit/rrd
hobbit   18481 18477  0 15:10 ?        00:00:14 hobbitd_rrd 
--rrddir=/usr/local/hobbit/rrd
hobbit   18482 18479  0 15:10 ?        00:00:00 hobbitd_client
hobbit   18634 18470  0 15:20 ?        00:00:00 bbtest-net --report --ping 
--checkresponse --timeout=60 --debug
hobbit   21820     1  0 22:02 ?        00:00:00 sh -c vmstat 300 2 
1>/usr/local/hobbit/client/tmp/hobbit_vmstat.21809 2>&1; mv 
/usr/local/hobbit/client/tmp/hobbit_vmstat.21809 
/usr/local/hobbit/client/tmp/hobbit_vmstat
hobbit   21821 21820  0 22:02 ?        00:00:00 vmstat 300 2
root     21861 21698  0 22:06 pts/0    00:00:00 grep hobbit

[root (at) bigbrother etc]# lsof -p 18634
COMMAND     PID   USER   FD   TYPE DEVICE    SIZE   NODE NAME
bbtest-ne 18634 hobbit  cwd    DIR    8,3    4096 376833 
/usr/local/hobbit/server
bbtest-ne 18634 hobbit  rtd    DIR   8,11    4096      2 /
bbtest-ne 18634 hobbit  txt    REG    8,3  170076 393236 
/usr/local/hobbit/server/bin/bbtest-net
bbtest-ne 18634 hobbit  mem    REG   8,11   61504  80026 
/lib/libnss_files-2.3.2.so
bbtest-ne 18634 hobbit  mem    REG   8,11   14592  80056 
/lib/liblaus.so.1.0.0
bbtest-ne 18634 hobbit  mem    REG   8,11   39468  82638 
/lib/libpam.so.0.75
bbtest-ne 18634 hobbit  mem    REG   8,11   29100  80014 
/lib/libcrypt-2.3.2.so
bbtest-ne 18634 hobbit  mem    REG    8,7   28672 144356 
/usr/lib/libgdbm.so.2.0.0
bbtest-ne 18634 hobbit  mem    REG    8,7   59608 144390 
/usr/lib/libz.so.1.1.4
bbtest-ne 18634 hobbit  mem    REG   8,11   19992  80016 
/lib/libdl-2.3.2.so
bbtest-ne 18634 hobbit  mem    REG   8,11   79916  80036 
/lib/libresolv-2.3.2.so
bbtest-ne 18634 hobbit  mem    REG    8,7   78360 272188 
/usr/kerberos/lib/libk5crypto.so.3.0
bbtest-ne 18634 hobbit  mem    REG    8,7   11072 272178 
/usr/kerberos/lib/libcom_err.so.3.0
bbtest-ne 18634 hobbit  mem    REG    8,7  391564 272198 
/usr/kerberos/lib/libkrb5.so.3.1
bbtest-ne 18634 hobbit  mem    REG    8,7   77448 272184 
/usr/kerberos/lib/libgssapi_krb5.so.2.2
bbtest-ne 18634 hobbit  mem    REG    8,7   57768 144429 
/usr/lib/libsasl.so.7.1.11
bbtest-ne 18634 hobbit  mem    REG   8,11 1608896  32013 
/lib/tls/libc-2.3.2.so
bbtest-ne 18634 hobbit  mem    REG   8,11 1104580  80070 
/lib/libcrypto.so.0.9.7a
bbtest-ne 18634 hobbit  mem    REG   8,11  220772  80071 
/lib/libssl.so.0.9.7a
bbtest-ne 18634 hobbit  mem    REG    8,7   49304 144433 
/usr/lib/liblber.so.2.0.17
bbtest-ne 18634 hobbit  mem    REG    8,7  186348 144435 
/usr/lib/libldap.so.2.0.17
bbtest-ne 18634 hobbit  mem    REG   8,11  115228  80005 /lib/ld-2.3.2.so
bbtest-ne 18634 hobbit    0r   CHR    1,3          65675 /dev/null
bbtest-ne 18634 hobbit    1w   REG    8,6 5775484 432036 
/var/log/hobbit/bb-network.log
bbtest-ne 18634 hobbit    2w   REG    8,6 5775484 432036 
/var/log/hobbit/bb-network.log
bbtest-ne 18634 hobbit    3u  IPv4 219456            UDP 
bigbrother:35123->n9000sd1.nro.glic.com:domain


-----------------------------------------
This message, and any attachments to it, may contain information
that is privileged, confidential, and exempt from disclosure under
applicable law.  If the reader of this message is not the intended
recipient, you are notified that any use, dissemination,
distribution, copying, or communication of this message is strictly
prohibited.  If you have received this message in error, please
notify the sender immediately by return e-mail and delete the
message and any attachments.  Thank you.