[hobbit] Network test dying
James B Horwath
JamesHorwath at glic.com
Tue Mar 14 04:25:51 CET 2006
> On Mon, Mar 13, 2006 at 12:45:27PM -0500, James B Horwath wrote:
> > I have been running hobbit for several months now without incident. I
am
> > running hobbit 4.1.2p1 on Redhat Enterprise 3 on IBM pseries hardware.
I
> > haven't had any issues until this morning. Now it appears after about
one
> > hour of running the system flat out dies. I am sent a notification for
> > every system connected. Then it appears the network process dies. I
was
> > running Tcpdump to see what was wrong. I see the completion of a
network
> > test about 30 minutes ago to a machine on the same subnet. I am not
> > running iptables/ipchains. I am not experienced at hard-core hobbit
> > debugging. I looked in /var/log/hobbit and don't see anything
strange.
> > There are no core files on the hobbit directory.
> >
> > Any advise on where to start? All my network test are now purple.
>
> Is there a "bbtest-net" and/or "fping" process which hangs ? If there
> is, it would be interesting to attach to it with "gdb" and see what
> it is doing. Alternatively, kill it with a "kill -6" which will trigger
> a core dump in ~hobbit/data/tmp/ - you can run the core dump through
> gdb, which might give me an idea what it is doing.
>
>
> You can also try su'ing to the hobbit user and run the command
>
> bbcmd bbtest-net --debug host1 host2
>
> (replace "host1" and "host2" with a couple of the hosts in your
> bb-hosts file).
>
>
> Is DNS lookups working on this box ? That is one of the few things that
> can cause the network tests to slow down dramatically. But they ought to
> time out automatically. Same goes for the other commands that run as
> part of the network tests (rpc and ntp queries).
Henrik,
Thanks for the tips. DNS works fine and the network tests seems to work
fine when I try them. My system is pretty much idle and I don't see
anything nasty in the system logs. I have included the process table and
an lsof bbtest-net process. When I did the kill -6 on the network process
it worked once and then failed stopped again. I did a strings on the core
and may have found a machine with a slove DNS resolution. I am keeping my
fingers crossed.
Regards,
Jim
[root at bigbrother etc]# ps -ef | grep hobbit
hobbit 18470 1 0 15:10 ? 00:00:00
/usr/local/hobbit/server/bin/hobbitlaunch
--config=/usr/local/hobbit/server/etc/hobbitlaunch.cfg
--env=/usr/local/hobbit/server/etc/hobbitserver.cfg
--log=/var/log/hobbit/hobbitlaunch.log
--pidfile=/var/log/hobbit/hobbitlaunch.pid
hobbit 18471 18470 0 15:10 ? 00:00:05 hobbitd
--pidfile=/var/log/hobbit/hobbitd.pid
--restart=/usr/local/hobbit/server/tmp/hobbitd.chk
--checkpoint-file=/usr/local/hobbit/server/tmp/hobbitd.chk
--checkpoint-interval=600 --log=/var/log/hobbit/hobbitd.log
--admin-senders=127.0.0.1 10.98.200.46
hobbit 18473 18470 0 15:10 ? 00:00:00 hobbitd_channel
--channel=stachg --log=/var/log/hobbit/history.log hobbitd_history
hobbit 18474 18473 0 15:10 ? 00:00:00 hobbitd_history
hobbit 18475 18470 0 15:10 ? 00:00:01 hobbitd_channel
--channel=page --log=/var/log/hobbit/page.log hobbitd_alert
--checkpoint-file=/usr/local/hobbit/server/tmp/alert.chk
--checkpoint-interval=600
hobbit 18476 18475 0 15:10 ? 00:00:00 hobbitd_alert
--checkpoint-file=/usr/local/hobbit/server/tmp/alert.chk
--checkpoint-interval=600
hobbit 18477 18470 0 15:10 ? 00:00:19 hobbitd_channel
--channel=status --log=/var/log/hobbit/rrd-status.log hobbitd_rrd
--rrddir=/usr/local/hobbit/rrd
hobbit 18478 18470 0 15:10 ? 00:00:00 hobbitd_channel
--channel=data --log=/var/log/hobbit/rrd-data.log hobbitd_rrd
--rrddir=/usr/local/hobbit/rrd
hobbit 18479 18470 0 15:10 ? 00:00:00 hobbitd_channel
--channel=client --log=/var/log/hobbit/clientdata.log hobbitd_client
hobbit 18480 18478 0 15:10 ? 00:00:00 hobbitd_rrd
--rrddir=/usr/local/hobbit/rrd
hobbit 18481 18477 0 15:10 ? 00:00:14 hobbitd_rrd
--rrddir=/usr/local/hobbit/rrd
hobbit 18482 18479 0 15:10 ? 00:00:00 hobbitd_client
hobbit 18634 18470 0 15:20 ? 00:00:00 bbtest-net --report --ping
--checkresponse --timeout=60 --debug
hobbit 21820 1 0 22:02 ? 00:00:00 sh -c vmstat 300 2
1>/usr/local/hobbit/client/tmp/hobbit_vmstat.21809 2>&1; mv
/usr/local/hobbit/client/tmp/hobbit_vmstat.21809
/usr/local/hobbit/client/tmp/hobbit_vmstat
hobbit 21821 21820 0 22:02 ? 00:00:00 vmstat 300 2
root 21861 21698 0 22:06 pts/0 00:00:00 grep hobbit
[root at bigbrother etc]# lsof -p 18634
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
bbtest-ne 18634 hobbit cwd DIR 8,3 4096 376833
/usr/local/hobbit/server
bbtest-ne 18634 hobbit rtd DIR 8,11 4096 2 /
bbtest-ne 18634 hobbit txt REG 8,3 170076 393236
/usr/local/hobbit/server/bin/bbtest-net
bbtest-ne 18634 hobbit mem REG 8,11 61504 80026
/lib/libnss_files-2.3.2.so
bbtest-ne 18634 hobbit mem REG 8,11 14592 80056
/lib/liblaus.so.1.0.0
bbtest-ne 18634 hobbit mem REG 8,11 39468 82638
/lib/libpam.so.0.75
bbtest-ne 18634 hobbit mem REG 8,11 29100 80014
/lib/libcrypt-2.3.2.so
bbtest-ne 18634 hobbit mem REG 8,7 28672 144356
/usr/lib/libgdbm.so.2.0.0
bbtest-ne 18634 hobbit mem REG 8,7 59608 144390
/usr/lib/libz.so.1.1.4
bbtest-ne 18634 hobbit mem REG 8,11 19992 80016
/lib/libdl-2.3.2.so
bbtest-ne 18634 hobbit mem REG 8,11 79916 80036
/lib/libresolv-2.3.2.so
bbtest-ne 18634 hobbit mem REG 8,7 78360 272188
/usr/kerberos/lib/libk5crypto.so.3.0
bbtest-ne 18634 hobbit mem REG 8,7 11072 272178
/usr/kerberos/lib/libcom_err.so.3.0
bbtest-ne 18634 hobbit mem REG 8,7 391564 272198
/usr/kerberos/lib/libkrb5.so.3.1
bbtest-ne 18634 hobbit mem REG 8,7 77448 272184
/usr/kerberos/lib/libgssapi_krb5.so.2.2
bbtest-ne 18634 hobbit mem REG 8,7 57768 144429
/usr/lib/libsasl.so.7.1.11
bbtest-ne 18634 hobbit mem REG 8,11 1608896 32013
/lib/tls/libc-2.3.2.so
bbtest-ne 18634 hobbit mem REG 8,11 1104580 80070
/lib/libcrypto.so.0.9.7a
bbtest-ne 18634 hobbit mem REG 8,11 220772 80071
/lib/libssl.so.0.9.7a
bbtest-ne 18634 hobbit mem REG 8,7 49304 144433
/usr/lib/liblber.so.2.0.17
bbtest-ne 18634 hobbit mem REG 8,7 186348 144435
/usr/lib/libldap.so.2.0.17
bbtest-ne 18634 hobbit mem REG 8,11 115228 80005 /lib/ld-2.3.2.so
bbtest-ne 18634 hobbit 0r CHR 1,3 65675 /dev/null
bbtest-ne 18634 hobbit 1w REG 8,6 5775484 432036
/var/log/hobbit/bb-network.log
bbtest-ne 18634 hobbit 2w REG 8,6 5775484 432036
/var/log/hobbit/bb-network.log
bbtest-ne 18634 hobbit 3u IPv4 219456 UDP
bigbrother:35123->n9000sd1.nro.glic.com:domain
-----------------------------------------
This message, and any attachments to it, may contain information
that is privileged, confidential, and exempt from disclosure under
applicable law. If the reader of this message is not the intended
recipient, you are notified that any use, dissemination,
distribution, copying, or communication of this message is strictly
prohibited. If you have received this message in error, please
notify the sender immediately by return e-mail and delete the
message and any attachments. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20060313/51a11e59/attachment.html>
More information about the Xymon
mailing list