[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] still crashing



Here's (attached as plaintext) an offending report ("client status.") note that for df, we have top output (huh?!) and hobbit complains, quite rightly, that it can't make head or tail (so to speak) of disk space from that.

Rich Smrcina wrote:
Also, if possible try to capture the offending disk report. Check the good report and the bad one to see if the reporting IP addresses are different. It is possible that two machines are reporting with the same hostname.

I've seen the 'Worker process died' message when I really screwed up something in the client coding. It likely means that something in the client message is out of place, which makes sense given the message you see about the disk report.

Rob Munsch wrote:
Henrik,

I haven't been able to pinpoint a specific message at the same time the hobbitd_client dies. What i am seeing are blocks of things like this:

2007-02-26 09:56:52 Worker process died with exit code 134, terminating
2007-02-26 10:16:54 Worker process died with exit code 134, terminating
2007-02-26 10:16:55 Worker process died with exit code 134, terminating
2007-02-26 10:26:56 Worker process died with exit code 134, terminating
2007-02-26 10:26:56 Worker process died with exit code 134, terminating
2007-02-26 12:17:07 Worker process died with exit code 134, terminating
2007-02-26 12:17:11 Worker process died with exit code 134, terminating
2007-02-26 12:42:10 Worker process died with exit code 134, terminating
2007-02-26 12:42:14 Worker process died with exit code 134, terminating
2007-02-26 13:02:13 Worker process died with exit code 134, terminating
2007-02-26 13:02:17 Worker process died with exit code 134, terminating
2007-02-26 13:07:13 Worker process died with exit code 134, terminating
2007-02-26 13:07:18 Worker process died with exit code 134, terminating
2007-02-26 13:17:19 Worker process died with exit code 134, terminating
2007-02-26 13:22:20 Worker process died with exit code 134, terminating
2007-02-26 13:22:20 Worker process died with exit code 134, terminating
2007-02-26 13:27:20 Worker process died with exit code 134, terminating
2007-02-26 13:27:20 Worker process died with exit code 134, terminating
2007-02-26 13:32:21 Worker process died with exit code 134, terminating
2007-02-26 13:42:22 Worker process died with exit code 134, terminating
2007-02-26 13:42:22 Worker process died with exit code 134, terminating
2007-02-26 13:52:24 Worker process died with exit code 134, terminating
2007-02-26 13:52:24 Worker process died with exit code 134, terminating
2007-02-26 14:07:26 Worker process died with exit code 134, terminating
2007-02-26 14:07:26 Worker process died with exit code 134, terminating

I have it running in --debug mode as per your suggestion, and am getting a ton of output: i have a feeling it's a little more than i'm capable of sorting through well :(.

The only other oddity is it occasionally barfs on Disk tests. For no apparent reason i get

2007-02-26 09:31:49 Host grape (linux) sent incomprehensible disk report - missing columnheaders 'Capacity' and 'Mounted'

but by the next poll, it's figured it out again. i don't know if these are related, but it's all I've got right now.

I'll keep trying to correlate a specific message with the crash time and let you know what i find out.

Rob Munsch wrote:
Rich Smrcina wrote:
Go back a level (cd ..) and try it again. It happens to me alot! :)

Marvelously embarrassing. Thanks, proceeding with requested tests... sigh


Rob Munsch wrote:
Henrik Stoerner wrote:
On Thu, Feb 08, 2007 at 04:00:47PM -0500, Rob Munsch wrote:
I still have a constantly red-then-purple hobbitd_client on my hobbit server.

It's gotten to the point where i have a cron job dropping the test continuously. I would appreciate any insight as to why this started happening and what is causing it.

Core was generated by `hobbitd_client'.
Program terminated with signal 6, Aborted.
#0  0xffffe410 in __kernel_vsyscall ()

Unfortunately this doesn't give a clue about what actually happened, except that it jumped to some wild address and crashed.

Could you add this line to hobbitd/hobbitd_client.c dbgprintf("Client report from host %s\n", (hostname ? hostname : "<unknown>"));
around line 1754, just after the
enum ostype_t os;
namelist_t *hinfo = NULL;
lines. Then run "make" to rebuild hobbitd_client, copy the

I tried doing this. The make bombed terribly; pages and pages of errors. It started like this:


-----
root (at) randomaccess ~/hobbit-4.2.0/hobbitd # make
cc -c -o hobbitd_client.o hobbitd_client.c
hobbitd_client.c:26:22: error: libbbgen.h: No such file or directory
In file included from hobbitd_client.c:28:
client_config.h:23: error: expected ')' before '*' token
client_config.h:27: error: expected ')' before '*' token
client_config.h:33: error: expected ')' before '*' token
client_config.h:38: error: expected ')' before '*' token
client_config.h:40: error: expected ')' before '*' token
client_config.h:43: error: expected ')' before '*' token
client_config.h:47: error: expected ')' before '*' token
client_config.h:51: error: expected ')' before '*' token
client_config.h:55: error: expected ')' before '*' token
hobbitd_client.c:46: error: 'COL_CLEAR' undeclared here (not in a function)
hobbitd_client.c:132: error: expected ')' before '*' token
hobbitd_client.c:165: error: expected declaration specifiers or '...' before 'namelist_t'
-----


I copied the line you gave me from this email, where specified, so i don't think it's that.

rob


To unsubscribe from the hobbit list, send an e-mail to hobbit-unsubscribe (at) hswn.dk





To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe (at) hswn.dk




To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe (at) hswn.dk




client doisneau.linux
[date]
Wed Feb 28 15:02:02 EST 2007
[uname]
Linux doisneau.office.solutionsforprogress.com 2.6.11 #1 SMP Wed Nov 16 14:07:49 EST 2005 i686 GNU/Linux
[uptime]
 15:02:02 up 296 days, 19:03,  0 users,  load average: 0.00, 0.00, 0.00
[who]
[df]
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0  2006 ?        00:00:24 init [2]         
root         2     1  0  2006 ?        00:00:00 [migration/0]
root         3     1  0  2006 ?        00:00:00 [ksoftirqd/0]
root         4     1  0  2006 ?        00:02:36 [events/0]
root         5     1  0  2006 ?        00:00:00 [khelper]
root        10     1  0  2006 ?        00:00:00 [kthread]
root        18    10  0  2006 ?        00:00:00 [kacpid]
root        63    10  0  2006 ?        00:03:32 [kblockd/0]
root       128    10  0  2006 ?        02:04:41 [pdflush]
root       131    10  0  2006 ?        00:00:00 [aio/0]
root       130     1  0  2006 ?        00:00:01 [kswapd0]
root       132     1  0  2006 ?        00:00:00 [cifsoplockd]
root       717     1  0  2006 ?        00:00:00 [kseriod]
root       782    10  0  2006 ?        00:00:00 [ata/0]
root       799    10  0  2006 ?        00:27:33 [reiserfs/0]
root      1122     1  0  2006 ?        02:30:05 /sbin/syslogd
root      1125     1  0  2006 ?        00:00:00 /sbin/klogd
mail      1148     1  0  2006 ?        00:00:00 /usr/lib/exim/exim3 -bd -q30m
uucp      1223     1  0  2006 ?        00:00:00 /usr/sbin/faxq
uucp      1225     1  0  2006 ?        00:00:06 /usr/sbin/hfaxd -i 4559
root      1240     1  0  2006 ?        00:00:00 /usr/sbin/inetd
daemon    1350     1  0  2006 ?        00:00:00 /usr/sbin/atd
root      1353     1  0  2006 ?        00:00:03 /usr/sbin/cron
root      1362     1  0  2006 tty2     00:00:00 /sbin/getty 38400 tty2
root      1363     1  0  2006 tty3     00:00:00 /sbin/getty 38400 tty3
root      1364     1  0  2006 tty4     00:00:00 /sbin/getty 38400 tty4
root      1365     1  0  2006 tty5     00:00:00 /sbin/getty 38400 tty5
root      1366     1  0  2006 tty6     00:00:00 /sbin/getty 38400 tty6
uucp      1367     1  0  2006 ?        00:01:46 /usr/sbin/faxgetty ttyS0
root      5268     1  0  2006 ?        00:00:28 /home/hobbit/client/bin/hobbitlaunch --config=/home/hobbit/client/etc/clientlaunch.cfg --log=/home/hobbit/client/logs/clientlaunch.log --pidfile=/home/hobbit/client/logs/clientlaunch.pid
root     19815     1  0  2006 ?        00:00:00 /usr/local/sbin/osirisd -r /usr/local/osiris
osiris   19816 19815  0  2006 ?        00:00:02 /usr/local/sbin/osirisd -r /usr/local/osiris
root     31765     1  0  2006 ?        00:00:23 /usr/sbin/ntpd -p /var/run/ntpd.pid
postgres 27103     1  0  2006 ?        00:00:13 /usr/lib/postgresql/bin/postmaster -D /var/lib/postgres/data
postgres 27109 27103  0  2006 ?        00:07:27 postgres: stats buffer process                              
postgres 27110 27109  0  2006 ?        00:07:24 postgres: stats collector process                           
root     29971     1  0  2006 tty1     00:00:00 /sbin/getty 38400 tty1
root     12487    10  0  2006 ?        00:35:10 [pdflush]
root     14714     1  0  2006 ?        00:00:19 /usr/sbin/sshd
root     19790     1  0 Jan06 ?        00:15:22 /usr/sbin/slapd -h ldap://127.0.0.1:389/ ldaps:/// ldapi:///
root     20416     1  0 Feb26 ?        00:00:00 ntpd
root     15365  1353  0 06:25 ?        00:00:00 /USR/SBIN/CRON
root     15366 15365  0 06:25 ?        00:00:00 /bin/sh -c test -x /usr/sbin/anacron || run-parts --report /etc/cron.daily
root     15367 15366  0 06:25 ?        00:00:00 run-parts --report /etc/cron.daily
root     15698 15367  0 06:25 ?        00:00:00 [jabber-restart] <defunct>
mail     15804 15365  0 06:25 ?        00:00:00 /usr/sbin/sendmail -i -FCronDaemon -oem root
jabber   15964     1  0 06:40 ?        00:00:02 perl -w -x /usr/local/jabberd2/bin/jabberd
jabber   15966     1  0 06:40 ?        00:00:01 /usr/local/jabberd2/bin/mu-conference -c /etc/jabberd/muc-conf.xml
jabber   15967 15964  0 06:40 ?        00:00:09 /usr/local/jabberd2/bin/router -c /usr/local/jabberd2/etc/jabberd/router.xml
jabber   15968 15964  0 06:40 ?        00:00:00 /usr/local/jabberd2/bin/resolver -c /usr/local/jabberd2/etc/jabberd/resolver.xml
jabber   15969 15964  0 06:40 ?        00:00:15 /usr/local/jabberd2/bin/sm -c /usr/local/jabberd2/etc/jabberd/sm.xml
jabber   15970 15964  0 06:40 ?        00:00:00 /usr/local/jabberd2/bin/s2s -c /usr/local/jabberd2/etc/jabberd/s2s.xml
postgres 15971 27103  0 06:40 ?        00:00:26 postgres: jabber jabberd2 127.0.0.1 idle                    
jabber   15972 15964  0 06:40 ?        00:00:13 /usr/local/jabberd2/bin/c2s -c /usr/local/jabberd2/etc/jabberd/c2s.xml
root     21205     1  0 14:17 ?        00:00:00 /home/hobbit/client/bin/hobbitlaunch --config=/home/hobbit/client/etc/clientlaunch.cfg --log=/home/hobbit/client/logs/clientlaunch.log --pidfile=/home/hobbit/client/logs/clientlaunch.pid
root     21706 21205  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient.sh
root     21707 21706  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient-linux.sh
root     21716 21707  0 15:02 ?        00:00:00 ps -efw
root     21719  5268  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient.sh
root     21720 21719  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient-linux.sh
root     21725 21720  0 15:02 ?        00:00:00 df -Pl -x none -x tmpfs -x shmfs -x unknown
root     21726 21720  0 15:02 ?        00:00:00 [sed]
[top]
Filesystem         1024-blocks      Used Available Capacity Mounted on
/dev/hda3               979928     74584    905344       8% /
/dev/hda1                64220     41272     22948      65% /boot
/dev/mapper/root_vg-usr   2097084    771940   1325144      37% /usr
/dev/mapper/root_vg-home  23067964     70744  22997220       1% /home
/dev/mapper/root_vg-var   2097084   1582168    514916      76% /var
[meminfo]
[free]
             total       used       free     shared    buffers     cached
Mem:       1033484     991744      41740          0     191108     688472
-/+ buffers/cache:     112164     921320
Swap:      2000084          0    2000084
[netstat]
Ip:
    85417010 total packets received
    0 forwarded
    0 incoming packets discarded
    85416799 incoming packets delivered
    86471394 requests sent out
Icmp:
    158140 ICMP messages received
    511 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 4914
        timeout in transit: 204
        source quenches: 1
        redirects: 1
        echo requests: 153018
        echo replies: 1
    153092 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 74
        echo replies: 153018
Tcp:
    182959 active connections openings
    199005 passive connection openings
    1 failed connection attempts
    66529 connection resets received
    65 connections established
    71891630 segments received
    72925768 segments send out
    73548 segments retransmited
    3 bad segments received.
    493641 resets sent
Udp:
    13366871 packets received
    73 packets to unknown port received.
    0 packet receive errors
    13392531 packets sent
TcpExt:
    50 resets received for embryonic SYN_RECV sockets
    1341 packets pruned from receive queue because of socket buffer overrun
    2257 ICMP packets dropped because they were out-of-window
    124455 TCP sockets finished time wait in fast timer
    41 time wait sockets recycled by time stamp
    14 packets rejects in established connections because of timestamp
    3115808 delayed acks sent
    4714 delayed acks further delayed because of locked socket
    Quick ack mode was activated 10841 times
    2189784 packets directly queued to recvmsg prequeue.
    352892 of bytes directly received from backlog
    365882019 of bytes directly received from prequeue
    12389655 packet headers predicted
    1740831 packets header predicted and directly queued to user
    8273781 acknowledgments not containing data received
    14982537 predicted acknowledgments
    14 times recovered from packet loss due to fast retransmit
    261 times recovered from packet loss due to SACK data
    TCPDSACKUndo: 10
    1123 congestion windows recovered after partial ack
    190 TCP data loss events
    36 timeouts after reno fast retransmit
    779 timeouts after SACK recovery
    146 timeouts in loss state
    361 fast retransmits
    6 forward retransmits
    406 retransmits in slow start
    22631 other TCP timeouts
    TCPRenoRecoveryFail: 3
    67 sack retransmits failed
    42 times receiver scheduled too late for direct processing
    30743 packets collapsed in receive queue due to low socket buffer
    10796 DSACKs sent for old packets
    1 DSACKs sent for out of order packets
    1170 DSACKs received
    77733 connections reset due to unexpected data
    1063 connections reset due to early user close
    2843 connections aborted due to timeout
[ps]
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0  2006 ?        00:00:24 init [2]         
root         2     1  0  2006 ?        00:00:00 [migration/0]
root         3     1  0  2006 ?        00:00:00 [ksoftirqd/0]
root         4     1  0  2006 ?        00:02:36 [events/0]
root         5     1  0  2006 ?        00:00:00 [khelper]
root        10     1  0  2006 ?        00:00:00 [kthread]
root        18    10  0  2006 ?        00:00:00 [kacpid]
root        63    10  0  2006 ?        00:03:32 [kblockd/0]
root       128    10  0  2006 ?        02:04:41 [pdflush]
root       131    10  0  2006 ?        00:00:00 [aio/0]
root       130     1  0  2006 ?        00:00:01 [kswapd0]
root       132     1  0  2006 ?        00:00:00 [cifsoplockd]
root       717     1  0  2006 ?        00:00:00 [kseriod]
root       782    10  0  2006 ?        00:00:00 [ata/0]
root       799    10  0  2006 ?        00:27:33 [reiserfs/0]
root      1122     1  0  2006 ?        02:30:05 /sbin/syslogd
root      1125     1  0  2006 ?        00:00:00 /sbin/klogd
mail      1148     1  0  2006 ?        00:00:00 /usr/lib/exim/exim3 -bd -q30m
uucp      1223     1  0  2006 ?        00:00:00 /usr/sbin/faxq
uucp      1225     1  0  2006 ?        00:00:06 /usr/sbin/hfaxd -i 4559
root      1240     1  0  2006 ?        00:00:00 /usr/sbin/inetd
daemon    1350     1  0  2006 ?        00:00:00 /usr/sbin/atd
root      1353     1  0  2006 ?        00:00:03 /usr/sbin/cron
root      1362     1  0  2006 tty2     00:00:00 /sbin/getty 38400 tty2
root      1363     1  0  2006 tty3     00:00:00 /sbin/getty 38400 tty3
root      1364     1  0  2006 tty4     00:00:00 /sbin/getty 38400 tty4
root      1365     1  0  2006 tty5     00:00:00 /sbin/getty 38400 tty5
root      1366     1  0  2006 tty6     00:00:00 /sbin/getty 38400 tty6
uucp      1367     1  0  2006 ?        00:01:46 /usr/sbin/faxgetty ttyS0
root      5268     1  0  2006 ?        00:00:28 /home/hobbit/client/bin/hobbitlaunch --config=/home/hobbit/client/etc/clientlaunch.cfg --log=/home/hobbit/client/logs/clientlaunch.log --pidfile=/home/hobbit/client/logs/clientlaunch.pid
root     19815     1  0  2006 ?        00:00:00 /usr/local/sbin/osirisd -r /usr/local/osiris
osiris   19816 19815  0  2006 ?        00:00:02 /usr/local/sbin/osirisd -r /usr/local/osiris
root     31765     1  0  2006 ?        00:00:23 /usr/sbin/ntpd -p /var/run/ntpd.pid
postgres 27103     1  0  2006 ?        00:00:13 /usr/lib/postgresql/bin/postmaster -D /var/lib/postgres/data
postgres 27109 27103  0  2006 ?        00:07:27 postgres: stats buffer process                              
postgres 27110 27109  0  2006 ?        00:07:24 postgres: stats collector process                           
root     29971     1  0  2006 tty1     00:00:00 /sbin/getty 38400 tty1
root     12487    10  0  2006 ?        00:35:10 [pdflush]
root     14714     1  0  2006 ?        00:00:19 /usr/sbin/sshd
root     19790     1  0 Jan06 ?        00:15:22 /usr/sbin/slapd -h ldap://127.0.0.1:389/ ldaps:/// ldapi:///
root     20416     1  0 Feb26 ?        00:00:00 ntpd
root     15365  1353  0 06:25 ?        00:00:00 /USR/SBIN/CRON
root     15366 15365  0 06:25 ?        00:00:00 /bin/sh -c test -x /usr/sbin/anacron || run-parts --report /etc/cron.daily
root     15367 15366  0 06:25 ?        00:00:00 run-parts --report /etc/cron.daily
root     15698 15367  0 06:25 ?        00:00:00 [jabber-restart] <defunct>
mail     15804 15365  0 06:25 ?        00:00:00 /usr/sbin/sendmail -i -FCronDaemon -oem root
jabber   15964     1  0 06:40 ?        00:00:02 perl -w -x /usr/local/jabberd2/bin/jabberd
jabber   15966     1  0 06:40 ?        00:00:01 /usr/local/jabberd2/bin/mu-conference -c /etc/jabberd/muc-conf.xml
jabber   15967 15964  0 06:40 ?        00:00:09 /usr/local/jabberd2/bin/router -c /usr/local/jabberd2/etc/jabberd/router.xml
jabber   15968 15964  0 06:40 ?        00:00:00 /usr/local/jabberd2/bin/resolver -c /usr/local/jabberd2/etc/jabberd/resolver.xml
jabber   15969 15964  0 06:40 ?        00:00:15 /usr/local/jabberd2/bin/sm -c /usr/local/jabberd2/etc/jabberd/sm.xml
jabber   15970 15964  0 06:40 ?        00:00:00 /usr/local/jabberd2/bin/s2s -c /usr/local/jabberd2/etc/jabberd/s2s.xml
postgres 15971 27103  0 06:40 ?        00:00:26 postgres: jabber jabberd2 127.0.0.1 idle                    
jabber   15972 15964  0 06:40 ?        00:00:13 /usr/local/jabberd2/bin/c2s -c /usr/local/jabberd2/etc/jabberd/c2s.xml
root     21205     1  0 14:17 ?        00:00:00 /home/hobbit/client/bin/hobbitlaunch --config=/home/hobbit/client/etc/clientlaunch.cfg --log=/home/hobbit/client/logs/clientlaunch.log --pidfile=/home/hobbit/client/logs/clientlaunch.pid
root     21706 21205  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient.sh
root     21707 21706  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient-linux.sh
root     21719  5268  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient.sh
root     21720 21719  0 15:02 ?        00:00:00 /bin/sh /home/hobbit/client/bin/hobbitclient-linux.sh
root     21729 21707  0 15:02 ?        00:00:00 sh -c vmstat 300 2 1>/home/hobbit/client/tmp/hobbit_vmstat.21707 2>&1; mv /home/hobbit/client/tmp/hobbit_vmstat.21707 /home/hobbit/client/tmp/hobbit_vmstat
root     21730 21707  0 15:02 ?        00:00:00 sleep 5
root     21731 21729  0 15:02 ?        00:00:00 vmstat 300 2
root     21734 21720  0 15:02 ?        00:00:00 ps -efw
[top]
[vmstat]
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3  0      0  41636 191104 688216    0    0     0     0    1     2  1  1 98  0
 2  0      0  38220 191108 688472    0    0     0     7 1007     0  0  0 99  0