David Gore wrote:
Henrik Stoerner wrote:
On Thu, Jul 13, 2006 at 07:09:11PM +0000, David Gore wrote:
We have seen this with recent snapshots and the latest release
candidate client. logfetch hangs which causes the client to hang
and go purple for all the tests. It can be resolved by killing
logfetch and deleting all the entries in ~/client/tmp. We could
try to be more surgical on the deleting of files. This has
happened on two very independent hosts running Solaris 8, one being
a SunFire 880 and another being an E4500/E5500.
Suggestions? It can run for many days before hanging.
That's obviously interesting.
When it hangs, is it just dead ? Or is it hogging the cpu (as it would
do if it were in a tight loop somewhere in the code) ?
CPU hogging, yes.
The hosts you monitor where this happens ... what kind of entries in
client-local.cfg do you have for them ? Any "dir" entries, for
instance?
Those do run an external program (du), which is always something that
is harder to control.
No "dir" entries, just "file" and "log".
When it happens again, could you please try and kill it with a "kill
-ABRT <logfetchPID>" ? That should cause it to dump core,
and it will be much easier to see where it hangs with a core
dump. Once you have the core dump, running it through gdb as described
in the Help->Known Problems->How to report bugs will give me much
more to work on.
Might take a few days, but we will certainly do that and see what it
shows. As always thank you for the hard work!
Sooner than I expected, here is the backtrace:
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "sparc-sun-solaris2.8"...
Core was generated by `/export/home/nmsbb/client/bin/logfetch
/export/home/nmsbb/client/tmp/logfetch.o'.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/lib/libc.so.1...done.
Loaded symbols for /usr/lib/libc.so.1
Reading symbols from /usr/lib/libdl.so.1...done.
Loaded symbols for /usr/lib/libdl.so.1
Reading symbols from
/usr/platform/SUNW,Sun-Fire-880/lib/libc_psr.so.1...done.
Loaded symbols for /usr/platform/SUNW,Sun-Fire-880/lib/libc_psr.so.1
#0 0xff3906e8 in memcpy () from
/usr/platform/SUNW,Sun-Fire-880/lib/libc_psr.so.1
(gdb) bt
#0 0xff3906e8 in memcpy () from
/usr/platform/SUNW,Sun-Fire-880/lib/libc_psr.so.1
#1 0x00012e10 in logdata (filename=0xffbef5a0 "", logdef=0x38738,
truncated=0xffbef6c4)
at logfetch.c:192
#2 0x000142f4 in main (argc=215040, argv=0x34c00) at logfetch.c:844
I took a look at one of my co-workers entries in client-local.cfg:
ignore DEBUG|WARN|^at.*)$
I put a back slash in front of the left paren:
ignore DEBUG|WARN|^at.*\)$
Perhaps that may have been why it was hanging?