[hobbit] False Process Down Alerts

Tue Jan 19 01:46:54 CET 2010

I never received any alerts about messages being truncated. After disabling
the non prod clients i started receiving alerts about the messages being
truncated. I adjusted these values as specified below and they are good now.
Tomorrow i'll enable the non prod servers again and see if this is what the
original culprit was. Thanks!

On Mon, Jan 18, 2010 at 12:41 PM, Williams, Doug (Consultant-RIC) <
Doug.Williams at rhd.com> wrote:

> Seems to me your clients data is being truncated.  Try modifying this in
> your hobbitserver.cfg.  You may want to set them appropriate size for
> your xymon server.  I have xymon running on pretty beefy servers so I
> set these incredibly high, and even though they may exceed what xymon
> actually allows (but it is not hurting me).  Restart hobbit server after
> making change to hobbitserver.cfg
>
>
>
> MAXMSG_STATUS=30000000
> MAXMSG_CLIENT=30000000
> MAXMSG_DATA=30000000
>
>
> -----Original Message-----
> From: Chris Naude [mailto:chris.naude.0 at gmail.com]
> Sent: Monday, January 18, 2010 2:21 PM
> To: hobbit at hswn.dk
> Subject: Re: [hobbit] False Process Down Alerts
>
> I've managed to stop the flood of false alerts. I removed all of my
> non-prod clients from the bb-hosts and shut off their client processes.
> The problem seems to be somehow related to the amount of data the Xymon
> server is trying to process.
>
>
> On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <chris.naude.0 at gmail.com>
> wrote:
>
>
>        I have 7 clients running. Each client has a different name. They
> are all sending data to the primary Xymon server. The alerts are reading
> missing processes, full file systems, and msgs errors. Here is another
> sample of an unusual error. You can see the process list has a funky
> break in it.
>
>
>         Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok
>
>          yellow<http://unixadmin.bestwestern.com/xymon/gifs/yellow.gif>
> Expected string COMMAND not found in ps output header
>
>          PID  PPID USER
>          STIM] S PRI  %CPU     TIME     VSZ COMMAND
>            0     0 root      Dec 14  S 127  0.16 00:40:00       0
> swapper
>            1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
>           48     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           45     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           42     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           31     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           30     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           29     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           28     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           26     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>            5     0 root      Dec 14  R 152  0.00 00:00:02       0
> signald
>            6     0 root      Dec 14  R 152  0.00 00:00:03       0
> kmemdaemon
>           17     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           16     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           15     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           14     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           13     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           12     0 root      Dec 14  S 152  0.00 00:00:00       0
> usbhubd
>           11     0 root      Dec 14  R 152  0.00 00:01:11       0
> escsid
>           10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
>            9     0 root      Dec 14  R 152  0.00 00:01:27       0
> ksyncer_daemon
>
>        7     0]root      Dec 14  R 152
>         0.00 00:]0:00       0 kai_daemon
>           50     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           47     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           44     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>           41     0 root      Dec 14  S 152  0.00 00:00:00       0
> net_str_cached
>
>        On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
> <josh at imaginenetworksllc.com> wrote:
>
>
>                Is there only one client sending data as this name?  I
> don't think you answered Lars' email.
>
>                What does the alert read and what does the data say?
> Missing process?  Too high of a load?
>
>                Josh Luthman
>                Office: 937-552-2340
>                Direct: 937-552-2343
>                1100 Wayne St
>                Suite 1337
>                Troy, OH 45373
>
>                "The secret to creativity is knowing how to hide your
> sources."
>                --- Albert Einstein
>
>
>
>                On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
> <chris.naude.0 at gmail.com> wrote:
>
>
>                        The problem has suddenly become much much worse.
> I verified with tcpdump that the data coming from the client is 100%
> correct. It seems something on the Xymon server side is not handling the
> client data correctly. Anyone have any other ideas?
>
>                         red 89%     /testdb3 (37771472% used) has
> reached the PANIC level (95%)
>
>                        Filesystem            1024-blocks  Used
> Available Capacity Mounted on
>                        /dev/vgtestdb1/lvol1    107844344 70901816
> 36942528    66%     /testdb1
>                        /dev/vgtestdb2/lvol1    35962064 25453128
> 10508936    71%     /testdb2
>                        /dev/vgtestdb4/lvol1    970909400 825006344
> 145903056    85%     /testdb4
>                        /dev/vgtestdb3/lv
>                        l1 ]  338788224 301016752 37771472    89%
> /testdb3
>                        /dev/vgtestdb5/lvol1    179789048 150553912
> 29235136    84%     /testdb5
>                        /dev/vg00/lvol8       24580711    74501 24506210
> 1%     /home
>                        /dev/vg00/lvol4       10226680  6339283  3887397
> 62%     /opt
>
>
>                        On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
> <chris.naude.0 at gmail.com> wrote:
>
>
>                                That makes a lot of sense. I did have
> some issues with the startup scripts on HP-UX. I'll check it out later
> tonight. Hopefully i can get it fixed before it goes live tonight.
> Thanks!
>
>
>                                On Sat, Jan 16, 2010 at 7:56 AM, Lars
> Ebeling <lars.ebeling at leopg9.no-ip.org> wrote:
>
>
>                                        It looks like two instances of
> the client are writing to the file at the same time or almost ;)
>
>
>                                        Lars
>
>                                                ----- Original Message
> -----
>                                                From: Chris Naude
> <mailto:chris.naude.0 at gmail.com>
>                                                 To: hobbit at hswn.dk
>                                                Sent: Saturday, January
> 16, 2010 4:59 AM
>                                                Subject: [hobbit] False
> Process Down Alerts
>
>                                                I'm run into a strange
> problem with my Xymon server. I noticed today that I'm receiving random
> false alerts for processes being down. When I look at the process list
> output in the alert it looks as if the data coming from the clients
> isn't correct. Here is an example. Has anyone seen anything like this?
>
>                                                 9613  1944 root
> Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
>                                                10389  1944 root
> Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
>                                                 9794     1 oracle
> 10:55:57 S 154  0.00 00:00:0
>                                                  217600]oracleTEST
> (LOCAL=NO)
>                                                 1592     1 oracle
> Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
>                                                12751  1944 root
> Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
>                                                 8965  1944 root
> Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
>
>                                                11819     1 oracle
> Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
>                                                 2711     1 roo
>                                                      ]ec  4  S 120
> 0.04 00:02:16     868 /usr/sbin/xntpd
>                                                 3547     1 xymon
> Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch
> --config=/opt/xymon/client/etc/clientlaunch.cfg
> --log=/opt/xymon/client/logs/clientlaunch.log
> --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
>                                                 3728     1 root
> Dec  4  R 152  0.00 00:00:37    4208
> /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
>
>
>                                                Xymon version:
> 4.3.0-0.beta2
>                                                Xymon server: CentOS 5.4
> 32 bit
>
>                                                Client: HP-UX 11.31
> Itanium
>
>                                                --
>                                                Chris Naude
>
>
>
>
>
>                                --
>                                Chris Naude
>
>
>
>
>
>                        --
>                        Chris Naude
>
>
>
>
>
>
>        --
>        Chris Naude
>
>
>
>
>
> --
> Chris Naude
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>

-- 
Chris Naude
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20100118/2f4df735/attachment.html>