[hobbit] False Process Down Alerts
Odinn
odinn_asgaard at yahoo.com
Mon Jan 18 21:03:37 CET 2010
My xymon server monitors over 1500 clients with no issues. When I see false alerts, it has always been a configuration on my part where I have 2 servers in my bb-host file using the same name on different IPs.
--
Jim Sloan
Just remember, today is the day you thought tomorrow was going to be yesterday.
________________________________
From: Chris Naude <chris.naude.0 at gmail.com>
To: hobbit at hswn.dk
Sent: Mon, January 18, 2010 2:20:43 PM
Subject: Re: [hobbit] False Process Down Alerts
I've managed to stop the flood of false alerts. I removed all of my non-prod clients from the bb-hosts and shut off their client processes. The problem seems to be somehow related to the amount of data the Xymon server is trying to process.
On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <chris.naude.0 at gmail.com> wrote:
>I have 7 clients running. Each client has a different name. They are all sending data to the primary Xymon server. The alerts are reading missing processes, full file systems, and msgs errors. Here is another sample of an unusual error. You can see the process list has a funky break in it.
>>
>
> Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok
> Expected string COMMAND not found in ps output header
>
> PID PPID USER
> STIM] S PRI %CPU TIME VSZ COMMAND
> 0 0 root Dec 14 S 127 0.16 00:40:00 0 swapper
> 1 0 root Dec 14 R 152 0.09 00:01:21 2064 init
> 48 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 45 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 42 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 31 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 30 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 29 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 28 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 26 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 5 0 root Dec 14 R 152 0.00 00:00:02 0 signald
> 6 0 root Dec 14 R 152 0.00 00:00:03 0 kmemdaemon
> 17 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 16 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 15 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 14 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 13 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 12 0 root Dec 14 S 152 0.00 00:00:00 0 usbhubd
> 11 0 root Dec 14 R 152 0.00 00:01:11 0 escsid
> 10 0 root Dec 14 S -32 0.00 00:00:00 0 ttisr
> 9 0 root Dec 14 R 152 0.00 00:01:27 0 ksyncer_daemon
>
>7 0]root Dec 14 R 152
> 0.00 00:]0:00 0 kai_daemon
> 50 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 47 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 44 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
> 41 0 root Dec 14 S 152 0.00 00:00:00 0 net_str_cached
>
>
>
>On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman <josh at imaginenetworksllc.com> wrote:
>
>>>Is there only one client sending data as this name? I don't think you answered Lars' email.
>>
>>What does the alert read and what does the data say? Missing process? Too high of a load?
>>
>>Josh Luthman
>>>>
>>
>>
>>Office: 937-552-2340
>>Direct: 937-552-2343
>>1100 Wayne St
>>Suite 1337
>>Troy, OH 45373
>>
>>"The secret to creativity is knowing how to hide your sources."
>>--- Albert Einstein
>>
>>
>>
>>
>>On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude <chris.naude.0 at gmail.com> wrote:
>>
>>>>>
>>>
>>>
>>>The problem has suddenly become much much worse. I verified with tcpdump that the data coming from the client is 100% correct. It seems something on the Xymon server side is not handling the client data correctly. Anyone have any other ideas?
>>>>>>
>>>
>>>
>>>
>>>
>>> 89% /testdb3 (37771472% used) has reached the PANIC level (95%)
>>>
>>>Filesystem 1024-blocks Used Available Capacity Mounted on
>>>/dev/vgtestdb1/lvol1 107844344 70901816 36942528 66% /testdb1
>>>/dev/vgtestdb2/lvol1 35962064 25453128 10508936 71% /testdb2
>>>/dev/vgtestdb4/lvol1 970909400 825006344 145903056 85% /testdb4
>>>/dev/vgtestdb3/lv
>>>l1 ] 338788224 301016752 37771472 89% /testdb3
>>>/dev/vgtestdb5/lvol1 179789048 150553912 29235136 84% /testdb5
>>>/dev/vg00/lvol8 24580711 74501 24506210 1% /home
>>>/dev/vg00/lvol4 10226680 6339283 3887397 62% /opt
>>>
>>>
>>>
>>>
>>>On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude <chris.naude.0 at gmail.com> wrote:
>>>
>>>>>>>That makes a lot of sense. I did have some issues with the startup scripts on HP-UX. I'll check it out later tonight. Hopefully i can get it fixed before it goes live tonight. Thanks!
>>>>
>>>>
>>>>
>>>>On Sat, Jan 16, 2010 at 7:56 AM, Lars Ebeling <lars.ebeling at leopg9.no-ip.org> wrote:
>>>>
>>>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>It looks like two instances of the client are
>>>>>writing to the file at the same time or almost ;)
>>>>> Lars
>>>>>----- Original Message -----
>>>>>>>>>>>>
>>>>>>
>>>>>>From: Chris
>>>>>> Naude
>>>>>>To: hobbit at hswn.dk
>>>>>>Sent: Saturday, January 16, 2010 4:59
>>>>>> AM
>>>>>>Subject: [hobbit] False Process Down
>>>>>> Alerts
>>>>>>
>>>>>>I'm run into a strange problem with my Xymon server. I noticed
>>>>>> today that I'm receiving random false alerts for processes being down. When I
>>>>>> look at the process list output in the alert it looks as if the data coming
>>>>>> from the clients isn't correct. Here is an example. Has anyone seen anything
>>>>>> like this?
>>>>>>
>>>>>>
>>>>>>
>>>>>> 9613 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
>>>>>>10389 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
>>>>>> 9794 1 oracle 10:55:57 S 154 0.00 00:00:0
>>>>>> 217600]oracleTEST (LOCAL=NO)
>>>>>> 1592 1 oracle Jan 11 S 154 0.00 00:00:11 217136 ora_mman_TEST
>>>>>>12751 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
>>>>>> 8965 1944 root Jan 11 S 154 0.00 00:00:00 6128 cmclconfd -c
>>>>>>
>>>>>>
>>>>>>
>>>>>>11819 1 oracle Jan 12 S 154 0.00 00:00:07 217280 ora_j015_TEST
>>>>>> 2711 1 roo
>>>>>> ]ec 4 S 120 0.04 00:02:16 868 /usr/sbin/xntpd
>>>>>> 3547 1 xymon Dec 4 S 168 0.00 00:00:43 268 /opt/xymon/client/bin/hobbitlaunch --config=/opt/xymon/client/etc/clientlaunch.cfg --log=/opt/xymon/client/logs/clientlaunch.log --pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
>>>>>> 3728 1 root Dec 4 R 152 0.00 00:00:37 4208 /usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Xymon version: 4.3.0-0.beta2
>>>>>>Xymon server: CentOS 5.4 32 bit
>>>>>>
>>>>>>
>>>>>>Client: HP-UX 11.31 Itanium
>>>>>>
>>>>>>--
>>>>>>Chris Naude
>>>>>>
>>>>
>>>>
>>>>--
>>>>Chris Naude
>>>>
>>>
>>>
>>>--
>>>Chris Naude
>>>
>>
>
>
>--
>Chris Naude
>
--
Chris Naude
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20100118/9a285272/attachment.html>
More information about the Xymon
mailing list