[hobbit] False Process Down Alerts

Williams, Doug (Consultant-RIC) Doug.Williams at rhd.com
Mon Jan 18 20:41:23 CET 2010


Seems to me your clients data is being truncated.  Try modifying this in
your hobbitserver.cfg.  You may want to set them appropriate size for
your xymon server.  I have xymon running on pretty beefy servers so I
set these incredibly high, and even though they may exceed what xymon
actually allows (but it is not hurting me).  Restart hobbit server after
making change to hobbitserver.cfg



MAXMSG_STATUS=30000000
MAXMSG_CLIENT=30000000
MAXMSG_DATA=30000000 


-----Original Message-----
From: Chris Naude [mailto:chris.naude.0 at gmail.com] 
Sent: Monday, January 18, 2010 2:21 PM
To: hobbit at hswn.dk
Subject: Re: [hobbit] False Process Down Alerts

I've managed to stop the flood of false alerts. I removed all of my
non-prod clients from the bb-hosts and shut off their client processes.
The problem seems to be somehow related to the amount of data the Xymon
server is trying to process. 


On Sun, Jan 17, 2010 at 5:08 PM, Chris Naude <chris.naude.0 at gmail.com>
wrote:


	I have 7 clients running. Each client has a different name. They
are all sending data to the primary Xymon server. The alerts are reading
missing processes, full file systems, and msgs errors. Here is another
sample of an unusual error. You can see the process list has a funky
break in it. 


	 Sun Jan 17 15:40:18 MST 2010 - Processes NOT ok

	 yellow<http://unixadmin.bestwestern.com/xymon/gifs/yellow.gif>
Expected string COMMAND not found in ps output header
	
	  PID  PPID USER     
	  STIM] S PRI  %CPU     TIME     VSZ COMMAND
	    0     0 root      Dec 14  S 127  0.16 00:40:00       0
swapper
	    1     0 root      Dec 14  R 152  0.09 00:01:21    2064 init
	   48     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   45     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   42     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   31     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   30     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   29     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   28     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   26     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	    5     0 root      Dec 14  R 152  0.00 00:00:02       0
signald
	    6     0 root      Dec 14  R 152  0.00 00:00:03       0
kmemdaemon
	   17     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   16     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   15     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   14     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   13     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   12     0 root      Dec 14  S 152  0.00 00:00:00       0
usbhubd
	   11     0 root      Dec 14  R 152  0.00 00:01:11       0
escsid
	   10     0 root      Dec 14  S -32  0.00 00:00:00       0 ttisr
	    9     0 root      Dec 14  R 152  0.00 00:01:27       0
ksyncer_daemon
	   
	7     0]root      Dec 14  R 152
	 0.00 00:]0:00       0 kai_daemon
	   50     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   47     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   44     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached
	   41     0 root      Dec 14  S 152  0.00 00:00:00       0
net_str_cached

	On Sun, Jan 17, 2010 at 4:21 PM, Josh Luthman
<josh at imaginenetworksllc.com> wrote:
	

		Is there only one client sending data as this name?  I
don't think you answered Lars' email.
		
		What does the alert read and what does the data say?
Missing process?  Too high of a load?
		
		Josh Luthman
		Office: 937-552-2340
		Direct: 937-552-2343
		1100 Wayne St
		Suite 1337
		Troy, OH 45373
		
		"The secret to creativity is knowing how to hide your
sources."
		--- Albert Einstein



		On Sun, Jan 17, 2010 at 6:11 PM, Chris Naude
<chris.naude.0 at gmail.com> wrote:
		

			The problem has suddenly become much much worse.
I verified with tcpdump that the data coming from the client is 100%
correct. It seems something on the Xymon server side is not handling the
client data correctly. Anyone have any other ideas?

			red 89%     /testdb3 (37771472% used) has
reached the PANIC level (95%)
			
			Filesystem            1024-blocks  Used
Available Capacity Mounted on
			/dev/vgtestdb1/lvol1    107844344 70901816
36942528    66%     /testdb1
			/dev/vgtestdb2/lvol1    35962064 25453128
10508936    71%     /testdb2
			/dev/vgtestdb4/lvol1    970909400 825006344
145903056    85%     /testdb4
			/dev/vgtestdb3/lv
			l1 ]  338788224 301016752 37771472    89%
/testdb3
			/dev/vgtestdb5/lvol1    179789048 150553912
29235136    84%     /testdb5
			/dev/vg00/lvol8       24580711    74501 24506210
1%     /home
			/dev/vg00/lvol4       10226680  6339283  3887397
62%     /opt


			On Sat, Jan 16, 2010 at 10:44 AM, Chris Naude
<chris.naude.0 at gmail.com> wrote:
			

				That makes a lot of sense. I did have
some issues with the startup scripts on HP-UX. I'll check it out later
tonight. Hopefully i can get it fixed before it goes live tonight.
Thanks!


				On Sat, Jan 16, 2010 at 7:56 AM, Lars
Ebeling <lars.ebeling at leopg9.no-ip.org> wrote:
				

					It looks like two instances of
the client are writing to the file at the same time or almost ;)
					 
					
					Lars

						----- Original Message
----- 
						From: Chris Naude
<mailto:chris.naude.0 at gmail.com>  
						To: hobbit at hswn.dk 
						Sent: Saturday, January
16, 2010 4:59 AM
						Subject: [hobbit] False
Process Down Alerts

						I'm run into a strange
problem with my Xymon server. I noticed today that I'm receiving random
false alerts for processes being down. When I look at the process list
output in the alert it looks as if the data coming from the clients
isn't correct. Here is an example. Has anyone seen anything like this? 

						 9613  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
						10389  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
						 9794     1 oracle
10:55:57 S 154  0.00 00:00:0
						  217600]oracleTEST
(LOCAL=NO)
						 1592     1 oracle
Jan 11  S 154  0.00 00:00:11  217136 ora_mman_TEST
						12751  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c
						 8965  1944 root
Jan 11  S 154  0.00 00:00:00    6128 cmclconfd -c

						11819     1 oracle
Jan 12  S 154  0.00 00:00:07  217280 ora_j015_TEST
						 2711     1 roo
						      ]ec  4  S 120
0.04 00:02:16     868 /usr/sbin/xntpd
						 3547     1 xymon
Dec  4  S 168  0.00 00:00:43     268 /opt/xymon/client/bin/hobbitlaunch
--config=/opt/xymon/client/etc/clientlaunch.cfg
--log=/opt/xymon/client/logs/clientlaunch.log
--pidfile=/opt/xymon/client/logs/clientlaunch.101.example.com.pid
						 3728     1 root
Dec  4  R 152  0.00 00:00:37    4208
/usr/sbin/stm/uut/bin/tools/monitor/WbemWrapperMonitor


						Xymon version:
4.3.0-0.beta2
						Xymon server: CentOS 5.4
32 bit

						Client: HP-UX 11.31
Itanium

						-- 
						Chris Naude
						




				-- 
				Chris Naude
				




			-- 
			Chris Naude
			





	-- 
	Chris Naude
	




-- 
Chris Naude




More information about the Xymon mailing list