Weird disk and cpu alert with bad data

Fri Jul 16 23:18:15 CEST 2010

Hi All:

(This is very similar to a problem reported by Craig Cook back on Oct. 30,
2008 in this form, but I don't see a resolution for it)

In my xymon setup, I have a number of linux and Solaris clients.  The linux
clients are doing fine, but I am having a problem with the Solaris ones (the
Solaris is both x86 and Sparc based, running Solaris 10 update 8).  I am
getting alarms like this, of various Solaris hosts at random times:

****************************************************************************************************
(example from failing test for CPU)
]Fri Jul 16 12:41:14 CDT 2010

uname]]SunOS lsslogin1 5.10 Generic_142900-13 sun4v sparc SUNW,
PARC-En]erprise-T5220 up: 6 days, 374 users, 2402 procs, load=
[image: red] Load is CRITICAL
System clock is 0 seconds off

load averages:  4.08,  4.30,  4.25;                    up 6+02:35:20   12:41:20
2400 processes: 2242 sleeping, 2 zombie, 152 stopped, 4 on cpu
CPU states: 93.6% idle,  3.9% user,  2.5% kernel,  0.0% iowait,  0.0% swap
Memory: 64G phys mem, 30G free mem, 2048M total swap, 1996M free swap

(example from successful test for CPU)
Fri Jul 16 12:46:15 CDT 2010 up: 6 days, 374 users, 2420 procs, load=4.14

System clock is 3 seconds off

load averages:  4.34,  4.15,  4.18;                    up 6+02:40:21   12:46:21
2395 processes: 2238 sleeping, 2 zombie, 152 stopped, 3 on cpu
CPU states: 94.3% idle,  3.6% user,  2.1% kernel,  0.0% iowait,  0.0% swap
Memory: 64G phys mem, 31G free mem, 2048M total swap, 1996M free swap

****************************************************************************************************
Note, the ] in front of the failing test is part of the real output.

I am see a similar corruption in the disk outputs.  Note, I normally only
see one failing test, and then the next test is fine.

So, my questions:
1) Any suggestions for a fix?
2) But more importantly, how do I debug a problem like this?  I looked at
both at the client and the server and didn't find any corresponding core
files, but what else can I do?  On the client, in
~/client/logs/hobbitclient.log, on machines that are zones, I do see an
error message "prtconf: devinfo facility not available", but I don't think
this is related to this problem (I am seeing the above corruption failures
on machines which are not zones).

thanks,

Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20100716/62e976f7/attachment.html>