Weird disk and cpu alert with bad data
Paul Jochum
hobbit at paul.jemail.info
Fri Jul 16 23:18:15 CEST 2010
Hi All:
(This is very similar to a problem reported by Craig Cook back on Oct. 30,
2008 in this form, but I don't see a resolution for it)
In my xymon setup, I have a number of linux and Solaris clients. The linux
clients are doing fine, but I am having a problem with the Solaris ones (the
Solaris is both x86 and Sparc based, running Solaris 10 update 8). I am
getting alarms like this, of various Solaris hosts at random times:
****************************************************************************************************
(example from failing test for CPU)
]Fri Jul 16 12:41:14 CDT 2010
uname]]SunOS lsslogin1 5.10 Generic_142900-13 sun4v sparc SUNW,
PARC-En]erprise-T5220 up: 6 days, 374 users, 2402 procs, load=
[image: red] Load is CRITICAL
System clock is 0 seconds off
load averages: 4.08, 4.30, 4.25; up 6+02:35:20 12:41:20
2400 processes: 2242 sleeping, 2 zombie, 152 stopped, 4 on cpu
CPU states: 93.6% idle, 3.9% user, 2.5% kernel, 0.0% iowait, 0.0% swap
Memory: 64G phys mem, 30G free mem, 2048M total swap, 1996M free swap
(example from successful test for CPU)
Fri Jul 16 12:46:15 CDT 2010 up: 6 days, 374 users, 2420 procs, load=4.14
System clock is 3 seconds off
load averages: 4.34, 4.15, 4.18; up 6+02:40:21 12:46:21
2395 processes: 2238 sleeping, 2 zombie, 152 stopped, 3 on cpu
CPU states: 94.3% idle, 3.6% user, 2.1% kernel, 0.0% iowait, 0.0% swap
Memory: 64G phys mem, 31G free mem, 2048M total swap, 1996M free swap
****************************************************************************************************
Note, the ] in front of the failing test is part of the real output.
I am see a similar corruption in the disk outputs. Note, I normally only
see one failing test, and then the next test is fine.
So, my questions:
1) Any suggestions for a fix?
2) But more importantly, how do I debug a problem like this? I looked at
both at the client and the server and didn't find any corresponding core
files, but what else can I do? On the client, in
~/client/logs/hobbitclient.log, on machines that are zones, I do see an
error message "prtconf: devinfo facility not available", but I don't think
this is related to this problem (I am seeing the above corruption failures
on machines which are not zones).
thanks,
Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20100716/62e976f7/attachment.html>
More information about the Xymon
mailing list