[Xymon] external script problem - question

Steve Holmes sholmes42 at mac.com
Wed Oct 5 17:41:43 CEST 2011


Xymonphiles:

[Disclaimer: I don't think this is a Xymon problem but my boss thinks it
might be and has directed me to ask the list for advice.]

Running Xymon 4.2.3, display server is a Solaris box, monitoring a few
hundred servers which are mostly RHEL 5, many of them are VMs hosted on ESX
VMware.

The test is an external script which basically does a 'sudo touch foo' on
each file system and waits for it to either return with no problem, or
return with an error indicating that the file system is read-only, or after
60 seconds declares that the file system is 'hung'. We had been having
problems particularly with the read-only file system problem cropping up on
the VMs, which is what prompted the implementation of this test.

The problem with the test is that once or twice a day we get a flurry of
alerts from a dozen or so servers, all at about the same time reporting that
there is a hung file system. Other file systems on the same server are
reporting that it takes longer to do the touch than we think it should (e.g.
12 to 25 or even 60 seconds). The alerts all go away the next test cycle.
The file systems are on local fiber channel disks (i.e. not NFS mounted).
The servers getting the alerts are not all VMs and it is not always the same
set of servers that show up.

Occasionally we catch a file system that is read-only and we are going to
modify the test to not send a panic on the hung file system condition so we
don't miss the read-only condition, but we really would like to figure out
why we are getting these bursts of hung, or at least very slow writes on
file systems and why they come in bursts.

Thanks for any insight you might have.
Steve
Purdue University/ITaP

-- 
If they give you ruled paper, write the other way. -Juan Ramon Jimenez,
poet, Nobel Prize in literature (1881-1958)

Truth never damages a cause that is just. -Mohandas Karamchand Gandhi
(1869-1948)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20111005/1385a2e4/attachment.html>


More information about the Xymon mailing list