[Xymon] autofixing

Alan Sparks asparks at doublesparks.net
Fri Apr 6 23:44:31 CEST 2012


I'd generally agree that fixing root cause whenever possible, so the
problem doesn't occur is preferable.  In a past life, we did do some of
this - of course, do whatever we could to prevent the problem in the
first place... but web server instances crash, and sometimes traffic
irregularities cause logs to fill fast than usual.

I had a hack going that involved cfengine, with cfrun callable from a
paging script.  The premise was to have cfengine invoked on the remote
node before pages actually went out (e.g., a DURATION delay on real
pages), to see if cfengine could fix the simpler problems (like a
process dying or whatnot).  If it could, we could sleep.  If not, the
second-level page went out for human intervention.

We didn't do much autofixing... there wasn't a lot in the environment
that lent itself to such.  Either we engineered an HA environment
(clustered) where a dead machine didn't affect the service... or the
problem was probably not simple to fix, and we needed real eyes/brains
on it.
-Alan

On 4/6/2012 3:31 PM, Larry Barber wrote:
> Resending to the list, Gmail seems to be hiding the "reply to all".
> 
> Thanks,
> Larry Barber
> 
> On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com
> <mailto:lebarber at gmail.com>> wrote:
> 
>     The kind of things that you can automate should be handled
>     routinely, not be triggered by an alert from your monitoring tool.
>     If you have logs growing to fast that they are filling up you file
>     system you should find out what is filling them up and why and then
>     fix that. Automatic log rotation and compression should be done by a
>     tool like logrotate, not Xymon or any other monitoring tool. You
>     shouldn't be using a monitoring tool to trigger routine maintenance,
>     it simply causes unnecessary alerts that cause problems in other areas.
> 
>     Thanks,
>     Larry Barber
> 
> 
>     On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <KK1051 at att.com
>     <mailto:KK1051 at att.com>> wrote:
> 
>         Larry,____
> 
>         __ __
> 
>         Some auto correcting is not bad.  Back in the Big brother days I
>         had a datacenter and team of folks. We managed to the “yellow”
>         alerts. I had folks correct and build scripts to address the
>         things that brought on the yellow so we never saw the red.  This
>         made it so very little red was ever seen.____
> 
>         __ __
> 
>         Now the things you can automate are the disk full kind of
>         things. If that happens you can run a script to clean logs
>         compress and that stuff.  This was usually handled by managing
>         the yellow. There would be a script in place to keep the space
>         to below the yellow trigger. So if you got a red it was usually
>         a bug temp file or something that would get cleaned shortly. So
>         say on the red alert you could have it run the cleanup script
>         rather than waiting for your cron to do the normal cleanup.____
> 
>         __ __
> 
>         Now on other issues it really depends on what the alert is
>         about. You cannot automate everything economically. At some
>         point it is cheaper and faster to put a human in the loop. I did
>         have a script that would take the e-mail response from the alert
>         and we could have it parse the message and do the work. This was
>         back in the day with the RIM pagers. So you got an alert you
>         replied to the alert with “run clean script on host” The reply
>         e-mail was parsed in by the same script we were using to
>         acknowledge the alert. It would parse and run a clean script.
>         This let my admins be able to work issues while away from a PC
>         or network connection.____
> 
>         __ __
> 
>         I do hear and agree with your concerns. A blanket statement from
>         managers that do not have a full understanding of all the
>         elements is a ruff thing to swallow. But there heart is in the
>         right spot J____
> 
>         __ __
> 
>         I guess in a rather long rambling way I am saying that you learn
>         and tune your systems. Address re-occurring issues so they do
>         not. Then watch for the next thing to be addressed.____
> 
>         __ __
> 
>         __ __
> 
>         -Kevin____
> 
>         __ __
> 
>         __ __
> 
>         *From:*xymon-bounces at xymon.com <mailto:xymon-bounces at xymon.com>
>         [mailto:xymon-bounces at xymon.com
>         <mailto:xymon-bounces at xymon.com>] *On Behalf Of *Larry Barber
>         *Sent:* Friday, April 06, 2012 1:43 PM
>         *To:* xymon at xymon.com <mailto:xymon at xymon.com>
>         *Subject:* [Xymon] autofixing____
> 
>         __ __
> 
>         My management has gotten the idea that we should be automating
>         the repair processes on our servers. They want things set up so
>         that when a fault is detected a script is run that attempts to
>         repair it. I've tried to convince them that this is a profoundly
>         wrong-headed idea, but I'm not having much luck. Do any of you
>         know of any articles or resources that might help convince them?
> 
>         Thanks,
>         Larry Barber____
> 
> 
> 
> 
> 
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon





More information about the Xymon mailing list