[Xymon] autofixing
Alan Sparks
asparks at doublesparks.net
Fri Apr 6 23:44:31 CEST 2012
I'd generally agree that fixing root cause whenever possible, so the
problem doesn't occur is preferable. In a past life, we did do some of
this - of course, do whatever we could to prevent the problem in the
first place... but web server instances crash, and sometimes traffic
irregularities cause logs to fill fast than usual.
I had a hack going that involved cfengine, with cfrun callable from a
paging script. The premise was to have cfengine invoked on the remote
node before pages actually went out (e.g., a DURATION delay on real
pages), to see if cfengine could fix the simpler problems (like a
process dying or whatnot). If it could, we could sleep. If not, the
second-level page went out for human intervention.
We didn't do much autofixing... there wasn't a lot in the environment
that lent itself to such. Either we engineered an HA environment
(clustered) where a dead machine didn't affect the service... or the
problem was probably not simple to fix, and we needed real eyes/brains
on it.
-Alan
On 4/6/2012 3:31 PM, Larry Barber wrote:
> Resending to the list, Gmail seems to be hiding the "reply to all".
>
> Thanks,
> Larry Barber
>
> On Fri, Apr 6, 2012 at 4:28 PM, Larry Barber <lebarber at gmail.com
> <mailto:lebarber at gmail.com>> wrote:
>
> The kind of things that you can automate should be handled
> routinely, not be triggered by an alert from your monitoring tool.
> If you have logs growing to fast that they are filling up you file
> system you should find out what is filling them up and why and then
> fix that. Automatic log rotation and compression should be done by a
> tool like logrotate, not Xymon or any other monitoring tool. You
> shouldn't be using a monitoring tool to trigger routine maintenance,
> it simply causes unnecessary alerts that cause problems in other areas.
>
> Thanks,
> Larry Barber
>
>
> On Fri, Apr 6, 2012 at 4:06 PM, KING, KEVIN <KK1051 at att.com
> <mailto:KK1051 at att.com>> wrote:
>
> Larry,____
>
> __ __
>
> Some auto correcting is not bad. Back in the Big brother days I
> had a datacenter and team of folks. We managed to the “yellow”
> alerts. I had folks correct and build scripts to address the
> things that brought on the yellow so we never saw the red. This
> made it so very little red was ever seen.____
>
> __ __
>
> Now the things you can automate are the disk full kind of
> things. If that happens you can run a script to clean logs
> compress and that stuff. This was usually handled by managing
> the yellow. There would be a script in place to keep the space
> to below the yellow trigger. So if you got a red it was usually
> a bug temp file or something that would get cleaned shortly. So
> say on the red alert you could have it run the cleanup script
> rather than waiting for your cron to do the normal cleanup.____
>
> __ __
>
> Now on other issues it really depends on what the alert is
> about. You cannot automate everything economically. At some
> point it is cheaper and faster to put a human in the loop. I did
> have a script that would take the e-mail response from the alert
> and we could have it parse the message and do the work. This was
> back in the day with the RIM pagers. So you got an alert you
> replied to the alert with “run clean script on host” The reply
> e-mail was parsed in by the same script we were using to
> acknowledge the alert. It would parse and run a clean script.
> This let my admins be able to work issues while away from a PC
> or network connection.____
>
> __ __
>
> I do hear and agree with your concerns. A blanket statement from
> managers that do not have a full understanding of all the
> elements is a ruff thing to swallow. But there heart is in the
> right spot J____
>
> __ __
>
> I guess in a rather long rambling way I am saying that you learn
> and tune your systems. Address re-occurring issues so they do
> not. Then watch for the next thing to be addressed.____
>
> __ __
>
> __ __
>
> -Kevin____
>
> __ __
>
> __ __
>
> *From:*xymon-bounces at xymon.com <mailto:xymon-bounces at xymon.com>
> [mailto:xymon-bounces at xymon.com
> <mailto:xymon-bounces at xymon.com>] *On Behalf Of *Larry Barber
> *Sent:* Friday, April 06, 2012 1:43 PM
> *To:* xymon at xymon.com <mailto:xymon at xymon.com>
> *Subject:* [Xymon] autofixing____
>
> __ __
>
> My management has gotten the idea that we should be automating
> the repair processes on our servers. They want things set up so
> that when a fault is detected a script is run that attempts to
> repair it. I've tried to convince them that this is a profoundly
> wrong-headed idea, but I'm not having much luck. Do any of you
> know of any articles or resources that might help convince them?
>
> Thanks,
> Larry Barber____
>
>
>
>
>
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon
More information about the Xymon
mailing list