[hobbit] Highlights of the 4.3.0 version

Buchan Milne bgmilne at staff.telkomsa.net
Wed Aug 8 18:20:04 CEST 2007


On Friday 03 August 2007 19:15:27 Scott Walters wrote:
> I am definitely in the "monitor only" camp.  As appealing as
> "self-healing" may seem, I've seen attempts go horrible wrong too many
> times.  For example, shutting down Oracle for upgrades and then being
> restarted in the middle of the upgrade.  Not good.

How about the easy example of a web server not responding. Do you restart it ? 
In the case I am thinking of, no. Since, the reason it is not responding is 
that the database server it (and another 4 webservers) is waiting for is 
having problems. Restarting the web server would drop the >1000 existing 
(working) sessions, causing a full-blown outage, and migrate the problem to 
the other 4 web servers that sit behind the same load balancer.

> I also agree that "self-healing" lends itself to band-aids that avoid
> root-cause determination.

Or *prevent* the root-cause determination. For example, I had a problem on an 
LDAP server that appeared once in 2 or 3 weeks. I start it under a debugger, 
and when next experienced the problem, some online debugging (after taking it 
out of the pool) with a developer found and fixed the bug within one hour 
(and allowed me to understand the cause so I could work around it). A restart 
here would have meant waiting some more and another few outages.

> I don't think this requires "baby-sitting," 
> but a commitment to fixing things once.  I have also had the
> displeasure of making permanent band-aids, but I cannot condone it.

We do have some applications that require supervision ... but for them we use 
daemon-tools or supervise-scripts (a re-implementation of daemon-tools), as 
these are *much* better at supervision than a monitoring system. If you 
really need a baby-sitter, the monitoring system isn't the best one ...

> All of those "operational" aspects aside, I've convinced myself from a
> security point of view, corrective action from monitoring is bad-- a
> clear violation of the separation of duties.  You don't want your
> auditors "cleaning up" the numbers as they go over your books.
>
> You know what's better than your webserver being automatically
> restarted when it crashes?  Your webserver not crashing.
>
> I completely support the absence of corrective actions from monitor
> triggers.  The question I have yet to answer satisfactorily is,"Should
> the monitoring system perform additional data collection after
> specific errors?"  For example, running a particular "find" command
> when disk usage increases to try and identify which files are causing
> the partition to fill.

Or attach a debugger to the hung process and get a backtrace ?

Regards,
Buchan



More information about the Xymon mailing list