[hobbit] Advice on how to handle HA monitoring

Tue Sep 25 12:17:40 CEST 2007

On Friday 21 September 2007 21:11:42 Charles Jones wrote:
> We have 2 hosts, HostA and HostB. They are part of an HA cluster via HP
> ServiceGuard. There is a virtual IP and DNS name of "virtual" that
> automatically goes to whichever of HostA and HostB is the primary at the
> time.
>
> I am currently monitoring both HostA and HostB via Hobbit.  Currently
> HostA is the primary, and I am doing various PROC checks. Currently on
> HostB, I am not doing process checks.
>
> My problem is, how do I smoothly handle a failover scenario (HostB
> becoming the primary)?  When a failover occurs, all of the procs on
> HostA are stopped (either by the server crashing, or manualy by
> ServiceGuard), and the same procs are started up on HostB.
>
> I'm trying to think of ways to monitor both hosts, but only monitor
> procs on the one that is primary. So far the best I can come up with is
> to run the hobbit clients in local mode, and maybe have the ServiceGuard
> scripts swap out the config files and restart the Hobbit clients when
> there is a failover. That would probably work, BUT in this case the
> Hobbit homdir is also the same (SAN mount) on both machines, so moving
> or editing a file on one does the same on the other :(
>
> Simply shutting down the hobbit client on the non-primary is not an
> option, as then it would no longer be monitored at all.

What I do is to do all the infrastructure tests on both nodes, the procs 
checks check only for the cluster intrastructure (besides local processes 
like say syslog), then a check using the cluster software on both nodes, and 
finally monitor the network services (and only them) on the clustered 
service's IP.

And, there are a few reasons I avoid installing software on the SAN, this is 
one of them ... (the other is that if you screw up upgrading your clustered 
software, you have an outage, whereas if it's installed locally on each box, 
you could just fail back ... or in most cases the cluster middleware would 
have done it already).

Regards,
Buchan