HA Hobbit / Load Balanced Hobbit / Failover Hobbit

Fri May 25 23:09:24 CEST 2007

I've been thinking of implementing a fault tolerant Hobbit setup. I've 
got a few ideas on how to do it, so I thought I would throw them out to 
see what Henrik and anyone else thinks...for all I know someone already 
has it setup*.

Setup 1 - Using bbproxy:
Clients configured to send to DNS name "hobbit.mydomain.com"
HOST-A, aka "hobbit.mydomain.com"
* Running bbproxy, configured to send all received data to HOST-B 
(hobbit1) and HOST-C (hobbit2)
* Apache virtual host redirects HTTP requests for hobbit.mydomain.com to 
HOST-B (hobbit1)

HOST-B aka "hobbit1". This is the main Hobbit server.
* Configured to do network tests
* Configured to send alerts
* Configured to save checkpoint file every 5 minutes (maybe even every 
minute)
* Also monitors the bbproxy host and alerts if it goes down
* Also monitors HOST-C (hobbit2) and alerts if it goes down
* Web interface accessible via a DNS virtual host 
(monitor.mydomain.com/hobbit)

HOST-C aka "hobbit2". This is the secondary/standby server.
* Only does network tests and alerts of HOST-B
* Rsync process runs every minute and mirrors HOST-B config files 
(bb-hosts, hobbit-alerts, the checkpoint file, etc) to a failover directory.
* Failover is accomplished via SCRIPT directive in hobbit-alerts.cfg
* When HOST-B goes down, an alert is sent, and also an addition SCRIPT 
directive kicks off a script which will:  
1. Swap out HOST-C's config files (bb-hosts, hobbit-alerts.cfg, 
checkpoint file, etc) with the saved mirrored failover ones
2. Restart Hobbit to load the new checkpoint data
3. Update the apache config on HOST-A so that the web interface now 
redirects to HOST-C instead of HOST-B, and rehup apache to activate the 
new config.
4. Host-C is now performing everything that HOST-B was. It is doing the 
same network tests, has the same alert setup, and is receiving the same 
data from bbproxy.  Anyone who goes to 
http://monitor.mydomain.com/hobbit will correctly be directed to the 
secondary server.

Thoughts: 
This is an automated failover, but failing back to the primary server 
when it comes back up is not so easily done. Eh, I suppose with another 
SCRIPT directive it could be done, but usually when a host fails its not 
ready to return to full service the first time it boots back up, so the 
failback is probably best done manually. Failback would include rsyncing 
the data directory and checkpoint file back to the primary host, and 
reverting HOST-C and HOST-A (apache) back to their pre-fail configurations.

I suppose this could be done without bb-proxy, and just have the 
secondary server constantly rsync the first server (data and etc 
directory), but then you would lose the ability to have the SCRIPT 
directive do the failover stuff for you (hmm or maybe not, if it was in 
HOST-B's config, it would never trigger, so should be safe to mirror 
that config). It's still nice to use bbproxy though, because it 
rate-limits the incoming messages, as well as combines them to combo 
messages when it is able, which reduces the load on the Hobbit server.

This only provides failover, no load balancing. This could change if 
Henrik ever does the code changes to allow distributing worker modules.

* I discovered while searching that someone has created a patch(es) for 
rrdtool that allows it to store and retrieve data from MySQL or Postgres 
instead of/in addition to RRD files. This is very interesting as it 
would be transparent to Hobbit, and MySQL has excellent load balancing 
and replication built-in. The author claimed that he had "re-written Big 
Brother" using these rrdtool patches to create a load balanced fault 
tolerant monitoring cluster. I wonder if he hacked it so that all the 
configuration was stored in mysql as well...I guess no matter as I got 
the impression that he didn't want to share his work (except for the 
rrdtool patches) :-)