[hobbit] failover?

Henrik Stoerner henrik at hswn.dk
Mon Dec 4 17:03:06 CET 2006


On Sun, Dec 03, 2006 at 12:10:03AM +0100, Henrik Stoerner wrote:

> Besides, "fail over" means lot of different things. For a true fail over
> setup, you'll need some hardware support on top of Hobbit - providing
> a virtual IP for your resilient hosts, and probably some sort of shared
> storage. Most of that is handled outside Hobbit.
> 
> So what exactly do you have in mind ?

I'm replying to my own mail to pick up all the responses that have come
about this.

Trever Noggle:
>I would like to do like you can with BB..  The master and the backups
>will be on completely different networks [...] I want to have a main 
>monitor server at location 1 monitoring devices at both locations.  
>I then want location 2 to take over if location 1 goes down.

Anton Burkhalter:
>I use two independent Hobbit servers; each client reports to both
>servers. The question is how to synchronize the two servers after 
>an outage of a server.

Ralph Mitchell:
>The thing that concerns me is that I can't be running the same checks
>from two servers at the same time.  People around here get irritated
>when their webserver stats are artificially inflated

Daniel J McDonald:
>I'd like hobbit-alerts to only run on one box at a time.  Displays and
>tests can all run independently


For myself, I might add that having access to the historical data - both
graphs and history logs - is also a requirement.

The simple "do it like BB does" is inadequate - it cannot handle keeping
the historical data up-to-date on both servers, and it also fails to
carry over the current alerts that are active: If the master server sent
out an alert for something before it crashed, and the next alert should
go out 12 hours later, then this repeat-setting isn't transferred to the
slave server. So when the master server drops off the net and the slave
server takes over, it will immediately start by sending out alerts for
everything that is down. Not good.



The current state of a Hobbit server can easily be shared among servers.
The checkpoint files that go into ~hobbit/server/tmp/ can be copied
across to another server, and if you do that often enough then starting
up Hobbit on the other server will pick up all of the current status.
So that part is easy - for convenience I might want to implement some
sort of internal Hobbit protocol for distributing the checkpoint files,
but you can already today just use scp, rsync or similar to copy those
files over. 

The downside of this of course is that something has to recognize when
the primary site is gone, and start up Hobbit on the secondary server.
That is not very attractive; I would rather have Hobbit running on both
servers all the time - this would require some work. But let's assume
for now that this is possible.


Then there are the on-disk files: History logs and graphs. Something has
happened here recently, since it is now possible to distribute these
over multiple servers - and also to have more than one site perform all
of the updates of those files. So instead of periodically copying the
files from a master server to a slave, you just copy them once and then
mirror all of the updates to the relevant servers. The code for this is
in the current snapshots; it isn't documented yet, and hasn't had much
testing. I use it currently for another purpose: Load-sharing of the
updates.


Finally there are the various Hobbit tasks: The display, the network
tests, the alerts. 

The display tasks are very easily distributed to multiple servers - it 
is somewhat inconvenient that there are static webpages built for the 
overview webpages, I want to eliminate those and have all of the webpages 
generated dynamically - but the web display does not have to be on the same 
physical server as the rest of Hobbit, so doing failover for the web 
interface is relatively simple.

Alerts - the code is *almost* ready. It is based on the same principle as
what is used for distributing the history- and RRD-files across multiple
servers; the hobbitd_alert module runs on all of the servers - so it
keeps track of the repeat times etc - but it only actually sends the
alerts from one of the servers at any time.

Network tests - I've heard arguments going both ways as to whether one
should run network tests on all servers ("it is interesting to see if
the site is down when tested from all of our locations, or only from the
primary location"), or on just a single server ("we want to minimize
traffic from the monitoring systems towards the webservers"). I'm still
thinking about how to handle this - if they run on all Hobbit servers,
there has to be some way of choosing which test result should be used;
if they run only on a single server I will probably use the same method
for choosing which server runs the tests as I use to decide who gets to
send out alerts.



Regards,
Henrik




More information about the Xymon mailing list