[hobbit] failover?

Mon Dec 4 18:45:45 CET 2006

On Mon, December 4, 2006 10:03 am, Henrik Stoerner wrote:
> On Sun, Dec 03, 2006 at 12:10:03AM +0100, Henrik Stoerner wrote:
>
>
>> Besides, "fail over" means lot of different things. For a true fail
>> over setup, you'll need some hardware support on top of Hobbit -
>> providing a virtual IP for your resilient hosts, and probably some sort
>> of shared storage. Most of that is handled outside Hobbit.
>>
>> So what exactly do you have in mind ?
>>
>
> I'm replying to my own mail to pick up all the responses that have come
> about this.
>
> Trever Noggle:
>
>> I would like to do like you can with BB..  The master and the backups
>> will be on completely different networks [...] I want to have a main
>> monitor server at location 1 monitoring devices at both locations. I
>> then want location 2 to take over if location 1 goes down.
>
> Anton Burkhalter:
>
>> I use two independent Hobbit servers; each client reports to both
>> servers. The question is how to synchronize the two servers after an
>> outage of a server.
>
> Ralph Mitchell:
>
>> The thing that concerns me is that I can't be running the same checks
>> from two servers at the same time.  People around here get irritated when
>> their webserver stats are artificially inflated
>
> Daniel J McDonald:
>
>> I'd like hobbit-alerts to only run on one box at a time.  Displays and
>> tests can all run independently
>
>
> For myself, I might add that having access to the historical data - both
> graphs and history logs - is also a requirement.
>
> The simple "do it like BB does" is inadequate - it cannot handle keeping
> the historical data up-to-date on both servers, and it also fails to carry
> over the current alerts that are active: If the master server sent out an
> alert for something before it crashed, and the next alert should go out 12
> hours later, then this repeat-setting isn't transferred to the slave
> server. So when the master server drops off the net and the slave server
> takes over, it will immediately start by sending out alerts for everything
> that is down. Not good.
>
>
>
> The current state of a Hobbit server can easily be shared among servers.
> The checkpoint files that go into ~hobbit/server/tmp/ can be copied
> across to another server, and if you do that often enough then starting up
> Hobbit on the other server will pick up all of the current status.
> So that part is easy - for convenience I might want to implement some
> sort of internal Hobbit protocol for distributing the checkpoint files, but
> you can already today just use scp, rsync or similar to copy those files
> over.
>
> The downside of this of course is that something has to recognize when
> the primary site is gone, and start up Hobbit on the secondary server. That
> is not very attractive; I would rather have Hobbit running on both servers
> all the time - this would require some work. But let's assume for now that
> this is possible.
>
>
> Then there are the on-disk files: History logs and graphs. Something has
> happened here recently, since it is now possible to distribute these over
> multiple servers - and also to have more than one site perform all of the
> updates of those files. So instead of periodically copying the files from
> a master server to a slave, you just copy them once and then mirror all of
> the updates to the relevant servers. The code for this is in the current
> snapshots; it isn't documented yet, and hasn't had much testing. I use it
> currently for another purpose: Load-sharing of the updates.
>
>
> Finally there are the various Hobbit tasks: The display, the network
> tests, the alerts.
>
> The display tasks are very easily distributed to multiple servers - it
> is somewhat inconvenient that there are static webpages built for the
> overview webpages, I want to eliminate those and have all of the webpages
>  generated dynamically - but the web display does not have to be on the
> same physical server as the rest of Hobbit, so doing failover for the web
> interface is relatively simple.
>
> Alerts - the code is *almost* ready. It is based on the same principle as
>  what is used for distributing the history- and RRD-files across multiple
>  servers; the hobbitd_alert module runs on all of the servers - so it
> keeps track of the repeat times etc - but it only actually sends the
> alerts from one of the servers at any time.
>
> Network tests - I've heard arguments going both ways as to whether one
> should run network tests on all servers ("it is interesting to see if the
> site is down when tested from all of our locations, or only from the
> primary location"), or on just a single server ("we want to minimize
> traffic from the monitoring systems towards the webservers"). I'm still
> thinking about how to handle this - if they run on all Hobbit servers,
> there has to be some way of choosing which test result should be used; if
> they run only on a single server I will probably use the same method for
> choosing which server runs the tests as I use to decide who gets to send
> out alerts.
>
>
>
> Regards,
> Henrik
>
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe at hswn.dk
>
>
>

Personally for me, for what I am wanting to be able to do, I do not care
about history data.  It would be nice but not required.  I simply want the
box at the remote site to take over on all of the display, testing and
paging if the master server (or network) is down.  This way I will still
get alerted if there is a problem at site 1.  And since site 2 will also
be monitored by site 1 I will be alerted if there is a problem at either
site.

It would be nice in the future to be able to have the historical data in
sync on both boxes but that is not something that is important to me at
this point.

-Trever