[hobbit] dual hobbit system config question

Henrik Stoerner henrik at hswn.dk
Thu Jan 27 17:42:04 CET 2005

On Thu, Jan 27, 2005 at 10:21:50AM -0500, Kauffman, Tom wrote:
> I currently run two BBDisplay systems, with the second serving as a
> BBNET and BBPAGER failover system. It also serves as a document server
> and our computer room X server. We recover this system at our D/R
> hotsite, and part of the recovery scripting in place doctors the
> bb-hosts file on this and all other recovered systems to set this up as
> the D/R BB system.
> I see failover (as such) isn't in the initial cut of hobbit. What
> problems am I setting myself up for if I allow the network tests to run
> on both systems? I can handle the redundant paging by kludging my own
> 'fallover' script to rename my paging script on the fallover, but it
> looks like network testing would be harder to fake out. (I've only got
> about 200 devices I'm testing)
> And I want both systems to have matching LARRD graphs :-) which is the
> reason I set things up as is.

As I understand your description, the two BBDISPLAY servers are
normally running in parallel, each with their own set of data for
webpages, RRD files, history logs etc - correct ?

Also, is the primary BBNET server running on the same box as the
primary BBDISPLAY server, or a separate system ?

If BBDISPLAY and BBNET functions are combined on the same server both
at the main site and the D/R site, then my suggestion is pretty
simple: Just run the two sites completely in parallel, with each of
the BBNET servers reporting to "their own" BBDISPLAY server. The only
downside of this is that the measurements on the two sites might
disagree! So there is no guarantee that you'll have identical
displays at the two sites.

On the D/R server you just disable the "[bbpage]" task in
hobbitlaunch.cfg - that way, no alarms get sent.

You can sync the bb-hosts and hobbit-alerts.cfg files between the two
servers without any problems; Hobbit doesn't use the "BBDISPLAY",
"BBNET" or "BBPAGER" tags in the bb-hosts file at all. 

Handling failover in this situation means somehow getting the
D/R server to detect that Hobbit is down on the primary server, and
then starting up the hobbitd_alert task to process alerts.
You'll probably need to do some script that periodically 
either checks the primary server itself, or runs 
   bb "query primaryserver.bbd"
and takes action when that status changes. 

Another option is to setup a hobbitd worker module that picks up
events from the "stachg" (status change) channel, and reacts to
changes in the state of the primary server. That might be a more
elegant solution - see the "hobbitd_sample.c" file in the Hobbit
sources for an example of how to write a worker module.

If you only want the BBNET function running on one server at a time
(e.g. because you must have identical displays at the two sites), 
or you have the BBNET server running on a different system at the
primary site, then the situation becomes a bit more complex.

The main site where BBNET normally runs would of course be configured
to send the results to both of the BBDISPLAY servers. As described
above, you can handle failover of the alert function by having the
hobbitd_alert module turned off normally, and enabling it if the
primary BBPAGER goes down.

The problem as I see it would be how the D/R server detects that the
primary BBNET server has failed. One possibility would be to enable
the bbtest-net "--report" option; this makes bbtest-net send in a
report about itself as the last status report from one cycle of the
network tests. Something on the D/R server could monitor when this
status was last received, and if it becomes too old it would then fire
up the network testing task on the D/R server. If the status-report
lifetime that bbtest-net generates was set to e.g. 10 minutes, then
this could trigger when the status-report turned purple - so you could
handle it with the same module that keeps an eye on changes in the
primary BBDISPLAY server status.

I haven't finally decided how to handle failover in Hobbit. I imagine
that it means implementing a special "heartbeat" message that the
servers send to each other - the BBDISPLAY servers exchange one,
and the BBNET servers send one to the BBDISPLAY servers to inform them
that they are alive. One of the BBDISPLAY servers then acts as an
arbitrator, and decides which tasks must run where, and announces this
to all of the servers, who then start or stop tasks as needed. If the
arbitrator crashes, another BBDISPLAY server will have to take over.

> (We monitor connectivity to RF antennas used for bar-code scanning
> applications and our motley assortment of oracle and SAP instances on a
> mob of AIX systems).

Sounds like a lot of custom scripts are in use :-)


More information about the Xymon mailing list