[hobbit] dual hobbit system config question

Kauffman, Tom KauffmanT at nibco.com
Thu Jan 27 18:44:49 CET 2005


Yes - my config has bbnet/bbpager/bbdisplay all on the same system, and
the D/R system is bbdisplay with bbnet/bbpager failover running.

I would like to just have one BBNET test system running at a time.

I was planning on detecting the failure of the primary BBNET by having a
script on the D/R system look for the ftp or sshd tests -- if I don't
see these:

Jan 27 12:25:06 elkhart sshd[19099]: Did not receive identification
string from ::ffff:10.8.254.11
Jan 27 12:25:06 elkhart pure-ftpd: (?@whq-bbd.nibco.com) [INFO] New
connection from whq-bbd.nibco.com
Jan 27 12:25:06 elkhart pure-ftpd: (?@whq-bbd.nibco.com) [INFO] Logout.

Then the network tests aren't running (elkhart is the failover, whq-bbd
is the primary). The nice part about pulling these from the log is that
they are time-stamped for my convenience.



-----Original Message-----
From: Henrik Stoerner [mailto:henrik at hswn.dk] 
Sent: Thursday, January 27, 2005 11:42 AM
To: hobbit at hswn.dk
Subject: Re: [hobbit] dual hobbit system config question

On Thu, Jan 27, 2005 at 10:21:50AM -0500, Kauffman, Tom wrote:
> I currently run two BBDisplay systems, with the second serving as a
> BBNET and BBPAGER failover system. It also serves as a document server
> and our computer room X server. We recover this system at our D/R
> hotsite, and part of the recovery scripting in place doctors the
> bb-hosts file on this and all other recovered systems to set this up
as
> the D/R BB system.
> 
> I see failover (as such) isn't in the initial cut of hobbit. What
> problems am I setting myself up for if I allow the network tests to
run
> on both systems? I can handle the redundant paging by kludging my own
> 'fallover' script to rename my paging script on the fallover, but it
> looks like network testing would be harder to fake out. (I've only got
> about 200 devices I'm testing)
>
> And I want both systems to have matching LARRD graphs :-) which is the
> reason I set things up as is.


As I understand your description, the two BBDISPLAY servers are
normally running in parallel, each with their own set of data for
webpages, RRD files, history logs etc - correct ?

Also, is the primary BBNET server running on the same box as the
primary BBDISPLAY server, or a separate system ?


If BBDISPLAY and BBNET functions are combined on the same server both
at the main site and the D/R site, then my suggestion is pretty
simple: Just run the two sites completely in parallel, with each of
the BBNET servers reporting to "their own" BBDISPLAY server. The only
downside of this is that the measurements on the two sites might
disagree! So there is no guarantee that you'll have identical
displays at the two sites.

On the D/R server you just disable the "[bbpage]" task in
hobbitlaunch.cfg - that way, no alarms get sent.

You can sync the bb-hosts and hobbit-alerts.cfg files between the two
servers without any problems; Hobbit doesn't use the "BBDISPLAY",
"BBNET" or "BBPAGER" tags in the bb-hosts file at all. 

Handling failover in this situation means somehow getting the
D/R server to detect that Hobbit is down on the primary server, and
then starting up the hobbitd_alert task to process alerts.
You'll probably need to do some script that periodically 
either checks the primary server itself, or runs 
   bb 127.0.0.1 "query primaryserver.bbd"
and takes action when that status changes. 

Another option is to setup a hobbitd worker module that picks up
events from the "stachg" (status change) channel, and reacts to
changes in the state of the primary server. That might be a more
elegant solution - see the "hobbitd_sample.c" file in the Hobbit
sources for an example of how to write a worker module.


If you only want the BBNET function running on one server at a time
(e.g. because you must have identical displays at the two sites), 
or you have the BBNET server running on a different system at the
primary site, then the situation becomes a bit more complex.

The main site where BBNET normally runs would of course be configured
to send the results to both of the BBDISPLAY servers. As described
above, you can handle failover of the alert function by having the
hobbitd_alert module turned off normally, and enabling it if the
primary BBPAGER goes down.

The problem as I see it would be how the D/R server detects that the
primary BBNET server has failed. One possibility would be to enable
the bbtest-net "--report" option; this makes bbtest-net send in a
report about itself as the last status report from one cycle of the
network tests. Something on the D/R server could monitor when this
status was last received, and if it becomes too old it would then fire
up the network testing task on the D/R server. If the status-report
lifetime that bbtest-net generates was set to e.g. 10 minutes, then
this could trigger when the status-report turned purple - so you could
handle it with the same module that keeps an eye on changes in the
primary BBDISPLAY server status.



I haven't finally decided how to handle failover in Hobbit. I imagine
that it means implementing a special "heartbeat" message that the
servers send to each other - the BBDISPLAY servers exchange one,
and the BBNET servers send one to the BBDISPLAY servers to inform them
that they are alive. One of the BBDISPLAY servers then acts as an
arbitrator, and decides which tasks must run where, and announces this
to all of the servers, who then start or stop tasks as needed. If the
arbitrator crashes, another BBDISPLAY server will have to take over.


> (We monitor connectivity to RF antennas used for bar-code scanning
> applications and our motley assortment of oracle and SAP instances on
a
> mob of AIX systems).

Sounds like a lot of custom scripts are in use :-)


Henrik

To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk





More information about the Xymon mailing list