[hobbit] Loadbalancing Hobbit Server

Mon Feb 19 23:33:42 CET 2007

Hi Scott,

you're always asking interesting questions :-)

On Mon, Feb 19, 2007 at 03:41:47PM -0500, Scott Walters wrote:
> >* History logs and RRD files - i.e. the Hobbit modules that need to
> >store data on disk - can be distributed among multiple servers.
> >Hobbit will automatically send updates to the correct server, and
> >fetch data from the server holding it when generating webpages.
> >This also applies to the client-data logs that are stored when
> >a critical event occurs. (4.3.0)
> 
> Is this leading the way to a hot-cold or hot-hot HA setup?  I understand how
> one server could distribute jobs to a farm.  But what if the central server
> goes down?  If data is sync'ed throughout the environement how could the
> 'freshest' be guaranteed through failures?
> 
> I am looking to implement a HA Hobbit solution where 10 minutes of recovery
> is acceptable while preserving historical data.
> 
> Are you just writing 'load sharing logic' or do you plan on developing
> failover/recovery logic as well?

The immediate need I had was load sharing. But I believe it can be used
to implement failover as well. Let me explain.

A Hobbit server consists of one core daemon which has all of the current
state information, and a bunch of more-or-less stateless "task"
handlers. There's an "update the RRD files" task, an "analyze the client
data" task, a "send out the alerts" task, and a "run the network tests"
task. Plus some more, but you get the picture.

My plan is that you can have multiple servers running each of these
tasks, and you can duplicate the tasks so they run on multiple servers.
When each task is initialized, it tells Hobbit that "hey, I'm here and I
can do alerts" - and then it basically just goes to sleep until it is 
notified that now it should actually do something. So, whenever the
Hobbit server needs to hand off some action to a task, it checks what
servers can handle it and just picks one that is available.

The information about what servers are available for handling the
various tasks is contained in a small demon running on the Hobbit
server; think of it as a kind of "Hobbit-DNS" except that it is updated
automatically.

Some tasks can run on any of the available servers. E.g. analyzing the
client data can be done on any server running the hobbitd_client module;
so it doesn't matter which of the available "client task" servers is
invoked. (Obviously, the config files must be kept in sync on the
servers, but that's why we have tools like rsync).

Some tasks store data - e.g. the RRD files. Those tasks can run on
multiple servers, BUT: For any given host, there will be only one server
holding the data. It's no good feeding the RRD updates to server A at
10:00 AM, and server B at 10:05 - because that would break the RRD data.
So if the RRD files for "www.foo.com" lives on server A, and that server
crashes, then you will lose access to the RRD files for www.foo.com -
but RRD files for hosts on the other servers will still be available.
History logs are handled like RRD files. Now, you can argue that it
would be nice if you could replicate the RRD- or history-updates to
multiple servers so you would have a complete failover where you
wouldn't lose access to some of the data. If there's enough requests
it can be added - there's nothing in the design that prevents it. But
perhaps it would just be simpler to mirror those files between the
servers at regular intervals through some other program.

There are some tasks that can only run on one server at a time: E.g.
the "send out alerts" task is one you wouldn't want to duplicate. So 
for this type of task, Hobbit will initially pick one server to handle 
it, and only if that servers fails will it switch to another server.

So now there's a mechanism in place for having fail-over servers for the
critical tasks, and load-balancing tasks among multiple servers. The
missing piece is to duplicate the core Hobbit server, and replicate the
information that is stored there (the current state of the system, and
the what-servers-run-what-tasks info). That's the part I haven't quite
worked out yet.

Replicating the data is fairly straight-forward. Hobbit already has a
mechanism in place for saving the current state in a "checkpoint" file,
so it can restart without losing the current state info. So replication
can be done by putting in some method for requesting the checkpoint
data. Sure, you'd lose a few minutes worth of updates in case of a
failover - depending upon how often you update the standby-servers'
data from the checkpoint - but since Hobbit updates everything every 5
minutes, I don't think that will be a major issue.

The tricky part is deciding when to do the failover. My current plan is
to have a "standby" option for the backup Hobbit daemon where it just 
loads and picks up the checkpoint data from the master server at regular
intervals; once that fails it goes on-line and starts behaving like a 
regular Hobbit daemon. That would suffice for a 2-server/hot-cold setup, 
and makes matters a lot less complicated (eg I won't have to deal with 
the issue of deciding who has the most recent data).

There are still a couple of murky details, like how do you get the
clients to send their data to the server that is up? One way would be
to send them a list of the available Hobbit servers whenever they send
their client reports, so they always (except the first time) have a list
of the current servers. If sending data to the first server fails, they
must try the next server in the list - if that works, then they'll get
a new list back with the new Hobbit server as the first one to try.

Those are my ideas. Feedback is very welcome from anyone; this is a
relatively new area for me to be working with (at least from a 
programmer perspective), so any input will be appreciated.

Regards,
Henrik