[Xymon] xymon master-slave server

J.C. Cleaver cleaver at terabithia.org
Thu Apr 7 00:04:39 CEST 2016



On Tue, April 5, 2016 10:59 am, eli wrote:
> I am planning to build secondary xymon server as backup, is there good
> method to sync between them and not both of them not to send alert same
> time. if any one implement already I would like to hear feedback.
> thanks,Eli

There are a few different strategies one can use here, all depending on
what kind of internal SLA you're expecting, how much bandwidth you're able
to use, and whether you need identical systems or not.


The simplest solution (but most bandwidth intensive) is to run the two
servers as stacks in parallel and simply not alert on the secondary one.
You can use Linux-HA or any of the more advanced cluster software to
determine up/down status between the two boxes and take over when the
other isn't reachable. You can configure your clients to send reports to
both xymon servers at the same time ($XYMONSERVERS) and you've in effect
got two complete systems. (xymond_distribute can be used to pass
disable/enable messages over). The drawback is a) double the bandwidth
use, and b) losing your acknowledgements and alert suppression when the
failover occurs.

Alternatively, you can keep the second server on a cold standby, regularly
getting rsync's from the primary one of the checkpoint files for both
xymond and xymond_alert. This has the advantage of the secondary system
not being in use when it doesn't need to be. When failover happens, you
start up xymond on the slave, it reads from the last checkpoint you'd
gotten (I'd advise increasing frequency to something like every few mins,
depending on your needs) and starts from there. The drawback there is that
you don't have graphing/history at all, and you're missing the last few
minutes of changes.

If you have heavy network bw available to you, things like DRBD can be
used to perform a complete synchronization of *saved* data between the
servers.


Henrik had proposed a Xymon Swarm concept at
http://lists.xymon.com/pipermail/xymon/2015-November/042684.html , which
may also help you evaluate your site's needs.


Really, there are lots of different ways to conceptualize "high
availability" for your monitoring system... I'd advise to keep things as
simple as possible so as to eliminate failure points. In our case, we've
had two live stacks running in parallel that (mostly) submit into a ticket
system, which can de-dupe incoming host+svc alerts automatically, which
mostly defines the problem out of existence. The things directly emailed
to us were of lower frequency, so we were fine with duplicate emails at
first. When that got to be too annoying, we left xymond_alert enabled on
the second system but used Linux-HA to simply disable Postfix when it
wasn't master. When a failover occurred, the startup script was modified
to clear out the local outbound queue before starting the service up.
Xymon thus never had to care about whether it was primary/secondary at
all.


Hope that helps a little bit.


Regards,
-jc




More information about the Xymon mailing list