[Xymon] Xymon swarm proposal

Sat Nov 28 12:37:11 CET 2015

Hi,

the recent talk on xymon-developer about rewriting xymonproxy to support 
TLS, IPv6 and other good stuff made me think about other ways of scaling 
Xymon across large installations.

Which led me to the idea of having multiple independent Xymon servers - 
a swarm, because no one Xymon server depends on the others, but they can 
cooperate.

Simply put, you have a number of independent Xymon installations. Each 
of them handles a group of servers - it could be one in each of your 
datacentres, one for each organisational unit, one in each network 
segment, or just a because you have such a large installation that a 
single Xymon server cannot cope with the load (and that would be a 
really big installation, judging by the numbers I hear). This all works 
just like the Xymon you have today.

The only thing that is needed to have all of these independent Xymon 
servers show up as a single (virtual) Xymon installation is to have the 
Xymon webpages - generated by xymongen - to display a set of webpages 
showing the status of all of the Xymon servers in the swarm. When you 
click on the detailed status log, you are transparently sent to the 
Xymon server that holds the data about that server (the URL points to 
the Xymon server handling the particular server you want to check on).

The nice thing about this is that I think it can be implemented fairly 
easily, i.e. without having to change anything fundamental in the way 
the various Xymon programs work. Which means it will also be easy to 
adapt into an existing Xymon installation, and with a good chance of not 
introducing difficult-to-troubleshoot bugs (difficult because bugs 
involving remote systems are always a headache to reproduce).

There are of course a few nitty-gritty details, e.g. "Find host" really 
should be able to search across all of the servers in the swarm. But 
those cases are rather few and fairly isolated to not be too much of a 
headache.

        Multiple independent Xymon servers

  * Each site runs just like today.
  * A new sites.cfg file lists the other sites (just a site ID and how
    to contact xymond there)
  * Each site UI (the static webpages from xymongen) merges data from
    all sites

        Advantages

  * More resilient - if one site dies, the others will remain operational
  * Less cross-site traffic (local data remain local except when needed)
  * Less load on each site (updates only go to one Xymon server)
  * Horizontally scalable

        Limitations

  * Hostnames must be unique globally. Probably not a significant problem.
  * Functions that fetch data directly from disk-files cannot be
    cross-site (rrd-files, history-logs), unless you can retrieve the
    data via a network request. In a standard Xymon installation that
    would be:
      o Availability reports
      o Event log reports (but see below)
      o Multi-host graphs, unless all of the hosts are local
  * Alerts are always handled locally

        xymongen

  * hosts.cfg file for the page layout must be merged from all sites.
    Can be a simple append-one-after-the-other (built-in) or perhaps
    allow foran externally generated hosts.cfg - if you want to have
    servers from multiple locations on one page.
  * How do we handle non-unique pagenames? Transparently prefix them
    with the remote site-ID?
  * xymondboard data is fetched from multiple sites and combined
    (appended) - handled in sendmessage()
  * cgi-URL's are generated with a prefix of /SITE/ - no change
    otherwise. The local webserver then proxies /SITE/ requests to the
    remote site.
  * Should there be both a local and a global "all non-green" page?
    Maybe even a full set of local and global webpages? That would be
    easy by running xymongen twice - one for the local and one for the
    global set of pages.

        sendmessage() function

  * No changes for sending status- or data-updates (status, combo,
    extcombo, client, data, modify)
  * Option to fetch data from multiple sites. This is already in place
    for sending to multiple Xymon servers, so we just need to combine
    the output response from multiple sites.
  * When processing host-related requests, we learn where the host is
    located. Cache this for use by various tools. Must be disk-based
    (e.g. SQLite file) so it can be shared.

        xymond

  * hostinfo requests should only answer for the local hosts. No need to
    consult the SQLite cache - no changes.

        CGI programs

  * "Find host" must be cross-site
  * Ack-alert: Suggest making it local-only. Since alerts are only
    generated locally, it makes sense to also only ack the local alerts.
  * Enable/disable only on the local site? Use the "info" page
    enable/disable (automatically local). Global enable/disable needs
    some more looking into.
  * Critical systems - would probably be nice to be able to do both a
    local and a global version.
  * Eventlog - would be nice to have both local and global, even though
    that means fetching a (large) remote logfile. Will probably require
    a new "eventlog" CGI interface for retrieving a remote logfile. It
    is probably not something we want to do on every
    critical-systems/all-nongreen webpage update. So those could keep
    the local eventlog display (as-is), and then the eventlog CGI could
    have the option of combining logs from all sites (or maybe a
    selection of sites).

        xymon commands

_Commands re. specific hosts_
First check via hostinfo cache (see below) if we know where the host is 
(performance optimization). If not then simply broadcast the message to 
all sites and combining any data that is returned - there will only be 
data from one server.

  * notify
  * disable
  * enable
  * query
  * xymondlog, xymondxlog, clientlog
  * hostinfo - sendmessage() will fetch the data for us, whether from
    the local xymond or from the SQLite cache.

_Commands that collect data on multiple hosts_

  * xymondboard, xymondxboard - option from user whether to fetch local
    or global info. Handled in sendmessage()

_Command that only work locally_

  * ghostlist
  * drop
  * rename
  * schedule. If done via web i/f it becomes automatically transparent,
    but not for scripts. Probably only used for
    disable/enable/drop/rename so makes most sense to do it locally.
    Doing global would have to parse the message to detect which host it
    is about.

Comments are very welcome.

Regards,
Henrik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20151128/74129d6e/attachment.html>