[Xymon] dropping/making blue checks not persistent when restarting

Sven Schuster Schuster.Sven at gmx.de
Tue May 23 11:31:56 CEST 2017


Hi Japheth,

> Hi Sven, This behavior would seem to point in the direction of the
> checkpoint file not being written out properly on shutdown, especially if
> it's working fine during the normal checkpointing process (eg, waiting 600
> seconds before the restart) and could be a latent bug (or at least a
> missing error message).

that was exactly my thought when taking a look at the source code. The
routine for writing the checkpoint file should be called at shutdown, too...

> Can you set xymond to --debug mode (or send it  -USR2 signal) and then
> shutdown/restart the process after this change? If shutting down, you can
> take a quick poke at the checkpoint file to see that it's been updated at
> the moment of shutdown? Depending on the host in question, you can also
> search for the test that should "no longer be there" (it's just a simple
> text file format).

...and indeed, it *is* called:

10410 2017-05-23 08:00:48.870364 -> save_checkpoint
10410 2017-05-23 08:00:48.963874 <- save_checkpoint

These were the last lines of the logfile when stopping xymon. Note that in
this case, I *stopped* the xymon service (to be able to take a look at the
checkpoint file while xymon is not running). Timestamp of checkpoint file
was updated, the test I disabled still was disabled when I started xymon
again. Strange.

So I did some further testing. It revealed that on Debian with systemd being
used for starting/stoping services, the restart option to the default SysV
initscript isn't used. Instead, systemd will call the initscript with option
stop (which TERMs the xymonlaunch process), wait some amount of time (which
is probably given by the RestartSec or RestartUSec parameter, see
systemd.service(5)), then the initscript is called again with option start.

Seems like the time between stop and start (which is 100ms in the local
environment, probably default value) is not long enough for the old,
terminating xymond process to completely write the checkpoint file (which is
roughly 35 MB here with config changes and disabling/dropping tests
happening quite often and independently). In xymond.c/save_checkpoint it
turns out that the checkpoint file is written to a temporary file with a
timestamp in the filename. That temp file is renamed to the real checkpoint
file later.
With that short amount of time between stopping and starting it seems like
the new xymond process, which is starting in the meantime, just reads an old
version of the checkpoint file.

To solve this issue, on Linux systems using systemd one might (and of course
should ;)) use a real systemd service file with RestartSec set to a sane
amount (e.g. 1s like in the old SysV initscript).
As a quick fix I added a "sleep 1" in the initscript:

--- xymon.orig  2012-06-27 21:14:29.000000000 +0200
+++ xymon       2017-05-23 10:28:51.983171661 +0200
@@ -49,6 +49,7 @@
    "stop")
        log_daemon_msg "Stopping $DESC" "$NAME"
        start-stop-daemon --exec $DAEMON --pidfile $PIDFILE --stop --retry 5
+       sleep 1
        log_end_msg $?
        ;;


That way restarting xymon works as expected for me.
Yet that might leave the (small) chance of that timespan not being long
enough in big installation and high load. Which in turn could just be a
hypothetical problem, as that behaviour didn't occur with the old
initscript (or at least no one noticed).
A clean solution would be to provide a way to do a clean shutdown of the
xymon server which returns not before the old processes really have exited
(however that might be implemented), so the asynchronous nature of the
current stop (sending a TERM to xymonlaunch) is not a concern anymore.

That's at least an explanation and possible ways of solving for the
behaviour that seems to make sense based on some tests and taking some
short looks at the source, so please correct me if I'm wrong ;)


Kind regards,
Sven


> The same routine is called at shutdown as is called during the periodic
> interval checkpointing, except for the fact that we wait synchronously for
> it to complete -- precisely to avoid this type of concern, but that
> doesn't mean there isn't an issue there still.
> Regards, -jc



More information about the Xymon mailing list