[Xymon] How many times does xymonnet retry?
J.C. Cleaver
cleaver at terabithia.org
Fri Jan 8 21:13:33 CET 2016
This is correct. In some other monitoring systems (Nagios/Icinga come to
mind), there's a notion of a Hard Fail vs Soft Fail and the scheduling
system can run checks several times before a "Hard Fail" is recorded.
Because there's no discrete scheduling system (or dispatcher) within
xymon, it doesn't really have that same model, and the built-in tools like
xymonnet don't conceptualize it.
Fundamentally, you have any number of things testing and whatever
frequency or decision process they're independently doing, and xymond is
simply accepting reports (and displaying/handling them) as needed.
As xymonnet runs at intervals, each run is distinct. If it's
down/slow/hung/whatever, it's marked as such and is not tested again
during that execution.
If you add that together, though, it provides other options for
administrator-defined recurrence, such as the "xymonnet-again.sh" script,
as you've seen.
When we were migrating from a system that had been configured to retry 3
times before alerting, we realized that we saved so much power in
efficiency moving to xymon (shameless plug ;) ), that we could lower our
xymonnet interval greatly and just make sure that 3 entire runs would
complete before the "red" alert was sent (using the DURATION value in
alerts.cfg(5)).
xymonnet-again.sh itself is somewhat basic, but you can script up any
number of additional ways of dispatching with the same concept. I have a
script on another server that queries xymond for any non-green 'dns' tests
every 10s and re-scans just those hosts with lower --timeout values.
As above, I've found interval scanning and adjusting your duration to be
simpler conceptually and to handle most of the cases that are needed. It
also sidesteps the problem of an overloaded scheduler during a crisis,
leaving just the extra time needed for failing TCP tests in the first
place.
HTH,
-jc
On Fri, January 8, 2016 11:59 am, Ribeiro, Glauber wrote:
> Q: Is the number of retires significant in your business case?
>
> A: Not really, I was just trying to understand how this works to see if it
> would provide precedent for one of our custom tests, which we are adding
> retries to.
>
>
> I think I have a good idea how the retries work now. When a test fails,
> xymonnet writes information to a text file.
>
> Xymonnet-again is a simple script, which is kicked off once a minute, to
> look for that text file - if it's present, it feeds it into xymonnet. The
> file (frequenttests) is simply the command line options for the xymonnet
> run, including the names of the hosts that had failed tests (but not which
> tests failed).
>
> So theoretically, things could be retried up to 30 times.
>
>
>
>
> -----Original Message-----
> From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of John Thurston
> Sent: Friday, January 08, 2016 12:41
> To: xymon at xymon.com
> Subject: Re: [Xymon] How many times does xymonnet retry?
>
> On 1/8/2016 9:04 AM, Ribeiro, Glauber wrote:
>> Thanks, I got that, so there is no set number of repetitions? I.e. it
>> will keep trying for 30 minutes?
>
> I see no reference to the _number_ of retries, only to the _duration_ of
> the effort.
>
> The number of retries will depend on how frequently the attempt is made
> and how long each attempt takes to fail. The first is probably
> controlled in code (and may be configurable at run time). The second is
> dependent on the protocol being tested, the behavior of the network, and
> the form of the failure.
>
> An ICMP test, for example, may reliably fail and timeout in 4 seconds.
>
> An SSH test (also handled by xymonnet) may fail in 4 seconds when it
> can't initiate a TCP connection. It may also be able to linger on for
> several minutes if a TCP connection can be established but not
> maintained.
>
> Is the number of retires significant in your business case?
>
More information about the Xymon
mailing list