[Xymon] Re-trigger alert

J.C. Cleaver cleaver at terabithia.org
Thu Oct 20 19:38:14 CEST 2016



On Wed, October 19, 2016 10:24 am, Scot Kreienkamp wrote:
> Hi everyone,
>
> I'm creating a test for failed jobs in one of our systems.  The monitor
> will turn red on the failed job, then stay red until the failed job is
> resolved.  The problem is that if a second job were to fail no additional
> notification would go out from xymon for the second job failure because
> the test is already red.  Is there any way to force a resend of the alert
> short of going red-green-red again?  I send recovery alerts, so I don't
> want to go the red-green-red route as that could give some users receiving
> the alert the impression that it was already resolved.
>
> Thanks!
>
> Scot Kreienkamp | Senior Systems Engineer | La-Z-Boy Corporate


This is, ultimately, one of the larger problems with the current paradigm.
We've actually got three different ways of conceptualizing "sub-test"
status issues like this, but none of them alone cover everything that's
needed.

- "status/group:foobar" tags on status messages, parsed by xymond_alert
- 1 or more "&red foobar - Foobar is DOWN" lines at the beginnings of
status messages
- "modify hostname.testname red foobar Please go check Foobar" messages
which can be sent after the fact

Long term, I'd been considering that it might make the most sense to
combine all of these into the same namespace. A parallel issue occurs
with, say, free memory and swap usage both being in the memory test, or a
vital partition filling up after a partition you don't care about that
much going red first.


In 4.4 via https://sourceforge.net/p/xymon/code/7834/ (and the Terabithia
RPMs), the sending a spurious "modify" message, or the expiration of an
existing one, will cause another page message to be transmitted to
xymond_alert even if the color hasn't actually changed, however if you're
under the REPEAT cycle, that won't cause that time delay to be cleared.


There's definitely a need for further improvements in deciding how to
handle "sub-status" configs like this when the things that could go wrong
are handled by different groups or have different "let it stay red for a
while" tolerances.

In the very short term, I'd explore the GROUP configurations in alerts.cfg
and in the message you're returning. Barring that, it might be easiest to
simply transmit two separate test results.


HTH,
-jc




More information about the Xymon mailing list