[Xymon] Xymon server (4.3.27) occasionally assigns reports to the wrong host
Axel Beckert
beckert at phys.ethz.ch
Thu Jun 16 17:02:43 CEST 2016
Hi,
On Wed, Jun 15, 2016 at 02:36:14PM -0700, J.C. Cleaver wrote:
> On Wed, June 15, 2016 9:30 am, Christoph Berg wrote:
> > Fwiw, I've seen instances of such behavior ever since I've started
> > taking care of a hobbit installation at a customer site in late
> > 2007.
Oops. Never ran into that knowingly before -- and I run Hobbit/Xymon
servers since about 2007 or so, too.
> > Symptoms are randomly mixed up hosts. I can say if there are tests
> > that are hit more than others, the problem is mostly visible through
> > disk tests by finding rrd files on disk for partitions that do not
> > exist on this host.
I remember having seen misnamed rrd files in the past, that was more
like bitflips in the test names or so. (We get some data from some
embedded devices which were buggy in the beginning and occassionally
sent garbled messages to our Xymon server.) But I never noticed device
or path names which don't fit to the machine. So probably not the same
thing.
> > It doesn't seem to happen constantly, but rather in bursts,
That explains why it seemed to coincident with my upgrade to 4.3.27
and then I found cases from a few days earlier, too.
> > but I don't have hard data on that. My impression was that it only
> > happens during busy periods, but that could be totally wrong.
Hrm, according to xymon, that xymon server has an average load of 0.2
and only a few peaks which go over 1.0 (highest load in the RRD is
1.5).
So I wouldn't say that this server is often "busy".
I also checked the past two days: The times where I have load peaks up
to 1.4 are other times than the ones where disk tests got
misassigned. :-(
> In some cases, I've seen this and tracked it down to malformed messages
> resulting from incomplete client reports.
I can imagine that. Where the incomplete client reports from the host
where they were assigned to or from the one where they should have
been assigned to?
> Unfortunately, I wasn't able to track down all of them from that,
> but many correllated with periods of intense load.
Fits with what Christoph experienced, but I doubt that it's related to
load on my server. Load on the clients might be possible, though.
> The client message (well, all messages, really, but client messages might
> be more noticable since they're the largest on a plain system) doesn't
> have an EOM indicator, so it's impossible to see if something's gotten
> truncated.
>
> This will be solved in V5 style messages (which have a size
> indicator)
Nice!
> One work-around is to add --filter=\[clock\] to:
> xymond_channel --channel=client --filter=\[clock\] xymond_client (etc)
>
> This will block partial client messages from getting further into xymond
> when they happen, at the expense of some increased CPU load on
> xymond_channel, with potential back-pressure into xymond if the message
> load is high enough.
Hrm, I'm a bit reluctant to add this since the man page says:
--filter=EXPRESSION
EXPRESSION is a Perl-compatible regular expression.
xymond_channel will match the first line of each message
against this expression, and silently drops any message that
does not match the expression.
If I download the client data of an arbitray host, the first line is
always empty and the second line reads "[collector:]". "[clock]" only
shows up at the very end:
---8<---
[collector:]
client <hostname>.linux linux
[date]
Thu Jun 16 16:31:51 CEST 2016
[uname]
Linux <hostname> 3.16.0-4-amd64 x86_64
[osversion]
Debian 8.5
Distributor ID: Debian
Description: Debian GNU/Linux 8.5 (jessie)
Release: 8.5
Codename: jessie
[uptime]
16:31:51 up 365 days, 28 min, 0 users, load average: 8.43, 8.46, 8.23
[who]
[df]
Filesystem 1024-blocks Used Available Capacity Mounted on
[...]
[clientversion]
Xymon version 4.3.17
[clock]
epoch: 1466087516.425434
local: 2016-06-16 16:31:56 CEST
UTC: 2016-06-16 14:31:56 GMT
--->8---
So filtering for messages containing "[clock]" seems to make sense as
the message needs to be nearly complete to contain that string.
OTOH the xymond_channel(8) man page says it only matches the first
line of the message. What's considered to be a "message"? Each block
starting with "[something]" (but then the man page would claim that it
drops all other blocks) or the whole set of data linked as "Client
data" on service status pages?
Seems to me that either way something's wrong in the man page.
Or are those data block shown in reverse order on the web?
Can you confirm that adding "--filter=\[clock\]" won't drop nearly all
of the valid messages?
> Of course, not having truncated messages in the first place would be
> nice :)
:-)
Kind regards, Axel Beckert
--
Axel Beckert <beckert at phys.ethz.ch> support: +41 44 633 26 68
IT Services Group, HPT H 6 voice: +41 44 633 41 89
Departement of Physics, ETH Zurich
CH-8093 Zurich, Switzerland http://nic.phys.ethz.ch/
More information about the Xymon
mailing list