[Xymon] Xymon server (4.3.27) occasionally assigns reports to the wrong host

J.C. Cleaver cleaver at terabithia.org
Wed Jun 15 23:36:14 CEST 2016


On Wed, June 15, 2016 9:30 am, Christoph Berg wrote:
> Re: Axel Beckert 2016-06-15 <20160615155816.GD29167 at phys.ethz.ch>
>> in the past few months I found more and more indices for a strange bug
>> in (at least) Xymon 4.3.27 which occasionally mixes up hosts when
>> handling reports:
>
>> * Machines with a single disk (e.g. VMs) occassional report status of
>>   a "raid" test which is not deployed to them -- and then (for obvious
>>   reasons) went purple on it. On that server, there's only one machine
>>   in having a RAID, but its "raid" reports have been misassigned to at
>>   least three other hosts, all host which have rather many tests
>>   (compared to a bunch of sensors which send in only very few tests
>>   per host).
> [...]
>
> Fwiw, I've seen instances of such behavior ever since I've started
> taking care of a hobbit installation at a customer site in late 2007.
> Symptoms are randomly mixed up hosts. I can say if there are tests
> that are hit more than others, the problem is mostly visible through
> disk tests by finding rrd files on disk for partitions that do not
> exist on this host.
>
> It doesn't seem to happen constantly, but rather in bursts, but I
> don't have hard data on that. My impression was that it only happens
> during busy periods, but that could be totally wrong.
>
> We've been on 4.3.0 for a long time until finally upgrading about two
> years ago, and I thought the problem was gone then, but what Axel is
> describing is exactly what we were (are?) seeing there.
>
> Christoph

In some cases, I've seen this and tracked it down to malformed messages
resulting from incomplete client reports. Unfortunately, I wasn't able to
track down all of them from that, but many correllated with periods of
intense load.

The client message (well, all messages, really, but client messages might
be more noticable since they're the largest on a plain system) doesn't
have an EOM indicator, so it's impossible to see if something's gotten
truncated.

This will be solved in V5 style messages (which have a size indicator) or
when combining into an extcombo.

One work-around is to add --filter=\[clock\] to:
 xymond_channel --channel=client --filter=\[clock\] xymond_client (etc)

This will block partial client messages from getting further into xymond
when they happen, at the expense of some increased CPU load on
xymond_channel, with potential back-pressure into xymond if the message
load is high enough.


Of course, not having truncated messages in the first place would be nice :)


HTH,
-jc




More information about the Xymon mailing list