[Xymon] Spurious purple messages

Colin Coe colin.coe at gmail.com
Wed Sep 16 08:26:07 CEST 2015


Hi all

The date/time is set correctly:
---
timedatectl
      Local time: Wed 2015-09-16 14:23:45 AWST
  Universal time: Wed 2015-09-16 06:23:45 UTC
        RTC time: Wed 2015-09-16 06:23:42
        Timezone: Australia/Perth (AWST, +0800)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
---

fping responds with "host is alive", ping responds with "normal" ping
successful output.


Anyone else have any ideas on this, I really don't want to have to
blow this server away and start again...

Thanks

On Tue, Sep 15, 2015 at 11:44 PM, Ribeiro, Glauber
<glauber.ribeiro at experian.com> wrote:
> Could it be something with the clock on the xymon server? Maybe some cron process to synchronize to a time server?
>
> -----Original Message-----
> From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Colin Coe
> Sent: Monday, September 14, 2015 22:29
> To: Vernon Everett
> Cc: xymon at xymon.com
> Subject: Re: [Xymon] Spurious purple messages
>
> Hi Vernon,
>
> Yep, very interesting.  The purple messages come through every day at
> about the same time, give or take a minute or so.
>
> Yep, pings work and the normal "main view" and "all non-green view" works fine.
>
> The logs look fine.  I'd really like to get to the bottom of this...
>
> Thanks
>
> CC
>
> On Tue, Sep 15, 2015 at 10:06 AM, Vernon Everett
> <everett.vernon at gmail.com> wrote:
>> That's interesting.
>> No idea what it means, or where to go from here, but it's certainly
>> interesting.
>>
>> Does it happen the exact same time every day?
>> Have you tried a ping from the Xymon host to the client at or around the
>> time of the issue? See if there's any oddities?
>>
>> Is there anything in the logs?
>>
>>
>> On 14 September 2015 at 15:17, Colin Coe <colin.coe at gmail.com> wrote:
>>>
>>> OK, looking at this again.  The main view looks fine, but the 'conn'
>>> test on every host is a yellow circle with a question mark (unknown)
>>> in the snapshot report view since September 4, 2015 at 13:32:42.
>>>
>>> September 4, 2015 at 13:32:41 and earlier look fine.
>>>
>>> Thanks
>>>
>>> On Sat, Sep 12, 2015 at 5:48 PM, Vernon Everett
>>> <everett.vernon at gmail.com> wrote:
>>> > Good to know it's not just me that fights with SELinux. :-)
>>> >
>>> > Now that it works, what does the snapshot report reveal at the time the
>>> > purple alerts go out?
>>> >
>>> > Purples require a "no report" for 30 minutes to trigger.
>>> > You might want to check all your logs at around 30-35 minutes before the
>>> > emails.
>>> >
>>> >
>>> >
>>> >
>>> > On 11 September 2015 at 18:13, Colin Coe <colin.coe at gmail.com> wrote:
>>> >>
>>> >> Almost...
>>> >>
>>> >> Turned out to be SELinux, my old nemesis.  :)
>>> >>
>>> >>
>>> >>
>>> >> On Tue, Sep 8, 2015 at 5:37 PM, Vernon Everett
>>> >> <everett.vernon at gmail.com>
>>> >> wrote:
>>> >> > That might be a permissions thing.
>>> >> >
>>> >> >
>>> >> >
>>> >> > On 8 September 2015 at 19:15, Colin Coe <colin.coe at gmail.com> wrote:
>>> >> >>
>>> >> >> Hi Vernon
>>> >> >>
>>> >> >> Thanks for the really good info.  The message serial numbers are
>>> >> >> different every day but the messages are sent at the same time
>>> >> >> (13:45)
>>> >> >> daily for all tests on all hosts.
>>> >> >>
>>> >> >> The network is not congested nor is the SAN under any kind of
>>> >> >> pressure.
>>> >> >>
>>> >> >> Interestingly, trying to do the snapshot report gave me "Cannot
>>> >> >> create
>>> >> >> output directory".
>>> >> >>
>>> >> >> Thanks again
>>> >> >>
>>> >> >> CC
>>> >> >>
>>> >> >> On Tue, Sep 8, 2015 at 3:56 PM, Vernon Everett
>>> >> >> <everett.vernon at gmail.com>
>>> >> >> wrote:
>>> >> >> > Hi Colin
>>> >> >> >
>>> >> >> > What do the client hosts share in common?
>>> >> >> > I have seen in the past, a client was overloading their storage
>>> >> >> > system,
>>> >> >> > and
>>> >> >> > were overflowing buffers and exceeding the storage array's ability
>>> >> >> > to
>>> >> >> > process IO requests. Of course this caused a general disk latency,
>>> >> >> > which
>>> >> >> > slowed things down to the point of a purple flood.
>>> >> >> > Was no simple solution to that one, except buy more storage, which
>>> >> >> > they
>>> >> >> > did.
>>> >> >> >
>>> >> >> > Also, check the "serial numbers" on the messages. Is this a repeat
>>> >> >> > of
>>> >> >> > an
>>> >> >> > older message - in which case Xymon might have something fishy
>>> >> >> > going
>>> >> >> > on,
>>> >> >> > or
>>> >> >> > are they new messages every day, as in it really thinks there is a
>>> >> >> > problem.
>>> >> >> >
>>> >> >> > Xymon only updates pages every 2 and 5 minutes, depending on the
>>> >> >> > page
>>> >> >> > you
>>> >> >> > are looking at. Meaning you could wait up to 7 minutes for the
>>> >> >> > real
>>> >> >> > status
>>> >> >> > to appear.
>>> >> >> > A purple takes 30 minutes to trigger.
>>> >> >> > With some unfortunate, and highly improbable timing on whatever is
>>> >> >> > triggering these events, it's possible you might not see the
>>> >> >> > purple.
>>> >> >> > Have you pulled up a "snapshot report" for the exact time of the
>>> >> >> > messages?
>>> >> >> >
>>> >> >> > Something else unlikely, but possible, is the network.
>>> >> >> > The conn test used ping, which is UDP
>>> >> >> > The Xymon agent sends using TCP.
>>> >> >> > Is there anything interesting happening on the network at the
>>> >> >> > time?
>>> >> >> >
>>> >> >> > Regards
>>> >> >> > Vernon
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On 8 September 2015 at 11:39, Colin Coe <colin.coe at gmail.com>
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> Hi all
>>> >> >> >>
>>> >> >> >> Since Friday September 4, I've started receiving "stopped
>>> >> >> >> reporting
>>> >> >> >> (PURPLE)" messages for all tests on all hosts from one of our
>>> >> >> >> Xymon
>>> >> >> >> servers.
>>> >> >> >>
>>> >> >> >> The host status, as shown in the Main View, is green for all
>>> >> >> >> hosts
>>> >> >> >> and
>>> >> >> >> tests.  No purple at all.
>>> >> >> >>
>>> >> >> >> The "stopped reporting (PURPLE)" messages are being sent at the
>>> >> >> >> same
>>> >> >> >> time every day, 1:45PM.
>>> >> >> >>
>>> >> >> >> Any advise on how I should track this down?
>>> >> >> >>
>>> >> >> >> Thanks
>>> >> >> >> _______________________________________________
>>> >> >> >> Xymon mailing list
>>> >> >> >> Xymon at xymon.com
>>> >> >> >> http://lists.xymon.com/mailman/listinfo/xymon
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > "Accept the challenges so that you can feel the exhilaration of
>>> >> >> > victory"
>>> >> >> > - General George Patton
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > "Accept the challenges so that you can feel the exhilaration of
>>> >> > victory"
>>> >> > - General George Patton
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > "Accept the challenges so that you can feel the exhilaration of victory"
>>> > - General George Patton
>>
>>
>>
>>
>> --
>> "Accept the challenges so that you can feel the exhilaration of victory"
>> - General George Patton
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon



More information about the Xymon mailing list