[Xymon] purple problems

Martin Lenko lenko99 at gmail.com
Tue Sep 1 21:17:04 CEST 2015


Hi Walter,
the purple color means that the server didn't get any status update for
more than LIFETIME interval ( depending on the configuration, usually 30
minutes).
There are number of reasons why you might get that for external tests:
-  the external script is not executed -  I would check whether the config
file for the test contains the right paths to script and log file, whether
the script is executable by the user under which xymon is running.
- the external script runs but it fails so it doesn't send the status
message to xymon server - check the log file for any errors. If it is a
sheel script, you can print something to STDOUT at the beginning of the
script just to make sure that it runs. If it doesn't write anything, check
the permissions of the log file. Configuring separate log file per external
test helps to separate messages from other scripts and xymon client itself.
- the external script doesn't contain the right path to xymon executable
(or bb if it is older version from times of hobbit) so it fails to send the
status message.

If this doesn't help you to find out the issue, could you send the test
config and script?

Regards,
Martin


On 31 August 2015 at 22:09, Walter Rutherford <wlrutherford at alaska.edu>
wrote:

> Spoke too soon. Some of the systems actually have client.d/raid and
> they still aren't reporting. At least one didn't even have the directories.
>
> I guess that's one of the hazards of inheriting systems that were installed
> and/or modified by multiple people over time.
>
>
> On Mon, Aug 31, 2015 at 12:58 PM, Walter Rutherford <
> wlrutherford at alaska.edu> wrote:
>
>> Found it!
>>
>> Besides the "raid.sh" script in ext/ I needed a raid configuration in
>> etc/client.d/. I thought that was defined in another file but apparently
>> not.
>>
>> On Mon, Aug 31, 2015 at 10:53 AM, Walter Rutherford <
>> wlrutherford at alaska.edu> wrote:
>>
>>> All good questions. Hunting for the answers helped me to see some
>>> patterns I'd missed before.
>>>
>>> The xymon server hostname and IP seem to be consistent, but that's about
>>> all that is consistent.
>>> There is a separate column for 'disks' on the main webpage and it
>>> correctly shows the output from
>>> a 'df' command. The script running on the clients' sides is called
>>> "raid.sh", the comments at the top
>>> of the script indicate it is over a decade old; bb-mdstat.h based on
>>> bb-raid.sh. There's a link from
>>> /home/xymon-client/ext to /usr/share/xymon-client/ext on most systems.
>>> The directory and the
>>> scripts in it are owned by either root or xymon. Changing location,
>>> ownership, and perms to match
>>> one of the working systems hasn't helped.
>>>
>>> The broken raid reports are all from Linux boxes. The working reports
>>> look like this:
>>>
>>> *          Mon Aug 31 09:38:49 AKDT 2015 RAID ALL devices OK*
>>>
>>>
>>> *             green md0 Status OK*
>>> *             green md1 Status OK*
>>> *             green md2 Status OK*
>>>
>>> *          ============================ /proc/mdstat
>>> ===========================*
>>>
>>> *          Personalities : [raid1] *
>>> *          md0 : active raid1 sdc1[1] sda1[0]*
>>> *                511988 blocks super 1.0 [2/2] [UU]*
>>>
>>> *          md2 : active raid1 sdd[3] sdb[2]*
>>> *                536869888 blocks super 1.2 [2/2] [UU]*
>>>
>>> *          md1 : active raid1 sdc2[1] sda2[2]*
>>> *                41428924 blocks super 1.1 [2/2] [UU]*
>>> *                bitmap: 1/1 pages [4KB], 65536KB chunk*
>>>
>>> *          unused devices: *
>>>
>>> *          Run /sbin/mdadm -D /dev/md* for more info*
>>>
>>> The non-working systems either show nothing at all (that's better than
>>> purple) OR show the same
>>> three green md[0-2] devices (whether it has three raid devices or not)
>>> on a blue disabled background.
>>> So, I'm almost positive someone copied a working system incorrectly to
>>> other clients without cleaning
>>> up the foreign logs. The working systems overwrote or just aged out the
>>> incorrect information while the
>>> non-working ones just keep reporting it. I have found logs but none for
>>> this raid information. Perhaps the
>>> logs are compressed or otherwise rendered humanly unreadable.
>>>
>>> So, I copied the /usr/share/xymon-client/ext scripts from a working
>>> system to several that were reporting
>>> nothing and restarted xymon-client. Most did nothing, one is showing a
>>> "no data" indicator. The raid out-
>>> put looks normal except the device is md127 - perhaps the high number is
>>> confusing the script.  But the
>>> wbinfo.sh script I copied at the same time to/from the same directory is
>>> now showing green. Argh!
>>>
>>> I don't even know where the xymon-client scripts running here came from
>>> so I'm reluctant (but motivated)
>>> to just rip them all out by the roots and start over from a known
>>> baseline.
>>>
>>>   WLR
>>>
>>>
>>>
>>> ==================================================================================
>>>
>>> Phil Crooker <Phil.Crooker at orix.com.au>
>>> 3:57 PM (17 hours ago)
>>>
>>> Is the hostname wrong somewhere? I'm thinking maybe the scipt is sending
>>> the wrong hostname, somehow....
>>>
>>>
>>>
>>> ==================================================================================
>>>
>>>
>>> Jeremy Laidman <jlaidman at rebel-it.com.au>
>>>
>>> 7:07 PM (14 hours ago)
>>>
>>>
>>> On 30 August 2015 at 14:22, Walter Rutherford <wlrutherford at alaska.edu>
>>> wrote:
>>> This is probably an old issue but I didn't see a way to search the
>>> archives.
>>>
>>> https://www.google.com/?q=site:lists.xymon.com+purple+raid
>>>
>>> Our xymon server is showing purple indicators for two of our custom
>>> scripts
>>> but only on a handful of systems.
>>>
>>> The scripts are running client-side and/or server-side?  Can you
>>> describe how the scripts work?  Are they locally-written scripts or did you
>>> get them from somewhere online?
>>>
>>> RAID checks are not standard for most Xymon clients.  I've never used or
>>> seen RAID checks.  A quick look at the source code indicates built-in
>>> support for only Linux, where "md" devices are identified in /proc/mdstat.
>>>
>>> At the bottom of the incorrect raid report page there is a
>>> link to "client data". If I follow the link I get a full report
>>> including the correct,
>>> current raid information!
>>>
>>> How is the RAID information getting into the client data?  This might
>>> not be used by your custom scripts, and so might be a red herring.  More
>>> detail is required about the raid scripts.  Or whether you're using the
>>> built-in support for Linux RAID meta-devices reporting with client data in
>>> the [mdstat] section.  If the latter, perhaps you could show the [mdstat]
>>> section of client data?
>>>
>>> Cheers
>>>
>>>
>>> ====================================================================================
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Walter Rutherford <wlrutherford at alaska.edu>
>>> Date: Sat, Aug 29, 2015 at 8:22 PM
>>> Subject: purple problems
>>> To: Xymon at xymon.com
>>>
>>>
>>> Hey all,
>>>
>>> This is probably an old issue but I didn't see a way to search the
>>> archives.
>>>
>>> Our xymon server is showing purple indicators for two of our custom
>>> scripts
>>> but only on a handful of systems. I've found differences in file
>>> location, file
>>> ownership, UID, GID, etc.. but so far none of that seems to be the
>>> problem.
>>>
>>> The custom script checks raids. Strangely, all of the stagnant hosts show
>>> the same three disks entries from mid-July no matter how many disks they
>>> really have. Unfortunately I don't know what may've happened in July;
>>> that
>>> was before I started working here. I suspect the xymon-client software
>>> was
>>> copied from a live system, including the old status reports, but in so
>>> doing
>>> something wasn't re-configured correctly for the new systems.
>>>
>>> Even stranger, at my urging the Lead SA undisabled the purple
>>> notifications.
>>> I was expecting the page to go purple but it remains green even though
>>> the
>>> page isn't updating. At the bottom of the incorrect raid report page
>>> there is a
>>> link to "client data". If I follow the link I get a full report *including
>>> the correct,*
>>> *current raid information*!
>>>
>>> I think this means that the client is capturing the correct data and
>>> sending
>>> it to the server, the server is actually receiving the report, but after
>>> that the
>>> raid report isn't being handled correctly. Other systems display as
>>> expected.
>>> So far I haven't found anywhere on the server that  the purple systems
>>> are
>>> configured or handled differently.
>>>
>>> I doubt we're the first to experience this problem. Does this sound
>>> familiar?
>>>
>>> Thanks in advance for any hints you can provide for where to look next.
>>>
>>>    WLR
>>>
>>>
>>>
>>>
>>
>
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20150901/9265c2b3/attachment.html>


More information about the Xymon mailing list