[Xymon] purple problems

Walter Rutherford wlrutherford at alaska.edu
Mon Aug 31 23:09:08 CEST 2015


Spoke too soon. Some of the systems actually have client.d/raid and
they still aren't reporting. At least one didn't even have the directories.

I guess that's one of the hazards of inheriting systems that were installed
and/or modified by multiple people over time.


On Mon, Aug 31, 2015 at 12:58 PM, Walter Rutherford <wlrutherford at alaska.edu
> wrote:

> Found it!
>
> Besides the "raid.sh" script in ext/ I needed a raid configuration in
> etc/client.d/. I thought that was defined in another file but apparently
> not.
>
> On Mon, Aug 31, 2015 at 10:53 AM, Walter Rutherford <
> wlrutherford at alaska.edu> wrote:
>
>> All good questions. Hunting for the answers helped me to see some
>> patterns I'd missed before.
>>
>> The xymon server hostname and IP seem to be consistent, but that's about
>> all that is consistent.
>> There is a separate column for 'disks' on the main webpage and it
>> correctly shows the output from
>> a 'df' command. The script running on the clients' sides is called
>> "raid.sh", the comments at the top
>> of the script indicate it is over a decade old; bb-mdstat.h based on
>> bb-raid.sh. There's a link from
>> /home/xymon-client/ext to /usr/share/xymon-client/ext on most systems.
>> The directory and the
>> scripts in it are owned by either root or xymon. Changing location,
>> ownership, and perms to match
>> one of the working systems hasn't helped.
>>
>> The broken raid reports are all from Linux boxes. The working reports
>> look like this:
>>
>> *          Mon Aug 31 09:38:49 AKDT 2015 RAID ALL devices OK*
>>
>>
>> *             green md0 Status OK*
>> *             green md1 Status OK*
>> *             green md2 Status OK*
>>
>> *          ============================ /proc/mdstat
>> ===========================*
>>
>> *          Personalities : [raid1] *
>> *          md0 : active raid1 sdc1[1] sda1[0]*
>> *                511988 blocks super 1.0 [2/2] [UU]*
>>
>> *          md2 : active raid1 sdd[3] sdb[2]*
>> *                536869888 blocks super 1.2 [2/2] [UU]*
>>
>> *          md1 : active raid1 sdc2[1] sda2[2]*
>> *                41428924 blocks super 1.1 [2/2] [UU]*
>> *                bitmap: 1/1 pages [4KB], 65536KB chunk*
>>
>> *          unused devices: *
>>
>> *          Run /sbin/mdadm -D /dev/md* for more info*
>>
>> The non-working systems either show nothing at all (that's better than
>> purple) OR show the same
>> three green md[0-2] devices (whether it has three raid devices or not) on
>> a blue disabled background.
>> So, I'm almost positive someone copied a working system incorrectly to
>> other clients without cleaning
>> up the foreign logs. The working systems overwrote or just aged out the
>> incorrect information while the
>> non-working ones just keep reporting it. I have found logs but none for
>> this raid information. Perhaps the
>> logs are compressed or otherwise rendered humanly unreadable.
>>
>> So, I copied the /usr/share/xymon-client/ext scripts from a working
>> system to several that were reporting
>> nothing and restarted xymon-client. Most did nothing, one is showing a
>> "no data" indicator. The raid out-
>> put looks normal except the device is md127 - perhaps the high number is
>> confusing the script.  But the
>> wbinfo.sh script I copied at the same time to/from the same directory is
>> now showing green. Argh!
>>
>> I don't even know where the xymon-client scripts running here came from
>> so I'm reluctant (but motivated)
>> to just rip them all out by the roots and start over from a known
>> baseline.
>>
>>   WLR
>>
>>
>>
>> ==================================================================================
>>
>> Phil Crooker <Phil.Crooker at orix.com.au>
>> 3:57 PM (17 hours ago)
>>
>> Is the hostname wrong somewhere? I'm thinking maybe the scipt is sending
>> the wrong hostname, somehow....
>>
>>
>>
>> ==================================================================================
>>
>>
>> Jeremy Laidman <jlaidman at rebel-it.com.au>
>>
>> 7:07 PM (14 hours ago)
>>
>>
>> On 30 August 2015 at 14:22, Walter Rutherford <wlrutherford at alaska.edu>
>> wrote:
>> This is probably an old issue but I didn't see a way to search the
>> archives.
>>
>> https://www.google.com/?q=site:lists.xymon.com+purple+raid
>>
>> Our xymon server is showing purple indicators for two of our custom
>> scripts
>> but only on a handful of systems.
>>
>> The scripts are running client-side and/or server-side?  Can you describe
>> how the scripts work?  Are they locally-written scripts or did you get them
>> from somewhere online?
>>
>> RAID checks are not standard for most Xymon clients.  I've never used or
>> seen RAID checks.  A quick look at the source code indicates built-in
>> support for only Linux, where "md" devices are identified in /proc/mdstat.
>>
>> At the bottom of the incorrect raid report page there is a
>> link to "client data". If I follow the link I get a full report including
>> the correct,
>> current raid information!
>>
>> How is the RAID information getting into the client data?  This might not
>> be used by your custom scripts, and so might be a red herring.  More detail
>> is required about the raid scripts.  Or whether you're using the built-in
>> support for Linux RAID meta-devices reporting with client data in the
>> [mdstat] section.  If the latter, perhaps you could show the [mdstat]
>> section of client data?
>>
>> Cheers
>>
>>
>> ====================================================================================
>>
>>
>> ---------- Forwarded message ----------
>> From: Walter Rutherford <wlrutherford at alaska.edu>
>> Date: Sat, Aug 29, 2015 at 8:22 PM
>> Subject: purple problems
>> To: Xymon at xymon.com
>>
>>
>> Hey all,
>>
>> This is probably an old issue but I didn't see a way to search the
>> archives.
>>
>> Our xymon server is showing purple indicators for two of our custom
>> scripts
>> but only on a handful of systems. I've found differences in file
>> location, file
>> ownership, UID, GID, etc.. but so far none of that seems to be the
>> problem.
>>
>> The custom script checks raids. Strangely, all of the stagnant hosts show
>> the same three disks entries from mid-July no matter how many disks they
>> really have. Unfortunately I don't know what may've happened in July; that
>> was before I started working here. I suspect the xymon-client software was
>> copied from a live system, including the old status reports, but in so
>> doing
>> something wasn't re-configured correctly for the new systems.
>>
>> Even stranger, at my urging the Lead SA undisabled the purple
>> notifications.
>> I was expecting the page to go purple but it remains green even though the
>> page isn't updating. At the bottom of the incorrect raid report page
>> there is a
>> link to "client data". If I follow the link I get a full report *including
>> the correct,*
>> *current raid information*!
>>
>> I think this means that the client is capturing the correct data and
>> sending
>> it to the server, the server is actually receiving the report, but after
>> that the
>> raid report isn't being handled correctly. Other systems display as
>> expected.
>> So far I haven't found anywhere on the server that  the purple systems are
>> configured or handled differently.
>>
>> I doubt we're the first to experience this problem. Does this sound
>> familiar?
>>
>> Thanks in advance for any hints you can provide for where to look next.
>>
>>    WLR
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20150831/0a0eb341/attachment.html>


More information about the Xymon mailing list