[Xymon] disk graph page limits total file systems
J.C. Cleaver
cleaver at terabithia.org
Sun May 29 23:20:27 CEST 2016
Hi Ed,
Apologies for the delay, there've been some RL issues getting in the way
here.
Thank you for the analysis below; I think you're near the issue here.
Looking at lib/htmllog.c:422 et seq, there's even a comment on the
possible issues with the line parsing logic. The storage-of-previous-info
might be a red herring in that ... I'm not seeing a way that actually gets
stored in the first place. On the other hand, the graphs *could* be being
affected by something similar: the HG_WITHOUT_STALE_RRDS value.
The line counting looks like it's "reasonable enough", but I could also
see complications from unusually-named or unusually-wrapped partitions
confusing it about the real number.
I don't have access to an AIX system at the moment, but is there a
POSIX-mode or guaranteed-no-line-wrap option for it's 'df' command? If so,
the lack of it in $OS.sh is a problem.
Two other ways to test here:
1) Can you take an existing disk status report and reinjected it,
including the HTML comment <!-- linecount=XX --> with the proper number in
XX? Per line 431, that value should be used instead of a figure calculated
at display time. (This seems like something xymond_client.c might/should
include at status-generation time, since we're already going through the
values anyway, but it's not at the moment. Probably should be added.)
2) Secondly, can you add '&nostale' to the RRD graph page loads? That
should ensure that partitions are *always* displayed even if the
underlying RRD file hasn't been updated recently.
HTH,
-jc
On Wed, May 25, 2016 11:40 am, EDSchminke at Hormel.com wrote:
> JC,
>
> I think I'm starting to see a pattern emerge, and a theory develop, here.
> Hope everyone is able to follow this... here goes:
>
> I think there may potentially be a disconnect between how the disk page
> determines how many filesystems SHOULD be graphed and the number of RRD
> files that are available TO BE graphed. I think the reason the trends
> page
> seems to work OK is it just graphs all data that it has available without
> condition. It seems like the disk page determines 1) the number of
> filesystems to graph and, 2) based on that number, the number of
> filesystems per image. These numbers seem to be determined BEFORE it
> generates the HTML that produces the link HREFs and image SRCs. It then
> seems to produce just enough graphs to satisfy the predetermined number,
> plus enough to satisfy the predetermined "multiple",
>
> The predetermined number of filesystems seems to come from the number of
> filesystems reported in the previous message from the client. I believe
> the assumption was made those numbers should always match. And for the
> most part they do. It's not everyday sys admins remove filesystems from
> their systems. Until now, it may not have been so easy to spot. It's a
> little more obvious to me, being primarily an AIX administrator. We have
> a
> daily process that creates an "alt_disk_copy" of our rootvg so that we
> always have a hot backup of the OS. This process causes a lot of
> transient
> filesystems to be created. Those filesystems get reported and recorded
> during the brief window that this process is running. On closer
> examination of my AIX systems, it is not just the one with 85 filesystems
> getting truncated... it's all of them.
> When I view HTML source on the disk page, I see that just ahead of the
> HTML
> code that displays the graph images and links, there is an HTML comment
> line: "<!-- linecount=x -->" Where x equals the number of filesystems that
> were reported in the previous message from the client. (Count number of
> lines, excluding header, from [df] section.) I went through each of my
> systems, Linux and AIX, and determined that to always be the case. There
> must also be some range at which it determines the number "y" that
> determines how many filesystems to display on each graph. It seems like
> that number is x<80, y=4 and x>=80, y=5. (If y changes to 6 at some
> point,
> I haven't done enough testing to determine where that threshold is.)
>
> The request to showgraph.cgi includes the parameters first=z and count=y.
> If there are no more RRD files to graph, it stops and the graph shows
> fewer
> filesystems than the count parameter. But, if you have a situation where
> you have more data available than the predetermined number of filesystems,
> it will continue to graph them.
>
> On the system that previously seem limited to 85 file systems, I modified
> the "hobbitclient-$os.sh" script and grepped out a certain number of file
> systems. After doing this, I had 77 filesystems reported. That number
> was
> reflected in the "linecount=" HTML comment, and I also began seeing 4
> filesystems per graph (instead of 5, previously) and 20 graphs being
> displayed (instead of 17, previously) for a total of 80 filesystems being
> graphed. It graphed 80 because it still had enough data from RRD files to
> round out the last graph. Also, the filesystems that were grepped out of
> the message from the client, were still graphed.
>
> I also went back and checked my Linux systems; the ones where I added 100
> filesystems. On those systems, I created enough filesystems to push past
> that 85 filesystem "limit". Since those all graphed successfully, I had
> previously thought that it was the difference between AIX and Linux. That
> no longer seems to be the case. Now that I have removed all of those test
> file systems, and since it's only reporting 10 filesystems, only 10
> filesystems are being graphed. File systems like /, /boot, and /home are
> graphed... but the test ones that I removed, are still being graphed, and
> filesystems that you would expect to see at the end alphabetically,
> (e.g. /usr, /var, /opt, /tmp, etc) are not displayed.
>
> A lot of speculation, I realize that, but the theory seems to fit reality
> in all cases. I haven't examined the code to prove it out since, as I've
> said before, my C skills are rubbish. But if my theory proves to be true,
> the suggestion for improvement that I would offer is, make sure that at
> least every file system from the most recent message is represented, plus
> any additional file systems that might have data available in the time
> period requested; between "graph_start" and "graph_end".
>
>
> Erik D. Schminke | Associate Systems Programmer
> Hormel Foods Corporation | One Hormel Place | Austin, MN 55912
> Phone: (507) 434-6817
> edschminke at hormel.com | www.hormelfoods.com
>
>
>
>
>
> From: "J.C. Cleaver" <cleaver at terabithia.org>
> To: EDSchminke at Hormel.com
> Cc: "Xymon Mailing List" <xymon at xymon.com>
> Date: 05/23/2016 10:32 PM
> Subject: Re: [Xymon] disk graph page limits total file systems
>
>
> Hi Erik,
>
> This actually helps a great deal, as it implies there's a distinction in
> parsing code ... and potentially not an issue on the display side at all
> (which I've been pouring over with little success).
>
> Can you confirm whether the RRD files themselves are being properly
> updated for both the AIX and Linux systems? (It might help to disable
> caching in xymond_rrd during this process, if your system has enough space
> I/O capacity.) In theory all partitions that are coming in should have
> their .rrd files updated continually, but if there's a parsing issue then
> that might explain one aspect of the failure.
>
> Alternatively, can you try adding and removing partition values in the
> client report and see if going above and below the 85-parition value
> reliably enables the 86th?
>
> It might be helpful to manually edit the xymonclient-$OS.sh script to grep
> out (or include additional) lines of the 'df' output.
>
> Can you also confirm that the remainder of the client report
> (CPU/memory/etc.) is being handled OK, even on the AIX system?
>
>
> So far I've been unable to duplicate this, but I was primarily testing on
> x86_64 Linux VMs.
>
>
> Regards,
> -jc
>
>
>
>
More information about the Xymon
mailing list