[Xymon] are bar well scaled through categories.

Tue Sep 22 05:09:15 CEST 2015

On 22 September 2015 at 06:04, Randall Badilla <rbadillarx at gmail.com> wrote:

> c) does the problem resides on how rrdtool plots or internal manipulation
> of the Solaris scripts?
>

Yes, this.

RRD, by definition, is a round-robin database that "consolidates" in a
lossy way.  Combining this with the widespread use of counter stats
produces the effect you see.  Let me try to explain by a simple example.
But first, a caveat that I'm not an expert in RRD, and my understanding is
partly from my own guess at what's happening based on how I would implement
things.

Let's say that at 10:05am, the interface bit counter is 10,500,000 bits
(meaning that the interface has transmitted 10.5 million bits since reboot,
a "clear counters" command, or a counter roll-over).  The router is polled
every 5 minutes for its interface statistics.  At 10:10am, 5 minutes later,
the counter has incremented to 12,900,000 bits.  The difference between the
two samples is 2,400,000 bits.  So rrdgraph (and hence Xymon) will show a
5-minute average value of 2,400,000/300=8kbps.

Now, RRD doesn't store 8kbps.  Nor does it store 12,900,000.  Instead, it
stores:
 10:05=10,500,000
 10:10=12,900,000
 10:15=...
 (etc)
In other words, only the absolute counter values get stored (along with the
timestamps for each).  These are the "primary data points".

To store 5-minute counter values for years and years would require a huge
database file that would take lots of CPU power to calculate and produce ad
hoc long-term views of the data.  Generally we only care about fine-grained
(primary) data point samples when they're recent, and as the data points
get older, we care more about hourly, daily or weekly trends instead.  RRD
solves this problem by reducing resolution for older samples.

Back to our example.  After 1 day, RRD "consolidates" the 5-minute values
into longer intervals so that they don't take up as much space.  The
consolidation parameters are configurable, but for our example let's say it
keeps 5-minute samples around for up to 24 hours, and after that it turns
them into hourly samples.  How does it do this?  Well all it needs to do is
forget 11 out of 12 samples in an hour.  So now RRD is storing:
  10:05=10,500,000
  11:05=27,320,000
  12:05=34,150,000
  13:05=...

Note that it's still storing new 5-minute primary data points verbatim.
The above list is only showing data points that are 24-hours old, from the
time we started our sampling.

The same consolidation process occurs when the hourly samples get older
than (say) 12 days, and they might be turned into daily samples by
forgetting all but one sample per day.

Again, let me stress that the timeframes used above are tuneable per RRD
file.  In fact, I arbitrarily chose 5-minutes, hourly, daily and 12-day
time periods, for illustration purposes only, and typical deployments are
usually not exactly as I have described.  But the principle still applies.
You can view the parameters of an RRD file with "rrdtool info
<filename.rrd>".

Now, back to the phenomenon you're seeing, which is an apparent reduction
in the magnitude of samples.  The reason this happens is that the RRD
database is always making averages when it queries an RRD *COUNTER* value.
 (Other sampling methods are available, such as GAUGE and DERIVE, but most
routers provide interface statistics as counters.)  Even when you ask RRD
to graph the most recent, 5-minute samples, you should realise that those
samples are averaged over 5 minutes.  There was almost certainly a
fluctuation during the 5-minute interval that went higher than the
calculated value, but the best RRD can do is show the average over that
time, by subtracting the two counter values and dividing by the time
period, to get average bps.

When RRD generates 12-day graphs, it uses (in our example) hourly samples
because for most of the 12 days, the 5-minute samples are now gone.  So to
produce the numbers for the time from 10:05 to 11:05, it can't show the
peaks and troughs that used to show in the 5-minute samples, and instead
can only show the average for the hour, because now all that it has are the
two counter values for time periods 1 hour apart.  This averaging gets
worse as the granularity reduces.

For Xymon, this is pretty much it.  However, in some cases, it's actually a
little bit more complicated than this.  RRD has specific "consolidation
functions" that it uses when moving data from each sample rate to the next
(eg from 5-minute to hourly samples).  For example, a typical RRD file can
store consolidated samples for MIN, MAX, AVERAGE and LAST, although RRD
files created by Xymon only have AVERAGE (defined in rrddefinitions.cfg).
I think for GAUGE sample types, RRD has to calculate the consolidated
average of 5-minute primary data points when it consolidates them to hourly
samples, rather than just forgetting the intermediate samples, because
GAUGE is different.  Similarly, even for AVERAGE, if the RRD is configured
to use a MAX consolidation function, it calculates the hourly maximum as
the maximum value of the 5-minute samples.  When Xymon shows "max" and
"min" values, but the data set only has AVERAGE samples, it has to
calculate the max and min, as simply the highest/lowest average over the
time period.  If the RRD file was created to use MAX and/or MIN
consolidation functions, then my understanding is that the longer-term
values for MAX and MIN will be the actual max and min of the 5-minute
samples.

To solve your problem, you can simply explain that longer-term views are
averaged from short-term views.  But if you want more accurate maxima and
minima on your longer-term views, then I think you can adjust
rrddefinitions.cfg to include MAX and MIN consolidation functions, but this
will only apply to newly created RRD files.  Alternatively, you can use
rrdtune to add new CFs to an existing file, but note that it will really
only help with new samples.  I've never done this, so I don't know how well
it works, or how to do it.  However, there used to be a TRACKMAX option for
this purpose, and this post describes how the same effect can be achieved
with an update to rrddefinitions.cfg:
http://lists.xymon.com/archive/2010-November/029960.html.

Hope that helps.

Cheers
Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20150922/e7db5d76/attachment.html>