[hobbit] trying to get netapp filer data into larrd graphs

Thu Feb 17 19:21:42 CET 2005

Henrik Storner wrote:

> The best way of working with the RRD data that Hobbit handles is to
> snoop on the data that is sent from hobbitd to the hobbitd_larrd
> program. You can do that by listening on the hobbit "status" channel:
> 
>     ~/server/bin/bbcmd sh
>     hobbitd_channel --channel=status cat

> 
> The first line with "@@status..." is the beginning of a message - it
> has some information that hobbitd picks out from all messages, like
> the hostname, test-name, color etc. The important thing here is to see
> that hobbitd does see that it is a "cpu" status - there's "|cpu|" in
> the first line. That means hobbitd_larrd will send this message
> through the "cpu" handler in hobbitd/larrd/do_la.c.

THis was extremely useful to learn.  Thanks for sharing it.

> So the first fix is to change those lines above to handle a report
> with the keyword "Uptime:" - e.g. like this:
> 
>         p = strstr(msg, "up: ");
>         if (!p) p = strstr(msg, "Uptime:");
>         if (p) {
> 
> 
> Just one line added. But in this case, I think it makes all the
> difference - because the rest of the reports looks like it will be
> handled just fine by the current code in do_la.c
> 
> I've added this fix to my sources.

I added the line to do_la.c and a rrd file is being created for la, but 
the data used in the graph was being converted or truncated in some 
manner on its way from the status report message to the rrd file.  The 
"load average" collected by this script is actually the %CPU 
utilization, not a true unix load average.  I thought that it may have 
been getting converted by the operation that converts load averages when 
DISPREALLOADAVG=FALSE, so I added a line to the perl script that adds 2 
digits after a decimal when returning the CPU load avg to hobbit.  Now a 
CPU utilization of 11% is displayed as "load=11.00", which seems to be 
working better.

So as it stands now, the trend charting works and I've found a new 
problem while pulling my hair out on this one:  The CPU utilization data 
obtained by SNMP is not always accurate (netapp bug #145119).  In my 
experience, it seems to be about 5-10% off.  That's not something that I 
can fix, so I'm just going to have to live with it for now.  Still 
didn't make troubleshooting this hobbit graphing any easier!  ;)

Coincidence or not, it seems that after I applied the fix above and 
rebuilt hobbit, sometime later a hobbitd_larrd column appeared and 
stayed red then purple for a very long time.  The error message was 
"fatal signal caught" or something like that.  I ended up using the bb 
127.0.0.1 "drop servername hobbitd_larrd" command just to get rid of it, 
with the intention of adding it back later once I was sure it wasn't a 
bogus message.  I'm beginning to regret that, since in my haste I may 
have thrown out perfectly good data.  Was that a new feature that was 
added in RC2?  How would I get it back?  Add hobbitd_larrd to bb-hosts?

> PS: If you want me to look at that Netapp disk-report that isn't being
> graphed, just send me an example of what such a report looks like.

Sure thing.  See below, sorry about the line wrap.  After seeing what 
you looked at in the CPU case, I think I know what the problem could be. 
  The rest of my systems use the phrase "Disk partitions" while the 
filer uses "NetAPP Volumes".  I poked at the do_disk.c code but was 
clearly out of my league when it came to fixing it.  The column ordering 
is different too, although I can reorder it in the perl script to match 
the other linux style systems if needed.

  Thu Feb 17 08:12:36 EST 2005 - NetAPP Volumes on filerA.nandomedia.com OK

Volume:	Size:	Used:	Avail:	%Used
green /vol/test01/                        382G 
92915122176      296G     22.63%
green /vol/test01/.snapshot                96G 
27266535424       70G     26.56%
green /vol/test01/total                   478G 
120181657600      366G     23.41%
green /vol/vol0/                           96G 
193298432       95G      0.19%
green /vol/vol0/.snapshot                  24G 
129028096       24G      0.50%
green /vol/vol0/total                     120G 
322326528      119G      0.25%