Monitoring disk space problems (was: RE: [hobbit] Highlights of the 4.3.0 version)

Haertig, David F (Dave) haertig at avaya.com
Mon Aug 6 21:25:46 CEST 2007


I try to identify filesystem "space hogs" via custom scripts I wrote a
long time ago when using BB.  99% of my custom stuff is done in PERL.
 
I use 'du -k' to get the size of all directories in the filesystem.  I
then cut those results down to only the first and second level
directories (but you could go as deep as you want).  I store the size of
each subdirectory in a small "database".  I did this ages ago and my
code uses PERL's "Storable" module to store the accumulated date into a
file (called my "database").  These days I'd just use Hobbit's easily
accessed RRD files.  I then use PERL's
Statistics::Descriptive::least_squares_fit() to calculate the slope and
linear correlation coefficient of the "best fit line".  This allows me
to see how fast each subdirectory is growing/shrinking, and how linear
that growth/reduction is.  I trigger yellow/red conditions based on rate
of growth and predicted fill time at current growth rate, in addition to
the standard "95% full = red" test.
 
The above makes it fairly easy to identify which subdirectory is your
problem, which is often times good enough to identify the file/process
that is killing you.  When that's not, I have a seperate test that tries
to identify problem files a different way.  BB/Hobbit uses 'top' to
identify cpu-hogging processes.  Many times you see files hogging space
are directly tied to processes hogging cpu (runaway process = runaway
file in many cases).  'top' identifies the process(es), then "lsof -p
<pid>" is used to identify the files that the suspect process has open.
Finding a cpu-hogger that has a filespace-hogger open is usually the
holy grail you seek.
 
As a "repair" action for Hobbit, I squirreled away 2Gb of diskspace in
100Mb chunks for critical filesystems.  "dd if=/dev/zero
of=/filesystem/DiskSpaceReserve/reserve01 bs=1024 count=102400", then
"cp reserve01 reserve02", etc. to build up the reserve.  A seperate
Hobbit "notification script" is used to simply delete files from this
reserve under dire circumstances, after normal email/pager notifications
have failed to trigger action by developers/production support people.
 
My BB/Hobbit custom scripts tend to get quite involved.  Probably too
much so, but they're fun for me to write!

________________________________

From: Gary Baluha [mailto:gumby3203 at gmail.com] 
Sent: Monday, August 06, 2007 7:29 AM
To: hobbit at hswn.dk
Subject: Re: [hobbit] Highlights of the 4.3.0 version


 < ... snip ... >
 
One 
example is disk space.  A full filesystem would shut many things down.
Apps should not fill a filesystem, but sometimes they do.  So my custom
Hobbit scripts first scream and scream about low disk space, even
analysing things down to specific subdirectories and fast growing files
and doing trend analysis.  But if their call is not answered, they start
freeing up space from a "private reserve" I have set aside to deal with 
emergencies.  So if we experience a sudden unexpected blowup in a
filesystem at 3am, Hobbit keeps things running in production until the
appropriate people can look into and diagnose the problem.  This may not
be Utopian behavior, but it sure is practical at 3am in the morning!

What sort of trend analysis do your scripts perform?  We have a few
boxes that are notorious for filling up their disk space, and I haven't
yet come up with an idea of how to neatly track exactly what it is that
keeps filling up the disk.  
 
< ... snip ...> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20070806/93fd2dad/attachment.html>


More information about the Xymon mailing list