[hobbit] grouping methods

Tue Jun 17 01:16:37 CEST 2008

I wrote a custom alert script to handle this.  The first alert is sent
immediately, then the rest are spooled up and sent later, as a batch.

(1) The alert script first checks if its spool file exists.  If so, the
current alert is appended to that file and the custom alert script
exits.  There is one spool file per recipient address.

(2) If the spool file does not exist, the custom alert script sends out
the current alert as normal and then creates a zero length spool file.
It also creates an "at" job.  The "at" job will mail the spool file to
it's normal recipent after one hours wait, and then delete it.  This
spoolfile deletion resets the spooling.  You can vary the one hour
setting to suite your needs.

(3) When the "at" job fires it mails the spoolfile if it is non-zero
length.  Then it deletes the spoolfile (reset).

Enhancements:

I found that if the server reboots while the spool file is spooling, the
"at" job gets killed and you end up endlessly spooling forever and ever.
To work around this:

(1) The custom alert script was modified to check the age of the
spoolfile as its first step.  If it's "too old" (in my example, over 1
hour 15 minutes old), the alert script mails it immediately, deletes it,
and then starts from the beginning with the current alert.

(2) Additionally, a cronjob was added to check for stale spoolfiles.
The job runs every 15 minutes and looks for spoolfiles over 1 hour 15
minutes old. If any are found, the cronjob does the mailing and
deleting.

Those are the basics.  I enhanced it further so that different alert
types could be grouped together into different spoolfiles and spooling
could be for different lengths of time.  I did this by symlinking the
alert script to different names.  The name of the symlink was structured
and the script looked at how it was invoked and parsed out the spooling
group and length of time from its invokation name.  The specific
spoolfile was then named based on recipient, spool duration, and group.

It is more complex to describe what I did than to actually code it!
Unfortunately I cannot post the script.  It does a bunch more than just
this spooling function, some of that being company proprietary.  It
would take quite a bit of work for me to strip out the proprietary stuff
to create a generic demonstation script for posting.

This script also does a function similar to spooling, but not quite.
Implemented as a symlink to a different name.  I call it a "consolidate"
funciton.  It works pretty much the same as spooling, but instead of
sending the spoolfile after an hour, it only waits 5 minutes, deletes
the spoolfile without mailing it, and then basically does a "screen
scrape" of the bb2.html page and lists all the non-green lights it finds
there.  This works well for pagers.  Rather than getting a whole bunch
of pages, you get one page that lists all the current light statuses.
As part of the consolidation during the screen scrape (actually I open
the actual html file, so I'm dependant on consistant file structure
unfortunatly) I heavily abbreviate things so they will fit in the tight
SMS message length limits.  A consolidate message  might look like this
cryptic example, but I know what it means!  "!BB! R:testa:srv1
R:testc:srv7 Y:testf:srv2 P:testq:srv3"  I list things in order of
importance (reds before yellows, etc) so if the messatge does get
truncated, the most important parts make it through.

-----Original Message-----
From: Linder, Doug (SABIC Innovative Plastics, consultant)
[mailto:Doug.Linder at sabic-ip.com] 
Sent: Monday, June 16, 2008 12:08 PM
To: hobbit at hswn.dk
Subject: RE: [hobbit] grouping methods

Sloan [mailto:joe at tmsusa.com] wrote:

> We've not had a bb server go down in all the years we've been using 
> it, but sometimes wan connectivity goes away due to circumstances 
> beyond our control

This is by far the biggest annoyance we have with all system monitoring
- when networks go down.  It's a problem with every monitoring tool
there is and I can't think of any way to solve it: the monitoring system
has no way of knowing whether a system is down because it crashed or if
it's down because the network went down.  All it knows is that it can't
talk to the system anymore and something is wrong, so it generates an
alert.  When a whole network goes down, it can become hundreds of
simultaneous alerts.  And that's annoying enough when it's just email
alerts.  When you use Hobbit to generate cases in your trouble ticket
system, that can be hundreds of new, useless cases to manually close.

We don't want to raise the amount of time a system has to be down before
Hobbit generates an alert, because we want to know as soon as possible.
But if we keep that number too low, then when the network has a brief
hiccup, we get hundreds of redundant cases.  This is especially a
problem with overseas networks on the WAN.

I think the only possible solution would be for Hobbit to have some kind
of flood-detection routine built in, where it could tell how rapidly it
was sending alerts about connection problems for machines all on the
same network, and was smart enough to think "Whoa, I'm about to send 100
connection alarms about systems on the same network.... Instead of
sending 100 of them, maybe I'll just send ONE alert saying "You got a
big problem here."

Doug Linder

To unsubscribe from the hobbit list, send an e-mail to
hobbit-unsubscribe at hswn.dk