[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [xymon] Managing who gets alerts - shifts and rotations
- To: "xymon (at) xymon.com" <xymon (at) xymon.com>
- Subject: RE: [xymon] Managing who gets alerts - shifts and rotations
- From: Tim McCloskey <tm (at) freedom.com>
- Date: Sat, 9 Oct 2010 12:28:13 -0700
- Accept-language: en-US
- Acceptlanguage: en-US
- References: <AANLkTik=Ez4-YGtG1ZM8VUmiqXqCFJXaM=_Ch+h=_H5g (at) mail.gmail.com>
- Thread-index: Actnpr1qZVn2OvqqQ2C4gqvYQTIVMQAOwmG5
- Thread-topic: [xymon] Managing who gets alerts - shifts and rotations
Hi,
Might not be what you were hoping to hear but I'm going to share it just the same.
If you think through your rules and come up with a standard format it will help. I know it seems endless but once you've set up a standard it's not as bad as you think. Just tedious in the initial setup, but worth it in the long run.
One of the things that I did was to seperate the email/pager addresses from the actual alert rules. Example below.
On NN schedule cron runs a simple perl pie script to change the values in mail-primary.sh (and page-primary.sh, etc). This is an extra layer that you need to come up with yourself but it's not overly complex. You may end up with 4 or nn more perl scripts but you maintain the variables in those scripts outside of the hobbit system, thus avoiding the typo in your alerts.cfg that breaks things. (The variables will likely be a list or array of people/pagers/email. It's likely that the people will change from time to time but the alert for ICMP on your nameserver will always be required.)
cat hobbit-alerts.cfg
...
$alertdir=/usr/local/tolkien/server/alert-scripts/sys_admin
$alertdir2=/usr/local/tolkien/server/alert-scripts/dev_app
include /usr/local/tolkien/server/etc/inc/alerts/tm-mu
...
cat tm-mu
...
## ALL OTHER SERVICES : SERVERS ON WEB SIDE
# mail primary on every red level service failure that has been red for over 6 minutes.
# send mail once an hour and do not send a recovery email.
# this excludes conn, for which they have already been paged.
# mail secondary after an hour and once an hour thereafter.
#
#
PAGE=%^web/(linux|other|windows|solaris) EXSERVICE=conn COLOR=red,purple DURATION>6m
SCRIPT $alertdir/tm-mu/mail-primary.sh mail-web-prim FORMAT=sms REPEAT=1h
PAGE=%^web/(linux|other|windows|solaris) EXSERVICE=conn COLOR=red,purple DURATION>1h
SCRIPT $alertdir/tm-mu/mail-secondary.sh mail-web-sec FORMAT=sms REPEAT=1h
...
cat mail-primary.sh:
!/bin/bash
/bin/mail -s "$BBHOSTSVC" tm (at) f...redacted....com < /dev/null
I can't share the complete perl script today but it's failry simple, example stanza.
...
elif $GREP $mailb $wd/$mp > /dev/null
then
$PERL -p -i -e "s:$mailb:$maila:" $wd/$mp
$PERL -p -i -e "s:$maila:$mailb:" $wd/$ms
else `$MAIL -s 'failed on call change' $sysadmin < $wd/$msg`
fi
...
I know it seems like you're banging your head on the wall looking for simplicity, that's the part you may need to create. If something already exists I'm sure someone on the group will let you know.
Good luck.
-t