[hobbit] Thoughts

Thu May 3 13:55:45 CEST 2007

I think that whatever solution decided should work for all other tests (in 
some form or another). My need for this would be the Disk report. I have a 
good number of Database Servers that have their disk fill up regularly. These 
disks are located on SAN and we either clean up the disk or put in a request 
for the SAN storage to be expanded. A storage expansion request could take 
2-6 weeks to be fulfilled. So Ack'ing disk for that long 'blinds' you to any 
other disk issue that may crop up. So a way to ack just one volume, would be 
very desireable. I have actually written an ext test module to do this. I am 
still in the process of bring it over from BigBrother.

Another disk scenario I have, is similar to the point raised with ports. It is 
when a server is shared between 2 groups (or more). Being able to have 
multiple disk reports would be very welcomed. So groupA has a dedicate report 
for their volumes and so does groupB. I realize alerting can already be split 
up this way, but a way to split the reports would be a nice to have also. We 
do use Alternative Pagesets a lot, so that could be the reason that I like 
the idea of being able to split up a report into multiple reports. We create 
a Pageset for GroupA, with just the devices & reports that GroupA cares 
about. But when you have reports like disk, well disk could be red due to a 
GroupB volume. And this sometimes confuses GroupA :(

I think the simplest solution would be to have an parameter in the 
hobbit-clients.cfg:

DISK    %(/mnt/Vol1|/mnt/Vol3|/mnt/Vol4)  90 95 REPORTALIAS=disk_a
DISK    %(/mnt/Vol2|/mnt/Vol5|/mnt/Vol6)  80 95 REPORTALIAS=disk_b
DISK    %!(disk_a|disk_b) 96 98

The last disk rule setting alert values for all other volumes, except those 
defined by disk_a & disk_b. The same REPORTALIAS feature could be used for 
MSGS, PORTS, PROCS, FILES, etc. And these alias names could be used in the 
alert rules, instead of GROUP=.

Now the above suggestion still does not help when a report has an alert 
status(red|yellow) and more alert items are added/subtracted. I would love 
the feature of being alerted when a report had more/less items in it than it 
did previously. The simplest way I see to do that is by including a 
alertstate field when the status is sent in to hobbit. I would imagine that 
this could be added to the report status first line, i.e
bin/bb 127.0.0.1 "status server1.disk red (red:/mnt/Vol1:/mnt/Vol2 
yellow:/mnt/vol3)
<rest of disk report>"

So in the above example there are 2 volumes with a red status & one with a 
yellow. When the next status report comes in it has (red:/mnt/Vol1 
yellow:/mnt/vol3), hobbit would be able to determine the report had a state 
change, even though the disk report would still have a red status. If reports 
do not provide this extra 'alertstate' field, it really shouldn't break 
anything. Hobbit would just behave as it does presently. Also a new alert 
parameter could be added, UPDATES. So people that want to receive emails 
whenever a report's alertstate changes can. And for people that just want 
alerts when reports have an alert status or recover, still can. The update 
alert emails can be as simple as, "server1's disk alert status has changed.", 
or can be complicated/informative "server1's disk /mnt/Vol2 alert status has 
cleared, but there are still disks that have met alert thresholds." Something 
else to consider is how this would affect acknowledgments. When acknowledging 
reports, I think a new option would be needed. Ack for the alert status, or 
Ack for the present alertstate. All depends on how you want to implement.

Sorry for the very long winded email, just trying to do a braindump of my 
thoughts. 
 ~Steve

On Wednesday 02 May 2007 17:24, Kruse, Jason K. wrote:
> Actually, you just indirectly mentioned that feels like a fairly elegant
> solution.  What would be nice in this particular case would be to be able
> to attach a service label to the PROCS tests for groups of processes.  The
> service could then be monitored without custom tests being created for each
> one.  New colums can be created from the service tag without really
> cluttering the lines.
>
> I'll have to think about how the log files are processed to see if
> something like that works or not.
>
> Jason
>
> ________________________________
>
> From: Dan Vande More [mailto:bigdan at gmail.com]
> Sent: Wed 5/2/2007 4:09 PM
> To: hobbit at hswn.dk
> Subject: Re: [hobbit] Thoughts
>
>
> Indeed, it seems to me that the whole group concept is a good way to work
> with us humans but breaks down wildly when dealing with computers. This is
> fine because most of us use the groups to save space on the screens, and
> configuration in the conf files.
>
> If you want tests for each process and ultimately different behaviours for
> each process, you need to be prepared to do the work and make the tests for
> each process.
>
> Please don't overcomplicate hobbit for this - it's a corner case and will
> ultimately make the program more unwieldy.
>
>
> On 5/2/07, Henrik Stoerner <henrik at hswn.dk> wrote:
>
> 	On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:
> 	> Grouped items, such as the process check and log monitors, are issues.
> 	> A single process down causes the whole check to go red.  A process
> 	> listed as alerting only operators can then mask another process on the
> 	> same system from notifying the DBA's.  Setting the alert repeat interval
> 	> to 0 shows the other problem, a recovery message is not generated for
> 	> each process that recovers, only when the whole group of processes
> 	> recovers.
>
> 	This will be difficult to handle - it's a very basic thing in the Hobbit
> 	design that it only tracks the color of each status, not the details of
> 	which rule (out of many) causes e.g. the "procs" column to go red.
>
> 	To do that, you would need to associate some "event ID" with each of the
> 	settings that can cause a red/yellow status; e.g . you'd have
>
> 	   HOST=myhost
> 	       PROC tnslistener 1 ID=100
> 	       PROC httpd 4 ID=200
>
> 	The "procs" status would then store the set of ID's that had been
> triggered for a status, and whenever there was a change in the set of
> triggered rules it would pass this information to some process.
>
> 	It can be done, but I am not particularly happy with it; it seems a bit
> too complex for my taste. If anyone has a better idea, please speak up.
>
> 	(And just in case you wonder why I've used a new "event ID" instead of
> 	re-using the existing "group" definition: I can easily imagine a
> 	scenario where you have e.g. multiple processes monitored with alerts
> 	going to one group of people (i.e. several PROC rules have the same
> 	GROUP setting), but you still want to track exactly which processes are
> 	up or down - and then you need a unique ID for each PROC rule).
>
>
> 	Regards,
> 	Henrik
>
>
> 	To unsubscribe from the hobbit list, send an e-mail to
> 	hobbit-unsubscribe at hswn.dk