[hobbit] Client interval question

Thu Dec 15 07:09:39 CET 2005

In message <F098822C-19A2-42CC-B6BA-2AB4E71D18BA at PacketPushers.com>, Scott Walters writes:
}>
}> We run pretty much all of our big brother tests every minute.  On
}> our new hobbit servers, we're running them at the default intervals.
}>
}> BB shows us that our primary name server is going out for less than
}> a minute, about every 62 minutes.
}> Hobbit is missing most of those
}> outages, although the longer "xxxx events received in the last xxx
}> minutes" is what helped us spot the problem, as a whole bunch of
}> machines' services don't respond well when our primary name server
}> is out, and having a mass of servers go yellow then green, in
}> unison, is sort of eye catching.
}
}So hobbit with the xxx events (running every 5m) did provide enough  
}information to indicate an intermittent problem with DNS?

Hobbit's non-green page, with last xxx events, gave us a large
enough view that we could see all the machine services going yellow
at the same time dns went red.  We're monitoring a bit over 260
machines with a whole lot of difference services, so there's often
something going red or yellow.  With BB's older default of the last
25 events, there wasn't ever that much on screen to notice a group
of swings to yellow, then back to green.

}Things running every 5m will collide with a problem that happens for  
}a minute frequently enough to 'show up on the radar'

Sure, but we'd see up to 13 hours between dns 'red', when BB would
get several in that period.

I haven't changed hobbit yet to 1 minute checks.  I've even made
an explicit explanation that I wasn't planning to shorten it to
1 minute checks when we officially switched over, and that was
agreed to.  However, with the fact that the 1 minute checks did
actually make a difference in tracking down and solving the problem
with DNS, I may yet have to work on that change.  We'll see what
kind of feedback I get after today.  Even then, the only thing I'd
really be willing to shorten to that frequency of checks are the
remote checks, over the network.

}But every site has different requirements.  It's just been my  
}experience that sampling more frequently than 5m hits the knee-bend  
}of diminishing returns.  It also increases the potential for state  
}changes, which chews up the filesystem with the history info.

I thought it was unnecessary when I originally brought BB into
production years ago, but it was one of the requirements I ended
up with to sell switching to BB.  Some things can't be checked
every minute, I have raid checks that can take more than a
minute to run.

Tracy J. Di Marco White
Information Technology Services
Iowa State University