[hobbit] SLA 99.9999 question?

Scott Walters scott at PacketPushers.com
Fri Jan 6 16:30:47 CET 2006

On Fri, 6 Jan 2006, mario andre wrote:

> The great bbretest-net tool reduced the interval for the TCP tests, so now,
> in some cases a 3 or 4 digit accuracy could be shown.

I agree the bbretest-net is great.  I disagree it it increases the
accuracy of availability measurements to 3 or 4 digits.

The re-test only affects the frequency when a failure is detected.  When
tests pass, the interval is still the standard 5 minutes.

A service could be down for 2 minutes before the 're-test' kicks in.  That
two minutes of 'missed' downtime is roughly 0.0004 of the year.

With 5 minute intervals, only 4 significant digits should be used or
'rounding errors' will be compounded.

If it is because management wants it, fine, but mathematically you're
making it up.

And from a business perspective, I've found availability statistics an
extremely poor way of managing expections for SLAs.  They are barely good
for measuring them.

There are two 'million dollar questions':

1)  When does the service need to be available?

2)  If it is down, what is the longest outage you can tolerate?

	* And be prepared to offer the cost differences between 1,4,12,24
hour recovery windows.  Customers will change their tune quickly when they
see the costs associated with 'zero downtime' environments.  Obviously,
stock exchanges, 911 call centers, eBay environments are prepared to pay .
. . . and charge an extra gazillion dollars if you can never get a
maintenance window.

The answers to those two questions will make it clear how to build the
infrastructure (tecnical, staffing, etc) requirements.

"Three kinds of lies: Lies, damned lies, and statistics. "
- Mark Twain

Scott Walters

More information about the Xymon mailing list