ORC Owl Logo 2  

Owl River Company

 
  Your IP is: 54.80.158.127

Examining High Availability, Outages, 99.9%

The issue of making a complex computer network fully available, all of the time, is often a stated goal of a client. While there are techniques to increase the probability of 'available' all of the time, let's examine the issue by defining some outage percentages, and then working through the implications. Customarily, this is done using a string of '9's to examine the outages by orders of magnitude.

Some useful numbers to know
 Seconds in a day  86,400 
 Seconds in a year  31,536,000 
  1.      99% - Two Nines - 87 hours a year
  2.     99.9% - Three Nines - 8 hours a year
  3.    99.99% - Four Nines - 52 minutes a year
  4.   99.999% ;- Five Nines - 5 minutes a year
  5.  99.9999% - Six Nines - Under one minute a year
1. Service level agreements are stated with response times, not times to repair, most customarily. Garden variety service level agreements cover only 'business hours'. Telephone carrier and data circuit provider service level agreements are for a four hour response. Some carry no express timeframe, and provide for 'reasonable' or 'best efforts' responses.

2. A site will often attain Three Nines (that is, two half day outages a year) with proper systems provisioning, administration, and other infrastructure. A power outage, a failed hard drive with adequate backups, and so forth (assuming suitable backup hardware is available at site) permit a recovery this rapidly. Note that the incremental outages of running a disk maintenance tool, or providing virus scanner and operating system updates which require a 5 minute install followed by a 5 minute reboot phase weekly, adds up to 520 minutes -- roughly our predicted eight hours.

In the common experience with consumer grade operating systems, we do not attain Three nines.

3. To reach Four Nines, customarily a main unit, and at least a 'cold spare' are needed. Better is a main unit, and a 'hot spare'. With a Cold Spare, a reserve pre-configured unit is maintained at site, powered off most of the time, but available on short notice. If the main unit is unavailable for some reason, the Cold Spare is started and placed into service. This should permit reaching Four Nines.

4. As we move to Five Nines, things shift and are a lot more demanding -- a Hot Spare needs to be on, and a 'heartbeat, or an automated 'failover' scheme is needed. When an outage occurs on either the main or the Hot Spare units, a reserve Cold Spare needs to be brought on line quickly to maintain coverage. This extra hardware and software required more system administration, and hardware, and represents the limits which most ordinary commercial enterprises are willing to build for.

(By the way, this is not irrational -- a cost/benefit analysis needs to be performed, and tolerating an hour's outage a year is probably acceptable to most entities.)

5. Six Nines is reserved for functions like controlling a highly hazardous system, where a failure can result in loss of life, or great damage to property. A medical device connected to a living person, or a chemical refinery dealing with volatile elements, or a nuclear power plant come to mind. Rare to find. Very expensive to attain.

The Recap
 99  999  9999  99999 999999
 87 hr  8 hr  52 min  5 min  30 sec 
       

Back to Top Page
[legal] [ no spam policy ] [ Copyright] © 2008 Owl River Company
All rights reserved.

Last modified: Mon, 09 Oct 2006 16:02:16 -0400
http://www.owlriver.com/support/highavailability/index.php