The issue of making a complex computer network fully available,
all of the time, is often a stated goal of a client. While
there are techniques to increase the probability of 'available' all
of the time, let's examine the issue by defining some outage percentages,
and then working through the implications. Customarily, this is
done using a string of '9's to examine the outages by orders of
magnitude.
Some useful numbers to know
Seconds in a day
86,400
Seconds in a year
31,536,000
99% - Two
Nines - 87 hours a year
99.9% - Three
Nines - 8 hours a year
99.99% - Four
Nines - 52 minutes a year
99.999% ;- Five
Nines - 5 minutes a year
99.9999% - Six
Nines - Under one minute a year
1. Service level agreements are stated with response times, not times to
repair, most customarily. Garden variety service level agreements cover
only 'business hours'. Telephone carrier and data circuit provider
service level agreements are for a four hour response. Some carry no
express timeframe, and provide for 'reasonable' or 'best efforts'
responses.
2. A site will often attain Three Nines (that is, two half day
outages a year) with proper systems provisioning, administration, and
other infrastructure. A power outage, a failed hard drive with adequate
backups, and so forth (assuming suitable backup hardware is available at
site) permit a recovery this rapidly. Note that the incremental outages of
running a disk maintenance tool, or providing virus scanner and
operating system updates which require a 5 minute install followed
by a 5 minute reboot phase weekly, adds up to 520 minutes -- roughly
our predicted eight hours.
In the common experience with consumer grade operating systems, we do not
attain Three nines.
3. To reach Four Nines, customarily a main unit, and at least a 'cold spare'
are needed. Better is a main unit, and a 'hot spare'. With a Cold Spare,
a reserve pre-configured unit is maintained at site, powered off most of the
time, but available on short notice. If the main unit is unavailable for
some reason, the Cold Spare is started and placed into service. This should
permit reaching Four Nines.
4. As we move to Five Nines, things shift and are a lot more demanding
-- a Hot Spare needs to be on, and a 'heartbeat, or an automated
'failover' scheme is needed. When an outage occurs on either the main or
the Hot Spare units, a reserve Cold Spare needs to be brought on line
quickly to maintain coverage. This extra hardware and software required
more system administration, and hardware, and represents the limits which
most ordinary commercial enterprises are willing to build for.
(By the way, this is not irrational -- a cost/benefit analysis
needs to be performed, and tolerating an hour's outage a year is probably
acceptable to most entities.)
5. Six Nines is reserved for functions like controlling a highly hazardous
system, where a failure can result in loss of life, or great damage to
property. A medical device connected to a living person, or a chemical
refinery dealing with volatile elements, or a nuclear power plant come
to mind. Rare to find. Very expensive to attain.