Home Availability Down time
21 | 05 | 2012
Down time PDF Print E-mail
Written by R2Launch   

Downtime or outage refers to a period of time or a percentage of a timespan that a system is unavailable or offline. This is usually a result of the system failing to function because of an unplanned event, or because of routine maintenance.

The term is commonly applied to networks and servers but also to production plants, airport radar systems etc... The common reasons for unplanned outages are system failures (such as a crash), communications failures or operator failures.

The opposite of downtime is uptime.

Characteristics

Downtime may be the result of a software bug, human error, equipment failure, malfunction, power failure, overload, etc.

Impact

Outages caused by system failures can have a serious impact, in particular those industries that rely on a nearly 24-hour service:
  • medical informatics
  • nuclear power and other infrastructure
  • production plants
  • banks and other financial institutions
  • aeronautics, airlines
  • news reporting
  • e-commerce and online transaction processing
  • persistent online games
Also affected can be the users of an ISP and other customers of a telecommunication network.

Corporations can lose business due to outage or they may default on a contract, resulting in financial losses.

Those people or organizations that are affected by downtime can be more sensitive to particular aspects:

-some are more affected by the length of an outage - it matters to them how much time it takes to recover from a problem
-others are sensitive to the timing of an outage - outages during peak hours affect them the most

The most demanding users are those that require high availability.
 

Service levels

In Service Level Agreements, it is common to mention a percentage value (per month or per year) that is calculated by dividing the sum of all downtimes timespans by the total time of a reference time span (e.g. a month). 0% downtime means that the server was available all the time.

For Internet servers downtimes above 1% per year or worse can be regarded as unacceptable as this means a downtime of more than 3 days per year. For e-commerce and other industrial use any value above 0.1% is usually considered unacceptable.

Response and reduction of impact

It is the duty of the system designer to make sure that an outage does not happen. When it does happen, a well-designed system will further reduce the effects of an outage by having localized outages which can be detected and fixed as soon as possible.

A process needs to be in place to detect a malfunction - system monitoring - and to restore the system - this generally involves an engineering and operator team that can troubleshoot and solve a problem.

Risk management techniques can be used to determine the impact of network outages on an organisation and what actions may be required to minimise risk. Risk may be minimised by using reliable components, by performing maintenance, such as upgrades, by using redundant systems or by having a contingency plan or business continuity plan.

Planning

A planned outage is the result of a planned activity such as a maintenance, change or an upgrade.

Maintenance downtimes have to be carefully scheduled. In many cases, system-wide downtimes can be averted using what is called a "rolling upgrade" - the process of incrementally taking down parts of the system for upgrade, without affecting the overall functionality.