What is Reliability
In computer reliablilty can be measured. (it has a metric)
One of the targets that a typical system administrator will have is the reliability of the system they are responsible for.
The 'metric' tends to be an average measure of some kind. For example the ones below can be used to assess reliability.
This is the percentage of time that the system is available to the user or customer but taking out any planned downtime perhaps for maintenance.
For example, if, in an uptime planned period of 1000 hours a computer system is available for 950 hours. The system AVAIL figure is 95%. i.e. 950/1000 * 100%.
What is reliability cont.
Stands for 'Mean Time To Failure'.
This is the average number of hours that the system runs for before something breaks.
This tends to be a measure for hardware such as a server. For example if a hard disk fails on average after 500 hours, then its MTTF figure is 500 hours.
There are a number of other metrics of reliability as well such as 'probability of failure of demand'.
effects of failure and redundancy
reliability is extremely important because the effects of a system failure can be disastrous.
If a business loses all its data the results may be catastrophic. Such a scenario can destroy a company, so 'disaster recovery' plans need to be in place
Everything eventually breaks down. there needs to be methods in place to deal with it. An excellent method is 'redundancy'.
This means that critical parts of the system are duplicated so if a failure occurs, the other component or sub-system takes over.
For example, the power supply to a rack server may be duplicated. If the power begins to sag due to component failure, the other power supply kicks in. The same can be done for an entire data centre, if the normal mains supply fails, a set of diesel generators kick in to take over.
Personal computers can be protected with an 'uninterruptible power supply' or UPS. This has a large battery in it that will keep the computer going for a few minutes while it is shut down in an orderly manner.
Data redundancy means to duplicate the data in more than one place within a computer system.
This is a good idea because if one component fails then the data is not lost.
This can be achieved by running two or more hard disks in parallel. Each storing the same data.
This arrangement is called a RAID 1 array otherwise known as 'disk mirroring'.
If one disk fails, then an alert is issued but the other hard disk just carries on working. No data is lost.
A technician then removes the broken disk from the server and replaces it with a new one.
This is called 'hot swapping' because the power remains on.
This is much rarer, but for safety critical applications it is sometimes used.
The reason for software redundancy is that it is very difficult to create perfect code. Obscure software faults may lie in wait when a rare event happens.
So for critical software, where failure is unacceptable, the idea is to have three software routines, each written by independent coding teams, producing the same output given the same input.
A kind of voting then takes place within the system - if there is no fault then all three modules will be producing identical outputs at all times, but if one of the three routines differ, then the software either has an undiscovered bug or the hardware it is running on has failed. In which case it is ignored as a control input and an error signal sent.
This is the kind of effort that goes into writing code for absolutely critical applications
- Medical equipment such as patient monitoring
- Transport - anti-lock brakes, railway safety control
- Aviation - fly-by-wire systems
- Nuclear power station control software
- Space - On-board satellite software
As well as redundancy, in order to continue from a failure it is essential that all the data is stored elsewhere as well. This is called 'data backup'.
For the individual or small office running non-server personal computers, it is a good idea to have an attached external hard drive and a good backup application running on each machine.
For a file server, it is standard practice to back up to digital tape machines or to replicate the data on a remote server
For large organisations, such as the building society mentioned earlier. An entire data centre may be duplicated in a separate location. So a fire or flood in the main centre will not result in data loss.
Backing up cont.
Another attractive option is to hire an external specialist in data storage. The business (or individual) then uploads their data every night to the remote data centre managed by the other company. 'Cloud' computing and backup is becoming popular because it is a cost effective solution, especially for small to medium sized companies (SME's).
- Reliability is an important factor in computer systems
- Reliability can be measured in a number of ways - AVAIL and MTTF
- The effect of system failure can be catastrophic unless disaster recovery plans are in place
- Redundancy is a key feature of improving reliability
- Redundancy can include hardware, data and software redundancy
- Backup is also a key feature of recovering from a problem
- There are a number of ways to back up data depending on the circumstance - individual external drives to cloud backup