Seminar 11: Fault Tolerance
Goal:
- Get acquaintance with the concept of Fault Tolerance
- Differences between a Fault,Error and a Failure
- Implement the Program to show fault tolerance and failure.
Definitions:
- Background: Read the article Fault Tolerance by Design Diversity: Concepts and Experiments by Algirdas Avizienis and John P. J. Kelly
- Fault Tolerance:
- Fault tolerance is the survival attribute of computer architectures; when a system is able to recover automatically from fault-caused errors, and to eliminate faults without suffering an externally perceivable failure, the system is said to be fault tolerant.
- Being fault tolerant is strongly related to what are called Dependable systems.
- Dependability is a term that covers a number of useful requirements for distributed systems including the following:
- Availability is defined as the property that a system is ready to be used immediately.
- Reliability refers to the property that a system can run continuously without failure.
- Safety refers to the situation that when a system temporarily fails to operate correctly, no catastrophic event happens.
- Maintainability refers to how easily a failed system can be repaired.
- Fault, Error or a Failure: Assume a System named Resource (r) is delivering an expected service (s) to a system or a person named User (u):
- Failures: A failure occurs when the user perceives that the resource ceases to deliver the expected service . Examples of failures are:
- The CPU (u) perceives that the Memory (r) has delivered a Word (s) with the Wrong Parity ,
- Transistor B (u) perceives that the output of transistor A (r) does not change (s) after a test input is applied to A by B.
- Errors: An error occurs when some part of the resource assumes an undesired state . Such a state is contrary to the specification of the resource or the expectation (requirement) of the user. Examples of errors are:
- Parity error - All words are stored in a memory with odd parity , but the "read" operation delivers a word that has even parity.
- Comparison error - Two identical adders receive the same operands and simultaneously deliver Sums to a comparator that are not identical in every bit position.
- Faults: A fault is detected when either a failure of the resource occurs, or an error is observed within the resource . The cause of the failure or error is said to be a fault . In most cases the fault can be identified ; in some it remains a hypothesis that cannot be adequately verified. Examples of faults are:
- A permanent physical fault - The output of an AND gate is stuck on logic one
- A transient physical fault - An alpha particle impact changes the state from one to zero in a dynamic MOSFET memory cell.
- Latent Fault: A fault is latent as long as it has not caused any errors, but exists in the resource as a potential cause.
- Failures: A failure occurs when the user perceives that the resource ceases to deliver the expected service . Examples of failures are:
EXERCISE Write an example program that has three methods. The main program calls these methods. Each method must contain a fault, which will be intentionally introduced in the program code. The main program will be partially fault-tolerant.
- One of the methods introduces a fault that remains invisible (latent) to the main program.
- Fault in the second method shall manifest as an error, but will be detected by the main program and handled (tolerated) by method invocation.
- The third fault should be left unhandled by the main program and propagate as a failure to the user.
Requirements: In the program, each fault must be carefully documented, as well as the execution sequence that causes the faults to propagate as errors or failures. The purpose of the exercise is to show that you have understood the threefold model of malfunctions presented in the article.
Deliverables of practical session: A zip file containing the code file of above exercise
Code Layout (.ipynb file): Download the code layout file here - CODE LAYOUT. The first two methods have been implemented for you. Add the code for Method 3 - fault should be left unhandled by the main program and propagate as a failure to the user.
Referrences:
- Distributed Systems, Third edition, Version 3.02 (2018) by Maarten van Steen and Andrew S. Tanenbaum
- Fault Tolerance by Design Diversity: Concepts and Experiments by Algirdas Avizienis and John P. J. Kelly