Computer Systems Reliability
Lecturers
Course Description
Study Programmes
University graduate
General Competencies
Students will develop a systematic understanding of basic concepts, methods and techniques for designing, implementation and evaluation of reliability, availability and fault-tolerance of hardware and software systems. They will gain an understanding of computer systems failure models, fault detection, fault masking, fault recovery strategies and testing. Students will be able to apply different approaches to improve and evaluate reliability, availability and fault tolerance as well as carry self-directed conclusions and applying fault-tolerant techniques to different problem areas. Also, they will become capable of expanding theoretical and practical knowledge through studying the new methodologies and developing critiques of them.
Learning Outcomes
- Describe the principles and theory of computer hardware and software reliability.
- Predict computer system faults.
- Predict computer system dependability.
- Apply probabilistic dependability analysis of fault-tolerant computer systems.
- Apply software reliability techniques.
- Design and evaluate system architectures for fault-tolerant computer systems.
Forms of Teaching
This course will consist of three 45-minute lectures per week. Lectures will emphasize main concepts illustrated with examples, solutions and topic discussions.
ExamsThere will be two exams - a mid-term (20% of final grade) and a final (45%). Homework assignments, short quizzes and class participations are gradaded as well.
ConsultationsConsultation with the instructor will be avaliable in predefined terms and e-lerning system.
Internship visitsStudents will visit some computing centre and will be introduced with specific dependability implementation approaches.
Grading Method
Continuous Assessment | Exam | |||||
---|---|---|---|---|---|---|
Type | Threshold | Percent of Grade | Threshold | Percent of Grade | ||
Homeworks | 60 % | 25 % | 60 % | 25 % | ||
Quizzes | 0 % | 4 % | 0 % | 0 % | ||
Class participation | 0 % | 6 % | 0 % | 0 % | ||
Mid Term Exam: Written | 50 % | 20 % | 0 % | |||
Final Exam: Written | 50 % | 45 % | ||||
Exam: Written | 50 % | 55 % | ||||
Exam: Oral | 20 % |
Week by Week Schedule
- Introduction. Motivation for the course. Basic principles, examples and terminology. Dependability, Reliability, Availability definitions. Faults, Errors, and Failure.
- Fault and Error Models. Failure process. Fault handling.
- Digital system testing. Simulations. Design for Testability. Built-In Test, Built-In Self-Test.
- Reliability Theory. Reliability Evaluation Methods. Failures rate, Mean Time to Failure, Mean Time to Repair. Combinatorial Modeling. RBD. MonteCarlo simulation.
- Reliability, Availability, and Safety modeling using Markov models. Failure Mode and Effects Analysis.
- Reliability improvement techniques. Fault tolerant design techniques. Hardware redundancy approaches.
- Midterm exam
- Repairable Systems. Standby Systems. Discussions.
- Time redundancy. Detecting and tolerating transient and permanent faults. Information redundancy. Error Detecting and Correcting Codes.
- Software Redundancy. Software Error Models. N-version programming, Recovery blocks.
- Software failure models, prediction of software failure intensities, impact of software failures on systems behaviour.
- Fault-tolerance in distributed systems. Byzantine failure model.
- High availability computer systems and services. Maintenance models.
- Experimental analysis of systems reliability and availability. Design methodology. Discussions.
- Final exam