Computer Systems Reliability

Data is displayed for academic year: 2023./2024.

Course Description

Widespread use of computer systems and a dependence on their services makes them an unavoidable part of our lives. It is of great importance to limit the damage caused by their failures to acceptable levels. This course will study concepts, methods and techniques of design, implementation and analysis of reliability, availability and fault tolerance of computer systems hardware and software. The objective is to develop understanding of impairments on the computer systems? dependability, means to improve systems and their attributes, and perform evaluations. The emphasis is placed on the study of reliability and fault-toleance of computer systems; on developing ability to apply basic principles to building improved real systems, and on introducing tools for analysis and evaluation of systems attributes.

Study Programmes

University graduate
[FER2-HR] Computer Engineering - profile
Theoretical Course (2. semester)

General Competencies

Students will develop a systematic understanding of basic concepts, methods and techniques for designing, implementation and evaluation of reliability, availability and fault-tolerance of hardware and software systems. They will gain an understanding of computer systems failure models, fault detection, fault masking, fault recovery strategies and testing. Students will be able to apply different approaches to improve and evaluate reliability, availability and fault tolerance as well as carry self-directed conclusions and applying fault-tolerant techniques to different problem areas. Also, they will become capable of expanding theoretical and practical knowledge through studying the new methodologies and developing critiques of them.

Learning Outcomes

  1. Describe the principles and theory of computer hardware and software reliability.
  2. Predict computer system faults.
  3. Predict computer system dependability.
  4. Apply probabilistic dependability analysis of fault-tolerant computer systems.
  5. Apply software reliability techniques.
  6. Design and evaluate system architectures for fault-tolerant computer systems.

Forms of Teaching

Lectures

This course will consist of three 45-minute lectures per week. Lectures will emphasize main concepts illustrated with examples, solutions and topic discussions.

Exams

There will be two exams - a mid-term (20% of final grade) and a final (45%). Homework assignments, short quizzes and class participations are gradaded as well.

Consultations

Consultation with the instructor will be avaliable in predefined terms and e-lerning system.

Internship visits

Students will visit some computing centre and will be introduced with specific dependability implementation approaches.

Grading Method

Continuous Assessment Exam
Type Threshold Percent of Grade Threshold Percent of Grade
Homeworks 60 % 25 % 60 % 25 %
Quizzes 0 % 4 % 0 % 0 %
Class participation 0 % 6 % 0 % 0 %
Mid Term Exam: Written 50 % 20 % 0 %
Final Exam: Written 50 % 45 %
Exam: Written 50 % 55 %
Exam: Oral 20 %

Week by Week Schedule

  1. Introduction. Motivation for the course. Basic principles, examples and terminology. Dependability, Reliability, Availability definitions. Faults, Errors, and Failure.
  2. Fault and Error Models. Failure process. Fault handling.
  3. Digital system testing. Simulations. Design for Testability. Built-In Test, Built-In Self-Test.
  4. Reliability Theory. Reliability Evaluation Methods. Failures rate, Mean Time to Failure, Mean Time to Repair. Combinatorial Modeling. RBD. MonteCarlo simulation.
  5. Reliability, Availability, and Safety modeling using Markov models. Failure Mode and Effects Analysis.
  6. Reliability improvement techniques. Fault tolerant design techniques. Hardware redundancy approaches.
  7. Midterm exam
  8. Repairable Systems. Standby Systems. Discussions.
  9. Time redundancy. Detecting and tolerating transient and permanent faults. Information redundancy. Error Detecting and Correcting Codes.
  10. Software Redundancy. Software Error Models. N-version programming, Recovery blocks.
  11. Software failure models, prediction of software failure intensities, impact of software failures on systems behaviour.
  12. Fault-tolerance in distributed systems. Byzantine failure model.
  13. High availability computer systems and services. Maintenance models.
  14. Experimental analysis of systems reliability and availability. Design methodology. Discussions.
  15. Final exam

Literature

D.P. Siewiorek, R.S. Swarz (1998.), Reliable Computer Systems: Design and Evaluation, AK Peters, Ltd.
M.L. Shooman (2002.), Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, J. Wiley & Sons
H. Pham (2000.), Software Reliability, Springer
M.Xie, J.S. Dai, K.L. Poh (2004.), Computing System Reliability: Models and Analysis, Kluwer Academic
M. Rausand, A. Hoyland (2004.), System Reliability Theory: Models, Statistical Methods, and Applications, J. Wiley & Sons

For students

General

ID 34505
  Summer semester
5 ECTS
L2 English Level
L1 e-Learning
45 Lectures
0 Seminar
0 Exercises
0 Laboratory exercises
0 Project laboratory
0 Physical education excercises

Grading System

85 Excellent
72 Very Good
60 Good
50 Sufficient