Dependable Computing - Ravishankar K. Iyer - E-Book

Dependable Computing E-Book

Ravishankar K. Iyer

0,0
115,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Dependable Computing Covering dependability from software and hardware perspectives Dependable Computing: Design and Assessment looks at both the software and hardware aspects of dependability. This book: * Provides an in-depth examination of dependability/fault tolerance topics * Describes dependability taxonomy, and briefly contrasts classical techniques with their modern counterparts or extensions * Walks up the system stack from the hardware logic via operating systems up to software applications with respect to how they are hardened for dependability * Describes the use of measurement-based analysis of computing systems * Illustrates technology through real-life applications * Discusses security attacks and unique dependability requirements for emerging applications, e.g., smart electric power grids and cloud computing * Finally, using critical societal applications such as autonomous vehicles, large-scale clouds, and engineering solutions for healthcare, the book illustrates the emerging challenges faced in making artificial intelligence (AI) and its applications dependable and trustworthy. This book is suitable for those studying in the fields of computer engineering and computer science. Professionals who are working within the new reality to ensure dependable computing will find helpful information to support their efforts. With the support of practical case studies and use cases from both academia and real-world deployments, the book provides a journey of developments that include the impact of artificial intelligence and machine learning on this ever-growing field. This book offers a single compendium that spans the myriad areas in which dependability has been applied, providing theoretical concepts and applied knowledge with content that will excite a beginner, and rigor that will satisfy an expert. Accompanying the book is an online repository of problem sets and solutions, as well as slides for instructors, that span the chapters of the book.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1556

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Series Page

Title Page

Copyright Page

Dedication Page

About the Authors

Preface

Acknowledgments

About the Companion Website

1 Dependability Concepts and Taxonomy

1.1 Introduction

1.2 Placing Classical Dependability Techniques in Perspective

1.3 Taxonomy of Dependable Computing

1.4 Fault Classes

1.5 The Fault Cycle and Dependability Measures

1.6 Fault and Error Classification

1.7 Mean Time Between Failures

1.8 User‐perceived System Dependability

1.9 Technology Trends and Failure Behavior

1.10 Issues at the Hardware Level

1.11 Issues at the Platform Level

1.12 What is Unique About this Book?

1.13 Overview of the Book

References

2 Classical Dependability Techniques and Modern Computing Systems

2.1 Illustrative Case Studies of Design for Dependability

2.2 Cloud Computing: A Rapidly Expanding Computing Paradigm

2.3 New Application Domains

2.4 Insights

References

3 Hardware Error Detection and Recovery Through Hardware‐Implemented Techniques

3.1 Introduction

3.2 Redundancy Techniques

3.3 Watchdog Timers

3.4 Information Redundancy

3.5 Capability and Consistency Checking

3.6 Insights

References

4 Processor Level Error Detection and Recovery

4.1 Introduction

4.2 Logic‐level Techniques

4.3 Error Protection in the Processors

4.4 Academic Research on Hardware‐level Error Protection

4.5 Insights

References

5 Hardware Error Detection Through Software‐Implemented Techniques

5.1 Introduction

5.2 Duplication‐based Software Detection Techniques

5.3 Control‐Flow Checking

5.4 Heartbeats

5.5 Assertions

5.6 Insights

References

6 Software Error Detection and Recovery Through Software Analysis

6.1 Introduction

6.2 Diverse Programming

6.3 Static Analysis Techniques

6.4 Error Detection Based on Dynamic Program Analysis

6.5 Processor‐Level Selective Replication

6.6 Runtime Checking for Residual Software Bugs

6.7 Data Audit

6.8 Application of Data Audit Techniques

6.9 Insights

References

7 Measurement‐based Analysis of System Software:

7.1 Introduction

7.2 MVS (Multiple Virtual Storage)

7.3 Experimental Analysis of OS Dependability

7.4 Behavior of the Linux Operating System in the Presence of Errors

7.5 Evaluation of Process Pairs in Tandem GUARDIAN

7.6 Benchmarking Multiple Operating Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER

7.7 Dependability Overview of the Cisco Nexus Operating System

7.8 Evaluating Operating Systems: Related Studies

7.9 Insights

References

8 Reliable Networked and Distributed Systems

8.1 Introduction

8.2 System Model

8.3 Failure Models

8.4 Agreement Protocols

8.5 Reliable Broadcast

8.6 Reliable Group Communication

8.7 Replication

8.8 Replication of Multithreaded Applications

8.9 Atomic Commit

8.10 Opportunities and Challenges in Resource‐Disaggregated Cloud Data Centers

References

9 Checkpointing and Rollback Error Recovery

9.1 Introduction

9.2 Hardware‐Implemented Cache‐Based Schemes Checkpointing

9.3 Memory‐Based Schemes

9.4 Operating‐System‐Level Checkpointing

9.5 Compiler‐Assisted Checkpointing

9.6 Error Detection and Recovery in Distributed Systems

9.7 Checkpointing Latency Modeling

9.8 Checkpointing in Main Memory Database Systems (MMDB)

9.9 Checkpointing in Distributed Database Systems

9.10 Multithreaded Checkpointing

References

10 Checkpointing Large‐Scale Systems

10.1 Introduction

10.2 Checkpointing Techniques

10.3 Checkpointing in Selected Existing Systems

10.4 Modeling‐Coordinated Checkpointing for Large‐Scale Supercomputers

10.5 Checkpointing in Large‐Scale Systems: A Simulation Study

10.6 Cooperative Checkpointing

References

11 Internals of Fault Injection Techniques

11.1 Introduction

11.2 Historical View of Software Fault Injection

11.3 Fault Model Attributes

11.4 Compile‐Time Fault Injection

11.5 Runtime Fault Injection

11.6 Simulation‐Based Fault Injection

11.7 Dependability Benchmark Attributes

11.8 Architecture of a Fault Injection Environment: NFTAPE Fault/Error Injection Framework Configured to Evaluate Linux OS

11.9 ML‐Based Fault Injection: Evaluating Modern Autonomous Vehicles

11.10 Insights and Concluding Remarks

References 

12 Measurement‐Based Analysis of Large‐Scale Clusters

12.1 Introduction

12.2 Related Research

12.3 Steps in Field Failure Data Analysis

12.4 Failure Event Monitoring and Logging

12.5 Data Processing

12.6 Data Analysis

12.7 Estimation of Empirical Distributions

12.8 Dependency Analysis

References

13 Measurement‐Based Analysis of Large Systems

13.1 Introduction

13.2 Case Study I: Failure Characterization of a Production Software‐as‐a‐Service Cloud Platform

13.3 Case Study II: Analysis of Blue Waters System Failures

13.4 Case Study III: Autonomous Vehicles: Analysis of Human‐Generated Data

References

14 The Future: Dependable and Trustworthy AI Systems

14.1 Introduction

14.2 Building Trustworthy AI Systems

14.3 Offline Identification of Deficiencies

14.4 Online Detection and Mitigation

14.5 Trust Model Formulation

14.6 Modeling the Trustworthiness of Critical Applications

14.7 Conclusion: How Can We Make AI Systems Trustworthy?

References 

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Error classes for Tandem GUARDIAN90.

Table 1.2 Error classes for IBM‐MVS and IBM database management systems.

Table 1.3 MTTF, MTTR, and availability data for various real‐world systems....

Table 1.4 Fault sources, levels of integration, users, and user sophisticat...

Chapter 3

Table 3.1 Comparison of parity codes.

Table 3.2 Berger code’s capabilities in detecting unidirectional errors.

Table 3.3 Error detection efficiency with respect to location and number of...

Chapter 4

Table 4.1 Summary of some of the architectural techniques for reliability....

Chapter 5

Table 5.1 Survey of representative control‐flow error detection techniques....

Table 5.2 Transient error models.

Table 5.3 Error detection coverage for

espresso

without ECCA.

Table 5.4 Error detection coverage for

espresso

with ECCA.

Table 5.5 Cumulative results from directed injection to control‐flow instru...

Table 5.6 Cumulative results from injection to random instructions from the...

Table 5.7 Performance measurements for DHCP instrumented with PECOS.

Chapter 6

Table 6.1 Classification of software‐based detection techniques.

Table 6.2 Benchmark programs and characteristics.

Table 6.3 Coverage with five critical variables per function.

Table 6.4 Generic rule classes.

Table 6.5 Probability values for computing tightness.

Table 6.6 Probability values for computing tightness of detector “Bounded‐R...

Table 6.7 Benchmarks and their descriptions.

Table 6.8 Types of errors detected by simulator and their consequences.

Table 6.9 Range of detection coverage for 100 detectors.

Table 6.10 Examples of database API functions.

Table 6.11 Breakdown of inserted and detected errors.

Chapter 7

Table 7.1 Summary of the data (period of study: March 1982–May 1983) [5].

Table 7.2 Distribution of error categories.

Table 7.3 Effect of recovery routines [5].

Table 7.4 Specification of recovery routines for HW/SW errors [5].

Table 7.5 Experiment setup summary [16] / with permission of IEEE.

Table 7.6 Outcome categories [16] / with permission of IEEE.

Table 7.7 Crash cause categories: Pentium (P4) [16] / with permission of IE...

Table 7.8 Crash cause categories: PPC (G4) [16] / with permission of IEEE....

Table 7.9 Statistics on error activation and failure distribution on P4 pro...

Table 7.10 Statistics on error activation and failure distribution on G4 pr...

Table 7.11 Summary of most severe crashes [22] / with permission of IEEE.

Table 7.12 Severity of software failures [29] / with permission of IEEE.

Table 7.13 Reasons for multiple‐processor halts [29] / with permission of I...

Table 7.14 Reasons for software fault tolerance [29] / with permission of I...

Table 7.15 Severity of software failures by failure type [29] / with permis...

Table 7.16 Estimated loss of service.

Table 7.17 Target systems [36] / with permission of IEEE.

Table 7.18 New crash cause categories [36] / with permission of IEEE.

Table 7.19 Crash cause mapping [36] / with permission of IEEE.

Table 7.20 Injection outcome summary [36] / with permission of IEEE.

Chapter 8

Table 8.1 Replica behavior under a faulty leader.

Table 8.2 Replica behavior under a faulty follower.

Table 8.3 Error models used in injection experiments.

Table 8.4 Outcome categories.

Table 8.5 Fault injection results.

Chapter 9

Table 9.1 Comparison of forward and backward error recovery.

Table 9.2 RMK pins in the RMK prototype [15] / with permission of IEEE.

Table 9.3 Comparison of the incremental and delta checkpointing algorithms ...

Chapter 10

Table 10.1 Classification of checkpointing approaches in existing systems....

Table 10.2 Checkpoint characterization on Blue Gene.

Table 10.3 Parameters for modeling generic correlated failures [17] / with ...

Table 10.4 Different configurations for a notional PF/s system [7] / with p...

Table 10.5 Effect of parameter variation on useful work factor/fraction (

U

)...

Chapter 11

Table 11.1 Characteristics of software fault injection techniques.

Table 11.2 Most frequently used kernel functions when UNIXBench is run.

Table 11.3 Function distribution among kernel modules.

Table 11.4 Instrumented kernel crashes in IA‐32.

Table 11.5 Examples of SLI‐supported ADS module outputs.

Table 11.6 Fault injection experiments.

Table 11.7 Summary of PGM‐based fault injection.

Chapter 12

Table 12.1 Classification of related research.

Table 12.2 Levels of severity for Syslog.

Table 12.3 Snippets of logs collected from Cray XE‐6/XK‐7 Blue Waters at NC...

Table 12.4 Log types provided by IBM Z/OS.

Table 12.5 Example of IBM Syslog Logs.

Table 12.6 Example IBM system dump message.

Table 12.7 Fields in EVTX Windows Event Logs.

Table 12.8 Example failure report from LANL [43].

Table 12.9 Example output of coalescence analysis [168].

Table 12.10 Error/failure statistics for a VAXcluster [29].

Table 12.11 MTBE, MTTER, MTBF, and MTTFR by type for VAXcluster [29].

Table 12.12 MTBF (days) for individual machines [29].

Table 12.13 Overview of systems. Systems 1–18 were SMP‐based, and systems 1...

Table 12.14 Detailed root‐cause breakdown of LANL data [43].

Table 12.15 Failure frequency distribution with respect to failure types an...

Table 12.16 Network‐related and machine‐related problems in a LAN of Window...

Table 12.17 Breakdown of reboots based on prominent events [102].

Table 12.18 Machine up‐time statistics [102].

Table 12.19 Machine down‐time statistics [102].

Table 12.20 Machine availability [102].

Table 12.21 Average correlation coefficients for VAXcluster errors [60].

Table 12.22 Sensitivity study of the unavailability computed for systems wi...

Table 12.23 Effect of correlation on unavailability for systems with two to...

Table 12.24 Hardware‐related software errors/failures [76].

Chapter 13

Table 13.1 Summary of the considered logs [1] / with permission from Associ...

Table 13.2 Summary of the coalescence process [1] / with permission from Ass...

Table 13.3 Example of a complex failure as reported after the coalescence....

Table 13.4 Size (% of total LOC count) and failure density (number of failu...

Table 13.5 Breakdown of synthetic statistics per month [1] / with permissio...

Table 13.6 Breakdown of the workload and failures by day of the week [1] / ...

Table 13.7 Example Blue Waters incident report.

Table 13.8 Failure statistics.

Table 13.9 Breakdown of the counts of the top three hardware and software f...

Table 13.10 Breakdown of the machine‐check errors [2] / with permission fro...

Table 13.11 Breakdown of the count of memory errors [2] / with permission f...

Table 13.12 Breakdown of the uncorrectable memory errors (UE) [2] / with pe...

Table 13.13 AFR, MTBF, and FIT for the top five hardware root causes [2] / ...

Table 13.14 Systemwide outage statistics [2] / with permission from IEEE.

Table 13.15 Parameters for the estimated systemwide outage TBF distribution...

Table 13.16 Summary of fleet size, autonomous miles driven, and failure inc...

Table 13.17 Sample of disengagement reports from the CA DMV datasets [3] / ...

Table 13.18 Definition of fault tags and categories that are assigned to di...

Table 13.19 Disengagements across manufacturers (as percentages) categorize...

Table 13.20 Distribution of disengagements across manufacturers (as percent...

Table 13.21 Summary of accidents reported by manufacturers [3] / with permis...

Table 13.22 Reliability of AVs compared to human drivers [3] / with permiss...

Table 13.23 Reliability of AVs compared to other safety‐critical autonomous...

Chapter 14

Table 14.1 Sample incidents involving trustworthy AI systems.

Table 14.2 Summary and analysis of work related to adversarial learning in ...

Table 14.3 Parameters.

Table 14.4 Parameters for extended formulation.

Table 14.5 Additional notation.

List of Illustrations

Chapter 1

Figure 1.1 A refined dependability and security tree.

Figure 1.2 Dependability measures in relation to the fault cycle.

Figure 1.3 Dependability requirements at various system levels.

Figure 1.4 Techniques available at each system level.

Chapter 2

Figure 2.1 Architecture of the Tandem Integrity system.

Figure 2.2 Blue Waters blades: (a) compute (Cray XE6), (b) GPU (Cray XK7).

Figure 2.3 Cloud computing's layered architecture.

Figure 2.4 A typical scenario in smart power grids.

Figure 2.5 IT infrastructure of an example business application.

Figure 2.6 Embedded health monitoring and diagnostics.

Figure 2.7 Generic representation of an AI system/application.

Chapter 3

Figure 3.1 Standby redundancy.

Figure 3.2 Comparative reliability of TMR and simplex systems.

Figure 3.3 Interrupt synchronization in a TMR system.

Figure 3.4 Logic of error masking/detecting code.

Figure 3.5 Determining parity/check bits for Hamming code.

Figure 3.6 Error detection and correction using SEC‐DED code.

Figure 3.7 Ethernet frame format.

Figure 3.8 Example representations of circuits to generate cyclic code. (a) ...

Figure 3.9 Error scenario not detected by single‐precision checksum. An erro...

Figure 3.10 Reed‐Solomon codes.

Figure 3.11 IBM two‐level interleaved coding scheme [52].

Figure 3.12 Mirroring RAID.

Figure 3.13 Hamming‐coded RAID.

Figure 3.14 Bit‐interleaved parity.

Figure 3.15

P

+

Q

redundancy.

Figure 3.16 Array controller architecture, interfaces, and data flow.

Chapter 4

Figure 4.1 SEU‐hardened latches. (a) D‐latch, (b) RS‐latch.

Figure 4.2 SEU effects in logic: three‐input NAND and NOR gates. (a) Combina...

Figure 4.3 Typical topologies of latches and domino cells used in high‐perfo...

Figure 4.4 Switched capacitor technique.

Figure 4.5 Stack node capacitor technique.

Figure 4.6 DICE memory cell's principle of operation.

Figure 4.7 Standard path‐exclusive latch.

Figure 4.8 SER‐tolerant path‐exclusive latch.

Figure 4.9 Pipeline stage augmented with Razor latches and control lines.

Figure 4.10 SER‐tolerant scan flip‐flop using C‐element.

Figure 4.11 NonStop system architecture.

Figure 4.12 Processor pipeline with extensions for SRTR.

Figure 4.13 Generation of prediction stream by the core processor.

Figure 4.14 Checker processor performs the check on the predication stream....

Figure 4.15 Microarchitecture support for the backlog buffer.

Figure 4.16 Classification of design defects in each processor.

Chapter 5

Figure 5.1 Original code segment.

Figure 5.2 Duplicated code segment [1]. © IEEE.

Figure 5.3 Duplicated code with performance optimizations.

Figure 5.4 Original code segment.

Figure 5.5 EDDI code segment after performance optimizations.

Figure 5.6 Change in the code structure due to insertion of PECOS assertion ...

Figure 5.7 Assembly code for branch assertion block. (i) Application assembl...

Figure 5.8 Protocol for adaptive heartbeat.

Figure 5.9 Adaptive heartbeat with load generator.

Figure 5.10 Effectiveness of smart heartbeat.

Chapter 6

Figure 6.1 The N‐version software model.

Figure 6.2 N‐self‐checking programming.

Figure 6.3 Airbus computer architecture.

Figure 6.4 Boeing computer architecture.

Figure 6.5 Diagram of recovery block scheme.

Figure 6.6 Diagram of a conversation scheme.

Figure 6.7 Example code fragment to illustrate feasible path problem faced b...

Figure 6.8 Code sample and property FSM.

Figure 6.9 Example showing the limitation of property simulation.

Figure 6.10 A function‐pair rule in PostgreSQL.

Figure 6.11 A complex programming rule in PostgreSQL.

Figure 6.12 A programming rule for variable correlation in Linux.

Figure 6.13 Flowchart of PR‐Miner.

Figure 6.14 Interface definition for a stack class.

Figure 6.15 Invariants derived by DAIKON for the stack program.

Figure 6.16 Buggy implementation of the push function.

Figure 6.17 Example code fragment with detectors inserted.

Figure 6.18 Example of a memory corruption error.

Figure 6.19 Example for race condition detection.

Figure 6.20 Performance overhead (five critical variables are chosen per fun...

Figure 6.21 Example code fragment.

Figure 6.22 Crash coverage of derived detectors.

Figure 6.24 Hang coverage of derived detectors [11].

Figure 6.25 Total error coverage for derived detectors [11].

Figure 6.26 Percentage of false positives for 1000 inputs.

Figure 6.27 Modifications to pipeline for selective replication.

Figure 6.28 Mechanism for register renaming of multiple instructions.

Figure 6.29 Target system with embedded audit and control flow checking.

Figure 6.30 Call‐processing phases emulated in the client program.

Chapter 7

Figure 7.1 Software handling of software errors on MVS

Figure 7.2 Error injection environment [16] / with permission of IEEE.

Figure 7.3 Automated process of injecting errors [16] / with permission of I...

Figure 7.4 Definition of

cycles‐to‐crash

[16] / with permission ...

Figure 7.5 Overall distribution of crash causes (

Known Crash

category) on P4...

Figure 7.6 Overall distribution of crash causes (

Known Crash

category) on G4...

Figure 7.7 Crash causes for kernel stack injection [16] / with permission of...

Figure 7.8 Crash causes for system register injection [16] / with permission...

Figure 7.9 Crash causes for code injection [16] / with permission of IEEE.

Figure 7.10 Crash causes for kernel data injection [16] / with permission of...

Figure 7.11 Distribution of

cycles‐to‐crash

[16] / with permissi...

Figure 7.12 Shadow processor in Tandem.

Figure 7.13 Differences between the primary and backup executions [29] / wit...

Figure 7.14 Faults exposed by non‐process pairs [29] / with permission of IE...

Figure 7.15 Measurement‐based Markov model.

Figure 7.16 Experimental setup [36] / with permission of IEEE.

Figure 7.17 Linux text injection crash data [36] / with permission of IEEE....

Figure 7.19 AIX text injection crash data [36] / with permission of IEEE.

Figure 7.20 Linux stack injection crash data [36] / with permission of IEEE....

Figure 7.22 AIX stack injection crash data [36] / with permission of IEEE.

Figure 7.23 Linux system register crash rate [36] / with permission of IEEE....

Figure 7.24 Solaris system register crash rate [36] / with permission of IEE...

Figure 7.25 AIX system register crash rate [36] / with permission of IEEE.

Figure 7.26 Hierarchical reliability in Cisco Nexus switches for production‐...

Chapter 8

Figure 8.1 Process failure models.

Figure 8.2

P

0

is correct.

Figure 8.3

P

0

is faulty.

Figure 8.4 Sample execution of

OM

(1), Step 1.

P

2

is faulty.

Figure 8.5 Sample execution of

OM

(1), Step 2.

P

2

is faulty.

Figure 8.6 Sample execution of

OM

(1).

P

0

is faulty.

Figure 8.7 Interactive consistency example.

Figure 8.8 Layering of application/broadcast communication primitives.

Figure 8.9 Reliable broadcast by message diffusion.

Figure 8.10 FIFO broadcast using reliable broadcast.

Figure 8.11 Causal broadcast using FIFO broadcast.

Figure 8.12 Group communication system.

Figure 8.13 Active replication architecture.

Figure 8.14 LSA sample execution.

Figure 8.15 Stages in PDS‐1 first sample execution (a), (b), (c), and (d)....

Figure 8.16 Stages in PDS‐1 second sample execution (a), (b), (c), and (d)....

Figure 8.17 Improving concurrency of PDS algorithm. (a) PDS‐1 and (b) PDS‐2....

Figure 8.18 Replication framework.

Figure 8.19 Evaluation results for triplicated server (a), (b), (c), and (d)...

Figure 8.20 Closing the gap between multi‐core memory bandwidth and network ...

Figure 8.21 Overview of a resource‐disaggregated cloud data center.

Figure 8.22 Overview of INDIGO’s workflow.

Chapter 9

Figure 9.1 Active and checkpoint states in processor‐based checkpointing and...

Figure 9.2 Dual memory banks in the Sequoia multiprocessor.

Figure 9.3 Case I: First reference after the checkpoint.

Figure 9.4 Case II: Page previously referenced.

Figure 9.5 Example of basic twist structure.

Figure 9.6 Example of a modified page.

Figure 9.7 Effect of Flashback primitives on process execution state.

Figure 9.8 Reliability MicroKernel architecture [15] / with permission of IE...

Figure 9.9 User/kernel state of a process in Linux [15] / with permission of...

Figure 9.10 Variation in checkpoint size for LUDCMP [16] / with permission o...

Figure 9.11 Adaptive checkpointing [16] / with permission of IEEE.

Figure 9.12 Domino effect in recovering cooperating processes [18].

Figure 9.13 Strongly consistent set (left) and consistent set (right) of che...

Figure 9.14 Operation of the message logging in the absence of transmission ...

Figure 9.15 Snapshot of the logs table.

Figure 9.16 Message sequence showing the addition of nodes to the network.

Figure 9.17 Checkpointing message sequence.

Figure 9.18 Messages in identifying and removing a crashed node.

Figure 9.19 FailingNode sending a recovery alert.

Figure 9.20 Message retransmissions to FailingNode.

Figure 9.21 Recovery completion.

Figure 9.22 A variety of state changes in the server that could occur becaus...

Figure 9.23 Sequential checkpointing [30] / with permission of IEEE.

Figure 9.24 Forked checkpointing (a) parent and child processes (b) interlea...

Figure 9.25 Markov chain to evaluate Γ [30].

Figure 9.26 Data structures in the Dali main‐memory DBMS [33] / with permiss...

Figure 9.27 Example control structures (SysDB) [34] / with permission of IEE...

Figure 9.28 Basic checkpointing architecture [34] / with permission of IEEE....

Figure 9.29 Control structure images [34] / with permission of IEEE.

Figure 9.30 Structure of the image keeper [34] / with permission of IEEE.

Chapter 10

Figure 10.1 Use of message acknowledgments to solve the consistent checkpoin...

Figure 10.2 Staggering approach to alleviate the I/O bottleneck while checkp...

Figure 10.3 Main steps in recovering from various types of failures.

Figure 10.4 Pictorial view of the split‐merge algorithm (a) memory divided i...

Figure 10.5 The overall composition of the model [17] / with permission of I...

Figure 10.6 Submodels for computing and checkpointing [17] / with permission...

Figure 10.7 Birth‐death Markov process of correlated failures [17] / with pe...

Figure 10.8 State transition diagram for the supercomputing system.

Chapter 11

Figure 11.1 Categories of software fault injection techniques.

Figure 11.2 Injecting faults into combinational circuits by adding injection...

Figure 11.3 Error injection environment.

Figure 11.4 Automated process of injecting errors.

Figure 11.5 Distribution of profiled functions.

Figure 11.6 Control host hierarchy.

Figure 11.7 Kernel injector components.

Figure 11.8 Sample of setting a breakpoint in IA‐32.

Figure 11.9 Starting performance registers to count cycles‐to‐crash.

Figure 11.10 Locating the target process stack.

Figure 11.11 Component interactions for kernel crash.

Figure 11.12 Component collaborations for unmanifested injection.

Figure 11.13 Component interactions for an inactivated injection.

Figure 11.14 A high‐level overview of an AV’s autonomous and mechanical syst...

Figure 11.15 Definition of

d

stop

,

d

safe

, and

δ

for lateral and longitud...

Figure 11.16 Bayesian FI.

Figure 11.17 Illustration of Tesla Autopilot problem similar to a Bayesian f...

Figure 11.18 Orientation of the EV when in motion.

Figure 11.19 3‐Temporal Bayesian network modeling the ADS.

Figure 11.20 ADS architecture.

Figure 11.21 BN MLE inference is executed offline for every simulated time p...

Figure 11.22 Driving scenarios supported by simulation engine.

Figure 11.23 DriveFI architecture.

Figure 11.24 Fault/error impact characterization using FI campaigns. (a) and...

Figure 11.25 Impact of 30 continuous faults on

ζ

in DriveAV. (a)

ζ

Chapter 12

Figure 12.1 Conceptual framework and steps for field failure data analysis (...

Figure 12.2 Example Blue Gene/P RAS log [159].

Figure 12.3 Example of details of the RAS message in Figure 12.2 [159].

Figure 12.4 Snapshot of the Microsoft EVTX Event Viewer from a Windows 8 mac...

Figure 12.5 Workflow for the processing of failure data.

Figure 12.6 Data processing workflow for analyzing safety‐critical computer‐...

Figure 12.7 Multiple event reporting phenomenon.

Figure 12.8 Examples of temporal data coalescence heuristics.

Figure 12.9 Effect of the value of the coalescence time windows on the tuple...

Figure 12.10 Example of wrong grouping: truncations (a) and collisions (c) w...

Figure 12.11 Number of failures for each midplane after the spatial coalesce...

Figure 12.12 Unavailability distribution [102]/with permission of Springer N...

Figure 12.13 Distribution of Andrew file system disk errors. Data taken from...

Figure 12.14 Analytic software TTE (or TTH) distributions. (a) IBM MVS softw...

Figure 12.15 Error hazard for (a) Europa, (b) VAXcluster [29]/with permissio...

Figure 12.16 Failure hazard for (a) Europa, (b) VAXcluster [29]/with permiss...

Figure 12.17 Example of the TBFs and the hazard rate for a Software‐as‐a‐Ser...

Figure 12.18 Hazard plots for three selected load parameters in the IBM 370 ...

Chapter 13

Figure 13.1 Distribution of the (a) tuple type after the coalescence process...

Figure 13.2 TBF density function distribution for (a) all failure types, (b)...

Figure 13.3 Distribution of the number of platform failure entries per day: ...

Figure 13.4 Laplace trend test [1] / with permission from Association for Co...

Figure 13.5 Workload: (a) intensity (number of files per day); (b) volume (a...

Figure 13.6 Cumulative absolute failure rate obtained by dividing the daily ...

Figure 13.7 Failure rate vs. workload per day [1] / with permission from Ass...

Figure 13.8 The Hardware Supervisor System: Resiliency features.

Figure 13.9 (a) Breakdown of the failure categories, and (b) distribution of...

Figure 13.10 Breakdown of the Lustre failures [2] / with permission from IEE...

Figure 13.11 Arithmetic mean of the TBF for failures due to (a) hardware, (b...

Figure 13.12 Distribution fitting for systemwide outages' TBF: (a) PDF, (b) ...

Figure 13.13 Breakdown of (a) SWO root causes and (b) repair times [2] / wit...

Figure 13.14 The end‐to‐end data collection, processing, and analysis pipeli...

Figure 13.15 Accident scenarios [3] / with permission from IEEE.

Figure 13.16 Autonomous vehicle hierarchical control structure drawn based o...

Figure 13.17 Comparison of the distributions of DPM per car across manufactu...

Figure 13.18 Disengagements reported per cumulative miles driven across manu...

Figure 13.19 Categorization (in terms of fault tags) of faults that led to d...

Figure 13.20 Evolution of DPMs per car with cumulative miles driven (all man...

Figure 13.21 Linear statistical relationship between DPM per car and the cum...

Figure 13.22 Evolution of DPM (per car) with the number of cumulative autono...

Figure 13.23 Distribution of reaction times for drivers following disengagem...

Figure 13.24 Distribution of reaction times for: (a) Mercedes‐Benz and (b) W...

Figure 13.25 Distribution of vehicular speeds for all reported accidents: (a...

Chapter 14

Figure 14.1 Generic representation of an AI system/application.

Figure 14.2 Architecture of a generative adversarial network (GAN).

Figure 14.3 Autonomous vehicle, with an array of error detection recovery me...

Figure 14.4 A generic representation of an AV stack showing where challenges...

Figure 14.5 Generic representation of a compute infrastructure showing where...

Figure 14.6 Generic representation of a healthcare system, showing where cha...

Figure 14.7 Operation of an experimental surgical robot [96].

Figure 14.8 An example of a surgical robot with learning malware installed o...

Figure 14.9 System view of dependable computing, incorporating AI/ML challen...

Figure 14.10 How to achieve the objectives: interfacing classical solution t...

Guide

Cover Page

Table of Contents

Series Page

Title Page

Copyright Page

Dedication Page

About the Authors

Preface

Acknowledgments

About the Companion Website

Begin Reading

Index

WILEY END USER LICENSE AGREEMENT

Pages

ii

iii

iv

v

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

IEEE Press445 Hoes LanePiscataway, NJ 08854

IEEE Press Editorial BoardSarah Spurgeon, Editor in Chief

Moeness Amin

Ekram Hossain

Desineni Subbaram Naidu

Jón Atli Benediktsson

Brian Johnson

Tony Q. S. Quek

Adam Drobot

Hai Li

Behzad Razavi

James Duncan

James Lyke

Thomas Robertazzi

Joydeep Mitra

Diomidis Spinellis

IEEE/Wiley Partnership

The IEEE Computer Society and Wiley partnership allows the CS Press authored book program to produce a number of exciting new titles in areas of computer science, computing, and networking with a special focus on software engineering. IEEE Computer Society members receive a 35% discount on Wiley titles by using their member discount code. Please contact IEEE Press for details.

To submit questions about the program or send proposals, please contact Mary Hatcher, Editor, Wiley‐IEEE Press: Email: [email protected], John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030‐5774.

Dependable Computing

Design and Assessment

Ravishankar K. Iyer

Department of Electrical and Computer Engineering andCoordinated Science LaboratoryUniversity of Illinois at Urbana‐ChampaignUrbana, Illinois, USA

Zbigniew T. Kalbarczyk

Department of Electrical and Computer Engineering and Coordinated Science LaboratoryUniversity of Illinois at Urbana‐ChampaignUrbana, Illinois, USA

Nithin M. Nakka

Cisco Networking Engineering groupCisco Systems, Inc.San Jose, California, USA

© 2024 John Wiley & Sons, Inc. Published 2024 by John Wiley & Sons, Inc.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication Data for:

Hardback ISBN: 9781118709443

Cover Design: WileyCover Image: © Yuichiro Chino/Getty Images

This book is dedicated to my wife Pamela for her unwavering support and encouragement in my academic pursuits, from my PhD to the present day.– For Ravishankar K. Iyer

This book is dedicated to my wife, whose unfailing support kept me moving forward to the completion of this work. – For Zbigniew T. Kalbarczyk

This book is dedicated to my parents, whose sacrifices have shaped what I am. – For Nithin M. Nakka

About the Authors

Professor Ravishankar K. Iyer is George and Ann Fisher Distinguished Professor of Engineering at the University of Illinois Urbana‐Champaign. He holds joint appointments in the Departments of Electrical and Computer Engineering (ECE) and Computer Science, the Coordinated Science Laboratory (CSL), the National Center for Supercomputing Applications (NCSA), the Carle Illinois College of Medicine, and the Carl R. Woese Institute for Genomic Biology. He is also a faculty Research Affiliate at the Mayo Clinic. Professor Iyer was the founding Chief Scientist of the Information Trust Institute at UIUC – a campus‐wide research center addressing security, reliability, and safety issues in critical infrastructures. He leads the DEPEND Group at CSL/ECE at Illinois, with a multidisciplinary focus on systems and software that combine deep measurement‐driven analytics and machine learning with applications in two important domains: (i) management and control of large infrastructures including autonomous systems that span resilience, safety, and security and performance and (ii) health and personalized medicine that spans computational genomics and health analytics focused on neurological disorders, pharmacogenomics, and predicting cancer metastases. His group has developed a rich AI analytics framework that has been deployed on real‐world applications in collaboration with industry, health providers, and government agencies including the National Science Foundation, the National Institutes of Health, the Department of Energy, the Defense Advanced Research Projects Agency, and the Department of Defense.

Professor Iyer is a Fellow of the American Association for the Advancement of Science (AAAS), the Institute of Electrical and Electronics Engineers (IEEE), and the Association for Computing Machinery (ACM). He has received several awards, including the Jean‐Claude Laprie Award, IEEE Emanuel R. Piore Award, and the 2011 Outstanding Contributions Award by the Association of Computing Machinery. Professor Iyer is also the recipient of the degree of Doctor Honoris Causa from Toulouse Sabatier University in France.

Dr. Zbigniew T. Kalbarczyk is Research Professor at the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory of the UIUC. Dr. Kalbarczyk’s research interests are in the design and validation of reliable and secure computing systems. His current work explores: (i) emerging computing technologies, such as resource virtualization to provide redundancy and assure system resiliency to accidental errors and malicious attacks; (ii) machine learning‐based methods for early detection of security attacks, including defense against smart malware; (ii) analysis of data on failures and security attacks in large computing systems to characterize system resiliency and guide development of methods for rapid diagnosis and runtime detection of problems; and (iv) development of techniques for automated validation and benchmarking of dependable and secure computing systems using formal (e.g., model checking) and experimental methods (e.g., fault/attack injection). Dr. Kalbarczyk led the design and commercialization of (i) ARMOR high‐availability software middleware to support resilient distributed applications and (ii) NFTAPE software framework to support fault injection‐based resiliency assessment. He served as a program chair of Dependable Computing and Communication Symposium (DCCS), a track of the International Conference on Dependable Systems and Networks (DSN) 2007, and Program Co‐Chair of Computer Performance and Dependability Symposium, a track of the DSN 2002. He was Associate Editor of IEEE Transactions on Dependable and Secure Computing. He has published over 230 technical papers and is regularly invited to give tutorials and lectures on issues related to the design and assessment of complex computing systems. He is a member of the IEEE, the IEEE Computer Society, and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance.

Dr. Nithin M. Nakka received his BTech (hons.) degree from the Indian Institute of Technology, Kharagpur and his MS and PhD degrees at the University of Illinois at UIUC. Currently, he is Technical Leader at Cisco Systems. While at Cisco, he has worked on most layers of the networking software stack, from network data‐plane hardware, control plane at Layer‐2 and Layer‐3, network controllers, up to and including network fabric monitoring. This along with doing what he enjoys – sharing his expertise through mentoring many incoming employees. He has been leading the development of solutions for network day‐2 operations to monitor fabrics and analyze, understand, troubleshoot network issues, and possibly predict impending network failures. His innovative approximation heuristics in Algorithms for Memory Array Redundancy Analysis during his work at Nextest Systems Corporation brought the company to world‐class excellence in this niche computational problem domain, generally known to be NP‐complete. He also worked for Motorola’s mobile devices group, on pioneering efforts in Bluetooth stereo audio transmission (A2DP) and Bluetooth security. His areas of research interest include systems reliability, network telemetry, and hardware implemented fault tolerance. Dr. Nakka has previously held positions as a research faculty in UIUC, and Northwestern University in Evanston, Illinois, and contributed to the area of dependability in high‐performance computing systems.

Preface

Dependability of systems has transitioned over the years from a feature to a necessity for end users, and from an add‐on to a core design principle for those who are designing and implementing computing or computer‐based systems. The need for dependability has grown not just in its breadth in terms of the areas where it is applicable but also in depth. Given any one of the many systems where dependability techniques are applied, their relevance is seen in every layer of the system stack. The aim of this book is to help readers navigate through the evolution of dependability, from taxonomy, mathematical concepts, and fundamental theory to design, implementation, validation, deployment, measurement, and monitoring. Finally, the book brings its audience right up to the modernity of the field by looking at critical societal applications such as autonomous vehicles, large‐scale clouds, and engineering solutions for healthcare, illustrating the emerging challenges faced in making artificial intelligence (AI) and its applications dependable and trustworthy.

Sections of the book are intensely pedantic and technical. However, with the support of practical case studies and use cases from both academia and real‐world deployments, we have attempted to guide our audience through their journey in fathoming the developments in this ever‐growing field. For a beginner, a systematic study from the beginning will help in building strong foundations, but we encourage all readers to whet their appetite with any of the case studies that spark their interest. For seasoned designers and academicians in the area, we attempt to provide a near‐current reference for dependability research and development.

The prerequisites for the content of this book are a basic understanding of statistical concepts, computer systems and organization, and, preferably, a course on distributed systems. Above all, a keen interest in delving into this exciting field to unravel and possibly discover new techniques will maintain a reader’s enthusiasm, as it has done ours over the past years. Certainly, well‐written texts are already available in this area. However, the authors felt that we lacked a single compendium spanning the myriad areas in which dependability has been applied, providing theoretical concepts and applied knowledge with content that would excite a beginner yet rigor that would satisfy an expert. That feeling led us to embark on the long journey of bringing forth this book.

Chapters 1 and 2 describe dependability taxonomy and briefly compare and contrast classical techniques with their modern counterparts or extensions. Chapters 3–7 help the readers walk up the system stack, from the hardware logic via operating systems up to software applications, with respect to how those layers are hardened for dependability. Chapters 8–12 expand into the domain of distributed systems to explore the techniques and applications therein. Those chapters also delve a great deal into a measurement‐based understanding of the systems being studied, an aspect that the authors feel honored to have had the opportunity to significantly contribute. Chapter 13 focuses on the most recent and upcoming trends that are shaping developments in dependability. Finally, looking into the future, Chapter 14 delves deeper into the novel challenges that are being faced in making AI systems dependable and trustworthy.

In summary, with the support of practical case studies and use cases from both academia and real‐world deployments, we guide our audience through a journey of developments, including the impact of AI and machine learning on this ever‐growing field.

Acknowledgments

In writing this book, we were inspired by Professor Dan Siewiorek’s groundbreaking research and the unmatched book The Theory and Practice of Reliable System Design by Siewiorek and Swarz, now in its third edition, as well as the foundational work of Professors Ed McCluskey and Al Avižienis, which continues to impact the field today.

We are indebted to many of our current and former students, postdoctoral associates, and academic and industry colleagues whose research contributed in important ways to material in this book, including Karthik Pattabiraman, Lelio DiMartino, Bob Horst, Saurabh Bagchi, Homa Alemzadeh, Long Wang, Tim Tsai, Saurabh Jha, Phuong Cao, Keywhan Chung, Shengkun Cui, and Archit Patke. Some of our colleagues have also adopted a draft version of this book in teaching dependability courses to graduate and senior students in their respective institutions, which has bolstered our confidence in the usefulness of this content. The administrative and technical proofreading staff members, including, Carol Bosley, Heidi Leerkamp, Jenny Applequist, and Kathleen Atchley, have contributed immensely to this effort by their critical linguistic polish of this technical content and also by their logistical work in keeping the authors and publishers in synchrony to accomplish this massive task. We are grateful to all of them as well as many others who shared their insights.

Special thanks to our colleagues at the University of Illinois Urbana‐Champaign who provided a rich, supportive environment that allowed us to pursue this project.

The research presented in this book was supported by numerous funding agencies and industry partners, including NSF, NIH, NASA, DoD, DARPA, DOE, IBM, Sandia National Lab, Nvidia, the Mayo Clinic, Infosys, and Xilinx.

Apart from the immense technical support we have received, we are very grateful to our families, who have been ever so patient in supporting us. They have transformed their “Are we there yet?” to “Looks like we are getting close” to keep our enthusiasm alive on an emotional front while we gave all we could to tame this mammoth. In spite of all the support that we have received both professionally and personally, added to our over 100 years of combined experience in this area, we feel that our attempts to gather all that we could in this ever‐expanding and interesting field may have fallen short in some application domains, not given enough justice to some, or even at times made unintentional errors in comprehending and explaining the content. A significant portion of our time was spent in making sure that we kept the content current and relevant for our audience. However, as the field is growing at the rate that it is, we had to reconcile ourselves to the hope that we may offer more in a future edition! We invite readers to send us their feedback on the content or any errors that may have escaped our scrutinous efforts to maintain relevance and correctness.

About the Companion Website

This book is accompanied by a companion website:

www.wiley.com/go/iyer/dependablecomputing1 

The website includes PDF’s of slides describing material in the book; problem sets and select solutions by chapters; as well as in‐class and semester long projects students can undertake.

1Dependability Concepts and Taxonomy

1.1 Introduction

Every single failure in any computing device is a potential cause for concern. Reliable computing and fault tolerance, or, to use a more current term, dependable computing, is a longstanding area of research and practical implementation. This broad area of study started in the mid‐fifties with John von Neumann's construction of reliable systems from unreliable systems or components. Over the years, significant advancements and deployments have been made in commercial telecommunications, defense, and business applications that address a wide range of potential failures. Today, an explosion in the complexity of systems, applications, and operating systems has resulted in ever‐expanding failure sources. That, combined with explosive growth in computing as an enterprise in all areas of human endeavor, has brought forth new challenges and opportunities in designing dependable systems. Further, early detection, rapid concurrent/online diagnosis, and efficient and complete recovery are key to the design of systems that continue to operate in the event of errors. They must be complemented by ongoing analysis and monitoring of failures, supported by strong statistical models. In dependability, an understanding of real failures is critical in the design, implementation, deployment, and validation of reliability techniques. Design and validation must go hand in hand in developing new systems. While dependability techniques protect systems against known faults, their greatest efficacy comes from their ability to safeguard against unanticipated failures due to accidental errors or malicious attacks.

This chapter sets the theme of the book by first placing classic work on dependability techniques in perspective and relating their importance for current computing systems. That assessment is followed by a description of the complexity of systems built using present‐day hardware designs, architectures, and software technologies that pose compelling challenges in providing continuous availability against a vast array of potential failures. Examples are provided of the developmental (or changing) trends in these areas that motivate the need for a newer perspective on dependability. The purpose of this chapter is to bring forth the recent challenges and opportunities in the reliability domain. (Possible solutions and techniques for fault tolerance and security will be explained as the book unfolds in the remaining chapters.) The discussion concludes with an introduction of dependability concepts, definitions, a taxonomy of failures, and a sample set of measurements from real systems in preparation for the next chapter's description of basic techniques.

The entire book follows the theme set by this chapter in introducing fundamentals of techniques with examples of prior deployment of the techniques in systems currently in use, with the goal of educating the reader on the applicability of these techniques, and any modifications or adaptations they need for use in modern and upcoming systems.

1.2 Placing Classical Dependability Techniques in Perspective

The earliest diagnostic techniques were developed for testing and failure recovery in the ILLIAC machine at the University of Illinois [1, 2] in the 1950s. When ILLIAC I (1950) and ILLIAC II (1961) were built at Illinois, fault diagnosis consisted of a battery of programs that exercised different sections of the machine. Typically, the test programs compared answers computed in two different ways, or stressed what was suspected to be a vulnerable part. In the ILLIAC II, the arithmetic and control units were designed to operate asynchronously, using a double handshake for each control signal and its acknowledgment. That protocol simplified the fault diagnosis, as it was used as an automatic fault detection mechanism. Most faults caused the control to wait for the next step in the asynchronous handshake protocol; that next step was identified using indicator lights for the flip‐flops.

Spaceborne computing systems were one of the earliest avenues for dependability design. Early work on dependability in space‐mission systems was performed on the JPL‐STAR (Jet Propulsion Laboratory Self‐Testing and Repair) computer (1971) [3] and on Voyager [4], leading to work on the Boeing 777 [5]. Although the craft carrying the JPL‐STAR computer never went into space, its development resulted in the design and implementation of a range of techniques that are considered standard today. The Voyager computer (launched in 1977) used block redundancy (a form of a standby redundancy whereby redundancy is provided at the subsystem level, e.g., at the altitude control subsystem, rather than internally in each subsystem) for fault tolerance. Heartbeat‐based hardware‐ and software‐implemented techniques were used for error detection. For example, an error would be detected in the hardware if a command for the primary (in the dual‐redundant configuration) arrived before the current command had been completely processed, and in software error detection, an error would be detected when the output unit in the primary remained unavailable for more than 14 seconds. Further developments in dependability in aviation were used in the design of the Boeing 777 fly‐by‐wire system, which used triple modular redundancy for all hardware resources, including the computing system, airplane electrical power, hydraulic power, and communication path.

The basic techniques established for hardware redundancy and software‐based fault and failure management, exceptions, and their handling in software, and the use of error codes in memory systems, transmission, and disk systems have been the mainstay of practical and commercial systems such as the AT&T No. 5 ESS [6], IBM S/360, and IBM S/370 [7]. These systems included a combination of hardware and software techniques and diagnostics that significantly advanced the theory and practice of dependable computing. The methods have since been augmented with computational algorithms and protocols to achieve consistency and reliable operation in distributed systems [8].

While parity, ECC (error correcting codes) and redundant array of independent disks have been widely used for commodity systems, the use of massive redundancy in hardware and software has led to high overheads in performance costs, hardware components, and software development costs. For example, the IBM MVS operating system devotes 50% of its software code base to fault management [9], while the IBM G5 processor dedicates 35% of its processor silicon area to fault detection and tolerance hardware [10]