Patterns for Fault Tolerant Software - Robert Hanmer - E-Book

Patterns for Fault Tolerant Software E-Book

Robert Hanmer

0,0
37,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Software patterns have revolutionized the way developer's and architects think about how software is designed, built and documented. This new title in Wiley's prestigious Series in Software Design Patterns presents proven techniques to achieve patterns for fault tolerant software. This is a key reference for experts seeking to select a technique appropriate for a given system. Readers are guided from concepts and terminology, through common principles and methods, to advanced techniques and practices in the development of software systems. References will provide access points to the key literature, including descriptions of exemplar applications of each technique. Organized into a collection of software techniques, specific techniques can be easily found with sufficient detail to allow appropriate choices for the system being designed.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 409

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Cover

Half Title page

Title page

Copyright page

Dedication

Preface

Who this Book is For

How to Use this Book

Patterns

A Pattern Language for Fault Tolerance

A Word about Examples

Acknowledgments

Pattern Origins and Earlier Versions

Introduction

An Imperfect World

Chapter 1: Introduction to Fault Tolerance

Fault -> Error -> Failure

Failure Perception [Lap91][Kop97]

Single Faults

Examples of How Vocabulary Makes a Difference

Coverage

Reliability

Availability

Dependability

Hardware Reliability

Reliability Engineering and Analysis

Performance

Chapter 2: Fault Tolerant Mindset

Fault Tolerant Mindset

Design Tradeoffs

Quality v. Fault Tolerance

Keep It Simple

Incremental Additions of Reliability

Defensive Programming Techniques

The Role of Verification

Fault Insertion Testing

Fault Tolerant Design Methodology

Chapter 3: Introduction to the Patterns

Shared Context for These Patterns

Terminology

Chapter 4: Architectural Patterns

1. Units of Mitigation

2. Correcting Audits

3. Redundancy

4. Recovery Blocks

5. Minimize Human Intervention

6. Maximize Human Participation

7. Maintenance Interface

8. Someone in Charge

9. Escalation

10. Fault Observer

11. Software Update

Chapter 5: Detection Patterns

12. Fault Correlation

13. Error Containment Barrier

14. Complete Parameter Checking

15. System Monitor

16. Heartbeat

17. Acknowledgement

18. Watchdog

19. Realistic Threshold

20. Existing Metrics

21. Voting

22. Routine Maintenance

23. Routine Exercises

24. Routine Audits

25. Checksum

26. Riding Over Transients

27. Leaky Bucket Counter

Chapter 6: Error Recovery Patterns

28. Quarantine

29. Concentrated Recovery

30. Error Handler

31. Restart

32. Rollback

33. Roll-Forward

34. Return to Reference Point

35. Limit Retries

36. Failover

37. Checkpoint

38. What to Save

39. Remote Storage

40. Individuals Decide Timing

41. Data Reset

Chapter 7: Error Mitigation Patterns

42. Overload Toolboxes

43. Deferrable Work

44. Reassess Overload Decision

45. Equitable Resource Allocation

46. Queue for Resources

47. Expansive Automatic Controls

48. Protective Automatic Controls

49. Shed Load

50. Final Handling

51. Share the Load

52. Shed Work at Periphery

53. Slow it Down

54. Finish Work in Progress

55. Fresh Work Before Stale

56. Marked Data

57. Error Correcting Code

Chapter 8: Fault Treatment Patterns

58. Let Sleeping Dogs Lie

59. Reintegration

60. Reproducible Error

61. Small Patches

62. Root Cause Analysis

63. Revise Procedure

Conclusion

A Pattern Language for Fault Tolerant Software

A Presence Server Example

Designing for Fault Tolerance

Software Structure

References and Bibliography

Appendices

Patterns for Fault Tolerant Software Thumbnails

External Pattern Thumbnails

Index

Pattern Index

Patterns for Fault Tolerant Software

Copyright © 2007 Alcatel-Lucent. All Rights Reserved.

Published by John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England Telephone (+44) 1243 779777

Email (for orders and customer service enquiries): [email protected]

Visit our Home Page on www.wiley.com

All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to [email protected], or faxed to (+44) 1243 770620.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book.

This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Other Wiley Editorial Offices

John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA

Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA

Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany

John Wiley & Sons Australia Ltd, 42 McDougall Street, Milton, Queensland 4064, Australia

John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809

John Wily & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada, M9W 1LI

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Anniversary Logo Design: Richard J. Pacifico

Library of Congress Cataloging-in-Publication Data

Hanmer, Robert S.    Patterns for fault tolerant systems / Robert S. Hanmer.          p. cm.    Includes bibliographical references and index.    ISBN 978-0-470-31979-6 (cloth : alk. paper) 1. Fault-tolerant computing. I. Title.    QA76.9.F38H35 2007    004.2–dc22                                                   2007029096

British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library

ISBN-13: 978-0-470-31979-6

For my best friends, Karen and Bud

Preface

Who this Book is For

This book is for both the novice at fault tolerant programming as well as the experienced practitioner. Both will find it useful to explain the key tradeoffs involved in a number of fault tolerant programming and system design techniques.

Most other books on the subject of reliable or fault tolerant software are concerned with aspects of Software Quality or Software Reliability Engineering. Those works are about fault prevention or the modeling and analysis of systems to predict reliability. This book’s goal is to provide proven techniques, in the form of patterns, that can make programs less failure prone when executing.

With this book you will be able to understand key fault tolerance techniques and how to include them into your designs.

How to Use this Book

This book can be used in several ways. Beginners can read Chapters 1 and 2 to get an understanding of the principles and key processes involved in designing fault tolerant systems and then skim the remainder of the book

Fault tolerance practitioners can use this book as a reference, referring to the patterns in Chapters 4 through 8 for refreshers on the key principles involved as they are confronting familiar problems. To gain an overview you can just read the chapter introductions and skim the patterns. Skimming Chapters 1 through 3 would be useful to grasp the way some of the key words are used in this book. Turning to the Appendices first will provide a quick reference to the patterns with their intents all collected together in once place.

Patterns

Patterns are the medium used here to convey what you need to make your software tolerate errors and continue processing. Software patterns are an effective way to capture proven design information and to communicate this information to the reader. Capturing the reasons for a particular design allows the design to be reused, eliminating the need to reinvent it. Unlike other forms of documentation, a pattern explains to you the reasons why the solution is an appropriate choice.

Pattern History

Software patterns have been discussed since the 1990s. A number of luminaries in the Object Oriented (OO) design community started talking about the work of Christopher Alexander and his catalog of patterns for the built world, A Pattern Language. [AIS+77]. In that book, Alexander and his colleagues describe 253 different techniques for making the physical world in which mankind lives more livable, and in specific more full of the Quality Without a Name (QWAN). In an accompanying book, The Timeless Way of Building [Ale79], Alexander describes QWAN and the nature of the pattern form.

The OO community started talking about patterns as a way to document and facilitate reuse of the techniques that they were seeing that reoccurred across the OO design space. The most famous pattern book is Design Patterns Elements of Reusable Object Oriented Design by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, the so-called Gang of Four [GHJ+95]. Many other books have followed, covering topics ranging from business process analysis to real-time programming to programming language specific techniques.

This book adds to this literature the patterns of systems that keep working even though software fails.

What is a Pattern?

Problems that you face when designing don’t exist in a vacuum. There are some things that can be changed and some that cannot. Examples of things that cannot be changed within a given project are programming language, programming paradigm, hardware platform, interface specifications, memory limitations, real-time deadline requirements, and so on. These things that you can’t change will vary from one problem to another and from one project to another and are referred to as the context of the problem.

The best solution to a design problem must take into account the things that can’t be changed and must also consider the various aspects over which you do have choice. These are the things that you can tradeoff against each other to achieve the best resolution. Examples of things that you might be able to change and hence find the best tradeoff include: memory consumption, processing time, subsystem boundaries, data design, etc. These things that can be traded off against each other are called the forces, which is a reference to the forces such as gravity that act on the buildings that Alexander was designing. Continuing the analogy to building architecture, we talk about a good pattern resolving the forces. We haven’t solved a problem but rather we’ve balanced the forces to achieve the best resolution possible given the context and forces.

The best solution is the one that introduces the most QWAN into the design. This is something that you, the designer, need to decide. In one context a particular solution will present the best possible resolution of the forces. And in another similar context the same solution might produce unfavorable results. For example, consider two Voice over IP ‘phones’. One is a soft phone that exists inside a general purpose computer with ample memory and processor power. The other is a stand-alone desk telephone. The designers of the soft telephone appliance for the general computer don’t need to worry about memory or processor speed because the user has already purchased more than enough for their application. In the second case the designers must carefully consider all the tradeoffs because the customer will be paying explicitly for all these capabilities, so the designer will be trying to keep costs to the minimum and hence won’t have the unlimited memory. Intelligence and good taste are required to determine if a particular resolution is right for your problem, context, and forces.

By now you’ve probably noticed the references to you. Patterns, and this book, are written to allow designers to resolve their problems. Patterns give the reader insight into the problem that they are facing that they might not have had in other ways. They aren’t meant to be used blindly or automatically. They require you, the reader, to understand and apply the principles to your problem. Since they reflect knowledge that is needed in the midst of your design problems, they make reference to you.

Reading a Pattern

There are many different forms, or styles, that patterns are written in. Some are very informal, such as those in Advanced C++ Programming Styles and Idioms [Cop92], and some use well defined sections for the various parts of the patterns. Examples of this style can be found in the Pattern Oriented Software Architecture series beginning with [BMR+96] and then there is the Gang of Four (or GOF) style found in [GHJ+95].

The patterns in this book are written in the Alexanderian style, which is the style used in A Pattern Language. This style does not have enumerated sections, but once you understand the flow of the patterns you will know where to look for the different kinds of information that is enumerated in the other pattern forms mentioned above.

Pattern names, when you encounter them in the text, will be in SMALL CAPS to set them off from the rest of the text. Pattern names will also be followed by a number in parentheses. For example CHECKPOINT (37), FAILOVER (36) and DEFERRABLE WORK (43) are all patterns. The number is a sequential reference number to help you find a pattern in this book. If a pattern name is followed by square brackets, e.g. FACADE [GHJ+95], it is a pattern that is not in this book, but a thumbnail explanation of it appears in the Appendices and the reference can be found in the References section.

A Pattern Language for Fault Tolerance

Individual patterns are very useful for resolving individual design problems. The true power of patterns comes when several patterns are combined. A pattern language is a collection of patterns that is designed to work together [HK04].

As discussed above, patterns resolve problems within a certain context, and after resolution they leave the system in a new context. In the new context there are new problems to be resolved. Patterns generally adhere to the model of small nuggets of information that work together rather than trying to ‘Resolve the World’ in one pattern, so the solution of one pattern flows into the problem of the next.

The patterns in this book are divided into different groupings reflecting four main phases of fault tolerance. These are Error Detection (Chapter 5), Error Recovery (Chapter 6) and Error Mitigation (Chapter 7) which combined deal with Error Processing, and Fault Treatment (Chapter 8). Chapter 3 will explain these phases in more detail. In addition to these phases, Chapter 4 contains architectural patterns that cut across all aspects of the system, and have wider scope than the patterns in Chapters 5 through 8. Chapter 3 contains an introduction to the pattern chapters and defines the common context and terminology that will be used in the actual patterns. Chapters 1 and 2 discuss fault tolerance in general but do not contain patterns.

Within each of these pattern chapters, the patterns work together to address the problems related to data and execution time faults, errors and failures. Within the larger context of this book, all of the patterns in all the chapters work together with other patterns to help you create systems of software that will tolerate faults and errors, and as a result have fewer failures.

Using a pattern language consists of four steps:

1. First you should be familiar with all of the patterns in the target design space. This familiarity is to the level of ‘I remember seeing something about that in that particular reference’, you don’t need to remember all the details, but you need to be able to find the pattern.
2. You should then consider the scope of the design problems that you are confronting. This helps you better understand your problem space.
3. Then look through all of the patterns that you remember that you think can help with your problem space and collect them in one place. It is sufficient for you to know where they are and how to find them.
4. The final step is to perform your normal design steps using the collection of patterns that you assembled in step 3 to guide your decision-making process.

A pattern language is complete and suitable for use in a design process if it meets two properties. Functional completeness says that the language contains patterns to resolve any new problems that result from applying a pattern in the language. The language is morphologically complete if the language covers the design space by building a complete solution to the systemic design problem without leaving any gaps.

When applying patterns to a design problem it is common to resolve one problem but have some new problem become exposed. Generally the new problem is ‘smaller’ and easier to resolve. If the solutions to all of the new problems that are introduced by the application of a pattern are contained in the language then the language can be said to be functionally complete. The pattern CHECKPOINT (37) provides an illustration of this principle. Once you have decided to add CHECKPOINTS to your system, new problems arise: What do you save in a CHECKPOINT? And how often should CHECKPOINTS be taken? Both of these new problems are addressed by subsequent patterns in the language.

A pattern language is morphologically complete if the solution space that it describes has no gaps that are not addressed. If this book contained only patterns for fault and error detection and did not cover the topic of error processing it would not be morphologically complete because there would be a major gap in its coverage.

The patterns in this book represent a complete pattern language. You will, however, find references to other patterns outside this book because they might address some element of the design space that isn’t directly involved with fault tolerance. An appendix contains the intent and a reference for each of these external patterns.

Figure 1 is an example of the style used to represent a pattern language. The pattern’s representation is a directed acyclic graph that shows the relationship between the patterns. From this diagram the arrow pointing from MINIMIZE HUMAN INTERVENTION (5) to MAXIMIZE HUMAN PARTICIPATION (6) indicates that MAXIMIZE HUMAN PARTICIPATION resolves some problem that MINIMIZE HUMAN INTERVENTION introduces, for example the problem of ‘how do we make sure that the outside world knows how the system is performing?’

Figure 1 Example language map

In the pattern chapters, language maps such as this one are presented to show a meaningful progression through the patterns. A language map of the whole language is found at the back of the book.

A Word about Examples

This book draws many examples from telecommunications or space programs. These environments have stringent reliability and availability requirements. The next few paragraphs provide background on these examples.

Telecommunications Systems

Telephone switching systems, such as the 4ESS™ Switch and the 5ESS® Switch from Alcatel-Lucent (formerly Lucent Technologies and AT&T), provide the automatic routing and connection of telephone calls. The 4ESS Switch is a long distance, or toll, switch that first entered service in 1976. Its availability requirements are that it is allowed to have no more than three minutes of unplanned downtime per system per year. The 4ESS Switch consists of a pair of proprietary central processors that operate in lock-step, providing a high level of redundancy. Surrounding the central control processors are many other computers of varying power performing general functions such as maintaining billing information, providing interfaces to external data protocols, or providing control for the electronics required to make the telephone connections.

The 5ESS Switch is a local and toll switch also from Alcatel-Lucent. The 5ESS Switch has a different set of features from the 4ESS Switch, which allow it to be used as the first switch in the network that can connect to analog phone lines. The 5ESS Switch has a record of achieving an average of only one-half minute of per system downtime per year during the early 2000s. The 5ESS Switch is a distributed complex of processors that are assigned to specific processing roles. The system employs a high level of redundancy that allows it to achieve its high level of availability.

Space Programs

The world’s space programs provide many examples of fault tolerant software design. Examples will be drawn from several of them. They have a number of common attributes that make them interesting examples. The first one is the harsh environment in which they operate. The systems must be self contained and able to withstand a wide range of complications. Their missions are long-lived, frequently measured in years rather than days or hours, and some of the spacecraft have greatly exceeded their designers’ expectations. For example, the Mars Expeditionary Rovers, Spirit and Opportunity, were hoped to survive and continue operating 90 Martian days. They each operated successfully for more than 1000 Martian days.

Once their mission has begun, changing the system software becomes either impossible or very difficult. The software must be designed and built correctly initially and any changes that are made must be carefully designed and built to ensure that they do not cause the mission to fail.

The type and design of software used in spacecrafts have evolved over time also. Early spacecraft had very short programs hardcoded into their memory stores. Programs grew, but still recognizing the principle of Keep It Simple by being designed small and concise. Modern techniques for redundancy management were employed by the Space Shuttle. The European Space Agency’s Ariane launch vehicle reused software from one model to another.

Acknowledgements

There are many people who helped make this book and the patterns, not the least of who is Karen Hanmer who supported the effort and assisted greatly with the photographs. Henry Maron drew many of the solution illustrations. The ChiliPLoP 2007 Hot Topic group to review this book consisted of Paul Adamczyk, Richard P. Gabriel, and Ricardo Lopez. Their insightful and thought provoking comments were instrumental to the final shape of this book.

The University of Illinois – Urbana Champaign Software Architecture Group under the leadership of Professor Ralph Johnson reviewed the manuscript and offered very useful comments and suggestions.

Veena Mendiratta, John Letourneau, Doug Kimber, Eric Bauer, Phil Scarff, Shawa Tam, Amir Raveh, Amr Elssamadisy, and Lee Ayres all contributed useful comments to help this book take shape.

Lucent and Alcatel-Lucent managers have supported this project from the beginning, including John McManus, Doug Wittig, Jan Fertig, Michael Massetti, Shawa Tam, Jon Heard, Joe Carson and Thierry Paul-Dauphin. A special thank you goes to Alicja Kawecki, who has always been very helpful with publication clearance.

Thank you to the people at John Wiley & Sons, Rosie Kemp, Sally Tickner, Drew Kennerley and Hannah Clement, for patiently answering all the questions that go along with a first book.

Pattern Origins and Earlier Versions

LEAKY BUCKET COUNTERS (27) was originally by Robert Gamoke, original version edited by James O. Coplien published in [ACG+96]. Very similar to Leaky Bucket of Credit by Gerard Meszaros published in [Mes 96]. Leaky Bucket of Credit describes using this same concept as a resource allocation mechanism. The Leaky Bucket Counter strategy was alluded to in p. 2003–4 of Bell System Technical Journal, Volume XLIII 5(10), Sept. 1964.

COMPLETE PARAMETER CHECKING (14) was suggested by Kopetz in [Kop79], pp 75–76.

REASSESS OVERLOAD DECISION (44) was alluded to in [GHH+77], p. 1177.

DEFERRABLE WORK (43) was alluded to in [GHH+77], p. 1177.

QUEUE FOR RESOURCES (46) is related to [WWF96].

EXISTING METRICS (20) was alluded to in [CCR+77], p. 1116.

Earlier versions of the patterns FINISH WORK IN PROGRESS (54), FRESH WORK BEFORE STALE (55), SHARE THE LOAD (51), SHED LOAD (49) and SHED WORK AT PERIPHERY (52) were written by Gerard Meszaros and published in [Mes96].

Earlier versions of EQUITABLE RESOURCE ALLOCATION (45), DEFERRABLE WORK (43) (If It’s Working Hard Don’t Fix It), EXISTING METRICS (20) (Overload Elastics), OVERLOAD TOOLBOXES (42), QUEUE FOR RESOURCES (46), and REASSESS OVERLOAD DECISION (44) appear in [Han06a] in [MVN06].

Mike Adams was a co-author on previous versions of EQUITABLE RESOURCE ALLOCATION (45), OVERLOAD TOOLBOXES (42), and SLOW IT DOWN (53).

Titos Saridakis in [Sar02] has versions of ACKNOWLEDGEMENT (17), HEARTBEAT (16), ROLLBACK (32) and ROLL-FORWARD (33).

Michael Wu was co-author on previous versions of EXPANSIVE AUTOMATIC CONTROLS (47), PROTECTIVE AUTOMATIC CONTROLS (48), and FINAL HANDLING (50).

MINIMIZE HUMAN INTERVENTION (5) was originally written by James O. Coplien and Mike Adams.

An earlier version of RIDING OVER TRANSIENTS (26) was written by James O. Coplien.

Thank you to my PLoP shepherds for these patterns:

Our PLoP 95 Shepherd was Gerard Meszaros, who worked with lead author of [ACG+96] James Coplien on the pre-conference drafts.
Michael Pont offered many valuable comments and suggestions as shepherd that significantly improved the organization of WATCHDOG (18), HEARTBEAT (16), SYSTEM MONITOR (15), ACKNOWLEDGEMENT (17), and REALISTIC THRESHOLD (19). Mark Bradac and Lars Grunske also reviewed drafts.
Ward Cunningham was PLoP 2000 shepherd for the patterns that also appeared in [Han06a].
For the patterns CHECKPOINT (37), CONCENTRATED RECOVERY (29), INDIVIDUALS DECIDE TIMING (40), LIMIT RETRIES (35), REMOTE STORAGE (39), and WHAT TO SAVE (38), thanks to Titos Saridakis, Tim Parks for insight on Killer Messages, Mark Bradac, and to shepherd Toni Marinucci.
David DeLano provided many useful comments on EXPANSIVE AUTOMATIC CONTROLS (47), PROTECTIVE AUTOMATIC CONTROLS (48), and FINAL HANDLING (50).
Thanks to Dirk Riehle who shepherded ERROR CONTAINMENT BARRIER (13) and MARKED DATA (56) through many changes of direction for PLoP 2006 [Han06b]. Ralph Johnson provided valuable feedback on the patterns.

Thank you to all of my Writers’ Workshops at PLoP conferences:

PloP2K Writers’ Workshop group for their valuable comments. Bill Opdyke, Carlos O’Ryan, Brian Foote, Rossana Andrade, Todd Coram, Brian Marick, Juha Pärssinen and Terunobu Fujino were members of this group, entitled ‘Network of Learning’ 2003 DRE workshop group.

The PLoP 2004 Writers’ Workshop group of Bob Blakey, Craig Heath, Eduardo Fernandez, Weerasak Witthawaskul, Paul Adamczyk, Crutcher Dunnavant, Joel Jones, Manawar Hafiz, Nelly Delessy-Gassant, and Halina Kaminski offered many valuable suggestions.

The PLoP 2006 Intimacy Gradient workshop group that reviewed [Han06b] included Mirko Raner, Erol Thompson, Anders Janmyr, Philipp Bachmann, Daniel Vainsencher, Maurice Rabb, Andrew Black, Sachin Bammi, Kanwardeep Singh Ahluwalia, Ward Cunningham, Brian Foote, Ademar Aguiar, and Ricardo Lopez.

And to all those unnamed members of my other PLoP Writers’ Workshop groups.

Introduction

An Imperfect World

We live in an imperfect world. The things we make break when we least expect them to. This includes computer programs and the systems that we build from computers. Even the things that we think of as being the most reliable are occasionally unavailable because they’ve broken. This book is about how to make these systems of software (and hardware) work even though they might break occasionally.

Consider the United States’ manned space flight program: Apollo 13 had a dramatic failure that almost killed the three person crew as they were heading toward the moon. Think also of the failures that are detected during space shuttle assembly that delay the launch for a period of days. These space systems, which are highly complicated systems of hardware and software components, were designed to operate flawlessly, and yet failures happened.

Consider also the WYSIWYG document editing program that just won’t let you number the first page of a document, such as this book manuscript, page one. Page numbering is a feature that is expected by the program’s creators and users to work flawlessly.

Or consider systems such as telephone switching equipment or web-based e-commerce systems or automatic teller machines (ATMs). These are expected to work flawlessly and continuously. They are built of combinations of hardware and software components that work together to provide the desired service.

This book is about what to design into software to make these complicated systems of software (and hardware) tolerate an occasional error in the software so that they can provide service without failures being perceived.

CHAPTER 1

Introduction to Fault Tolerance

Like any subject of study, there is a specialized language associated with fault tolerance. This chapter introduces these terms.

The focus of this book is on ‘Fault Tolerance’ in general and in particular on things that can be done during the design of software to support fault tolerant operation. A system of software or hardware and software that is fault tolerant is able to operate even though some part is no longer performing correctly. Thus the focus of this book is on the software structures and mechanisms that can be designed into a system to enable its continued operation, even though a different part isn’t working correctly. This book describes practices to improve the reliability and availability of software systems. These practices are currently in use in a variety of software application domains.

The next few sections define the vocabulary needed to discuss fault tolerance.

Fault -> Error -> Failure

The terms fault, error and failure have very specific meanings.

A system failure occurs when the delivered service no longer complies with the specification, the latter being an agreed description of the system’s expected function and/or service. An error is that part of the system state that is liable to lead to subsequent failure; an error affecting the service is an indication that a failure occurs or has occurred. The adjudged or hypothesized cause of an error is a fault. [Lap91, p. 4]

Every fault tolerant system composed of software and hardware must have a specification that describes what it means for that system to operate without failure. The system’s specification defines its expected behavior, such as available 99.999% of the time. When the system doesn’t behave in the manner specified in its requirements, it has failed. The term failure refers to system behavior that does not conform to the systems specification.

These are examples of failures: The system crashes to a stop when it shouldn’t, the system computes an incorrect result, the system is not available for service, the system is unable to respond to user interaction. Whenever the system does the wrong thing it has failed.

Failures are detected by the observer and users of the system.

Failures are dependant upon the requirements and the definition of agreed-upon correct operation of the system. If there is not a specification of what the system should do, there cannot be a failure.

Failures are caused by errors.

An error is the incorrect system behavior from which a failure may occur. Errors can be categorized into two types, timing or value. Errors that manifest as value errors might be incorrect discrete values or incorrect system state. Timing errors can include total non-performance (the time was infinite).

Some common examples of errors include:

Timing or Race conditions: communicating processes get out of synchronization and a race for resources occurs.
Infinite Loops: continuous execution of a tight loop without pausing and without acknowledging the requests of others for shared resources.
Protocol Error: errors in the messaging stream because of non-conformance with the protocol in use. Unexpected messages sent to other parts of the system, messages sent at inappropriate times, or out of sequence.
Data inconsistency: Data may be different between two locations, for example memory and disk, or between different elements in a network.
Failure to Handle Overload conditions: the system is unable to handle the workload.
Wild Transfer or Wild Write: Data written to an incorrect location of memory or a transfer to an incorrect location occurs if there is a fault in the system.

Any of these example errors could be failures if they deviate from the system’s specification.

Errors are important when talking about fault tolerant systems because errors can be detected before they become failures. Errors are the manifestation of faults, and errors are the way that we can look into the system to discover if faults are present.

A fault is the defect that is present in the system that can cause an error. It is the actual deviation from correctness. In a computer program it is the misplaced comma or period, or the missing break statement in a C++ switch statement. Colloquially the fault is often called a ‘bug’, but that word will not appear elsewhere in this book.

The fault might be a latent software defect, or it might be a garbled message received on a communications channel, or a variety of other things. In general, neither the software nor the observers are aware of the presence of a fault until an error occurs.

A number of causes lead to the introduction of a fault into software. These include:

Incorrect Requirement Specification: Sometimes the software designers and coders were told to build the wrong thing.
Incorrect Designs: Translating system requirements into a working software design is a complicated process that sometimes results in incorrect designs. The design might not be workable from a pure software standpoint, or it might not be an accurate translation of the requirements. In either case it is faulty.
Coding Errors: Translating the design into working code can also introduce faults into the system. The compiler/interpreter/code examination tool can catch some faults or a fault can produce syntactically correct code that just does not perform the specified task.

Faults are present in every system. When a fault is lying dormant and not causing any mischief it is said to be latent. When the circumstances arise that the latent fault causes something incorrect to happen it is said to become active. A fault’s activation results in an error.

Examples of Fault -> Error -> Failure

To help make these very important definitions clear, here are a few examples.

A misrouted telephone call is an example of a failure. Telephone system requirements specify that calls should be delivered to the correct recipient. When a faulty system prevents them from being delivered correctly, the system has failed. In this case the fault might have been an incorrect call routing data being stored in the system. The error occurs when the incorrect data is accessed and an incorrect network path is computed with that incorrect data.

A robotic arm used to drill a part in a manufacturing environment provides another example. Consider the fault of a misplaced decimal point in a data constant that is used in the computation of the rotation of the robot’s arm. The data constant might be the number of steps required to rotate the robotic arm one degree. The error might be that it rotates in the wrong direction because of the erroneous computation made with the faulty decimal point. The arm fails by lowering its drill at the wrong location

The preparation of an incorrect bill for service is another example of a failure. The system requirements specify that the customer will be accurately charged for service received. A faulty identification received in a message by a billing system can result in the charges being erroneously applied to the wrong account. The fault in this case might have been in the communications channel (a garbled message), or in the system component that prepares the message for transmission. The error was applying the charges to the wrong account. The fact that the customer receives an incorrect charge is the failure, since they agreed with the carrier to pay for the service that they used and not for unused service.

Consider a spacecraft that is given an updated set of program instructions by the Earth station controlling it. An error occurs because someone designing the update incorrectly computed the memory range to be updated. The new program was updated to this incorrect range, which corrupted another part of the programming. The corrupted instructions caused the spacecraft’s antenna to point away from Earth, breaking off communications between Earth and the spacecraft, which led to the mission being considered a failure. The initial fault was the computation of the incorrect memory range.

Banking systems fail when they do not safeguard funds. An example of failure is when a bank’s automatic teller machine (ATM) dispenses too much cash to a customer. Several errors might lead to this failure. One error is that the machine counted out more bills than it should have. In this case the fault might be an incorrect computation module, or a faulty currency sorting mechanism. A different error that can result in the same failure is that the bills were loaded incorrectly into the ATM. The fault was that the courier that loaded the machine put money in the wrong dispensaries, i.e. $20 bills were placed in the $5 storage location and vice versa.

The last example illustrates how the same failure might result from different faults as shown in Figure 2.

Figure 2 Multiple faults create the same error

Another example is the failure of the first Ariane 5 rocket from the European Space Agency. Flight 501 veered off its intended course, broke up and exploded shortly into the flight. The inertial reference system for the Ariane 5 was reused from the Ariane 4. The initial period of the flight the Ariane 5’s flight path took was different enough than Ariane 4 for the inertial reference system to encounter errors in the horizontal velocity calculations. These errors resulted in the failure of the backup inertial reference system, followed by a failure of the active inertial reference system. The loss of inertial reference systems resulted in a large deviation from the desired flight path, which resulted in a mechanical failure that triggered self-destruct circuitry. The fault in this case can be traced to a change in the requirements between Ariane 4 and Ariane 5 that enables for a more rapid buildup of horizontal velocities in Ariane 5. The error that resulted from the horizontal velocity increasing too rapidly resulted in the failure. [ESA96]

Failure Perception [Lap91][Kop97]

A fail-silent failure is one in which the failing unit either presents the correct result or no result at all. A crash failure is one where the unit stops after the first fail-silent failure. When a crash failure is visible to the rest of the system, it is called a fail-stop failure.

A set-top entertainment system computer fails quietly, without announcing to the world that it has failed. When it fails it just stops providing service. The computer in the Voyager spacecraft fails in a crash failure mode after it detects its first failure, which is detected by the backup computer, which assumes primary control. [Tom88]

Failures can be categorized as either consistent or inconsistent. Consistency refers to whether the failure appears the same each time it is observed. Examining the failure occurs from the viewpoint of the user, the person or other system that is determining that the failing system did not conform to its specifications. Consistent failures are seen as the same kind of failure by all users or observers of a system. An example of failing consistently is reporting ‘1’ in response to all questions that the system is asked.

Inconsistent failures are ones that appear different to different observers. These are sometimes called two-faced failures, malicious failures or Byzantine failures. These are the most difficult to isolate and correct because the failure is presenting multiple faces to the error detection, processing, and fault treatment phases of recovery.

An example of an inconsistent failure is to respond with ‘1’ to questions asked by one peer and ‘2’ to questions from all other peers. Another example is when the failing system misroutes all network traffic to a certain network address, and not to other network addresses. The observers of the system, the network peers, see one of two behaviors: either they see a complete absence of network traffic, or they see a flood of network traffic of which most of it is incorrect and should not have been received. This failure is inconsistent because the perception of whether the system is sending traffic or not sending traffic depends on which peer is the observer.

Inconsistent failures are very hard to detect and to correct because they appear different to each observer. In particular they might appear correct to the part that would detect a failure and incorrect to all other parts of the system. To counter the risk of the failure appearing differently to different observers, fault tolerant design attempts to turn the potentially inconsistent failures into consistent failures. This is accomplished by creating boundaries around failing functionalities, and transforming all failures into fail-silent failures.

Fail-silent failures are the easiest type of failures to be tolerated because the observed failure is that the failing unit has stopped working. The reason for the failure is unclear, but the failing element is identified and the failure is contained and is not spreading throughout the system.

Single Faults

Much of the fault tolerant design over the years has been created to handle only one error at a time. The assumption is that only one error will occur at a time and recovery from it has completed before another error occurs. A further assumption is that errors are independent of each other.

While this is a common design principle in real life, many failures have occurred when this assumption has been invalid.

To understand why this is a valuable assumption, consider Table 1.1. It shows the theoretical results that indicate how many redundant units are required to tolerate independent faults of three kinds: fail-silent, consistent and malicious (inconsistent). The type of failures tolerated influences the number of components required to tolerate failures. From this table, most designers will see that the most desirable situation is to have the failing unit fail silently, because that requires only two units to tolerate the failures.

Table 1.1 Minimum number of components to tolerate failures [Kop97, p. 121]

MINIMUM NUMBER OF COMPONENTS TO TOLERATE FAILURES

TYPE OF FAILURE

n + 1

Fail-silent

failures

2n + 1

Consistent

failures

3n + 1

Malicious

failures

To gain perspective of the ramifications in Table 1.1, the computer control system in the Space Shuttle is designed to tolerate two simultaneous failures which must be consistent but need not be silent and, as a result, it has five general purpose computers. [Skl76] A typical telephone switching system is designed to tolerate single failures. Many components are duplicated because two units are all that are required to tolerate single failures.

Examples of How Vocabulary Makes a Difference

When debugging failures it is very useful to determine what is the fault, what is the error and what is the failure. Here are a few examples. These also show that the terms, while specific, depend on the viewpoint and the depth of examination.

Consider the robotic arm failure presented above. Was the fault that the arm software rotated in the wrong direction, or was it the incorrect data that drove the state change? Knowing which the fault was helps us know what to fix.

As another example, consider the Ariane 5 failure mentioned earlier. Was the fault that the specification didn’t reflect the expected flight path? Or was the fault that the reused component was insufficiently tested to detect the fault? Was the error that the incorrect specification was used, or was the error that the flight path deviated from the Ariane 4 flight path? Identifying and correctly labeling faults and errors simplifies the fault treatment.

Coverage

The coverage factor is an important metric of a system’s fault tolerance. Highly reliable and highly available systems strive for high coverage factors, 95% or higher.

The coverage is the conditional probability that the system will recover automatically within the required time interval given that an error has occurred.

In the Space Shuttle avionics nearly perfect coverage is attained in a complex of four off-the-shelf processors by comparing the output of simultaneous computations in each of the processors. Each Shuttle processor is equipped with a small amount of redundancy management hardware to manage the receipt of the values to be compared. Through the use of this hardware the processor can identify with certainty which of its peers computed an incorrect value. The coverage was increased to 100% through the additional technique of placing a timer on the buses used to communicate between the processors. [Skl76]

Coverage can be computed from the probability associated with detection and recovery.

Obtaining the probabilities used to compute the coverage factor is difficult. Extensive stability testing and fault insertion testing are required to obtain these values.

Reliability

A system’s reliability is the probability that it will perform without deviations from agreed-upon behavior for a specific period of time. In other words, that there will be no failures during a specified time.

The parameters used to describe reliability are Mean Time To Failure (MTTF) and Mean Time to Repair (MTTR). The Mean Time To Failure is the average time from start of operation until the time when the first failure occurs. The Mean Time to Repair is a measure of the average time required to restore a failing component to operation. In the case of hardware this means the time to replace the faulty hardware component in addition to the time to travel to the site to be able to perform the repair actions. The Mean Time Between Failures, or MTBF, is similar to MTTF but reflects the time from the start of operation until the component is restored to operation after repair. MTBF is the sum of MTTF and MTTR. MTBF is used in situations where the system is repairable, and MTTF is used when it cannot be repaired. The start of operations for both MTTF and MTBF refers to when normal operations are resumed, either after initial startup or after recovery has completed. The reliability can be computed with the following equation.

Failure rate is the inverse of MTTF. A commonly used measurement of failure rate is FITs, or Failures in Time. FITs are the number of failures in 1 × 109 hours.

Reliability Examples

Mars Landers

The Mars Exploratory Rovers, Spirit and Opportunity, had a design duration of 90 days. The reliability of these two Mars explorers has been so good that they lasted more than 1000 days. However, note that this refers only to complete system failures. There have been partial failures requiring workarounds or fault treatment, such as finding a way to keep the Mars Rover Sprit operating on only five of its six wheels. [NASA04][NASA06].

Airplane Navigation System

Many modern airplanes rely extensively on computers to control critical systems. While the aircraft is in the air, the navigational computers must operate failure-free. On a flight from Chicago to Los Angeles, the navigation system must be failure-free for between four and five hours. The MTTF during the operational phase of the system must be greater than five hours; if it were less the flight crew could expect at least one failure on their flight. If the navigational system fails while the airplane is at the gate on the ground, repairs can return it to operational status before its next flight. Before or after a flight it is still a failure, but it might not be considered into the system’s reliability computations. The MTTR must be low because airlines require their planes to be highly available in order to maximize their return on investment.

Measuring Reliability

There are two primary methods of determining the reliability of a system. The first is to watch the system for a long time and calculate the probability of failure at the end of the time. The other is to predict the number of faults and from that number to predict the probability of failures (both numbers of failures and durations). Software Reliability Engineering focuses on measuring and predicting reliability.

Availability

A system’s availability is the percentage of time that it is able to perform its designed function. Uptime is when the system is available, downtime is when it is not. A common way to express availability is in terms of a number of nines, as indicated in Table 1.2.

Table 1.2 Availability as a number of nines

EXPRESSION

MINUTES PER YEAR OF DOWNTIME

100%

0

Three 9s

99.9%

525.6

Four 9s

99.99%

52.56

Four 9s and a 5

99.995%

26.28

Five 9s

99.999%

5.256

Six 9s

99.9999%

0.5256

100%

0

Availability is computed as:

Availability and Reliability are two concepts that are easy to get confused. Availability is concerned with what percentage of time the system can perform its function. Reliability is concerned with the probability that the system will perform failure-free for a specified period of time.

Availability Examples

The 4ESS™ Switch from Alcatel-Lucent had an explicit requirement when it was designed in the 1970s of two hours of downtime every 40 years. This equates to an unavailability of three minutes per year, which is slightly better than five 9s. The 5ESS® Switch from Alcatel-Lucent has achieved six 9’s of availability for a number of years.

Dependability

Dependability is a measure of a system’s trustworthiness to be relied upon to perform the desired function. The attributes of dependability are reliability, availability, safety and security. Safety refers to the non-occurrence of catastrophic failures, whose consequences are much greater than the potential benefit. Security refers to the unauthorized access or unauthorized handling of information. Since dependability includes both reliability and availability, the correctness of the result is important. [Lap91]

Hardware Reliability

Unlike software, hardware faults can be analyzed statistically based upon behavior and occurrence and also the physics of materials. The reliability of hardware has been studied for a long time, and covered in great depth. Hardware reliability includes the study of the physics and the materials, as well as the way things wear out. There is an array of technical conferences and journals that address this topic, such as the International Reliability Physics Symposium and the Electronic Components Technology Conference and IEEE journals Device and Materials Reliability, Advanced Packaging and Solid State Circuits.

Reliability Engineering and Analysis

Software Reliability Engineering is the practice of monitoring and managing the reliability of a system. By collecting fault, error, and failure statistics during development, testing, and field operation, monitoring and managing the parameters of reliability and availability is possible. The Handbook of Software Reliability Engineering [Lyu96] contains a number of articles on topics related to Software Reliability Engineering.

A widely used technique is Reliability Growth Modeling, which graphs the cumulative number of faults corrected versus time. Prediction methods calculate the cumulative number of faults expected, which enables comparison with the measured results. This, in turn, enables the determination of the number of faults remaining in the system.

Markov modeling of systems (including software components) is another technique useful for predicting the reliability of a system. These models enable analysis of redundancy techniques and prediction of MTTF.

Markov models are constructed by defining the possible system states. Transitions between the states are defined and are assigned a probability factor. The probability indicates the likelihood that the transition will occur. An important aspect of the model is that the probability of a state transition depends only on the current state; history is not considered. Figure 3