164,99 €
A one-volume guide to the most essential techniques for designing and building dependable distributed systems Instead of covering a broad range of research works for each dependability strategy, this useful reference focuses on only a selected few (usually the most seminal works, the most practical approaches, or the first publication of each approach), explaining each in depth, usually with a comprehensive set of examples. Each technique is dissected thoroughly enough so that readers who are not familiar with dependable distributed computing can actually grasp the technique after studying the book. Building Dependable Distributed Systems consists of eight chapters. The first introduces the basic concepts and terminology of dependable distributed computing, and also provides an overview of the primary means of achieving dependability. Checkpointing and logging mechanisms, which are the most commonly used means of achieving limited degree of fault tolerance, are described in the second chapter. Works on recovery-oriented computing, focusing on the practical techniques that reduce the fault detection and recovery times for Internet-based applications, are covered in chapter three. Chapter four outlines the replication techniques for data and service fault tolerance. This chapter also pays particular attention to optimistic replication and the CAP theorem. Chapter five explains a few seminal works on group communication systems. Chapter six introduces the distributed consensus problem and covers a number of Paxos family algorithms in depth. The Byzantine generals problem and its latest solutions, including the seminal Practical Byzantine Fault Tolerance (PBFT) algorithm and a number of its derivatives, are introduced in chapter seven. The final chapter details the latest research results surrounding application-aware Byzantine fault tolerance, which represents an important step forward in the practical use of Byzantine fault tolerance techniques.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 535
Veröffentlichungsjahr: 2014
Contents
Cover
Half Title page
Title page
Copyright page
Dedication
List of Figures
List of Tables
Acknowledgments
Preface
References
Chapter 1: Introduction to Dependable Distributed Computing
1.1 Basic Concepts and Terminologies
1.2 Means to Achieve Dependability
References
Chapter 2: Logging and Checkpointing
2.1 System Model
2.2 Checkpoint-Based Protocols
2.3 Log Based Protocols
References
Chapter 3: Recovery-Oriented Computing
3.1 System Model
3.2 Fault Detection and Localization
3.3 Microreboot
3.4 Overcoming Operator Errors
References
Chapter 4: Data and Service Replication
4.1 Service Replication
4.2 Data Replication
4.3 Optimistic Replication
4.4 CAP Theorem
References
Chapter 5: Group Communication Systems
5.1 System Model
5.2 Sequencer Based Group Communication System
5.3 Sender Based Group Communication System
5.4 Vector Clock Based Group Communication System
References
Chapter 6: Consensus and the Paxos Algorithms
6.1 The Consensus Problem
6.2 The Paxos Algorithm
6.3 Multi-Paxos
6.4 Dynamic Paxos
6.5 Fast Paxos
6.6 Implementations of the Paxos Family Algorithms
References
Chapter 7: Byzantine Fault Tolerance
7.1 The Byzantine Generals Problem
7.2 Practical Byzantine Fault Tolerance
7.3 Fast Byzantine Agreement
7.4 Speculative Byzantine Fault Tolerance
References
Chapter 8: Application-Aware Byzantine Fault Tolerance
8.1 High Throughput BFT Systems: Networked File Systems
8.2 Exploiting Deep Application Semantics: Web Services Coordination
8.3 Application Nondeterminism Control
References
Index
Also of Interest
Building Dependable DistributedSystems
Scrivener Publishing 100 Cummings Center, Suite 541J Beverly, MA 01915-6106
Performability Engineering Series Series Editors: Krishna B. Misra ([email protected]) and John Andrews ([email protected])
Scope: A true performance of a product, or system, or service must be judged over the entire life cycle activities connected with design, manufacture, use and disposal in relation to the economics of maximization of dependability, and minimizing its impact on the environment. The concept of performability allows us to take a holistic assessment of performance and provides an aggregate attribute that reflects an entire engineering effort of a product, system, or service designer in achieving dependability and sustainability. Performance should not just be indicative of achieving quality, reliability, maintainability and safety for a product, system, or service, but achieving sustainability as well. The conventional perspective of dependability ignores the environmental impact considerations that accompany the development of products, systems, and services. However, any industrial activity in creating a product, system, or service is always associated with certain environmental impacts that follow at each phase of development. These considerations have become all the more necessary in the 21st century as the world resources continue to become scarce and the cost of materials and energy keep rising. It is not difficult to visualize that by employing the strategy of dematerialization, minimum energy and minimum waste, while maximizing the yield and developing economically viable and safe processes (clean production and clean technologies), we will create minimal adverse effect on the environment during production and disposal at the end of the life. This is basically the goal of performability engineering.
It may be observed that the above-mentioned performance attributes are interrelated and should not be considered in isolation for optimization of performance. Each book in the series should endeavor to include most, if not all, of the attributes of this web of interrelationship and have the objective to help create optimal and sustainable products, systems, and services.
Publishers at Scrivener Martin Scrivener ([email protected]) Phillip Carmical ([email protected])
Copyright © 2014 by Scrivener Publishing LLC. All rights reserved.
Co-published by John Wiley & Sons, Inc. Hoboken, New Jersey, and Scrivener Publishing LLC, Salem, Massachusetts. Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
For more information about Scrivener products please visit www.scrivenerpublishing.com.
Cover design by Exeter Premedia Services Private Ltd., Chennai, India
Library of Congress Cataloging-in-Publication Data:
ISBN 978-1-118-54943-8
To Michael and Louise
List of Figures
1.1 An example of a chain of threats with two levels of recursion.
1.2 The rollback recovery is enabled by periodically taking checkpoints and usually logging of the requests received.
1.3 With redundant instances in the system, the failure of a replica in some cases can be masked and the system continue providing services to its clients without any disruption.
2.1 An example distributed system.
2.2 Consistent and inconsistent global state examples.
2.3 An example of the domino effect in recovery with uncoordinated checkpointing.
2.4 Finite state machine specification for the coordinator in the Tamir and Sequin checkpointing protocol.
2.5 Finite state machine specification for the participant in the Tamir and Sequin checkpointing protocol.
2.6 Normal operation of the Tamir and Sequin checkpointing protocol in an example three-process distributed system.
2.7 Finite state machine specification for the Chandy and Lamport distributed snapshot protocol.
2.8 Normal operation of the Chandy and Lamport global snapshot protocol in an example three-process distributed system.
2.9 A comparison of the channel state definition between (a) the Chandy and Lamport distributed snapshot protocol and (b) the Tamir and Sequin global checkpointing protocol.
2.10 Example state intervals.
2.11 An example for pessimistic logging.
2.12 Transport level (a) and application level (b) reliable messaging.
2.13 Optimization of pessimistic logging: (a) concurrent message logging and execution (b) logging batched messages.
2.14 Probability density function of the logging latency.
2.15 A summary of the mean logging latency and mean end-to-end latency under various conditions.
2.16 Probability density function of the end-to-end latency.
2.17 Normal operation of the sender-based logging protocol
2.18 An example normal operation of the sender-based logging protocol
2.19 Two concurrent failures could result in the loss of determinant information for regular messages
3.1 The three-tier architecture
3.2 The Java EE architecture
3.3 An example runtime path of an end-user request
3.4 Component class and component instances
3.5 The chi-square cumulative distribution function for degree of freedom of 1, 2, 3, 4, 5.
3.6 The path shape of the example runtime path shown in Figure 3.3.
3.7 Component class and component instances.
3.8 Dependency templates for nodes, processes, network paths, and the neighbor sets.
3.9 A partial dependency graph for an example system.
3.10 The error function.
3.11 A hypothetical dependency graph with abnormality for each component and the weight for each edge labeled.
3.12 The components that form a cycle in the f-map are reduced to a single unit in the r-map for recursive recovery.
3.13 The architecture of an Operator Undo framework [5, 6].
4.1 The replication algorithm is typically implemented in a fault tolerance middleware framework.
4.2 Active replication, without (top) and with (bottom) voting at the client.
4.3 Passive replication.
4.4 Semi-active replication.
4.5 A write-all algorithm for data replication.
4.6 The problem of the write-all-available algorithm for data replication.
4.7 Preventing a transaction from accessing a not-fullyrecovered replica is not sufficient to ensure one-copy serializable execution of transactions.
4.8 An example run of the quorum consensus algorithm on a single data item.
4.9 Basic steps for optimistic data replication for an operation-transfer system.
4.10 An example run of a system with three sites that uses Lamport clocks.
4.11 An example run of a system with three sites that uses vector clocks.
4.12 An example for the determination of the new version vector value after reconciling a conflict.
4.13 An example operation propagation using vector clocks in a system with three replicas.
4.14 An example for operation propagation using timestamp matrices in a system with three replicas.
4.15 Update commit using ack vectors in a system with three replicas.
4.16 Update commit using timestamp matrices in a system with three replicas.
4.17 An illustration of the CAP theorem.
4.18 Partition mode and partition recovery.
5.1 Examples of systems that ensure uniform total ordering and nonuniform total ordering.
5.2 In the sequencer based approach, a general system is structured into a combination of two subsystems, one with a single receiver and the other with a single sender of broadcast messages.
5.3 An example rotation sequencer based system in normal operation.
5.4 Normal operation of the membership view change protocol.
5.5 Membership change scenario: competing originators.
5.6 Membership change scenario: premature timeout.
5.7 Membership change scenario: temporary network partitioning.
5.8 A simplified finite state machine specification for Totem.
5.9 A successful run of the Totem Membership Protocol.
5.10 Membership changes due to a premature timeout by N2.
5.11 Messages sent before N1 fails in an example scenario. 181
5.12 Messages delivered during recovery for the example scenario.
5.13 Message sent before the network partitions into two groups, one with {N1,N2}, and the other with {N3,N4,N5}.
5.14 Messages delivered during recovery in the two different partitions for the example scenario.
5.15 Causal ordering using vector clocks.
6.1 Normal operation of the Paxos algorithm.
6.2 A deadlock scenario with two competing proposers in the Paxos algorithm.
6.3 If the system has already chosen a value, the safety property for consensus would hold even without the promise-not-to-accept-older-proposal requirement.
6.4 If two competing proposers propose concurrently, the system might end up choosing two different values without the promise-not-to-accept-older-proposal requirement.
6.5 With the promise-not-to-accept-older-proposal requirement in place, even if two competing proposers propose concurrently, only a single value may be chosen by the system.
6.6 Normal operation of Multi-Paxos in a client-server system with 3 server replicas and a single client.
6.7 View change algorithm for Multi-Paxos.
6.8 With reconfigurations, a group of 7 replicas (initially 5 active and 2 spare replicas) can tolerate up to 5 single faults (without reconfigurations, only up to 3 faults can be tolerated)
6.9 The Primary and secondary quorums formation for a system with 3 main replicas and 2 auxiliary replicas
6.10 The Primary and secondary quorums formation as the system reconfigures due to the failures of main replicas
6.11 Normal operation of Cheap Paxos in a system with 3 main replicas and 1 auxiliary replica
6.12 The Primary and secondary quorums formation for a system with 3 main replicas and 2 auxiliary replicas
6.13 Normal operation of (Multi-) Fast Paxos in a client-server system
6.14 Collision recovery in an example system
6.15 Expansion of the membership by adding two replicas in method 1
6.16 Expansion of the membership by adding two replicas in method 2
6.17 Reduction of the membership by removing two replicas one after another
7.1 Two scenarios that highlight why it is impossible to use 3 generals to solve the Byzantine generals problem
7.2 The message flow and the basic steps of the OM(1) algorithms
7.3 The message flow and the basic steps of the OM(2) algorithms
7.4 Normal operation of the PBFT algorithm
7.5 PBFT view change protocol
7.6 A worst case scenario for tentative execution
7.7 Normal operation of Fast Byzantine fault tolerance
7.8 Zyzzyva agreement protocol (case 1)
7.9 Zyzzyva agreement protocol (case 2)
7.10 A corner case in view change in Zyzzyva
8.1 A deadlock scenario if all requests are to be executed sequentially at the replicated cloud service A and at replicated cloud service B
8.2 By exploiting the application semantics, it is possible to concurrently execute several requests if they are independent (bottom) instead of sequentially (top)
8.3 The parallelizer and its interaction with other components.
8.4 The sequence diagram for an example WS-AT application
8.5 The sequence diagram for an example WS-BA application.
8.6 Normal operation of the lightweight algorithm for reaching a Byzantine agreement on request identity
8.7 The normal operation of Byzantine fault tolerant transaction propagation and registration protocol
8.8 The normal operation of Byzantine fault tolerant transaction commitment and completion protocol
8.9 A classification of common types of application non-determinism
8.10 Pseudo code for a remote method that involves the VPRE type of nondeterminism
8.11 Pseudo code for a remote method that involves the NPRE type of nondeterminism
8.12 Pseudo code executed by a multithreaded server with active timer threads
8.13 Pseudo code executed by two different threads for a remote method that involves the NPOST type of nondeterminism
8.14 Normal operation of the modified BFT algorithm in handling VPRE type of nondeterminism
8.15 Normal operation of the modified BFT algorithm in handling NPRE type of nondeterminism
8.16 Normal operations of the modified BFT algorithm in handling VPOST type of nondeterminism
8.17 Normal operations of the modified BFT algorithm in handling NPOST type of nondeterminism
List of Tables
7.1 Messages received and final decisions in two cases for OM(1,4).
7.2 Messages received and step (3) calculation in two cases for instances of OM(1) at G1.
7.3 Messages received and step (3) calculation in two cases for instances of OM(1) at G2.
7.4 Messages received and step (3) calculation in two cases for instances of OM(1) at G3.
7.5 Messages received and step (3) calculation in two cases for instances of OM(1) at G4.
7.6 Messages received and step (3) calculation in two cases for instances of OM(1) at G5.
7.7 Final decision made at each lieutenant in step (3) of OM(2).
8.1 WS-AT services
Acknowledgments
I would like to thank Professor Kirshna Misra and Professor John Andrews for organizing the book series on performability engineering. Writing this book is a very enjoyable journey. I also would like to thank my beautiful wife, Hao, and my lovely children Dorothy, Emily, and Arthur. It is them that make my life so enjoyable and meaningful.
This book is dedicated to my Ph.D. advisors Dr.Michael Melliar-Smith and Dr. Louise Moser, who introduced me to the fascinating world of dependable distributed computing. Their passion for doing research in this field is contagious and inspiring even long after I graduated.
I also would like to thank the following students who have helped me proof-read an earlier draft of this book: Sainath Reddy Adula, Kamlesh Kumar Banala, Prasanth Devarapalli, Robert Fiske, Jonathan Gurary, Anvesh Kamatham, Baburao Chowdary Kandula, Shanthan Rao Kasuganti, Manasa Rao Kodipalli, Chaitanya Muppidi, John Oyster, Venkat Rama Raju Pallapu, Srujana Parelly, Keshav Pasumarthy, Meher Vikramadhitya Varma Penmetsa, Vamshidhar Reddy Pitta, Naga Venkata Saiviswanth Sadam, Tajasvi Sanka, Vittal Shreemale, Mounica Thanneeru, Vishnu Teja Yalamanchili, and Amuktamalyada Yarlagadda. Their help is very much appreciated.
W.Z.
Preface
Distributed computing systems are playing an ever increasingly important role in all aspects of our society, governments, businesses, and individuals alike. Such systems are behind many services on which we depend on a daily basis, such as financial (e.g., online banking and stock trading), e-commerce (e.g., online shopping), civil infrastructure (e.g., electric power grid and traffic control), entertainment (e.g., online gaming and multimedia streaming), and personal data storage (e.g., various cloud services such as Dropbox, Google Drive, and SkyDrive). The dependability of these systems no longer matters only to businesses, but matters to every one of us too.
Of course, ensuring high availability of distributed systems is not cheap. In [4], the cost of data center is estimated to range from $450 per square foot for 99.671% availability (i.e., 28.8 hours of downtime per year), to $1,100 per square foot for 99.995% availability (i.e., 0.4 hours of downtime per year). That is perhaps one reason why about 59% of Fortune 500 companies suffer from 1.6 hours or more of downtime per week [1]. To reduce the cost of building and maintaining highly dependable systems, I believe that an effective way is to train more experts that know how to design, implement, and maintain dependable distributed systems. We hope that this book helps achieve this goal.
In this book, I cover the most essential techniques for designing dependable distributed systems (according to the my subjective judgement, of course). To keep the book concise, I chose not to cover a broad range of research work for each dependability technique. Instead, only a selected few (usually the most well-known, or the first publication of each approach) are included and explained in depth, usually with a comprehensive set of examples. The goal is to dissect each technique thoroughly so that readers who are not familiar with dependable distributed computing can actually grasp the technique after studying the book. Should I have missed any important work that has immediate practical implication (almost inevitable), I would love to hear from the readers and will be happy to include the work in the next edition of the book.
In Chapter 1, we introduce the basic concepts and terminologies of dependable distributed computing, as well as the primary means to achieve dependability.
In Chapter 2, we describe the checkpointing and logging mechanisms, which are widely used in practice to achieve some form of fault tolerance (they enable the recoverability of the application but do not prevent service disruption). The biggest advantages of this approach are that it is relatively simple to implement and understand, and it incurs minimum runtime overhead while demanding very modern extra resources (only stable storage). Furthermore, checkpointing and logging also serve as the foundation for more sophisticated dependability techniques. The disadvantage of this approach, if used alone, is that it cannot prevent service disruption from happening. Hence, it is not suitable to be used alone for applications that demand high reliability.
In Chapter 3, we cover research works on recovery-oriented computing, including fault detection and diagnosis, microreboot, and system-level undo and redo. Recovery-oriented computing aims to facilitate faster recovery after a system failure and thereby improving the availability of the system. Similar to checkpointing and logging, the mechanisms for recovery-oriented computing do not prevent service disruption, hence, it is a promising approach for many e-commerce application, but not suitable for applications that require high reliability.
In Chapter 4, we outline the replication technique for data and service fault tolerance. This is the fundamental technique to ensure high reliability. Through active replication (i.e., the use of multiple redundant copies of the application processes), the system would be able to mask the failure of a replica and continue to process clients’ requests (this is actually not entirely true, as we will show in later chapters, some failures may cause extended period of unavailability of the system). With replication comes the complexity of consistency issue. Ideally, the replicas should always maintain consistency with each other. However, doing so might not incur too much runtime overhead to be acceptable for some applications, or may cause extended period of system unavailability. Hence, strict consistency may have to be compromised either for better performance or for better availability.
In Chapter 5, we explain the group communication systems, which can be used to implement active replication. A group communication system typically offers a totally ordered reliable multicast service for messages, a membership server, and a view synchrony service. These set of services help the replicas to maintain consistency even in the presence of failures, which would reduce the development cost of building dependable systems with active replication. In the chapter, we describe in detail several well known research works on group communication system construction with different approaches.
In Chapter 6, we discuss the consensus problem and describe several Paxos algorithms, including the Classic Paxos, Dynamic Paxos, Cheap Paxos, and Fast Paxos. Distributed consensus is perhaps the most fundamental problem in distributed computing. While it is easy for a group of processes to agree on the same value if all processes can communicate with each other promptly and if none of them fails. However, distributed consensus is an incredibly hard problem when processes might fail and there might be extended delay to send or receive a message. The classical Paxos algorithm solves the consensus problem (under the non-malicious fault model) in a very elegant and efficient manner by separating the safety concern and the liveness concern [5]. Additional Paxos algorithm are developed to minimize the resources required (for Cheap Paxos), and to reduce the latency for achieving consensus by using a higher redundancy level.
In Chapter 7, we introduce the problem Byzantine fault tolerance. A Byzantine fault is synonymous with a malicious fault. Because a malicious faulty component may choose to behave like any of the non-malicious faults, the Byzantine fault model encompasses any arbitrary fault. The distributed consensus problem under the Byzantine fault model was first studied several decades ago by Lamport, Shostak, and Pease [6]. A much more efficient algorithm for achieving fault tolerance under the Byzantine fault model (referred to as Byzantine fault tolerance) was proposed by Castro and Liskov in 1999 [2]. Since then, the research on Byzantine fault tolerance exploded. With the pervasiveness of cyber attacks and espionages, tolerating malicious faults becomes an urgent concern now instead of being a far fetched problem several decades ago. In this chapter, we explain in detail several seminal works on this topic.
In Chapter 8, we document a few research works on the design of customized Byzantine fault tolerance solutions by exploiting the application semantics. For a general-purpose Byzantine fault tolerance algorithm, all requests are totally ordered and executed sequentially in the total order. This imposes severe restrictions on the types of applications that can be supported by the algorithm. By exploiting application semantics, the general-purpose algorithm can be customized to enable the partitioning of requests, the identifying of independent requests, read-only requests, and commutative requests, all of which facilitate concurrent execution of multiple requests. Furthermore, by enabling concurrent execution of selected requests based on the application semantics, potential deadlocks could be prevented.
1. A. Arnold. Assessing the financial impact of downtime, April 2010. http://www.businesscomputingworld.co.uk/assessing-the-financial-impactof-downtime/.
2. M. Castro and B. Liskov. Practical byzantine fault tolerance. In Proceedings of the third symposium on Operating systems design and implementation, OSDI ’99, pages 173–186, Berkeley, CA, USA, 1999. USENIX Association.
3. Channel Insider. Unplanned it outages cost more than $5,000 per minute: Report. http://www.channelinsider.eom/c/a/Spotlight/Unplanned-ITOutages-Cost-More-than-5000-per-Minute-Report-105393/, May 2011.
4. J. Clark. The price of data center availability, October 2011. http://www.datacenterjournal.com/design/the-price-of-data-centeravailability/.
5. L. Lamport. Paxos made simple. ACM SIGACT News (Distributed Computing Column), 32(4):18–25, December 2001.
6. L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4:382–401, 1982.
7. T. Pisello and B. Quirk. How to quantify downtime, January 2004. http://www.networkworld.com/careers/2004/0105man.html.
Distributed systems bring many benefits to us, for example, we can share resources such as data storage and processing cycles much more easily; we can collaborative on projects efficiently even if the team members span across the planet; we can solve challenging problems by utilizing the vast aggregated computing power of large scale distributed systems. However, if not designed properly, distributed systems may appear to be less dependable than standalone systems. As Leslie Lamport pointed out: “You know you have one (a distributed system) when the crash of a computer you’ve never heard of stops you from getting any work done” [9]. In this book, we introduce various dependability techniques that can be used to address the issue brought up by Lamport. In fact, with sufficient redundancy in the system, a distributed system can be made significantly more dependable than a standalone system because such a distributed system can continue providing services to its users even when a subset of its nodes have failed.
In this chapter, we introduce the basic concepts and terminologies of dependable distributed computing, and outline the primary approaches to achieving dependability.
The term “dependable systems” has been used widely in many different contexts and often means different things. In the context of distributed computing, dependability refers to the ability of a distributed system to provide correct services to its users despite various threats to the system such as undetected software defects, hardware failures, and malicious attacks.
To reason about the dependability of a distributed system, we need to model the system itself as well as the threats to the system clearly [2]. We also define common attributes of dependable distributed systems and metrics on evaluating the dependability of a distributed system.
A system is designed to provide a set of services to its users (often referred to as clients). Each service has an interface that a client could use to request the service. What the system should do for each service is defined as a set of functions according to a functional specification for the system. The status of a system is determined by its state. The state of a practical system is usually very complicated. A system may consist of one or more processes spanning over one or more nodes, and each process might consist of one or more threads. The state of the system is determined collectively by the state of the processes and threads in the system. The state of a process typically consists of the values of its registers, stack, heap, file descriptors, and the kernel state. Part of the state might become visible to the users of the system via information contained in the responses to the users’ requests. Such state is referred to as external state and is normally an abstract state defined in the functional specification of the system. The remaining part of the state that is not visible to users is referred to as internal state. A system can be recovered to where it was before a failure if its state was captured and not lost due to the failure (for example, if the state is serialized and written to stable storage).
From the structure perspective, a system consists of a one or more components (such as nodes or processes), and a system always has a boundary that separates the system from its environment. Here environment refers to all other systems that the current system interact with. Note that what we refer to as a system is always relative with respect to the current context. A component in a (larger) system by itself is a system when we want to study its behavior and it may in turn have its own internal structures.
Whether or not a system is providing correct services is judged by whether or not the system is performing the functions defined in the functional specification for the system. When a system is not functioning according to its functional specification, we say a service failure (or simply failure) has occurred. The failure of a system is caused by part of its state in wrong values, i.e., errors in its state. We hypothesize that the errors are caused by some faults [6]. Therefore, the threats to the dependability of a system are modeled as various faults.
A fault might not always exhibit itself and cause error. In particular, a software defect (often referred to as software bug) might not be revealed until the code that contains the defect is exercised when certain condition is met. For example, if a shared variable is not protected by a lock in a multithreaded application, the fault (often referred to as race condition) does not exhibit itself unless there are two or more threads trying to update the shared variable concurrently. As another example, if there is no boundary check on accessing to an array, the fault does not show up until a process tries to access the array with an out-of-bound index. When a fault does not exhibit itself, we say the fault is dormant. When certain condition is met, the fault will be activated.
When a fault is activated, initially the fault would cause an error in the component that encompasses the defected area (in programming code). When the component interacts with other components of the system, the error would propagates to other components. When the errors propagate to the interface of the system and render the service provided to a client deviate from the specification, a service failure would occur. Due to the recursive nature of common system composition, the failure of one system may cause a fault in a larger system when the former constitutes a component of the latter, as shown in Figure 1.1. Such relationship between fault, error, and failure is referred to as “chain of threats” in [2]. Hence, in literature the terms “faults” and “failures” are often used interchangeably.
Figure 1.1 An example of a chain of threats with two levels of recursion.
Of course, not all failures can be analyzed with the above chain of threats. For example, a power outage of the entire system would immediately cause the failure of the system.
Faults can be classified based on different criteria, the most common classifications include:
When the system fails, it is desirable to avoid catastrophic consequences, such as the loss of life. The consequence of the failure of a system can be alleviated by incorporating dependability mechanisms into the system such that when it fails, it stops responding to requests (such systems are referred to as fail-stop systems), if this is impossible, it returns consistent wrong values instead of inconsistent values to all components that it may interact with. If the failure of a system does not cause great harm either to human life or to the environment, we call such as system a fail-safe system. Usually, a fail-safe system defines a set of safe states. When a fail-safe system can no longer operate according to its specification due to faults, it can transit to one of the predefined safe states when it fails. For example, the computer system that is used to control a nuclear power plant must be a fail-safe system.
Perhaps counter intuitively, it is often desirable for a system to halt its operation immediately when it is in an error state or encounters an unexpected condition. The software engineering practice to ensure such a behavior is called fail fast [8]. The benefits of the fail-fast practice are that it enables early detection of software faults and the diagnosis of faults. When a fault has been propagated to many other components, it is a lot harder to pinpoint the source of the problem.
A dependable system has a number of desirable attributes and some of the attributes can be used as evaluation metrics for the system. We classify these attributes into two categories: (1) those that are fundamental to, and are immediate concern of, all distributed systems, including availability, reliability, and integrity; and (2) those that are secondary and may not be of immediate concern of, or be applicable to all systems, such as maintainability and safety.
The availability and reliability of a system can be used as evaluation metrics. Other attributes are normally not used as evaluation metrics because it is difficult to quantify the integrity, maintainability, and safety of a distributed system.
Availability is a measure of the readiness of a dependable system at a point in time, i.e., when a client needs to use a service provided by the system, the probability that the system is there to provide the service to the client. The availability of a system is determined by two factors:
Availability is defined to be MTTF/(MTTF + MTTR). Hence, the larger the MTTF, and higher the availability of a system. Similarly, the smaller the MTTR, the higher the availability of the system.
The availability of a system is typically represented in terms of how many 9s. For example, if a system is claimed to offer five 9s availability, it means that the system will be available with a probability of 99.999%, i.e., the system has 10−5 probability to be not available when a client wants to access the service offered by the system at any time, which means that the system may have at most 5.256 minutes of down time a year.
Integrity refers to the capability of a system to protect its state from being compromised under various threats. In dependable computing research, integrity is typically translated into the consistency of server replicas, if redundancy is employed. As long as the number of faulty replicas does not exceed a pre-defined threshold, the consistency of the remaining replicas would naturally imply the integrity of the system.
Maintainability refers to the capability of a system to evolve after it is deployed. Once a software fault is detected, it is desirable to be able to apply a patch that repairs the system without having to uninstall the existing system and then reinstall an updated system. The same patching/software update mechanism may be used to add new features or improve the performance of the existing system. Ideally, we want to be able to perform the software update without having to shutdown the running system (often referred to as live upgrade or live update), which is already a standard feature for many operating systems for patching non-kernal level components. Live upgrade has also be achieved via replication in some distributed systems [10].
Safety means that when a system fails, it does not cause catastrophic consequences, i.e., the system must be fail-safe. Systems that are used to control operations that may cause catastrophic consequences, such as nuclear power plants, or endanger human lives, such as hospital operation rooms, must bear the safety attribute. The safety attribute is not important for systems that are not operating in such environments, such as for e-commerce.
There are two primary approaches to improving the dependability of distributed systems: (1) fault avoidance: build and use high quality software components and hardware that are less prone to failures; (2) fault detection and diagnosis: while crash faults are trivial to detect, components in a practical system might fail in various ways other than crash, and if not detected, the integrity of the system cannot be guaranteed; and (3) fault tolerance: a system is able to recover from various faults without service interruption if the system employs sufficient redundancy so that the system can mask the failures of a portion of its components, or with minimum service interruption if the system uses less costly dependability means such as logging and checkpointing.
For software components, fault avoidance aims to ensure correct design specification and correct implementation before a distributed system is released. This objective can be achieved by employing standard software engineering practices, for example:
Fault detection is a crucial step in ensuring the dependability of a system. Crash faults are relatively trivial to detect, for example, we can periodically probe each component to check on its health. If no response is received after several consecutive probes, the component may be declared as having crashed. However, components in a system might fail in various ways and they might respond promptly to each probe after they have failed. It is nontrivial to detect such faults, especially in a large distributed system. Diagnosis is required to determine that a fault indeed has occurred and to localize the source of the fault (i.e., pinpoint the faulty component). To accomplish this, the distributed system is modeled, and sophisticated statistical tools are often used [3]. Some of the approaches in fault detection and diagnosis are introduced in Chapter 3.
A lot of progress has been made in modern programming language design to include some forms of software fault detection and handling, such as unexpected input or state. The most notable example is exception handling. A block of code can be enclosed with a try-catch construct. If an error condition occurs during the execution of the code, the catch block will be executed automatically. Exceptions may also be propagated upward through the calling chain. If an exception occurs and it is not handled by any developer-supplied code, the language runtime usually terminates the process.
The recovery block method, which is designed for software fault tolerance [7], may be considered as an extension of the programming language exception handling mechanism. An important step in recovery blocks is the acceptance testing, which is a form of fault detection. A developer is supposed to supply an acceptance test for each module of the system. When the acceptance test fails, a software fault is detected. Subsequently, an alternate block of code is executed, after which the acceptance test is evaluated again. Multiple alternate blocks of code may be provided to increase the robustness of the system.
Once a fault is detected and localized, it should be isolated and removed from the system. Subsequently, the faulty component is either repaired or replaced. A repaired or replaced component can be readmitted to the system. To accommodate these steps, the system often needs to be reconfigured. In a distributed system, it is often necessary to have a notion of membership, i.e., each component is aware of a list of components that are considered part of the system and their roles. When a faulty component is removed from the system, a reconfiguration is carried out and a new membership is formed with the faulty component excluded. When the component is repaired or replaced, and readmitted to the system, it becomes part of the membership again.
A special case of fault removal is software patching and updates. Software faults and vulnerabilities may be removed via a software update when the original system is patched. Virtually all modern operating systems and software packages include the software update capability.
Robust software itself is normally insufficient to delivery high dependability because of the possibility of hardware failures. Unless a distributed system is strictly stateless, simply restarting the system after a failure would not automatically restore its state to what it had before the failure. Hence, fault tolerance techniques are essential to improve the dependability of distributed systems to the next level.
There are different fault tolerance techniques that can be used to cater to different levels of dependability requirements. For applications that need high availability, but not necessarily high reliability, logging and checkpointing (which is the topic of Chapter 2), which incurs minimum runtime overhead and uses minimum extra resources, might be sufficient. More demanding applications could adopt the recovery oriented computing techniques (which is the topic of Chapter 3). Both types of fault tolerance techniques rely on rollback recovery. After restarting a failed system, the most recent correct state (referred to as a checkpoint) of the system is located in the log and the system is restored to this correct state.
An example scenario of rollback recovery is illustrated in Figure 1.2. When a system fails, it takes some time to detect the failure. Subsequently, the system is restarted and the most recent checkpoint in the log is used to recover the system back to that checkpoint. If there are logged requests, these requests are re-executed by the system, after which the recovery is completed. The system then resumes handling new requests.
Figure 1.2 The rollback recovery is enabled by periodically taking checkpoints and usually logging of the requests received.
For a distributed system that requires high reliability, i.e., continuous correct services, redundant instances of the system must be used so that the system can continue operating correctly even if a portion of redundant copies (referred to as replicas) fail. Using redundant instances (referred to as replicas) also makes it possible to tolerate malicious faults provided that the replicas fail independently. When the failed replica is repaired, it can be incorporated back into the system by rolling its state forward to the current state of other replicas. This recovery strategy is called rollforward recovery.
An example scenario of rollforward recovery is shown in Figure 1.3. When the failure of the replica is detected and the replica is restarted (possibly after being repaired). To readmit the restarted replica into the system, a nonfaulty replica takes a checkpoint of its state and transfer the checkpoint to the recovering replica. The restarted replica can rollforward its state using the received checkpoint, which represents the latest state of the system.
Figure 1.3 With redundant instances in the system, the failure of a replica in some cases can be masked and the system continue providing services to its clients without any disruption.
To avoid common mode failures (i.e., correlated faults), it helps if each replica could execute a different version of the system code. This strategy is referred to as n-version programming [1]. Program transformation may also be used to achieve diversified replicas with lower software development cost [4]. A special form of n-version programming appears in the recovery block method for software fault tolerance [7]. Instead of using different versions of the software in different replicas, each module of the system is equipped with a main version and one or more alternate versions. At the end of the execution of the main version, an acceptance test is evaluated. If the testing fails, the first alternate version is executed and the acceptance test is evaluated again. This goes on until all alternate versions have been exhausted, in which case, the module returns an error.
1. A. Avizienis and L. Chen. On the implementation of n-version programming for software fault tolerance during execution. In Proceedings of the IEEE International Computer Software and Applications Conference, pages 149–155, 1977.
2. A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11–33, 2004.
3. M. Y. Chen, E. Kidman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, DSN ’02, pages 595–604, Washington, DC, USA, 2002. IEEE Computer Society.
4. M. Franz. Understanding and countering insider threats in software development. In Proceedings of the International MCETECH Conference on e-Technologies, pages 81–90, January 2008.
5. L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4:382–401, 1982.
6. P. M. Melliar-Smith and B. Randell. Software reliability: The role of programmed exception handling. In Proceedings of an ACM conference on Language design for reliable software, pages 95–100, New York, NY, USA, 1977. ACM.
7. B. Randell and J. Xu. The evolution of the recovery block concept. In Software Fault Tolerance, pages 1–22. John Wiley & Sons Ltd, 1994.
8. J. Shore. Fail fast. IEEE Software, pages 21–25, September/October 2004.
9. A. S. Tanenbaum and M. V. Steen. Distributed Systems: Principles and Paradigms. Prentice Hall, 2nd edition, 2006.
10. L. Tewksbury, L. Moser, and P. Melliar-Smith. Live upgrade techniques for corba applications. In New Developments in Distributed Applications and Interoperable Systems, volume 70 of IFIP International Federation for Information Processing, pages 257–271. Springer US, 2002.
Checkpointing and logging are the most essential techniques to achieve dependability in distributed systems [7]. By themselves, they provide a form of fault tolerance that is relatively easy to implement and incurs low runtime overhead. Although some information could be lost (if only checkpointing is used) when a fault occurs and the recovery time after a fault is typically larger than that of more sophisticated fault tolerance approaches, it may be sufficient for many applications. Furthermore, they are used in all levels of dependability mechanisms.
A checkpoint of a distributed system refers to a copy of the system state [7]. If the checkpoint is available after the system fails, it can be used to recover the system to the state when the checkpoint was taken. Checkpointing refers to the action of taking a copy of the system state (periodically) and saving the checkpoint to a stable storage that can survive the faults tolerated.
To recover the system to the point right before it fails, other recovery information must be logged in addition to periodical checkpointing. Typically all incoming messages to the system are logged. Other nondeterministic events may have to be logged as well to ensure proper recovery.
Checkpointing and logging provide a form of rollback recovery [7] because they can recover the system to a state prior to the failure. In contrast, there exist other approaches that accomplish roll-forward recovery, that is, a failed process can be recovered to the current state by incorporating process redundancy into the system. However, roll-forward recovery protocols typically incur significantly higher runtime overhead and demand more physical resources.
In this section, we define the system model used in the checkpointing and logging algorithms introduced in this chapter. The algorithms are executed in a distributed system that consists of N number of processes. Processes within the system interact with each other by sending and receiving messages. These processes may also interact with the outside world by message exchanges. The input message to the distributed system from the outside world is often a request message sent by the user of the system. The output message from the system is the corresponding response message. An example distributed system consisting of 4 processes is shown in Figure 2.1.
Figure 2.1 An example distributed system.
In such a distributed system, a failure could occur at a process. However, it is assumed that when a process fails, it simply stops execution and loses all its volatile state (i.e., the fail-stop model [18] is used). In addition, it is assumed that any two processes can establish a reliable connection (such as a TCP connection) for communication. Even though the network may lose messages, the reliable channel can effectively mask such losses. Naturally, the reliable connection ensures the first-in first-out (FIFO) property between the two endpoints of the reliable connection. This assumption also implies that the network does not partition, i.e., it does not prevent two or more processes in the system from interacting with each other for extended period of time.
The state of an individual process is defined by its entire address space in an operating system. A generic checkpointing library (such as Condor [23]) normally saves the entire address space as a checkpoint of the process. Of course, not everything in the address space is interesting based on the application semantics. As such, the checkpoint of a process can be potentially made much smaller by exploiting application semantics.
The state of a distributed system is usually referred to as the global state of the system [5]. It is not a simple aggregation of the states of the processes in the distributed system because the processes exchange messages with each other, which means that a process may causally depend on some other processes. Such dependency must be preserved in a global state. Assume that each process in the distributed system takes checkpoints periodically, this implies that we may not be able to use the latest set of checkpoints for proper recovery should the processes fails, unless the checkpointing at different processes are coordinated [5]. To see why, considering the three scenarios illustrated in Figure 2.2 where the global state is constructed by using the three checkpoints, C0, C1, C2, taken at processes P0, P1, and P2, respectively.
Figure 2.2 Consistent and inconsistent global state examples.
Figure 2.2(a) shows a scenario in which the checkpoints taken by different processes are incompatible, and hence cannot be used to recover the system upon a failure. Let’s see why. In this scenario, P0 sends a message m0 to P1, and P1 subsequently sends a message m1 to P2. Therefore, the state of P2 potentially depends on the state of P1 after it has received m1, and the state of P1 may depend on that of P0 once it receives m0. The checkpoint C0 is taken before P0 sends the message m0 to P1, whereas the checkpoint C1 is taken after P1 has received m0. The checkpoints are not compatible because C1 reflects the receiving of m0 while C0 does not reflect the sending of m0, that is, the dependency is broken. Similarly, C2 reflects the receiving of m1 while C1 does not reflect the sending of m1.
To understand the problem better, consider the following example. Assume that P0 and P1 represent two bank accounts, A and B respectively. The purpose of m0 is to deposite $100 to account B after P0 has debited account A. P0 takes a checkpoint C0before the debit operation, and P1 takes a checkpoint C1after it has received and processed the deposit request (i.e., mo), as illustrated in Figure 2.2(a). If P0 crashes after sending the deposit request (m0), and P1 crashes after taking the checkpoint C1, upon recovery, P1’s state would reflect a deposit of $100 (from account A) while P0’s state would not reflect the corresponding debit operation. Consequently, $100 would appear to have come from nowhere, which obviously is not what had happened. In essence, the global state constructed using the wrong set of checkpoints does not correspond to a state that could have happened since the initial state of the distributed system. Such a global state is referred to as an inconsistent global state.
Next, let’s look at a scenarios (shown in Figure 2.2(b)) in which the set of checkpoints can be used to properly recover the system to an earlier state prior to the failure. The checkpoint (C0) taken by P0 reflects the sending event of m0. The checkpoint C1 is taken by P1 after it has received m0, therefore, the dependency on P0 is captured by C1. Similarly, the dependency of P2 on P1 is also preserved by the checkpoint C2 taken by P2. Such a global state is an example of consistent global state. Of course, the execution after the checkpoints, such as the sending and receiving of m2 and m3, will be lost upon recovery.
The scenario described in Figure 2.2(c) is the most subtle one. In this scenario, P0 takes a checkpoint after it has sent message m0 while P1 takes a checkpoint before it receives m0 but after it has sent m1, and P2 takes a checkpoint before it receives m1. This means that the checkpoint C0 reflects the state change resulting from sending m0 whereas C1 does not incorporate the state change caused by the receiving of m0. Consequently, this set of checkpoints cannot be used to recover the system after a failure because m0 and m1 would have been lost. However, the global state reconstructed by using such a set of checkpoints would still be qualified as a consistent global state because it is one such that it could have happened, i.e., messages m0 and m1 are still in transit to their destinations. To accommodate this scenario, an additional type of states, referred to as channel state, is introduced as part of the distributed system state [5].
To define the channel state properly, it is necessary to provide a more rigorous (and abstract) definition of a distributed system. A distributed system consists of two types of components [5]:
A pair of neighboring processes are always connected by a pair of channels, one in each direction. An event (such as the sending or receiving of a message) at a process may change the state of the process and the state of the channel it is associated with, if any. For example, the injection of a message into a channel may change the state of the channel from empty to one that contains the message itself.
Using this revised definition, the channel states in the third scenario would consist of the two in-transit messages m0 and m1. If the channel states can be properly recorded in addition to the checkpoints in this scenario, the recovery can be made possible (i.e., m0 will be delivered to P1 and m1 will be delivered to P2 during recovery).
Checkpoint-based protocols only ensure to recover the system up to the most recent consistent global state that has been recorded and all executions happened afterwards, if any, are lost. Logging can be used to recover the system to the state right before the failure, provided that all events (that could potentially change the state of the processes) are logged and the log is available upon recovery. This is what is referred to as the piecewise deterministic assumption [21]. According to this assumption, all nondeterministic events can be identified and sufficient information (referred to as a determinant [1]) must be logged for each event. The most obvious example of nondeterministic events is the receiving of a message. Other examples include system calls, timeouts, and the receipt of interrupts. In this chapter, we typically assume that the only nondeterministic events are the receiving of a message. Note that the sending of a message is not a deterministic event, i.e., it is determined by a nondeterministic event or the initial state of the process [7].
