Designing High Availability Systems - Zachary Taylor - E-Book

Designing High Availability Systems E-Book

Zachary Taylor

0,0
116,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A practical, step-by-step guide to designing world-class, high availability systems using both classical and DFSS reliability techniques

Whether designing telecom, aerospace, automotive, medical, financial, or public safety systems, every engineer aims for the utmost reliability and availability in the systems he, or she, designs. But between the dream of world-class performance and reality falls the shadow of complexities that can bedevil even the most rigorous design process. While there are an array of robust predictive engineering tools, there has been no single-source guide to understanding and using them . . . until now.

Offering a case-based approach to designing, predicting, and deploying world-class high-availability systems from the ground up, this book brings together the best classical and DFSS reliability techniques. Although it focuses on technical aspects, this guide considers the business and market constraints that require that systems be designed right the first time.

Written in plain English and following a step-by-step "cookbook" format, Designing High Availability Systems:

  • Shows how to integrate an array of design/analysis tools, including Six Sigma, Failure Analysis, and Reliability Analysis
  • Features many real-life examples and case studies describing predictive design methods, tradeoffs, risk priorities, "what-if" scenarios, and more
  • Delivers numerous high-impact takeaways that you can apply to your current projects immediately
  • Provides access to MATLAB programs for simulating problem sets presented, along with PowerPoint slides to assist in outlining the problem-solving process

Designing High Availability Systems is an indispensable working resource for system engineers, software/hardware architects, and project teams working in all industries.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 569

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

IEEE Press

Title page

Copyright page

Dedication

Preface

MATLAB®

Acknowledgments

List of Abbreviations

Chapter 1: Introduction

Chapter 2: Initial Considerations for Reliability Design

2.1 The Challenge

2.2 Initial Data Collection

2.3 Where Do We Get MTBF Information?

2.4 MTTR and Identifying Failures

2.5 Summary

Chapter 3: A Game of Dice: An Introduction to Probability

3.1 Introduction

3.2 A Game of Dice

3.3 Mutually Exclusive and Independent Events

3.4 Dice Paradox Problem and Conditional Probability

3.5 Flip a Coin

3.6 Dice Paradox Revisited

3.7 Probabilities for Multiple Dice Throws

3.8 Conditional Probability Revisited

3.9 Summary

Chapter 4: Discrete Random Variables

4.1 Introduction

4.2 Random Variables

4.3 Discrete Probability Distributions

4.4 Bernoulli Distribution

4.5 Geometric Distribution

4.6 Binomial Coefficients

4.7 Binomial Distribution

4.8 Poisson Distribution

4.9 Negative Binomial Random Variable

4.10 Summary

Chapter 5: Continuous Random Variables

5.1 Introduction

5.2 Uniform Random Variables

5.3 Exponential Random Variables

5.4 Weibull Random Variables

5.5 Gamma Random Variables

5.6 Chi-Square Random Variables

5.7 Normal Random Variables

5.8 Relationship between Random Variables

5.9 Summary

Chapter 6: Random Processes

6.1 Introduction

6.2 Markov Process

6.3 Poisson Process

6.4 Deriving the Poisson Distribution

6.5 Poisson Interarrival Times

6.6 Summary

Chapter 7: Modeling and Reliability Basics

7.1 Introduction

7.2 Modeling

7.3 Failure Probability and Failure Density

7.4 Unreliability, F(t)

7.5 Reliability, R(t)

7.6 MTTF

7.7 MTBF

7.8 Repairable System

7.9 Nonrepairable System

7.10 MTTR

7.11 Failure Rate

7.12 Maintainability

7.13 Operability

7.14 Availability

7.15 Unavailability

7.16 Five 9s Availability

7.17 Downtime

7.18 Constant Failure Rate Model

7.19 Conditional Failure Rate

7.20 Bayes's Theorem

7.21 Reliability Block Diagrams

7.22 Summary

Chapter 8: Discrete-Time Markov Analysis

8.1 Introduction

8.2 Markov Process Defined

8.3 Dynamic Modeling

8.4 Discrete Time Markov Chains

8.5 Absorbing Markov Chains

8.6 Nonrepairable Reliability Models

8.7 Summary

Chapter 9: Continuous-Time Markov Systems

9.1 Introduction

9.2 Continuous-Time Markov Processes

9.3 Two-State Derivation

9.4 Steps to Create a Markov Reliability Model

9.5 Asymptotic Behavior (Steady-State Behavior)

9.6 Limitations of Markov Modeling

9.7 Markov Reward Models

9.8 Summary

Chapter 10: Markov Analysis: Nonrepairable Systems

10.1 Introduction

10.2 One Component, No Repair

10.3 Nonrepairable Systems: Parallel System with No Repair

10.4 Series System with No Repair: Two Identical Components

10.5 Parallel System with Partial Repair: Identical Components

10.6 Parallel System with No Repair: Nonidentical Components

10.7 Summary

Chapter 11: Markov Analysis: Repairable Systems

11.1 Repairable Systems

11.2 One Component with Repair

11.3 Parallel System with Repair: Identical Component Failure and Repair Rates

11.4 Parallel System with Repair: Different Failure and Repair Rates

11.5 Summary

Chapter 12: Analyzing Confidence Levels

12.1 Introduction

12.2 pdf of a Squared Normal Random Variable

12.3 pdf of the Sum of Two Random Variables

12.4 pdf of the Sum of two Gamma Random Variables

12.5 pdf of the Sum of n Gamma Random Variables

12.6 Goodness-of-Fit Test Using Chi-Square

12.7 Confidence Levels

12.8 Summary

Chapter 13: Estimating Reliability Parameters

13.1 Introduction

13.2 Bayes'S Estimation

13.3 Example of Estimating Hardware MTBF

13.4 Estimating Software MTBF

13.5 Revising Initial MTBF Estimates and Trade-offs

13.6 Summary

Chapter 14: Six Sigma Tools for Predictive Engineering

14.1 Introduction

14.2 Gathering Voice of Customer (VOC)

14.3 Processing Voice of Customer

14.4 Kano Analysis

14.5 Analysis of Technical Risks

14.6 Quality Function Deployment (QFD) or House of Quality

14.7 Program Level Transparency of Critical Parameters

14.8 Mapping DFSS Techniques to Critical Parameters

14.9 Critical Parameter Management (CPM)

14.10 First Principles Modeling

14.11 Design of Experiments (DOE)

14.12 Design Failure Modes and Effects Analysis (DFMEA)

14.13 Fault Tree Analysis

14.14 Pugh Matrix

14.15 Monte Carlo Simulation

14.16 Commercial DFSS Tools

14.17 Mathematical Prediction of System Capability instead of “Gut Feel”

14.18 Visualizing System Behavior Early in the Life Cycle

14.19 Critical Parameter Scorecard

14.20 Applying DFSS in Third-Party Intensive Programs

14.21 Summary

Chapter 15: Design Failure Modes and Effects Analysis

15.1 Introduction

15.2 What Is Design Failure Modes and Effects Analysis (DFMEA)?

15.3 Definitions

15.4 Business Case for DFMEA

15.5 Why Conduct DFMEA?

15.6 When to Perform DFMEA

15.7 Applicability of DFMEA

15.8 DFMEA Template

15.9 DFMEA Life Cycle

15.10 The DFMEA Team

15.11 DFMEA Advantages AND Disadvantages

15.12 Limitations of DFMEA

15.13 DFMEAs, FTAs, and Reliability Analysis

15.14 Summary

Chapter 16: Fault Tree Analysis

16.1 What Is Fault Tree Analysis?

16.2 Events

16.3 Logic Gates

16.4 Creating a Fault Tree

16.5 Fault Tree Limitations

16.6 Summary

Chapter 17: Monte Carlo Simulation Models

17.1 Introduction

17.2 System Behavior Over Mission Time

17.3 Reliability Parameter Analysis

17.4 A Worked Example

17.5 Component and System Failure Times using Monte Carlo Simulations

17.6 Limitations of Using Nontime-Based Monte Carlo Simulations

17.7 Summary

Chapter 18: Updating Reliability Estimates: Case Study

18.1 Introduction

18.2 Overview of the Base Station Controller—Data Only (BSC-DO) System

18.3 Downtime Calculation

18.4 Calculating Availability from Field Data Only

18.5 Assumptions Behind Using the Chi-Square Methodology

18.6 Fault Tree Updates from Field Data

18.7 Summary

Chapter 19: Fault Management Architectures

19.1 Introduction

19.2 Faults, Errors, and Failures

19.3 Fault Management Design

19.4 Repair versus Recovery

19.5 Design Considerations for Reliability Modeling

19.6 Architecture Techniques to Improve Availability

19.7 Redundancy Schemes

19.8 Summary

Chapter 20: Application of DFMEA to Real-Life Example

20.1 Introduction

20.2 Cage Failover Architecture Description

20.3 Cage Failover DFMEA Example

20.4 DFMEA Scorecard

20.5 Lessons Learned

20.6 Summary

Chapter 21: Application of FTA to Real-Life Example

21.1 Introduction

21.2 Calculating Availability Using Fault Tree Analysis

21.3 Building the Basic Events

21.4 Building the Fault Tree

21.5 Steps for Creating and Estimating the Availability Using FTA

21.6 Summary

Chapter 22: Complex High Availability System Analysis

22.1 Introduction

22.2 Markov Analysis of the Hardware Components

22.3 Building a Fault Tree from the Hardware Markov Model

22.4 Markov Analysis of the Software Components

22.5 Markov Analysis of the Combined Hardware and Software Components

22.6 Techniques for Simplifying Markov ANALYSIS

22.7 Summary

References

Index

IEEE Press

445 Hoes Lane

Piscataway, NJ 08854

IEEE Press Editorial Board 2013

John Anderson, Editor in Chief

Linda ShaferSaeid NahavandiGeorge ZobristGeorge W. ArnoldOm P. MalikTariq SamadEkram HossainMary LanzerottiDmitry Goldgof

Kenneth Moore, Director of IEEE Book and Information Services (BIS)

Technical Reviewers

Thomas Garrity

Michael D. Givot

Olli Salmela

Copyright © 2014 by The Institute of Electrical and Electronics Engineers, Inc.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved

Published simultaneously in Canada

MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trade marks. The MathWorks Publisher Logo identifies books that contain MATLAB® content. Used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book or in the software downloadable fromhttp://www.wiley.com/WileyCDA/WileyTitle/productCd-047064477X.htmlandhttp://www.mathworks.com/matlabcentral/fileexchange/?term=authored%3A80973. The book’s or downloadable software’s use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular use of the MATLAB® software or related products.

For MATLAB® and Simulink® product information, in information on other related products, please contact:

The MathWorks, Inc.

3 Apple Hill Drive

Natick, MA 01760-2098 USA

Tel: 508-647-7000

Fax: 508-647-7001

E-mail: [email protected]

Web: www.mathworks.com

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Taylor, Zachary, 1959–

Designing high availability systems : design for Six Sigma and classical reliability techniques with practical real-life examples / Zachary Taylor, Subramanyam Ranganathan.

pages cm

ISBN 978-1-118-55112-7 (cloth)

1. Reliability (Engineering) 2. Systems engineering–Case studies. 3. Six sigma (Quality control standard) I. Ranganathan, Subramanyam. II. Title.

TA169.T39 2013

658.4'013–dc23

2013011388

To my wife Soheila, for her unwavering support

Zachary Taylor

To the Lotus Feet of Sringeri Jagadguru Sri. Abhinava Vidyatheertha Mahaswamigal, To the Lotus Feet of Sri. R. M. Umesh, whose divine grace brought me into the world of engineering, and to my parents Smt. Shantha Ranganathan and Sri. V. Ranganathan

Subramanyam Ranganathan

Preface

Even as you begin to browse this book, you may be wondering, What's in it for me? Will it help solve my design challenges? Will it provide my product a more competitive edge? Will I have immediate takeaways to work? How much will it improve my capability to develop highly available and reliable systems? Can my team benefit practically from this book? Can I immediately adopt some of the techniques as best practices? Can it help deliver higher quality faster?

Whether you are a student, designer, architect, system engineer, mid-level manager, senior manager, director, VP, CEO, entrepreneur, or just someone with an intellectual curiosity, the techniques described in this book will provide you with a winning edge for designing world-class high-availability solutions.

The intent of this book is to bring to you a straight-forward, crisp, and practical approach to designing high-availability systems from the ground up, that is, systems in which high availability is an integral critical design element and differentiator, as well as a customer requirement. Typical business segments for these systems include telecom, automotive, medical, manufacturing, aerospace, financial, defense, and public safety. These systems typically consist of high reliability hardware, embedded and off-the-shelf software, multisite, multithreaded distributed processing environments, complex real-time applications, and high performance capabilities.

Though high availability and reliability are typically “must-haves” and taken for granted, designing such systems is usually complex and difficult for a variety of reasons. The design can take many iterations, involving significant time, cost, and effort. This book attempts to bring together different practical techniques used in the industry to successfully design, predict, and deploy high availability systems with maximum productivity and reduced total costs.

Our intent is to enable readers to quickly apply practical tools and techniques to projects within their organizations to realize potential benefits regardless of the current phase of their development project. Benefits include, but are not limited to, higher customer satisfaction, superior product differentiation, and delivery of high reliability and high performance systems to customers within shorter cycle times at lower overall cost.

Having worked in the telecommunication and aerospace industries for many years developing mission-critical and safety-critical embedded systems using these proven techniques, the authors strongly feel that practitioners will be able to employ this as a practical guide to designing high availability systems in a systematic and methodical fashion.

System engineers, architects, and designers who are driven to design high availability systems that are best in class can benefit immensely by employing the classical and Six Sigma tools in this book for predictive engineering. While the book focuses in-depth on the technical aspects, it is also sensitive to the underlying business frameworks, which demand designing a system right the first time in the face of market constraints. We use real-life examples throughout this book to explore predictive design methods, trade-offs, risk analysis, and “what-if” scenarios, so that the architect can realize the most effective impact to system design.

Designing high availability systems also requires a bit of skill from some of the more sophisticated arts of probability theory. A system is available unless something goes wrong. When something bad happens, we did not want it to happen and we certainly did not expect it to happen. So not only do we need to consider how to design a system that minimizes the probability of something going seriously wrong that disrupts operation, but we also need to quantify to some degree of confidence the likelihood of failure events occurring and how we can proactively minimize the impact of these failure events.

Reliability and availability are intimately intertwined with probability theory, and probability theory itself can be a difficult subject. But to fully understand how to design and manage highly available systems, we need to delve into the world of probability theory.

Many books on probability theory exist, and the reader is encouraged to explore some of the books recommended in the references. The authors believe that a firm understanding of the key aspects of theory is critical to understanding the application, as well as the limitations, of practical applications of theory. Therefore, we focus on those topics of probability and reliability theory that have the most influence on the practical applications of reliability theory. This includes exploring some concepts in depth, including proofs when needed for clarity.

Many approaches to presenting probability theory have been employed. Some make use of the typical dice, card, and coin problems that help the reader understand probability theory, but on the other hand these often fall short on the question of how to make the quantum leap from dice games to complex computer system redundancy strategies. Some texts take the approach that dice games are too simplistic and do not relate to the real world problems we are after. So dice and card games are dispensed with, and we are soon immersed in a set of complex equations that at first glance can be intimidating and difficult to follow for the uninitiated. In this case, we must somehow relate this theory back to the general problems we face in architecting a high available system. Another technique is to present the basic set of classical probability equations and ask the reader to accept these equations as a matter of faith and move on. Finally, some texts take a purely theoretical approach in which each equation is painstakingly derived based on a set of axioms and theorems previously laid out. Practical real-life examples are often left as an exercise for the reader. In addition, following the derivation process inevitably leads to a few steps being collapsed due to space constraints or assumptions of the knowledge of the reader.

This book takes a very different approach. We have uniquely blended classical and more recent Design for Six Sigma (DFSS) techniques, thus providing the practitioner with a broader repertoire of tools from which the designer or analyst can choose. We derive many of the equations that form the foundation of reliability theory. We believe these derivations are important for understanding the application of the theory, and we tackle the relevant mathematical foundation with some rigor. It is our firm belief that these derivations will be valuable to the reader by providing better foundational insight into the underpinnings of reliability theory. We then follow up with increasingly practical applications. The formalistic approach is avoided, since our goal is to help the practitioner not only apply a certain equation but also understand why it is valid. This is important, since the practitioner who wishes to become an expert in the field needs to know why a technique works, what the limitations are to this technique, what assumptions are necessary for this technique to be employed, and at what point a certain technique should be discarded for another approach.

Readers who have a strong background in reliability and probability theory may choose to skip those foundation chapters in which probability and reliability theory are introduced along with derivations of pertinent formulas, and instead jump straight to the practical applications and techniques. Practitioners will find many relevant examples of applications of classical reliability theory and DFSS techniques throughout the book.

The authors' goal is to deliver the following:

Application to a Broad Range of Industries: Telecom, automotive, medical, manufacturing, aerospace, financial, information systems, such as fly-by-wire avionics, large telecommunication networks, life-saving medical devices, critical retail or financial transaction management servers, or other systems that involve innovative cutting-edge design and technology relevant to the everyday quality of life.Practical Examples and Lucid Explanations: Complex concepts are described in simple, easy-to-understand language, blending real-life examples and including step-by-step procedures and case studies.Relevant Topics: We bring together topics that are relevant for high availability design and at the same time have been sensitive to demands on the readers' time.Comprehensive yet Focused Diagrams: A wealth of illustrations and diagrams for in depth understanding are made available to the reader. We attempt to bridge theory with practice and have confined the derivations to key theoretical aspects most likely to be applicable in practical applications.Immediate Takeaways with High Impact: Readers can start applying techniques immediately to their projects at work. It will also enable them to quickly see the results, communicate success stories, and share best practices.

The authors hope that the book serves both the student and professional community to enrich their understanding as well as help them realize their objectives for designing and deploying superior high availability products.

MATLAB®

MATLAB is one of the most popular applications for mathematical calculations and plotting. The MATLAB programs used in several of the examples in this book are available on the book's website: http://booksupport.wiley.com. Enter the ISBN number 9781118551127 to access these files.

Acknowledgments

The authors would like to thank the following reviewers who provided valuable comments on the manuscript: Dr. Thomas Garrity, Department of Mathematics at Williams College; Olli Salmela, Nokia Siemens Networks; and Michael D. Givot, Microsoft Corporation.

Numerous people, including present and past colleagues, have directly or indirectly contributed to enriching the content of this book. Greg Freeland worked on the session capacity design improvement illustrated in this book. Andy Moreno worked on the Paging Retries design improvement. Tim Klandrud worked on a complex critical parameter design prediction effort. Several teams have been part of the DFSS strategy and rollout, including DFMEA and Fault Tree Analysis. Thanks also to mentors and colleagues, including Eric Maass, Richard Riemer, and Cvetan Redzic, especially in the DFSS area.

Many late nights and weekends were spent working on this book; the authors would like to thank their families for adjusting their schedules and being there with a smile. Special thanks to Hema Ramaswamy for her persistent encouragement and support throughout the creation of the book.

Finally, thanks are due to Motorola Solutions Inc. and Nokia Siemens Networks.

Zachary Taylor

Subramanyam Ranganathan

List of Abbreviations

ATCAAdvanced Telecommunications Computing ArchitectureBITBuilt-In TestBSC-DOBase Station Controller—Data OnlyBTSBase Transceiver SystemCpkCapability MetricCDFCumulative Distribution FunctionCDMACode Division Multiple AccessCPMCritical Parameter ManagementCTMCContinuous Time Markov ChainDFDegrees of FreedomDFMEADesign Failure Modes and Effects AnalysisDFSSDesign for Six SigmaDMAICDefine Measure Analyze Improve ControlDOE Design of ExperimentsDTMCDiscrete Time Markov ChainEVDOEVolution Data OnlyFITFailure in TimeFMFault ManagementFMEAFailure Modes and Effects AnalysisFRUField Replaceable UnitFTAFault Tree AnalysisGOFGoodness of FitGUIGraphical User InterfaceIECInternational Electrotechnical CommissionIPInternet ProtocolKJKawakito JiraLSLLower Specification LimitMDTMean DowntimeMOLMaintenance on LineMRMMarkov Reward ModelMTBFMean Time between FailuresMTBVMean Time between VisitsMTFFMean Time to First FailureMTTFMean Time to FailureMTTRMean Time to RepairNASA National Aeronautics and Space AdministrationNUDNew, Unique, DifficultO&MOperations and MaintenanceODEOrdinary Differential EquationOOSOut of ServiceOSOperating SystemPAMProcess Advanced MezzaninePANDPriority AND GatePDFProbability Density FunctionPMFProbability Mass FunctionQFDQuality Function DeploymentRAMRandom Access MemoryRBACRole-Based Access ControlRBDReliability Block DiagramRCARoot Cause AnalysisRFRadio FrequencyROIReturn on InvestmentRPNRisk Priority NumberSCISlot Cycle IndexSSPDSix Sigma Product DevelopmentUSLUpper Specification LimitVOC Voice of Customer

Chapter 1

Introduction

We live in a complex and uncertain world. Need we say more? However, we can say quite a bit about some aspects of randomness that govern behavior of systems—in particular, failure events. How can we predict failures? When will they occur? How will the system we are designing react to unexpected failures? Our task is to help identify possible failure modes, predict failure frequencies and system behavior when failures occur, and prevent the failures from occurring in the future. Determining how to model failures and build the model that represents our system can be a daunting task. If our model becomes too complex as we attempt to capture a variety of behaviors and failure modes, we risk making the model difficult to understand, difficult to maintain, and we may be modeling certain aspects of the system that provide only minimal useful information. On the other hand, if our model becomes too simple, we may leave out critical system behavior that dramatically reduces its effectiveness. A model of a real system or natural process represents only certain aspects of reality and cannot capture the complete behavior of the real physical system. A good model should reflect key aspects of the system we are analyzing when constrained to certain conditions. The information extracted from a good model can be applied to making the design of the system more robust and reliable.

No easy solutions exist for modeling uncertainty. We must make simplifying assumptions to make the solutions we obtain tractable. These assumptions and simplifications should be identified and documented since any model will be useful only for those constrained scenarios. Used outside of these constraints, the model will tend to degrade and provides us with less usable information. That being the case, what type of model is best suited for our project?

When designing a high availability system, we should carefully analyze the system for critical failure modes and attempt to prevent these failures by incorporating specific high availability features directly in the system architecture and design.

However, from a practical standpoint, we know unexpected failures can and will occur at any time despite our best intentions. Given that, we add a layer of defense, known as fault management, that mitigates the impacts of a failure mode on the system functionality. Multiple failures and/or failure modes not previously identified may cause system performance degradation or complete system failure. It is important to characterize these failures and determine the expected overall availability of the system over its lifetime of operation.

Stochastic models are used to capture and constrain randomness inherent in all physical processes. The more we know about the underlying stochastic process, the better we will be able to model that process and constrain the impacts of the random failures on the system we are analyzing. For example, if we can assume that certain system components have constant failure rates, a wealth of tools and techniques are available to assist us in this analysis. This will allow us to design a system with a known confidence level of meeting our reliability and availability goals. Unfortunately, two major impediments stand in our way: (1) The failure rate of many of the components that comprise our system are not constant, that is, independent of time over the life of the system being built or analyzed, but rather these failure rates follow a more complicated trajectory over the lifetime of the system; and (2) exact component failure rates—especially for new hardware and software—are not known and cannot be exactly determined until after all built and deployed systems reach the end of their useful lives.

So, where do we start? What model can we use for high availability design and analysis? How useful will this model be? Where will it fail to correctly predict system behavior? Fortunately, many techniques have already been successfully used to model system behavior. In this book, we will cover several of the more useful and practical models. We will explore techniques that will address reliability concerns, identify their limitations and assumptions that are inherent in any model, and provide methods that in spite of the significant hurdles we face, will allow us to effectively design systems that meet high availability requirements.

Our first step in this seemingly unpredictable world of failures is to understand and characterize the nature of randomness itself. We will begin our journey by reviewing important concepts in probability. These concepts are the building blocks for understanding reliability engineering. Once we have a firm grasp on key probability concepts, we will be ready to explore a wide variety of classical reliability and Design for Six Sigma (DFSS) tools and models that will enable us to design and analyze high availability systems, as well as to predict the behavior of these systems.

Chapter 2

Initial Considerations for Reliability Design

2.1 The Challenge

One of the biggest challenges we face is to predict the reliability and availability of a particular system or design with incomplete information. Incomplete information includes lack of reliability data, partial historical data, inaccuracies with data obtained from third parties, and uncertainties concerning what to model. Inaccuracies with data can also stem from internal organizational measurement errors or reporting issues. Although well-developed techniques can be applied, reliability attributes, such as predictive product or component MTBF (Mean Time between Failures), cannot be precisely predicted—it can only be estimated. Even if the MTBF of a system is accurately estimated, we will still not be able to predict when any particular system will fail. The application of reliability theory works well when scaled to a large number of systems over a long period of time relative to the MTBF. The smaller the sample and the smaller the time frame, the less precise the predictions will be. The challenge is to use appropriate information and tools to accomplish two goals: (1) predict the availability and reliability of the end product to ensure customer requirements are met, and (2) determine the weak points in the product architecture so that these problem areas can be addressed prior to production and deployment of the product.

A model is typically created to predict the problems we encounter in the field, such as return rates, and to identify weak areas of system design that need to be improved. A good model can be continually updated and refined based on new information and field data to improve its predictive accuracy.

2.2 Initial Data Collection

How do we get started? Typically, for the initial availability or reliability analysis, we should have access to (1) the initial system architecture, (2) the availability and reliability requirements for our product, and (3) reliability data for individual components (albeit in many cases data are incomplete or nonexistent).

For reliability purposes, a system or product is decomposed into several components and reliability information is associated with these components. How do we determine these components? Many components can be extracted from the system architecture block diagram (Fig. 2.1). For hardware components, one natural division is to identify the Field Replaceable Units (FRU) and their reliability data estimates or measurements. FRUs are components, such as power supplies, fan trays, processors, memory, controllers, server blades, and routers, that can be replaced at the customer site by the customer or contracted field support staff. For software components, several factors need to be considered, such as system architecture, application layers, third party vendors, fault zones, and existing reliability information.

Figure 2.1 System Block Diagram

Let us say you have a system with several cages and lots of cards and have implemented fault tolerance mechanisms. If the design of the system is such that the customer's maintenance plan includes procedures for the repair of the system by replacing failed cards (FRUs), then at a minimum, those individual cards should be uniquely identified along with certain reliability data associated with them—in particular, MTBF and MTTR (Mean Time to Repair) data. This basic information is required to calculate system availability.

The MTTR is dependent upon how quickly a problem is identified and how quickly it can be repaired. In the telecom industry, typically a maintenance window exists at low traffic peak times of the day, during which maintenance activities take place. If a failed component does not affect the system significantly, then the repair of that component may be delayed until the maintenance window, depending on the customer's maintenance plan, ease of access to equipment, and so on. However, if the system functionality is significantly impacted due to a failure event, then the system will require immediate repair to recover full service.

In addition to hardware and software component failures, we should also take into consideration other possible failures and events that impact the availability of our system. These may include operator-initiated planned events (e.g., software upgrade), environment failures, security failures, external customer network failures, and operator errors. The objective is to identify those events and failures that can affect the ability of our system to provide full service, and create a model that incorporates these potential failures. We need to determine which of these events are significant (characterized by severity and likelihood of occurrence) and are within the scope of the system we are modeling. If we group certain failure modes into a fault zone, we can modularize the model for better analysis and maintainability.

2.3 Where Do We Get MTBF Information?

For hardware, we may be able to obtain MTBF information from industry standard data or from a manufacturer-published reliability data. If a particular component does not have a published or known MTBF, then the next step is to look for parts or components that are similar and estimate the MTBF based on that data under similar operating conditions. Another method is to extrapolate the MTBF based on data mining from past similar projects. In the worst case scenario, if a totally new component or technology is to be employed with little reliability data available, then a good rule of thumb is to look at a previous generation of products of a similar nature that do have some reliability data and use that information to estimate the MTBF. Use engineering judgment to decrease the MTBF by a reasonable factor, for example, x/2 to account for uncertainty and typical early failures in the hardware cycle. It is better to err on the side of caution.

Once we have made this initial assessment, the MTBF becomes the baseline for our initial model.

Let us consider a hypothetical communication system that consists of a chassis with several slots in which processing cards can be inserted. One of these cards is a new RF (radio frequency) carrier card. We start with the original data provided from the manufacturer, which indicate an MTBF of 550,000 hours. We then sanity-check these data by looking at MTBFs of similar cards successfully deployed over a number of years from different manufacturers. This analysis reveals an average MTBF of 75,000 hours. How might we reconcile the difference? Which estimate do we use? It turns out that the right answer may be both! In Chapter 13, we will explore techniques for combining reliability data to obtain an updated MTBF estimate that may be more accurate than any single source of MTBF data.

On the software side, if the software is being built in-house, we can derive the MTBF from similar projects that have been done in the past and at the same level of maturity. Release 1 always has the most bugs! We can also extrapolate information from field data in the current system release or previous releases. This can get more complicated if the software is written by multiple vendors (which is typical). We may also consider software complexity, risk, schedules, maturity of the organization developing the software, nonconstant failure rates, and so on as factors that affect MTBF.

Are you reusing existing software that has already been working in the field? For example, if we build on software from a previous project, and then we add additional functionality on top of this, we can extract software failure rate information from the previous project. We can also identify failure rates from industry information for common software, such as the Linux operating system. It is also quite possible that we identify MTBF information from third-party suppliers of off-the-shelf standard software.

We need to take into account as many relevant factors as possible when assigning MTBFs, MTTRs, and other reliability data. We should also note that MTBFs of products generally tend to improve over time—as the system gets more field exposure and software fixes are made, the MTBF numbers will generally increase.

2.4 MTTR and Identifying Failures

An important part of our architectural considerations is the detectability of the problem. How likely is it that when we have a problem, we will be able to automatically detect it? Fault Management includes designing detection mechanisms that are capable of picking up those failures.

Mechanisms, such as heartbeat messaging, checkpointing, process monitoring, checksums, and watchdog timers, help identify problems and potential failures. If a particular failure is automatically detected and isolated to a card, then recovery mechanisms can be invoked, for example, failover to a standby card.

Autorecovery is a useful technique for transient software failures, such as buffer overflows, memory leaks, locks that have not been released, and unstable or absorbing states that cause the system to hang. For these failures, a reboot of the card in which the problem was detected may repair the problem. If the failure reoccurs, more sophisticated repair actions may need to be employed. The bottom line is that we want to maximize the time the system is available leveraging the simplest recovery options for providing the required service.

In addition to detectable failures, a portion of the failures will be undetected or “silent” failures that should be accounted for as part of reliability calculations. These undetected failures can eventually manifest themselves as an impact to the functionality of the system. Since these failures may remain undetected by the system, the last line of defense is manual detection and resolution of the problem. Our goal is to reduce the number of undetected failures to an absolute minimum since these undetected failures potentially have a much larger impact on system functionality due to the length of time the problem remains undiscovered and uncorrected.

In situations where the problem cannot be recovered by an automatic card reset, for example, the MTTR becomes much larger. Instead of an MTTR of a few minutes to account for the reboot of a card, the MTTR could be on the order of several hours if we need to manually replace the card or system, or revert to a previous software version. So the more robust the system and fault management architecture is, the more successfully we can quickly identify and repair a failure. This is part of controllability—once we know the nature of the problem, we know the method that can be employed to recover from the problem.

There are several ways to reduce the MTTR. In situations where the software fails and is not responsive to external commands, we look at independent paths to increase the chances of recovering that card. In ATCA (Advanced Telecommunications Computing Architecture) architectures, a dedicated hardware line can trigger a reboot independent of the software. Other methods include internal watchdog time-outs that will trigger a card reboot. The more repair actions we have built into the system, the more robust it becomes. There is of course a trade-off on the number and complexity of such mechanisms that we build into a system. We must be careful that our fault management mechanisms are not so complex that the fault management software itself begins to negatively impact the reliability and response time of the system! In the authors' experience, redundancy management tends to be the most problematic of fault management techniques due to their inherent complexity necessary to minimize outages or downtimes for a large number of possible known (and unknown) failures.

2.5 Summary

Prior to designing a high availability system, we must have a set of availability requirements or goals for our system. We then set out to design a system that meets these requirements. By decomposing the system into appropriate components, we can create a system reliability model and mechanisms for ensuring high availability. For each component in this model, we allocate specific MTBF, MTTR, and other reliability information. We described a few general methods can be used to estimate and improve this reliability information. The more knowledge we have regarding the reliability of these components, the maintenance plan for the system, and the number of systems we expect to deploy, the more accurate our model will be in predicting actual system reliability and availability once deployed to the field.

Now that we have introduced the mechanics of obtaining initial reliability information, the next several chapters will dwell on basic mathematical concepts that will set the foundation for more advanced techniques and applications to build, predict, and optimize high availability systems.

Chapter 3

A Game of Dice: An Introduction to Probability

3.1 Introduction

All systems are subject to failures. We need powerful techniques to analyze and predict the reliability of high availability systems. Standard techniques have been developed and put into practice that can assist us in designing these systems. Many of the techniques necessary for this analysis are rooted in probability theory. This chapter introduces important concepts of probability theory that will be applied to the practical application of reliability in later chapters. We begin our discussion by posing the following question:

“How can an exponential failure distribution represent a constant failure rate?”

The vast majority of analysis techniques we employ are based on the concept of constant random failure rates. That is, the probability of a failure occurring at any time remains the same over the lifetime of the system. In other words, the rate of these random failures is independent of time. With this assumption regarding the nature of the failures, a wealth of techniques become available—including the exponential failure probability distribution.

An exponential function is not constant, right? In fact, it is not even a simple linear equation of the form  =  + . If we know anything about the exponential function , we know that either gets bigger faster when is positive and the magnitude of becomes larger, or gets smaller quicker when is negative and the magnitude of becomes larger.

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!