98,99 €
A unique, design-based approach to reliability engineering Design for Reliability provides engineers and managers with a range of tools and techniques for incorporating reliability into the design process for complex systems. It clearly explains how to design for zero failure of critical system functions, leading to enormous savings in product life-cycle costs and a dramatic improvement in the ability to compete in global markets. Readers will find a wealth of design practices not covered in typical engineering books, allowing them to think outside the box when developing reliability requirements. They will learn to address high failure rates associated with systems that are not properly designed for reliability, avoiding expensive and time-consuming engineering changes, such as excessive testing, repairs, maintenance, inspection, and logistics. Special features of this book include: * A unified approach that integrates ideas from computer science and reliability engineering * Techniques applicable to reliability as well as safety, maintainability, system integration, and logistic engineering * Chapters on design for extreme environments, developing reliable software, design for trustworthiness, and HALT influence on design Design for Reliability is a must-have guide for engineers and managers in R&D, product development, reliability engineering, product safety, and quality assurance, as well as anyone who needs to deliver high product performance at a lower cost while minimizing system failure.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 499
Veröffentlichungsjahr: 2012
WILEY SERIES IN QUALITY & RELIABILITY ENGINEERING
and related titles*
Electronic Component Reliability:
Fundamentals, Modelling, Evaluation and Assurance
Finn Jensen
Measurement and Calibration Requirements
For Quality Assurance to ASO 9000
Alan S. Morris
Integrated Circuit Failure Analysis:
A Guide to Preparation Techniques
Friedrich Beck
Test Engineering
Patrick D. T. O’Connor
Six Sigma: Advanced Tools for Black Belts and Master Black Belts*
Loon Ching Tang, Thong Ngee Goh, Hong See Yam, Timothy Yoap
Secure Computer and Network Systems: Modeling, Analysis and Design*
Nong Ye
Failure Analysis:
A Practical Guide for Manufacturers of Electronic Components and Systems
Marius Bâzu and Titu Băjenescu
Reliability Technology:
Principles and Practice of Failure Prevention in Electronic Systems
Norman Pascoe
Effective FMEAs: Achieving Safe, Reliable, and Economical Products and Processes Using Failure Mode and Effects Analysis
Carl Carlson
Design for Reliability
Dev Raheja and Louis J. Gullo (Editors)
Copyright © 2012 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Raheja, Dev.
Design for reliability / Dev Raheja & Louis J. Gullo.
p. cm.
ISBN 978-0-470-48675-7 (hardback)
1. Reliability (Engineering) I. Gullo, Louis J. II. Title.
TA169.R348 2011
620’.00452-dc23
2011042405
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
To my wife, Hema, and my children, Gauri, Pramod, and Preeti Dev Raheja
To my wife, Diane, and my children, Louis, Jr., Stephanie, Catherine, Christina, and Nicholas Louis J. Gullo
Contents
Contributors
Foreword
Preface
Introduction: What You Will Learn
1 Design for Reliability Paradigms
Dev Raheja
Why Design for Reliability?
Reflections on the Current State of the Art
The Paradigms for Design for Reliability
Summary
References
2 Reliability Design Tools
Joseph A. Childs
Introduction
Reliability Tools
Test Data Analysis
Summary
References
3 Developing Reliable Software
Samuel Keene
Introduction and Background
Software Reliability: Definitions and Basic Concepts
Software Reliability Design Considerations
Operational Reliability Requires Effective Change Management
Execution-Time Software Reliability Models
Software Reliability Prediction Tools Prior to Testing
References
4 Reliability Models
Louis J. Gullo
Introduction
Reliability Block Diagram: System Modeling
Example of System Reliability Models Using RBDs
Reliability Growth Model
Similarity Analysis and Categories of a Physical Model
Monte Carlo Models
Markov Models
References
5 Design Failure Modes, Effects, and Criticality Analysis
Louis J. Gullo
Introduction to FMEA and FMECA
Design FMECA
Principles of FMECA-MA
Design FMECA Approaches
Example of a Design FMECA Process
Risk Priority Number
Final Thoughts
References
6 Process Failure Modes, Effects, and Criticality Analysis
Joseph A. Childs
Introduction
Principles of P-FMECA
Use of P-FMECA
What Is Required Before Starting
Performing P-FMECA Step by Step
Improvement Actions
Reporting Results
Suggestions for Additional Reading
7 FMECA Applied to Software Development
Robert W. Stoddard
Introduction
Scoping an FMECA for Software Development
FMECA Steps for Software Development
Important Notes on Roles and Responsibilities with Software FMECA
Lessons Learned from Conducting Software FMECA
Conclusions
References
8 Six Sigma Approach to Requirements Development
Samuel Keene
Early Experiences with Design of Experiments
Six Sigma Foundations
The Six Sigma Three-Pronged Initiative
The RASCI Tool
Design for Six Sigma
Requirements Development: The Principal Challenge to System Reliability
The GQM Tool
The Mind Mapping Tool
References
9 Human Factors in Reliable Design
Jack Dixon
Human Factors Engineering
A Design Engineer’s Interest in Human Factors
Human-Centered Design
Human Factors Analysis Process
Human Factors and Risk
Human Error
Design for Error Tolerance
Checklists
Testing to Validate Human Factors in Design
References
10 Stress Analysis During Design to Eliminate Failures
Louis J. Gullo
Principles of Stress Analysis
Mechanical Stress Analysis or Durability Analysis
Finite Element Analysis
Probabilistic vs. Deterministic Methods and Failures
How Stress Analysis Aids Design for Reliability
Derating and Stress Analysis
Stress vs. Strength Curves
Software Stress Analysis and Testing
Structural Reinforcement to Improve Structural Integrity
References
11 Highly Accelerated Life Testing
Louis J. Gullo
Introduction
Time Compression
Test Coverage
Environmental Stresses of HALT
Sensitivity to Stresses
Design Margin
Sample Size
Conclusions
Reference
12 Design for Extreme Environments
Steven S. Austin
Overview
Designing for Extreme Environments
Designing for Cold
Designing for Heat
References
13 Design for Trustworthiness
Lawrence Bernstein and C. M. Yuhas
Introduction
Modules and Components
Politics of Reuse
Design Principles
Design Constraints That Make Systems Trustworthy
Conclusions
References and Notes
14 Prognostics and Health Management Capabilities to Improve Reliability
Louis J. Gullo
Introduction
PHM Is Department of Defense Policy
Condition-Based Maintenance vs. Time-Based Maintenance
Monitoring and Reasoning of Failure Precursors
Monitoring Environmental and Usage Loads for Damage Modeling
Fault Detection, Fault Isolation, and Prognostics
Sensors for Automatic Stress Monitoring
References
15 Reliability Management
Joseph A. Childs
Introduction
Planning, Execution, and Documentation
Closing the Feedback Loop: Reliability Assessment, Problem Solving, and Growth
References
16 Risk Management, Exception Handling, and Change Management
Jack Dixon
Introduction to Risk
Importance of Risk Management
Why Many Risks Are Overlooked
Program Risk
Design Risk
Risk Assessment
Risk Identification
Risk Estimation
Risk Evaluation
Risk Mitigation
Risk Communication
Risk and Competitiveness
Risk Management in the Change Process
Configuration Management
References
17 Integrating Design for Reliability with Design for Safety
Brian Moriarty
Introduction
Start of Safety Design
Reliability in System Safety Design
Safety Analysis Techniques
Establishing Safety Assessment Using the Risk Assessment Code Matrix
Design and Development Process for Detailed Safety Design
Verification of Design for Safety Includes Reliability
Examples of Design for Safety with Reliability Data
Final Thoughts
References
18 Organizational Reliability Capability Assessment
Louis J. Gullo
Introduction
The Benefits of IEEE 1624-2008
Organizational Reliability Capability
Reliability Capability Assessment
Design Capability and Performability
IEEE 1624 Scoring Guidelines
SEI CMMI Scoring Guidelines
Organizational Reliability Capability Assessment Process
Advantages of High Reliability
Conclusions
References
Index
Contributors
Foreword
The importance of quality and reliability to a system cannot be disputed. Product failures in the field inevitably lead to losses in the form of repair cost, warranty claims, customer dissatisfaction, product recalls, loss of sales, and in extreme cases, loss of life. Thus, quality and reliability play a critical role in modern science and engineering and so enjoy various opportunities and face a number of challenges.
As quality and reliability science evolves, it reflects the trends and transformations of technological support. A device utilizing a new technology, whether it be a solar power panel, a stealth aircraft, or a state-of-the-art medical device, needs to function properly and without failure throughout its mission life. New technologies bring about new failure mechanisms (chemical, electrical, physical, mechanical, structural, etc.), new failure sites, and new failure modes. Therefore, continuous advancement of the physics of failure, combined with a multi-disciplinary approach, is essential to our ability to address those challenges in the future.
In addition to the transformations associated with changes in technology, the field of quality and reliability engineering has been going through its own evolution: developing new techniques and methodologies aimed at process improvement and reduction of the number of design- and manufacturing-related failures.
The concept of design for reliability (DFR) has been gaining popularity in recent years and its development is expected to continue for years to come. DFR methods shift the focus from reliability demonstration and the outdated “test-analyze-fix” philosophy to designing reliability into products and processes using the best available science-based methods. These concepts intertwine with probabilistic design and design for six sigma (DFSS) methods, focusing on reducing variability at the design and manufacturing levels. As such, the industry is expected to increase the use of simulation techniques, enhance the applications of reliability modeling, and integrate reliability engineering earlier and earlier in the design process. DFR also transforms the role of the reliability engineer from being focused primarily on product test and analysis to being a mentor to the design team, which is responsible for finding and applying the best design methods to achieve reliability. A properly applied DFR process ensures that pursuit of reliability is an enterprise-wide activity.
Several other emerging and continuing trends in quality and reliability engineering are also worth mentioning here. For an increasing number of applications, risk assessment will enhance reliability analysis, addressing not only the probability of failure but also the quantitative consequences of that failure. Life-cycle engineering concepts are expected to find wider applications in reducing life-cycle risks and minimizing the combined cost of design, manufacturing, quality, warranty, and service. Advances in prognostics and health management will bring about the development of new models and algorithms that can predict the future reliability of a product by assessing the extent of degradation from its expected operating conditions. Other advancing areas include human and software reliability analysis.
Additionally, continuous globalization and outsourcing affect most industries and complicate the work of quality and reliability professionals. Having various engineering functions distributed around the globe adds a layer of complexity to design coordination and logistics. Moving design and production into regions with little knowledge depth regarding design and manufacturing processes, with a less robust quality system in place and where low cost is often the primary driver of product development, affects a company's ability to produce reliable and defect-free parts.
Despite its obvious importance, quality and reliability education is paradoxically lacking in today's engineering curriculum. Few engineering schools offer degree programs or even a sufficient variety of courses in quality or reliability methods. Therefore, a majority of quality and reliability practitioners receive their professional training from colleagues, professional seminars, and from a variety of publications and technical books. The lack of formal education opportunities in this field greatly emphasizes the importance of technical publications for professional development.
The real objective of the Wiley Series in Quality & Reliability Engineering is to provide a solid educational foundation for both practitioners and researchers in quality and reliability and to expand the reader's knowledge base to include the latest developments in this field. This series continues Wiley's tradition of excellence in technical publishing and provides a lasting and positive contribution to the teaching and practice of engineering.
Andre Kleyner
Editor
Wiley Series in Quality & Reliability Engineering
Preface
Design for reliability (DFR) has become a worldwide goal, regardless of the industry and market. The best organizations around the world have become increasingly intent on harvesting the value proposition for competing globally while significantly lowering life cycle costs. The DFR principles and methods are aimed proactively to prevent faults, failures, and product malfunctions, which result in cheaper, faster, and better products. In Japan, this tool is used to gain customer loyalty and customer trust. However, we still face some challenges. Very few engineering managers and design engineers understand the value added by design for reliability; they often fail to see savings in warranty costs, increased customer satisfaction, and gain in market share.
These facts, combined with the current worldwide economic challenges, have created perfect conditions for this science of engineering. This is an art also because many decisions have to be made not only on evidence-based data, but also on engineering creativity to design out failure at lower costs. Readers will be delighted with the wealth of knowledge because all contributors to this book have at least 20 years hands-on experience with these methods.
The idea for this book was conceived during our participation in the IEEE Design for Reliability Technical Committee. We saw the need for a DFR volume not only for hardware engineers, but also for software and system engineers. The traditional books on reliability engineering are written for reliability engineers who rely more on statistical analysis than on improvements in inherent design to mitigate hardware and software failures. Our book attempts to fill a gap in the published body of knowledge by communicating the tremendous advantages of designing for reliability during very early development phase of a new product or system. This volume fulfills the needs of entry-level design engineers, experienced design engineers, engineering managers, as well as the reliability engineers/managers who are looking for hands-on knowledge on how to work collaboratively on design engineering teams.
ACKNOWLEDGMENTS
We would like to thank the IEEE Reliability Society for sowing the seed for this book, especially the encouragement from a former society president, Dr. Samuel Keene, who also contributed chapters in the book. We would like to recognize a few of the authors for conducting peer reviews of several chapters: Joe Childs, Jack Dixon, Larry Bernstein, and Sam Keene. We also thank the guest editors—Tim Adams, at NASA Kennedy Center, and Dr. Nat Jambulingam, at NASA Goddard Space Flight Center—who helped edit several chapters. We are grateful to Diana Gialo, at Wiley, who has always been gracious in helping and guiding us.
We acknowledge the contributions of the following:
Steve Austin (Chapter 12)
Larry Bernstein (Chapter 13)
Joe Childs (Chapters 2, 6, and 15)
Jim Dixon (Chapters 9 and 16)
Lou Gullo (Chapters 4, 5, 10, 11, 14, and 18)
Sam Keene (Chapters 3 and 8)
Brian Moriarty (Chapter 17)
Dev Raheja (Chapter 1)
Bob Stoddard (Chapter 7)
C. M. Yuhas (Chapter 13)
Dev Raheja
Louis J. Gullo
Introduction: What You will Learn
This chapter introduces what is means to design for reliability. It shows the technical gaps between the current state-of-art and what it takes to design reliability as a value proposition for new products. It gives real examples of how to get high return on investment to understand the art of design for reliability. The chapter introduces readers to the deeper level topics with eight practical paradigms for best practices.
This chapter summarizes reliability tools that exist throughout the product's life cycle from creation, requirements, development, design, production, testing, use and end of life. The need for tools in understanding and communicating reliability performance is also explained. Many of these tools are explained in further detail in the chapters that follow.
This chapter describes good design practices for developing reliable software embedded in most of the high technology products. It shows how to prevent software faults and failures often inherent in the design by applying evidence based reliability tools to software such as FMEA, capability maturity modeling, and software reliability modeling. It introduces the most popular software reliability estimation tool CASRE (Computer Aided Software Reliability Estimation).
This chapter is on reliability modeling, one of the most important tools for design for reliability in he early stages of design, to determine strategy for overall reliability. The chapter covers models for system reliability, component reliability, and shows the use of block diagrams in modeling. It discusses reliability growth process, similarity analysis used for physical modeling, and widely used models for simulation.
This chapter on FMECA contains the core knowledge for reliability analysis at system level, subsystem level and component level. The chapter shows how to perform risk assessment using a risk index called Risk Priority Number and shows how to eliminate single point failures making a design significantly less vulnerable. It explains the difference between FMEA and FMECA and how to us them for improving product performance and the maintenance effectiveness.
The last chapter showed how to make design more robust. This chapter applies the FMEA tool to analyze a process for robustness such that the manufacturing defects are eliminated before the show p in production. The end result is improved product reliability with lower manufacturing costs. It covers step by step procedure to perform the analysis including the risk assessment using the Risk Priority Number.
The FMEA tool is just as applicable to software design. There is very little literature on how to apply it to software. This chapter shows the details of how to use it to improve the software reliability. It covers the lessons learned and shows different ways of integrating the FMECA into the most widely used software development model known as “V” model. The chapter describes roles and responsibilities for proper use of this tool.
In this chapter the author explains why Design of Experiments (DOE) is a sweet spot for identifying the key input variables to a Six Sigma programs. The chapter covers the origin of this program, the meaning of six sigma measurements, and how it is applied to improve the design. It then proceeds to cover the tools for designing the product for Six Sigma performance to reduce failure rates as close to zero as possible.
Humans are Often blamed for many product failures when in fact the fault lies in the insufficient attention to human factor engineering. This chapter covers the principles of human-centered design to make man-machine interface robust and error-tolerant. It covers how to perform the human factors analysis, and how to integrate it to make the product design user-friendly.
This chapter explains why it is critical to reduce the design stress to improve durability as well as reliability. It introduces the concept of derating as a design tool. The author includes examples on electrical and mechanical stress analysis including how to apply this theory to software design. The chapter also shows how to apply Finite Element Analysis, a numerical technique, to solve specific design problems.
Usually designers cannot predict what failures will occur for a new design. This chapter shows how highly accelerated life tests and highly accelerated stress tests can reveal the failure modes quickly. It covers how to design these tests and how to estimate the design margin from the test results. It shows different methods of accelerating the stresses.
When a product is used in extreme cold or extreme heat such as in Alaska or in a desert in Arizona, we must design for such environments to assure product can last long enough. This chapter shows what factors need to be considered and how to design for each condition. It shows how lessons learned from space programs and overseas experience can help make products durable, reliable, and safe.
This is a very important chapter because software design methods for reliability are not standardized yet. This chapter goes beyond reliability to design software such that it is also safe, and secure from errors in engineering changes which are very frequent. This chapter covers design methods and offers suggestions for improving the architecture, modules, interfaces, and using right policies for re-using the software. The chapter offers good design practices.
Design for reliability practices should include detecting a malfunction before a product malfunctions. This chapter covers designing prognostics and product health monitoring principles that can b designed into the product. The result is enhanced system reliability. The chapter includes condition-based maintenance and time-based maintenance, use of failure precursors to signal an imminent failure event, and automatic stress monitoring to enhance prognosis.
This chapter provides both motivation and guidance in outlining the importance of good reliability management. Management participation is the key to any successful reliability in design. It shows how to manage, plan, execute, and document the needs of the program during early design. It describes the important tasks, and closing the feedback loops after reliability assessment, problem solving, and reliability growth testing.
Many risks are overlooked in a product design. This chapter defines what is risk in engineering terms, how to predict risk, assess risk, and mitigate it. It highlights the role of risk management culture in mitigating risks and the critical role of configuration management for avoiding new risks from design changes. Included in this chapter is how to minimize oversights and omissions including requirement creeps.
This chapter integrates reliability with safety, including how to design for safety. It covers several safety analysis techniques that equally apply to reliability. It shows the how a risk assessment code matrix is used widely in aerospace and many commercial products to make risk management decisions. It includes examples of risk reduction.
This chapter describes the benefits of using IEEE 1624–2008 standard to describe how reliability capability of an organizational entity is determined by assessing eight key reliability practices and associated metrics. Management should know the capability of an organization to deliver a reliable product, which is defined as organizational reliability capability. It describes the process in detail with case studies.
Chapter 1
Design for Reliability Paradigms
Dev Raheja
The science of reliability has not kept pace with user expectations. Many corporations still use MTBF (mean time between failures) as a measure of reliability, which, depending on the statistical distribution of failure data, implies acceptance of roughly 50 to 70% failures during the time indicated by the MTBF. No user today can tolerate such a high number of failures. Ideally, a user does not want any failures for the entire expected life! The life expected is determined by the life inferred by users, such as 100,000 miles or 10 years for an automobile, at least 10 years for kitchen appliances, and at least 20 years for a commercial airliner. Most commercial companies, such as automotive and medical device manufacturers, have stopped using the MTBF measure and aim at 1 to 10% failures during a self-defined time. This is still not in line with users' dreams. The real question is: Why not design for zero failures if we can increase profits and gain more market share? Zero failures implies zero mission-critical failures or zero safety-critical system failures. As a minimum, systems in which failures can lead to catastrophic consequences must be designed for zero failures. There are companies that are able to do this. Toyota, Apple, Gillette, Honda, Boeing, Johnson & Johnson, Corning, and Hewlett-Packard are a few examples.
The aim of design for reliability (DFR) is to design-out failures of critical system functions in a system. The number of such failures should be zero for the expected life of the product. Some components may be allowed to fail, such as in redundant systems. For example, in aerospace, as long as a system can function at least for the duration of the mission and the failed components are replaced prior to the next mission to maintain redundancy, certain failures can be tolerated. This is, however, insufficient for complex systems where thousands of software interactions, hundreds of wiring connections, and hundreds of human factors affect the systems' reliability. Then there are issues of compatibility [1] among components and materials, among subsystems, and among hardware and software interactions. Therefore, for complex systems we may find it impossible to have zero failures, but we must at least prevent the potential failures we know about. Since failures can come from unknown and unexpected interactions, we should try to design-in fallback modes for unexpected events. A “what-if” analysis usually points to some events of this type. To minimize failures in complex systems, in this book we describe techniques for improving software and interface reliability.
As indicated earlier, some companies have built a strong and long-lasting reputation for reliability based on aiming at zero failures. Toyota and Sony built their world leadership mostly on high reliability; and Hyundai has been offering a 10-year warranty and increasing its market share steadily. Progress has been made since then. In 1974, when nobody in the world gave a warranty longer than one year, Cooper Industries gave a 15-year warranty to electric power utilities on high-voltage transformer components and stood out as the leader in profitability among all Fortune 500 electrical companies. Raytheon has established a culture at the highest level in the corporation of providing customers with mission assurance through a “no doubt” mindset. Says Bill Swanson, chairman and CEO of Raytheon: “[T]here must be no doubt that our products will work in the field when they are needed” (Raytheon Company, Technology Today, 2005, Issue 4). Similarly, with its new lifetime power train warranty, Chrysler is creating new standards for reliability.
Reliability is defined as the probability of performing all the functions (including safety functions) satisfactorily for a specified time and specified use conditions. The functions and use conditions come from the specification. If a specification misses or is vague 60% or more of the time, the reliability predictions are of very little value. This is usually the case [2]. The second big issue is: How many failures should be tolerable? Some readers may not agree that we can design for zero critical failures, but the evidence supports the contrary conclusion. We may not be able to prevent failures that we did not foresee, but we can design out all the critical failure modes that we discover during the requirements analysis and in the failure mode and effects analysis (FMEA). In over 30 years' experience, I have yet to encounter a failure mode that cannot be designed-out. The cost is usually not an issue if the FMEA is conducted and the improvements are made during the early design stage. The time specified for critical failures in the reliability definition should be the entire lifetime expected.
In this chapter we address how to write a good system specification and how to design so as not to fail. We make it clear that the design for reliability should concentrate on the critical and major failures. This prevents us from solving easy problems and ignoring the complex ones. The following incident raises issues that are central to designing for reliability.
The lessons learned from the Interstate 35 bridge collapse in Minnesota on August 1, 2007 into the Mississippi River on August 1, killing 13, give us some clues about what needs to be done. Similar failure mechanisms can be found in many large electrical and mechanical systems, such as aircraft and electric power plants.
The bridge was expanded from four lanes to six, and eventually to eight. Some wonder whether that might have played a role in its collapse. Investigators said the failure resulted because of a flaw in its design. The designers had specified a metal plate that was too thin to serve as a junction of several girders.
Like many products, it gradually got exposed to higher loads, adding strain to the weak spot. At the time of the collapse, the maintenance crews had brought tons of equipment and material onto the deck for a repair job. The bridge was of a design known as a nonredundant structure, meaning that if a single part failed, the entire structure could collapse. Experts say that the pigeon dung all over the steel could have caused faster corrosion than was predicted.
This case history challenges the fundamentals of engineering taught in the universities.
Should the design margin be 100% or 800
%? “How does the designer determine the design margin?”
Should we design for pigeons doing their dirty job?
What about designing for all the other environmental stressors, such as chemicals sprayed during snow emergencies, tornados, and earthquakes?
Should we design-in redundancy on large mechanical systems to avoid disasters?
The wisdom says that redundancy delays failures but may not avoid disasters. The failure could occur in both the redundant paths, such as in an aircraft accident where the flying debris cut through all three redundant hydraulic lines.
Should we design for sudden shocks experienced by the bridge during repair and maintenance?
These concerns apply to any product, such as electronics, electrical power systems, and even a complex software design. In software, the corrosion can be symbolic for applying too many patches without knowing the interactions. Call it “software corrosion.”
The answers to the questions above should be a resounding “yes.” An engineering team should foresee all these and many more failure scenarios before starting to design. The obvious strategy is to write a good system specification by first predicting all major potential failures and avoiding them by writing robust requirements. Oversights and omissions in specifications are the biggest weakness in the design for reliability. Typically, 200 to 300 requirements are generally missing or vague for a reasonably complex system such as an automotive transmission.
Analyses techniques covered in this book for hardware and software help us discover many missing requirements, and a good brainstorming session for overlooked requirements always results in discovering many more. What we really need is perhaps the paradigms based on lessons learned.
Reliability is a process. If the right process is followed, results are likely to be right. The opposite is also true in the absence of the right process. There is a saying: “If we don't know where we are going, that's where we will go.” It is difficult enough to do the right things, but it is even more difficult to know what the right things are!
Knowledge of the right things comes from practicing the use of lessons learned. Just having all the facts at your fingertips does not work. One must utilize the accumulated knowledge for arriving at correct decisions. Theory is not enough. One must keep becoming better by practicing. Take the example of swimming. One cannot learn to swim from books alone; one must practice swimming. It is okay to fail as long as mistakes are the stepping stones to failure prevention. Thomas Edison was reminded that he failed 2000 times before the success of the light bulb. His answer, “I never failed. There were 2000 steps in this process.”
One of the best techniques is to use lessons learned in the form of paradigms. They are easy to remember and they make good topics for brainstorming during design reviews.
When engineers say that a component's life is five years, they usually imply the calculation of the mean value, which says that there is a 50% chance of failure during the five years. In other words, either the supplier or the customer has to pay for 50% failures during the product cycle. This is expensive for both: a lose–lose situation. Besides, there are many indirect expenses: for warranties, production testing, and more inventories to replace failed parts. This is mean management. It has a negative return on investment. It is mean to the supplier because of loss of future business and mean to the customer in putting up with the frustrations of downtime and the cost of business interruptions. Therefore, our failure rate goal should be as lean as possible. Engineers should promise minimum life to customers, not mean life. Never use averages in reliability; they are of no use to anyone.
It is worth repeating that the sources of most failures are incomplete, ambiguous, and poorly defined requirements. That is why we introduce unnecessary design changes and write deviations when we are in hurry to ship a product. Look particularly for missing functions in the specifications. There is often practically nothing in a specification about modularity, reliability, safety, serviceability, logistics, human factors, reduction of “no faults found,” diagnostics capability, and prevention of warranty failures. Very few specifications address even obvious requirements, such as internal interface, external interface, user–hardware interface, user–software interface, and how the product should behave if and when a sneak failure occurs. Developing a good specification is an iterative process with inputs from the customer and the entities that are downstream in the process. Those who are trying to build reliability around a faulty specification should only expect a faulty product. Unfortunately, most companies think of reliability when the design is already approved. At this stage there is no budget and no time for major design changes. The only thing a company can do is to hope for reasonable reliability and commit to do better the next time.
To identify missing functions, a cross-functional team is necessary. At least one member from each disciple should be present, such as manufacturing, field service, and marketing, as well as a customer representative. If the specification contains only 50% of the necessary features, how can one even think of reliability? Reliability is not possible without accurate and comprehensive specifications. Therefore, writing accurate performance specifications is a prerequisite for reliability. Such specifications should aim at zero failures for the modes that result in product recalls, high downtime, and inability to diagnose. My interviews with those attending my reliability courses reveal that the dealers are unable to diagnose about 65% of the problems (no faults found). Obviously, fault isolation requirements in the specifications are necessary to reduce down time.
To ensure the accuracy and completeness of a specification, only those who have knowledge of what makes a good specification should approve it. They must ensure that the specification is clear on what the product should never do, however stupid it may sound. For example: “There shall be no sudden acceleration during landing” for an aircraft. In addition, the marketing and sales experts should participate in writing the specification to make sure that old warranty problems “shall not” be in the new product and that there is enough gain in reliability to give the product a competitive edge.
The “shall not” specification is not limited to failures. That would be too simple. We must be able to see the complexity in this simplicity. This is called interconnectedness. We need to know that reliability is intertwined with many elements of life-cycle costs. The costs of downtime, repairs, preventive maintenance, amount of logistics support required, safety, diagnostics, and serviceability are dependent on the level of reliability. In the same spirit, we should also analyze product friendliness and modularity, which are interconnected with reliability. For example, General Motors is designing its hydrogen cars to have a single chassis for all models instead of 80 different chassis as is the case with current production. This action influences reliability in many ways. Similarly, an analysis of downtime should be conducted by service engineering staff to ensure that each fault will be diagnosed in a timely manner, repairs will be quick, and life-cycle costs will be reduced by extending the maintenance cycles or eliminating the need for maintenance altogether. The specification should be critiqued for quick serviceability and ease of access. Until the specification is written thoroughly and approved, no design work should begin. An example of the need to identify missing requirements is that nearly 1000 people around the world lost their lives while the kinks were being removed from the 290-ton McDonnell Douglas DC-10 during the 1970s. Blown-out cargo doors, shredded hydraulic lines, and engines dropped during the flight were just a few of the behemoth's early problems. It is obvious that the company did not have the right system performance specification. We rely on customers to tell us what they want, but they themselves don't know many requirements until there is a breakdown. Customers are not going to tell us that the cargo doors should not blow out during a crowded flight. It is the design team's responsibility to figure out what the customers did not say.
To find the design flaws early, a team has to view the system from various angles. You would not buy a house by just looking at the front view. You want to see it from all sides. Similarly, a product concept has to be viewed from at least the following perspectives:
Functions of the product
Range of applications
Range of environments
Active safety
Duty cycles during life
Reliability
Robustness for user or servicing mistakes
Logistics requirements
Manufacturability requirements
Internal interface requirements
External interface requirements
Installation requirements
Shipping and handling capabilities
Serviceability and diagnostics capabilities
Prognostics health monitoring
Usability on other products
Sustainability
There is a need to explain a sustainable design in the list above. Good product design is about meeting current needs without compromising the needs of future generations, such as by pollution or global warming. Current electronic and computers are not designed for sustainability. They should have been designed for reuse—the ability to recycle is not enough. Not everyone makes an effort to recycle. According to NBC News on October 4, 2007, there are over 3 billion such devices and only 15% are recycled. About 200 million tons, with mercury in the monitors and lead in the solder, wind up in landfills and often in drinking water.
Most designers are likely to miss many of the requirements noted above. This knowledge is not new. It can be included by inviting experts in these areas to brainstorm. There is no mechanism for customers to specify all of these. Suppliers that want to do productive work will teach customers how to develop good requirements as a team member. This makes the customer understand what needs to be in the contract. The point here is that if we have to fix many mistakes later (expensively), we cannot be proud of reliability, as craftsmen once were.
It is wrong to measure reliability in terms of failure rates alone. Such a negative index with unknown impact does not get much attention from management, except when there is a crisis. It is the cost of failures that is important. It should be measured by reduction in life-cycle costs. The fewer the failures, the lower is the life-cycle cost. The costs should be measured over the expected life. They are not just warranty costs; they include the cost of downtime, repairs, logistics, human errors, and product liability. When I was in charge of the reliability of the Baltimore Rapid Transit Train system design, the reliability performance was measured in terms of cost per track mile. Similarly, at Baltimore Gas & Electric, reliability is measured in terms of cost per circuit mile. Smart customers look for only one performance feature: the life-cycle cost per unit of use. Those who approve the specification should concentrate on this measure. Reliability must result in cheaper, faster, and better products.
Why twice the life? The simple answer is that it is the fundamental taught in Engineering 101, which seems to have been forgotten. Remember 100% design margin? Second, it is cheaper than designing for one life if we measure reliability by the life-cycle cost savings. A division of Eaton Corporation requires twice-the-life at 500% return on investment [3]. It actually turns the situation into a positive cash flow, since there is nothing to be monitored if the failures occur beyond the first life. The 50% failure rate is now shifted to the second life, when the product is going to be obsolete. Engineers try to design transmission components without increasing the size or weight, using alternative means such as heat treating in a different way or eliminating joints in the assemblies. Occasionally, they may increase the size by a very minor amount, such as on wires or connectors, to expedite the solution. This is acceptable as long as the return on investment is at least 500%.
Another reason for twice the life is the need to avoid engineering changes, which seems to be obvious. Imagine a bridge designed for 20-ton trucks and a 30-year life. It may have no problems in the beginning. But the bridge degrades over time. After 10 years it may not be strong enough to take even 15 tons, and it is very likely to collapse. If it had been designed for twice the load (for 40 tons) or for a 60-year life, it should not fail at all during 30 years. It should be noted that designing for twice the load also results in twice the life most of the time, but one must still use some engineering judgment. This is similar to a 100% design margin. For the same reason, the electronic components in the aerospace industry are derated 50%. In one assembly the load-bearing capability was more than doubled by using a cheaper round key instead of a rectangular key. The round key has practically no stress concentration points. In another design, twice the life as well as twice the load capability were achieved by molding two parts as a single piece, preventing stresses at the joint. The cost was lower because no assembly was required, there were fewer part numbers in the inventory, no failures, and no downtime for customers.
What if we cannot design for twice the life? There are times when we cannot think of a proper solution for twice the life. Then one can go to other options, such as:
Providing redundancy on the weakest links, such as bolts, corroded joints, cables, and weak components.
Designing to fail safely such that no one is injured. For automobiles a safe mode can be that the car can switch to a degraded performance with enough time left to reach home or a repair facility.
Designing-in early prognostics-type warnings so that the user still has sufficient time to correct the situation—before failure occurs. One of the purposes of prognostics is to predict the remaining life.
The rule of thumb in aerospace for safety-related components is to design for four times the life. A U.S. Navy policy (NAVAIR) is to design safety-critical components for four times the life and conduct a test for a minimum of twice the life. The expected life should include future increases in load. Many airlines use their aircraft beyond the design life by performing more maintenance. This indirectly exposes many components to work beyond the normal one life. This is the main reason for designing for four times the life, to maintain 100% design margin all the time. Similarly, many consumers drive cars far beyond the expected 10-year life.
We should also design for peak loads, not the usual mean load. When a high-voltage cable used in power lines broke easily, engineers could not duplicate the failure with average loads. When they applied the peak loads, they could.
Designing for four times the life does not mean overdesigning. It is the art of choosing the right concept. If the attention is placed on innovation rather than marginal improvements, engineers can design for multiple lives with little or no investment, as shown earlier by several examples. They must encourage themselves to think differently rather than latching on to outdated traditional methods of increasing the size or weight. Engineers who talk of costs when solving problems usually block out creativity. They draw the boundary around the solution. Their first thought is to increase the size or weight to design for high loads. This is very common defective thinking. This is where the universities need to be more knowledgeable. We need to balance logic with creativity and should still be able to show a high return on investment.
Most engineers are of the opinion that high reliability costs more. World-class organizations embrace the paradox of increasing reliability and lowering costs simultaneously. Trade-off between reliability and cost is not always necessary. Toyota has mastered this paradigm, where high reliability and lower life-cycle costs are a way of life. Toyota has learned over the years that preventing failures is always cheaper than fixing them if the failure prevention process starts early in the design. If we capture the potential failures during the requirements analysis, we can include design for reliability without making wasteful engineering changes later. Similarly, during detailed design reviews, such tools as design failure modes, effects, and criticality analysis (FMECA), process FMECA, and fault tree analysis, if used early, can help us discover many missing, vague, and incomplete requirements. Engineering changes are the biggest source of waste in organizations, because most of them can be prevented. Here are some examples of achieving high reliability with very little or no investment. Since high reliability reduces life-cycle costs, the insignificant amount of investment does not negatively affect the win–win scenario.
A company in Brazil had designed a large warning light bulb on a control console, with a plastic cover to reduce glare. They told me that they tried all kinds of plastics for the cap but that all of them melted after a few months. Someone suggested using a glass cover. We received the usual stupid answer: “Glass will cost three times as much as plastic. The cost of the product will be high.” The bad part is that many engineers look only at the cost of the component and completely ignore the cost of losing customers and the warranty costs to the employer. They are unaware that the cost of getting a new customer is at least five times the cost of retaining a current customer. When the team calculated the life-cycle costs of plastic versus the glass cap, the return on investment (ROI) turned out to be 300% in favor of the glass material. The author requested them to put a hold on the solution because we had agreed on an ROI goal of at least 500%. The author advised the entire team to take long showers for three weeks in the hope that someone would come up with a better idea. Why? Because when you take a long shower, your brain is calmed. In this state it is able to use over 1000 billion neurons that you have never used.
It so happened that the present author (the facilitator) was the one taking the long shower. Suddenly I began to feel that the engineers were giving me a snow job! They said that they tried all the plastics and they all melted. This could not be true. There are fundamentally two types of plastics: thermoplastics, which melt with heat, and thermoset plastics, which harden with the heat. I sent them an e-mail suggesting that they try thermoset plastic. It worked. They could not melt it, no matter how much heat they put in. They sent a nice e-mail: “Thanks for the research you did for us.” The cost of the new plastic was almost the same. Zero investment. One hundredfold life. One million percent ROI!
The original European jet aircraft Comets were cracking around the windows. They were taken out of service for two years. The engineers, as usual, started to design thicker fuselage walls and proposed an enormous cost increase. Then someone suggested examining the failures and discovered that all the failures were around the corners of the widows. He suggested increasing the radius at the corners. Problem solved quickly, with hardly any investment. The ROI was least 100,000% if you consider the ratio of the cost of thickening the fuselage and the investment in changing the radius on the corners of the windows.
At a General Motors facility, the headlamps were failing after about 1000 hours of use. The supplier was going to raise the price 100% to design for twice the life. An engineer turned the filament in the headlamp 90° to avoid harmful vibration and the life increased at least sixfold. Practically zero investment.
A dent in a Caterpillar tractor spring was causing premature breakdowns. The reason for the dent was that the spring under the tractor occasionally hit rocks on the ground. The engineers reduced the diameter of the spring such that it wouldn't hit rocks and replaced it with a tougher spring. With a very small investment they got a better than 10,000% ROI.
We can design for reliability as much as we want, but if manufacturing processes are subject to operator error and to wide swings in variability, a good design is bound to have premature failures. We need to identify manufacturing features such as the correct torque for fasteners, vulnerability to installing components backward, or vulnerability to using the wrong components. These features could be certain dimensions, alignment, proper fit of mating parts, property of a lubricant, workmanship, and so on. A product should be designed to avoid such vulnerabilities or should be testable during manufacturing to detect abnormalities. For lack of current terminology, we can call it design to avoid latent manufacturing flaws.
Let's look at an example of designing to reduce vulnerability to manufacturing variations. A new motorcycle design involved over 50 different fasteners. Following process FMEA, the production operators discovered that a separate torque was required for each fastener joint. They approached design engineers to ask if they could choose about 20 different fasteners instead of 50. This would allows them to concentrate on fewer fasteners and fewer fastening standards. Engineers were flabbergasted: Such advice coming from the hourly workers was an aha! moment for them. They standardized on a few fasteners.
Another example is from Delco Electronics (now Delphi). A plastic panel required that a plating process have a conductive surface. The plating had been peeling off in two to three years and six sigma team efforts failed to control the plating durability. Someone came up with the bright idea of adding carbon particles to the plastic to make it conductive. The entire plating process was eliminated. The cost went down by 70%. The reliability of the conductivity was now 100%! A good example of over 100,000% ROI.
The secret of controlling manufacturing flaws is to identify where inspection is needed and to design the process such that no inspection is required—if such a solution is possible.
One more example may help. In this case, the process is the focus. Assume that we want to design a dinner table with four legs such that the legs must be equal. If we cut one leg at a time, we cannot get them all equal because of the variability in the cutting process. But if we take all four legs together, and cut all of them with a single cut, they will all be equal.
In complex systems such as telecommunications and fly-by-wire systems, most system failures are not from component failures. They are from very complex interactions and sneak circuits. Failure rates are very difficult to predict. The sudden acceleration experienced by Audi 5000 users during the 1980s was a result of a software sneak failure. A bit in the integrated circuit register got stuck at zero value, which rapidly increased the speed when the gear was engaged in reverse mode. One way to prevent system failures is to monitor the health of critical features such as “stuck at” faults, critical functions, and critical inputs to the system. A possible solution is to develop a software program to determine prognostics, diagnostics, and possible fallback modes.
The following data on a major airline, announced at a Federal Aeronautics Administration (FAA) National Aeronautics and Space Administration (NASA) workshop [4] shows the extent of unpredicted failures:
Problems reported confidentially by airline employees: about 13,000
Number actually in airline files: about 2%, or 260
Number known to the FAA: about 1%, or 130
The sneak failures are more likely to be in embedded software, where it is impractical to do a thorough analysis. Frequently, the software requirements are faulty because they are not derived completely from the system requirements. Peter Neumann, a computer scientist at SRI International, highlights the nature of damage from software defects in the last 15 years [5]:
Wrecked a European satellite launch
Delayed the opening of the new Denver airport by one year
Destroyed a NASA Mars mission
Induced a U.S. Navy ship to destroy an airliner
Shut down ambulance systems in London, leading to several deaths
To counter such risks, we need an early warning, early enough to prevent a major mishap. This tool is prognostics health monitoring. It consists of tracking all the possible unusual events, such as signal rates, the quality of the inputs to the system, or unexpected outputs from the system, and designing in intelligence to detect unusual system behavior. The intelligence may consist of measuring important features and making a decision as to their impact. For example, a sensor input occasionally occurs after 30 milliseconds instead of 20 milliseconds as the timing requirement states. The question is: Is this an indication of a disaster? If so, the sensor calibration may be required before the failure manifests as a mishap.
In summary we can say that we need to define functions correctly. We need to design not to fail, and we need to implement all the paradigms covered in this chapter, including designing to avoid manufacturing problems. Once I was at a company meeting where the customers were asked to describe the warranty they would wish to have. One of them said (and others agreed): No warranty is the best warranty. Very few understood the paradox—the best warranty would be one that would never experience a claim. In other words, the customers wanted a failure-free design for reliability.
[1] Kuo, W., Compatibility and simplicity: the fundamentals of reliability, IEEE Trans. Reliab., vol. 56, Dec. 2007.
[2] Raheja, D. G., Product Assurance Technologies, Design for Competitiveness, Inc., 2002.
[3] Raheja, D. G., and Allocco M., Assurance Technologies Principles and Practices: A Product, Process, and System Safety Perspective, 2nd ed., Wiley, Hoboken, NJ, 2006, Appendix.
[4] Farrow, D. R., presented at the Fifth International Workshop on Risk Analysis and Performance Measurement in Aviation, sponsored by FAA and NASA, Baltimore, Aug. 19–21, 2003.
[5] Mann, C. C., Why software is so bad, Technol. Rev., July-Aug. 2002.
Chapter 2
Reliability Design Tools
Joseph A. Childs
The importance of designing reliability into a product was the focus of Chapter 1. As technology continues to advance, products continue to increase in complexity. Their ability to perform when needed and to last longer are becoming increasingly important. Similarly, it is becoming more and more critical to be able to predict failure occurrences for today's products more effectively and more thoroughly. This means that reliability engineers must be increasingly effective at understanding what is at stake, assessing reliability, and assuring that product reliability maturity is at the level required. To assure this effectiveness, tools have been developed in the reliability engineering discipline. This chapter is a summary of such tools that exist in all aspects of a product's life: from invention, design, production, and testing, to its use and end of life.
The automation of reliability methods into tools is important for the repeatability of the process and results, for value-added benefits in terms of cost savings during the application of design analysis methods, and for achieving desired results faster, improving design cycle times. As design processes evolve, the tools should evolve. Innovation in the current electrical and mechanical design tool suite should include interfacing to the current design reliability tool suite.
One important thing about reliability engineering as a discipline is that it is involved in all parts of a product's life: from product inception, its manufacture and use, to its end of life. This is because reliability is an intrinsic part of a product's essence, whether it is a “throwaway” coffee cup or a sophisticated spacecraft intended to last 10 years in outer space. As an intrinsic parameter, it must be taken into account in the definition, design, building, test, and use (and abuse) of the product. For each program phase, tools have been devised to enable engineers to gain insight into the requirements and status of reliability. Figure 1 provides a generalized flow, representing any product's life cycle and how reliability mirrors those phases throughout a development program. Figure 2 notes key activities and events throughout a product's life cycle.
Figure 1 Reliability involvement in program and product life.
Figure 2 Program and product life tasks.
The reliability tools are designed to help the reliability function to assess and enhance the design so that the product is capable of meeting and exceeding its goals.
In this chapter we provide an overview of many of the tools used in the design life of a product: what they are, how they are performed, and how their results are used by the various design disciplines—reliability, electrical, mechanical, and software design, test, and manufacturing engineering. Figure 3 illustrates the reliability tools that are discussed here, when they might be used in a product's life cycle, and how these tools match the actions and events in each phase of a product's lifetime.
Figure 3 Program and product life tasks, tied to reliability tasks
