89,99 €
NEXT GENERATION HALT AND HASS ROBUST DESIGN OF ELECTRONICS AND SYSTEMS
A NEW APPROACH TO DISCOVERING AND CORRECTING SYSTEMS RELIABILITY RISKS
Next Generation HALT and HASS presents a major paradigm shift from reliability prediction-based methods to discovery of electronic systems reliability risks. This is achieved by integrating highly accelerated life test (HALT) and highly accelerated stress screen (HASS) into a physics of failure based robust product and process development methodology. The new methodologies challenge misleading and sometimes costly misapplication of probabilistic failure prediction methods (FPM) and provide a new deterministic map for reliability development. The authors clearly explain the new approach with a logical progression of problem statement and solutions.
The book helps engineers employ HALT and HASS by demonstrating why the misleading assumptions used for FPM are invalid. Next, the application of HALT and HASS empirical discovery methods to quickly find unreliable elements in electronics systems gives readers practical insight into the techniques.
The physics of HALT and HASS methodologies are highlighted, illustrating how they uncover and isolate software failures due to hardware–software interactions in digital systems. The use of empirical operational stress limits for the development of future tools and reliability discriminators is described.
Key features:
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 394
Veröffentlichungsjahr: 2016
Cover
Title Page
Series Editor’s Foreword
Preface
List of Acronyms
Introduction
1 Basis and Limitations of Typical Current Reliability Methods and Metrics
1.1 The Life Cycle Bathtub Curve
1.2 HALT and HASS Approach
1.3 The Future of Electronics: Higher Density and Speed and Lower Power
1.4 Use of MTBF as a Reliability Metric
1.5 MTBF: What is it Good For?
1.6 Reliability of Systems is Complex
1.7 Reliability Testing
1.8 Traditional Reliability Development
Bibliography
2 The Need for Reliability Assurance Reference Metrics to Change
2.1 Wear-Out and Technology Obsolescence of Electronics
2.2 Semiconductor Life Limiting Mechanisms
2.3 Lack of Root Cause Field Unreliability Data
2.4 Predicting Reliability
2.5 Reliability Predictions – Continued Reliance on a Misleading Approach
2.6 Stress–Strength Diagram and Electronics Capability
2.7 Testing to Discover Reliability Risks
2.8 Stress–Strength Normal Assumption
2.9 A Major Challenge – Distributions Data
2.10 HALT Maximizes the Design’s Mean Strength
2.11 What Does the Term HALT Actually Mean?
Bibliography
3 Challenges to Advancing Electronics Reliability Engineering
3.1 Disclosure of Real Failure Data is Rare
3.2 Electronics Materials and Manufacturing Evolution
Bibliography
4 A New Deterministic Reliability Development Paradigm
4.1 Introduction
4.2 Understanding Customer Needs and Expectations
4.3 Anticipating Risks and Potential Failure Modes
4.4 Robust Design for Reliability
4.5 Diagnostic and Prognostic Considerations and Features
4.6 Knowledge Capture for Reuse
4.7 Accelerated Test to Failure to Find Empirical Design Limits
4.8 Design Confirmation Testing: Quantitative Accelerated Life Test
4.9 Limitations of Success Based Compliance Test
4.10 Production Validation Testing
4.11 Failure Analysis and Design Review Based on Test Results
Bibliography
5 Common Understanding of HALT Approach is Critical for Success
5.1 HALT – Now a Very Common Term
5.2 HALT – Change from Failure Prediction to Failure Discovery
5.3 Serial Education of HALT May Increase Fear, Uncertainty and Doubt
6 The Fundamentals of HALT
6.1 Discovering System Stress Limits
6.2 HALT is a Simple Concept – Adaptation is the Challenge
6.3 Cost of Reliable vs Unreliable Design
6.4 HALT Stress Limits and Estimates of Failure Rates
6.5 Defining Operational Limit and Destruct Limits
6.6 Efficient Cooling and Heating in HALT
6.7 Applying HALT
6.8 Thermal HALT Process
6.9 Random Vibration HALT
6.10 Product Configurations for HALT
6.11 Lessons Learned from HALT
6.12 Failure Analysis after HALT
7 Highly Accelerated Stress Screening (HASS) and Audits (HASA)
7.1 The Use of Stress Screening on Electronics
7.2 ‘Infant Mortality’ Failures are Reliability Issues
7.3 Developing a HASS
7.4 Unique Pneumatic Multi-axis RS Vibration Characteristics
7.5 HALT and HASS Case History
Bibliography
7.6 Benefits of HALT and HASS with Prognostics and Health Management (PHM)
Bibliography
8 HALT Benefits for Software/Firmware Performance and Reliability
8.1 Software – Hardware Interactions and Operational Reliability
8.2 Stimulation of Systematic Parametric Variations
Bibliography
9 Design Confirmation Test
9.1 Introduction to Accelerated Life Test
9.2 Accelerated Degradation Testing
9.3 Accelerated Life Test Planning
9.4 Pitfalls of Accelerated Life Testing
9.5 Analysis Considerations
Bibliography
10 Failure Analysis and Corrective Action
10.1 Failure Analysis and Knowledge Capture
10.2 Review of Test Results and Failure Analysis
10.3 Capture Test and Failure Analysis Results for Access on Follow-on Projects
10.4 Analyzing Production and Field Return Failures
Bibliography
11 Additional Applications of HALT Methods
11.1 Future of Reliability Engineering and HALT Methodology
11.2 Winning the Hearts and Minds of the HALT Skeptics
11.3 Test of No Fault Found Units
11.4 HALT for Reliable Supplier Selection
11.5 Comparisons of Stress Limits for Reliability Assessments
11.6 Multiple Stress Limit Boundary Maps
11.7 Robustness Indicator Figures
11.8 Focusing on Deterministic Weakness Discovery Will Lead to New Tools
11.9 Application of Limit Tests, AST and HALT Methodology to Products Other Than Electronics
Bibliography
Appendix: HALT and Reliability Case Histories
A.1 HALT Program at Space Systems Loral
A.2 Software Fault Isolation Using HALT and HASS
A.3 Watlow HALT and HASS Application
A.4 HALT and HASS Application in Electric Motor Control Electronics
A.5 A HALT to HASS Case Study – Power Conversion Systems
Index
Wiley Series in Quality and Reliability Engineering
End User License Agreement
Chapter 1
Table 1.1 Reliability at 100 Hours for item A, item B and item C
Table 1.2 Estimated parameters for item D, item E and item F
Table 1.3 Reliability at 50 hours for item D, item E, and item F
Chapter 2
Table 2.1 Semiconductor wear-out mechanisms activation energies
Table 2.2 Reported activation energy for silicon semiconductor wear-out mechanisms.
Table 2.3 Causes of Failure
Table 2.4 Results of the 1987 SINCGARS NDI candidate test
Chapter 7
Table 7.1 Five-year estimate of failures prevented by HASS testing
Table 7.2 Cumulative cost avoidance of failures using HASS
Chapter 9
Table 9.1 ALT test plan table to capture key factors
Appendix
Table A4.1 Test sequence of ten sample modules
Table A4.2 Example of cold step stress test
Table A4.3 Summary of cold step stress HALT results
Table A4.4 Temperature Profile in Hot Step Stress HALT
Table A4.5 Results of hot step stress test of PWB modules
Table A4.6 Vibration profile for combined temperature cycle and vibration HALT
Table A4.7 Comparison of HALT and HASS results with field failure data
Chapter 1
Figure 1.1 Dilbert, management and reliability.
Figure 1.2 The life cycle bathtub curve
Figure 1.3 Realistic field life cycle bathtub curve
Figure 1.4 The ‘drain’ of technological obsolescence in the life cycle bathtub curve
Figure 1.5 Reliability functions for item A, item B and item C
Figure 1.6 Hazard functions for item A, item B and item C
Figure 1.7 Item D: Graphical analysis of survival data
Figure 1.8 Item E: Graphical analysis of survival data
Figure 1.9 Item F: Graphical analysis of survival data
Figure 1.10 Reliability functions for item D, item E, and item F
Figure 1.11 Hazard functions for item D, item E, and item F
Figure 1.12 Examples of where latent defects are introduced during assembly fabrication
Figure 1.13 Impact of reliability tasks on electronics.
Figure 1.14 The ESS vibration power spectral density spectrum guideline from NAVMAT 9492 (US Navy)
Chapter 2
Figure 2.1 Burned battery assembly after suffering a thermal runaway.
Figure 2.2 Chargeability of failures based upon test data
Figure 2.3 Comparison of vibration displacement
Figure 2.4 Comparison of vibration response and resistor location
Figure 2.5 Comparison of various handbook methodologies
Figure 2.6 Comparison of predicted versus demonstrated values for DoD systems
Figure 2.7 The stress–strength diagram for reliability
Figure 2.8 The stress–strength diagram and the effect of fatigue damage
Figure 2.9 The intersection of the stress and strength curves resulting in failure PDFs
Figure 2.10 Reliability margin in the stress–strength diagram
Figure 2.11 The stress–strength curves in a reliable system
Figure 2.12 The stress–strength curves overlap results in failure PDF
Figure 2.13 Fixed and known strength, but random variable for stress,
X
Figure 2.14 Fixed and known stress but random variable for strength,
Y
Figure 2.15 Both stress and strength are random variables
Figure 2.16 The distributions of a system’s strength is a sum of individual components and subsystems, each with its own distribution
Figure 2.17 A latent defect subpopulation resulting from a manufacturing process excursion
Figure 2.18 Cisco normalized return rate versus thermal operating margin.
Figure 2.19 Cisco normalized RMA rate versus active components count.
Chapter 4
Figure 4.1 New product and process development flow
Figure 4.2 Lean quality function deployment chart to translate needs to design features [1].
Figure 4.3 Parameter diagram elements [2,5].
Figure 4.4 Boundary diagram derived from functional block diagram [2,5]
Figure 4.5 Good design, good discussion, good dissection integration [5,8,10]
Figure 4.6 Basic design review based on failure modes (DRBFM) format [5,8,10]
Figure 4.7 Levels of DRBFM in complex system development [3,5,8]
Figure 4.8 Typical process flow diagram
Figure 4.9 Typical process DRBFM format [5,8,10]
Figure 4.10 Timing and application of DRBFM in product and process development [3,8]
Figure 4.11 Summary of typical failure mechanisms to consider in design
Figure 4.12 Reducing stress–strength interference [6].
Figure 4.13 Effect of aging or wear on product strength and probability of failure [6].
Figure 4.14 Illustration of a probabilistic solution using reliability based design optimization [4,11].
Figure 4.15 Phased DOE approach to optimize design choices [9].
Figure 4.16 Illustration of degradation analysis.
Figure 4.17 Step stress limit test profile
Figure 4.18 Step stress limit test profile for electronic modules with increasing temperature deltas
Figure 4.19 Robustness indicator diagram [7].
Figure 4.20 Stress boundary map
Figure 4.21 Design review based on test results (DRBTR) process flow [8].
Figure 4.22 Design review based on test results (DRBTR) format [8].
Chapter 5
Figure 5.1 The change in orientation for the designed strength of electronics
Chapter 6
Figure 6.1 Typical mounting of a circuit board with aluminum ducts directing air flow across the UUT.
Figure 6.2 Low mass triaxial and single axis accelerometers.
Figure 6.3 PSD showing peak vibration resonant frequency shift of PCB at −35°C and 70°C.
Figure 6.4 SEM picture of the BGA solder joint shows cracks on the top with some connection.
Figure 6.5 Typical HALT chamber.
Figure 6.6 Decreasing temperature HALT profile and thermal lag of UUT
Figure 6.7 Increasing temperature HALT profile and thermal lag of UUT
Figure 6.8 Vibration HALT profile
Figure 6.9 A thermal isolation chamber inside a HALT chamber
Figure 6.10 Good, better, and best component placements learned from HALT
Chapter 7
Figure 7.1 Stress/strength diagram with subsystems strength distributions
Figure 7.2 Stress strength diagram with subsystems latent defect distributions
Figure 7.3 HASS uses some fatigue life to precipitate latent defects
Figure 7.4 Stress levels for HALT and HASS
Figure 7.5 HASS precipitation and detection screens
Figure 7.6 Stress regime of a typical HASS process
Figure 7.7 A PSD for the top of a multi-axis RS vibration table
Figure 7.8 Correlation between air pressure and hammer frequency (Courtesy of Charles Felkins)
Figure 7.9 Otis field and HALT failures comparison
Figure 7.10 The effectiveness of a leaky capacitor introduced after HALT was completed
Figure 7.11 Comparisons of Weibull analysis of tested and untested populations.
Figure 7.12 Cardiac stress test.
Figure 7.13 Generic parametric signature for reliability discriminators during HASS
Chapter 8
Figure 8.1 Measured low to high propagation delay versus case temperature of a Fairchild octal buffer.
Figure 8.2 Potential distribution signal propagation delay in mass production.
Figure 8.3 Applied temperatures skewing of timing distributions in semiconductors
Figure 8.4 Propagation delays for short PCB trace.
Figure 8.5 Thermograph of an operating circuit board showing thermal gradients across board.
Figure 8.6 Potential contributors to poor signal integrity in electronics
Figure 8.7 Cross-section of simple circuit board
Figure 8.8 Potential distributions of parametric timing variations during prototype/pilot production builds
Figure 8.9 Mass production and the wider distributions causing marginal operation
Figure 8.10 Colder temperatures skew signal speeds higher in a sample of prototype hardware
Figure 8.11 High temperatures skew signal speeds lower in a sample of prototype hardware
Figure 8.12 Rapid thermal gradients shift dimensions and parametrics of active devices
Chapter 9
Figure 9.1 Life–stress relationship plot for quantitative accelerated life test
Figure 9.2 Degradation analysis plot with failure threshold [6].
Chapter 10
Figure 10.1 Design review by failure modes (DRBTR) results format [5].
Figure 10.2 Robustness indicator figure showing test results margin for DRBTR discussion
Chapter 11
Figure 11.1 Temperature and voltage four corner test
Figure 11.2 Voltages and temperature empirical operational boundaries
Figure 11.3 Empirical operational boundaries showing stress/strength distributions
Figure 11.4 Two-dimensional safe stress margins for ongoing reliability monitoring
Figure 11.5 Robustness indicator figure
Appendix
Figure A1.1 Statistical nature of stress vs. strength
Figure A1.2 Probabilities of not finding issues during qualification and acceptance temperature testing
Figure A1.3 Stress testing principle.
Figure A1.4 Effect of time on strength.
Figure A2.1 Freeze spray is applied to a suspect component
Figure A2.2 Power resistors are applied to a suspect circuit
Figure A2.3 Fault type summary by percentage
Figure A2.4 Fault type summary by failure type
Figure A3.1 Typical HASS and HASA profile used at Watlow
Fig. A5.1 New upper operating limits for model B units with high current component
Cover
Table of Contents
Begin Reading
ii
iii
iv
xi
xii
xiii
xiv
xv
xvi
xvii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
24
25
26
27
23
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
59
60
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
Kirk A. Gray
Accelerated Reliability Solutions, LLC, Colorado, USA
John J. Paschkewitz
Product Assurance Engineering, LLC, Missouri, USA
This edition first published 2016© 2016 John Wiley & Sons, Ltd
Registered officeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the authors to be identified as the authors of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and authors have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the authors shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Names: Gray, Kirk, author. | Paschkewitz, John James, author.Title: Next generation HALT and HASS : robust design of electronics and systems/by Kirk Gray, John James Paschkewitz.Description: Chichester, UK ; Hoboken, NJ : John Wiley & Sons, 2016. |Includes bibliographical references and index.Identifiers: LCCN 2015044935 | ISBN 9781118700235 (cloth)Subjects: LCSH: Accelerated life testing. | Electronic systems–Design and construction. |Electronic systems–Testing.Classification: LCC TA169.3 .G73 2016 | DDC 621.381028/7–dc23LC record available at http://lccn.loc.gov/2015044935
A catalogue record for this book is available from the British Library.
The Wiley Series in Quality & Reliability Engineering aims to provide a solid educational foundation for researchers and practitioners in the field of quality and reliability engineering and to expand their knowledge base by including the latest developments in these disciplines.
The importance of quality and reliability to a system can hardly be disputed. Product failures in the field inevitably lead to losses in the form of repair costs, warranty claims, customer dissatisfaction, product recalls, loss of sales and, in extreme cases, loss of life.
Engineering systems are becoming increasingly complex, with added functions and capabilities; however, the reliability requirements remain the same or are even growing more stringent. Also the rapid development of functional safety standards increases pressure to achieve ever higher reliability as it applies to system safety. These challenges are being met with design and manufacturing improvements and, to no lesser extent, by advancements in testing and validation methods.
Since its introduction in the early 1980s the concept and practice of highly accelerated life testing has undergone significant evolution. This book Next Generation HALT and HASS written by Kirk Gray and John Paschkewitz, both of whom I have the privilege to know personally, takes the concept of rapid product development to a new level. Both authors have lifelong experience in product testing, validation and applications of HALT to product development processes. HALT and HASS have quickly become mainstream product development tools, and this book is the next step in cementing their place as an integral part of the design process; it offers an excellent mix of theory, practice, useful applications and common sense engineering, making it a perfect addition to the Wiley series in Quality and Reliability Engineering.
The purpose of this Wiley book series is not only to capture the latest trends and advancements in quality and reliability engineering but also to influence future developments in these disciplines. As quality and reliability science evolves, it reflects the trends and transformations of the technologies it supports. A device utilizing a new technology, whether it be a solar panel, a stealth aircraft or a state-of-the-art medical device, needs to function properly and without failures throughout its mission life. New technologies bring about new failure mechanisms, new failure sites and new failure modes, and HALT has proven to be an excellent tool in discovering those weaknesses, especially where new technologies are concerned. It also promotes the advanced study of the physics of failure, which improves our ability to address those technological and engineering challenges.
In addition to the transformations associated with changes in technology the field of quality and reliability engineering has been going through its own evolution by developing new techniques and methodologies aimed at process improvement and reduction of the number of design and manufacturing related failures. And again, HALT and HASS form an integral part of that transformation.
Among the current reliability engineering trends, life cycle engineering concepts have also been steadily gaining momentum by finding wider applications to life cycle risk reduction and minimization of the combined cost of design, manufacturing, warranty and service. Life cycle engineering promotes a holistic approach to the product design in general and quality and reliability in particular.
Despite its obvious importance, quality and reliability education is paradoxically lacking in today’s engineering curriculum. Very few engineering schools offer degree programs, or even a sufficient variety of courses, in quality or reliability methods; and the topic of HALT and HASS receives almost no coverage in today’s engineering student curriculum. Therefore, the majority of the quality and reliability practitioners receive their professional training from colleagues, professional seminars, publications and technical books. The lack of opportunities for formal education in this field emphasizes too well the importance of technical publications for professional development.
We are confident that this book, as well as this entire book series, will continue Wiley’s tradition of excellence in technical publishing and provide a lasting and positive contribution to the teaching, and practice of reliability and quality engineering.
Dr. Andre Kleyner,Editor of the Wiley Series in Quality & Reliability Engineering
This book is written for practicing engineers and managers working in new product development, product testing or sustaining engineering to improve existing products. It can also be used as a textbook in courses in reliability engineering or product testing. It is focused on incorporating empirical limit determination with accelerated stress testing into a physics of failure approach for new product and process development. It overcomes the limitations, weaknesses and assumptions prevalent in prediction based reliability methods that have prevailed in many industries for decades.
We are especially grateful to the late Dr Gregg Hobbs for being the creator of HALT and HASS and a teacher and mentor.
We especially appreciate Dr Michael Pecht, the founder of CALCE at the University of Maryland, for his encouragement for writing this book and sharing CALCE material.
We would like to indicate our gratitude to our colleagues who provided support, input, review and feedback that helped us create this book. We thank Andrew Roland for permission to use his article MTBF – What Is It Good For? We would also like to thank Charlie Felkins for the pictures and drawings he provided and Andrew Riddle of Allied Telesis Labs for use of their case history. We are also grateful for the assistance of Fred Schenkelberg in providing support, contributions and promotion of this book.
We would like to thank Mark Morelli for material used in the book, as well as working with him early on implementing HALT and HASS at Otis Elevator, and Michael Beck for his support on implementing HALT and HASS, and access to information on DRBFM. We are grateful to Bill Haughey for introducing us to GD3, DRBFM and DRBTR, as well as to James McLeish for his support and work on Robust Design, Failure Analysis and GD3 source information.
We want to acknowledge Watlow and in particular Chris Lanham for providing opportunity to expand and apply our reliability knowledge, as well as Mark Wagner for his case history contribution to the Appendix.
Reliasoft granted us permission to use material in this book and we appreciate the support and encouragement from Lisa Hacker. We thank Linda Ofshe for her technical editing of early chapters, Richard Savage for his support and encouragement and Monica Nogueira at SAE International for her review of manuscript sections and resolving questions on copyrighted material.
Ella Mitchell, Liz Wingett and Pascal Raj Francois, who are our contacts at John Wiley & Sons, have guided us through the process of writing a technical book and all the details of manuscript development and preparation for publication.
ALT
Accelerated Life Testing
AMSAA
Army Material Systems Analysis Activity
AST
Accelerated Stress Tests
CALT
Calibrated Accelerated Life Test
CDF
Cumulative Distribution Function
CHC
Channel Hot Carrier
CND
Can Not Duplicate
CRE
Certified Reliability Engineer
DoD
Department of Defense
DFX
Design for X (Test, Cost, Manufacture & Assembly, etc.)
DFR
Design for Reliability
DFSS
Design for Six Sigma
DOE
Design of Experiments
DRBFM
Design Review Based on Failure Modes
DRBTR
Design Review Based on Test Results
DVT
Design Verification Test
ED
Electrodynamic (Shaker)
EM
Electromigration
ESS
Environmental Stress Screening
FEA
Finite Element Analysis
FIT
Failure in Time
FLT
Fundamental Limit of Technology
FMEA
Failure Modes and Effects Analysis
FMECA
Failure Modes, Effects & Criticality Analysis
FRACAS
Failure Reporting, Analysis, & Corrective Action System
GD
3
Good Design, Good Discussion, Good Dissection
HALT
Highly Accelerated Life Test
HASS
Highly Accelerated Stress Screening
HASA
Highly Accelerated Stress Audit (Sampling)
HTOL
High Temperature Operating Life
HCI
Hot Carrier Injection
ICs
Integrated Circuits
LCD
Liquid Crystal Display
LCEP
Life Cycle Environmental Profile
MSM
Matrix Stressing Method
MTBF
Mean Time between Failures
MTTF
Mean Time To Failure
MWD
Measurement While Drilling
NBTI
Negative Bias Temperature Instability
NDI
Non Developmental Item
NFF
No Fault Found
NPF
No Problem Found
OEM
Original Equipment Manufacturer
ORT
Ongoing Reliability Test
PoF
Physics of Failure
PRAT
Production Reliability Acceptance Test
PTH
Plated Through Holes
PWBA
Printed Wiring Board Assembly
QFD
Quality Function Deployment
RoHS
Restriction of Hazardous Substances
RMA
Returned Material Authorization
RMS
Reliability, Maintainability, Supportability
RDT
Reliability Demonstration Test
SINCGARS
Single Channel Ground Air Radio Set
SPC
Statistical Process Control
TDDB
Time Dependent Dielectric Breakdown
VOC
Voice of the Customer
WCA
Worst Case Analysis
This book presents a new paradigm for reliability practitioners. It is focused on incorporating empirical limit determination with accelerated stress testing into a physics of failure approach for new product and process development. This extends the basics of highly accelerated life test (HALT) and highly accelerated stress screens (HASS) presented in earlier books and contrasts this new approach with the limitations, weaknesses, and assumptions in prediction based reliability methods that have prevailed in many industries for decades. It addresses the lack of understanding of why most systems fail, which has led to reliance on reliability predictions.
Chapters 1, 2 and 3 examine the basis and limitations of statistical reliability prediction methods and shows why they fail to provide useful estimates of reliability in new products even if they are derivatives of previous products. It also addresses the prevailing focus on estimating life or reliability with metrics such as MTBF (mean time before failures) and MTTF (mean time to failure) and the misleading aspects of using these metrics in reliability programs. This includes difficulties and limitations in using field return data on previous products or results of reliability demonstration tests to derive an MTBF or MTTF estimate on new products. The section concludes with an assessment of practices in many reliability programs and shows how they can be inadequate, resulting in warranty claims, customer dissatisfaction and increased cost to correct field problems. These typical practices include reactive reliability efforts conducted too late in product development to influence the design, success based testing that fails to find product weaknesses, and a focus on deliverable data to meet the customer’s qualification requirements.
Chapter 4 proposes a new approach to ensuring product reliability. This begins with a focused risk assessment to anticipate potential failure modes and weaknesses based on changes from the current product knowledge base as well as new components and materials needed to meet customer needs. This assessment draws on knowledge of subject matter experts and tools to identify likely failure mechanisms and causes. These risks are then addressed with robust design to ensure sufficient margin to withstand the variability of anticipated operating environments and production strength variability. The robust design also considers prognostics and health management to detect degradation and wear out by monitoring key parameters during operation. This design approach is followed by phased robustness testing of prototypes using accelerated stress tests, including HALT, to find product limits and design margins as well as to identify design weaknesses. After the weaknesses have been identified, design changes to overcome the issues are completed and verified in HALT or accelerated stress tests.
With the empirical limits determined and weaknesses corrected, quantitative accelerated life test can be used to estimate reliability of selected components or assemblies where the operating environment stresses can be determined and applied. ALT provides indication of expected reliability in the reduced time available with today’s shorter product development schedules. On systems with higher levels of integration, correctly identifying the combined stresses and accelerating them in a test becomes very difficult. So, validation testing at system level in the actual application may be needed to assess reliability and evaluate interfaces, which are often the source of reliability issues. Finally, production variability, process issues and supplier component variability need to be addressed with production screening tests and corrective action of issues discovered.
Chapters 5 and 6 detail the Highly Accelerated Life Test (HALT) from concept through process and planning to description of how to apply HALT. It also covers how to conduct failure analysis and ensure corrective action for the product weaknesses that are discovered. This includes selected stresses to apply in HALT, product configuration for test and applying thermal, vibration and power variation stresses, monitoring product operation and detecting failures and failure analysis after HALT.
Chapter 7 covers the use of production screening for electronics using Highly Accelerated Stress Screening (HASS) to find infant mortality issues and ensure the consistency and control of production processes. The HASS process is covered in detail, including precipitation and detection screens, stresses applied in HASS, the safety of screen process and verification of the HASS process. The effectiveness of HASS is discussed and transition to Highly Accelerated Stress Audit (HASA) sampling and cost avoidance are then covered.
Chapter 8 includes HALT and HASS examples to illustrate the application and effectiveness of discovering empirical limits, correcting design weaknesses and ensuring repeatable production processes. The section concludes with the benefits of HALT for software and firmware performance and reliability.
Chapter 9 covers the application of quantitative Accelerated Life Test (ALT) at component and subassembly levels when stresses can be correlated to the application environment and accelerated to levels between the operational level and the empirical limit of the product under test for the selected stresses used in the test. At higher levels of assembly, the combined stresses encountered in application become more difficult to apply and control to appropriate levels in an accelerated test. For these assemblies, validation testing in the application system at the prototype stage becomes necessary to evaluate interfaces and find potential problems that could not be discovered at the component or subassembly level.
Chapter 10 examines failure analysis, managing correction action and capturing learning in the knowledge base for access by follow-on project teams, allowing them to build on previous work rather than relearn it. This includes Design Review Based on Test Results (DRBTR) as a method for reviewing test results, deciding on corrective actions and tracking progress to completion and closure. Follow-up with production screening, ongoing reliability test during production and analysis of field data conclude the section.
Chapter 11 covers additional applications of the HALT methodology. These topics include:
future of reliability engineering and the HALT methodology
winning the hearts and minds of the HALT skeptics
analysis of field failures in HALT
test of no defect found units in HALT
HALT for reliable supplier selection
comparisons of stress limits for reliability assessments
multiple stress limit boundary maps and robustness indicator figures
focusing on deterministic weakness discovery will lead to new tools
application of empirical limit test, AST and HALT concepts to products other than electronics
These areas help the reliability practitioner apply the HALT methodology and tools to solve problems they often face in both product development and sustaining engineering of current products.
The appendix includes data from case studies that illustrate the effectiveness of the HALT methods in improving product reliability.
Reliability cannot be achieved by adhering to detailed specifications. Reliability cannot be achieved by formula or by analysis. Some of these may help to some extent, but there is only one road to reliability. Build it, test it and fix the things that go wrong. Repeat the process until the desired reliability is achieved. It is a feedback process and there is no other way.
David Packard, 1972
In the field of electronics reliability, it is still very much a Dilbert world as we see in the comic from Scott Adams, Figure 1.1. Reliability Engineers are still making reliability predictions based on dubious assumptions about the future and management not really caring if they are valid. Management just needs a ‘number’ for reliability, regardless of the fact it may have no basis in reality.
Figure 1.1 Dilbert, management and reliability.
Source: DILBERT © 2010 Scott Adams. Reproduced with permission of UNIVERSAL UCLICK
The classical definition of reliability is the probability that a component, subassembly, instrument, or system will perform its specified function for a specified period of time under specified environmental and use conditions. In the history of electronics reliability engineering, a central activity and deliverable from reliability engineers has been to make reliability predictions that provide a quantification of the lifetime of an electronics system.
Even though the assumptions of causes of unreliability used to make reliability predictions have not been shown to be based on data from common causes of field failures, and there has been no data showing a correlation to field failure rates, it still continues for many electronics systems companies due to the sheer momentum of decades of belief. Many traditional reliability engineers argue that even though they do not provide an accurate prediction of life, they can be used for comparisons of alternative designs. Unfortunately, prediction models that are not based on valid causes of field failures, or valid models, cannot provide valid comparisons of reliability predictions.
Of course there is a value if predictions, valid or invalid, are required to retain one’s employment as a reliability engineer, but the benefit for continued employment pales in comparison to the potential misleading assumptions that may result in forcing invalid design changes that may result in higher field failures and warranty costs.
For most electronics systems the specific environments and use conditions are widely distributed. It is very difficult if not impossible to know specific values and distributions of the environmental conditions and use conditions that future electronics systems will be subjected to. Compounding the challenge of not knowing the distribution of stresses in the end - use environments is that the numbers of potential physical interactions and the strength or weaknesses of potential failure mechanisms in systems of hundreds or thousands of components is phenomenologically complex.
Tracing back to the first electronics prediction guide, we find the RCA release of TR-ll00 titled Reliability Stress Analysis for Electronic Equipment, in 1956, which presented models for computing rates of component failures. It was the first of the electronics prediction ‘cookbooks’ that became formalized with the publishing of reliability handbook MIL-HDBK-217A and continued to 1991, with the last version MIL-HDBK-217F released in December of that year. It was formally removed as a government reference document in 1995.
A classic diagram used to show the life cycle of electronics devices is the life cycle bathtub curve. The bathtub curve is a graph of time versus the number of units failing.
Just as medical science has done much to extend our lives in the past century, electronic components and assemblies have also had a significant increase in expected life since the beginning of electronics when vacuum tube technologies were used. Vacuum tubes had inherent wear-out failure modes, such as filaments burning out and vacuum seal leakage, that were a significant limiting factor in the life of an electronics system.
Figure 1.2 The life cycle bathtub curve
The life cycle bathtub curve, which is modeled after human life cycle death rates and is shown in Figure 1.2., is actually a combination of two curves. The first curve is the initial declining failure rate, traditionally referred to as the period of ‘infant mortality’, and the second curve is the increasing failure rates from wear-out failures. The intersection of the two curves is a more or less flat area of the curve, which may appear to be a constant failure rate region. It is actually very rare that electronics components fail at a constant rate, and so the ‘flat’ portion of the curve is not really flat but instead a low rate of failure with some peaks and valleys due to variations in use and manufacturing quality.
The electronics life cycle bathtub curve was derived from human the life cycle curves and may have been more relevant back in the day of vacuum tube electronics systems. In human life cycles we have a high rate of death due to the risks of birth and the fragility of life during human infancy. As we age, the rates of death decline to a steady state level until we age and our bodies start to fail. Human infant mortality is defined as the number of deaths in the first year of life. Infant mortality in electronics has been the term used for the failures that occur after shipping or in the first months or first year of use.
The term ‘infant mortality’ applied to the life of electronics is a misnomer. The vast majority of human infant mortality occurs in poorer third world countries, and the main cause is dehydration from diarrhea, which is a preventable disease. There are many other factors that contribute to the rate of infant deaths, such as limit access to health services, education of the mother and access to clean drinking water. The lack of healthcare facilities or skilled health workers is also a contributing factor.
An electronic component or system is not weaker when fabricated; instead, if manufactured correctly, components have the highest inherent life and strength when manufactured, then they decline in strength, or total fatigue life during use.
The term ‘infant mortality’, which is used to describe failures of electronics or systems that occurs in the early part of the use life cycle, seems to imply that the failure of some devices and systems is intrinsic to the manufacturing process and should be expected. Many traditional reliability engineers dismiss these early life failures, or ‘infant mortality’ failures as due to ‘quality control’ and therefore do not see them as the responsibility of the reliability engineering department. Manufacturing quality variations are likely to be the largest cause of early life failures, especially far designs with narrow environmental stress capabilities that could be found in HALT. But it makes little difference to the customer or end-user, they lose use of the product, and the company whose name is on it is ultimately to blame.
So why use the dismissive term infant mortality to describe failures from latent defects in electronics as if they were intrinsic to manufacturing? The time period that is used to define the region of infant mortality in electronics is arbitrary. It could be the first 30 days or the first 18 months or longer. Since the vast majority of latent (hidden) defects are from unintentional process excursions or misapplications, and since they are not controlled, they are likely to have a wide distribution of times to failure. Many times the same failure mechanism in which the weakest distributions may occur within 30 to 90 days will continue for the stronger latent defects to contribute to the failure rate throughout the entire period of use before technological obsolescence.
Of course the life cycle bathtub curves are represented as idealistic and simplistic smooth curves. In reality, monitoring the field reliability would result in a dynamically changing curve with many variations in the failure rates for each type of electronics system over time as shown in Figure 1.3. As failing units are removed from the population, the remaining field population failure rate decreases and may appear to reach a low steady state or appear as a constant or steady state failure rate in a large population.
Figure 1.3 Realistic field life cycle bathtub curve
In the real tracking of failure rates, the peaks and valleys of the curve extend to the wear-out portion of the life cycle curve. For most electronics, the wear-out portion of the curve extends well beyond technological obsolescence and will be never actually significantly contribute to unreliability of the product.
Without detailed root cause analysis of failures that make up the peaks of the middle portion of the bathtub curve, or what is termed the useful life period, any increase in failure rates can be mistaken as the intrinsic wear-out phase of a system’s life cycle. It may be discovered in failure analysis that what at first appears to be an wear out mode in a component, is actually due to it being overstressed from a misapplication in circuit or unknown high voltage transients.
The traditional approach to electronics reliability engineering has been to focus on probabilistic wear-out mode of electronics. Failures that are due to the wear-out mode are represented by the exponentially increasing failure rate or back end of the bathtub curve.
Mathematical models of intrinsic wear-out mechanisms in components and assemblies must assume that all the manufacturing processes – from IC die fabrication to packaging, mounting on a printed wiring board assembly (PWBA) and then final assembly in a system – are in control and are consistent through the production life cycle.
Mathematical models must also include specific values of environmental stress cycles that drive the inherent device degradation mechanisms for each device, which may include voltage and temperature cycles and shock and vibration, which can interact to modify rates of degradation. The sum of all the stresses that a whole product is expected to be subjected to during its use is the life cycle environmental profile (LCEP).
The cost of failures for a company introducing a new electronics product to market are much more significant at the front end of the bathtub curve, the ‘infant mortality’ period, rather than the ‘useful life’ or ‘wear-out’ period in the bathtub curve. This includes the tangible and quantifiable cost of service and warranty replacements, and less tangible but real costs in lost sales due to perceptions of poor reliability in a competitive market.
There is little data or supporting evidence that in general electronics systems intrinsic life can be modeled and predicted, and this is especially true for the early life failures. The misleading approach of using traditional reliability predictions for reliability development will be discussed further in Chapter 2.
The frame of reference for the HALT and HASS approach, reliability testing is as simple as the old adage that ‘a chain is only as strong as its weakest link’. A complex electronics system is only as strong as its weakest or least tolerant or capable component or subsystem. Just like pulling on a chain until the weakest link breaks, HALT methods apply a wide range of relevant stresses, both individually and in combinations, at increasing levels in order to expose the least capable element in the system. If the failure mechanism causes catastrophic damage to a component, when a destruct limit is reached in HALT, makes it easier to isolate a weak link, identifying the weak link is easier to isolate. Operational weakness causing soft failures can be more challenging to isolate.
HALT (highly accelerated life test) is a process that requires specific adaptation when it is applied to almost any system and assembly. Because HALT is a highly adaptive process, the information given in this book will be general guidelines on how to apply HALT. How HALT is adapted to each type of product or assembly is unique to each, and presents a learning process for each different type of electronic and electromechanical system. It is advised that a company that plans to adopt HALT as a new process or a new user of HALT will have a significantly faster adoption and success in implementation if they have the guidance of an experienced HALT consultant. As in any newly introduced adoption of test new methods and techniques, there are
