Next Generation HALT and HASS - Kirk A. Gray - E-Book

Next Generation HALT and HASS E-Book

Kirk A. Gray

0,0
89,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

NEXT GENERATION HALT AND HASS ROBUST DESIGN OF ELECTRONICS AND SYSTEMS

A NEW APPROACH TO DISCOVERING AND CORRECTING SYSTEMS RELIABILITY RISKS

Next Generation HALT and HASS presents a major paradigm shift from reliability prediction-based methods to discovery of electronic systems reliability risks. This is achieved by integrating highly accelerated life test (HALT) and highly accelerated stress screen (HASS) into a physics of failure based robust product and process development methodology. The new methodologies challenge misleading and sometimes costly misapplication of probabilistic failure prediction methods (FPM) and provide a new deterministic map for reliability development. The authors clearly explain the new approach with a logical progression of problem statement and solutions.

The book helps engineers employ HALT and HASS by demonstrating why the misleading assumptions used for FPM are invalid. Next, the application of HALT and HASS empirical discovery methods to quickly find unreliable elements in electronics systems gives readers practical insight into the techniques.

The physics of HALT and HASS methodologies are highlighted, illustrating how they uncover and isolate software failures due to hardware–software interactions in digital systems. The use of empirical operational stress limits for the development of future tools and reliability discriminators is described.

Key features:

  • Provides a clear basis for moving from statistical reliability prediction models to practical methods of insuring and improving reliability.
  • Challenges existing failure prediction methodologies by highlighting their limitations using real field data.
  • Explains a practical approach to why and how HALT and HASS are applied to electronics and electromechanical systems.
  • Presents opportunities to develop reliability test discriminators for prognostics using empirical stress limits.
  • Guides engineers and managers on the benefits of the deterministic and more efficient methods of HALT and HASS.
  • Integrates the empirical limit discovery methods of HALT and HASS into a physics of failure based robust product and process development process.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 394

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Series Editor’s Foreword

Preface

List of Acronyms

Introduction

1 Basis and Limitations of Typical Current Reliability Methods and Metrics

1.1 The Life Cycle Bathtub Curve

1.2 HALT and HASS Approach

1.3 The Future of Electronics: Higher Density and Speed and Lower Power

1.4 Use of MTBF as a Reliability Metric

1.5 MTBF: What is it Good For?

1.6 Reliability of Systems is Complex

1.7 Reliability Testing

1.8 Traditional Reliability Development

Bibliography

2 The Need for Reliability Assurance Reference Metrics to Change

2.1 Wear-Out and Technology Obsolescence of Electronics

2.2 Semiconductor Life Limiting Mechanisms

2.3 Lack of Root Cause Field Unreliability Data

2.4 Predicting Reliability

2.5 Reliability Predictions – Continued Reliance on a Misleading Approach

2.6 Stress–Strength Diagram and Electronics Capability

2.7 Testing to Discover Reliability Risks

2.8 Stress–Strength Normal Assumption

2.9 A Major Challenge – Distributions Data

2.10 HALT Maximizes the Design’s Mean Strength

2.11 What Does the Term HALT Actually Mean?

Bibliography

3 Challenges to Advancing Electronics Reliability Engineering

3.1 Disclosure of Real Failure Data is Rare

3.2 Electronics Materials and Manufacturing Evolution

Bibliography

4 A New Deterministic Reliability Development Paradigm

4.1 Introduction

4.2 Understanding Customer Needs and Expectations

4.3 Anticipating Risks and Potential Failure Modes

4.4 Robust Design for Reliability

4.5 Diagnostic and Prognostic Considerations and Features

4.6 Knowledge Capture for Reuse

4.7 Accelerated Test to Failure to Find Empirical Design Limits

4.8 Design Confirmation Testing: Quantitative Accelerated Life Test

4.9 Limitations of Success Based Compliance Test

4.10 Production Validation Testing

4.11 Failure Analysis and Design Review Based on Test Results

Bibliography

5 Common Understanding of HALT Approach is Critical for Success

5.1 HALT – Now a Very Common Term

5.2 HALT – Change from Failure Prediction to Failure Discovery

5.3 Serial Education of HALT May Increase Fear, Uncertainty and Doubt

6 The Fundamentals of HALT

6.1 Discovering System Stress Limits

6.2 HALT is a Simple Concept – Adaptation is the Challenge

6.3 Cost of Reliable vs Unreliable Design

6.4 HALT Stress Limits and Estimates of Failure Rates

6.5 Defining Operational Limit and Destruct Limits

6.6 Efficient Cooling and Heating in HALT

6.7 Applying HALT

6.8 Thermal HALT Process

6.9 Random Vibration HALT

6.10 Product Configurations for HALT

6.11 Lessons Learned from HALT

6.12 Failure Analysis after HALT

7 Highly Accelerated Stress Screening (HASS) and Audits (HASA)

7.1 The Use of Stress Screening on Electronics

7.2 ‘Infant Mortality’ Failures are Reliability Issues

7.3 Developing a HASS

7.4 Unique Pneumatic Multi-axis RS Vibration Characteristics

7.5 HALT and HASS Case History

Bibliography

7.6 Benefits of HALT and HASS with Prognostics and Health Management (PHM)

Bibliography

8 HALT Benefits for Software/Firmware Performance and Reliability

8.1 Software – Hardware Interactions and Operational Reliability

8.2 Stimulation of Systematic Parametric Variations

Bibliography

9 Design Confirmation Test

9.1 Introduction to Accelerated Life Test

9.2 Accelerated Degradation Testing

9.3 Accelerated Life Test Planning

9.4 Pitfalls of Accelerated Life Testing

9.5 Analysis Considerations

Bibliography

10 Failure Analysis and Corrective Action

10.1 Failure Analysis and Knowledge Capture

10.2 Review of Test Results and Failure Analysis

10.3 Capture Test and Failure Analysis Results for Access on Follow-on Projects

10.4 Analyzing Production and Field Return Failures

Bibliography

11 Additional Applications of HALT Methods

11.1 Future of Reliability Engineering and HALT Methodology

11.2 Winning the Hearts and Minds of the HALT Skeptics

11.3 Test of No Fault Found Units

11.4 HALT for Reliable Supplier Selection

11.5 Comparisons of Stress Limits for Reliability Assessments

11.6 Multiple Stress Limit Boundary Maps

11.7 Robustness Indicator Figures

11.8 Focusing on Deterministic Weakness Discovery Will Lead to New Tools

11.9 Application of Limit Tests, AST and HALT Methodology to Products Other Than Electronics

Bibliography

Appendix: HALT and Reliability Case Histories

A.1 HALT Program at Space Systems Loral

A.2 Software Fault Isolation Using HALT and HASS

A.3 Watlow HALT and HASS Application

A.4 HALT and HASS Application in Electric Motor Control Electronics

A.5 A HALT to HASS Case Study – Power Conversion Systems

Index

Wiley Series in Quality and Reliability Engineering

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Reliability at 100 Hours for item A, item B and item C

Table 1.2 Estimated parameters for item D, item E and item F

Table 1.3 Reliability at 50 hours for item D, item E, and item F

Chapter 2

Table 2.1 Semiconductor wear-out mechanisms activation energies

Table 2.2 Reported activation energy for silicon semiconductor wear-out mechanisms.

Table 2.3 Causes of Failure

Table 2.4 Results of the 1987 SINCGARS NDI candidate test

Chapter 7

Table 7.1 Five-year estimate of failures prevented by HASS testing

Table 7.2 Cumulative cost avoidance of failures using HASS

Chapter 9

Table 9.1 ALT test plan table to capture key factors

Appendix

Table A4.1 Test sequence of ten sample modules

Table A4.2 Example of cold step stress test

Table A4.3 Summary of cold step stress HALT results

Table A4.4 Temperature Profile in Hot Step Stress HALT

Table A4.5 Results of hot step stress test of PWB modules

Table A4.6 Vibration profile for combined temperature cycle and vibration HALT

Table A4.7 Comparison of HALT and HASS results with field failure data

List of Illustrations

Chapter 1

Figure 1.1 Dilbert, management and reliability.

Figure 1.2 The life cycle bathtub curve

Figure 1.3 Realistic field life cycle bathtub curve

Figure 1.4 The ‘drain’ of technological obsolescence in the life cycle bathtub curve

Figure 1.5 Reliability functions for item A, item B and item C

Figure 1.6 Hazard functions for item A, item B and item C

Figure 1.7 Item D: Graphical analysis of survival data

Figure 1.8 Item E: Graphical analysis of survival data

Figure 1.9 Item F: Graphical analysis of survival data

Figure 1.10 Reliability functions for item D, item E, and item F

Figure 1.11 Hazard functions for item D, item E, and item F

Figure 1.12 Examples of where latent defects are introduced during assembly fabrication

Figure 1.13 Impact of reliability tasks on electronics.

Figure 1.14 The ESS vibration power spectral density spectrum guideline from NAVMAT 9492 (US Navy)

Chapter 2

Figure 2.1 Burned battery assembly after suffering a thermal runaway.

Figure 2.2 Chargeability of failures based upon test data

Figure 2.3 Comparison of vibration displacement

Figure 2.4 Comparison of vibration response and resistor location

Figure 2.5 Comparison of various handbook methodologies

Figure 2.6 Comparison of predicted versus demonstrated values for DoD systems

Figure 2.7 The stress–strength diagram for reliability

Figure 2.8 The stress–strength diagram and the effect of fatigue damage

Figure 2.9 The intersection of the stress and strength curves resulting in failure PDFs

Figure 2.10 Reliability margin in the stress–strength diagram

Figure 2.11 The stress–strength curves in a reliable system

Figure 2.12 The stress–strength curves overlap results in failure PDF

Figure 2.13 Fixed and known strength, but random variable for stress,

X

Figure 2.14 Fixed and known stress but random variable for strength,

Y

Figure 2.15 Both stress and strength are random variables

Figure 2.16 The distributions of a system’s strength is a sum of individual components and subsystems, each with its own distribution

Figure 2.17 A latent defect subpopulation resulting from a manufacturing process excursion

Figure 2.18 Cisco normalized return rate versus thermal operating margin.

Figure 2.19 Cisco normalized RMA rate versus active components count.

Chapter 4

Figure 4.1 New product and process development flow

Figure 4.2 Lean quality function deployment chart to translate needs to design features [1].

Figure 4.3 Parameter diagram elements [2,5].

Figure 4.4 Boundary diagram derived from functional block diagram [2,5]

Figure 4.5 Good design, good discussion, good dissection integration [5,8,10]

Figure 4.6 Basic design review based on failure modes (DRBFM) format [5,8,10]

Figure 4.7 Levels of DRBFM in complex system development [3,5,8]

Figure 4.8 Typical process flow diagram

Figure 4.9 Typical process DRBFM format [5,8,10]

Figure 4.10 Timing and application of DRBFM in product and process development [3,8]

Figure 4.11 Summary of typical failure mechanisms to consider in design

Figure 4.12 Reducing stress–strength interference [6].

Figure 4.13 Effect of aging or wear on product strength and probability of failure [6].

Figure 4.14 Illustration of a probabilistic solution using reliability based design optimization [4,11].

Figure 4.15 Phased DOE approach to optimize design choices [9].

Figure 4.16 Illustration of degradation analysis.

Figure 4.17 Step stress limit test profile

Figure 4.18 Step stress limit test profile for electronic modules with increasing temperature deltas

Figure 4.19 Robustness indicator diagram [7].

Figure 4.20 Stress boundary map

Figure 4.21 Design review based on test results (DRBTR) process flow [8].

Figure 4.22 Design review based on test results (DRBTR) format [8].

Chapter 5

Figure 5.1 The change in orientation for the designed strength of electronics

Chapter 6

Figure 6.1 Typical mounting of a circuit board with aluminum ducts directing air flow across the UUT.

Figure 6.2 Low mass triaxial and single axis accelerometers.

Figure 6.3 PSD showing peak vibration resonant frequency shift of PCB at −35°C and 70°C.

Figure 6.4 SEM picture of the BGA solder joint shows cracks on the top with some connection.

Figure 6.5 Typical HALT chamber.

Figure 6.6 Decreasing temperature HALT profile and thermal lag of UUT

Figure 6.7 Increasing temperature HALT profile and thermal lag of UUT

Figure 6.8 Vibration HALT profile

Figure 6.9 A thermal isolation chamber inside a HALT chamber

Figure 6.10 Good, better, and best component placements learned from HALT

Chapter 7

Figure 7.1 Stress/strength diagram with subsystems strength distributions

Figure 7.2 Stress strength diagram with subsystems latent defect distributions

Figure 7.3 HASS uses some fatigue life to precipitate latent defects

Figure 7.4 Stress levels for HALT and HASS

Figure 7.5 HASS precipitation and detection screens

Figure 7.6 Stress regime of a typical HASS process

Figure 7.7 A PSD for the top of a multi-axis RS vibration table

Figure 7.8 Correlation between air pressure and hammer frequency (Courtesy of Charles Felkins)

Figure 7.9 Otis field and HALT failures comparison

Figure 7.10 The effectiveness of a leaky capacitor introduced after HALT was completed

Figure 7.11 Comparisons of Weibull analysis of tested and untested populations.

Figure 7.12 Cardiac stress test.

Figure 7.13 Generic parametric signature for reliability discriminators during HASS

Chapter 8

Figure 8.1 Measured low to high propagation delay versus case temperature of a Fairchild octal buffer.

Figure 8.2 Potential distribution signal propagation delay in mass production.

Figure 8.3 Applied temperatures skewing of timing distributions in semiconductors

Figure 8.4 Propagation delays for short PCB trace.

Figure 8.5 Thermograph of an operating circuit board showing thermal gradients across board.

Figure 8.6 Potential contributors to poor signal integrity in electronics

Figure 8.7 Cross-section of simple circuit board

Figure 8.8 Potential distributions of parametric timing variations during prototype/pilot production builds

Figure 8.9 Mass production and the wider distributions causing marginal operation

Figure 8.10 Colder temperatures skew signal speeds higher in a sample of prototype hardware

Figure 8.11 High temperatures skew signal speeds lower in a sample of prototype hardware

Figure 8.12 Rapid thermal gradients shift dimensions and parametrics of active devices

Chapter 9

Figure 9.1 Life–stress relationship plot for quantitative accelerated life test

Figure 9.2 Degradation analysis plot with failure threshold [6].

Chapter 10

Figure 10.1 Design review by failure modes (DRBTR) results format [5].

Figure 10.2 Robustness indicator figure showing test results margin for DRBTR discussion

Chapter 11

Figure 11.1 Temperature and voltage four corner test

Figure 11.2 Voltages and temperature empirical operational boundaries

Figure 11.3 Empirical operational boundaries showing stress/strength distributions

Figure 11.4 Two-dimensional safe stress margins for ongoing reliability monitoring

Figure 11.5 Robustness indicator figure

Appendix

Figure A1.1 Statistical nature of stress vs. strength

Figure A1.2 Probabilities of not finding issues during qualification and acceptance temperature testing

Figure A1.3 Stress testing principle.

Figure A1.4 Effect of time on strength.

Figure A2.1 Freeze spray is applied to a suspect component

Figure A2.2 Power resistors are applied to a suspect circuit

Figure A2.3 Fault type summary by percentage

Figure A2.4 Fault type summary by failure type

Figure A3.1 Typical HASS and HASA profile used at Watlow

Fig. A5.1 New upper operating limits for model B units with high current component

Guide

Cover

Table of Contents

Begin Reading

Pages

ii

iii

iv

xi

xii

xiii

xiv

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

24

25

26

27

23

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

59

60

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

Next Generation HALT and HASS

Robust Design of Electronics and Systems

Kirk A. Gray

Accelerated Reliability Solutions, LLC, Colorado, USA

John J. Paschkewitz

Product Assurance Engineering, LLC, Missouri, USA

 

 

 

 

 

 

 

 

 

 

 

 

This edition first published 2016© 2016 John Wiley & Sons, Ltd

Registered officeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the authors to be identified as the authors of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and authors have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the authors shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Names: Gray, Kirk, author. | Paschkewitz, John James, author.Title: Next generation HALT and HASS : robust design of electronics and systems/by Kirk Gray, John James Paschkewitz.Description: Chichester, UK ; Hoboken, NJ : John Wiley & Sons, 2016. |Includes bibliographical references and index.Identifiers: LCCN 2015044935 | ISBN 9781118700235 (cloth)Subjects: LCSH: Accelerated life testing. | Electronic systems–Design and construction. |Electronic systems–Testing.Classification: LCC TA169.3 .G73 2016 | DDC 621.381028/7–dc23LC record available at http://lccn.loc.gov/2015044935

A catalogue record for this book is available from the British Library.

Series Editor’s Foreword

The Wiley Series in Quality & Reliability Engineering aims to provide a solid educational foundation for researchers and practitioners in the field of quality and reliability engineering and to expand their knowledge base by including the latest developments in these disciplines.

The importance of quality and reliability to a system can hardly be disputed. Product failures in the field inevitably lead to losses in the form of repair costs, warranty claims, customer dissatisfaction, product recalls, loss of sales and, in extreme cases, loss of life.

Engineering systems are becoming increasingly complex, with added functions and capabilities; however, the reliability requirements remain the same or are even growing more stringent. Also the rapid development of functional safety standards increases pressure to achieve ever higher reliability as it applies to system safety. These challenges are being met with design and manufacturing improvements and, to no lesser extent, by advancements in testing and validation methods.

Since its introduction in the early 1980s the concept and practice of highly accelerated life testing has undergone significant evolution. This book Next Generation HALT and HASS written by Kirk Gray and John Paschkewitz, both of whom I have the privilege to know personally, takes the concept of rapid product development to a new level. Both authors have lifelong experience in product testing, validation and applications of HALT to product development processes. HALT and HASS have quickly become mainstream product development tools, and this book is the next step in cementing their place as an integral part of the design process; it offers an excellent mix of theory, practice, useful applications and common sense engineering, making it a perfect addition to the Wiley series in Quality and Reliability Engineering.

The purpose of this Wiley book series is not only to capture the latest trends and advancements in quality and reliability engineering but also to influence future developments in these disciplines. As quality and reliability science evolves, it reflects the trends and transformations of the technologies it supports. A device utilizing a new technology, whether it be a solar panel, a stealth aircraft or a state-of-the-art medical device, needs to function properly and without failures throughout its mission life. New technologies bring about new failure mechanisms, new failure sites and new failure modes, and HALT has proven to be an excellent tool in discovering those weaknesses, especially where new technologies are concerned. It also promotes the advanced study of the physics of failure, which improves our ability to address those technological and engineering challenges.

In addition to the transformations associated with changes in technology the field of quality and reliability engineering has been going through its own evolution by developing new techniques and methodologies aimed at process improvement and reduction of the number of design and manufacturing related failures. And again, HALT and HASS form an integral part of that transformation.

Among the current reliability engineering trends, life cycle engineering concepts have also been steadily gaining momentum by finding wider applications to life cycle risk reduction and minimization of the combined cost of design, manufacturing, warranty and service. Life cycle engineering promotes a holistic approach to the product design in general and quality and reliability in particular.

Despite its obvious importance, quality and reliability education is paradoxically lacking in today’s engineering curriculum. Very few engineering schools offer degree programs, or even a sufficient variety of courses, in quality or reliability methods; and the topic of HALT and HASS receives almost no coverage in today’s engineering student curriculum. Therefore, the majority of the quality and reliability practitioners receive their professional training from colleagues, professional seminars, publications and technical books. The lack of opportunities for formal education in this field emphasizes too well the importance of technical publications for professional development.

We are confident that this book, as well as this entire book series, will continue Wiley’s tradition of excellence in technical publishing and provide a lasting and positive contribution to the teaching, and practice of reliability and quality engineering.

Dr. Andre Kleyner,Editor of the Wiley Series in Quality & Reliability Engineering

Preface

This book is written for practicing engineers and managers working in new product development, product testing or sustaining engineering to improve existing products. It can also be used as a textbook in courses in reliability engineering or product testing. It is focused on incorporating empirical limit determination with accelerated stress testing into a physics of failure approach for new product and process development. It overcomes the limitations, weaknesses and assumptions prevalent in prediction based reliability methods that have prevailed in many industries for decades.

We are especially grateful to the late Dr Gregg Hobbs for being the creator of HALT and HASS and a teacher and mentor.

We especially appreciate Dr Michael Pecht, the founder of CALCE at the University of Maryland, for his encouragement for writing this book and sharing CALCE material.

We would like to indicate our gratitude to our colleagues who provided support, input, review and feedback that helped us create this book. We thank Andrew Roland for permission to use his article MTBF – What Is It Good For? We would also like to thank Charlie Felkins for the pictures and drawings he provided and Andrew Riddle of Allied Telesis Labs for use of their case history. We are also grateful for the assistance of Fred Schenkelberg in providing support, contributions and promotion of this book.

We would like to thank Mark Morelli for material used in the book, as well as working with him early on implementing HALT and HASS at Otis Elevator, and Michael Beck for his support on implementing HALT and HASS, and access to information on DRBFM. We are grateful to Bill Haughey for introducing us to GD3, DRBFM and DRBTR, as well as to James McLeish for his support and work on Robust Design, Failure Analysis and GD3 source information.

We want to acknowledge Watlow and in particular Chris Lanham for providing opportunity to expand and apply our reliability knowledge, as well as Mark Wagner for his case history contribution to the Appendix.

Reliasoft granted us permission to use material in this book and we appreciate the support and encouragement from Lisa Hacker. We thank Linda Ofshe for her technical editing of early chapters, Richard Savage for his support and encouragement and Monica Nogueira at SAE International for her review of manuscript sections and resolving questions on copyrighted material.

Ella Mitchell, Liz Wingett and Pascal Raj Francois, who are our contacts at John Wiley & Sons, have guided us through the process of writing a technical book and all the details of manuscript development and preparation for publication.

List of Acronyms

ALT

Accelerated Life Testing

AMSAA

Army Material Systems Analysis Activity

AST

Accelerated Stress Tests

CALT

Calibrated Accelerated Life Test

CDF

Cumulative Distribution Function

CHC

Channel Hot Carrier

CND

Can Not Duplicate

CRE

Certified Reliability Engineer

DoD

Department of Defense

DFX

Design for X (Test, Cost, Manufacture & Assembly, etc.)

DFR

Design for Reliability

DFSS

Design for Six Sigma

DOE

Design of Experiments

DRBFM

Design Review Based on Failure Modes

DRBTR

Design Review Based on Test Results

DVT

Design Verification Test

ED

Electrodynamic (Shaker)

EM

Electromigration

ESS

Environmental Stress Screening

FEA

Finite Element Analysis

FIT

Failure in Time

FLT

Fundamental Limit of Technology

FMEA

Failure Modes and Effects Analysis

FMECA

Failure Modes, Effects & Criticality Analysis

FRACAS

Failure Reporting, Analysis, & Corrective Action System

GD

3

Good Design, Good Discussion, Good Dissection

HALT

Highly Accelerated Life Test

HASS

Highly Accelerated Stress Screening

HASA

Highly Accelerated Stress Audit (Sampling)

HTOL

High Temperature Operating Life

HCI

Hot Carrier Injection

ICs

Integrated Circuits

LCD

Liquid Crystal Display

LCEP

Life Cycle Environmental Profile

MSM

Matrix Stressing Method

MTBF

Mean Time between Failures

MTTF

Mean Time To Failure

MWD

Measurement While Drilling

NBTI

Negative Bias Temperature Instability

NDI

Non Developmental Item

NFF

No Fault Found

NPF

No Problem Found

OEM

Original Equipment Manufacturer

ORT

Ongoing Reliability Test

PoF

Physics of Failure

PRAT

Production Reliability Acceptance Test

PTH

Plated Through Holes

PWBA

Printed Wiring Board Assembly

QFD

Quality Function Deployment

RoHS

Restriction of Hazardous Substances

RMA

Returned Material Authorization

RMS

Reliability, Maintainability, Supportability

RDT

Reliability Demonstration Test

SINCGARS

Single Channel Ground Air Radio Set

SPC

Statistical Process Control

TDDB

Time Dependent Dielectric Breakdown

VOC

Voice of the Customer

WCA

Worst Case Analysis

Introduction

This book presents a new paradigm for reliability practitioners. It is focused on incorporating empirical limit determination with accelerated stress testing into a physics of failure approach for new product and process development. This extends the basics of highly accelerated life test (HALT) and highly accelerated stress screens (HASS) presented in earlier books and contrasts this new approach with the limitations, weaknesses, and assumptions in prediction based reliability methods that have prevailed in many industries for decades. It addresses the lack of understanding of why most systems fail, which has led to reliance on reliability predictions.

Chapters 1, 2 and 3 examine the basis and limitations of statistical reliability prediction methods and shows why they fail to provide useful estimates of reliability in new products even if they are derivatives of previous products. It also addresses the prevailing focus on estimating life or reliability with metrics such as MTBF (mean time before failures) and MTTF (mean time to failure) and the misleading aspects of using these metrics in reliability programs. This includes difficulties and limitations in using field return data on previous products or results of reliability demonstration tests to derive an MTBF or MTTF estimate on new products. The section concludes with an assessment of practices in many reliability programs and shows how they can be inadequate, resulting in warranty claims, customer dissatisfaction and increased cost to correct field problems. These typical practices include reactive reliability efforts conducted too late in product development to influence the design, success based testing that fails to find product weaknesses, and a focus on deliverable data to meet the customer’s qualification requirements.

Chapter 4 proposes a new approach to ensuring product reliability. This begins with a focused risk assessment to anticipate potential failure modes and weaknesses based on changes from the current product knowledge base as well as new components and materials needed to meet customer needs. This assessment draws on knowledge of subject matter experts and tools to identify likely failure mechanisms and causes. These risks are then addressed with robust design to ensure sufficient margin to withstand the variability of anticipated operating environments and production strength variability. The robust design also considers prognostics and health management to detect degradation and wear out by monitoring key parameters during operation. This design approach is followed by phased robustness testing of prototypes using accelerated stress tests, including HALT, to find product limits and design margins as well as to identify design weaknesses. After the weaknesses have been identified, design changes to overcome the issues are completed and verified in HALT or accelerated stress tests.

With the empirical limits determined and weaknesses corrected, quantitative accelerated life test can be used to estimate reliability of selected components or assemblies where the operating environment stresses can be determined and applied. ALT provides indication of expected reliability in the reduced time available with today’s shorter product development schedules. On systems with higher levels of integration, correctly identifying the combined stresses and accelerating them in a test becomes very difficult. So, validation testing at system level in the actual application may be needed to assess reliability and evaluate interfaces, which are often the source of reliability issues. Finally, production variability, process issues and supplier component variability need to be addressed with production screening tests and corrective action of issues discovered.

Chapters 5 and 6 detail the Highly Accelerated Life Test (HALT) from concept through process and planning to description of how to apply HALT. It also covers how to conduct failure analysis and ensure corrective action for the product weaknesses that are discovered. This includes selected stresses to apply in HALT, product configuration for test and applying thermal, vibration and power variation stresses, monitoring product operation and detecting failures and failure analysis after HALT.

Chapter 7 covers the use of production screening for electronics using Highly Accelerated Stress Screening (HASS) to find infant mortality issues and ensure the consistency and control of production processes. The HASS process is covered in detail, including precipitation and detection screens, stresses applied in HASS, the safety of screen process and verification of the HASS process. The effectiveness of HASS is discussed and transition to Highly Accelerated Stress Audit (HASA) sampling and cost avoidance are then covered.

Chapter 8 includes HALT and HASS examples to illustrate the application and effectiveness of discovering empirical limits, correcting design weaknesses and ensuring repeatable production processes. The section concludes with the benefits of HALT for software and firmware performance and reliability.

Chapter 9 covers the application of quantitative Accelerated Life Test (ALT) at component and subassembly levels when stresses can be correlated to the application environment and accelerated to levels between the operational level and the empirical limit of the product under test for the selected stresses used in the test. At higher levels of assembly, the combined stresses encountered in application become more difficult to apply and control to appropriate levels in an accelerated test. For these assemblies, validation testing in the application system at the prototype stage becomes necessary to evaluate interfaces and find potential problems that could not be discovered at the component or subassembly level.

Chapter 10 examines failure analysis, managing correction action and capturing learning in the knowledge base for access by follow-on project teams, allowing them to build on previous work rather than relearn it. This includes Design Review Based on Test Results (DRBTR) as a method for reviewing test results, deciding on corrective actions and tracking progress to completion and closure. Follow-up with production screening, ongoing reliability test during production and analysis of field data conclude the section.

Chapter 11 covers additional applications of the HALT methodology. These topics include:

future of reliability engineering and the HALT methodology

winning the hearts and minds of the HALT skeptics

analysis of field failures in HALT

test of no defect found units in HALT

HALT for reliable supplier selection

comparisons of stress limits for reliability assessments

multiple stress limit boundary maps and robustness indicator figures

focusing on deterministic weakness discovery will lead to new tools

application of empirical limit test, AST and HALT concepts to products other than electronics

These areas help the reliability practitioner apply the HALT methodology and tools to solve problems they often face in both product development and sustaining engineering of current products.

The appendix includes data from case studies that illustrate the effectiveness of the HALT methods in improving product reliability.

1Basis and Limitations of Typical Current Reliability Methods and Metrics

Reliability cannot be achieved by adhering to detailed specifications. Reliability cannot be achieved by formula or by analysis. Some of these may help to some extent, but there is only one road to reliability. Build it, test it and fix the things that go wrong. Repeat the process until the desired reliability is achieved. It is a feedback process and there is no other way.

David Packard, 1972

In the field of electronics reliability, it is still very much a Dilbert world as we see in the comic from Scott Adams, Figure 1.1. Reliability Engineers are still making reliability predictions based on dubious assumptions about the future and management not really caring if they are valid. Management just needs a ‘number’ for reliability, regardless of the fact it may have no basis in reality.

Figure 1.1 Dilbert, management and reliability.

Source: DILBERT © 2010 Scott Adams. Reproduced with permission of UNIVERSAL UCLICK

The classical definition of reliability is the probability that a component, subassembly, instrument, or system will perform its specified function for a specified period of time under specified environmental and use conditions. In the history of electronics reliability engineering, a central activity and deliverable from reliability engineers has been to make reliability predictions that provide a quantification of the lifetime of an electronics system.

Even though the assumptions of causes of unreliability used to make reliability predictions have not been shown to be based on data from common causes of field failures, and there has been no data showing a correlation to field failure rates, it still continues for many electronics systems companies due to the sheer momentum of decades of belief. Many traditional reliability engineers argue that even though they do not provide an accurate prediction of life, they can be used for comparisons of alternative designs. Unfortunately, prediction models that are not based on valid causes of field failures, or valid models, cannot provide valid comparisons of reliability predictions.

Of course there is a value if predictions, valid or invalid, are required to retain one’s employment as a reliability engineer, but the benefit for continued employment pales in comparison to the potential misleading assumptions that may result in forcing invalid design changes that may result in higher field failures and warranty costs.

For most electronics systems the specific environments and use conditions are widely distributed. It is very difficult if not impossible to know specific values and distributions of the environmental conditions and use conditions that future electronics systems will be subjected to. Compounding the challenge of not knowing the distribution of stresses in the end - use environments is that the numbers of potential physical interactions and the strength or weaknesses of potential failure mechanisms in systems of hundreds or thousands of components is phenomenologically complex.

Tracing back to the first electronics prediction guide, we find the RCA release of TR-ll00 titled Reliability Stress Analysis for Electronic Equipment, in 1956, which presented models for computing rates of component failures. It was the first of the electronics prediction ‘cookbooks’ that became formalized with the publishing of reliability handbook MIL-HDBK-217A and continued to 1991, with the last version MIL-HDBK-217F released in December of that year. It was formally removed as a government reference document in 1995.

1.1 The Life Cycle Bathtub Curve

A classic diagram used to show the life cycle of electronics devices is the life cycle bathtub curve. The bathtub curve is a graph of time versus the number of units failing.

Just as medical science has done much to extend our lives in the past century, electronic components and assemblies have also had a significant increase in expected life since the beginning of electronics when vacuum tube technologies were used. Vacuum tubes had inherent wear-out failure modes, such as filaments burning out and vacuum seal leakage, that were a significant limiting factor in the life of an electronics system.

Figure 1.2 The life cycle bathtub curve

The life cycle bathtub curve, which is modeled after human life cycle death rates and is shown in Figure 1.2., is actually a combination of two curves. The first curve is the initial declining failure rate, traditionally referred to as the period of ‘infant mortality’, and the second curve is the increasing failure rates from wear-out failures. The intersection of the two curves is a more or less flat area of the curve, which may appear to be a constant failure rate region. It is actually very rare that electronics components fail at a constant rate, and so the ‘flat’ portion of the curve is not really flat but instead a low rate of failure with some peaks and valleys due to variations in use and manufacturing quality.

The electronics life cycle bathtub curve was derived from human the life cycle curves and may have been more relevant back in the day of vacuum tube electronics systems. In human life cycles we have a high rate of death due to the risks of birth and the fragility of life during human infancy. As we age, the rates of death decline to a steady state level until we age and our bodies start to fail. Human infant mortality is defined as the number of deaths in the first year of life. Infant mortality in electronics has been the term used for the failures that occur after shipping or in the first months or first year of use.

The term ‘infant mortality’ applied to the life of electronics is a misnomer. The vast majority of human infant mortality occurs in poorer third world countries, and the main cause is dehydration from diarrhea, which is a preventable disease. There are many other factors that contribute to the rate of infant deaths, such as limit access to health services, education of the mother and access to clean drinking water. The lack of healthcare facilities or skilled health workers is also a contributing factor.

An electronic component or system is not weaker when fabricated; instead, if manufactured correctly, components have the highest inherent life and strength when manufactured, then they decline in strength, or total fatigue life during use.

The term ‘infant mortality’, which is used to describe failures of electronics or systems that occurs in the early part of the use life cycle, seems to imply that the failure of some devices and systems is intrinsic to the manufacturing process and should be expected. Many traditional reliability engineers dismiss these early life failures, or ‘infant mortality’ failures as due to ‘quality control’ and therefore do not see them as the responsibility of the reliability engineering department. Manufacturing quality variations are likely to be the largest cause of early life failures, especially far designs with narrow environmental stress capabilities that could be found in HALT. But it makes little difference to the customer or end-user, they lose use of the product, and the company whose name is on it is ultimately to blame.

So why use the dismissive term infant mortality to describe failures from latent defects in electronics as if they were intrinsic to manufacturing? The time period that is used to define the region of infant mortality in electronics is arbitrary. It could be the first 30 days or the first 18 months or longer. Since the vast majority of latent (hidden) defects are from unintentional process excursions or misapplications, and since they are not controlled, they are likely to have a wide distribution of times to failure. Many times the same failure mechanism in which the weakest distributions may occur within 30 to 90 days will continue for the stronger latent defects to contribute to the failure rate throughout the entire period of use before technological obsolescence.

1.1.1 Real Electronics Life Cycle Curves

Of course the life cycle bathtub curves are represented as idealistic and simplistic smooth curves. In reality, monitoring the field reliability would result in a dynamically changing curve with many variations in the failure rates for each type of electronics system over time as shown in Figure 1.3. As failing units are removed from the population, the remaining field population failure rate decreases and may appear to reach a low steady state or appear as a constant or steady state failure rate in a large population.

Figure 1.3 Realistic field life cycle bathtub curve

In the real tracking of failure rates, the peaks and valleys of the curve extend to the wear-out portion of the life cycle curve. For most electronics, the wear-out portion of the curve extends well beyond technological obsolescence and will be never actually significantly contribute to unreliability of the product.

Without detailed root cause analysis of failures that make up the peaks of the middle portion of the bathtub curve, or what is termed the useful life period, any increase in failure rates can be mistaken as the intrinsic wear-out phase of a system’s life cycle. It may be discovered in failure analysis that what at first appears to be an wear out mode in a component, is actually due to it being overstressed from a misapplication in circuit or unknown high voltage transients.

The traditional approach to electronics reliability engineering has been to focus on probabilistic wear-out mode of electronics. Failures that are due to the wear-out mode are represented by the exponentially increasing failure rate or back end of the bathtub curve.

Mathematical models of intrinsic wear-out mechanisms in components and assemblies must assume that all the manufacturing processes – from IC die fabrication to packaging, mounting on a printed wiring board assembly (PWBA) and then final assembly in a system – are in control and are consistent through the production life cycle.

Mathematical models must also include specific values of environmental stress cycles that drive the inherent device degradation mechanisms for each device, which may include voltage and temperature cycles and shock and vibration, which can interact to modify rates of degradation. The sum of all the stresses that a whole product is expected to be subjected to during its use is the life cycle environmental profile (LCEP).

The cost of failures for a company introducing a new electronics product to market are much more significant at the front end of the bathtub curve, the ‘infant mortality’ period, rather than the ‘useful life’ or ‘wear-out’ period in the bathtub curve. This includes the tangible and quantifiable cost of service and warranty replacements, and less tangible but real costs in lost sales due to perceptions of poor reliability in a competitive market.

There is little data or supporting evidence that in general electronics systems intrinsic life can be modeled and predicted, and this is especially true for the early life failures. The misleading approach of using traditional reliability predictions for reliability development will be discussed further in Chapter 2.

1.2 HALT and HASS Approach

The frame of reference for the HALT and HASS approach, reliability testing is as simple as the old adage that ‘a chain is only as strong as its weakest link’. A complex electronics system is only as strong as its weakest or least tolerant or capable component or subsystem. Just like pulling on a chain until the weakest link breaks, HALT methods apply a wide range of relevant stresses, both individually and in combinations, at increasing levels in order to expose the least capable element in the system. If the failure mechanism causes catastrophic damage to a component, when a destruct limit is reached in HALT, makes it easier to isolate a weak link, identifying the weak link is easier to isolate. Operational weakness causing soft failures can be more challenging to isolate.

HALT (highly accelerated life test) is a process that requires specific adaptation when it is applied to almost any system and assembly. Because HALT is a highly adaptive process, the information given in this book will be general guidelines on how to apply HALT. How HALT is adapted to each type of product or assembly is unique to each, and presents a learning process for each different type of electronic and electromechanical system. It is advised that a company that plans to adopt HALT as a new process or a new user of HALT will have a significantly faster adoption and success in implementation if they have the guidance of an experienced HALT consultant. As in any newly introduced adoption of test new methods and techniques, there are