Software Reliability Techniques for Real-World Applications - Roger K. Youree - E-Book

Software Reliability Techniques for Real-World Applications E-Book

Roger K. Youree

0,0
106,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Authoritative resource providing step-by-step guidance for producing reliable software to be tailored for specific projects Software Reliability Techniques for Real-World Applications is a practical, up to date, go-to source that can be referenced repeatedly to efficiently prevent software defects, find and correct defects if they occur, and create a higher level of confidence in software products. From content development to software support and maintenance, the author creates a depiction of each phase in a project such as design and coding, operation and maintenance, management, product production, and concept development and describes the activities and products needed for each. Software Reliability Techniques for Real-World Applications introduces clear ways to understand each process of software reliability and explains how it can be managed effectively and reliably. The book is supported by a plethora of detailed examples and systematic approaches, covering analogies between hardware and software reliability to ensure a clear understanding. Overall, this book helps readers create a higher level of confidence in software products. In Software Reliability Techniques for Real-World Applications, readers will find specific information on: * Defects, including where defects enter the project system, effects, detection, and causes of defects, and how to handle defects * Project phases, including concept development and planning, requirements and interfaces, design and coding, and integration, verification, and validation * Roadmap and practical guidelines, including at the start of a project, as a member of an organization, and how to handle troubled projects * Techniques, including an introduction to techniques in general, plus techniques by organization (systems engineering, software, and reliability engineering) Software Reliability Techniques for Real-World Applications is a practical text on software reliability, providing over sixty-five different techniques and step-by-step guidance for producing reliable software. It is an essential and complete resource on the subject for software developers, software maintainers, and producers of software.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 772

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

Preface

Series Editor's Foreword by Dr. Andre Kleyner

Acronyms

Glossary

References

1 Introduction

1.1 Description of the Problem

1.2 Implications for Software Reliability

References

2 Understanding Defects

2.1 Where Defects Enter the Project System

2.2 Effects of Defects

2.3 Detection of Defects

2.4 Causes of Defects

References

3 Handling Defects

3.1 Strategy for Handling Defects

3.2 Objectives

3.3 Plan

3.4 Implementation, Monitoring, and Feedback

3.5 Analogies Between Hardware and Software Reliability Engineering

References

4 Project Phases

4.1 Introduction to Project Phases

4.2 Concept Development and Planning

4.3 Requirements and Interfaces

4.4 Design and Coding

4.5 Integration, Verification, and Validation

4.6 Product Production and Release

4.7 Operation and Maintenance

4.8 Management

References

5 Roadmap and Practical Guidelines

5.1 Summary and Roadmap

5.2 Guidelines

References

6 Techniques

6.1 Introduction to the Techniques

6.2 Techniques for Systems Engineering

6.3 Techniques for Software

6.4 Techniques for Reliability Engineering

6.5 Project-Wide Techniques and Techniques for Quality Assurance

References

Index

End User License Agreement

List of Tables

Chapter 4

Table 4.1 Defects per Activity (%).

Chapter 6

Table 6.1 Techniques for Software Reliability.

Table 6.2 Techniques for Systems Engineering.

Table 6.3 Techniques Related to Systems Engineering.

Table 6.4 Techniques for Software.

Table 6.5 Techniques Related to Software.

Table 6.6 Confidence Interval for Exponentially Distributed Failure-Terminat...

Table 6.7 Confidence Interval for Exponentially Distributed Time-Terminated ...

Table 6.8 Test Duration Example.

Table 6.9 Techniques for Reliability Engineering.

Table 6.10 Techniques Related to Reliability Engineering.

Table 6.11 Example of Operational Profile, A1.

Table 6.12 Example of Operational Profile, A2.

Table 6.13 Software FMEA.

Table 6.14 Software FMEA (cont.).

Table 6.15 Predicted Critical Failure Rate Values.

Table 6.16 Characteristics of the Software LRUs.

Table 6.17 Predicted Defect Density Values for Software LRUs.

Table 6.18 Predicted Defects, Failure Rate, MTBF, MTBCF, and Reliability.

Table 6.19 Approximate Kolmogorov–Smirnov Statistics Critical Values – One-S...

Table 6.20 Software Fault Data.

Table 6.21 Sequential Probability Ratio Test.

Table 6.22 Project-Wide Techniques and Techniques for Quality Assurance.

Table 6.23 Measurement Selection Matrix.

Table 6.24 Software Failures.

Table 6.25 Process FMEA.

Table 6.26 Process FMEA (cont.).

Table 6.27 Code Review Metrics.

Table 6.28 Laplace Test Statistic Example.

Table 6.29 Mann–Kendall Test Statistic Example.

Table 6.30 Spearman's Rank Correlation Coefficient Example.

List of Illustrations

Chapter 3

Figure 3.1 Overall Process.

Figure 3.2 Designing and Running a Project.

Chapter 6

Figure 6.1 Reliability Block Diagram.

Figure 6.2 Example Communication Network.

Guide

Cover

Title Page

Copyright

Dedication

Preface

Series Editor's Foreword by Dr. Andre Kleyner

Acronyms

Glossary

Table of Contents

Begin Reading

Index

End User License Agreement

Pages

ii

iii

iv

v

vi

vii

xi

xiii

xv

xvi

xvii

xviii

xix

1

2

3

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

323

324

325

326

Wiley Series in Quality & Reliability Engineering

Dr. Andre V. KleynerSeries Editor

The Wiley Series in Quality & Reliability Engineering aims to provide a solid educational foundation for both practitioners and researchers in the Q&R field and to expand the reader's knowledge base to include the latest developments in this field. The series will provide a lasting and positive contribution to the teaching and practice of engineering.

The series coverage will contain, but is not exclusive to,

Statistical methods

Physics of failure

Reliability modeling

Functional safety

Six-sigma methods

Lead-free electronics

Warranty analysis/management

Risk and safety analysis

Wiley Series in Quality & Reliability Engineering

Software Reliability Techniques for Real-World Applications

by Roger K. Youree

December 2022

System Reliability Assessment and Optimization: Methods and Applications

by Yan-Fu Li, Enrico Zio

April 2022

Design for Excellence in Electronics Manufacturing

Cheryl Tulkoff, Greg Caswell

April 2021

Design for Maintainability

by Louis J. Gullo (Editor), Jack Dixon (Editor)

March 2021

Reliability Culture: How Leaders can Create Organizations that Create Reliable Products

by Adam P. Bahret

February 2021

Lead-free Soldering Process Development and Reliability

by Jasbir Bath (Editor)

August 2020

Automotive System Safety: Critical Considerations for Engineering and Effective Management

Joseph D. Miller

February 2020

Prognostics and Health Management: A Practical Approach to Improving System

Reliability Using Condition-Based Data

by Douglas Goodman, James P. Hofmeister, Ferenc Szidarovszky

April 2019

Improving Product Reliability and Software Quality: Strategies, Tools, Process and Implementation, 2nd Edition

Mark A. Levin, Ted T. Kalal, Jonathan Rodin

April 2019

Practical Applications of Bayesian Reliability

Yan Liu, Athula I. Abeyratne

April 2019

Dynamic System Reliability: Modeling and Analysis of Dynamic and Dependent Behaviors

Liudong Xing, Gregory Levitin, Chaonan Wang

March 2019

Reliability Engineering and Services

Tongdan Jin

March 2019

Design for Safety

by Louis J. Gullo, Jack Dixon

February 2018

Thermodynamic Degradation Science: Physics of Failure, Accelerated Testing,

Fatigue and Reliability

by Alec Feinberg

October 2016

Next Generation HALT and HASS: Robust Design of Electronics and Systems

by Kirk A. Gray, John J. Paschkewitz

May 2016

Reliability and Risk Models: Setting Reliability Requirements, 2nd Edition

by Michael Todinov

November 2015

Applied Reliability Engineering and Risk Analysis: Probabilistic Models and Statistical Inference

by Ilia B. Frenkel (Editor), Alex Karagrigoriou (Editor), Anatoly Lisnianski (Editor), Andre V. Kleyner (Editor)

October 2013

Design for Reliability

by Dev G. Raheja (Editor), Louis J. Gullo (Editor)

July 2012

Effective FMEAs: Achieving Safe, Reliable, and Economical Products and Processes Using Failure Modes and Effects Analysis

by Carl Carlson

April 2012

Failure Analysis: A Practical Guide for Manufacturers of Electronic Components and Systems

by Marius Bazu, Titu Bajenescu

April 2011

Reliability Technology: Principles and Practice of Failure Prevention in Electronic Systems

by Norman Pascoe

April 2011

Improving Product Reliability: Strategies and Implementation

by Mark A. Levin, Ted T. Kalal

March 2003

Test Engineering: A Concise Guide to Cost-Effective Design, Development and Manufacture

by Patrick O'Connor

April 2001

Integrated Circuit Failure Analysis: A Guide to Preparation Techniques

by Friedrich Beck

January 1998

Measurement and Calibration Requirements for Quality Assurance to ISO 9000

by Alan S. Morris

October 1997

Electronic Component Reliability: Fundamentals, Modelling, Evaluation, and Assurance

by Finn Jensen

November 1995

Software Reliability Techniques for Real-World Applications

 

 

Roger K. Youree

Instrumental Sciences IncorporatedHuntsville, USA

 

 

 

 

 

This edition first published 2023© 2023 John Wiley and Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Roger K. Youree to be identified as the author of this work has been asserted in accordance with law.

Registered OfficesJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USAJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial OfficeThe Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of WarrantyIn view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data Applied for

Hardback ISBN: 9781119931829

Cover Design: WileyCover Image: © Titima Ongkantong/Shutterstock

 

This book is dedicated to my wife, Susan.

Preface

Software reliability as a discipline started later than hardware reliability but has grown rapidly. Software reliability is an active area of research with new results, both theoretical and practical, being published regularly. As is often the case with a relatively young discipline, much of the material may seem unconnected, making it difficult to determine how to choose the right techniques for a given job and organization.

This book is a survey of techniques and approaches that can be used to produce reliable software in a cost- and schedule-efficient manner. It focuses on practical techniques and is tailored for practitioners and not academics. Software reliability is not in any one organization's domain, and this book takes a broad approach in that it considers all activities that affect the software, such as conceptual design and requirements development, even though they generally occur well before any coding takes place. Preventing or removing defects from these early activities will pay significant dividends.

The early chapters of this book are intended to provide an overall understanding of the nature of the problem, followed by more practical suggestions in later chapters. Chapter 2 covers some definitions and useful information about defects. Chapter 3 outlines an overall approach for developing reliable software, followed by Chapter 4, which describes different project phases or stages, how defects can enter the project in each phase, ways to mitigate these defects, and ways to monitor for defects applicable to the phase. Chapter 5 provides a summary and roadmap, along with some practical guidelines. The book concludes with Chapter 6 that gives more details on some of the techniques mentioned in Chapters 3 and 4.

Roger K. Youree

Instrumental Sciences Incorporated

Huntsville, Alabama, USA

Series Editor's Foreword by Dr. Andre Kleyner

The Wiley Series in Quality & Reliability Engineering aims to provide a solid educational foundation for researchers and practitioners in the field of quality and reliability engineering and to expand the knowledge base by including the latest developments in these disciplines.

The importance of quality and reliability to a system can hardly be disputed. Product failures in the field inevitably lead to losses in the form of repair cost, warranty claims, customer dissatisfaction, product recalls, loss of sale, and in extreme cases, loss of life.

Engineering systems are becoming increasingly complex with added functions and capabilities; however, the reliability requirements remain the same or are even growing more stringent. Increasing integration of hardware and software is making these systems even more complex and challenging to design. For example, in autonomous driving vehicles, software may play an even more important role than the hardware. All this brings ever-increasing attention to the topic of software quality and reliability.

The book you are about to read has been written by an expert and state-of-the-art practitioner in the field of software reliability. It covers a variety of topics critical to producing high-quality, malfunction-free software in a timely manner.

At present, despite its obvious importance, quality and reliability education is paradoxically lacking in today's engineering curriculum. Very few engineering schools offer degree programs or even a sufficient variety of courses in quality or reliability methods. The topics of reliability analysis, accelerated testing, reliability modeling and simulation, warranty data analysis, reliability growth programs, and other practical applications of reliability engineering receive very little coverage in today's engineering student curriculum. Therefore, the majority of the quality and reliability practitioners receive their professional training from colleagues, professional seminars, and professional publications. This book is intended to close some of these gaps and provide additional educational opportunities for a wide range of readers from graduate-level students to seasoned reliability professionals.

We are confident that this book as well as this entire book series will continue Wiley's tradition of excellence in technical publishing and provide a lasting and positive contribution to the teaching and practice of reliability and quality engineering.

Acronyms

There are many commonly used acronyms in software reliability, but some may have different meanings to different people. The following list of acronyms is used in this book:

AADL

Architecture Analysis and Design Language

ATAM

Architecture Tradeoff Analysis Method

SM

BDD

Behavior-Driven Development

BOM

Bill of Materials

CDP

Concept Development and Planning

CIL

Critical Items List

CONOPS

Concept of Operations

DC

Design and Coding

DOE

Design of Experiments

DRACAS

Defect Reporting, Analysis and Corrective Action System

DRB

Defect Review Board

DRE

Defect Removal Efficiency

EKSLOC

Effective thousand (Kilo) Source Lines of Code

FDSC

Failure Definition and Scoring Criteria

FEF

Fix Effectiveness Factor

FMEA

Failure Modes, Effects, and Analysis

FMECA

Failure Modes, Effects, and Criticality Analysis

FRACAS

Failure Reporting, Analysis and Corrective Action System

FRB

Failure Review Board

FTA

Fault Tree Analysis

IV&V

Integration Verification and Validation

LRU

Line Replaceable Unit

MOE

Measure of Effectiveness

MS

Management Strategy

MTBCF

Mean Time Between Critical Failures

MTBEFF

Mean Time Between Essential Function Failures

MTBF

Mean Time Between Failures

MTBSA

Mean Time Between System Aborts

MTSWR

Mean Time to SoftWare Restore

MTTF

Mean Time to Failure

NVA

Non-Value Added

ODC

Orthogonal Defect Classification

OM

Operation and Maintenance

OP

Operational Profile

PFMEA

Process Failure Modes Effects and Analysis

QA

Quality Assurance

QFD

Quality Function Deployment

RCA

Root Cause Analysis

RPN

Risk Priority Number

SDD

Software Design Document

SFMEA

Software Failure Modes Effects and Analysis

SFTA

Software Fault Tree Analysis

SLOC

Source Lines of Code

SPC

Statistical Process Control

SRE

Software Reliability Engineering

SRGP

Software Reliability Growth Plan

SRPP

Software Reliability Program Plan

SRS

Software Requirements Specification

SysML

Systems Modeling Language

TBA

To Be Added

TBD

To Be Determined

TBR

To Be Reviewed

TBS

To Be Supplied

TEMP

Test and Evaluation Master Plan

TDD

Test-Driven Development

UML

Unified Modeling Language

WBS

Work Breakdown Structure

Glossary

Defect

:

A defect is a problem that, if not corrected, could cause an application or product to either fail or to produce incorrect or unsatisfactory results.

Defect precursor

:

A defect precursor is an event that does not directly result in a defect being placed in the software but makes the introduction of a defect into the software more likely.

Error

:

An error is a human action that produces an incorrect result. Note that the word “error” is also a standard part of some software terms, such as “runtime errors” and “memory errors.”

Essential function failure

:

An essential function failure is any incident or incorrect function that causes (or could cause) the loss of an essential function or the degradation of an essential function below a specified level. Essential functions are the minimum operational tasks that the system must perform to accomplish its mission or achieve acceptable customer satisfaction. A

system abort

(

SA

) is an essential function failure, but not all essential function failures are SAs.

Failure

:

There are two definitions of failure that are typically used:

A failure is the inability of a system or system component to perform a required function within specified limits.

A failure is the termination of the ability of a product to perform a required function or its inability to perform within previously specified limits.

Fault

:

Again, there are two definitions of fault that are typically used:

A fault is a defect in the software code that can be the cause of one or more failures.

A fault is a manifestation of an error in the software.

Nonessential function failure

:

A nonessential function failure is any incident or incorrect function that causes (or could cause) the loss of a nonessential function or the degradation of an essential function but not to an unacceptable level.

Operational profile

:

An operational profile (OP) is a set of relative frequencies (or probabilities) of occurrences of disjoint software operations during operational use.

Phase

:

A phase, or project phase, is a period in the life cycle of the project dedicated to a certain set of tasks and products. Other terms sometimes used are stages, increments, or sprints. Phases typically overlap.

Project

:

For this book, a project is defined as an organized undertaking to produce one or more products. We uses the term “project” rather than “program” to avoid confusion with the use of program to refer to a software program.

Project system

:

For purposes of this book, we define a project system to be the finished product, all of the intermediate products, tools, services, and documentation used to develop the finished product, and all of the processes used in the project. When the risk of confusion is small, “system” may be used in the place of “project system.”

Root cause

:

The root cause of a defect, also called a primary cause, is the initial causal event or chain of events that results in a defect. The root cause is the fundamental reason for the defect and if corrected will prevent recurrence of these and similar defect occurrences.

System abort

:

A SA, sometimes known as a mission abort or operational mission failure, is an essential function failure that occurs during a mission or critical operations that prevents critical aspects of system performance. It usually results in terminating the mission or operations. A software crash is an example of a SA.

Software reliability

:

Software reliability is the probability that the software will not cause a system failure for a specified time period under specified conditions.

Software reliability engineering

:

Software reliability engineering (SRE) is defined in [1] as “the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability.” It includes the following:

Software reliability prediction and estimation.

The use of attributes and metrics of the product design, development process, and operational environment to assess and improve software reliability.

The application of this knowledge to specify and guide design, development, testing, acquisition, use, and maintenance.

Software reliability estimation

:

There are two definitions of software reliability estimation in frequent use:

Reference

[2]

defines software reliability estimation as “The application of statistical techniques to observed failure data collected during system testing and operation to assess the reliability of the software.”

Reference

[1]

defines software reliability estimation as the activity that “… determines

current

software reliability by applying statistical inference techniques to failure data obtained during system test or during system operation. This is a measure regarding the achieved reliability from the past until the current point.”

Software reliability prediction

:

There are two definitions of software reliability prediction often used:

Reference

[2]

defines software reliability predictions as “A forecast or assessment of the reliability of the software based on parameters associated with the software product and its development environment.”

Reference

[1]

defines software reliability predictions as the activity that “… determines

future

software reliability based on the available software metrics and measures.”

Validation

Validation of a product answers the question of whether the product meets the needs that prompted its creation.

Verification

Verification of a product answers the question of whether the product satisfies its requirements.

References

1

   Lyu M, editor.

Handbook of software reliability engineering

. Computer Society Press and McGraw-Hill Book Company, New York, 1996.

2

   IEEE Standard 1633.

Recommended practice on software reliability

, 2017. Software Engineering Technical Committee of the IEEE Computer Society.

1Introduction

Software is ubiquitous in today's world. It controls our home appliances, automobiles, phones, and many of our forms of entertainment. It increases our productivity at work, speeds our communications, and improves our medical care. It affects nearly every aspect of modern life. Software is also getting more complicated because of a number of reasons, such as an increase in the number and diversity of software applications, the more varied types of platforms for the software, and the increased reliance on other “third-party” software. Because of this, it is critical to produce reliable software. Software that fails often may mean that some entertainment application is not as entertaining as intended, or it could result in a life-or-death situation in a hospital or a mass transit system.

1.1 Description of the Problem

As mentioned above, software is everywhere and is becoming more and more complicated. It is largely “handmade” and subject to human errors. Also, most software contains, or at least interfaces with, software developed independently by other companies. As a result, software defects can be subtle and difficult to find, sometimes only manifesting themselves under very specific conditions. Unfortunately, when these conditions occur, the effects of a defect may be very serious, including loss of life. Even if lives do not depend on the software, litigations can seriously damage a company.

Software reliability tasks are often assigned to reliability engineering personnel. Many times, these people are more familiar with hardware reliability than they are with software reliability. Hardware reliability and software reliability are different, and hardware reliability engineers are frequently uncomfortable with software reliability.

There is more to the problem than just producing reliable software. There are budgets and schedules to meet. Whatever is done to produce reliable software must meet these constraints. Another consideration is the highly dynamic business environment typical of modern software products. Customer needs and wants are always changing, and if one company does not respond to them, another will.

Many times, software reliability is treated as an added task to be performed after the software has been developed and in the process of being tested. The importance of software reliability and the seriousness of the constraints that must be adhered to mean that there are often issues affecting software reliability that should be addressed early in the development process. Producing reliable software within the budget and schedule constraints requires embedding a software reliability mindset into the project from its start.

1.2 Implications for Software Reliability

Software reliability is ultimately about achieving customer satisfaction with a profitable product. This goal requires many things other than reliability, but it is unlikely to be achieved with a seriously defective product. The importance of software reliability, along with the complexity of the problem and the budget and schedule constraints inherent to the problem, means that a software reliability program should be planned and implemented early in the development effort and monitored and adjusted as needed. A company that does this successfully has a huge competitive advantage over a company that is unsuccessful at it.

Good software reliability practices are about doing things right the first time, and this effort starts at the beginning of the development effort. It is often said that doing a job right takes less time than doing it over, and this advice often holds. It is particularly applicable to software reliability given how difficult it can be to find and remove some types of software defects. Not all software defects are coding issues. Many are due to defects in products produced much earlier in the effort, and preventing or finding and removing them early before they become deeply embedded in downstream products can be very cost-effective and schedule-effective. Most people recognize the importance of software reliability for critical software, but many do not understand that good software reliability practices can reduce the cost of development and maintenance of the software. When properly planned and implemented, a software reliability program can significantly reduce the amount of rework required and rework costs money and can result in schedule impacts. One of the more obvious examples of reducing rework is with software testing. Software testing is expensive, and applying good software reliability techniques from early in the effort can mean significantly fewer faults found during software testing, resulting in less re-testing and shorter test cycles.

Choosing a good set of reliability techniques for a software project requires anticipating the types of defects and errors that are likely to occur in that project. However, our knowledge of the future is not perfect. It is said that in war, a general's plan for a battle never survives first contact with the enemy. Unfortunately, the same can often be said for plans for developing and supporting software. Things do not always go as planned, particularly in our highly dynamic and interdependent world. While starting the effort with a good set of software reliability techniques is important, monitoring results and then making appropriate changes are also a necessary part of the process. We live in a very dynamic world and need to get used to the fact that unexpected events will occur. We must continuously monitor and adapt while always trying to learn from events and see if we can do better next time. Managing for software reliability involves identifying and managing risks in an ever-changing environment.

There is no one set of techniques that is best for all software development efforts. The approach to software reliability should depend on the product, the software team, the company, and often the customer. This book therefore starts with a general understanding of defects that can affect software and what can be done about them and then progresses to more specific project areas. This book is also designed to be beneficial to a wide audience, such as software developers and software maintainers, producers and users of the software, and software for government and for commercial customers. More on the importance and implications of software reliability may be found in [1–3].

References

1

   Lyu M. Software reliability engineering: A roadmap.

Future of software engineering

, pp. 153–170, IEEE Computer Society, 2007.

https://www.researchgate.net/publication/4250863_Software_Reliability_Engineering_A_Roadmap

. 22 Aug 2020.

2

   Musa J.

Software reliability engineering: More reliability software faster and cheaper

. AutherHouse, 2004.

3

   Neufelder A.

Ensuring software reliability

. Marcel Dekker, Inc., New York, 1993.

2Understanding Defects

To prevent and control software defects, we need to understand them. This chapter explains the nature of software defects, including where they enter into the system, what effects they can have, how to detect them, and what causes them.

To reduce the number and impact of defects in our software, it is important to understand the nature of errors and defects. Almost any error on a project can affect the reliability of the software. Anything that makes it more difficult for project personnel to perform their tasks can negatively impact reliability, even if it does not directly result in placing a defect in the software code. A frustrated, angry, or confused programmer is more likely to make an error resulting in a software defect than a motivated, generally happy, and well-informed programmer. A poor work environment and a lack of good software development tools are examples of defect precursors. Defect precursors do not directly cause a software defect, but they make defects more likely and so are considerations for software reliability. Projects that produce high-quality software tend to be well-run projects. Not all errors or defect precursors result in defects, but reducing errors and precursors reduces the likelihood of defects. Similarly, not all defects produce software faults, and not all software faults result in software failures, but again, reducing them improves our chances of reliable software.

As we want to produce reliable software, our understanding of software defects needs to be tailored to that purpose. To this end, we consider the following:

Where defects enter the project system

Effects defects can have on the project system

How we can detect defects

What causes defects

How we can handle defects

The first four of these are addressed in Sections 2.1–2.4, while the fifth is covered in Chapter 3. Chapter 4 covers the material in more detail by addressing it for specific phases of a project.

2.1 Where Defects Enter the Project System

Knowing where defects can enter a project system is important because we can use this information to design mechanisms to prevent or detect them. When we think of software defects, we typically think of specific types of errors, such as typographical errors, logical errors, synchronization errors, resource errors, or interface errors, to name just a few, and the software defects that may result from them. These types of errors are obviously important, and we must be able to handle them; however, defects affecting the software can enter a system in almost any phase and through almost anything used to design or produce the software product. Processes and products in one phase are used by later phases to produce the final product, so defects in an early phase may propagate to the final product.

In Chapter 4, we describe six phases that are typical for a project. They are as follows:

Concept Development and Planning

Requirements and Interfaces

Design and Coding

Integration, Verification, and Validation

Product Production and Release

Operation and Maintenance

We also consider management impacts. All of these use processes and produce products that create opportunities to introduce defects. Examples of potential defect sources include a poor understanding of customer needs, imprecise requirements, and not following good configuration control processes. The first two examples are typically from the Concept Development and Planning phase and the Requirements and Interfaces phase, respectively, while the last example can be from any phase. It is also important to realize that defects can be introduced into software that has a low defect density, but these defects may have very serious consequences. Also, correcting a detected defect or adding a feature to mature software may introduce defects. Chapter 4 takes each of these phases and describes it, outlining what defects are typical for each phase and how they can enter the project system. It describes techniques and processes to mitigate these defects and lists some metrics to help monitor progress in each phase.

2.2 Effects of Defects

Software defects manifest themselves in many ways, and understanding this helps us produce more reliable software. Of course, a defect may never manifest itself. For example, if the defective part of the code is never executed, the defect never causes a fault or failure. As we generally try not to write unused code, we will assume that defects have some likelihood of being executed.

We commonly think of software defects as causing software crashes, infinite loops, or incorrect software results. Crashes and infinite loop tend to be readily visible. Incorrect results may be obvious or may be subtle. Other types of defects, such as memory leaks, may manifest themselves even more subtly. Software defects, or “bugs,” are sometimes classified into two types:

Mandelbugs: A mandelbug is a software defect whose activation and subsequent behavior is complex and its behavior appears chaotic. An example of a mandelbug is a type of defect jovially referred to as a “heisenbug.” Heisenbugs are altered by the attempts to find them. They may be affected by the timing of the execution, by the memory addresses used, by having debugging tools connected to the system, or any of a large number of other factors. Once introduced into the software, heisenbugs, and mandelbugs in general, can be notoriously difficult to find.

Bohrbugs: A bohrbug is a software defect whose behavior is repeatable and predictable. Although the cause of the incorrect behavior may be unknown, they are repeatable if the right conditions are found and applied.

Knowing about these various types of defects helps us plan, carry out, and analyze software tests. However, the possible existence of these subtle and hard-to-find defects is one of the reasons why we should not rely solely on software testing to detect defects. It also adds emphasis to the fact that software testing can only show the existence of defects in software, not the absence of defects. Ultimately, it supports the idea that we need to put an emphasis of defect prevention.

If the only defects that we consider are defects in the software, we are missing opportunities to prevent defects from being introduced into the project system. As previously mentioned, almost any error or defect can increase the likelihood of software defects. For example, a poorly worded requirement may be interpreted differently by different software developers. If two developers are writing different software modules affected by this requirement, the different interpretations may mean that these modules do not work together correctly. Furthermore, the effects may be subtle and difficult to find, meaning that the most cost-effective and schedule-effective way to deal with the defect is by ensuring that the requirements are as clear and precise as possible.

Finally, not all defect effects are equally important. Defects that never manifest themselves are less important than defects that cause critical failures. Improving the reliability of software involves focusing on the defects that are most likely to occur and also on the defects that have the most serious consequences if they do occur.

2.3 Detection of Defects

An effective and efficient software reliability effort requires well-thought-out defect detection and monitoring. Good defect detection and monitoring should:

Find errors and defects early when it is most cost-effective and schedule-effective to correct them.

Be as complete as practical, finding a high percentage of the errors and defects, and finding them in all processes and products that can significantly affect the software product.

Be reliable by not missing too many errors and defects while also not creating too many false alarms and the ensuing unproductive effort.

Be cost-efficient and schedule-efficient to perform.

Good defect detection and monitoring should also add confidence in the software and related products. It should provide evidence that it is working, and project personnel should be able to trust the detection and monitoring processes and execution enough that the results can be used as a part of the final sign-off of the software.

Recognizing defect precursors is critical for preventing and removing defects efficiently. For example, knowing that a software defect may be due to a requirement defect informs us that we need to detect requirement issues and therefore institute appropriate processes for doing this. Process and product monitoring is important at each phase of the project, and Chapter 4 covers each in more detail.

There are many ways to identify an error, defect precursor, or defect. Some ways identify weaknesses or problems with the processes that produce a product and others identify issues with a product. Techniques to detect process defects and weakness include the following:

Use a process

failure modes effects and analysis

(

FMEA

)/

failure modes, effects, and criticality analysis

(

FMECA

).

Use process reviews, inspections, and independent assessors.

Use error brainstorming sessions. Those responsible for a task brainstorm on what errors could occur while performing the task. The list can be used to develop checklists for the errors, and the brainstorming process sensitizes the task performers to the errors.

Use a software reliability advocate to continually assess project processes for potential software reliability impacts.

Perform a premortem on the process to anticipate process defects.

Some techniques to detect defects in products are as follows:

Use product peer reviews, inspections, and independent assessors.

As with process defects, we can use error brainstorming and premortem sessions for the product.

Perform tests of code.

Use checklists of process steps to ensure that each step is followed when producing the product.

As with process defects, use a Software Reliability Advocate to continually assess project products for potential software reliability impacts.

Use a software reliability casebook to assess if all processes are correctly followed and if not to push for corrections and improvements.

Use requirements traceability analysis of a specification as a means of detecting potential requirement defects.

Also for requirements, let several people independently assess what would constitute verification of a specific requirement. Make the assessments specific enough that if certain criteria are met, the requirement passes, and if they are not met, it fails. Failure to agree on these criteria indicates the potential for confusion and for an inconsistent use of the requirement.

While detection of defects is important, we ultimately want to anticipate the chain of events that results in a defect and use this information to prevent the defect. Ideally, we prevent the first precursor, but realistically, we should also monitor for most if not all of the known precursors in the chain. We should also use “triggers.” These are indicators that additional action is required for a monitored event. These triggers may at times be subjective, but early intervention increases the likelihood that a problem will be contained and will not spread damage to later phases of the project where it is increasing difficult to handle. Chapter 4 lists metrics and monitoring activities applicable to each phase.

Finally, the project should continuously assess how effective its defect detection processes are and always try to improve them. Avoid change for the sake of change as project changes can be disruptive. However, monitor the effectiveness of the detection and be willing to change a process if there is reason to believe that it will make a significant improvement.

The next section considers causes of defects. Knowing defect causes helps us prevent and remove defects. It also enables us to monitor events that trigger the creation of defects and therefore potentially detect defects earlier. For example, a defect may be caused by not following the processes used to create requirements, and not following a process may be caused by inadequate training. This information tells us that we should use skilled requirements developers or institute adequate training for requirements development and that we should also monitor training completions and adequacy.

2.4 Causes of Defects

To prevent or eliminate a defect, it is important to know the causes of the defect. Knowing defect causes helps us predict them and reduce their likelihood as well as to more efficiently manage resources. This strategy is analogous to the use of “Physics of Failure” techniques for hardware reliability. Defects usually have a causal chain, a sequence of events that ultimately results in the given defect. In this chain of causes, it may be that only a few of the causes are readily detectable. To choose the best place and approach to correct the problem, we need to understand this chain. It is also important to know that there may be more than one causal chain for a given defect, i.e. the confluence of two or more such chains results in the defect.

Consistent with the idea of causal chains, we distinguish between primary and secondary causes of defects. For purposes of this book, a primary cause of a defect is a root cause of the defect. Successfully addressing a primary cause not only addresses the specific instance of the defect in question but also prevents other similar defects from occurring and therefore improves the running of the entire project. Addressing a secondary cause may remove the current defect and may in some cases prevent other similar defects, but it does not address the more fundamental cause of the problem and therefore risks problem reoccurrence.

Examples of secondary causes include inadequate project objectives, unclear requirements, and excessively complex software code. Each of these causes provides useful information but is not the root cause of the defect. For each, we can constructively ask for additional information. For example, a requirement may be unclear causing unintended behavior from the resulting software code. The requirement can be clarified, and the code can be changed to address the clarified version of the requirement, thereby eliminating the defect. However, we need to ask if there is a way to prevent or reduce the likelihood of unclear requirements. We should ask what caused the unclear requirement and how we can improve the way that we produce requirements. Secondary causes are useful for helping us detect and analyze defects. They are covered for specific project phases in Chapter 4, but we need to understand defects at a deeper level to more effectively prevent or remove them.

Root cause analysis is the process of finding the primary cause of a defect or problem. At a high level, root cause analysis usually follows steps similar to the following taken from [1]:

Identify the problem.

Determine the significance of the problem.

Identify the causes (conditions or actions) directly preceding and surrounding the problem.

Identify the reasons why the causes in the previous step exist and work backward to the root cause.

A critical part of this analysis is to systematically work our way back to the root cause, and there are various techniques that can be used in this process. Several are listed below:

Five whys

Fault tree analysis

(

FTA

)

Fishbone diagrams (cause/effect or Ishikawa diagrams)

Scatter plots and correlation analysis: These can be used to determine if two factors correlate with one another and aid in finding a causal relation.

FMEA/FMECA

Event and causal factor analysis

Barrier analysis

Change analysis

Human performance evaluation

See the topic on root cause analysis in Section 6.5 for more on these and other root cause analysis techniques.

In finding root causes, it can be useful knowing the categories applicable to most defects. Although the following list is not necessarily complete, most defects in software production or monitoring can be traced to these high-level issues:

Not producing or monitoring the right things: For example, we may have a software project with a significant number of interfaces, but we are not producing any interface documentation to specify them.

Poor processes for producing or monitoring a product or process: An example of this type of issue is having a product release process that does not ensure correct configuration control of the product, potentially resulting in the wrong product being released.

Not following the processes: This issue could occur if the process for creating software code is adequate, but because of schedule pressures and staffing issues, certain steps are not performed.

Following the process or monitoring poorly: With this issue, we use the procedure but do it poorly or intermittently. An example of this issue is having an adequate process for creating software code but using an inexperienced software developer who is unfamiliar with the process or is unable to follow the steps properly.

Non-human factors: The first four of these categories of failures are largely due to human errors, and humans make mistakes in spite of excellent processes, resources, ability, and training. However, some errors cannot reasonably be attributed to human error. For example, externally imposed constraints may make defects more likely. A sudden change in legal requirements may require a project change that negatively impacts software reliability. Errors from this type of situation may appear at almost any time in any process or product.

As stated above, the first four of these categories of failures are largely due to human errors. Causes of human errors include the following:

Insufficient knowledge: This type of error is due to one or more task performers not knowing or not having access to relevant information.

Cognitive failure: A cognitive error occurs when a task performer is unable to correctly process the required task information.

Lack of needed skills: As the name indicates, this error is due to a task performer not having the correct skill set to perform the task.

Attention failure: Attention failures occur because of carelessness or loss of focus and a task that otherwise would be performed correctly is adversely affected.

Overload: Overload is due to too much work or too much multitasking.

Contradictory tasks: Sometimes, a task performer is assigned tasks or conditions that cannot all be satisfied, such as writing software code for contradictory requirements.

Lack of motivation: Lack of motivation is typically due to a lack of interest or a “bad attitude” and can be exacerbated by a poor work environment.

Misunderstanding: Poor communication between two or more employees can result in misunderstandings.

Using our example of a defect because of an unclear requirement, suppose that we have traced the original problem to an unclear requirement, giving us a secondary cause. With further analysis, we find that the requirement is unclear because there was a misunderstanding of who was responsible for the requirement and a “placeholder” was put into the specification until the issue was resolved. The issue was forgotten about because the requirement had no identification as being a placeholder. At this point, we need to know if this cause of confusion is an isolated incident or more systemic. If it is systemic, we may have other major issues to address. We also need to address how a “placeholder” mistakenly became a requirement and why it was forgotten. These mistakes could be due to the requirements processes not addressing placeholders, or someone not following the procedure, or perhaps other issues. With further analysis, we find that the process does address the placeholder situation, and the person who performed it incorrectly was temporarily assigned to the project to relieve a budget issue on a different project and had not been trained for the task. This information enables us to direct our efforts toward the root cause of the problem. For example, we could add training for temporarily assigned personnel, or if this is impractical because of schedule constraints, to add additional monitoring of products produced by these personnel.

Causes often suggest possible mitigations that may prevent or reduce the likelihood that such defects will occur in the future. For example, if a defect is caused by someone not having the right skill set, the person could be trained or supported by another employee with a stronger skill set or moved to tasks better suited to the employee's current skills. Non-human errors need to be addressed on a case-by-case basis. Chapter 3 covers mitigation of defects at a high level, while Chapter 4 details mitigation techniques and processes in more detail. Chapter 29 of [2] contains more information on human errors and reliability. See [3] and [4] for more details on the causes of software defects.

As a final note, knowing the causes of defects enables us to better predict defects and plan ways to avoid creating them. In Chapter 3, we cover planning the steps and processes needed to achieve our software reliability objectives within the given resources. Part of this plan is to create a list of what can go wrong and using this list to institute ways of preventing or detecting these problems. A sound knowledge of potential root causes can therefore help us prevent defects in a timely and cost-effective manner.

References

1

   DOE. DOE-NE-STD-1004-92, root cause analysis, 1992. Available via

http://everyspec.com/DOE/DOE-PUBS/DOE_NE_STD_1004_92_262/

. Accessed 22 Aug 2020.

2

   H. Pham, editor.

Handbook of reliability engineering

. Springer-Verlag, London, 2003.

3

   Neufelder A.

Ensuring software reliability

. Marcel Dekker, Inc., New York, 1993.

4

   Musa J.

Software reliability engineering: More reliability software faster and cheaper

. AutherHouse, 2004.

3Handling Defects

To produce reliable software under cost and schedule constraints, we need to carefully plan our project activities and ensure that the plan is implementable by the team put together to do the tasks. This chapter outlines how to develop an overall strategy for software reliability. It then covers the nature of our software reliability objectives and provides details on how to plan the project to build reliability into the software with each project activity. We also discuss how to make the plan implementable. Finally, we discuss analogies between hardware reliability and software reliability engineering (SRE). As most practitioners are more familiar with hardware reliability, it is hoped that these analogies will help them better understand and more effectively implement software reliability practices.

3.1 Strategy for Handling Defects

In Chapter 2, we learn about errors, defect precursors, and defects, and in this chapter, we use this information to construct processes to handle these defects and to produce reliable software. To handle defects, we use four complementary approaches:

Prevent errors and defects by anticipating likely causes and providing mitigations.

Remove defects by monitoring and detecting defects and errors, preferably early when removal is more cost- and schedule-effective.

Design the system to be fault tolerant to reduce the impact of defects that are in the system.

Forecast defects and faults to manage project resources and to gain confidence in the reliability of the product.

Producing highly reliable software within project constraints requires clear goals, careful planning, and good execution. A standard overall process for achieving almost any goal is as shown in the accompanying Figure 3.1. This process is closely related to the “Plan–Do–Check–Act” Deming cycle. In more detail, this overall or high-level process consists of the following steps:

Determine objectives: Decide on reasonable objectives for software reliability consistent with the needs of the customer and the project resources.

Figure 3.1 Overall Process.

Plan: Determine the steps and processes needed to achieve these objectives within the given resources.

Implementation and monitoring: Decide how to perform the plan and what the signs of success and of trouble are.

Feedback: Decide when feedback indicates that changes are needed, and if so indicated, determine what changes are appropriate and when to make them.

The next sections consider each of these steps in more detail.

3.2 Objectives

At a high level, a project wants to produce a profitable product with a high level of customer satisfaction. Both customer satisfaction and product profitability relate to software reliability. Not only is high software reliability important, we also need to have some level of assurance of its reliability. Expanding on this, we consider two main objectives for software reliability along with typical sub-objectives for each:

Objective 1: Create a highly reliability software product on schedule and within budget constraints. This objective can be further broken down into the following:

Sub-objective 1a: Prevent defects from entering into the product.

Sub-objective 1b: If a defect is in the product, design the product to perform adequately in spite of the defect.

Sub-objective 1c: If a defect is introduced into the product, find and remove it as soon and as economically as possible.

Objective 2: Know with a high level of assurance that the software is sufficiently reliable. The sub-objectives include the following:

Sub-objective 2a: Determine metrics and criteria for assessing the reliability of the software product.

Sub-objective 2b: Design methods to collect and analyze the information required to make the assessment.

Sub-objective 2c: Monitor the product, processes, and implementation of the processes to determine if the software is at risk of not being sufficiently reliable.

Ultimately, we want a satisfied customer; therefore, customer inputs and feedback throughout the project, but particularly with project objectives, can prove highly beneficial. Also, determining objectives is typically an iterative process. As the project progresses, objectives should be made more precise. Ideally, we have quantitative objectives that can be monitored and used to indicate when we are on track and when corrective actions are needed. However, we should consider our objectives carefully. Our objectives guide our plan and therefore our implementation of the plan. These can suffer if the objectives are not clear, motivated, and well accepted.

3.3 Plan

After determining our objectives, we need a plan to coordinate the efforts used to achieve them. This plan should provide guidance on the following:

How to prevent or reduce the impact of each type of anticipated error or defect affecting the software.

How to monitor for errors and defects, anticipated or not, in products, in process compliance, and in process effectiveness.

How to determine when the monitoring should trigger some form of action and what that action should be.

When and how to perform root cause analysis and how to use the analysis results and other monitoring information to make changes to the products, processes, and implementation of the processes.

The typical steps for designing the software reliability activities for a project are as follows:

Step 1: List steps that the project will perform to produce the software product.

Step 2: List what can go wrong in each of these steps.

Step 3: List how we can prevent these defects and errors or at least significantly reduce their likelihood and impact.

Step 4: List ways that we can quickly know if something goes wrong, i.e. list what monitoring is needed.

Step 5: List when the information from the monitoring indicates that we should do something different and what it should be.

Step 6: List how we will know if our processes and corrective actions are effective, and if they are not, list what we should do. We need to know how confident we can justifiably be in our product.

We elaborate on these steps below. It is important to note however that our plan and how we implement it are dependent on our objectives, our staff, and resources to implement the plan and the nature of the project. For example, we should consider the chosen software development process, such as spiral, incremental, or cleanroom software development when planning these activities. The results of following these steps can then form the basis of the Software Reliability Program Plan (SRPP).

Step 1: List steps to produce the product: To effectively reduce the impact of software defects, we need to identify where defects can be created, and this means understanding the processes and products used to produce and maintain the software. As a result, this first step requires that we create a list of project phases and the products that each produces. The list should also include the processes used to produce the products and where these processes are documented. If there are important processes that are not documented, the plan should note this and encourage the project to suitably document them.

Section 2.1 addresses this need at a high level. Chapter 4