106,99 €
Authoritative resource providing step-by-step guidance for producing reliable software to be tailored for specific projects Software Reliability Techniques for Real-World Applications is a practical, up to date, go-to source that can be referenced repeatedly to efficiently prevent software defects, find and correct defects if they occur, and create a higher level of confidence in software products. From content development to software support and maintenance, the author creates a depiction of each phase in a project such as design and coding, operation and maintenance, management, product production, and concept development and describes the activities and products needed for each. Software Reliability Techniques for Real-World Applications introduces clear ways to understand each process of software reliability and explains how it can be managed effectively and reliably. The book is supported by a plethora of detailed examples and systematic approaches, covering analogies between hardware and software reliability to ensure a clear understanding. Overall, this book helps readers create a higher level of confidence in software products. In Software Reliability Techniques for Real-World Applications, readers will find specific information on: * Defects, including where defects enter the project system, effects, detection, and causes of defects, and how to handle defects * Project phases, including concept development and planning, requirements and interfaces, design and coding, and integration, verification, and validation * Roadmap and practical guidelines, including at the start of a project, as a member of an organization, and how to handle troubled projects * Techniques, including an introduction to techniques in general, plus techniques by organization (systems engineering, software, and reliability engineering) Software Reliability Techniques for Real-World Applications is a practical text on software reliability, providing over sixty-five different techniques and step-by-step guidance for producing reliable software. It is an essential and complete resource on the subject for software developers, software maintainers, and producers of software.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 772
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Dedication
Preface
Series Editor's Foreword by Dr. Andre Kleyner
Acronyms
Glossary
References
1 Introduction
1.1 Description of the Problem
1.2 Implications for Software Reliability
References
2 Understanding Defects
2.1 Where Defects Enter the Project System
2.2 Effects of Defects
2.3 Detection of Defects
2.4 Causes of Defects
References
3 Handling Defects
3.1 Strategy for Handling Defects
3.2 Objectives
3.3 Plan
3.4 Implementation, Monitoring, and Feedback
3.5 Analogies Between Hardware and Software Reliability Engineering
References
4 Project Phases
4.1 Introduction to Project Phases
4.2 Concept Development and Planning
4.3 Requirements and Interfaces
4.4 Design and Coding
4.5 Integration, Verification, and Validation
4.6 Product Production and Release
4.7 Operation and Maintenance
4.8 Management
References
5 Roadmap and Practical Guidelines
5.1 Summary and Roadmap
5.2 Guidelines
References
6 Techniques
6.1 Introduction to the Techniques
6.2 Techniques for Systems Engineering
6.3 Techniques for Software
6.4 Techniques for Reliability Engineering
6.5 Project-Wide Techniques and Techniques for Quality Assurance
References
Index
End User License Agreement
Chapter 4
Table 4.1 Defects per Activity (%).
Chapter 6
Table 6.1 Techniques for Software Reliability.
Table 6.2 Techniques for Systems Engineering.
Table 6.3 Techniques Related to Systems Engineering.
Table 6.4 Techniques for Software.
Table 6.5 Techniques Related to Software.
Table 6.6 Confidence Interval for Exponentially Distributed Failure-Terminat...
Table 6.7 Confidence Interval for Exponentially Distributed Time-Terminated ...
Table 6.8 Test Duration Example.
Table 6.9 Techniques for Reliability Engineering.
Table 6.10 Techniques Related to Reliability Engineering.
Table 6.11 Example of Operational Profile, A1.
Table 6.12 Example of Operational Profile, A2.
Table 6.13 Software FMEA.
Table 6.14 Software FMEA (cont.).
Table 6.15 Predicted Critical Failure Rate Values.
Table 6.16 Characteristics of the Software LRUs.
Table 6.17 Predicted Defect Density Values for Software LRUs.
Table 6.18 Predicted Defects, Failure Rate, MTBF, MTBCF, and Reliability.
Table 6.19 Approximate Kolmogorov–Smirnov Statistics Critical Values – One-S...
Table 6.20 Software Fault Data.
Table 6.21 Sequential Probability Ratio Test.
Table 6.22 Project-Wide Techniques and Techniques for Quality Assurance.
Table 6.23 Measurement Selection Matrix.
Table 6.24 Software Failures.
Table 6.25 Process FMEA.
Table 6.26 Process FMEA (cont.).
Table 6.27 Code Review Metrics.
Table 6.28 Laplace Test Statistic Example.
Table 6.29 Mann–Kendall Test Statistic Example.
Table 6.30 Spearman's Rank Correlation Coefficient Example.
Chapter 3
Figure 3.1 Overall Process.
Figure 3.2 Designing and Running a Project.
Chapter 6
Figure 6.1 Reliability Block Diagram.
Figure 6.2 Example Communication Network.
Cover
Title Page
Copyright
Dedication
Preface
Series Editor's Foreword by Dr. Andre Kleyner
Acronyms
Glossary
Table of Contents
Begin Reading
Index
End User License Agreement
ii
iii
iv
v
vi
vii
xi
xiii
xv
xvi
xvii
xviii
xix
1
2
3
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
323
324
325
326
Dr. Andre V. KleynerSeries Editor
The Wiley Series in Quality & Reliability Engineering aims to provide a solid educational foundation for both practitioners and researchers in the Q&R field and to expand the reader's knowledge base to include the latest developments in this field. The series will provide a lasting and positive contribution to the teaching and practice of engineering.
The series coverage will contain, but is not exclusive to,
Statistical methods
Physics of failure
Reliability modeling
Functional safety
Six-sigma methods
Lead-free electronics
Warranty analysis/management
Risk and safety analysis
Wiley Series in Quality & Reliability Engineering
Software Reliability Techniques for Real-World Applications
by Roger K. Youree
December 2022
System Reliability Assessment and Optimization: Methods and Applications
by Yan-Fu Li, Enrico Zio
April 2022
Design for Excellence in Electronics Manufacturing
Cheryl Tulkoff, Greg Caswell
April 2021
Design for Maintainability
by Louis J. Gullo (Editor), Jack Dixon (Editor)
March 2021
Reliability Culture: How Leaders can Create Organizations that Create Reliable Products
by Adam P. Bahret
February 2021
Lead-free Soldering Process Development and Reliability
by Jasbir Bath (Editor)
August 2020
Automotive System Safety: Critical Considerations for Engineering and Effective Management
Joseph D. Miller
February 2020
Prognostics and Health Management: A Practical Approach to Improving System
Reliability Using Condition-Based Data
by Douglas Goodman, James P. Hofmeister, Ferenc Szidarovszky
April 2019
Improving Product Reliability and Software Quality: Strategies, Tools, Process and Implementation, 2nd Edition
Mark A. Levin, Ted T. Kalal, Jonathan Rodin
April 2019
Practical Applications of Bayesian Reliability
Yan Liu, Athula I. Abeyratne
April 2019
Dynamic System Reliability: Modeling and Analysis of Dynamic and Dependent Behaviors
Liudong Xing, Gregory Levitin, Chaonan Wang
March 2019
Reliability Engineering and Services
Tongdan Jin
March 2019
Design for Safety
by Louis J. Gullo, Jack Dixon
February 2018
Thermodynamic Degradation Science: Physics of Failure, Accelerated Testing,
Fatigue and Reliability
by Alec Feinberg
October 2016
Next Generation HALT and HASS: Robust Design of Electronics and Systems
by Kirk A. Gray, John J. Paschkewitz
May 2016
Reliability and Risk Models: Setting Reliability Requirements, 2nd Edition
by Michael Todinov
November 2015
Applied Reliability Engineering and Risk Analysis: Probabilistic Models and Statistical Inference
by Ilia B. Frenkel (Editor), Alex Karagrigoriou (Editor), Anatoly Lisnianski (Editor), Andre V. Kleyner (Editor)
October 2013
Design for Reliability
by Dev G. Raheja (Editor), Louis J. Gullo (Editor)
July 2012
Effective FMEAs: Achieving Safe, Reliable, and Economical Products and Processes Using Failure Modes and Effects Analysis
by Carl Carlson
April 2012
Failure Analysis: A Practical Guide for Manufacturers of Electronic Components and Systems
by Marius Bazu, Titu Bajenescu
April 2011
Reliability Technology: Principles and Practice of Failure Prevention in Electronic Systems
by Norman Pascoe
April 2011
Improving Product Reliability: Strategies and Implementation
by Mark A. Levin, Ted T. Kalal
March 2003
Test Engineering: A Concise Guide to Cost-Effective Design, Development and Manufacture
by Patrick O'Connor
April 2001
Integrated Circuit Failure Analysis: A Guide to Preparation Techniques
by Friedrich Beck
January 1998
Measurement and Calibration Requirements for Quality Assurance to ISO 9000
by Alan S. Morris
October 1997
Electronic Component Reliability: Fundamentals, Modelling, Evaluation, and Assurance
by Finn Jensen
November 1995
Roger K. Youree
Instrumental Sciences IncorporatedHuntsville, USA
This edition first published 2023© 2023 John Wiley and Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Roger K. Youree to be identified as the author of this work has been asserted in accordance with law.
Registered OfficesJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USAJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial OfficeThe Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of WarrantyIn view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data Applied for
Hardback ISBN: 9781119931829
Cover Design: WileyCover Image: © Titima Ongkantong/Shutterstock
This book is dedicated to my wife, Susan.
Software reliability as a discipline started later than hardware reliability but has grown rapidly. Software reliability is an active area of research with new results, both theoretical and practical, being published regularly. As is often the case with a relatively young discipline, much of the material may seem unconnected, making it difficult to determine how to choose the right techniques for a given job and organization.
This book is a survey of techniques and approaches that can be used to produce reliable software in a cost- and schedule-efficient manner. It focuses on practical techniques and is tailored for practitioners and not academics. Software reliability is not in any one organization's domain, and this book takes a broad approach in that it considers all activities that affect the software, such as conceptual design and requirements development, even though they generally occur well before any coding takes place. Preventing or removing defects from these early activities will pay significant dividends.
The early chapters of this book are intended to provide an overall understanding of the nature of the problem, followed by more practical suggestions in later chapters. Chapter 2 covers some definitions and useful information about defects. Chapter 3 outlines an overall approach for developing reliable software, followed by Chapter 4, which describes different project phases or stages, how defects can enter the project in each phase, ways to mitigate these defects, and ways to monitor for defects applicable to the phase. Chapter 5 provides a summary and roadmap, along with some practical guidelines. The book concludes with Chapter 6 that gives more details on some of the techniques mentioned in Chapters 3 and 4.
Roger K. Youree
Instrumental Sciences Incorporated
Huntsville, Alabama, USA
The Wiley Series in Quality & Reliability Engineering aims to provide a solid educational foundation for researchers and practitioners in the field of quality and reliability engineering and to expand the knowledge base by including the latest developments in these disciplines.
The importance of quality and reliability to a system can hardly be disputed. Product failures in the field inevitably lead to losses in the form of repair cost, warranty claims, customer dissatisfaction, product recalls, loss of sale, and in extreme cases, loss of life.
Engineering systems are becoming increasingly complex with added functions and capabilities; however, the reliability requirements remain the same or are even growing more stringent. Increasing integration of hardware and software is making these systems even more complex and challenging to design. For example, in autonomous driving vehicles, software may play an even more important role than the hardware. All this brings ever-increasing attention to the topic of software quality and reliability.
The book you are about to read has been written by an expert and state-of-the-art practitioner in the field of software reliability. It covers a variety of topics critical to producing high-quality, malfunction-free software in a timely manner.
At present, despite its obvious importance, quality and reliability education is paradoxically lacking in today's engineering curriculum. Very few engineering schools offer degree programs or even a sufficient variety of courses in quality or reliability methods. The topics of reliability analysis, accelerated testing, reliability modeling and simulation, warranty data analysis, reliability growth programs, and other practical applications of reliability engineering receive very little coverage in today's engineering student curriculum. Therefore, the majority of the quality and reliability practitioners receive their professional training from colleagues, professional seminars, and professional publications. This book is intended to close some of these gaps and provide additional educational opportunities for a wide range of readers from graduate-level students to seasoned reliability professionals.
We are confident that this book as well as this entire book series will continue Wiley's tradition of excellence in technical publishing and provide a lasting and positive contribution to the teaching and practice of reliability and quality engineering.
There are many commonly used acronyms in software reliability, but some may have different meanings to different people. The following list of acronyms is used in this book:
AADL
Architecture Analysis and Design Language
ATAM
Architecture Tradeoff Analysis Method
SM
BDD
Behavior-Driven Development
BOM
Bill of Materials
CDP
Concept Development and Planning
CIL
Critical Items List
CONOPS
Concept of Operations
DC
Design and Coding
DOE
Design of Experiments
DRACAS
Defect Reporting, Analysis and Corrective Action System
DRB
Defect Review Board
DRE
Defect Removal Efficiency
EKSLOC
Effective thousand (Kilo) Source Lines of Code
FDSC
Failure Definition and Scoring Criteria
FEF
Fix Effectiveness Factor
FMEA
Failure Modes, Effects, and Analysis
FMECA
Failure Modes, Effects, and Criticality Analysis
FRACAS
Failure Reporting, Analysis and Corrective Action System
FRB
Failure Review Board
FTA
Fault Tree Analysis
IV&V
Integration Verification and Validation
LRU
Line Replaceable Unit
MOE
Measure of Effectiveness
MS
Management Strategy
MTBCF
Mean Time Between Critical Failures
MTBEFF
Mean Time Between Essential Function Failures
MTBF
Mean Time Between Failures
MTBSA
Mean Time Between System Aborts
MTSWR
Mean Time to SoftWare Restore
MTTF
Mean Time to Failure
NVA
Non-Value Added
ODC
Orthogonal Defect Classification
OM
Operation and Maintenance
OP
Operational Profile
PFMEA
Process Failure Modes Effects and Analysis
QA
Quality Assurance
QFD
Quality Function Deployment
RCA
Root Cause Analysis
RPN
Risk Priority Number
SDD
Software Design Document
SFMEA
Software Failure Modes Effects and Analysis
SFTA
Software Fault Tree Analysis
SLOC
Source Lines of Code
SPC
Statistical Process Control
SRE
Software Reliability Engineering
SRGP
Software Reliability Growth Plan
SRPP
Software Reliability Program Plan
SRS
Software Requirements Specification
SysML
Systems Modeling Language
TBA
To Be Added
TBD
To Be Determined
TBR
To Be Reviewed
TBS
To Be Supplied
TEMP
Test and Evaluation Master Plan
TDD
Test-Driven Development
UML
Unified Modeling Language
WBS
Work Breakdown Structure
Defect
:
A defect is a problem that, if not corrected, could cause an application or product to either fail or to produce incorrect or unsatisfactory results.
Defect precursor
:
A defect precursor is an event that does not directly result in a defect being placed in the software but makes the introduction of a defect into the software more likely.
Error
:
An error is a human action that produces an incorrect result. Note that the word “error” is also a standard part of some software terms, such as “runtime errors” and “memory errors.”
Essential function failure
:
An essential function failure is any incident or incorrect function that causes (or could cause) the loss of an essential function or the degradation of an essential function below a specified level. Essential functions are the minimum operational tasks that the system must perform to accomplish its mission or achieve acceptable customer satisfaction. A
system abort
(
SA
) is an essential function failure, but not all essential function failures are SAs.
Failure
:
There are two definitions of failure that are typically used:
A failure is the inability of a system or system component to perform a required function within specified limits.
A failure is the termination of the ability of a product to perform a required function or its inability to perform within previously specified limits.
Fault
:
Again, there are two definitions of fault that are typically used:
A fault is a defect in the software code that can be the cause of one or more failures.
A fault is a manifestation of an error in the software.
Nonessential function failure
:
A nonessential function failure is any incident or incorrect function that causes (or could cause) the loss of a nonessential function or the degradation of an essential function but not to an unacceptable level.
Operational profile
:
An operational profile (OP) is a set of relative frequencies (or probabilities) of occurrences of disjoint software operations during operational use.
Phase
:
A phase, or project phase, is a period in the life cycle of the project dedicated to a certain set of tasks and products. Other terms sometimes used are stages, increments, or sprints. Phases typically overlap.
Project
:
For this book, a project is defined as an organized undertaking to produce one or more products. We uses the term “project” rather than “program” to avoid confusion with the use of program to refer to a software program.
Project system
:
For purposes of this book, we define a project system to be the finished product, all of the intermediate products, tools, services, and documentation used to develop the finished product, and all of the processes used in the project. When the risk of confusion is small, “system” may be used in the place of “project system.”
Root cause
:
The root cause of a defect, also called a primary cause, is the initial causal event or chain of events that results in a defect. The root cause is the fundamental reason for the defect and if corrected will prevent recurrence of these and similar defect occurrences.
System abort
:
A SA, sometimes known as a mission abort or operational mission failure, is an essential function failure that occurs during a mission or critical operations that prevents critical aspects of system performance. It usually results in terminating the mission or operations. A software crash is an example of a SA.
Software reliability
:
Software reliability is the probability that the software will not cause a system failure for a specified time period under specified conditions.
Software reliability engineering
:
Software reliability engineering (SRE) is defined in [1] as “the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability.” It includes the following:
Software reliability prediction and estimation.
The use of attributes and metrics of the product design, development process, and operational environment to assess and improve software reliability.
The application of this knowledge to specify and guide design, development, testing, acquisition, use, and maintenance.
Software reliability estimation
:
There are two definitions of software reliability estimation in frequent use:
Reference
[2]
defines software reliability estimation as “The application of statistical techniques to observed failure data collected during system testing and operation to assess the reliability of the software.”
Reference
[1]
defines software reliability estimation as the activity that “… determines
current
software reliability by applying statistical inference techniques to failure data obtained during system test or during system operation. This is a measure regarding the achieved reliability from the past until the current point.”
Software reliability prediction
:
There are two definitions of software reliability prediction often used:
Reference
[2]
defines software reliability predictions as “A forecast or assessment of the reliability of the software based on parameters associated with the software product and its development environment.”
Reference
[1]
defines software reliability predictions as the activity that “… determines
future
software reliability based on the available software metrics and measures.”
Validation
Validation of a product answers the question of whether the product meets the needs that prompted its creation.
Verification
Verification of a product answers the question of whether the product satisfies its requirements.
1
Lyu M, editor.
Handbook of software reliability engineering
. Computer Society Press and McGraw-Hill Book Company, New York, 1996.
2
IEEE Standard 1633.
Recommended practice on software reliability
, 2017. Software Engineering Technical Committee of the IEEE Computer Society.
Software is ubiquitous in today's world. It controls our home appliances, automobiles, phones, and many of our forms of entertainment. It increases our productivity at work, speeds our communications, and improves our medical care. It affects nearly every aspect of modern life. Software is also getting more complicated because of a number of reasons, such as an increase in the number and diversity of software applications, the more varied types of platforms for the software, and the increased reliance on other “third-party” software. Because of this, it is critical to produce reliable software. Software that fails often may mean that some entertainment application is not as entertaining as intended, or it could result in a life-or-death situation in a hospital or a mass transit system.
As mentioned above, software is everywhere and is becoming more and more complicated. It is largely “handmade” and subject to human errors. Also, most software contains, or at least interfaces with, software developed independently by other companies. As a result, software defects can be subtle and difficult to find, sometimes only manifesting themselves under very specific conditions. Unfortunately, when these conditions occur, the effects of a defect may be very serious, including loss of life. Even if lives do not depend on the software, litigations can seriously damage a company.
Software reliability tasks are often assigned to reliability engineering personnel. Many times, these people are more familiar with hardware reliability than they are with software reliability. Hardware reliability and software reliability are different, and hardware reliability engineers are frequently uncomfortable with software reliability.
There is more to the problem than just producing reliable software. There are budgets and schedules to meet. Whatever is done to produce reliable software must meet these constraints. Another consideration is the highly dynamic business environment typical of modern software products. Customer needs and wants are always changing, and if one company does not respond to them, another will.
Many times, software reliability is treated as an added task to be performed after the software has been developed and in the process of being tested. The importance of software reliability and the seriousness of the constraints that must be adhered to mean that there are often issues affecting software reliability that should be addressed early in the development process. Producing reliable software within the budget and schedule constraints requires embedding a software reliability mindset into the project from its start.
Software reliability is ultimately about achieving customer satisfaction with a profitable product. This goal requires many things other than reliability, but it is unlikely to be achieved with a seriously defective product. The importance of software reliability, along with the complexity of the problem and the budget and schedule constraints inherent to the problem, means that a software reliability program should be planned and implemented early in the development effort and monitored and adjusted as needed. A company that does this successfully has a huge competitive advantage over a company that is unsuccessful at it.
Good software reliability practices are about doing things right the first time, and this effort starts at the beginning of the development effort. It is often said that doing a job right takes less time than doing it over, and this advice often holds. It is particularly applicable to software reliability given how difficult it can be to find and remove some types of software defects. Not all software defects are coding issues. Many are due to defects in products produced much earlier in the effort, and preventing or finding and removing them early before they become deeply embedded in downstream products can be very cost-effective and schedule-effective. Most people recognize the importance of software reliability for critical software, but many do not understand that good software reliability practices can reduce the cost of development and maintenance of the software. When properly planned and implemented, a software reliability program can significantly reduce the amount of rework required and rework costs money and can result in schedule impacts. One of the more obvious examples of reducing rework is with software testing. Software testing is expensive, and applying good software reliability techniques from early in the effort can mean significantly fewer faults found during software testing, resulting in less re-testing and shorter test cycles.
Choosing a good set of reliability techniques for a software project requires anticipating the types of defects and errors that are likely to occur in that project. However, our knowledge of the future is not perfect. It is said that in war, a general's plan for a battle never survives first contact with the enemy. Unfortunately, the same can often be said for plans for developing and supporting software. Things do not always go as planned, particularly in our highly dynamic and interdependent world. While starting the effort with a good set of software reliability techniques is important, monitoring results and then making appropriate changes are also a necessary part of the process. We live in a very dynamic world and need to get used to the fact that unexpected events will occur. We must continuously monitor and adapt while always trying to learn from events and see if we can do better next time. Managing for software reliability involves identifying and managing risks in an ever-changing environment.
There is no one set of techniques that is best for all software development efforts. The approach to software reliability should depend on the product, the software team, the company, and often the customer. This book therefore starts with a general understanding of defects that can affect software and what can be done about them and then progresses to more specific project areas. This book is also designed to be beneficial to a wide audience, such as software developers and software maintainers, producers and users of the software, and software for government and for commercial customers. More on the importance and implications of software reliability may be found in [1–3].
1
Lyu M. Software reliability engineering: A roadmap.
Future of software engineering
, pp. 153–170, IEEE Computer Society, 2007.
https://www.researchgate.net/publication/4250863_Software_Reliability_Engineering_A_Roadmap
. 22 Aug 2020.
2
Musa J.
Software reliability engineering: More reliability software faster and cheaper
. AutherHouse, 2004.
3
Neufelder A.
Ensuring software reliability
. Marcel Dekker, Inc., New York, 1993.
To prevent and control software defects, we need to understand them. This chapter explains the nature of software defects, including where they enter into the system, what effects they can have, how to detect them, and what causes them.
To reduce the number and impact of defects in our software, it is important to understand the nature of errors and defects. Almost any error on a project can affect the reliability of the software. Anything that makes it more difficult for project personnel to perform their tasks can negatively impact reliability, even if it does not directly result in placing a defect in the software code. A frustrated, angry, or confused programmer is more likely to make an error resulting in a software defect than a motivated, generally happy, and well-informed programmer. A poor work environment and a lack of good software development tools are examples of defect precursors. Defect precursors do not directly cause a software defect, but they make defects more likely and so are considerations for software reliability. Projects that produce high-quality software tend to be well-run projects. Not all errors or defect precursors result in defects, but reducing errors and precursors reduces the likelihood of defects. Similarly, not all defects produce software faults, and not all software faults result in software failures, but again, reducing them improves our chances of reliable software.
As we want to produce reliable software, our understanding of software defects needs to be tailored to that purpose. To this end, we consider the following:
Where defects enter the project system
Effects defects can have on the project system
How we can detect defects
What causes defects
How we can handle defects
The first four of these are addressed in Sections 2.1–2.4, while the fifth is covered in Chapter 3. Chapter 4 covers the material in more detail by addressing it for specific phases of a project.
Knowing where defects can enter a project system is important because we can use this information to design mechanisms to prevent or detect them. When we think of software defects, we typically think of specific types of errors, such as typographical errors, logical errors, synchronization errors, resource errors, or interface errors, to name just a few, and the software defects that may result from them. These types of errors are obviously important, and we must be able to handle them; however, defects affecting the software can enter a system in almost any phase and through almost anything used to design or produce the software product. Processes and products in one phase are used by later phases to produce the final product, so defects in an early phase may propagate to the final product.
In Chapter 4, we describe six phases that are typical for a project. They are as follows:
Concept Development and Planning
Requirements and Interfaces
Design and Coding
Integration, Verification, and Validation
Product Production and Release
Operation and Maintenance
We also consider management impacts. All of these use processes and produce products that create opportunities to introduce defects. Examples of potential defect sources include a poor understanding of customer needs, imprecise requirements, and not following good configuration control processes. The first two examples are typically from the Concept Development and Planning phase and the Requirements and Interfaces phase, respectively, while the last example can be from any phase. It is also important to realize that defects can be introduced into software that has a low defect density, but these defects may have very serious consequences. Also, correcting a detected defect or adding a feature to mature software may introduce defects. Chapter 4 takes each of these phases and describes it, outlining what defects are typical for each phase and how they can enter the project system. It describes techniques and processes to mitigate these defects and lists some metrics to help monitor progress in each phase.
Software defects manifest themselves in many ways, and understanding this helps us produce more reliable software. Of course, a defect may never manifest itself. For example, if the defective part of the code is never executed, the defect never causes a fault or failure. As we generally try not to write unused code, we will assume that defects have some likelihood of being executed.
We commonly think of software defects as causing software crashes, infinite loops, or incorrect software results. Crashes and infinite loop tend to be readily visible. Incorrect results may be obvious or may be subtle. Other types of defects, such as memory leaks, may manifest themselves even more subtly. Software defects, or “bugs,” are sometimes classified into two types:
Mandelbugs: A mandelbug is a software defect whose activation and subsequent behavior is complex and its behavior appears chaotic. An example of a mandelbug is a type of defect jovially referred to as a “heisenbug.” Heisenbugs are altered by the attempts to find them. They may be affected by the timing of the execution, by the memory addresses used, by having debugging tools connected to the system, or any of a large number of other factors. Once introduced into the software, heisenbugs, and mandelbugs in general, can be notoriously difficult to find.
Bohrbugs: A bohrbug is a software defect whose behavior is repeatable and predictable. Although the cause of the incorrect behavior may be unknown, they are repeatable if the right conditions are found and applied.
Knowing about these various types of defects helps us plan, carry out, and analyze software tests. However, the possible existence of these subtle and hard-to-find defects is one of the reasons why we should not rely solely on software testing to detect defects. It also adds emphasis to the fact that software testing can only show the existence of defects in software, not the absence of defects. Ultimately, it supports the idea that we need to put an emphasis of defect prevention.
If the only defects that we consider are defects in the software, we are missing opportunities to prevent defects from being introduced into the project system. As previously mentioned, almost any error or defect can increase the likelihood of software defects. For example, a poorly worded requirement may be interpreted differently by different software developers. If two developers are writing different software modules affected by this requirement, the different interpretations may mean that these modules do not work together correctly. Furthermore, the effects may be subtle and difficult to find, meaning that the most cost-effective and schedule-effective way to deal with the defect is by ensuring that the requirements are as clear and precise as possible.
Finally, not all defect effects are equally important. Defects that never manifest themselves are less important than defects that cause critical failures. Improving the reliability of software involves focusing on the defects that are most likely to occur and also on the defects that have the most serious consequences if they do occur.
An effective and efficient software reliability effort requires well-thought-out defect detection and monitoring. Good defect detection and monitoring should:
Find errors and defects early when it is most cost-effective and schedule-effective to correct them.
Be as complete as practical, finding a high percentage of the errors and defects, and finding them in all processes and products that can significantly affect the software product.
Be reliable by not missing too many errors and defects while also not creating too many false alarms and the ensuing unproductive effort.
Be cost-efficient and schedule-efficient to perform.
Good defect detection and monitoring should also add confidence in the software and related products. It should provide evidence that it is working, and project personnel should be able to trust the detection and monitoring processes and execution enough that the results can be used as a part of the final sign-off of the software.
Recognizing defect precursors is critical for preventing and removing defects efficiently. For example, knowing that a software defect may be due to a requirement defect informs us that we need to detect requirement issues and therefore institute appropriate processes for doing this. Process and product monitoring is important at each phase of the project, and Chapter 4 covers each in more detail.
There are many ways to identify an error, defect precursor, or defect. Some ways identify weaknesses or problems with the processes that produce a product and others identify issues with a product. Techniques to detect process defects and weakness include the following:
Use a process
failure modes effects and analysis
(
FMEA
)/
failure modes, effects, and criticality analysis
(
FMECA
).
Use process reviews, inspections, and independent assessors.
Use error brainstorming sessions. Those responsible for a task brainstorm on what errors could occur while performing the task. The list can be used to develop checklists for the errors, and the brainstorming process sensitizes the task performers to the errors.
Use a software reliability advocate to continually assess project processes for potential software reliability impacts.
Perform a premortem on the process to anticipate process defects.
Some techniques to detect defects in products are as follows:
Use product peer reviews, inspections, and independent assessors.
As with process defects, we can use error brainstorming and premortem sessions for the product.
Perform tests of code.
Use checklists of process steps to ensure that each step is followed when producing the product.
As with process defects, use a Software Reliability Advocate to continually assess project products for potential software reliability impacts.
Use a software reliability casebook to assess if all processes are correctly followed and if not to push for corrections and improvements.
Use requirements traceability analysis of a specification as a means of detecting potential requirement defects.
Also for requirements, let several people independently assess what would constitute verification of a specific requirement. Make the assessments specific enough that if certain criteria are met, the requirement passes, and if they are not met, it fails. Failure to agree on these criteria indicates the potential for confusion and for an inconsistent use of the requirement.
While detection of defects is important, we ultimately want to anticipate the chain of events that results in a defect and use this information to prevent the defect. Ideally, we prevent the first precursor, but realistically, we should also monitor for most if not all of the known precursors in the chain. We should also use “triggers.” These are indicators that additional action is required for a monitored event. These triggers may at times be subjective, but early intervention increases the likelihood that a problem will be contained and will not spread damage to later phases of the project where it is increasing difficult to handle. Chapter 4 lists metrics and monitoring activities applicable to each phase.
Finally, the project should continuously assess how effective its defect detection processes are and always try to improve them. Avoid change for the sake of change as project changes can be disruptive. However, monitor the effectiveness of the detection and be willing to change a process if there is reason to believe that it will make a significant improvement.
The next section considers causes of defects. Knowing defect causes helps us prevent and remove defects. It also enables us to monitor events that trigger the creation of defects and therefore potentially detect defects earlier. For example, a defect may be caused by not following the processes used to create requirements, and not following a process may be caused by inadequate training. This information tells us that we should use skilled requirements developers or institute adequate training for requirements development and that we should also monitor training completions and adequacy.
To prevent or eliminate a defect, it is important to know the causes of the defect. Knowing defect causes helps us predict them and reduce their likelihood as well as to more efficiently manage resources. This strategy is analogous to the use of “Physics of Failure” techniques for hardware reliability. Defects usually have a causal chain, a sequence of events that ultimately results in the given defect. In this chain of causes, it may be that only a few of the causes are readily detectable. To choose the best place and approach to correct the problem, we need to understand this chain. It is also important to know that there may be more than one causal chain for a given defect, i.e. the confluence of two or more such chains results in the defect.
Consistent with the idea of causal chains, we distinguish between primary and secondary causes of defects. For purposes of this book, a primary cause of a defect is a root cause of the defect. Successfully addressing a primary cause not only addresses the specific instance of the defect in question but also prevents other similar defects from occurring and therefore improves the running of the entire project. Addressing a secondary cause may remove the current defect and may in some cases prevent other similar defects, but it does not address the more fundamental cause of the problem and therefore risks problem reoccurrence.
Examples of secondary causes include inadequate project objectives, unclear requirements, and excessively complex software code. Each of these causes provides useful information but is not the root cause of the defect. For each, we can constructively ask for additional information. For example, a requirement may be unclear causing unintended behavior from the resulting software code. The requirement can be clarified, and the code can be changed to address the clarified version of the requirement, thereby eliminating the defect. However, we need to ask if there is a way to prevent or reduce the likelihood of unclear requirements. We should ask what caused the unclear requirement and how we can improve the way that we produce requirements. Secondary causes are useful for helping us detect and analyze defects. They are covered for specific project phases in Chapter 4, but we need to understand defects at a deeper level to more effectively prevent or remove them.
Root cause analysis is the process of finding the primary cause of a defect or problem. At a high level, root cause analysis usually follows steps similar to the following taken from [1]:
Identify the problem.
Determine the significance of the problem.
Identify the causes (conditions or actions) directly preceding and surrounding the problem.
Identify the reasons why the causes in the previous step exist and work backward to the root cause.
A critical part of this analysis is to systematically work our way back to the root cause, and there are various techniques that can be used in this process. Several are listed below:
Five whys
Fault tree analysis
(
FTA
)
Fishbone diagrams (cause/effect or Ishikawa diagrams)
Scatter plots and correlation analysis: These can be used to determine if two factors correlate with one another and aid in finding a causal relation.
FMEA/FMECA
Event and causal factor analysis
Barrier analysis
Change analysis
Human performance evaluation
See the topic on root cause analysis in Section 6.5 for more on these and other root cause analysis techniques.
In finding root causes, it can be useful knowing the categories applicable to most defects. Although the following list is not necessarily complete, most defects in software production or monitoring can be traced to these high-level issues:
Not producing or monitoring the right things: For example, we may have a software project with a significant number of interfaces, but we are not producing any interface documentation to specify them.
Poor processes for producing or monitoring a product or process: An example of this type of issue is having a product release process that does not ensure correct configuration control of the product, potentially resulting in the wrong product being released.
Not following the processes: This issue could occur if the process for creating software code is adequate, but because of schedule pressures and staffing issues, certain steps are not performed.
Following the process or monitoring poorly: With this issue, we use the procedure but do it poorly or intermittently. An example of this issue is having an adequate process for creating software code but using an inexperienced software developer who is unfamiliar with the process or is unable to follow the steps properly.
Non-human factors: The first four of these categories of failures are largely due to human errors, and humans make mistakes in spite of excellent processes, resources, ability, and training. However, some errors cannot reasonably be attributed to human error. For example, externally imposed constraints may make defects more likely. A sudden change in legal requirements may require a project change that negatively impacts software reliability. Errors from this type of situation may appear at almost any time in any process or product.
As stated above, the first four of these categories of failures are largely due to human errors. Causes of human errors include the following:
Insufficient knowledge: This type of error is due to one or more task performers not knowing or not having access to relevant information.
Cognitive failure: A cognitive error occurs when a task performer is unable to correctly process the required task information.
Lack of needed skills: As the name indicates, this error is due to a task performer not having the correct skill set to perform the task.
Attention failure: Attention failures occur because of carelessness or loss of focus and a task that otherwise would be performed correctly is adversely affected.
Overload: Overload is due to too much work or too much multitasking.
Contradictory tasks: Sometimes, a task performer is assigned tasks or conditions that cannot all be satisfied, such as writing software code for contradictory requirements.
Lack of motivation: Lack of motivation is typically due to a lack of interest or a “bad attitude” and can be exacerbated by a poor work environment.
Misunderstanding: Poor communication between two or more employees can result in misunderstandings.
Using our example of a defect because of an unclear requirement, suppose that we have traced the original problem to an unclear requirement, giving us a secondary cause. With further analysis, we find that the requirement is unclear because there was a misunderstanding of who was responsible for the requirement and a “placeholder” was put into the specification until the issue was resolved. The issue was forgotten about because the requirement had no identification as being a placeholder. At this point, we need to know if this cause of confusion is an isolated incident or more systemic. If it is systemic, we may have other major issues to address. We also need to address how a “placeholder” mistakenly became a requirement and why it was forgotten. These mistakes could be due to the requirements processes not addressing placeholders, or someone not following the procedure, or perhaps other issues. With further analysis, we find that the process does address the placeholder situation, and the person who performed it incorrectly was temporarily assigned to the project to relieve a budget issue on a different project and had not been trained for the task. This information enables us to direct our efforts toward the root cause of the problem. For example, we could add training for temporarily assigned personnel, or if this is impractical because of schedule constraints, to add additional monitoring of products produced by these personnel.
Causes often suggest possible mitigations that may prevent or reduce the likelihood that such defects will occur in the future. For example, if a defect is caused by someone not having the right skill set, the person could be trained or supported by another employee with a stronger skill set or moved to tasks better suited to the employee's current skills. Non-human errors need to be addressed on a case-by-case basis. Chapter 3 covers mitigation of defects at a high level, while Chapter 4 details mitigation techniques and processes in more detail. Chapter 29 of [2] contains more information on human errors and reliability. See [3] and [4] for more details on the causes of software defects.
As a final note, knowing the causes of defects enables us to better predict defects and plan ways to avoid creating them. In Chapter 3, we cover planning the steps and processes needed to achieve our software reliability objectives within the given resources. Part of this plan is to create a list of what can go wrong and using this list to institute ways of preventing or detecting these problems. A sound knowledge of potential root causes can therefore help us prevent defects in a timely and cost-effective manner.
1
DOE. DOE-NE-STD-1004-92, root cause analysis, 1992. Available via
http://everyspec.com/DOE/DOE-PUBS/DOE_NE_STD_1004_92_262/
. Accessed 22 Aug 2020.
2
H. Pham, editor.
Handbook of reliability engineering
. Springer-Verlag, London, 2003.
3
Neufelder A.
Ensuring software reliability
. Marcel Dekker, Inc., New York, 1993.
4
Musa J.
Software reliability engineering: More reliability software faster and cheaper
. AutherHouse, 2004.
To produce reliable software under cost and schedule constraints, we need to carefully plan our project activities and ensure that the plan is implementable by the team put together to do the tasks. This chapter outlines how to develop an overall strategy for software reliability. It then covers the nature of our software reliability objectives and provides details on how to plan the project to build reliability into the software with each project activity. We also discuss how to make the plan implementable. Finally, we discuss analogies between hardware reliability and software reliability engineering (SRE). As most practitioners are more familiar with hardware reliability, it is hoped that these analogies will help them better understand and more effectively implement software reliability practices.
In Chapter 2, we learn about errors, defect precursors, and defects, and in this chapter, we use this information to construct processes to handle these defects and to produce reliable software. To handle defects, we use four complementary approaches:
Prevent errors and defects by anticipating likely causes and providing mitigations.
Remove defects by monitoring and detecting defects and errors, preferably early when removal is more cost- and schedule-effective.
Design the system to be fault tolerant to reduce the impact of defects that are in the system.
Forecast defects and faults to manage project resources and to gain confidence in the reliability of the product.
Producing highly reliable software within project constraints requires clear goals, careful planning, and good execution. A standard overall process for achieving almost any goal is as shown in the accompanying Figure 3.1. This process is closely related to the “Plan–Do–Check–Act” Deming cycle. In more detail, this overall or high-level process consists of the following steps:
Determine objectives: Decide on reasonable objectives for software reliability consistent with the needs of the customer and the project resources.
Figure 3.1 Overall Process.
Plan: Determine the steps and processes needed to achieve these objectives within the given resources.
Implementation and monitoring: Decide how to perform the plan and what the signs of success and of trouble are.
Feedback: Decide when feedback indicates that changes are needed, and if so indicated, determine what changes are appropriate and when to make them.
The next sections consider each of these steps in more detail.
At a high level, a project wants to produce a profitable product with a high level of customer satisfaction. Both customer satisfaction and product profitability relate to software reliability. Not only is high software reliability important, we also need to have some level of assurance of its reliability. Expanding on this, we consider two main objectives for software reliability along with typical sub-objectives for each:
Objective 1: Create a highly reliability software product on schedule and within budget constraints. This objective can be further broken down into the following:
Sub-objective 1a: Prevent defects from entering into the product.
Sub-objective 1b: If a defect is in the product, design the product to perform adequately in spite of the defect.
Sub-objective 1c: If a defect is introduced into the product, find and remove it as soon and as economically as possible.
Objective 2: Know with a high level of assurance that the software is sufficiently reliable. The sub-objectives include the following:
Sub-objective 2a: Determine metrics and criteria for assessing the reliability of the software product.
Sub-objective 2b: Design methods to collect and analyze the information required to make the assessment.
Sub-objective 2c: Monitor the product, processes, and implementation of the processes to determine if the software is at risk of not being sufficiently reliable.
Ultimately, we want a satisfied customer; therefore, customer inputs and feedback throughout the project, but particularly with project objectives, can prove highly beneficial. Also, determining objectives is typically an iterative process. As the project progresses, objectives should be made more precise. Ideally, we have quantitative objectives that can be monitored and used to indicate when we are on track and when corrective actions are needed. However, we should consider our objectives carefully. Our objectives guide our plan and therefore our implementation of the plan. These can suffer if the objectives are not clear, motivated, and well accepted.
After determining our objectives, we need a plan to coordinate the efforts used to achieve them. This plan should provide guidance on the following:
How to prevent or reduce the impact of each type of anticipated error or defect affecting the software.
How to monitor for errors and defects, anticipated or not, in products, in process compliance, and in process effectiveness.
How to determine when the monitoring should trigger some form of action and what that action should be.
When and how to perform root cause analysis and how to use the analysis results and other monitoring information to make changes to the products, processes, and implementation of the processes.
The typical steps for designing the software reliability activities for a project are as follows:
Step 1: List steps that the project will perform to produce the software product.
Step 2: List what can go wrong in each of these steps.
Step 3: List how we can prevent these defects and errors or at least significantly reduce their likelihood and impact.
Step 4: List ways that we can quickly know if something goes wrong, i.e. list what monitoring is needed.
Step 5: List when the information from the monitoring indicates that we should do something different and what it should be.
Step 6: List how we will know if our processes and corrective actions are effective, and if they are not, list what we should do. We need to know how confident we can justifiably be in our product.
We elaborate on these steps below. It is important to note however that our plan and how we implement it are dependent on our objectives, our staff, and resources to implement the plan and the nature of the project. For example, we should consider the chosen software development process, such as spiral, incremental, or cleanroom software development when planning these activities. The results of following these steps can then form the basis of the Software Reliability Program Plan (SRPP).
Step 1: List steps to produce the product: To effectively reduce the impact of software defects, we need to identify where defects can be created, and this means understanding the processes and products used to produce and maintain the software. As a result, this first step requires that we create a list of project phases and the products that each produces. The list should also include the processes used to produce the products and where these processes are documented. If there are important processes that are not documented, the plan should note this and encourage the project to suitably document them.
Section 2.1 addresses this need at a high level. Chapter 4