Fault-Tolerance Techniques for Spacecraft Control Computers - Mengfei Yang - E-Book

Fault-Tolerance Techniques for Spacecraft Control Computers E-Book

Mengfei Yang

0,0
128,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Comprehensive coverage of all aspects of space application oriented fault tolerance techniques * Experienced expert author working on fault tolerance for Chinese space program for almost three decades * Initiatively provides a systematic texts for the cutting-edge fault tolerance techniques in spacecraft control computer, with emphasis on practical engineering knowledge * Presents fundamental and advanced theories and technologies in a logical and easy-to-understand manner * Beneficial to readers inside and outside the area of space applications

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 545

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Brief Introduction

Preface

1 Introduction

1.1 Fundamental Concepts and Principles of Fault‐tolerance Techniques

1.2 The Space Environment and Its Hazards for the Spacecraft Control Computer

1.3 Development Status and Prospects of Fault Tolerance Techniques

References

2 Fault‐Tolerance Architectures and Key Techniques

2.1 Fault‐tolerance Architecture

2.2 Synchronization Techniques

2.3 Fault‐tolerance Design with Hardware Redundancy

References

3 Fault Detection Techniques

3.1 Fault Model

3.2 Fault Detection Techniques

References

4 Bus Techniques

4.1 Introduction to Space‐borne Bus

4.2 The MIL‐STD‐1553B Bus

4.3 The CAN Bus

4.4 The SpaceWire Bus

4.5 Other Buses

References

5 Software Fault‐Tolerance Techniques

5.1 Software Fault‐tolerance Concepts and Principles

5.2 Single‐version Software Fault‐tolerance Techniques

5.3 Multiple‐version Software Fault‐tolerance Techniques

5.4 Data Diversity Based Software Fault‐tolerance Techniques

References

6 Fault‐Tolerance Techniques for FPGA

6.1 Effect of the Space Environment on FPGAs

6.2 Fault Modes of SRAM‐based FPGAs

6.3 Fault‐tolerance Techniques for SRAM‐based FPGAs

6.4 Typical Fault‐tolerance Design of SRAM‐based FPGA

6.5 Fault‐tolerance Techniques of Anti‐fuse Based FPGA

References

7 Fault‐Injection Techniques

7.1 Basic Concepts

7.2 Classification of Fault‐injection Techniques

7.3 Fault‐injection System Evaluation and Application

7.4 Fault‐injection Platform and Tools

References

8 Intelligent Fault‐Tolerance Techniques

8.1 Evolvable Hardware Fault‐tolerance

8.2 Artificial Immune Hardware Fault‐tolerance

References

Acronyms

Index

End User License Agreement

List of Tables

Chapter 02

Table 2.1 Design reference by signal type.

Table 2.2 IPU elements and application references.

Table 2.3 Application scope and the advantages and disadvantages of a power supply isolation protection circuit.

Chapter 05

Table 5.1 Backward recovery vs. forward recovery.

Table 5.2 Classification of software fault‐tolerance techniques.

Chapter 06

Table 6.1 Analysis of fault modes of SRAM‐based FPGA.

Table 6.2 SRAM‐based FPGA fault‐tolerance techniques.

Table 6.3 Maximum data bandwidth for various configuration modes.

Table 6.4 ICAP read‐back operation commands.

Table 6.5 Description of Frame_ECC port definition.

Chapter 07

Table 7.1 Ground simulation test accelerators for SEEs.

Table 7.2 Various fault‐injection approaches.

Table 7.3 Various typical fault‐injection tools.

Chapter 08

Table 8.1 Basic structure, manufacturing technique, and applicable evolution methods of various PLDs.

Table 8.2 Comparison of evolution methods in various hardware layers.

Table 8.3 Relationship of immune system to traditional fault‐tolerance.

Table 8.4 Foundation of an organism’s immune system.

Table 8.5 Mapping relationship between biological immune system and artificial immune system.

Table 8.6 Mapping relationship between biological immune system and hardware immune system.

Table 8.7 Mapping relationship between FPGA structure and an organism’s structure.

Table 8.8 Responsibility and function of artificial immune system applicable to FPGA fault‐tolerance.

Table 8.9 Function of artificial immune system hardware component based on the SoPC technique.

Table 8.10 Execution times of immune algorithm critical functions.

List of Illustrations

Chapter 01

Figure 1.1 Fault categorization.

Figure 1.2 Serial connection model.

Figure 1.3 Parallel connection model.

Figure 1.4 The

r

/

n

(

G

) model.

Figure 1.5 Magnetic layer of the earth and radiation.

Figure 1.6 Energy spectrum of protons.

Figure 1.7 Electron distribution above the equator.

Figure 1.8 Space radiation environment.

Figure 1.9 PNPN component: (a) NOT gate. (b) Equivalent circuit.

Figure 1.10 Physical essence of SEU damage.

Figure 1.11 MOSFET parasitic BJT structure that leads to SEB.

Figure 1.12 Physical mechanism of SEGR damage.

Chapter 02

Figure 2.1 Structure of module‐level redundancy.

Figure 2.2 Organization of modules in the control computer.

Figure 2.3 Reliability model of the control computer.

Figure 2.4 Dual‐computer cold‐backup fault‐tolerance structure.

Figure 2.5 Dual‐computer hot‐backup FT system structure.

Figure 2.6 TMT fault‐tolerance structure.

Figure 2.7 Truth table for the hardware voter.

Figure 2.8 TMR/S Tri‐computer FT structure.

Figure 2.9 Schematic diagram of the phase‐locked loop (PLL) clock circuit.

Figure 2.10 Clock module receiver circuit.

Figure 2.11 A clock module in the average algorithm.

Figure 2.12 Tri‐computer system data exchange.

Figure 2.13 Fault‐tolerance processing procedure.

Figure 2.14 Fault‐tolerance processing procedure with waiting time inserted.

Figure 2.15 Data input structure of a multi‐computer system.

Figure 2.16 Sequence of the computers.

Figure 2.17 Tri‐computer execution scenario.

Figure 2.18 Common logic model of redundancy design.

Figure 2.19 Example of system fault caused by FDMU.

Figure 2.20 The alienation principle among the BFUs.

Figure 2.21 Dual‐point‐dual‐line redundancy design.

Figure 2.22 RAS demonstration.

Figure 2.23 Satellite temperature measure redundancy circuits.

Figure 2.24 Demonstration of sneak circuit in the redundancy design.

Chapter 03

Figure 3.1 Internal structure of the 80C86 CPU.

Figure 3.2 Blocked storage space structure.

Figure 3.3 Typical memory test process.

Figure 3.4 I/O test structure.

Chapter 04

Figure 4.1 Structure of the 1553B Bus.

Figure 4.2 1553B Bus system function module.

Figure 4.3 System level fault distribution structure of the 1553b bus.

Figure 4.4 Bus‐level structure.

Figure 4.5 Terminal bus interface.

Figure 4.6 Faulty data internal influence.

Figure 4.7 Longest message on the 1553B bus.

Figure 4.8 Bus controller dual modular redundancy.

Figure 4.9 N‐modular redundancy of the bus controller.

Figure 4.10 Distributed redundancy mode of the bus controller.

Figure 4.11 Structure of the control computer system of a spacecraft.

Figure 4.12 Node relation in CAN networking.

Figure 4.13 CAN Protocol Layer and OSI.

Figure 4.14 Node structure of the CAN bus.

Figure 4.15 Structure of the CAN bus controller.

Figure 4.16 Nine fault scenarios in the CAN bus physical layer.

Figure 4.17 CAN bus communication process.

Figure 4.18 CAN bus coding.

Figure 4.19 CAN bus frame formats.

Figure 4.20 Relationship between the three fault modes.

Figure 4.21 CAN bus structure used on a space mission.

Figure 4.22 LVDS signal transmission voltage.

Figure 4.23 Operating principle of LVDS.

Figure 4.24 Data filter coding.

Figure 4.25 SpaceWire data character.

Figure 4.26 SpaceWire control character and control code.

Figure 4.27 Coverage area of SpaceWire parity check.

Figure 4.28 SpaceWire encoder/decoder state machine.

Figure 4.29 SpaceWire communication link initialization procedure.

Figure 4.30 Header deletion in SpaceWire.

Figure 4.31 Integration solution for an onboard SpaceWire electrical system.

Figure 4.32 Address mapping of node internal addressing.

Figure 4.33 Relationship between layers of the IEEE 1394 protocol.

Figure 4.34 Application of the IEEE 1394 bus in the telemetry acquisition system of JPLX2000.

Figure 4.35 Structure of the cabin electrical system in the ESA Columbus.

Figure 4.36 Electrical system network of the NASA X2000 program based on the I

2

C bus.

Chapter 05

Figure 5.1 Backward recovery.

Figure 5.2 Forward recovery using redundant processes.

Figure 5.3 Checkpoint and restart.

Figure 5.4 Example process used in SIHFT.

Figure 5.5 Workflow of CFCSS.

Figure 5.6 Runtime adjusting signature, D.

Figure 5.7 Recovery block structure and operation.

Figure 5.8 Recovery block structure with two alternates.

Figure 5.9 N‐version programming structure.

Figure 5.10 Distributed recovery block structure.

Figure 5.11 A fault‐free PSP task execution cycle.

Figure 5.12 A PSP station task execution cycle involving a failure.

Figure 5.13 NSCP using acceptance tests (ATs).

Figure 5.14 NSCP using comparison.

Figure 5.15 Consensus recovery block structure and operation.

Figure 5.16 Acceptance voting technique structure and operation.

Figure 5.17 Retry block structure and operation.

Figure 5.18 N‐copy programming structure.

Figure 5.19 Two‐pass adjudicator structure and operation.

Chapter 06

Figure 6.1 Radiation effect induced by charged particle bombardment.

Figure 6.2 SET of combinational logic and SEU of sequential logic.

Figure 6.3 SRAM configuration memory cell of an FPGA.

Figure 6.4 SEE induced multiple bits upset.

Figure 6.5 Internal structure of the Virtex FPGA.

Figure 6.6 Internal configuration information storage memory unit. [M] denotes configuration storage unit inside FPGA.

Figure 6.7 Upset of sequential and combinational logic.

Figure 6.8 Half‐latch structure of the Virtex series FPGA.

Figure 6.9 Structure of the traditional TMR.

Figure 6.10 TMR functional module.

Figure 6.11 Output of TMR.

Figure 6.12 TMR protection for block RAM.

Figure 6.13 TMR fault‐tolerance method to protect sequential logic against SEU.

Figure 6.14 Avoiding SET occurrence in combinational logic and sequential logic with the TMR fault‐tolerance method.

Figure 6.15 EDAC information redundancy protection solution.

Figure 6.16 Hamming EDAC system diagram.

Figure 6.17 Structure of Hamming encoder/decoder.

Figure 6.18 Fault detection based on DMR and fault isolation based on tristate gate.

Figure 6.19 Structure of the HWICP module.

Figure 6.20 ICAP interface configuration read‐back, verification, and reconfiguration.

Figure 6.21 Format of an I type packet.

Figure 6.22 Format of an II type packet.

Figure 6.23 Virtex 4 Frame_ECC module.

Figure 6.24 Design process based on ICAP configuration read‐back >+ RS fault‐tolerant coding.

Figure 6.25 RS encoding process.

Figure 6.26 RS encoder realization based on FPGA.

Figure 6.27 RS decoding process.

Figure 6.28 Module‐level dynamic reconfiguration fault recovery.

Figure 6.29 Synthesis realization of top layer module and sub‐modules.

Figure 6.30 Configuration file creation process.

Figure 6.31 Design hierarchy and reconfiguration area layout.

Figure 6.32 Physical module division of the PlanAhead module.

Figure 6.33 Hardware checkpoint setup process.

Figure 6.34 Structure of FPGA Capture module.

Figure 6.35 ICAP automatic scrubbing, read‐back, and reconfiguration prototype system.

Figure 6.36 Module position constraint in PlanAhead.

Figure 6.37 Signal observation and control in ChipScope.

Figure 6.38 LEON3 microprocessor configurable architecture.

Figure 6.39 LEON3 microprocessor architecture.

Figure 6.40 Dynamic reconfiguration time verification circuit. (a) Combinational function verification circuit structure (b) Sequential function verification circuit structure.

Figure 6.41 Structure of Actel FPGA. (a) Combinational ACTI (C‐cell) and sequential ACTI (R‐cell). (b) Detailed layout of C‐cell. (c) R‐cell: description of latch.

Figure 6.42 Memory unit internal TMR of radiation resistant FX or AX structured Actel FPGA.

Chapter 07

Figure 7.1 Fault‐injection procedures.

Figure 7.2 Classification of fault‐injection technology.

Figure 7.3 Example of injection validity.

Figure 7.4 Fault‐injection platform in the EDA environment.

Figure 7.5 Fault‐injection system functional modules.

Figure 7.6 Data flow in the fault‐injection system.

Figure 7.7 Uncapped FPGA.

Figure 7.8 Uncapped 1553B interface chip.

Figure 7.9 Test system environment.

Figure 7.10 Single‐particle test principle schematic diagram.

Chapter 08

Figure 8.1 Evolvable hardware fault‐tolerance implementation flow.

Figure 8.2 Evolvable hardware fault‐tolerance implementation methods.

Figure 8.3 Evolution‐based fault‐tolerance process in a PLD.

Figure 8.4 Standard genetic algorithm flowchart.

Figure 8.5 Two‐point crossover process.

Figure 8.6 ROM structure.

Figure 8.7 PAL basic structure.

Figure 8.8 Internal structure of a Xilinx FPGA.

Figure 8.9 Virtex serial FPGAs with two‐slice CLB structure.

Figure 8.10 A slice inside a Virtex series FPGA.

Figure 8.11 VRC structure adopted by Sekanian

et al

.

Figure 8.12 Genotype and phenotype.

Figure 8.13 Relationship between chromosome code, configuration data, and circuit function in hardware evolution system.

Figure 8.14 Hardware evolution system model and implementation. (a) Hardware evolution system model. (b) Hardware evolution system implementation.

Figure 8.15 Global dynamic reconfigurable system.

Figure 8.16 Partial dynamic reconfigurable system.

Figure 8.17 Evolvable hardware fault‐tolerance structure with FPGA implementation.

Figure 8.18 Hardware evolution implementation methods. (a) Extrinsic evolution. (b) Intrinsic evolution implemented with host computer. (c) Intrinsic evolution implemented with embedded system.

Figure 8.19 Evolution methods in various hardware layers.

Figure 8.20 Implementation methods for various hardware layers’ evolution. (a) Bit stream‐level evolution. (b) Netlist‐level evolution. (c) Design‐level evolution. (d) VRC‐level evolution.

Figure 8.21 Internal structure of a PAL.

Figure 8.22 Evolutional hardware fault‐tolerant capability of a simple PLD. (a) Point A is normal. (b) There is a stuck‐at‐0 fault in point A.

Figure 8.23 Internal evolution realized through implementation of JBits on FPGA.

Figure 8.24 T cell screening and maturing process.

Figure 8.25 Proliferation and differentiation after cell combination with antigens of high affinity.

Figure 8.26 Fast immune response in which T cells participate.

Figure 8.27 Flowchart of standard negative selection algorithm.

Figure 8.28 System structure of artificial immune system.

Figure 8.29 Flowchart of the immune system.

Figure 8.30 FSM of immune hardware.

Figure 8.31 Diagram of artificial immune fault‐tolerance system.

Figure 8.32 FPGA fault‐tolerant system hierarchy based on artificial immune system.

Figure 8.33 Structure of a basic unit in eCell.

Figure 8.34 eCell Structure.

Figure 8.35 Immune control system based on human adaptive immunity model.

Figure 8.36 eCell state transfer string encoding. Key: clk: clock signal; input: eCell input; present: eCell present state user logic output; past: eCell user logic former state output; trans: eCell state transfer string created through the assembly of circuit state acquisition module.

Figure 8.37 Flowchart of the standard PSA.

Figure 8.38 Procedure utilized in the learning phase.

Figure 8.39 Procedure utilized in the fault detection phase.

Figure 8.40 Fault recovery phase procedure.

Figure 8.41 eCell hardware design principle diagram.

Figure 8.42 Artificial immune system hardware structure based on the SoPC technique.

Figure 8.43 Simplified selection algorithm procedure. (a) Construction of PSA detector. (b) Fault detection and recovery of PSA.

Figure 8.44 Flowchart for eCell configuration software.

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

xiii

xv

xvi

xvii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

343

344

345

346

347

348

349

350

351

352

Fault‐Tolerance Techniques for Spacecraft Control Computers

 

Mengfei Yang,Gengxin Hua,Yanjun Feng,Jian Gong

 

 

 

 

 

 

 

This edition first published 2017

© 2017 National Defense Industry Press. All rights reserved.

Published by John Wiley & Sons Singapore Pte. Ltd., 1 Fusionopolis Walk, #07‐01 Solaris South Tower, Singapore 138628, under exclusive license granted by National Defense Industry Press for all media and languages excluding Simplified and Traditional Chinese and throughout the world excluding Mainland China, and with non‐exclusive license for electronic versions in Mainland China.

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as expressly permitted by law, without either the prior written permission of the Publisher, or authorization through payment of the appropriate photocopy fee to the Copyright Clearance Center. Requests for permission should be addressed to the Publisher, John Wiley & Sons Singapore Pte. Ltd., 1 Fusionopolis Walk, #07‐01 Solaris South Tower, Singapore 138628, tel: 65‐66438000, fax: 65‐66438008, email: [email protected].

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging‐in‐Publication Data

Names: Yang, Mengfei, author. | Hua, Gengxin, 1965– author. | Feng, Yanjun, 1969– author. | Gong, Jian, 1975– author.Title: Fault‐tolerance techniques for spacecraft control computers / Mengfei Yang, Gengxin Hua, Yanjun Feng, Jian Gong.Other titles: Hang tian qi kong zhi ji suan ji rong cuop ji shu. EnglishDescription: Singapore : John Wiley & Sons, Inc., 2017. | Translation of: Hang tian qi kong zhi ji suan ji rong cuop ji shu. | Includes bibliographical references and index.Identifiers: LCCN 2016038233 (print) | LCCN 2016051493 (ebook) | ISBN 9781119107279 (cloth) | ISBN 9781119107408 (pdf) | ISBN 9781119107415 (epub)Subjects: LCSH: Space vehicles–Control systems. | Fault‐tolerant computing.Classification: LCC TL3250 .Y36513 2017 (print) | LCC TL3250 (ebook) | DDC 629.47/42–dc23LC record available at https://lccn.loc.gov/2016038233

Cover design by Wiley

Cover image: pixelparticle/Gettyimages

Brief Introduction

In this book, fault tolerance techniques are systematically presented for spacecraft control computers.

The contents of this book are as follows:

space environment where spacecraft control computers operate, and fault models of control computers;

fault‐tolerance architecture and clock synchronization techniques;

fault detection techniques;

space bus fault‐tolerance techniques;

software fault‐tolerance techniques, including single version and N‐version programming;

SRAM‐based FPGA fault‐tolerance techniques with redundancy and reconfiguration;

fault‐injection techniques;

intelligent fault‐tolerance techniques, such as evolvable hardware fault‐tolerance and artificial immune hardware fault‐tolerance.

This book can function as a reference for persons engaged in the research and design of high‐reliability computers, especially spacecraft computers and electronics, and also as a textbook for graduates engaged in research work in this field.

Preface

The control computer is one of the key equipment in a spacecraft control system. Advances in space technology have resulted in the functionality of the control computer becoming increasingly more complex. In addition, the control computer used in space is affected by the harsh elements of the space environment, especially radiation, necessitating the satisfaction of stringent requirements to ensure the control computer’s reliability. Consequently, multiple fault‐tolerance techniques are used in spacecraft design to improve the reliability of the control computer.

NASA (in the United States) has been using fault‐tolerant computer systems in its spacecraft – for example, the self‐testing and repairing (STAR) fault‐tolerant computer – since the 1960s. China began to develop fault‐tolerant computers for spacecraft in the 1970s. We utilized a fault‐tolerant control computer in a satellite for the first time at the Beijing Institute of Control Engineering in the 1980s, and realized a successful on‐orbit flight. Fault‐tolerance techniques have subsequently been predominantly incorporated into control computers, and have contributed significantly to the success of spacecraft projects and missions.

The significance of fault‐tolerance techniques in space technology has prompted us to publish this book, introducing the techniques that we use in spacecraft control computer research and design in China. The content of this book covers not only the fundamental principles, but also methods and case studies in practical engineering.

There are a total of eight chapters. Chapter 1 summarizes fundamental concepts and principles of fault‐tolerance techniques, analyzes the characteristics of a spacecraft control computer and the influences of the space environment, and reviews the course of development of fault‐tolerance techniques and development perspectives expected in the future. Chapter 2 introduces the typical architecture of a fault‐tolerant computer and its key techniques, based on China’s spacecraft projects and engineering practices. Chapter 3 presents frequently used fault models, based upon which, fault detection techniques of computer key components are discussed. Chapter 4 introduces the fault‐tolerance techniques of several frequently used spacecraft control computer buses, with special focus on buses such as 1553B bus, CAN bus, and SpaceWire bus.

Chapter 5 outlines the fundamental concepts and principles underlying software fault‐tolerance and emphatically discusses several concrete software fault‐tolerance techniques, including single‐version fault tolerance, N‐version fault tolerance, and data diversity‐based fault tolerance. Chapter 6 discusses the effect that space radiation has on field programmable gate arrays (FPGAs), and the fault models and dynamic fault‐tolerance methods used in static random access memory (SRAM)‐based FPGAs. Chapter 7 presents fault‐injection relevant techniques based on practical engineering, primarily involving fault‐injection methods, evaluation methods, and tools. Chapter 8 discusses the fundamental concepts, principles, and concrete implementation methods of state‐of‐the‐art intelligence fault‐tolerance techniques, and introduces two representative intelligence fault‐tolerance techniques – specifically, evolvable hardware fault tolerance and artificial immune hardware fault tolerance.

All the authors listed in this book – Yang Mengfei, Hua Gengxin, Feng Yanjun, and Gong Jian – helped to plan it. Yang Mengfei, Gong Jian, and Feng Yanjun wrote Chapter 1; Yang Mengfei, Feng Yanjun, and Gong Jian wrote Chapter 2; Yang Mengfei and Gong Jian wrote Chapter 3; Hua Gengxin, Yang Mengfei, Feng Yanjun, and Gong Jian wrote Chapter 4; Feng Yanjun and Yang Mengfei wrote Chapter 5; Liu Hongjin, Yang Mengfei, and Gong Jian wrote Chapter 6; Hua Gengxin and Gong Jinggang wrote Chapter 7; and Gong Jian, Yang Mengfei, and Dong Yangyang wrote Chapter 8. Gong Jian and Feng Yanjun were responsible for formatting, while Yang Mengfei approved, proofread, and finalized the book.

Throughout the process of writing this book, we received significant support and assistance from leaders, experts, and colleagues at the Beijing Institute of Control Engineering, to whom we express our sincere thanks. We also wish to express sincere thanks to Wu Hongxin, academician at the China Academy of Science, for his encouragement and support. We also wish to express our sincere thanks to leaders and colleagues Zhang Duzhou, Yuan Li, Wang Dayi, Ding Cheng, Gu Bin, Wu Yifan, Yang Hua, Liu Bo, Chen Zhaohui, Liu Shufen, Lu Xiaoye, Wang Lei, Zhao Weihua, Wang Rong, Yuan Yi, Zhang Shaolin, and Wu Jun for their support. Publication of this book was made possible by financial aid from the National Science and Technology Book Publication Fund, to whom we express our sincerest thanks.

This book contains not only a summary of our practical work, but also our research experience, fully reflecting the present status and level of China’s spacecraft control computer fault‐tolerance techniques. This book combines theory with practice, and is highly specialized. As a result, it can function as a reference for persons engaged in the research and design of high‐reliability computers, especially of spacecraft computer and electronics, and also as a textbook for graduates engaged in research work in this field.

We are fully aware that our expertise is limited, and that inconsistencies and errors may be present in this book. If any such instances are found, please do not hesitate to point them out.

1Introduction

A control computer is one of the key equipment in a spacecraft control system. Its reliability is critical to the operations of the spacecraft. Furthermore, the success of a space mission hinges on failure‐free operation of the control computer. In a mission flight, a spacecraft’s long‐term operation in a hostile space environment without maintenance requires a highly reliable control computer, which usually employs multiple fault‐tolerance techniques in the design phase. With focus on the spacecraft control computer’s characteristics and reliability requirements, this chapter provides an overview of fundamental fault‐tolerance concepts and principles, analyzes the space environment, emphasizes the importance of fault‐tolerance techniques in the spacecraft control computer, and summarizes the current status and future development direction of fault‐tolerance technology.

1.1 Fundamental Concepts and Principles of Fault‐tolerance Techniques

Fault‐tolerance technology is an important approach to guarantee the dependability of a spacecraft control computer. It improves system reliability through implementation of multiple redundancies. This section briefly introduces its fundamental concepts and principles.

1.1.1 Fundamental Concepts

“Fault‐tolerance” refers to “a system’s ability to function properly in the event of one or more component faults,” which means the failure of a component or a subsystem should not result in failure of the system. The essential idea is to achieve a highly reliable system using components that may have only standard reliability [1]. A fault‐tolerant computer system is defined as a system that is designed to continue fulfilling assigned tasks even in the event of hardware faults and/or software errors. The techniques used to design and analyze fault‐tolerant computer systems are called fault‐tolerance techniques. The combination of theories and research related to fault‐tolerant computer techniques is termed fault‐tolerant computing [2–4].

System reliability assurance depends on the implementation of fault‐tolerance technology. Before the discussion of fault‐tolerance, it is necessary to clarify the following concepts [4,5]:

Fault: a physical defect in hardware, imperfection in design and manufacturing, or bugs in software.

Error: information inaccuracy or incorrect status resulting from a fault.

Failure: a system’s inability to provide the target service.

A fault can either be explicit or implicit. An error is a consequence and manifestation of a fault. A failure is defined as a system’s inability to function. A system error may or may not result in system failure – that is, a system with a fault or error may still be able to complete its inherent function, which serves as the foundation of fault‐tolerance theory. Because there are no clearly defined boundaries, concepts 1, 2, and 3 above are usually collectively known as “fault” (failure).

Faults can be divided into five categories on the basis of their pattern of manifestation, as shown in Figure 1.1.

Figure 1.1 Fault categorization.

“Permanent fault” can be interpreted as permanent component failure. “Transient fault” refers to the component’s failure at a certain time. “Intermittent fault” refers to recurring component failure – sometimes a failure occurs, sometimes it does not. When there is no fault, the system operates properly; when there is a fault, the component fails. A “benign fault” only results in the failure of a component, which is relatively easy to handle. A “malicious fault” causes the failed component to appear normal, or transmit inaccurate values to different receivers as a result of malfunction – hence, it is more hostile.

Currently, the following three fault‐tolerant strategies are utilized [4–6]:

Fault masking. This strategy prevents faults from entering the system through redundancy design, so that faults are transparent to the system, having no influence. It is mainly applied in systems that require high reliability and real‐time performance. The major methods include memory correction code and majority voting. This type of method is also called static redundancy.

Reconfiguration. This strategy recovers system operation through fault removal. It includes the following steps:

Fault detection – fault determination, which is a necessary condition for system recovery;

Fault location – used to determine the position of the fault;

Fault isolation – used to isolate the fault to prevent its propagation to other parts of the system;

Fault recovery – used to recover system operation through reconfiguration.

This method is also defined as dynamic redundancy.

Integration of fault masking and reconfiguration. This integration realizes system fault‐tolerance through the combination of static redundancy and dynamic redundancy, also called hybrid redundancy.

In addition to strategies 1, 2, and 3 above, analysis shows that, in certain scenarios, it is possible to achieve fault‐tolerance through degraded redundancy. Since degraded redundancy reduces or incompletely implements system function, this book will not provide further discussion on it.

The key to fault tolerance is redundancy – no redundancy, no fault‐tolerance. Computer system fault‐tolerance consists of two types of redundancies: time redundancy and space redundancy. In time redundancy, the computation and transmission of data are repeated, and the result is compared to a stored copy of the previous result. In space redundancy, additional resources, such as components, functions or data items, are provided for a fault‐free operation.

Redundancy necessitates additional resources for fault‐tolerance. The redundancies in the above two categories can be further divided into four types of redundancies: hardware redundancy, software redundancy, information redundancy, and time redundancy. In general, hardware failure is solved with hardware redundancy, information redundancy, and time redundancy, while software failure is solved with software redundancy and time redundancy.

Hardware redundancy: In this type of redundancy, the effect of a fault is obviated through extra hardware resources (e.g., using two CPUs to achieve the same function). In this scenario, the failure of one CPU can be detected through comparison of the two results. If there are triple CPUs, masking of one CPU’s failure is achieved through majority voting – a typical static redundancy strategy. It is possible to set up a dynamic fault‐tolerant system through multiple hardware redundancies, such that backup components replace the ones that fail. Hybrid redundancy incorporates static and dynamic redundancy. Hardware redundancy, which ranges from simple backup to complex fault tolerance structures, is the most widely used and basic redundancy method, and is related to the other three because they all need extra resources.

Software redundancy: In this type of redundancy, faults are detected and fault tolerance achieved by using extra software. Using the rationale that different people will

not

make the same mistake, fault tolerance is achieved by developing different versions of the same software using different teams, to avoid the same errors being induced by certain inputs.

Information redundancy: This type of redundancy achieves fault‐tolerance through extra information (e.g., error correcting code is a typical information redundancy method). Information redundancy needs the support of hardware redundancy to complete error detection and correction.

Time redundancy: In this type of redundancy, fault detection and fault‐tolerance are achieved over time – for example, a user may repetitively execute certain program on certain hardware, or adopt a two‐out‐of‐three strategy with the result for an important program.

Because of the extra resources involved, redundancy inevitably affects system performance, size, weight, function, and reliability. In the design phase of a computer system with high‐reliability requirement, it is necessary to balance all application requirements to select the appropriate redundancy method and fault tolerance structure. In order to reflect all aspects of a fault‐tolerant computer system’s implementation and research, this book covers system architecture, fault detection, bus, software, FPGA, and fault injection, and introduces intelligence fault tolerance technology.

1.1.2 Reliability Principles

1.1.2.1 Reliability Metrics

Qualitative and quantitative analysis and estimation are essential in the design of fault‐tolerant computer systems. The major features involved are reliability, availability, maintainability, safety, performability, and testability, with each feature having its own qualitative and quantitative specifications [4,5,7].

Reliability and its measurement

(R(t))

Reliability is the ability of a system to function under stated time and conditions. Assume that the system is operating normally at t0. The conditional probability that the system is operating normally at [t0, t] is defined as the system’s reliability degree at time t, denoted as R(t). Further, the conditional probability of the system operating abnormally at [t0, t] is defined as the system’s unreliability degree at time t, denoted as F(t). Reliability and unreliability have the following relationship:

The failure probability density function can be calculated according to the system’s unreliability, i.e.,

Availability and its measurement

(A(t))

Availability is the proportion of time a system is in a functioning condition. The normal operation of the system at time t is defined as the system’s availability degree at t, denoted as A(t). This is also termed transient availability, with mathematical expectation called steady‐state availability.

Maintainability and its measurement (

M

(

t

))

Maintainability is the ability of a system to recover its required function when the system operates under specified conditions, and is repaired following specified procedures and methods. Assume that the system failed at t0; the probability of the system being successfully repaired within [t0, t] is defined as its maintainability degree, denoted M(t).

Safety and its measurement (

S

(

t

))

Safety is the nature of a system not to endanger personnel and equipment. Assume that the system is operating normally at t0. The probability R(t) that the system is operating normally at [t0, t], plus the conditional probability of the system being in failsafe mode, is defined as the safety degree of the system at [t0, t], denoted S(t). Failsafe mode is a mode in which the system stops functioning without jeopardizing human life. Therefore, high reliability results in high safety, but high safety does not necessarily result in high reliability.

Performability and its measurement (

P

(

L

,

t

))

Performability is the ability of a system to maintain part of its function and gracefully degrade when failure occurs. The probability of the system operating at performance of level L or above is defined as the system performability degree at time t, denoted P(L, t). Reliability requires that all functions be properly operational, whereas performability requires only that a portion of the functions be properly operational.

Testability and its measurement

Testability is the extent of how difficult a system can be tested, detected, and fault‐located – that is, how difficult and complex the test can be. There is currently no universal definition of testability degree; consequently, it is usually measured with test cost.

In summary, fault tolerance is a system’s ability to complete its targeted function in the presence of a fault. Fault tolerance techniques are the major methods employed to achieve system reliability. Of the six features described above, reliability and availability are the most important. Therefore, we focus on these two features in the ensuing discussion.

Because R(t) is the probability of the system being in continuous operation within [t0, t], system mean time to failure (MTTF) and mean time between failure (MTBF) are closely related. MTTF is the time that has elapsed before system failure. MTTF is the mathematical expectation of unrepairable system operating time before failure, i.e.,

When the system lifetime follows an exponential distribution, that is, f (t) is a constant λ, i.e. R(t) = e–λt, then:

For a repairable product, MTBF is the mean time between two consecutive failures. Let MTTR represent system recovery time, the difference between MTTF and MTBF is specified by the following equation:

Availability A(t) is the proportion of time during which the system is available within [t0, t] (proportion of normal operation time vs. total operation time). It can be calculated using MTBF, MTTF, and MTTR – that is:

The definition of MTTF and MTBF verifies that reliability and availability are not positively correlated – that is, high availability does not necessarily result in high reliability. For example, given a system that fails once per hour, its recovery time is 1 second and its MTBF is 1 hour, which is quite low, but its availability is A = 3599/3600 = 0.99972, which is very high.

1.1.2.2 Reliability Model

A reliability model is widely used at the design phase of a computer system to calculate, analyze, and compare its reliability. As described in the following section, the reliability model includes serial connection, parallel connection, and multiple modular redundancy.

Serial connection model

A serial connected system is a system in which the failure of one unit will cause the failure of the entire system. Its reliability model is a serial model, shown in Figure 1.2, which is the most widely used model:

In a serial connected system, if every unit follows the exponential distribution, then the mathematical model of the serial model is:

(1‐1)

where:

R

(

t

) is the reliability of the system;

R

i

(

t

) is the reliability of each unit;

λ

i

(

t

) is each unit’s probability of failure;

n

is the total number of units.

The lifetime of the system follows an exponential distribution if the lifetime of each unit follows exponential distribution. The failure probability of the system λ is the summation of the failure probability of each unit λi, as shown in the following equation:

The MTBF is:

Equation (1‐1) shows that the reliability of a system is the product of each of its unit’s reliability. The more units there are, the lower the system reliability. From a design point of view, in order to improve the reliability of a serial connected system, the following measures may be taken:

minimize the number of units in the serial connection;

improve the reliability of each unit, reduce its probability of failure

λ

i

(

t

);

reduce operation time

t

.

Parallel connection model

A parallel connected system is a system in which only the failure of all units will cause the failure of the system. Its reliability model is a parallel model, shown in Figure 1.3, which is the simplest and most widely used model with backup:

The mathematical model of the parallel model is:

(1‐2)

where:

R

(

t

) is the reliability of the system;

R

i

(

t

)is the reliability of each unit; and

n

is the total number of units.

For the usual two‐unit parallel system, if the lifetime of each unit follows an exponential distribution, then:

Equation (1‐2) shows that although the unit probability of failure λ1λ2 is a constant, the parallel system probability of failure λ is not. For a system with n parallel connected identical units, if the lifetime of each unit follows an exponential distribution, the system reliability is given by:

Compared to units with no backup, the reliability of the system is significantly improved. However, the amount of improvement decreases as more parallel units are incorporated into the system.

Multiple Modular redundancy (

r

/

n

(

G

)) model

Consisting of n units and a voting machine, a system that operates normally when the voting machine operates normally and the number of normal units is no less than r(1 ≤ r ≤ n) is defined as an r/n(G) voting system. Its reliability model is defined as the r/n(G) model, shown in Figure 1.4, which is a type of backup model.

The mathematical form of the r/n(G) model is:

where:

R

(

t

) is the reliability of the system;

R

(

t

)

i

is the reliability of each of the system’s units (identical for each unit); and

R

m

is the reliability of the voting machine.

If the reliability of each unit is a function of time and its lifetime failure probability λ follows an exponential distribution, then the reliability of the r/n(G) system is:

Assuming n is an odd number, an r/n(G) system that operates normally when the number of normal units is no less than k + 1 is defined as a majority voting system. A majority voting system is a special case of an r/n(G) system. A two‐out‐of‐three system is the most common majority voting system.

When the reliability of the voting machine is one, and the failure probability of each unit is a constant λ, the mathematical form of the majority voting model is:

If r = 1, r/n(G) is a parallel connected system, and the system reliability is:

If r = n, r/n(G) is a serial connected system, and the system reliability is:

Figure 1.2 Serial connection model.

Figure 1.3 Parallel connection model.

Figure 1.4 The r/n(G) model.

1.2 The Space Environment and Its Hazards for the Spacecraft Control Computer

A spacecraft control computer is constantly exposed to a complex environment that features factors such as zero gravity, vacuum, extreme temperature, and space radiation, in addition to maintenance difficulties. The complexity of the operating environment challenges the computer and often results in system faults during orbit missions [8]. It is particularly important to implement fault tolerance techniques in a spacecraft control computer.

1.2.1 Introduction to Space Environment

1.2.1.1 Solar Radiation

Solar radiation is the most active and important aspect of space radiation. Long‐term observation shows that solar activity can be categorized into two types, based on the energy levels of the particles and magnetic flux released: slow type and eruptive type. Each type has its own radiation effect on a spacecraft.

In slow type activity, the corona ejects solar winds comprising electrons and protons as major components, accounting for more than 95% of the total ejected mass, with speeds of up to 900 km/s. Helium ions account for 4.8%, and other particles account for an even smaller percentage [9]. In a solar wind, the flux of a low‐energy particle is high, and that of a high‐energy particle is low. In the quiet period of solar minima, particles at 1 AU (150 000 000 km) consist of low‐energy solar wind and a few galactic cosmic rays (GCRs).

Eruptive solar activity includes coronal mass ejection (CME) and solar flares, which can also be called a solar particle event (SPE), a solar proton event, or a relativity proton event. During an eruptive solar activity, streams of charged particles and high‐energy radiation are released into space. The speed of the high‐energy particles exceeds 2000 km/s. In the most severe five minutes of the eruption, most particles at 1 AU are high‐energy particles, with a flux level higher than that of the solar quiet period by several orders of magnitude.

In the 11‐year solar cycle, the probability of CME and flare is low during solar minima and high during solar maxima. Compared with the static slow activity, eruptive solar activity is a low probability event with a very short duration and very low energy, but very high power. As the flux level of eruptive solar activity is higher than that of slow solar activity by several orders of magnitude, it has a severely destructive effect on space electronics and astronauts and, hence, is the focus of space radiation research.

With ion emission, the above two types of solar activities emit interplanetary magnetic fields as well. The magnetic field intensity of eruptive solar activity is extremely high and interacts with the magnetic field of the earth, thereby negatively affecting low orbit satellites and the earth environment.

1.2.1.2 Galactic Cosmic Rays (GCRs)

GCRs are from the outer solar system and feature very low ion densities, extremely high‐energy levels, and isotropic emissions. They comprise 83% protons, 13% ammonium ions, 3% electrons, and 1% other high‐energy level particles. The total energy and flux of a GCR are extremely low. During solar maxima, the GCR flux increases slightly. Conversely, during solar minima, the GCR flux decreases slightly.

1.2.1.3 Van Allen Radiation Belt

The interplanetary magnetic field emitted from solar activity interacts with the magnetic field of the earth, and deforms the earth’s magnetic sphere in such a manner that the sun side is compressed and the other side is extended. This effect redirects the charged ions emitted towards the earth to leave the earth along a magnetotail direction, so that nature of the earth may evolve. The shape of the earth’s magnetic layer resembles that of a comet’s tail, as shown in Figure 1.5.

Figure 1.5 Magnetic layer of the earth and radiation.

Those ions that cross the magnetopause and arrive in the vicinity around the earth are captured by the earth’s magnetic field. These captured ions form an inner/outer belt around the earth, with the south and north poles as its axis. The capture zone was first discovered by Van Allen and, hence, can also be called the Van Allen radiation belt. The inner belt is situated in a shell space above the meridian plane within latitude range ± 40° (the shell space is above the equator at an altitude in the range 1.2 L to 2.5 L, where L is the radius of the earth, L ≈ 6361 km, L = 1 means on the surface of the earth). Protons and electrons constitute most of the inner belt. The outer belt is situated in a shell space above the meridian plane within the latitude range ± 55° to ± 70° (the shell space is above the equator at an altitude in the range 2.8 L to 12 L). The relationship of flux and the position of the protons and electrons inside the Van Allen radiation belt are shown in Figures 1.6 and 1.7.

Figure 1.6 Energy spectrum of protons.

Figure 1.7 Electron distribution above the equator.

Because of the inhomogeneity of the earth’s magnetic field, there are high‐energy level protons at an altitude of 200 km above the South Atlantic negative abnormal zone. In addition, the accumulation of magnetic force lines at the polar zone leads to an increase in the high‐energy level particle flux in those areas [10].

The components and distribution of high‐energy level particles within the Van Allen radiation belt are stable when there is no eruptive solar activity. However, when there is eruptive solar activity, or the planetary magnetic field disturbs the earth’s magnetic field, the high‐energy level particles’ flux and spectrum increase significantly, and the Van Allen radiation belt moves closer to the earth. As a result, geo‐satellite electrical facilities (even ground facilities) will fail.

1.2.1.4 Secondary Radiation

When original high‐energy level particles penetrate a spacecraft’s material, a nuclear reaction is produced, which in turn excites secondary particles and rays, including the strong penetrative types, such as bremsstrahlung and neutrons.

1.2.1.5 Space Surface Charging and Internal Charging

Surface charging results from plasma and the photoelectric effect. Because the mass of electrons in plasma is significantly lower than that of other particles, the speed of the electrons is correspondingly significantly higher than that of other particles. When a satellite is immersed in the uncharged plasma, first a large number of electrons and a small number of other particles are deposited onto the surface of the satellite, to form electron collection current Ie and ion collection current Ii; the surface of the material produces secondary electronic radiation and ions, which form surface‐deposited ion radiation current Isi, leaving electronic radiation current Ise; and the impact of incident electrons on the surface of the material produces reflected electrons to form reflected electron current Ib. If the material is situated in a lit region, the surface emits photons, which forms photon photoelectric current Ip. Hence, the total current on the surface of the material is given by It = Ie – (Ii + Isi + Ise + Ib + Ip).

At the beginning of charging, as a result of the high speed of the electrons, electron collection current constitutes most of the total current and creates a negative potential that is continuously reduced, until the summation of the repulsive force to electrons and the attractive force to ions result in the total current being zero. The effect is a negative potential with respect to plasma – that is, absolute surface charging. Surface potential is related to the energy level and density of plasma. Research shows that when immersed in 100 eV and 300 eV plasma, the respective potentials of the satellite surface are –270 V and –830 V. Because of the thermal plasma environment in high orbit, the large number of deposited electrons in polar orbit and the cold plasma environment in non‐polar‐orbit, the negative surface potential of satellite in high orbit and polar orbit is more severe than that of those in non‐polar low‐earth orbit.

At the lighted surface, the continuous light irradiation results in departure of electrons from the satellite surface, owing to the photoelectric effect. The lighted surface gradually takes on a positive potential of around several volts to dozens of volts, while the unlighted surface maintains a relatively high negative potential. The potential difference between the two surfaces is defined as relative surface charging, which is the major factor contributing to damage to a satellite when it enters or leaves the earth’s shadow.

Internal charging is induced by electrons with energy levels higher than 50 keV residing within poor or isolated conductors after penetration of the spacecraft’s skin. Because the flux of high‐energy level electrons in high orbit and polar orbit is relatively large, satellites in these orbits experience more severe internal charging problems than in others. In addition, during CME and solar flare periods, the flux of high‐energy level electrons surges and lasts a long time. This also leads to serious internal charging problems.

1.2.1.6 Summary of Radiation Environment

The natural space radiation environment is composed of multiple particles with continuous energy levels and flux. These particles contribute to stable factors such as solar winds, the Van Allen radiation belt and GCRs, and eruptive factors such as solar flares and CME. Wilson et al. provided a rough illustration of the space environment particles and their energy spectra, as depicted in Figure 1.8 [11].

Figure 1.8 Space radiation environment.

1.2.1.7 Other Space Environments

In addition to space radiation, a spacecraft is subjected to other special space environments and their corresponding negative effects. These include the following:

Vacuum environment: when a satellite is in orbit, its electrical equipment is in a very low pressure environment, namely a “vacuum” environment. During launch and the return phase, the electrical equipment is in a gradually changing pressure environment.

Thermal environment: the vacuum environment a satellite is in and the huge temperature difference between the satellite’s lighted and unlighted sides invalidate the normal convection thermal control method. This is a new challenge to thermal control. The current practice is to perform thermal control with thermal conduction and radiation.

Atomic oxygen, space debris, and so on.

1.2.2 Analysis of Damage Caused by the Space Environment

The above space environment will cause permanent, transient, and intermittent electrical equipment and computer failures. The damage caused can be categorized into total ionizing dose (TID), single event effect (SEE), internal/surface charging damage, displacement damage (DD), etc.

1.2.2.1 Total Ionization Dose (TID)

When high‐energy level particles penetrate into metal oxide semiconductor (MOS) or bipolar devices and ionize oxides (SiO2), electron and hole pairs that possess composite and drifting types of movements are produced. Without an external electric field, composite movement is dominant; with an external electric field, electrons and holes move in opposite directions along the direction of the field. The high mobility of electrons results in them quickly leaving the oxides, and holes start to accumulate. This process is defined as gate‐oxide hole capture. The higher the electric field and electron mobility, the higher the capture ratio. This explains why TID damage is more severe than damage for uncharged components.

An electric field also forms from surface capture at the Si‐SiO2 interface. For a negative MOS (NMOS) with an N channel, when Vg > 0 V, surface capture results in negative charge accumulation; for a positive MOS (PMOS), when Vg < 0 V, surface capture results in positive charge accumulation.

The extra electric field produced by gate‐oxide and surface capture parasitizes the function area of the components, and leads to drifting of the threshhold voltage Vth and transmission delay Tpd, increase in the static current Icc, and attenuation of the transistor’s amplification coefficient. Components will fail after these damages exceed a certain limit.

At the outset, the major effect of environmental radiation is gate‐oxide capture. However, over time, surface capture becomes more dominant. Therefore, the Vth of PMOS is monotone drifting, while the Vth of NMOS shows a “rebound phenomenon,” which negatively drifts at the beginning, then changes to positive drifting.

Gate‐oxide capture anneals at normal temperature (approximately 20–23.5 °C) and achieves accelerated annealing at high temperatures (e.g., 100 °C). Further, it is recoverable damage. Surface capture accumulates charge slowly and steadily. It anneals at normal temperature, but does not anneal at high temperatures. Under extreme conditions, high temperatures will even intensify the effect of surface capture and TID, which is difficult to recover from, or is unrecoverable [12].

1.2.2.2 Single Event Effect (SEE)

The causal chain of SEE damage is as follows:

Plasma track resulting from high‐energy level particle charge’s movement within the track activation of parasitized components or weak components all kinds of damage. Based on the effect of SEE, it can be categorized into single event latch‐up (SEL), single event upset (SEU), single event burnout (SEB), and so on.

1.2.2.2.1 Single Event Latch‐up (SEL)

SEL is caused externally by current resulting from the potential difference within the “transient plasma needle” before its disappearance. The track of transient plasma is produced by the transient ionization of high‐energy level particles entering the Si and SiO2 areas [13].

SEL is caused internally by the parasitic PNPN structure [14]. The PNPN component parasitizes the circuit of the complementary MOS (CMOS), as the NOT gate shown in Figure 1.9(a). The n+ of the NMOS on the P substrate, p– of the P substrate, and n+ of the N well contact pad form a parallel parasitic NPN transistor, Vsub. The p+ of the PMOS on the N well, the n– of the N well, and the p+ of the P substrate contact pad form a vertical parasitic PNP transistor, Vwell. Figure 1.9(b) is the equivalent parasitic PNPN circuit, in which Rwell and Rsub are the parasitic resistances of the well and substrate contacts, respectively.

Figure 1.9 PNPN component: (a) NOT gate. (b) Equivalent circuit.

Under normal conditions, the collector junction of Vsub and Vwell are zero offset, the emitter junction is positively biased, and the PNPN component is in the cut‐off state. The impedance between Vcc and GND is high. When high‐energy level particles enter between the well and substrate, current is produced in this area because of the potential difference. Consequently, the well and substrate are turned on transiently. The result is a voltage drop on Rwell and Rsub which, in turn, makes the emitter junction of Vsub positively biased. If the positive bias value is sufficiently large, it turns on Vsub, and Vsub turns on Vwell. Consequently, the PNPN component is in a positive feedback state and Vcc and GND are in a low impedance state, with a large current that will fuse metal wire and permanently damage the component if no current limiting measures are taken.

The parasitic PNPN structure is a unique damage mode of a CMOS circuit. There are three requirements for the occurrence of latchup:

The loop gain

β

must be greater than one.

In order to start the latchup positive feedback, there must be proper excitation to provide necessary bias and starting current.

The electrical power supply must provide enough current to maintain latchup positive feedback.

In a CMOS chip, there are many parasitic resistors and bipolar transistors that may become involved in latchup positive feedback. The consequence is a parasitic circuit that is more complex than that shown in Figure 1.9. Latchup current differs as the number of parasitic circuits varies.

Single event snapback (SES) damage may occur in the OFF state of an NMOS transistor. The equivalent circuit is the horizontal parasitic NPN bipolar transistor, and its parasitic resistor on the base junction [15]. When heavy ions hit the source, as a result of the potential difference between the source and the substrate: current flows and a potential difference exists across the parasitic resistor to meet the requirements for amplification of the excitation of the parasitic bipolar junction transistor (BJT); the emitter junction is positively biased; and the bias of the collector junction is the reverse. Furthermore, the potential difference between the source and the substrate provides the necessary BJT turn‐on current and turn‐on state maintaining current through the parasitic resistor. Long‐term BJT turn‐on current results in permanent thermal damage to the NMOS transistor.

1.2.2.2.2 Single Event Upset, Turbulence, and Failure Interruption (SEU, SET, Single Event Functional Interrupt (SEFI))

The rationale for SEU can be described as follows: If the movement of the charge resulting from the potential difference within the vicinity of the plasma track is sufficiently large, the relevant unit’s logic state will change (i.e., logic state upset will occur). When only a single digit is upset in a byte, the event is termed SEU, and the upset of multiple digits is termed multiple bits upset (MBU). As illustrated in the SRAM shown in Figure 1.10, when heavy ions turn on the NMOS transistor in the lower left corner through bombardment, the logic state of point A will change from one to zero because it is grounded. This, in turn, will lead to a change in point B’s logic state from zero to one. In addition to the SRAM, time sequential components, such as flip‐flops and latches, will also experience SEU failure. In fact, because of the complexity of the component’s structure, the actual fault mechanism is more complex than that depicted in Figure 1.10.

Figure 1.10 Physical essence of SEU damage.

When SEU occurs in a control or configuration unit (e.g., the working mode selector of a network protocol chip, address register of a CPU, or configuration unit of an FPGA), component failure can result (i.e., SEFI) [16].