E-Book
99,99 €

Artificial Intelligence Hardware Design E-Book

Albert Chun-Chen Liu

0,0

99,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

ARTIFICIAL INTELLIGENCE HARDWARE DESIGN Learn foundational and advanced topics in Neural Processing Unit design with real-world examples from leading voices in the field In Artificial Intelligence Hardware Design: Challenges and Solutions, distinguished researchers and authors Drs. Albert Chun Chen Liu and Oscar Ming Kin Law deliver a rigorous and practical treatment of the design applications of specific circuits and systems for accelerating neural network processing. Beginning with a discussion and explanation of neural networks and their developmental history, the book goes on to describe parallel architectures, streaming graphs for massive parallel computation, and convolution optimization. The authors offer readers an illustration of in-memory computation through Georgia Tech's Neurocube and Stanford's Tetris accelerator using the Hybrid Memory Cube, as well as near-memory architecture through the embedded eDRAM of the Institute of Computing Technology, the Chinese Academy of Science, and other institutions. Readers will also find a discussion of 3D neural processing techniques to support multiple layer neural networks, as well as information like: * A thorough introduction to neural networks and neural network development history, as well as Convolutional Neural Network (CNN) models * Explorations of various parallel architectures, including the Intel CPU, Nvidia GPU, Google TPU, and Microsoft NPU, emphasizing hardware and software integration for performance improvement * Discussions of streaming graph for massive parallel computation with the Blaize GSP and Graphcore IPU * An examination of how to optimize convolution with UCLA Deep Convolutional Neural Network accelerator filter decomposition Perfect for hardware and software engineers and firmware developers, Artificial Intelligence Hardware Design is an indispensable resource for anyone working with Neural Processing Units in either a hardware or software capacity.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 196

Veröffentlichungsjahr: 2021

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Series Page

Title Page

Author Biographies

Preface

Acknowledgments

Table of Figures

1 Introduction

1.1 Development History

1.2 Neural Network Models

1.3 Neural Network Classification

1.4 Neural Network Framework

1.5 Neural Network Comparison

Exercise

References

2 Deep Learning

2.1 Neural Network Layer

2.2 Deep Learning Challenges

Exercise

References

3 Parallel Architecture

3.1 Intel Central Processing Unit (CPU)

3.2 NVIDIA Graphics Processing Unit (GPU)

3.3 NVIDIA Deep Learning Accelerator (NVDLA)

3.4 Google Tensor Processing Unit (TPU)

3.5 Microsoft Catapult Fabric Accelerator

Exercise

References

4 Streaming Graph Theory

4.1 Blaize Graph Streaming Processor

4.2 Graphcore Intelligence Processing Unit

Exercise

References

5 Convolution Optimization

5.1 Deep Convolutional Neural Network Accelerator

5.2 Eyeriss Accelerator

Exercise

References

6 In‐Memory Computation

6.1 Neurocube Architecture

6.2 Tetris Accelerator

6.3 NeuroStream Accelerator

Exercise

References

7 Near‐Memory Architecture

7.1 DaDianNao Supercomputer

7.2 Cnvlutin Accelerator

Exercise

References

8 Network Sparsity

8.1 Energy Efficient Inference Engine (EIE)

8.2 Cambricon‐X Accelerator

8.3 SCNN Accelerator

8.4 SeerNet Accelerator

Exercise

References

9 3D Neural Processing

9.1 3D Integrated Circuit Architecture

9.2 Power Distribution Network

9.3 3D Network Bridge

9.4 Power‐Saving Techniques

Exercise

References

Appendix A: Neural Network Topology

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Neural network framework.

Chapter 2

Table 2.1 AlexNet neural network model.

Chapter 3

Table 3.1 Intel Xeon family comparison.

Table 3.2 NVIDIA GPU architecture comparison.

Table 3.3 TPU v1 applications.

Table 3.4 Tensor processing unit comparison.

Chapter 5

Table 5.1 Efficiency loss comparison.

Table 5.2 DNN accelerator performance comparison.

Table 5.3 Eyeriss v2 architectural hierarchy.

Table 5.4 Eyeriss architecture.

Chapter 6

Table 6.1 Neurocube performance comparison.

Chapter 8

Table 8.1 SeerNet system performance comparison.

List of Illustrations

Chapter 1

Figure 1.1 High‐tech revolution.

Figure 1.2 Neural network development timeline.

Figure 1.3 ImageNet challenge.

Figure 1.4 Neural network model.

Figure 1.5 Regression.

Figure 1.6 Clustering.

Figure 1.7 Neural network top 1 accuracy vs. computational complexity.

Figure 1.8 Neural network top 1 accuracy density vs. model efficiency [14]....

Figure 1.9 Neural network memory utilization and computational complexity [1...

Chapter 2

Figure 2.1 Deep neural network AlexNet architecture [1].

Figure 2.2 Deep neural network AlexNet model parameters.

Figure 2.3 Deep neural network AlexNet feature map evolution [3].

Figure 2.4 Convolution function.

Figure 2.5 Nonlinear activation functions.

Figure 2.6 Pooling functions.

Figure 2.7 Dropout layer.

Figure 2.8 Deep learning hardware issues [1].

Chapter 3

Figure 3.1 Intel Xeon processor ES 2600 family Grantley platform ring archit...

Figure 3.2 Intel Xeon processor scalable family Purley platform mesh archite...

Figure 3.3 Two‐socket configuration.

Figure 3.4 Four‐socket ring configuration.

Figure 3.5 Four‐socket crossbar configuration.

Figure 3.6 Eight‐socket configuration.

Figure 3.7 Sub‐NUMA cluster domains [3].

Figure 3.8 Cache hierarchy comparison.

Figure 3.9 Intel multiple sockets parallel processing.

Figure 3.10 Intel multiple socket training performance comparison [4].

Figure 3.11 Intel AVX‐512 16 bits FMA operations (VPMADDWD + VPADDD).

Figure 3.12 Intel AVX‐512 with VNNI 16 bits FMA operation (VPDPWSSD).

Figure 3.13 Intel low‐precision convolution.

Figure 3.14 Intel Xenon processor training throughput comparison [2].

Figure 3.15 Intel Xenon processor inference throughput comparison [2].

Figure 3.16 NVIDIA turing GPU architecture.

Figure 3.17 NVIDIA GPU shared memory.

Figure 3.18 Tensor core 4 × 4 × 4 matrix operation [9].

Figure 3.19 Turing tensor core performance [7].

Figure 3.20 Matrix D thread group indices.

Figure 3.21 Matrix D 4 × 8 elements computation.

Figure 3.22 Different size matrix multiplication.

Figure 3.23 Simultaneous multithreading (SMT).

Figure 3.24 Multithreading schedule.

Figure 3.25 GPU with HBM2 architecture.

Figure 3.26 Eight GPUs NVLink2 configuration.

Figure 3.27 Four GPUs NVLink2 configuration.

Figure 3.28 Two GPUs NVLink2 configuration.

Figure 3.29 Single GPU NVLink2 configuration.

Figure 3.30 NVDLA core architecture.

Figure 3.31 NVDLA small system model.

Figure 3.32 NVDLA large system model.

Figure 3.33 NVDLA software dataflow.

Figure 3.34 Tensor processing unit architecture.

Figure 3.35 Tensor processing unit floorplan.

Figure 3.36 Multiply–Accumulate (MAC) systolic array.

Figure 3.37 Systolic array matrix multiplication.

Figure 3.38 Cost of different numerical format operation.

Figure 3.39 TPU brain floating‐point format.

Figure 3.40 CPU, GPU, and TPU performance comparison [15].

Figure 3.41 Tensor Processing Unit (TPU) v1.

Figure 3.42 Tensor Processing Unit (TPU) v2.

Figure 3.43 Tensor Processing Unit (TPU) v3.

Figure 3.44 Google TensorFlow subgraph optimization.

Figure 3.45 Microsoft Brainwave configurable cloud architecture.

Figure 3.46 Tour network topology.

Figure 3.47 Microsoft Brainwave design flow.

Figure 3.48 The Catapult fabric shell architecture.

Figure 3.49 The Catapult fabric microarchitecture.

Figure 3.50 Microsoft low‐precision quantization [27].

Figure 3.51 Matrix‐vector multiplier overview.

Figure 3.52 Tile engine architecture.

Figure 3.53 Hierarchical decode and dispatch scheme.

Figure 3.54 Sparse matrix‐vector multiplier architecture.

Figure 3.55 (a) Sparse Matrix; (b) CSR Format; and (c) CISR Format.

Chapter 4

Figure 4.1 Data streaming TCS model.

Figure 4.2 Blaize depth‐first scheduling approach.

Figure 4.3 Blaize graph streaming processor architecture.

Figure 4.4 Blaize GSP thread scheduling.

Figure 4.5 Blaize GSP instruction scheduling.

Figure 4.6 Streaming vs. sequential processing comparison.

Figure 4.7 Blaize GSP convolution operation.

Figure 4.8 Intelligence processing unit architecture [8].

Figure 4.9 Intelligence processing unit mixed‐precision multiplication.

Figure 4.10 Intelligence processing unit single‐precision multiplication.

Figure 4.11 Intelligence processing unit interconnect architecture [9].

Figure 4.12 Intelligence processing unit bulk synchronous parallel model.

Figure 4.13 Intelligence processing unit bulk synchronous parallel execution...

Figure 4.14 Intelligence processing unit bulk synchronous parallel inter‐chi...

Chapter 5

Figure 5.1 Deep convolutional neural network hardware architecture.

Figure 5.2 Convolution computation.

Figure 5.3 Filter decomposition with zero padding.

Figure 5.4 Filter decomposition approach.

Figure 5.5 Data streaming architecture with the data flow.

Figure 5.6 DCNN accelerator COL buffer architecture.

Figure 5.7 Data streaming architecture with 1×1 convolution mode.

Figure 5.8 Max pooling architecture.

Figure 5.9 Convolution engine architecture.

Figure 5.10 Accumulation (ACCU) buffer architecture.

Figure 5.11 Neural network model compression.

Figure 5.12 Eyeriss system architecture.

Figure 5.13 2D convolution to 1D multiplication mapping.

Figure 5.14 2D convolution to 1D multiplication – step #1.

Figure 5.15 2D convolution to 1D multiplication – step #2.

Figure 5.16 2D convolution to 1D multiplication – step #3.

Figure 5.17 2D convolution to 1D multiplication – step #4.

Figure 5.18 Output stationary.

Figure 5.19 Output stationary index looping.

Figure 5.20 Weight stationary.

Figure 5.21 Weight stationary index looping.

Figure 5.22 Input stationary.

Figure 5.23 Input stationary index looping.

Figure 5.24 Eyeriss Row Stationary (RS) dataflow.

Figure 5.25 Filter reuse.

Figure 5.26 Feature map reuse.

Figure 5.27 Partial sum reuse.

Figure 5.28 Eyeriss run‐length compression.

Figure 5.29 Eyeriss processing element architecture.

Figure 5.30 Eyeriss global input network.

Figure 5.31 Eyeriss processing element mapping (AlexNet CONV1).

Figure 5.32 Eyeriss processing element mapping (AlexNet CONV2).

Figure 5.33 Eyeriss processing element mapping (AlexNet CONV3).

Figure 5.34 Eyeriss processing element mapping (AlexNet CONV4/CONV5).

Figure 5.35 Eyeriss processing element operation (AlexNet CONV1).

Figure 5.36 Eyeriss processing element operation (AlexNet CONV2).

Figure 5.37 Eyeriss processing element (AlexNet CONV3).

Figure 5.38 Eyeriss processing element operation (AlexNet CONV4/CONV5).

Figure 5.39 Eyeriss architecture comparison.

Figure 5.40 Eyeriss v2 system architecture.

Figure 5.41 Network‐on‐Chip configurations.

Figure 5.42 Mesh network configuration.

Figure 5.43 Eyeriss v2 hierarchical mesh network examples.

Figure 5.44 Eyeriss v2 input activation hierarchical mesh network.

Figure 5.45 Weights hierarchical mesh network.

Figure 5.46 Eyeriss v2 partial sum hierarchical mesh network.

Figure 5.47 Eyeriss v1 neural network model performance. [6]

Figure 5.48 Eyeriss v2 neural network model performance. [6]

Figure 5.49 Compressed sparse column format.

Figure 5.50 Eyeriss v2 PE architecture.

Figure 5.51 Eyeriss v2 row stationary plus dataflow.

Figure 5.52 Eyeriss architecture AlexNet throughput speedup [6].

Figure 5.53 Eyeriss architecture AlexNet energy efficiency [6].

Figure 5.54 Eyeriss architecture MobileNet throughput speedup [6].

Figure 5.55 Eyeriss architecture MobileNet energy efficiency [6].

Chapter 6

Figure 6.1 Neurocube architecture.

Figure 6.2 Neurocube organization.

Figure 6.3 Neurocube 2D mesh network.

Figure 6.4 Memory‐centric neural computing flow.

Figure 6.5 Programmable neurosequence generator architecture.

Figure 6.6 Neurocube programmable neurosequence generator.

Figure 6.7 Tetris system architecture.

Figure 6.8 Tetris neural network engine.

Figure 6.9 In‐memory accumulation.

Figure 6.10 Global buffer bypass.

Figure 6.11 NN partitioning scheme comparison.

Figure 6.12 Tetris performance and power comparison [7].

Figure 6.13 NeuroStream and NeuroCluster architecture.

Figure 6.14 NeuroStream coprocessor architecture.

Figure 6.15 NeuroStream 4D tiling.

Figure 6.16 NeuroStream roofline plot [8].

Chapter 7

Figure 7.1 DaDianNao system architecture.

Figure 7.2 DaDianNao neural functional unit architecture.

Figure 7.3 DaDianNao pipeline configuration.

Figure 7.4 DaDianNao multi‐node mapping.

Figure 7.5 DaDianNao timing performance (Training) [1].

Figure 7.6 DaDianNao timing performance (Inference) [1].

Figure 7.7 DaDianNao power reduction (Training) [1].

Figure 7.8 DaDianNao power reduction (Inference) [1].

Figure 7.9 DaDianNao basic operation.

Figure 7.10 Cnvlutin basic operation.

Figure 7.11 DaDianNao architecture.

Figure 7.12 Cnvlutin architecture.

Figure 7.13 DaDianNao processing order.

Figure 7.14 Cnvlutin processing order.

Figure 7.15 Cnvlutin zero free neuron array format.

Figure 7.16 Cnvlutin dispatch.

Figure 7.17 Cnvlutin timing comparison [4].

Figure 7.18 Cnvlutin power comparison [4].

Figure 7.19 Cnvlutin

ineffectual activation skipping.

Figure 7.20 Cnvlutin

ineffectual weight skipping.

Chapter 8

Figure 8.1 EIE leading nonzero detection network.

Figure 8.2 EIE processing element architecture.

Figure 8.3 Deep compression weight sharing and quantization.

Figure 8.4 Matrix W, vector a and b are interleaved over four processing ele...

Figure 8.5 Matrix W layout in compressed sparse column format.

Figure 8.6 EIE timing performance comparison [1].

Figure 8.7 EIE energy efficient comparison [1].

Figure 8.8 Cambricon‐X architecture.

Figure 8.9 Cambricon‐X processing element architecture.

Figure 8.10 Cambricon‐X sparse compression.

Figure 8.11 Cambricon‐X buffer controller architecture.

Figure 8.12 Cambricon‐X index module architecture.

Figure 8.13 Cambricon‐X direct indexing architecture.

Figure 8.14 Cambricon‐X step indexing architecture.

Figure 8.15 Cambricon‐X timing performance comparison [4].

Figure 8.16 Cambricon‐X energy efficiency comparison [4].

Figure 8.17 SCNN convolution.

Figure 8.18 SCNN convolution nested loop.

Figure 8.19 PT‐IS‐CP‐dense dataflow.

Figure 8.20 SCNN architecture.

Figure 8.21 SCNN dataflow.

Figure 8.22 SCNN weight compression.

Figure 8.23 SCNN timing performance comparison [5].

Figure 8.24 SCNN energy efficiency comparison [5].

Figure 8.25 SeerNet architecture.

Figure 8.26 SeerNet Q‐ReLU and Q‐max‐pooling.

Figure 8.27 SeerNet quantization.

Figure 8.28 SeerNet sparsity‐mask encoding.

Chapter 9

Figure 9.1 2.5D interposer architecture.

Figure 9.2 3D stacked architecture.

Figure 9.3 3D‐IC PDN configuration (pyramid shape).

Figure 9.4 PDN – Conventional PDN Manthan geometry.

Figure 9.5 Novel PDN X topology.

Figure 9.6 3D network bridge.

Figure 9.7 Neural network layer multiple nodes connection.

Figure 9.8 3D network switch.

Figure 9.9 3D network bridge segmentation.

Figure 9.10 Multiple‐channel bidirectional high‐speed link.

Figure 9.11 Power switch configuration.

Figure 9.12 3D neural processing power gating approach.

Figure 9.13 3D neural processing clock gating approach.

Guide

Cover Page

Series Page

Title Page

Author Biographies

Preface

Acknowledgments

Table of Figures

Table of Contents

Begin Reading

Appendix A Neural Network Topology

Index

Wiley End User License Agreement

Pages

iii

xiii

xiv

xvii

xviii

xix

xxi

xxii

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

191

192

193

194

195

196

197

198

199

200

201

203

204

205

206

207

208

209

IEEE Press445 Hoes LanePiscataway, NJ 08854

IEEE Press Editorial BoardEkram Hossain, Editor in Chief

Jón Atli Benediktsson

Xiaoou Li

Jeffrey Reed

Anjan Bose

Lian Yong

Diomidis Spinellis

David Alan Grier

Andreas Molisch

Saeid Nahavandi

Elya B. Joffe

Sarah Spurgeon

Ahmet Murat Tekalp

Artificial Intelligence Hardware Design

Challenges and Solutions

Albert Chun Chen Liu and Oscar Ming Kin Law

Kneron Inc.,San Diego, CA, USA

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication data applied for:

ISBN: 9781119810452

Author Biographies

Albert Chun Chen Liu is Kneron’s founder and CEO. He is Adjunct Associate Professor at National Tsing Hua University, National Chiao Tung University, and National Cheng Kung University. After graduating from the Taiwan National Cheng Kung University, he got scholarships from Raytheon and the University of California to join the UC Berkeley/UCLA/UCSD research programs and then earned his Ph.D. in Electrical Engineering from the University of California Los Angeles (UCLA). Before establishing Kneron in San Diego in 2015, he worked in R&D and management positions in Qualcomm, Samsung Electronics R&D Center, MStar, and Wireless Information.

Albert has been invited to give lectures on computer vision technology and artificial intelligence at the University of California and be a technical reviewer for many internationally renowned academic journals. Also, Albert owned more than 30 international patents in artificial intelligence, computer vision, and image processing. He has published more than 70 papers. He is a recipient of the IBM Problem Solving Award based on the use of the EIP tool suite in 2007 and IEEE TCAS Darlington award in 2021.

Oscar Ming Kin Law developed his interest in smart robot development in 2014. He has successfully integrated deep learning with the self‐driving car, smart drone, and robotic arm. He is currently working on humanoid development. He received a Ph.D. in Electrical and Computer Engineering from the University of Toronto, Canada.

Oscar currently works at Kneron for in‐memory computing and smart robot development. He has worked at ATI Technologies, AMD, TSMC, and Qualcomm and led various groups for chip verification, standard cell design, signal integrity, power analysis, and Design for Manufacturability (DFM). He has conducted different seminars at the University of California, San Diego, University of Toronto, Qualcomm, and TSMC. He has also published over 60 patents in various areas.

Preface

With the breakthrough of the Convolutional Neural Network (CNN) for image classification in 2012, Deep Learning (DL) has successfully solved many complex problems and widely used in our everyday life, automotive, finance, retail, and healthcare. In 2016, Artificial Intelligence (AI) exceeded human intelligence that Google AlphaGo won the GO world championship through Reinforcement Learning (RL). AI revolution gradually changes our world, like a personal computer (1977), Internet (1994), and smartphone (2007). However, most of the efforts focus on software development rather than hardware challenges:

Big input data

Deep neural network

Massive parallel processing

Reconfigurable network

Memory bottleneck

Intensive computation

Network pruning

Data sparsity

This book shows how to resolve the hardware problems through various design ranging from CPU, GPU, TPU to NPU. Novel hardware can be evolved from those designs for further performance and power improvement:

Parallel architecture

Streaming Graph Theory

Convolution optimization

In‐memory computation

Near‐memory architecture

Network sparsity

3D neural processing

Organization of the Book

Chapter 1 introduces neural network and discusses neural network development history.

Chapter 2 reviews Convolutional Neural Network (CNN) model and describes each layer functions and examples.

Chapter 3 lists out several parallel architectures, Intel CPU, Nvidia GPU, Google TPU, and Microsoft NPU. It emphasizes hardware/software integration for performance improvement. Nvidia Deep Learning Accelerator (NVDLA) open‐source project is chosen for FPGA hardware implementation.

Chapter 4 introduces a streaming graph for massive parallel computation through Blaize GSP and Graphcore IPU. They apply the Depth First Search (DFS) for task allocation and Bulk Synchronous Parallel Model (BSP) for parallel operations.

Chapter 5 shows how to optimize convolution with the University of California, Los Angeles (UCLA) Deep Convolutional Neural Network (DCNN) accelerator filter decomposition and Massachusetts Institute of Technology (MIT) Eyeriss accelerator Row Stationary dataflow.

Chapter 6 illustrates in‐memory computation through Georgia Institute of Technologies Neurocube and Stanford Tetris accelerator using Hybrid Memory Cube (HMC) as well as University of Bologna Neurostream accelerator using Smart Memory Cubes (SMC).

Chapter 7 highlights near‐memory architecture through the Institute of Computing Technology (ICT), Chinese Academy of Science, DaDianNao supercomputer and University of Toronto Cnvlutin accelerator. It also shows Cnvlutin how to avoid ineffectual zero operations.

Chapter 8 chooses Stanford Energy Efficient Inference Engine, Institute of Computing Technology (ICT), Chinese Academy of Science Cambricon‐X, Massachusetts Institute of Technology (MIT) SCNN processor and Microsoft SeerNet accelerator to handle network sparsity.

Chapter 9 introduces an innovative 3D neural processing with a network bridge to overcome power and thermal challenges. It also solves the memory bottleneck and handles the large neural network processing.

In English edition, several chapters are rewritten with more detailed descriptions. New deep learning hardware architectures are also included. Exercises challenge the reader to solve the problems beyond the scope of this book. The instructional slides are available upon request.

We shall continue to explore different deep learning hardware architectures (i.e. Reinforcement Learning) and work on a in‐memory computing architecture with new high‐speed arithmetic approach. Compared with the Google Brain floating‐point (BFP16) format, the new approach offers a wider dynamic range, higher performance, and less power dissipation. It will be included in a future revision.

Albert Chun Chen LiuOscar Ming Kin Law

Acknowledgments

First, we would like to thank all who have supported the publication of the book. We are thankful to Iain Law and Enoch Law for the manuscript preparation and project development. We would like to thank Lincoln Lee and Amelia Leung for reviewing the content. We also thank Claire Chang, Charlene Jin, and Alex Liao for managing the book production and publication. In addition, we are grateful to the readers of the Chinese edition for their valuable feedback on improving the content of this book. Finally, we would like to thank our families for their support throughout the publication of this book.

Albert Chun Chen LiuOscar Ming Kin Law