111,99 €
Accelerators for Convolutional Neural Networks Comprehensive and thorough resource exploring different types of convolutional neural networks and complementary accelerators Accelerators for Convolutional Neural Networks provides basic deep learning knowledge and instructive content to build up convolutional neural network (CNN) accelerators for the Internet of things (IoT) and edge computing practitioners, elucidating compressive coding for CNNs, presenting a two-step lossless input feature maps compression method, discussing arithmetic coding -based lossless weights compression method and the design of an associated decoding method, describing contemporary sparse CNNs that consider sparsity in both weights and activation maps, and discussing hardware/software co-design and co-scheduling techniques that can lead to better optimization and utilization of the available hardware resources for CNN acceleration. The first part of the book provides an overview of CNNs along with the composition and parameters of different contemporary CNN models. Later chapters focus on compressive coding for CNNs and the design of dense CNN accelerators. The book also provides directions for future research and development for CNN accelerators. Other sample topics covered in Accelerators for Convolutional Neural Networks include: * How to apply arithmetic coding and decoding with range scaling for lossless weight compression for 5-bit CNN weights to deploy CNNs in extremely resource-constrained systems * State-of-the-art research surrounding dense CNN accelerators, which are mostly based on systolic arrays or parallel multiply-accumulate (MAC) arrays * iMAC dense CNN accelerator, which combines image-to-column (im2col) and general matrix multiplication (GEMM) hardware acceleration * Multi-threaded, low-cost, log-based processing element (PE) core, instances of which are stacked in a spatial grid to engender NeuroMAX dense accelerator * Sparse-PE, a multi-threaded and flexible CNN PE core that exploits sparsity in both weights and activation maps, instances of which can be stacked in a spatial grid for engendering sparse CNN accelerators For researchers in AI, computer vision, computer architecture, and embedded systems, along with graduate and senior undergraduate students in related programs of study, Accelerators for Convolutional Neural Networks is an essential resource to understanding the many facets of the subject and relevant applications.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 432
Veröffentlichungsjahr: 2023
IEEE Press
445 Hoes Lane Piscataway, NJ 08854
IEEE Press Editorial Board
Sarah Spurgeon,
Editor in Chief
Jón Atli Benediktsson
Behzad Razavi
Jeffrey Reed
Anjan Bose
Jim Lyke
Diomidis Spinellis
James Duncan
Hai Li
Adam Drobot
Amin Moeness
Brian Johnson
Tom Robertazzi
Desineni Subbaram Naidu
Ahmet Murat Tekalp
Arslan MunirKansas State UniversityUSA
Joonho KongKyungpook National UniversitySouth Korea
Mahmood Azhar QureshiKansas State UniversityUSA
Copyright © 2024 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data Applied for:
Hardback ISBN: 9781394171880
Cover Design: Wiley
Cover Image: © Gorodenkoff/Shutterstock; Michael Traitov/Shutterstock
Arslan Munir dedicates this book to his wife Neda and his parents for their continuous support.
Joonho Kong dedicates this book to his wife Jiyeon, children Eunseo and Eunyu, and his parents for their continuous support.
Mahmood Azhar Qureshi dedicates this book to his parents, siblings, and his wife Kiran, all of whom provided continuous support throughout his academic and professional career.
Arslan Munir is currently an Associate Professor in the Department of Computer Science at Kansas State University. He was a postdoctoral research associate in the Electrical and Computer Engineering (ECE) Department at Rice University, Houston, Texas, USA, from May 2012 to June 2014. He received his MASc in ECE from the University of British Columbia (UBC), Vancouver, Canada, in 2007 and his PhD in ECE from the University of Florida (UF), Gainesville, Florida, USA, in 2012. From 2007 to 2008, he worked as a software development engineer at Mentor Graphics Corporation in the Embedded Systems Division.
Munir's current research interests include embedded and cyber physical systems, artificial intelligence, deep learning hardware, computer vision, secure and trustworthy systems, parallel computing, and reconfigurable computing. Munir received many academic awards including the doctoral fellowship from Natural Sciences and Engineering Research Council (NSERC) of Canada. He earned gold medals for best performance in electrical engineering, and gold medals, and academic roll of honor for securing rank one in pre‐engineering provincial examinations (out of approximately 300,000 candidates). He is a senior member of IEEE.
Joonho Kong is currently an Associate Professor with the School of Electronics Engineering, Kyungpook National University. He received the BS degree in computer science and the MS and PhD degrees in computer science and engineering from Korea University, in 2007, 2009, and 2011, respectively. He worked as postdoctoral research associate with the Department of Electrical and Computer Engineering, Rice University, from 2012 to 2014. Before joining Kyungpook National University, he also worked as a Senior Engineer at Samsung Electronics, from 2014 to 2015. His research interests include computer architecture, heterogeneous computing, embedded systems, hardware/software co‐design, AI/ML accelerators, and hardware security. He is a member of IEEE.
Mahmood Azhar Qureshi is currently a Senior Design Engineer at Intel Corporation. He received his PhD in Computer Science from Kansas State University, Manhattan, Kansas, in 2021 where he also worked as a research assistant from 2018 to 2021. He received his MS in electrical engineering from the University of Engineering and Technology (UET), Taxila, Pakistan, in 2018 and BE in electrical engineering from National University of Sciences and Technology (NUST), Pakistan, in 2013. From 2014 to 2018, he worked as a Senior RTL Design Engineer at Center for Advanced Research in Engineering (CARE) Pvt. Ltd, Islamabad, Pakistan. During the summer of 2020, he interned at MathWorks, USA, where he was actively involved in adding new features in Matlab which is a tool used globally in industry as well as academia. During fall 2020, he interned at Tesla, working on the failure analysis of the infotainment hardware for the Tesla Model 3 and Model Y global feet.
Convolutional neural networks (CNNs) have gained tremendous significance in the domain of artificial intelligence (AI) because of their use in a variety of applications related to visual imagery analysis. There has been a drastic increase in the accuracy of CNNs in recent years, which has helped CNNs make its way in real-world applications. This increase in accuracy, however, translates into a sizable model and high computational requirements, which make the deployment of these CNNs in resource-limited computing platforms a challenging endeavor. Thus, embedding CNN inference into various real-world applications requires the design of high-performance, area, and energy-efficient accelerator architectures. This book targets the design of accelerators for CNNs.
This book is organized into five parts: overview, compressive coding for CNNs, dense CNN accelerators, sparse CNN accelerators, and HW/SW co-design and co-scheduling for CNN acceleration. The first part of the book provides an overview of CNNs along with the composition of different contemporary CNN models. The book then discusses some of the architectural and algorithmic techniques for efficient processing of CNN models. The second part of the book discusses compressive coding for CNNs to compress CNN weights and feature maps. This part of the book then discusses Huffman coding for lossless compression of CNN weights and feature maps. The book then elucidates a two-step lossless input feature maps compression method followed by discussion of an arithmetic coding and decoding-based lossless weights compression method. The third part of the book focuses on the design of dense CNN accelerators. The book provides a discussion on contemporary dense CNN accelerators. The book then presents an iMAC dense CNN accelerator, which combines image-to-column and general matrix multiplication hardware acceleration followed by the discussion of another dense CNN accelerator that utilizes log-based processing elements and 2D data flow to maximize data reuse and hardware utilization. The fourth part of the book targets sparse CNN accelerator. The book discusses contemporary sparse CNNs that consider sparsity in weights and activation maps (i.e., many weights and activations in CNNs are zero and result in ineffectual computations) to deliver high effective throughput. The book then presents a sparse CNN accelerator that performs in situ decompression and convolution of sparse input feature maps. Afterwards, the book discusses a sparse CNN accelerator, which has the capability to actively skip a huge number of ineffective computations (i.e., computations involving zero weights and/or activations), while only favoring effective computations (nonzero weights and nonzero activations) to drastically improve the hardware utilization. The book then presents another sparse CNN accelerator that uses sparse binary mask representation to actively lookahead into sparse computations, and dynamically schedule its computational threads to maximize the thread utilization and throughput. The fifth part of the book targets hardware/software co-design and co-scheduling for CNN acceleration. The book discusses hardware/software co-design and co-scheduling that can lead to better optimization and utilization of the available hardware resources for CNN acceleration. The book summarizes recent works on hardware/software co-design and scheduling. The book then presents a technique that utilizes software, algorithm, and hardware co-design to reduce the response time of CNN inferences. Afterwards, the book discusses a CPU-accelerator co-scheduling technique, which co-utilizes the CPU and CNN accelerators to expedite the CNN inference. The book also provides directions for future research and development for CNN accelerators.
This is the first book on the subject of accelerators for CNNs that introduces readers to advances and state-of-the-art research in design of CNN accelerators. This book can serve as a good reference for students, researchers, and practitioners working in the area of hardware design, computer architecture, and AI acceleration.
Arslan Munir
January 24, 2023Manhattan, KS, USA
Deep neural networks (DNNs) have enabled the deployment of artificial intelligence (AI) in many modern applications including autonomous driving [1], image recognition [2], and speech processing [3]. In many applications, DNNs have achieved close to human‐level accuracy and, in some, they have exceeded human accuracy [4]. This high accuracy comes from a DNN's unique ability to automatically extract high‐level features from a huge quantity of training data using statistical learning and improvement over time. This learning over time provides a DNN with an effective representation of the input space. This is quite different from the earlier approaches where specific features were hand‐crafted by domain experts and were subsequently used for feature extraction.
Convolutional neural networks (CNNs) are a type of DNNs, which are most commonly used for computer vision tasks. Among different types of DNNs, such as multilayer perceptrons (MLP), recurrent neural networks (RNNs), long short‐term memory (LSTM) networks, radial basis function networks (RBFNs), generative adversarial networks (GANs), restricted Boltzmann machines (RBMs), deep belief networks (DBNs), and autoencoders, CNNs are the mostly commonly used. Invention of CNNs has revolutionized the field of computer vision and has enabled many applications of computer vision to go mainstream. CNNs have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, object detection, activity recognition, natural language processing, brain–computer interfaces, and financial time‐series prediction.
DNN/CNN processing is usually carried out in two stages, training and inference, with both of them having their own computational needs. Training is the process where a DNN model is trained using a large application‐specific data set. The training time is dependent on the model size and the target accuracy requirements. For high accuracy applications like autonomous driving, training a DNN can take weeks and is usually performed on a cloud. Inference, on the other hand, can be performed either on the cloud or the edge device (mobile device, Internet of things (IoT), autonomous vehicle, etc.). Nowadays, in many applications, it is advantageous to perform the inference process on the edge devices, as shown in Figure 1.1. For example, in cellphones, it is desirable to perform image and video processing on the device itself rather than sending the data over to the cloud for processing. This methodology reduces the communication cost and the latency involved with the data transmission and reception. It also eliminates the risk of losing important device features should there be a network disruption or loss of connectivity. Another motivation for doing inference on the device is the ever‐increasing security risk involved with sending personalized data, including images and videos, over to the cloud servers for processing. Autonomous driving systems which require visual data need to deploy solutions to perform inference locally to avoid latency and security issues, both of which can result in a catastrophe, should an undesirable event occurs. Performing DNN/CNN inference on the edge presents its own set of challenges. This stems from the fact that the embedded platforms running on the edge devices have stringent cost limitations which limit their compute capabilities. Running compute and memory‐intensive DNN/CNN inference in these devices in an efficient manner becomes a matter of prime importance.
Figure 1.1 DNN/CNN processing methodology.
Source: (b) Daughter#3 ‐ Cecil/Wikimedia Commons/CC BY‐SA 2.0.
Neural nets have been around since the 1940s; however, the first practically applicable neural network, referred to as the LeNet [5], was proposed in 1989. This neural network was designed to solve the problem of digit recognition in hand‐written numeric digits. It paved the way for the development of neural networks responsible for various applications related to digit recognition, such as an automated teller machine (ATM), optical character recognition (OCR), automatic number plate recognition, and traffic signs recognition. The slow growth and a little to no adoption of neural networks in the early days is mainly due to the massive computational requirements involved with their processing which limited their study to theoretical concepts.
Over the past decade, there has been an exponential growth in the research on DNNs with many new high accuracy neural networks being deployed for various applications. This has only been possible because of two factors. The first factor is the advancements in the processing power of semiconductor devices and technological breakthroughs in computer architecture. Nowadays, computers have significantly higher computing capability. This enables the processing of a neural network within a reasonable time frame, something that was not achievable in the early days. The second factor is the availability of a large amount of training datasets. As neural networks learn over time, providing huge amounts of training data enables better accuracy. For example, Meta (parent company of Facebook) receives close to a billion user images per day, whereas YouTube has 300 hours of video uploaded every minute [6]. This enables the service providers to train their neural networks for targeted advertising campaigns bringing in billions of dollars of advertising revenue. Apart from their use in social media platforms, DNNs are impacting many other domains and are making a huge impact. Some of these areas include:
Speech Processing
: Speech processing algorithms have improved significantly in the past few years. Nowadays, many applications have been developed that use DNNs to perform real‐time speech recognition with unprecedented levels of accuracy
[3
,
7
–
9]
. Many technology companies are also using DNNs to perform language translation used in a wide variety of applications. Google, for example, uses Google's neural machine translation system (GNMT)
[10]
which uses LSTM‐based seq2seq model for their language translation applications.
Autonomous Driving
: Autonomous driving has been one of the biggest technological breakthroughs in the auto industry since the invention of the internal combustion engine. It is not a coincidence that the self‐driving boom came at the same time when high accuracy CNNs became increasingly popular. Companies like Tesla
[11]
and Waymo
[12]
are using various types of self‐driving technology including visual feeds and Lidar for their self‐driving solutions. One thing which is common in all these solutions is the use of CNNs for visual perception of the road conditions which is the main back‐end technology used in advanced driver assistance systems (ADAS).
Medical AI
: Another crucial area where DNNs/CNNs have become increasingly useful is medicine. Nowadays, doctors can use AI‐assisted medical imagery to perform various surgeries. AI systems use DNNs in genomics to gather insights about genetic disorders like autism
[13
,
14]
. DNNs/CNNs are also useful in the detection of various types of cancers like skin and brain cancer
[15
,
16]
.
Security
: The advent of AI has challenged many traditional security approaches that were previously deemed sufficient. The rollout of 5G technology has caused a massive surge of IoT‐based deployments which traditional security approaches are not able to keep up with. Physical unclonability approaches
[17
–
21]
were introduced to protect this massive deployment of IoTs against security attacks with minimum cost overheads. These approaches, however, were also unsuccessful in preventing AI‐assisted attacks using DNNs
[22
,
23]
. Researchers have now been forced to upgrade the security threat models to incorporate AI‐based attacks
[24
,
25]
. Because of a massive increase in AI‐assisted cyber‐attacks on cloud and datacenters, companies have realized that the best way of defeating offensive AI attacks is by incorporating AI‐based counterattacks
[26
,
27]
.
Overall, the use of DNNs, in particular CNNs, in various applications has seen exponential growth over the past decade, and this trend has been on the rise for the past many years. The massive increase in CNN deployments on the edge devices requires the development of efficient processing architectures to keep up with the computational requirements for successful CNN inference.
This section discusses some of the pitfalls of high‐accuracy DNN/CNN models focusing on compute and energy bottlenecks, and the effect of sparsity of high‐accuracy models on throughput and hardware utilization.
CNNs are composed of multiple convolution layers (CONV) which help in extracting low‐, mid‐, and high‐level input features for better accuracy. Although CNNs are primarily used in applications related to image and video processing, they are also used in speech processing [3, 7], gameplay [28], and robotics [29] applications. We will further discuss the basics of CNNs in Chapter 2. In this section, we explore some of the bottlenecks when it comes to implementing high‐accuracy CNN inference engines in embedded mobile devices.
Table 1.1 Popular CNN models.
CNN model
Layers
Top‐1 accuracy (%)
Top‐5 accuracy (%)
Parameters
MACs
AlexNet
[30]
8
63.3
84.6
62M
666M
VGG‐16
[31]
16
74.3
91.9
138M
15.3B
GoogleNet
[35]
22
68.9
88
6.8M
1.5B
MobileNet
[35]
28
70.9
89.9
4.2M
569M
ResNet‐50
[32]
50
75.3
92.2
25.5M
3.9B
The development of high accuracy CNN models [30–34] in recent years has strengthened the notion of employing DNNs in various AI applications. The classification accuracy of CNNs for the ImageNet challenge [2] has improved considerably from 63.3% in 2012 (AlexNet [30]) to a staggering 87.3% (EfficientNetV2 [4] in 2021). This high jump in accuracy comes with high compute and energy costs for CNN inference. Table 1.1 shows some of the most commonly used CNN models. The models are trained using the ImageNet dataset [2], and the top‐1 and top‐5 classification accuracy is also given. We note that top‐1 accuracy is the conventional accuracy, which means that the model answer (i.e., the one predicted by the model with the highest probability) must be exactly the expected answer. Top‐5 accuracy means that any of the five highest probability answers predicted by the model must match the expected answer. It can be seen from Table 1.1 that the addition of more layers results in better accuracy. This addition, however, also corresponds to a greater number of model parameters, requiring more memory and storage. It also results in higher multiply‐accumulate (MAC) operations, causing an increase in computational complexity and resource requirements, which in turn, affects the performance of the edge devices.
Even though some efforts have been made to reduce the size of the high accuracy models, they still require massive amounts of computations over a series of network layers to perform a particular inference task (classification, segmentation, etc.). These tremendous number of computations (typically in tens of millions) present a huge challenge for the neural network accelerators (NNAs) running the CNN inference. NNAs are specialized hardware blocks inside a computer system (e.g., mobile devices and cloud servers) that speed up the computations of the CNN inference process to maintain the real‐time requirements of the system and improve system throughput. Apart from the massive computational requirements, the addition of more layers for higher accuracy drastically increases the CNN model size. This prevents the CNN model from being stored in the limited on‐chip static random access memory (SRAM) of the edge device, and, therefore, requires off‐chip dynamic random access memory (DRAM) which presents a high DRAM access energy cost.
Figure 1.2 Energy cost (relative to 8 bit Add operation) shown on a log 10 scale for a 45 nm process technology.
Source: Adapted from [6, 36].
To put this in perspective, the energy cost per fetch for 32 bit coefficients in an off‐chip low‐power double data rate 2 (LPDDR2) DRAM is about 640 pJ, which is about the energy cost of a 32 bit integer ADD operation [36]. The bigger the model is, the more memory referencing is performed to access the model data which in turn expends more energy. Figure 1.2 shows the energy cost of various compute and memory operations relative to an 8 bit integer add (8 bit INT Add) operation. It can be seen that the DRAM Read operation dominates the energy graph with the 32 bit DRAM Read consuming greater than 4 orders of magnitude higher energy than the 8 bit INT Add. As a consequence, the energy cost from just the DRAM accesses would be well beyond the limitations of an embedded mobile device with limited battery life. Therefore, in addition to accelerating the compute operations, the NNA also needs to minimize the off‐chip memory transactions for decreasing the overall energy consumption.
Many algorithm‐level techniques have been developed to minimize the computational requirements of a CNN without incurring a loss in accuracy. Since the main compute bottleneck in CNN inference is the CONV operation, Mobilenets[33, 34] were developed to reduce the total number of CONV operations. These CNNs drastically reduce the total number of parameters and MAC operations by breaking down the standard 2D convolution into depthwise separable and pointwise convolutions. The depthwise separable and pointwise convolutions result in to reduction in total computations compared to regular CONV operations, with a slight decrease in accuracy. They also eliminate varying filter sizes, and instead, use and filters for performing convolution operations. This makes them ideal for embedded mobile devices because of their relatively low memory footprint and lower total MAC operations.
A widely used approach for decreasing the memory bottleneck is the reduction in the precision of both weights and activations using various quantization strategies [37–39]. This again does not result in a significant loss in accuracy and reduces the model size by a considerable amount. Hardware implementations like Envision [40], UNPU [41], and Stripes [42] show how reduced bit precision, and quantization, translates into better savings in energy.
Nonlinear activation functions [6], in addition to deep layers, is one of the key characteristics that improve the accuracy of a CNN model. Typically, nonlinearity is added by incorporating activation functions, the most common being the rectified linear unit (ReLU) [6]. The ReLU converts all negative values in a feature map to zeros. Since the output of one layer is the input to the next layer, many of the computations, within a layer, involve multiplication with zeros. These feature maps containing zeros are referred to as one‐sided sparse feature maps. The multiplications resulting from this one‐sided sparsity waste compute cycles and decrease the effective throughput and hardware utilization, thus, reducing the performance of the accelerator. It also results in high energy costs as the transfer of zeros to/from off‐chip memory is wasted memory access. In order to reduce the computational and memory access volume, previous works [43–45] have exploited this one‐sided sparsity and displayed some performance improvements. To exacerbate the issue of wasted compute cycles and memory accesses, two‐sided sparsity is introduced in CNNs often by pruning techniques when, in addition to the feature maps, the weight data also consists of zeros. Designing a CNN accelerator that can overcome the wasted compute cycles and memory accesses issues of one‐sided and two‐sided sparsities is quite challenging.
In recent years, many pruning techniques have been developed for the compression of DNN models [46–49]. Han et al. [46] iteratively pruned the connections based on parameter threshold and performed retraining to retain accuracy. This type of pruning is referred to as unstructured pruning. It arbitrarily removes weight connections in a DNN/CNN but does little to improve acceleration on temporal architectures like central processing units (CPUs) and graphics processing units (GPUs) which rely on accelerating matrix multiplications. Another form of pruning, referred to as structured pruning [50, 51], reduces the size of weight matrices and maintains a full matrix. This makes it possible to simplify the NNA design since the sparsity patterns are predictable, therefore, enabling better hardware support for operation scheduling.
Both unstructured and structured pruning strategies, as described above, result in two‐sided sparsity, (i.e., sparsity in both weights and activations) which lead to approximately 9 model reduction for AlexNet and 13 reduction for VGG‐16. The purning strategies also result in 4–9effective compute reduction (depending on the model). These gains seem very promising; however, designing an accelerator architecture to leverage them is quite challenging because of the following reasons:
Data Access Inconsistency
: Computation gating is one of the most common ways by which sparsity is generally exploited. Whenever a zero in the activation or the weight data is read, no operation is performed. This results in energy savings but has no impact on the throughput because of the wastage of compute cycle. Complex read logic needs to be implemented to discard the zeros, and instead, perform effective computations on nonzero data. Some previous works
[52
,
53]
use sparse compression formats like compressed sparse column (CSC) or compressed sparse row (CSR) to represent sparse data. These formats have variable lengths and make
looking ahead
difficult if both the weight and the activation sparsity are being considered. Other than that, developing the complex control and read logic to process these formats can be quite challenging.
Low Utilization of the Processing Element (PE) Array
: Convolution operations for CNN inference are usually performed using an array of two‐dimensional PEs in a CNN accelerator. Different dataflows (input stationary, output stationary, weight stationary, etc.) have been proposed that efficiently map the weight data and the activation data onto the PE array to maximize the throughput
[6
,
54]
. Sparsity introduces inconsistency in the scheduling of data thereby reducing hardware utilization. The subset of PEs provided with more sparse data have idle times while those provided with less sparse (or denser) data are fully active. This bounds the throughput of the accelerator to the most active PEs, and therefore, leads to the underutilization of the PE array.
Considering the abovementioned issues, many accelerators have been proposed in the past that attempt to strike a balance between hardware resource complexity and performance improvements. The CNN accelerators that exploit sparsity in CNN models are covered in detail in Part IV of this book.
This chapter discussed the history and applications of DNNs, focusing on CNNs. The chapter also highlighted the compute and energy bottlenecks as well as the effect of sparsity in high‐accuracy CNN models on the throughput and hardware utilization of edge devices.
This chapter gives an overview of the composition of different deep neural network (DNN) models, specifically the convolutional neural networks (CNNs), and explore the different layers that these neural networks are comprised of. Additionally, this chapter describes some of the most popular high accuracy CNN models and the datasets upon which these CNNs operate. Finally, the chapter reviews some of the architectural and algorithmic techniques for efficient processing of high accuracy CNN models on edge devices.
DNNs are a manifestation of the notion of deep learning, which comes under the umbrella of artificial intelligence (AI). The main inspiration behind DNNs is the way different neurons work in a brain to process and communicate information. The raw sensory input is transformed into a high‐level abstraction in order to extract meaningful information and make decisions. This transformation, referred to as inference (or forward propagation) in DNNs, results from many stages of nonlinear processing, with each stage called a layer.
Figure 2.1 shows a simple DNN with four hidden layers and one input and output layer. The DNN layers receive a weighted sum of input values (, where denotes weights and denotes the inputs) and compute outputs using a nonlinear function, referred to as the activation function. Many activation functions have been proposed in the literature, some of which are shown in Figure 2.2. Rectified linear unit (ReLU) is one of the most commonly used activation functions utilized in many modern state‐of‐the‐art DNNs. The weights in each layer are determined through the process of training. Once the training is complete after meeting a desired accuracy, the trained model is deployed on computer servers or often on edge devices where inference is performed.
Figure 2.1 A neural network example with one input layer, four hidden layers, and one output layer.
Figure 2.2 Various nonlinear activation functions.
Source: Figure adapted from DNN tutorial survey [6].
DNNs come in various shapes and sizes depending on the target application. Multi‐layer perceptrons (MLPs) are DNNs that consist of many fully connected (FC) layers, with each layer followed by a nonlinear activation function. In MLPs, each neuron in layeri is connected to every neuron in layeri+1. In this way, the FC layer computations can be generalized as matrix–vector multiplications followed by the activation function. The FC layer computation can be represented as:
where is the input vector, is the weight matrix, is the bias, is the activation function, and represents the output activation vector. It can be seen that the DNN in Figure 2.1 is an example of an MLP network.
CNNs are a specialized type of DNNs that utilize a mathematical operation called convolution instead of general matrix multiplication in at least one of their layers. A CNN is composed of an input layer, hidden layers, and an output layer. In any feed‐forward neural network, any middle layers between the input layer and the output layer are known as hidden layers. The hidden layers take a set of weighted inputs and produce output through an activation function. In a CNN, the hidden layers include layers that perform convolution operations. In a convolutional layer of the CNN, each neuron receives input from only a confined area (e.g., a square of or dimension) of the previous layer called the neuron's receptive field. In an FC layer of the CNN (further elaborated in the following), the receptive field is the entire previous layer. Each neuron in a neural network applies a specific function to the input values received from the receptive field in the previous layer to compute an output value. The input value received by the function in neuron depends on a vector of weights and biases, which are learned iteratively as the neural network is being trained on the input data. In CNNs, the vectors of weights and biases are called filters that capture specific features of the input. A distinctive feature of CNNs as opposed to other DNNs or MLPs is that a filter in CNNs can be shared by many neurons (i.e., weights or filter sharing). As illustrated in Figure 2.3, the same weights are shared between the neurons across receptive fields in the case of CNNs. This sharing of filter in CNNs reduces the memory footprint because a single vector of weights and a single bias can be used across all the receptive fields that share that filter as opposed to having a separate weight vector and bias for each receptive field. The other distinctive feature of the CNN as compared to MLPs is sparse connection between the input and output neurons. It means only a small fraction of the input neurons are connected to a certain output neuron. As shown in Figure 2.3, the neurons are sparsely connected in CNNs, while all the input neurons are connected to all the output neurons (i.e., are densely connected) in MLPs.
Figure 2.3 The demonstration of (a) MLPs and (b) CNNs.
An overall architecture of a CNN is depicted in Figure 2.4. The input for CNNs is typically composed of three channels (red, green, and blue [RGB]), each of which comprises two‐dimensional pixel arrays. The convolution layer performs the convolution operations with filters to extract features from the inputs. The pooling layer