108,99 €
DEMYSTIFYING DEEP LEARNING Discover how to train Deep Learning models by learning how to build real Deep Learning software libraries and verification software! The study of Deep Learning and Artificial Neural Networks (ANN) is a significant subfield of artificial intelligence (AI) that can be found within numerous fields: medicine, law, financial services, and science, for example. Just as the robot revolution threatened blue-collar jobs in the 1970s, so now the AI revolution promises a new era of productivity for white collar jobs. Important tasks have begun being taken over by ANNs, from disease detection and prevention, to reading and supporting legal contracts, to understanding experimental data, model protein folding, and hurricane modeling. AI is everywhere--on the news, in think tanks, and occupies government policy makers all over the world --and ANNs often provide the backbone for AI. Relying on an informal and succinct approach, Demystifying Deep Learning is a useful tool to learn the necessary steps to implement ANN algorithms by using both a software library applying neural network training and verification software. The volume offers explanations of how real ANNs work, and includes 6 practical examples that demonstrate in real code how to build ANNs and the datasets they need in their implementation, available in open-source to ensure practical usage. This approachable book follows ANN techniques that are used every day as they adapt to natural language processing, image recognition, problem solving, and generative applications. This volume is an important introduction to the field, equipping the reader for more advanced study. Demystifying Deep Learning readers will also find: * A volume that emphasizes the importance of classification * Discussion of why ANN libraries, such as Tensor Flow and Pytorch, are written in C++ rather than Python * Each chapter concludes with a "Projects" page to promote students experimenting with real code * A supporting library of software to accompany the book at https://github.com/nom-de-guerre/RANT * An approachable explanation of how generative AI, such as generative adversarial networks (GAN), really work. * An accessible motivation and elucidation of how transformers, the basis of large language models (LLM) such as ChatGPT, work. Demystifying Deep Learning is ideal for engineers and professionals that need to learn and understand ANNs in their work. It is also a helpful text for advanced undergraduates to get a solid grounding on the topic.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 417
Veröffentlichungsjahr: 2023
Cover
Table of Contents
Title Page
Copyright
About the Author
Acronyms
1 Introduction
1.1 AI/ML – Deep Learning?
1.2 A Brief History
1.3 The Genesis of Models
1.4 Numerical Computation – Computer Numbers Are Not eal
1.5 Summary
1.6 Projects
Notes
2 Deep Learning and Neural Networks
2.1 Feed‐Forward and Fully‐Connected Artificial Neural Networks
2.2 Computing Neuron State
2.3 The Feed‐Forward ANN Expressed with Matrices
2.4 Classification
2.5 Summary
2.6 Projects
Notes
3 Training Neural Networks
3.1 Preparing the Training Set: Data Preprocessing
3.2 Weight Initialization
3.3 Training Outline
3.4 Least Squares: A Trivial Example
3.5 Backpropagation of Error for Regression
3.6 Stochastic Sine
3.7 Verification of a Software Implementation
3.8 Summary
3.9 Projects
Notes
4 Training Classifiers
4.1 Backpropagation for Classifiers
4.2 Computing the Derivative of the Loss
4.3 Multilabel Classification
4.4 Summary
4.5 Projects
Note
5 Weight Update Strategies
5.1 Stochastic Gradient Descent
5.2 Weight Updates as Iteration and Convex Optimization
5.3 RPROP+
5.4 Momentum Methods
5.5 Levenberg–Marquard Optimization for Neural Networks
5.6 Summary
5.7 Projects
Notes
6 Convolutional Neural Networks
6.1 Motivation
6.2 Convolutions and Features
6.3 Filters
6.4 Pooling
6.5 Feature Layers
6.6 Training a CNN
6.7 Applications
6.8 Summary
6.9 Projects
7 Fixing the Fit
7.1 Quality of the Solution
7.2 Generalization Error
7.3 Classification Performance
7.4 Regularization
7.5 Advanced Normalization
7.6 Summary
7.7 Projects
Notes
8 Design Principles for a Deep Learning Training Library
8.1 Computer Languages
8.2 The Matrix: Crux of a Library Implementation
8.3 The Framework
8.4 Summary
8.5 Projects
Notes
9 Vistas
9.1 The Limits of ANN Learning Capacity
9.2 Generative Adversarial Networks
9.3 Reinforcement Learning
9.4 Natural Language Processing Transformed
9.5 Neural Turing Machines
9.6 Summary
9.7 Projects
Notes
Appendix A: Mathematical Review
A.1 Linear Algebra
A.2 Basic Calculus
A.3 Advanced Matrices
A.4 Probability
Notes
Glossary
ReferencesReferences
Index
End User License Agreement
Chapter 3
Table 3.1 Important Activation Functions
Table 3.2 Verification Results
Chapter 5
Table 5.1 Table of Times and Accuracy for a Sample of %size Minibatchs When ...
Table 5.2 Comparison of Strategies
Chapter 8
Table 8.1 Comparison of Performance Between C++ and Python.
Chapter 1
Figure 1.1 Examples of GAN‐generated cats. The matrix on the left contains e...
Figure 1.2 The graph of , is plotted with empty circles as points. The 's ...
Figure 1.3 IEEE‐754 representation for the value of . The exponent is biase...
Chapter 2
Figure 2.1 A trained ANN that has learnt the
sine
function. The circles, gra...
Figure 2.2 A trained ANN that has learnt the
cosine
function. The only diffe...
Figure 2.3 The output of two ANNs is superimposed on the ground truth for th...
Figure 2.4 The
sine
model in more detail. The layers are labeled. The left i...
Figure 2.5 Three popular activation functions. The two on the left are super...
Figure 2.6 A binary diabetes classifier. The predictors are continuous, but ...
Figure 2.7 A trained classier for the Iris dataset. There are 4 predictors a...
Chapter 3
Figure 3.1 An ANN with a preprocessing layer. The preprocessing nodes are in...
Figure 3.2 The fit of a least squares solution to a random sample of points ...
Figure 3.3 The decision boundaries for two classification problems. The exam...
Figure 3.4 A trained ANN model that has learnt the sine function. The layers...
Figure 3.5 The sigmoid activation function plotted in black. Its derivative ...
Figure 3.6 MSE loss for four training runs of a sine ANN. They all show the ...
Figure 3.7 The output of an ANN learning the
sine
function. The black curve ...
Figure 3.8 An ANN with a differencing verification layer. The differencing l...
Chapter 4
Figure 4.1 Likelihood for the fairness of a coin following 15 tosses and obs...
Figure 4.2 Detail of an iris classifier's terminal layer. The softmax layer ...
Figure 4.3 An example of a multilabel classifier. The outputs corresponding ...
Chapter 5
Figure 5.1 Four training runs using SGD to train a classifier to recognize t...
Figure 5.2 A loss surface for a hypothetical ANN. A static choice for is p...
Figure 5.3 An example run of Newton's Method for optimization for the cosine...
Figure 5.4 An example path for RPROP+ during the training of a
sine
ANN. It ...
Figure 5.5 Computed densities of observed weight updates. Training was initi...
Figure 5.6 Densities of log scale losses of models following training. LM lo...
Chapter 6
Figure 6.1 Five examples of hand‐written twos from the MNIST dataset. The im...
Figure 6.2 The repeated application of a kernel to produce a feature map. Th...
Figure 6.3 The result of applying the kernels in (6.1) to detect features. T...
Figure 6.4 The figure depicts a image and the feature map that results fol...
Figure 6.5 An example of a complete CNN. The CNN is a classifier that accept...
Figure 6.6 The top row is the result of applying a set of 5 filters to an ex...
Figure 6.7 An example of a pooling layer in a forward training pass and the ...
Figure 6.8 Backpropagation from a pooling layer to a filter. The filter grad...
Figure 6.9 Some example weight updates in a filter. A weight accumulates the...
Chapter 7
Figure 7.1 Sample of hand‐written 3s from the MNIST dataset.
Figure 7.2 kNN models for a static dataset. The data are produced as pairs o...
Figure 7.3 The classical bias‐variance trade‐off. The minimum of the error o...
Figure 7.4 A confusion for the Iris dataset and a trained ANN. As there are ...
Figure 7.5 An example of an ANN with a dropout layer. The left side shows a ...
Figure 7.6 The architecture of a dropout implementation. The dropout logic i...
Figure 7.7 An ANN with a batch normalization layer. The 1:1 links correspond...
Figure 7.8 The unnormalized means ribboned with the standard deviation for 5...
Figure 7.9 The distribution of means by type. The layer normalization densit...
Chapter 8
Figure 8.1 A C program and the resultant assembly code following compilation...
Figure 8.2 Detail of the memory hierarchy. The DRAM DIMM at the bottom is wh...
Figure 8.3 Two options for the physical memory layout of a matrix, . The ba...
Figure 8.4 The physical layout of the input matrix is on the left. The matri...
Figure 8.5 The representation of an ANN graph. Layers are implemented by inh...
Chapter 9
Figure 9.1 An example of generating twos. During training, 4 sample twos wer...
Figure 9.2 A selection of hand‐written 2's from the MNIST dataset.
Figure 9.3 The GAN game. The generator, , produces fakes and attempts to fo...
Figure 9.4 A chess game following the opening moves of both players. Blue we...
Figure 9.5 (a) Markov Decision Process and (b) the resulting Markov Chain fo...
Figure 9.6 An example training run for tic‐tac‐toe. The progression of the g...
Figure 9.7 The transformer architecture. The figure on the left depicts the ...
Figure 9.8 The transformer block for multi‐head attention. The changes are c...
Cover
Table of Contents
Series Page
Title Page
Copyright
About the Author
Acronyms
Begin Reading
Appendix A Mathematical Review
Glossary
References
Index
End User License Agreement
ii
iii
iv
ix
x
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
73
74
75
76
77
78
79
80
81
82
83
84
85
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
IEEE Press
445 Hoes Lane Piscataway, NJ 08854
IEEE Press Editorial Board
Sarah Spurgeon,
Editor in Chief
Jón Atli Benediktsson
Behzad Razavi
Jeffrey Reed
Anjan Bose
Jim Lyke
Diomidis Spinellis
James Duncan
Hai Li
Adam Drobot
Amin Moeness
Brian Johnson
Tom Robertazzi
Desineni Subbaram Naidu
Ahmet Murat Tekalp
Douglas J. SantryUniversity of Kent, United Kingdom
Copyright © 2024 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data Applied for:
Hardback ISBN: 9781394205608
Cover Design: Wiley
Cover Image: © Yuichiro Chino/Getty Images
Douglas J. Santry, PhD, MSc, is a Lecturer in Computer Science at the University of Kent, UK. Prior to his current position, he worked extensively as an important figure in the industry with Apple Computer Corp, NetApp, and Goldman Sachs. At NetApp, he conducted research into embedded and real‐time machine learning techniques.
AI
artificial intelligence
ANN
artificial neural network
BERT
bidirectional encoder representation for transformers
BN
Bayesian network
BPG
backpropagation
CNN
convolutional neural network
CNN
classifying neural network
DL
deep learning
FFFC
feed forward fully connected
GAN
generative adversarial network
GANN
generative artificial neural network
GPT
generative pre‐trained
LLM
large language model
LSTM
long short term memory
ML
machine learning
MLE
minimum likelihood estimator
MSE
mean squared error
NLP
natural language processing
RL
reinforcement learning
RNN
recurrent neural network
SGD
stochastic gradient descent
Interest in deep learning (DL) is increasing every day. It has escaped from the research laboratories and become a daily fact of life. The achievements and potential of DL are reported in the lay news and form the subject of discussion at dinner tables, cafes, and pubs across the world. This is an astonishing change of fortune considering the technology upon which it is founded was pronounced a research dead end in 1969 (131) and largely abandoned.
The universe of DL is a veritable alphabet soup of bewildering acronyms. There are artificial neural networks (ANN)s, RNNs, LSTMs, CNNs, Generative Adversarial Networks (GAN)s, and more are introduced every day. The types and applications of DL are proliferating rapidly, and the acronyms grow in number with them. As DL is successfully applied to new problem domains this trend will continue. Since 2015 the number of artificial intelligence (AI) patents filed per annum has been growing at a rate of 76.6% and shows no signs of slowing down (169). The growth rate speaks to the increasing investment in DL and suggests that it is still accelerating.
DL is based on ANN. Often only neural networks is written and the artificial is implied. ANNs attempt to mathematically model biological assemblies of neurons. The initial goal of research into ANNs was to realize AI in a computer. The motivation and means were to mimic the biological mechanisms of cognitive processes in animal brains. This led to the idea of modeling the networks of neurons in brains. If biological neural networks could be modeled accurately with mathematics, then computers could be programmed with the models. Computers would then be able to perform tasks that were previously thought only possible by humans; the dream of the electronic brain was born (151). Two problem domains were of particular interest: natural language processing (NLP), and image recognition. These were areas where brains were thought to be the only viable instrument; today, these applications are only the tip of the iceberg.
In the field of image recognition, DL has achieved spectacular results and, by some metrics, is out‐performing humans. Image recognition is the task of finding and identifying objects in an image. DL has a better record than humans (13; 63) recognizing the ImageNet (32) test suite, an important database of millions of photographs. Computer vision has become so reliable for some tasks that it is common for motor cars to offer features based on reliable computer vision, and in some cases, cars can even drive themselves. In airports and shopping malls, we are continually monitored by CCTV, but often it is a computer, not a human, performing the monitoring (39). Some CCTV monitors look for known thieves and automatically alert the staff, or even the local police, when they are spotted in a shop (165). This can lead to problems. When courts and the police do not understand how to interpret the results of the software great injustices can follow.
One such example is that of Robert Julian‐Borchak Williams (66). Mr. Williams' case is a cautionary tale. AI image recognition software is not evidence and does not claim to be. It is meant to point law enforcement in a promising direction of investigation; it is a complement to investigation, not a substitute. But too often the police assume the computer's hint is a formal allegation and treat it as such. Mr. Williams was accused of a crime by the police that he did not commit. The police were acting on information from AI image recognition software, but the police were convinced because they did not understand what the computer was telling them. A computer suggested that the video of a shoplifter in a shop could be Mr. Williams. As a result, a warrant was obtained on the basis of the computer's identification. All the “safeguards,” such as corroborating evidence, despite being formal policy of the police department, were ignored, and Mr. Williams had a nightmare visited upon him. He was arrested, processed, and charged with no effort on the part of the police to confirm the computer's suggestion. This scenario has grown so frequent that there are real concerns with the police and the courts using AI technology as an aid to their work. Subsequently, Amazon, IBM, and Microsoft withdrew their facial recognition software from police use pending federal regulation (59). DL, like any tool, must be used responsibly to provide the greatest benefit and mitigate harm.
DL ANNs have also made tremendous progress in the field of NLP. Natural language is how people communicate, such as English or Japanese. Computers are just elaborate calculators, and they have no capacity for inference or context; hence, people use programming languages to talk to computers. The current state‐of‐the‐art NLP is based on transformers (155) (see Section 9.4 for details). Transformers have led to recent rapid progress in language models and NLP tools since 2017. Moreover, progress in NLP systems is outstripping the test suites. A popular language comprehension benchmark, the General Language Understanding Benchmark (GLUE) (158), was quickly mastered by research systems, leading to its replacement by SuperGLUE in the space of a year (159). SuperGlue will soon be upgraded. Another important benchmark, the Stanford Question Answering Dataset 2.0 (SQUAD) (121) has also been mastered1 and is anticipating an update to increase the challenge. The test suites are currently too easy for modern NLP systems. This is impressive as the bar was not set low per se. DL ANNs are, on average, outperforming humans in both test suites. Therefore, it can be argued that the test suites are genuinely challenging.
Of particular note is OpenAI's ChatGPT; it has dazzled the world (128). The author recently had to change the questions for his university course assignments because the students were using ChatGPT to produce complete answers. Because ChatGPT can understand English, some students were cutting and pasting the question, in plain English, into the ChatGPT prompt and doing the reverse with the response. ChatGPT is able to produce Python code that is correct. The irony of students using DL to cheat on AI course work was not lost on him.
A lot of the debate surrounding ChatGPT has centered on its abilities, what it can and cannot do reliably, but to do so is to miss the point. The true import of ChatGPT is not what it can do today. ChatGPT is not perfect, and its creators never claimed it was far from it. The free version used by most of the world was made available to aid in identifying and fixing problems. ChatGPT is a point in a trend. The capabilities of ChatGPT, today, are not important. The real point is the implication of what language models will be capable of in five to ten years. The coming language models will clearly be extremely powerful. Businesses and professions that think they are safe because ChatGPT is not perfect are taking terrible risks. There is a misconception that it is low‐skilled jobs that will experience the most change, that the professions will remain untouched as they have been for decades. This is a mistake. The real application of DL is not in low‐skilled jobs. Factories and manufacturing were already disrupted starting in the 1970s with the introduction of automation. DL is going to make the professions more productive, such as medicine and law. It is the high‐skilled jobs that are going to experience the most disruption. A study by OpenAI examining the potential of its language models suggested that up to 80% of the US workforce would experience some form of change resulting from language models (38). This may be a conservative estimate.
Perhaps one of the most interesting advances of DL is the emergence of systems that produce meaningful content. The systems mentioned so far either classify, inflect (e.g. translate), or “comprehend” input data. Systems that produce material instead of consuming it are known as generative. When produced with DL, they are known as a Generative Artificial Neural Network (GANN). ChatGPT is an example of a generative language model. Images and videos can also be generated. A GANN can draw an image this is very different from learning to recognize an image. A powerful means of building GANNs is with GAN (50); again, very much an alphabet soup. As an example, a GAN can be taught impressionist painting by training it with pictures by the impressionist masters. The GAN will then produce a novel painting very much in the genre of impressionism. The quality of the images generated is remarkable. Figure 1.1 displays an example of cats produced by a GAN (81). The GAN was trained to learn what cats look like and produce examples. The object is to produce photorealistic synthetic cats. Products such as Adobe Photoshop have included this facility for general use by the public (90). In the sphere of video and audio, GANs are producing the so‐called “deep fake” videos that are of very high quality. Deep fakes are becoming increasingly difficult for humans to detect. In the age of information war and disinformation, the ramifications are serious. GANs, are performing tasks at levels undreamt of a few decades ago, the quality can be striking, and even troubling. As new applications are identified for GANs the resources dedicated to improving them will continue to grow and produce ever more spectacular results.
Figure 1.1 Examples of GAN‐generated cats. The matrix on the left contains examples from the training set. The matrix on the right are GAN‐generated cats. The cats on the right do not exist. They were generated by the GAN.
Source: Karras et al. (81).
It is all too common to see the acronym AI/ML, which stands for artificial intelligence/machine learning, and worse to see the terms used interchangeably. AI, as the name implies, is the study of simulating or creating intelligence, and even defining intelligence. It is a field of study encompassing many areas including, but not limited to, machine learning. AI researchers can also be biologists (histology and neurology), psychologists, mathematicians, computer scientists, and philosophers. What is intelligence? What are the criteria for certifying something as intelligent? These are philosophical questions as much as technical challenges. How can AI be defined without an understanding of “natural” intelligence? That is a question that lies more in the biological realm than that of technology. Machine Learning is a subfield of AI. DL and ANNs are a subfield of machine learning.
The polymath, Alan Turing, suggested what has come to be known as the Turing Test2 in 1950 (153). He argued that if a machine could fool a human by convincing the human that it is human too, then the computer is “intelligent.” He proposed concealing a human and a computer and linking them over a teletype to a third party, a human evaluator. If the human evaluator could not distinguish between the human and the computer, then, effectively, the computer could be deemed “intelligent.” It is an extremely controversial assertion, but a useful one in 1950. It has formed an invaluable basis for discussion ever since. An influential argument forwarded in 1980 by the philosopher, John Searle, asserts that a machine can never realize real intelligence in a digital computer. Searle argued that a machine that could pass the Turing test was not necessarily intelligent. He proposed a thought experiment called the Chinese Room (135). The Turing test was constrained to be performed in Chinese, and it was accepted that a machine could be programmed to pass the test. Searle argued that there is an important distinction between simulating Chinese and understanding Chinese. The latter is the true mark of intelligence. He characterized the difference as “weak AI” and “strong AI”. A computer executing a program of instructions is not thinking, and Searle argued that is all a computer could ever do. There is a large body of literature, some of which predates Turing's contribution and dates back to Leibniz (96; 98), debating the point. OpenAI's recent offering, ChatGPT, is a perfect example of this dichotomy. The lay press speculates (128) on whether it is intelligent, but clearly it is an example of “weak AI.” The product can pass the Turing test, but it is not intelligent.
To understand what machine learning is, one must place it in relation to AI. It is a means of realizing some aspect of AI in a digital computer; it is a subfield of AI. Tom Mitchell, who wrote a seminal text on machine learning (105), provides a useful definition of machine learning: “A computer program is said to learn3 from experience, E, with respect to some class of tasks, T, and performance measure, P, if its performance at tasks in T, as measured by P, improves with experience E.” [Page 2]. Despite first appearances, this really is a very concise definition. So while it is clear that machine learning is related to AI, the reverse is not necessarily true. DL and ANN are, in turn, specializations of machine learning. Thus, while DL is a specialization of AI, not all AI topics are necessarily connected to DL.
The object of this book is to present a canonical mathematical basis for DL concisely and directly with an emphasis on practical implementation, and as such, the reference approach is consciously eschewed. It is not an attempt to cover everything as it cannot. The field DL has advanced to the point where both its depth and breadth call for a series of books. But a brief history of DL is clearly indicated. DL evolved from ANNs, and so the history begins with them. The interested reader is directed to (31) for a more thorough history and to (57; 127) for a stronger biological motivation.
ANNs are inspired by, and originally attempted to simulate, biological neural networks. Naturally, research into biological neural networks predated ANNs. During the nineteenth century, great strides were taken, and it was an interdisciplinary effort. As physicists began to explain electricity and scientists placed chemistry on a firm scientific footing, the basis was created for a proper understanding of the biological phenomena that depended on them. Advances in grinding lenses combined with a better appreciation of the light condenser led to a dramatic increase in the quality microscopes. The stage was set for histologists, biologists, and anatomists to make progress in identifying and understanding tissues and cell differentiation. Putting all those pieces together yielded new breakthroughs in understanding the biology of living things in every sphere.
Alexander Bain and William James made independent seminal contributions (8; 76). They postulated that physical action, the movement of muscles, was directed and controlled by neurons in the brain and communicated with electrical signals. Santiago Ramón y Cajal (167) and Sir Charles Sherrington (136) put the study of neurology on a firm footing with their descriptions of neurons and synapses; both would go on to win Nobel prizes for their contributions in 1906 and 1932, respectively.
By the 1940s, a firm understanding of biological neurons had been developed. Computer science was nascent, but fundamental results were developed. In the 1930s, Alonzo Church had described his Lambda Calculus model of computation (21), and his student, Alan Turing, had defined his Turing Machine4 (152), both formal models of computation. The age of modern computation was dawning. Warren McCulloch and Walter Pitts wrote a number of papers that proposed artificial neurons to simulate Turing machines (164). Their first paper was published in 1943. They showed that artificial neurons could implement logic and arithmetic functions. Their work hypothesized networks of artificial neurons cooperating to implement higher‐level logic. They did not implement or evaluate their ideas, but researchers had now begun thinking about artificial neurons.
Daniel Hebb, an eminent psychologist, wrote a book in 1949 postulating a learning rule for artificial neurons (65). It is a supervised learning rule. While the rule itself is numerically unstable, the rule contains many of the ingredients of modern ANNs. Hebb's neurons computed state based on the scaler product and weighted the connections between the individual neurons. Connections between neurons were reinforced based on use. While modern learning rules and network topologies are different, Hebb's work was prescient. Many of the elements of modern ANNs are recognizable such as a neuron's state computation, response propagation, and a general network of weighted connections.
The next step to modern ANNs was Frank Rosenblatt's perceptron (130). Rosenblatt published his first paper in 1958. Building on Hebb's neuron, he proposed an updated supervised learning rule called the perceptron rule. Rosenblatt was interested in computer vision. His first implementation was in software on an IBM 704 mainframe (it had 18 k of memory!). Perceptrons were eventually implemented in hardware. The machine was a contemporary marvel fitted with an array of cadmium sulfide photocells used to create a 400 pixel input image. The New York Times reported it with the headline, “Electronic Brain Teaches Itself.” Hebb's neuron state was improved with the introduction of a bias, an innovation still very important today. Perceptrons were capable of learning linear decision boundaries, that is, the categories of classification had to be linearly separable.
The next milestone was a paper by Widrow and Hoff in 1960 that proposed a new learning rule, the delta rule. It was more numerically stable than the perceptron learning rule. Their research system was called ADALINE (15) and used least squares to train the network. Like Rosenblatt's early work, ADALINE was implemented in hardware with memristors. The follow‐up system, MADALINE (163), included multiple layers of perceptrons, another step toward modern ANNs. It suffered from a similar limitation as Rosenblatt's perceptrons in that it could only address linearly separable problems; it was a composition of linear classifiers.
In 1969, Minksy and Papert published a book that set a pall on ANN research (106). They demonstrated that ANNs, as they were understood at that point, suffer from an inherent limitation. It was argued that ANNs could never solve “interesting” problems; but the assertion was based on the assumption that ANNs could never practically handle nonlinear decision boundaries. They famously used the example of the XOR logic gate. As the XOR truth table could not be learnt by an ANN, and XOR is trivial concept when compared to image recognition and other applications, they concluded that the latter applications were not appropriate. As most interesting problems are nonlinear, including vision and NLP, they concluded that the ANN was a research dead end. Their book had the effect of chilling research in ANNs for many years as the AI community accepted their conclusion. It coincided with a general reassessment of the practicality of AI research in general and the beginning of the first “AI Winter.”
The fundamental problem facing ANN researchers was how to train multiple layers of an ANN to solve nonlinear problems. While there were multiple independent developments, Rumelhart, Hinton, and Williams are generally credited with the work that described the backpropagation of error algorithm in the context of training ANNs (34). This was published in 1986. It is still the basis of training today. Backpropagation of error is the basis of the majority of modern ANN training algorithms. Their method provided a means of training ANNs to learn nonlinear problems reliably.
It was also in 1986 that Rina Dechter coined the term, “Deep Learning” (30). The usage was not what is meant by DL today. She was describing a backtracking algorithm for theorem proving with Prolog programs.
The confluence of two trends, the dissemination of the backpropagation algorithm and the advent of widely available workstations, led to unprecedented experimentation and advances in ANNs. By 1989, in a space of just 3 years, ANNs had been successfully trained to recognize hand‐written digits in the form of postal codes from the United States Postal Service. This feat was achieved by a team led by Yann Lecun at AT&T Labs (91). The work had all the recognizable features of DL, but the term had not yet been applied to neural networks in that sense. The system would evolve into LeNet‐5, a classic DL model. The renewed interest in ANN research has continued unbroken down to this day. In 2006, Hinton et al. described a multi‐layered belief network that was described as a “Deep Belief Network,” (67). The usage arguably led to referring to deep neural networks as DL. The introduction of AlexNet in 2012 demonstrated how to efficiently use GPUs to train DL models (89). AlexNet set records in image recognition benchmarks. Since AlexNet DL models have dominated most machine learning applications; it has heralded the DL Age of machine learning.
We leave our abridged history here and conclude with a few thoughts. As the computing power required to train ANNs grows ever cheaper, access to the resources required for research becomes more widely available. The IBM Supercomputer, ASCI White, cost US$110 million in 2001 and occupied a special purpose room. It had 8192 processors for a total of 123 billion transistors with a peak performance of 12.3 TFLOPS.5 In 2023, an Apple Mac Studio costs US$4000, contains 114 billion transistors, and offers peak performance of 20 TFLOPS. It sits quietly and discreetly on a desk. In conjunction with improvements in hardware, there is a change in the culture of disseminating results. The results of research are proliferating in an ever more timely fashion.6 The papers themselves are also recognizing that describing the algorithms is not the only point of interest. Papers are including experimental methodology and setup more frequently, making it easier to reproduce results. This is made possible by ever cheaper and more powerful hardware. Clearly, the DL boom has just begun.
A model is an attempt to mimic some phenomenon. It can take the form of a sculpture, a painting or a mathematical explanation of observations of the natural world. People have been modeling the world since the dawn of civilization. The models of interest in this book are quantitative mathematical models. People build quantitative models to understand the world and use the models to make predictions. With accurate predictions come the capacity to exploit and manipulate natural phenomena. Humans walked on the moon because accurate models of gravity, among many other things, were possible. Building quantitative models requires many technologies. Writing, the invention of numbers and a means of operating on them, arithmetic, and finally mathematics. In its simplest form, a model is a mathematical function. In essence, building a model means developing a mathematical function that makes accurate predictions; the scientific method is an extraordinarily successful example of this. DL ANNs are forms of models, but before we examine them let us examine how models have traditionally been developed.
People have been building models for millennia. The traditional means of doing so is to write down a constrained set of equations and then solve them. For millennia, the constraints have been in the form of natural laws or similar phenomena. The laws are often discovered scientifically. Ibn al‐Haytham and Galileo Galilei (45) independently invented the scientific method, which when combined with the calculus (invented independently by Newton and Leibniz in 1660s), a century later led to an explosion of understanding of the natural world. The scientist gathers data, interprets it, and composes a law in the form of an equation that explains it. For example, Newton's law of gravity is
where in SI units, is the distance between two objects, and are the masses of the objects.
Using the equation for gravity, one can build models by writing an equation and then solving it. The law of gravity acts as the constraint. Natural laws are discovered by scientists collecting, analyzing, and interpreting the data to discern the relationships between the variables, and the result is an interpretable model. Once natural laws have been published, such as the conservation of mass, scientists, and engineers can use them to build models of systems of interest. This is done for exciting things like the equations of motion for rockets and dull things like designing the plumbing for an apartment building; mathematical models are everywhere.
The process of building a model begins with writing down a set of constraints in the form of a system of differential equations and then solving them. To illustrate, consider the trivial problem of producing a model that computes the time to fall for an object from a height, , near the surface of the Earth. The object's motion is constrained by gravity. The classical means of proceeding is to use Newton's laws and writing down a constraint. Acceleration near the surface of the Earth can be approximated with the constant, (9.80665 ). Employing Newton's notation for derivatives, we obtain the following equation of motion (acceleration in this case) based on the physical constraint:
The equation can be integrated to obtain the velocity (ignoring friction),
which in turn can be integrated to produce the desired model, ,
This yields an analytical solution obtained from the constraint, which was obtained from a natural law. Of course this is a very trivial example, and often an analytical solution is not available. Under those circumstances, the modeler must resort to numerical methods to solve the equations, but it illustrates the historical approach.
With modern computers another approach to obtaining a function, , is possible; an ANN can be used. Instead of constraining the system with a natural law, it is constrained empirically, with a dataset. ANNs are trained with supervised learning techniques. They can be thought of as functions that start as raw clay. Supervised training moulds the clay into the desired shape (an accurate model), and the desired model is specified with a dataset, that is, the dataset defines the model, not a natural law. To demonstrate, the example of is revisited.
Training the ANN is done with supervised learning techniques. The raw clay of the untrained ANN function needs to be defined by data, so the first step is to collect data. This is done by measuring the time to fall from a number of different heights. This would result in a dataset of the form, , where each tuple consists of the height and the time to fall to the ground. Using the data, the ANN can be trained and we obtain,
Once trained, the ANN model approximates , the analytical solution.
There are now two models, hopefully producing the same results, but arrived at with completely different techniques. The results of both are depicted in Figure 1.2. There are, however, some meaningful differences. First, the ANN is a black box, it may be correct, but nothing can really be said about it. The final model does not admit of interpretability. The analytical result can be used to predict asymptotic behavior or the rearrangement of variables for further insights. Moreover, the analytical solution was obtained by rearranging the solution to the differential Eq. (1.4). Second, the training of the ANN uses far more compute resources, memory, and CPU, than the analytical solution. And finally, assembling the dataset is a great deal of trouble. In this trivial example, someone already did that and arrived at the gravitational constant, g. Comparing the two methods, the ANN approach seems like a great deal more trouble.
This begs the question, given the seeming disadvantage of ANNs, why would anyone ever use them? The answer lies in the differences between the approaches, the seeming “disadvantages.” The ANN approach, training with raw data, did not require any understanding or insight of the underlying process that produced the data to build an accurate model, none. The model was constrained empirically – the data, and no constraint in the form of a natural law or principle was required. This is extremely useful for many modern problems of interest.
Consider the problem of classifying black‐and‐white digital images as one of either a cat, a dog, or a giraffe; we need a function. The function is f:, where is the set, = { cat, dog, giraffe }, and M is the resolution of the image. For such applications, empirically specifying the function is the only means of obtaining the model. There are no other constraints available, Einstein cannot help with a natural law, the Black Scholes equation is of no use, nor can a principle such as “no arbitrage” be invoked. There is no natural starting point. The underlying process is unknown, and probably unknowable. The fact that we have no insight into the process producing the data is no hinderance at all. There is a drawback in that the resulting model is not interpretable, but never the less the approach has been immensely successful. Using supervised learning techniques for this application imposes the requirement to collect a set of images and labeling them with the correct answer (one of the three possible answers). Thus, ignoring the need for interpretability or an understanding of the generating process, it is possible to accurately model a whole new set of applications.
Figure 1.2 The graph of , is plotted with empty circles as points. The 's predictions are crosses plotted over them. The points are from training dataset. The ANN seems to have learnt the function.
Even for applications where natural laws exist leading to a system of constraints, ANNs are beginning to enjoy some success. Combinatoric problems such as protein folding have been successfully addressed with ANNs (16). ANNs are better at predicting the shapes of proteins than approaches solving the differential equations and quantum mechanical constraints. Large problems lacking an analytical solution, such as predicting the paths of hurricanes, are investing in the use of ANNs to make predictions that are more accurate (22). There are many more examples.
Finally, it is worth bearing in mind the inherent differences between the DL models composed of ANNs and animal brains. ANNs were motivated by, and attempted to simulate, biological neuron assemblies (Hebb and Rosenblatt were psychologists). Owing to the success of DL, the nature of the simulation is often lost while retaining the connection; this can be unfortunate.
It must not be forgotten that biological neural networks are physical; they are cells, “hardware.” Biological neurons operate independently, asynchronously, and concurrently; they are the unit of computation. In this sense, a brain is a biological parallel computer. ANNs are software simulating biological hardware on a completely different computer architecture. There are inherent differences between the biological instance and the simulation that render the ANN inefficient. A simulated neuron must wait to have its state updated when a signal “arrives.” The delay is owing to waiting in a queue for its turn on a CPU core – the simulation's unit of computation. Biological neurons are the “CPU”s and continually update themselves. A human brain has approximately 100 billion neurons with an average of 10,000 synapses (connections) each (79), and they do not need to wait their turn to compute state – they are the state. The ANN simulation must queue all its virtual neurons serially for a chance on a CPU to update their state. To do this efficiently, DL models typically impose strong restrictions on the topology of the network of virtual neuron connections. ANN software is simulating a parallel computer on a serial computer. Even allowing for the parallelism of GPUs, the simulation is still (number of neurons). The characteristics are different too: a biological neural network is an analog computer and modern computers are digital.
The nature of a computer is also very much at variance with an animal brain. A human brain uses around 20 W of energy (79). An Intel Xeon CPU consumes between 200 and 300 W, as do GPUs. The power usage of the GPU farms used to train Google's BERT or NVIDIA's GANN is measured in kilowatts. Training language models can cost US$8,000,000 just for the electricity (19). It is also common to compare biological neurons to transistors. It is a really fine example of an apple to orange comparison. Transistors have switching times in the order of seconds. Biological neuron switching times are in the order of seconds. Transistors typically have 3 static connections, while neurons can have thousands of connections. A neuron's set of connections, synapses, can be dynamically adapting to changing circumstances. Neurons perform very different functions, and a great many transistors would be required to implement the functionality of a single neuron.
None of this is to say that DL software is inappropriate for use or not fit for purpose, quite the contrary, but it is important to have some perspective on the nature of the simulation and the fundamental differences.
Before presenting DL algorithms, it should be emphasized that ANNs are mathematical algorithms implemented on digital computers. When reading this text, it is important to understand that naïve computer implementations of mathematical methods can lead to surprising 7 results. Blindly typing equations into a computer is often a recipe for trouble, and DL is no exception. Arithmetic is different on a computer and varies in unexpected ways, as will be seen. Unlike the normal arithmetical operations of addition and subtraction, most computer implementations of them are not associative, distributive, or commutative. The reader is encouraged to peruse this section with a computer and experiment to aid understanding the pitfalls.
Consider the interval, , a subset of the natural numbers. The cardinality of , , is two. Intervals of the natural numbers are countable. Now consider . The real number line is continuous, a characteristic relied upon by the calculus. So, in this case, . Indeed, also has a cardinality of infinity. Equations are generally derived assuming that is available, but the real number line does not exist in a computer. Computers simulate with a necessarily discrete (finite) set called floating point numbers. This has profound implications when programming. Two mathematically equivalent algorithms can behave completely differently when implemented on the same digital computer.
By far, the most common implementation of floating point numbers on modern digital computers is the IEEE‐754 standard for floating point values (25). First agreed in 1985, it has been continually updated ever since. Intel's x86 family of processors implement it as well as Apple's ARM chips, such as the Mx family of SoCs. It is often misunderstood as simply a format for representing floating point numbers, but it is actually a complete system defining behavior and operations, including the handling of errors. This is extremely important as running a program on different CPU architectures that are IEEE‐754 compliant will yield the same numerical results. The most common IEEE‐754 floating point types are the 32‐bit (“single‐precision”) and the 64‐bit (“double‐precision”) formats.8 Computer languages usually expose them as native types and the programmer uses them without realizing it. It is immediately clear that, by their very nature of being finite, an IEEE representation can only represent a finite subset of real numbers. A 32 bit format for floating point numbers can, at most, represent values; that is a long way from infinity.
To illustrate the pitfalls of floating point arithmetic, we present a simple computer experiment following the presentation of Forsythe (41) and the classic Linear Algebra text, Matrix Computations (49). Consider the polynomial, . The quadratic equation, a seemingly innocuous equation known by all school children, computes the roots with the following:
There are two roots. At first glance, this appears to be a trivial equation to implement. For the smallest root of the quadratic (, and ):
or, alternatively,
Both of these forms are mathematically equivalent, but they are very different when implemented in a computer. Letting and