71,99 €
DEEP LEARNING
A concise and practical exploration of key topics and applications in data science
In Deep Learning: From Big Data to Artificial Intelligence with R, expert researcher Dr. Stéphane Tufféry delivers an insightful discussion of the applications of deep learning and big data that focuses on practical instructions on various software tools and deep learning methods relying on three major libraries: MXNet, PyTorch, and Keras-TensorFlow. In the book, numerous, up-to-date examples are combined with key topics relevant to modern data scientists, including processing optimization, neural network applications, natural language processing, and image recognition.
This is a thoroughly revised and updated edition of a book originally released in French, with new examples and methods included throughout. Classroom-tested and intuitively organized, Deep Learning: From Big Data to Artificial Intelligence with R offers complimentary access to a companion website that provides R and Python source code for the examples offered in the book. Readers will also find:
Perfect for graduate students studying data science, big data, deep learning, and artificial intelligence, Deep Learning: From Big Data to Artificial Intelligence with R will also earn a place in the libraries of data science researchers and practicing data scientists.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1018
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Dedication
Acknowledgements
Introduction
The Structure of the Book
Notes
1 From Big Data to Deep Learning
1.1 Introduction
1.2 Examples of the Use of Big Data and Deep Learning
1.3 Big Data and Deep Learning for Companies and Organizations
1.4 Big Data and Deep Learning for Individuals
1.5 Risks in Data Processing
1.6 Protection of Personal Data
1.7 Open Data
Notes
2 Processing of Large Volumes of Data
2.1 Issues
2.2 The Search for a Parsimonious Model
2.3 Algorithmic Complexity
2.4 Parallel Computing
2.5 Distributed Computing
2.6 Computer Resources
2.7 R and Python Software
2.8 Quantum Computing
Notes
3 Reminders of Machine Learning
3.1 General
3.2 The Optimization Algorithms
3.3 Complexity Reduction and Penalized Regression
3.4 Ensemble Methods
3.5 Support Vector Machines
3.6 Recommendation Systems
Notes
4 Natural Language Processing
4.1 From Lexical Statistics to Natural Language Processing
4.2 Uses of Text Mining and Natural Language Processing
4.3 The Operations of Textual Analysis
4.4 Vector Representation and Word Embedding
4.5 Sentiment Analysis
Notes
5 Social Network Analysis
5.1 Social Networks
5.2 Characteristics of Graphs
5.3 Characterization of Social Networks
5.4 Measures of Influence in a Graph
5.5 Graphs with R
5.6 Community Detection
5.7 Research and Analysis on Social Networks
5.8 The Business Model of Social Networks
5.9 Digital Advertising
5.10 Social Network Analysis with R
Notes
6 Handwriting Recognition
6.1 Data
6.2 Issues
6.3 Data Processing
6.4 Linear and Quadratic Discriminant Analysis
6.5 Multinomial Logistic Regression
6.6 Random Forests
6.7 Extra-Trees
6.8 Gradient Boosting
6.9 Support Vector Machines
6.10 Single Hidden Layer Perceptron
6.11 H2O Neural Network
6.12 Synthesis of “Classical” Methods
Notes
7 Deep Learning
7.1 The Principles of Deep Learning
7.2 Overview of Deep Neural Networks
7.3 Recall on Neural Networks and Their Training
7.4 Difficulties of Gradient Backpropagation
7.5 The Structure of a Convolutional Neural Network
7.6 The Convolution Mechanism
7.7 The Convolution Parameters
7.8 Batch Normalization
7.9 Pooling
7.10 Dilated Convolution
7.11 Dropout and DropConnect
7.12 The Architecture of a Convolutional Neural Network
7.13 Principles of Deep Network Learning for Computer Vision
7.14 Adaptive Learning Algorithms
7.15 Progress in Image Recognition
7.16 Recurrent Neural Networks
7.17 Capsule Networks
7.18 Autoencoders
7.19 Generative Models
7.20 Other Applications of Deep Learning
Notes
8 Deep Learning for Computer Vision
8.1 Deep Learning Libraries
8.2 MXNet
8.3 Keras and TensorFlow
8.4 Configuring a Machine's GPU for Deep Learning
8.5 Computing in the Cloud
8.6 PyTorch
Notes
9 Deep Learning for Natural Language Processing
9.1 Neural Network Methods for Text Analysis
9.2 Text Generation Using a Recurrent Neural Network LSTM
9.3 Text Classification Using a LSTM or GRU Recurrent Neural Network
9.4 Text Classification Using a H2O Model
9.5 Application of Convolutional Neural Networks
9.6 Spam Detection Using a Recurrent Neural Network LSTM
9.7 Transformer Models, BERT, and Its Successors
Notes
10 Artificial Intelligence
10.1 The Beginnings of Artificial Intelligence
10.2 Human Intelligence and Artificial Intelligence
10.3 The Different Forms of Artificial Intelligence
10.4 Ethical and Societal Issues of Artificial Intelligence
10.5 Fears and Hopes of Artificial Intelligence
10.6 Some Dates of Artificial Intelligence
Notes
Conclusion
Note
Annotated Bibliography
On Big Data and High Dimensional Statistics
On Deep Learning
On Artificial Intelligence
On the Use of R and Python in Data Science and on Big Data
Index
End User License Agreement
Chapter 2
Table 2.1 Data table.
Table 2.2 Comparison of Python and R packages.
Table 2.3 Comparison of machine learning functions.
Chapter 3
Table 3.1 Comparison of bagging, random forests, extra-trees, and boosting....
Chapter 4
Table 4.1 Comparison of parsing features.
Chapter 6
Table 6.1 Random forests with the
ranger
package.
Table 6.2 Models obtained with the
extraTrees
package (
numRandomCuts = 1
Table 6.3 Models obtained with the
extraTrees
package (
numRandomCuts = 3
Table 6.4 Classical methods applied to the MNIST database.
Chapter 7
Table 7.1 Choice of activation and error functions.
Table 7.2 MNIST between 2003 and 2020.
Chapter 8
Table 8.1 Deep learning libraries.
Table 8.2 Keras pre-trained models.
Chapter 9
Table 9.1 Some transformer models.
Chapter 10
Table 10.1 Timeline of artificial intelligence.
Chapter 1
Figure 1.1 Google Trends.
Figure 1.2 Stock market price.
Figure 1.3 Google search queries.
Figure 1.4 Stock market prices and search queries time series.
Figure 1.5 Stock market prices and search queries overlay.
Figure 1.6 Correlogram.
Figure 1.7 Google Flu Trends.
Figure 1.8 Number of queries in Google Flu Trends.
Chapter 2
Figure 2.1 MapReduce.
Chapter 3
Figure 3.1
.
Figure 3.2
.
Figure 3.3
– Fixed step = 0.7.
Figure 3.4
– Fixed step = 0.6.
Figure 3.5
– Fixed step = 0.1
Figure 3.6
– Optimal step descent (in dashed line) and conjugate gradient ...
Figure 3.7 Saddle point.
Figure 3.8 Gradient descent with an ill-conditioned Hessian matrix
has eig...
Figure 3.9 Influence of the second derivative.
Figure 3.10 Evolution of the coefficients as a function of the Lasso penalty...
Figure 3.11 Evolution of the coefficients as a function of the ridge penalty...
Figure 3.12 Adaptive boosting mechanism.
Figure 3.13 Weighting of observations according to the error rate.
Figure 3.14 Effect of learning rate on boosting convergence (λ = 1).
Figure 3.15 Effect of learning rate on boosting convergence (λ = 0.1).
Figure 3.16 Effect of learning rate on boosting convergence (λ = 0.01).
Figure 3.17 Correct separation (B2) and optimal separation (B1).
Figure 3.18 Example of transformation in an SVM:
.
Chapter 4
Figure 4.1 CA on document-term table.
Figure 4.2 Word cloud.
Figure 4.3 Density of a Dirichlet distribution of order 3 and parameters (0....
Figure 4.4 Density of a Dirichlet distribution of order 3 and parameters (10...
Figure 4.5 Finding the optimal number of topics.
Figure 4.6 Frequency of words in
Un amour de Swann
(in blue) and in
Combray
...
Figure 4.7 Word2Vec embedding models.
Figure 4.8 Word2Vec representation (first two vectors).
Figure 4.9 Word2Vec representation (first two factorial axes).
Figure 4.10 FastText representation (first two factorial axes).
Chapter 5
Figure 5.1 Displaying a graph of R packages.
Figure 5.2 Maximum PageRank display.
Figure 5.3 Detecting communities in the country graph.
Figure 5.4 Hashtag word cloud.
Figure 5.5 Verification of Zipf's law.
Figure 5.6 Word cloud.
Figure 5.7 Clustering of terms.
Figure 5.8 Graph of the semi-partial R².
Figure 5.9 Evolution of the opinion score.
Figure 5.10 Term adjacency graph.
Figure 5.11 Communities with the Louvain algorithm.
Chapter 6
Figure 6.1 Images from the MNIST database.
Figure 6.2 Reading the MNIST database.
Figure 6.3 Average digits from the MNIST database.
Figure 6.4 Explanation of the error rate.
Chapter 7
Figure 7.2 Hidden layer perceptron.
Figure 7.3 Activation functions.
Figure 7.4 Residual networks.
Figure 7.5 Determining the weights of a network by autoencoding.
Figure 7.6 Local connections.
Figure 7.7 Shared weights.
Figure 7.8 Structure of a convolutional network.
Figure 7.9 Convolution in a neural network.
Figure 7.10 Convolution pattern detection.
Figure 7.11 Examples of convolution.
Figure 7.12 Stacking of convolution layers.
Figure 7.13 Convolution stride.
Figure 7.14 Max pooling.
Figure 7.15 Dilated convolution.
Figure 7.16 Dropout and DropConnect.
Figure 7.17 Example of a convolutional network.
Figure 7.18 MNIST in 1998.
Figure 7.19 LeNet-5 network.
Figure 7.20 CIFAR-10.
Figure 7.21 The ILSVRC Challenge.
Figure 7.22 AlexNet and VGG neural networks.
Figure 7.23 Accuracy of CNNs as a function of their complexity.
Figure 7.24 Recurrent neural network.
Figure 7.25 LSTM recurrent network.
Figure 7.26 GRU recurrent network.
Figure 7.27 Autoencoder.
Figure 7.28 Keras autoencoder (400, 200, 100, 2, 100, 200, 400) and 100 epoc...
Figure 7.29 Representation of the proximity matrix of a random forest.
Figure 7.30 Principal component analysis on MNIST.
Figure 7.31 Principle of generative adversarial networks.
Figure 7.32 YOLO grid.
Figure 7.33 Object detection with YOLO.
Figure 7.34 Detecting the style of a painter.
Figure 7.35 Go.
Chapter 8
Figure 8.1 MNIST images.
Figure 8.2 Flattening of convolution layers on fully connected layers.
Figure 8.3 Convolutional neural network created with MXNet.
Figure 8.4 Accuracy as a function of the number of epochs.
Figure 8.5 Comparison of the accuracy of several variants.
Figure 8.6 Handwritten number.
Figure 8.7 Accuracy as a function of the number of epochs.
Figure 8.8 Training history of the neural network with Keras on MNIST.
Figure 8.9 Image processed with Keras.
Figure 8.10 Superpixels of LIME.
Figure 8.11 Explanation of LIME.
Figure 8.12 Relevant superpixels in the explanation of LIME.
Figure 8.13 First 100 images of CIFAR-10.
Figure 8.14 Training history of the neural network with Keras on CIFAR-10.
Figure 8.15 CIFAR-10 image classification errors.
Figure 8.16 Cats and dogs: simple convolutional model.
Figure 8.17 Cats and dogs: convolutional model with data augmentation.
Figure 8.18 Cats and dogs: VGG16 pre-trained model with data augmentation.
Figure 8.19 Cats and dogs: Xception pre-trained model with data augmentation...
Figure 8.20 Graphics card information.
Figure 8.21 Downloading the graphics card driver
.
Figure 8.22 Creating a Kaggle notebook.
Figure 8.23 Creating an R notebook on Kaggle.
Figure 8.24 Configuring a notebook on Kaggle.
Figure 8.25 Number of cores allocated to a Google Colab session.
Figure 8.26 Google Colab notebook settings.
Figure 8.27 Google Colab Jupyter notebook.
Figure 8.28 Convolutional network in Google Colab.
Figure 8.29 Reading a file with R in Google Colab.
Figure 8.30 RStudio Cloud.
Figure 8.31 The first 100 articles in Fashion-MNIST.
Figure 8.32 The Kuzushiji-MNIST dataset.
Chapter 9
Figure 9.1 Word prediction with an LSTM network.
Figure 9.3 Training an LSTM model of complaint classification.
Figure 9.4 Training a GRU model of complaint classification.
Figure 9.5 Training a bi-directional GRU model of complaint classification....
Figure 9.7 Convolutional neural network applied to a text.
Cover
Table of Contents
Title Page
Copyright
Dedication
Acknowledgements
Introduction
Begin Reading
Conclusion
Annotated Bibliography
Index
End User License Agreement
iii
iv
v
xiii
xv
xvi
xvii
xviii
xix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
505
506
507
508
509
510
511
512
513
515
516
517
518
519
520
Stéphane Tufféry
Associate Professor,University of Rennes 1,France
This edition first published 2023© 2023 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/ permissions.
The right of Stéphane Tufféry to be identified as the author of this work has been asserted in accordance with law.
Registered OfficesJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office9600 Garsington Road, Oxford, OX4 2DQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication data applied for
Hardback: ISBN: 9781119845010
Cover Design: WileyCover Images: Courtesy of Stéphane Tufféry; Maxger/Shutterstock
To my parents
I warmly thank Franck Berthuit and Ricco Rakotomalala for their careful reading and helpful comments. I am also pleased to thank all the people I met at the University of Rennes 1, at the Ensai (École Nationale de la Statistique et de l'Analyse de l'Information) and at the Institute of Actuaries of Paris, where I taught the courses on data science, big data, and deep learning from which this book was derived.
This book is dedicated to deep learning, which is a recent branch of a slightly older discipline: machine learning.1 Deep learning is particularly well suited to the analysis of complex data, such as images and natural language. For this reason, it is at the heart of many of the artificial intelligence applications that we will describe in this book. Although deep learning today relies almost exclusively on neural networks, we will first look at other machine learning methods, partly because of the concepts they share with neural networks and which it is important to understand in their generality, and partly to compare their results with those of deep learning methods. We will then be able to fully measure the effectiveness of deep learning methods in computer vision and automatic natural language processing problems. This is what the present book will do, recalling the theoretical foundations of these methods while showing how to implement them in concrete situations, with examples treated with the open source deep learning libraries of Python and mainly R, as indicated below. As we will see, the prodigious development of deep learning and artificial intelligence has been made possible by new theoretical concepts, by more powerful computing tools, but also by the possibility of using immense masses of various data, images, videos, audios, texts, traces on the Internet, signals from connected objects ... these big data will be very present in this book.
Chapter 1 is an overview of deep learning and big data with their principles and applications in the main sectors of finance, insurance, industry, transport, medicine, and scientific research. A few pages are devoted to the main difficulties that can be encountered in processing data in machine learning and deep learning, particularly when it comes to big data. We must not neglect the IT risks inherent in the collection and storage, sometimes in a cloud, of large amounts of personal data. The news about certain social networks regularly reminds us of this. At the opposite end of the spectrum from their commercial vision of big data are open data, which closes the chapter.
Chapter 2 deals with concepts that data scientists must know when dealing with large volumes of data: parsimony in modeling, algorithmic complexity, parallel computing and its generalization, which is distributed computing. We devote a few pages to the MapReduce algorithm at the basis of distributed computing, its implementation in the Hadoop system, and to the database management systems, known as NoSQL and column-oriented, particularly adapted to big data. We will see that “analytical” applications such as machine learning have particular computing requirements that require specific solutions: Spark is one of them. We then review the hardware and software resources to be implemented, whether they are on the user's machine or in a cloud. We talk about the processors that enable deep learning computations to be accelerated, as well as the two most used open source software in statistics, machine learning, and deep learning: R and Python. A synoptic table compares the main machine learning methods implemented in R, Python (scikit-learn library) and Spark (MLlib). We also found it interesting to mention quantum computing, for which specific versions of algorithms are starting to be designed, notably in linear algebra, machine learning, optimization, and cryptography. The prospects of quantum computing are still distant but very promising, with the possibility of a considerable reduction in computing time.
Chapter 3 recalls some essential principles of machine learning and data science: the bias-variance dilemma in modeling, complexity reduction methods, optimization algorithms, such as gradient descent, Newton or Levenberg-Marquardt, ensemble (or aggregation) methods by random forests, Extra-Trees or boosting, and useful methods for big data, such as incremental algorithms and recommendation systems used by social networks and online commerce. Apart from these reminders, it is assumed that the reader is familiar with machine learning methods but, if required, a bibliography is given at the end of the book and notes are provided in each chapter for specific references.
Chapter 4 presents natural language processing methods. The principles of textual analysis are introduced, including segmentation into units or tokenization, part-of-speech tagging, named entity recognition, lemmatization, and other simplification operations that aim to reduce the volume of data and the complexity of the problem as much as possible while retaining the maximum amount of information, which is a constant concern in statistics and machine learning. We then describe the operations of vector representation of words, which go from the classical document-term matrix to the methods of word embedding, which started with Word2Vec, GloVe, and fastText, and the list of these is continuously growing. We speak of embedding because each word is associated with a point in a vector space of fairly small dimensions, of the order of a few hundred, i.e. much less than different terms, with the remarkable property that two semantically close words correspond to close points in the vector space, and that arithmetic operations in this vector space can lead to identities such as “King” – “Man” + “Woman” = “Queen”. These vector embeddings preserve not only the proximity of words but also their relations. They are therefore an efficient way to transform documents for analysis, for example, to classify them into categories: spam or non-spam, type of message, subject of the complaint, etc. We also discuss topic modeling, which uses methods such as latent Dirichlet allocation to detect all the topics present in a corpus of documents. We present another current method of natural language processing, sentiment analysis, which seeks to detect the sentiments expressed in a text, either in a binary form of positive or negative sentiments, or in a more elaborate form of joy, fear, anger, etc. Neural methods applied to natural language processing are discussed in Chapter 9, after the one devoted to the principles of deep learning.
Chapter 5 shows how to analyze social networks, starting from the notions of graph theory and taking the example of Twitter. We are particularly interested in the so-called centrality and influence measures, as they are very important in social networks and web search engines. We are also interested in the detection of communities, which are the dense sub-graphs that can constitute a partition of the studied graph. The search for communities in a graph is an active field of research, in various domains (biology, sociology, marketing), because the vertices of a same community tend to have in common interesting properties. Some considerations are turned to the economic model of social networks and to digital advertising and what is called programmatic advertising.
Chapter 6 deals with the classical problem of recognizing handwritten digits on bank checks and postal codes on envelopes, among others. On a well-known dataset (MNIST), it compares the different machine learning methods previously discussed in the book: in particular penalized regression, random forests, gradient boosting, and support vector machines.
Chapter 7 is a long and important chapter on deep learning. It explains the principles of deep learning and the architecture of deep neural networks, especially convolutional and recurrent networks, which are those most widely used for computer vision and natural language processing today. The many features designed to optimize their performance are presented, such as pooling, normalization, dropout, and adaptive learning, with indications on how best to use them. We review the fundamental learning mechanism of neural networks, backpropagation, the difficulties encountered in its application to multilayer networks with the vanishing gradient phenomenon that led for a while to the “winter of artificial intelligence,” and the solutions found in the last ten years by new ideas and increased computing power. Particular networks are described: autoencoders for data compression, and generative neural networks that are increasingly being developed to have artificial intelligence produce texts, images or music. Illustrations show the interest of deep learning for subjects ranging from object detection to strategy games.
Chapter 8 presents the application to computer vision of the methods seen in Chapter 7, using MXNet, Keras-TensorFlow, and PyTorch libraries. In particular, they are applied to three classical datasets: (1) the MNIST database already discussed in Chapter 6, which allows the performances of classical and deep learning methods to be compared; (2) the CIFAR-10 image database; and (3) a database of cat and dog pictures. We apply transfer learning. We sketch the question of the explicability of machine learning algorithms by applying the LIME method to images to find out on which parts of the image the model relies for its predictions. We show how to configure a computer with a Windows operating system to use its graphics processing unit (GPU) for deep learning computations, which are much faster on these graphics processors than on classical processors (CPUs). This configuration is not very simple and it is necessary to follow the different steps indicated. The chapter concludes with examples of cloud computing, using the Google Colab platform with a Jupyter notebook running Python code.
Chapter 9 returns to natural language processing, applying to it the deep learning methods described in Chapter 7: generative models and recurrent neural networks. An example application is given of the creation of poems by a neural network which has been provided with an example for its training (Shakespeare's Sonnets) but no other information on the English language, no dictionary, and no grammar rules. We also show how to apply LSTM and GRU recurrent networks to document classification, and we compare them to classical machine learning methods. We then show how recent “transformer” models have led to a refinement of the word embedding methods seen in Chapter 4, with the BERT algorithm and its successors able to take into account the context of a polysemous word to compute its embedding. But these transformer models are much more than word embedding methods and they are today the state of the art for all natural language processing tasks, document classification, translation, question-answering, text generation, automatic summarization, etc. We give an insight into their performance by applying the BERT model to the same example of document classification as the previous LSTM model.
Finally, Chapter 10 describes artificial intelligence with its concepts, its relationship with human intelligence, its links with symbolic methods, machine learning and deep learning, its applications and, of course, the hopes and debates it raises.
The book concludes with an annotated bibliography and an index.
In the examples in this book, R has been used more often than Python. Even if Python's successes are growing in data science, R remains the reference software in this field, the richest in statistics, and it has progressively caught up in the field of deep learning. Indeed, the main methods and first of all the convolutional and recurrent neural networks, initially interfaced with Python (the calculations themselves are often implemented in C++ and CUDA) are also increasingly being interfaced with R. It should be noted, however, that the use of TensorFlow and Keras, even with R, requires the prior installation of Python, at least in a minimal distribution such as Miniconda. As for the PyTorch library, until 2020, it had not been ported to R and required writing Python code for its use, which we do at the end of Chapter 8. Since then, the project “torch for R” supported by RStudio has implemented the PyTorch library in R, directly from C++ code without going through the Python interface.2
Another reason for choosing R is that many books have already been published on Python, while few describe the use of R in deep learning. Of course, these issues are the subject of articles and discussions in forums, but the interesting elements are disseminated and not necessarily complete and coherent between them. We have therefore favored an approach that is perhaps more circumscribed, but, we hope, coherent and likely to allow the reader to solve the problems of data science and deep learning. Since this is the objective of this book, and not a systematic comparison of software libraries, we have chosen not to increase the volume of the book by presenting all the code both in R and Python versions. For the same reason, we have not presented examples of the use of all deep learning libraries, but only three of the main ones: Keras-TensorFlow, MXNet, and PyTorch.
Many running times are shown, to illustrate the differences in computational performance between different methods or ways of programming, or sometimes between different software. Most of the deep neural network training examples were implemented on a laptop computer with a quad-core Intel i5-8300H processor running at 2.3 GHz, with 8 GB of RAM, a 64-bit Windows 10 operating system, and an NVIDIA GeForce GTX 1050 graphics card. The execution times are sometimes compared with and without the use of this GPU. They would of course be very different if they were measured on other machines, and it is better to look at the differences in computing time between two approaches than at their absolute value.
The duration and some results of the computations can also sometimes depend on the version of R, and those presented in this book were mostly obtained with versions 3.6.x to 4.1.x of R. The reader should not be surprised by differences that one can obtain with other more or less recent versions. This is especially true of the big data and deep learning packages, which are frequently evolving like the methods they implement.
The methods presented in this book go far beyond the field of statistics to cover statistical learning and machine learning, of which deep learning is a particular branch.
In short, we can say that if statistics seeks to predict phenomena, it seeks above all to explain them and therefore to provide a description in the form of models. A model is a representation of reality that assumes that the data follow certain probability distributions. The statistician carries out tests to check this assumption and to ensure that this model is well founded. If it is proven that the observed data do follow the assumed probability distribution, or at least do not deviate too much from it, all that remains is to estimate the parameter(s) of this distribution and to verify, once again by means of statistical tests, the significance of this estimate (in common language: its reliability).
In machine learning, we are primarily interested in the predictive power of the methods and the generalization capacity of the models obtained. We do not ask them to provide a formalized description of reality and the notion of hypothesis testing takes a back seat.3 This is all the more true since we are interested in phenomena that are sometimes too complex to be described by simple probability distributions, and these phenomena are described by much more complex mechanisms, deep learning mechanisms that are not without analogy with the functioning of the brain and can partly explain their important place in artificial intelligence.
1
We include here in the same term statistics, statistical learning, and machine learning, for the sake of brevity and because the boundaries are shifting, despite our attempt to draw them at the end of this Introduction.
2
https://www.rstudio.com/blog/torch/
.
3
According to Brian Ripley's quip (
useR! 2004
, Vienna): “Machine learning is statistics minus any checking of models and assumptions.”
The first chapter of the book begins with a presentation of big data, which is both a growing field of study and application (with algorithms and tools) and its raw material (data sets). The term data is in the plural (of datum) as it should be in English, and we can talk about the plurality of data sources, but we can also think of the discipline in the singular, with a set of concepts and methods, a mixture of machine learning1 and computer technology. In the first sense of the term, big data can be understood as large collections of data, too large for traditional data-processing systems, but this translation only emphasizes the voluminous aspect of these data, whereas their variety is often also essential, as the following pages will show. Of the machine learning methods, we will focus on deep learning methods, which are the best for dealing with big data such as text, voice, image, and video. We will then go on to examine some key points in the processing of data, especially big data, and we will conclude with questions of personal data protection and some elements on open data.
Big Data covers all the issues associated with the collection and exploitation of very large sets of data, of very varied natures and formats (texts, photographs, clicks, signals from sensors, connected objects, etc.), and in very rapid, even continuous evolution. They are invading many fields of activity: health, industry, transport, finance, banking and insurance, retail, public policy, security, scientific research, etc.
The economic stakes are high: McKinsey2 estimates that big data could save healthcare policies in the United States $300 billion per year and generate $600 billion in consumption by using consumer location data.
We will see that the impacts of big data are very important for both people and business, and that the technological challenges are formidable.
Before going into detail about big data, let's take a quick historical look at the developments that led to them. For a several decades, the rise of computing power has accompanied the explosion of data production. This rise has developed over several eras:
before 1950, the beginnings of statistics with a few hundred individuals and a few variables, collected in a laboratory according to a strict protocol of experimental design for a scientific study;
in the 1960s–1980s, data analysis with a few tens of thousands of individuals and a few dozen variables, rigorously collected for a specific survey;
in the 1990s and 2000s, data mining with several million individuals and several hundred variables, collected in the information system of companies for decision-making;
from the 2010s, big data with several hundred million individuals and several thousand variables, of all types, collected in companies, systems, the Internet, for decision-making and new services.
From the end of the nineteenth century until the 1950s was the era of classical statistics. It precedes the invention of computers, so the means of calculation are manual and very limited, and it is characterized by:
small volumes studied;
the strong hypotheses on the statistical distributions followed (triad: linearity, normality, homoscedasticity);
models derived from theory and confronted with data;
the probabilistic and statistical nature of the methods;
laboratory use.
The predominance of the hypothetico-deductive method and the importance of tests and inferential statistics can be noted. Data are collected and analyzed within a strict, often scientific, framework in order to verify a theory, which can be refuted by the result of a test. The foundations of mathematical statistics, but also important predictive methods, such as linear discriminant analysis and logistic regression, date from this period.
In the 1960s and 1980s, the emergence of computers revolutionized the discipline, allowing for much more complex and rapid calculations than before, on thousands of analyzed individuals and dozens of variables, with the construction of “Individuals x Variables” tables. This was a time of great theoretical creativity, during which many fundamental methods were invented that are still widely used today. This is the golden age of data analysis and especially of factorial analysis, and visual representation begins to take on great importance.
The 1990s saw the advent of the concept of data mining, which is not only characterized by an explosion of computing resources and the quantity of data to be processed, but also by a profound change in the role of quantitative analysis. Even if it uses the tools of statistics, data mining differs from statistics in many qualitative and quantitative aspects:
millions or tens of millions of individuals, and hundreds or thousands of variables are analyzed:
many variables are non-numerical, sometimes textual;
weak assumptions are made about the statistical distributions followed;
data are collected prior to the study, and often for other purposes;
the populations studied are constantly changing;
outliers (out of the norm, at least in terms of the distributions studied) are present;
the data are imperfect, with data entry errors, coding errors, missing values;
fast calculations are needed, sometimes in real time;
we are not always looking for the mathematical optimum, but sometimes for the model that is easiest to understand for non-statisticians;
the models are derived from the data and sometimes attempts are made to derive theoretical elements from them;
some methods are starting to come from information theory or
machine
learning;
used in companies, and not only in labs and universities.
Two major changes have occurred. On the one hand, the data analyzed often were not collected for the needs of the study and were collected for management needs, with all that this implies in terms of redundancy, imprecision, and even errors. Data mining involves a long and important work of “cleaning” the data prior to modeling. On the other hand, the aim is not essentially scientific, or even sometimes explanatory, but rather to assist in decision-making. It is not a question of formulating theoretical hypotheses and then verifying them with the help of observations, but of finding models that fit the observed data as well as possible, and can, in some very specific cases, suggest theoretical clues.
For example, in the commercial field, we do not try to elaborate a theory of customer behavior and its hidden motives, and we simply try to characterize the profiles of those customers most likely to buy a given product, and possibly know when and at what price.
A new era began in the 2010s when big data appeared, with hundreds of millions of individuals and thousands of variables, of all types, sometimes very noisy, collected in companies, objects, the Internet, to provide decision support, and also, more than with data mining, new services. The concept of big data was born out of the explosion in the production of all kinds of data: traces left on the Web (sites visited, videos seen, clicks, keywords searched for, etc.), opinions expressed in social networks (about a person, a company, a brand, a product, a service, a movie, a restaurant), posts in social media and content shared on websites (blogs, photographs, videos, music tracks, etc.), geolocation by GPS or IP address, information gathered by industrial, road and climate sensors, RFID chips, NFC devices and connected objects (smartphones, connected watches and wristbands, intelligent personal assistants,3 cameras, electricity meters, household appliances, scales, clothes, medical devices, cars, glasses with augmented reality ...) that form what is called the Internet of Things (IoT).
These big data are characterized by what Doug Laney has called4 the “three Vs”: volume, velocity and variety.5
The volume of data involved has given Big Data its name, with an order of magnitude that can reach the petabyte (1015 bytes). The increase in data volume comes from the increase in:
the number of individuals observed (more numerous or at a finer level);
the frequency of observation and data recording (from monthly to daily or hourly);
the number of observed features.
This increase also comes from the observation of new data, especially from connected objects, geolocation, and the Web.
Thus, in one minute on the Internet:6
120 new accounts are created on LinkedIn;
571 websites are created;
600 Wikipedia pages are created or updated;
1,500 blogs are posted
46,740 photographs are posted on Instagram;
154,200 calls are made on Skype;
752,000 dollars are spent online;
$258,750 in sales are made by Amazon;
342,000 smartphone applications are downloaded;
456,000 tweets are sent;
900,000 people log on to Facebook;
4.1 million videos are viewed on YouTube;
3.6 million searches are made on Google;
16 million SMS are sent;
156 million emails are sent (including over 100 million spam emails).
This aspect of volume is perhaps the most visible and spectacular feature of big data, but it is not entirely new, since the retail, banking, and telecom industries have long been handling huge volumes of data, with annual numbers of transactions routinely exceeding one billion. However, if the number of objects or individuals processed (the rows of the databases) was very large, the number of their observed characteristics (the columns of the databases) was not so large, and this is rather where the novelty of big data lies, and their theoretical and practical difficulties. The main theoretical difficulty is the so-called “curse of dimensionality,” to which we return in Section 1.5.1. From a practical point of view, it is sometimes said that big data begin when we can no longer load data into memory and, more generally, when we can no longer process it by “conventional” means (in a rather vague sense, which can, for example, refer to non-distributed computer architectures). What sounds like a joke is at the same time pragmatic, and refers to the technical and software tools that can manipulate data that cannot reside in memory.
The variety of big data is due to the fact that these data are of very diverse natures and forms: numerical, graphs, web logs, texts in various forms (documents, emails, SMS, etc.), sounds, images, videos, functional data ... This variety makes it difficult to use the usual databases and requires a variety of methods: graph analysis, deep neural networks, text mining, web mining ... Heterogeneous data can be crossed: for example, sales data matched with social network data, or pollution sensor data matched with weather data and traffic sensors.
The velocity of big data comes from the fact that these data are updated rapidly, sometimes in real time and in a continuous flow, or at least at high frequency, and must often be processed just as quickly. By data streams, we mean the continuous flow of data from industrial, meteorological, and astronomical sensors. An autonomous vehicle must be able to continuously and immediately process an enormous quantity of data, estimated at 30 terabytes per day. On merchant sites, the customer's decision on the Web is made quickly because it only takes one click to change site, so it is necessary to instantly make them the best commercial offer. Another example: credit card fraud must of course be detected in real time. In some cases, the limitation of the calculation time can induce an error to be estimated and controlled.
Note that in some cases, it is not only the application of the statistical model but its update that is done in real time or at least very frequently. Velocity thus has three components:
data velocity;
the speed of the treatments to be implemented;
the speed of updating the models.
The concept of data science has recently developed to address the theoretical and technical challenges raised by big data. The modeling methods applied to big data are far from being new but they have seen strong motivation and important work related to big data. We can mention in particular some advanced techniques of sampling, optimization, machine learning, deep learning, penalized estimators of the Lasso or Danzig type, incremental learning, spatial statistics, the use of graphs, and of course the analysis and generation of natural language, both written and spoken.
Sampling issues are important, as they can help to reduce the volume of data to make it more easily manipulated, and to infer general conclusions from partial observations. But the representativeness of samples is difficult to establish, with multiple data sources, which do not cover the same populations and have a significant number of missing values. This raises problems of sampling techniques and sample adjustment. For example, it is necessary to assign appropriate weights to the observed units in order to obtain a representative sample of the population studied. There is also the issue of matching individual data from different sources and using auxiliary information. These methods are already used by the national statistical institutes (such as French INSEE), which uses numerous data sources (surveys, population census, and administrative files), as well as by media audience measurement institutes.
Incremental learning refers to methods that allow us to build a model, not on the basis of a complete sample, but on the basis of data that arrive in smaller chunks and from which we must update the model without having to take into account past observations. These methods make it possible to analyze data arriving in streams or which are too large to be processed at once. They are found in certain decision trees, the Very Fast Decision Trees, which rely on a theoretical limit, the Hoeffding bound, to determine a sufficient number of observations to obtain a split of each node of a tree close enough to the split that would have been obtained with all the observations. By this sampling, these trees can treat data so massive that the calculations would be too long or even impossible. Another widely used incremental algorithm is Alan Miller's “Memory Bounded Algorithm” AS 274 which is used for linear regression and is implemented in the biglm package of R. Incremental algorithms also exist in deep learning.
Machine learning methods, such as model aggregation (ensemble methods), support vector machines, and neural networks discussed in this book, are used for their high predictive power, in situations where model readability is not required and their “black box” characteristic is not an insurmountable obstacle, especially since some methods of explainability (Section 1.5.7) of their predictions are beginning to develop.
Deep learning is used for problems as complex as image, video, text and speech recognition, it uses sophisticated machine learning techniques such as convolutional neural networks, and it relies on massively parallel computer architectures, especially for computations with graphics processors, which we discuss later. Along with big data, deep learning is the main topic of this book because it is at the heart of modern data science. Most big data could not be analyzed without deep learning.
Whether the images are from social networks or medical imaging, advanced algorithms are needed to process them, not just to retouch or enhance them, but to recognize faces, places, or tumor cells. Unfortunately, deep learning can also create fake images or videos, manufacturing deepfakes. “Multimodal artificial intelligence” methods, combining image, voice and text recognition, are being implemented to try to flush out these deepfakes. Deep learning is also becoming indispensable for processing data as numerous and complex as those produced by genetic analysis, astronomy, or particle physics. Deep learning has also made it possible to achieve unequaled performance in text and voice processing, whether it is answering human questions in writing or orally, or helping users in their searches on the Web: today, huge transformer neural networks (Section 9.7) analyze the requests of Internet users to find the most relevant information.
Examples of the use of big data and deep learning abound in a wide variety of sectors. This section gives a quick overview before we come back to discuss several of them.
In the field of transport, it is about the improvement of road traffic by geolocation, the search for free parking spaces, the billing of parking in paying areas thanks to the reading and the optical recognition of characters on the license plates, the dynamic fixing of the price of airline tickets.
This last application is part of dynamic pricing, also called yield management or revenue management. It concerns activities with fixed available capacities, high fixed costs, perishable products (to be sold before a certain date) and that can be sold in advance at differentiated prices. Dynamic pricing determines in real time the appropriate quantities to put on sale, at the appropriate price, at the appropriate time, in order to maximize the profit generated by the sale. It is about maximizing margin without reducing demand. It was born in the 1980s in the airline industry with American Airlines, but has since spread to many other areas of transport, hotels, advertising space, tourism, entertainment. For example, it is applied by Uber to regulate the number of drivers behind the wheel at a given time: lowering fares when supply exceeds demand should encourage Uber drivers not to drive at that time. Dynamic pricing is also used in fashion, where it is more difficult to implement for new products with less predictable sales.
To take the example of air transport, here is what Stéphane Ormand, head of the “revenue management” department at Air France-KLM Group, says about dynamic pricing in an interview in La Croix on June 27, 2016:
Nearly a year in advance, they [pricers] virtually carve up an aircraft for each flight into a multitude of fares, taking into account a multitude of factors – everything from the economic crisis in Brazil to falling exchange rates in Venezuela or Nigeria, or the Euro soccer tournament in France ... Analysts must keep a constant watch on the competition, local or global events that can disrupt demand for a given flight and adapt the sales strategy.
A single economy class cabin can be divided into fifteen or so fares, with fare differentials of 1 to 10, and the number of seats between these fares can be adjusted according to demand.
In a sense, dynamic pricing is a return to the age-old practice of prices that were not fixed and posted, but were the result of negotiation. Here, we cannot speak of negotiation because the price is set without the buyer's knowledge, but it is the overall behavior of all buyers that influences the price. Dynamic pricing must be legally regulated so as not to discriminate against certain customers on the basis of their individual profile or the place where they live.
In marketing, geolocation allows you to send a promotion or a coupon on your smartphone when you pass near a business. This geolocation (or geofencing) uses the GPS of smartphones, Wi-Fi hotspots, beacons.
Another marketing application of big data is the analysis of preferences, recommendations, possibly linked to sales data, to target consumers more efficiently (Section 3.6).
A classic in the retail industry is the analysis of receipts and its cross-referencing with loyalty program data. Consider that these analyses are done in real time, and that it is at the moment of checkout that the customer is warned, based on the contents of their shopping cart, that they may have forgotten to buy an item. There is a lot of talk about Internet big data, but 90% of sales in the world are still made in physical stores, and Wal-Mart remains the leading retail group in the world, even if Amazon comes in third place in Deloitte's Top 250 in 2020.7 It was in sixth place in 2016 and in 186th place in 2000. Moreover, it should be noted that Amazon is starting to open physical stores and to make agreements with mass retail chains. Under the terms of an agreement between Google and Carrefour, since 2020, Google's voice assistant can be used to order purchases at Carrefour, through a smartphone or a connected speaker. Such an agreement was made in August 2017 in the United States between Google and Wal-Mart. With big data, physical stores can try to exploit their advantages over online businesses.
Combining the two types of businesses, Amazon Go is a physical store that relies on sensors to know which items have been picked up by customers (and not put back on the shelf). Payment is made automatically when leaving the store, if one has an Amazon account and a smartphone with the Amazon Go app. Like online visits and purchases, those made in this store can be tracked by Amazon, which can use them to derive relevant recommendations.
Perhaps less mediatized than the previous applications of big data, the scientific applications are important and varied in fields that are big users of massive data, such as meteorology, seismology, oceanography, ecology, genomics, epidemiology, medical imaging, astronomy, and nuclear physics.
For example, the Large Synoptic Survey Telescope (LSST) records 30 terabytes of images each day, with real-time alerts for changes in the position or brightness of celestial objects. The Large Hadron Collider (LHC) used to discover the Higgs boson records 60 terabytes of data every day, at each of its 100 million collisions per second. The Discovery supercomputer at the NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate data and climate model simulations.
In literature, the millions of digitized texts cover the entire history of literature and can be analyzed for comparative literature or to study the diffusion of cultural movements in a society, according to the historical or social context. They are also available and downloadable on sites such as Gutenberg8 for texts in the public domain. In archeology, computer vision can be applied to the identification of patterns on ancient objects, their comparison with other known objects, their dating, and the study of their technical and stylistic evolution.9 These applications of deep learning are part of the disciplines called digital humanities.
Artificial intelligence comes to the aid of paleography, that is, the study of ancient writings, to distinguish minute differences in the style of two texts. Thus, in April 2021, scientists from the University of Groningen in the Netherlands published an article in PLOS One revealing that two scribes, and not just one, had probably written one of the Dead Sea Scrolls, the Great Isaiah Scroll.
In journalism, data journalism relies on the analysis of digital data to verify, produce, and distribute information. These data are increasingly numerous, notably thanks to Open Data (Section 1.7) and social networks (Section 5.1). This is different from infographics, which only aims to represent data. It can also rely on natural language processing tools, machine translation, and automatic text generation to comment on data. Among the applications of data journalism, we can mention the identification of individuals or events in photographs, the detection of fake photographs or videos, the detection of fake news and fact-checking, the exploration and analysis of archives and large databases, the automatic monitoring of social media to detect emerging topics, the “breaking news” to be announced before the others, the analysis of political speeches, or the detection of propaganda, of influences. We can also mention the recommendation systems (Section 3.6) that suggest to readers new articles likely to interest them, based on those they have already read.
In the field of human resources, job boards rely on machine learning methods, and companies use résumé analysis enriched by searching CV libraries, detecting links made by the candidate on social networks, events in which he or she participates, his or her career path, etc. This sourcing allows start-ups to detect candidates who best match the position offered and the company's culture more quickly than headhunters, and at a lower cost. Interview reports with employees can also be analyzed to automatically detect positive or negative tones, situations of discomfort, the expression of difficulties. More recently, AI has been used to automate telephone or video-conference interviews with candidates, but with results that are sometimes unreliable in evaluating the aptitudes of candidates for a position. Recruitment is a field where particular attention is paid to the absence of bias and discrimination.
Applications of big data to education are the development of applied technologies to education (EdTech) and the analysis of social networks to learn about the popularity of lessons and student satisfaction, and to adapt teaching to the progress of each student.