109,99 €
Data Science in Theory and Practice EXPLORE THE FOUNDATIONS OF DATA SCIENCE WITH THIS INSIGHTFUL NEW RESOURCE Data Science in Theory and Practice delivers a comprehensive treatment of the mathematical and statistical models useful for analyzing data sets arising in various disciplines, like banking, finance, health care, bioinformatics, security, education, and social services. Written in five parts, the book examines some of the most commonly used and fundamental mathematical and statistical concepts that form the basis of data science. The authors go on to analyze various data transformation techniques useful for extracting information from raw data, long memory behavior, and predictive modeling. The book offers readers a multitude of topics all relevant to the analysis of complex data sets. Along with a robust exploration of the theory underpinning data science, it contains numerous applications to specific and practical problems. The book also provides examples of code algorithms in R and Python and provides pseudo-algorithms to port the code to any other language. Ideal for students and practitioners without a strong background in data science, readers will also learn from topics like: * Analyses of foundational theoretical subjects, including the history of data science, matrix algebra and random vectors, and multivariate analysis * A comprehensive examination of time series forecasting, including the different components of time series and transformations to achieve stationarity * Introductions to both the R and Python programming languages, including basic data types and sample manipulations for both languages * An exploration of algorithms, including how to write one and how to perform an asymptotic analysis * A comprehensive discussion of several techniques for analyzing and predicting complex data sets Perfect for advanced undergraduate and graduate students in Data Science, Business Analytics, and Statistics programs, Data Science in Theory and Practice will also earn a place in the libraries of practicing data scientists, data and business analysts, and statisticians in the private sector, government, and academia.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 487
Cover
Title Page
Copyright
List of Figures
List of Tables
Preface
1 Background of Data Science
1.1 Introduction
1.2 Origin of Data Science
1.3 Who is a Data Scientist?
1.4 Big Data
2 Matrix Algebra and Random Vectors
2.1 Introduction
2.2 Some Basics of Matrix Algebra
2.3 Random Variables and Distribution Functions
2.4 Problems
3 Multivariate Analysis
3.1 Introduction
3.2 Multivariate Analysis: Overview
3.3 Mean Vectors
3.4 Variance–Covariance Matrices
3.5 Correlation Matrices
3.6 Linear Combinations of Variables
3.7 Problems
4 Time Series Forecasting
4.1 Introduction
4.2 Terminologies
4.3 Components of Time Series
4.4 Transformations to Achieve Stationarity
4.5 Elimination of Seasonality via Differencing
4.6 Additive and Multiplicative Models
4.7 Measuring Accuracy of Different Time Series Techniques
4.8 Averaging and Exponential Smoothing Forecasting Methods
4.9 Problems
5 Introduction to R
5.1 Introduction
5.2 Basic Data Types
5.3 Simple Manipulations – Numbers and Vectors
5.4 Problems
6 Introduction to Python
6.1 Introduction
6.2 Basic Data Types
6.3 Number Type Conversion
6.4 Python Conditions
6.5 Python File Handling: Open, Read, and Close
6.6 Python Functions
6.7 Problems
7 Algorithms
7.1 Introduction
7.2 Algorithm – Definition
7.3 How to Write an Algorithm
7.4 Asymptotic Analysis of an Algorithm
7.5 Examples of Algorithms
7.6 Flowchart
7.7 Problems
8 Data Preprocessing and Data Validations
8.1 Introduction
8.2 Definition – Data Preprocessing
8.3 Data Cleaning
8.4 Data Transformations
8.5 Data Reduction
8.6 Data Validations
8.7 Problems
9 Data Visualizations
9.1 Introduction
9.2 Definition – Data Visualization
9.3 Data Visualization Techniques
9.4 Data Visualization Tools
9.5 Problems
10 Binomial and Trinomial Trees
10.1 Introduction
10.2 The Binomial Tree Method
10.3 Binomial Discrete Model
10.4 Trinomial Tree Method
10.5 Problems
11 Principal Component Analysis
11.1 Introduction
11.2 Background of Principal Component Analysis
11.3 Motivation
11.4 The Mathematics of PCA
11.5 How PCA Works
11.6 Application
11.7 Problems
12 Discriminant and Cluster Analysis
12.1 Introduction
12.2 Distance
12.3 Discriminant Analysis
12.4 Cluster Analysis
12.5 Problems
13 Multidimensional Scaling
13.1 Introduction
13.2 Motivation
13.3 Number of Dimensions and Goodness of Fit
13.4 Proximity Measures
13.5 Metric Multidimensional Scaling
13.6 Nonmetric Multidimensional Scaling
13.7 Problems
14 Classification and Tree‐Based Methods
14.1 Introduction
14.2 An Overview of Classification
14.3 Linear Discriminant Analysis
14.4 Tree‐Based Methods
14.5 Applications
14.6 Problems
15 Association Rules
15.1 Introduction
15.2 Market Basket Analysis
15.3 Terminologies
15.4 The Apriori Algorithm
15.5 Applications
15.6 Problems
16 Support Vector Machines
16.1 Introduction
16.2 The Maximal Margin Classifier
16.3 Classification Using a Separating Hyperplane
16.4 Kernel Functions
16.5 Applications
16.6 Problems
17 Neural Networks
17.1 Introduction
17.2 Perceptrons
17.3 Feed Forward Neural Network
17.4 Recurrent Neural Networks
17.5 Long Short‐Term Memory
17.6 Application
17.7 Significance of Study
17.8 Problems
18 Fourier Analysis
18.1 Introduction
18.2 Definition
18.3 Discrete Fourier Transform
18.4 The Fast Fourier Transform (FFT) Method
18.5 Dynamic Fourier Analysis
18.6 Applications of the Fourier Transform
18.7 Problems
19 Wavelets Analysis
19.1 Introduction
19.2 Discrete Wavelets Transforms
19.3 Applications of the Wavelets Transform
19.4 Problems
20 Stochastic Analysis
20.1 Introduction
20.2 Necessary Definitions from Probability Theory
20.3 Stochastic Processes
20.4 Examples of Stochastic Processes
20.5 Measurable Functions and Expectations
20.6 Problems
21 Fractal Analysis – Lévy, Hurst, DFA, DEA
21.1 Introduction and Definitions
21.2 Lévy Processes
21.3 Lévy Flight Models
21.4 Rescaled Range Analysis (Hurst Analysis)
21.5 Detrended Fluctuation Analysis (DFA)
21.6 Diffusion Entropy Analysis (DEA)
21.7 Application – Characterization of Volcanic Time Series
21.8 Problems
22 Stochastic Differential Equations
22.1 Introduction
22.2 Stochastic Differential Equations
22.3 Examples
22.4 Multidimensional Stochastic Differential Equations
22.5 Simulation of Stochastic Differential Equations
22.6 Problems
23 Ethics: With Great Power Comes Great Responsibility
23.1 Introduction
23.2 Data Science Ethical Principles
23.3 Data Science Code of Professional Conduct
23.4 Application
23.5 Problems
Bibliography
Index
End User License Agreement
Chapter 2
Table 2.1 Examples of random vectors.
Chapter 3
Table 3.1 Ramus Bone Length at Four Ages for 20 Boys.
Chapter 4
Table 4.1 Time series data of the volume of sales of over a six hour period.
Table 4.2 Simple moving average forecasts.
Table 4.3 Time series data used in Example 4.6.
Table 4.4 Weighted moving average forecasts.
Table 4.5 Trend projection of weighted moving average forecasts.
Table 4.6 Exponential smoothing forecasts of volume of sales.
Table 4.7 Exponential smoothing forecasts from Example 4.9.
Table 4.8 Adjusted exponential smoothing forecasts.
Chapter 6
Table 6.1 Numbers.
Table 6.2 Files mode in Python.
Chapter 7
Table 7.1 Common asymptotic notations.
Chapter 9
Table 9.1 Temperature versus ice cream sales.
Chapter 12
Table 12.1 Events information.
Table 12.2 Discriminant scores for earthquakes and explosions groups.
Table 12.3 Discriminant scores for Lehman Brothers collapse and Flash crash ...
Table 12.4 Discriminant scores for Citigroup in 2009 and IAG stock in 2011.
Chapter 13
Table 13.1 Data matrix.
Table 13.2 Distance matrix.
Table 13.3 Stress and goodness of fit.
Table 13.4 Data matrix.
Chapter 14
Table 14.1 Models' performances on the test dataset with 23 variables using ...
Table 14.2 Top 10 variables selected by the Random forest algorithm.
Table 14.3 Performance for the four models using the top 10 features from mo...
Chapter 15
Table 15.1 Market basket transaction data.
Table 15.2 A binary
representation of market basket transaction data.
Table 15.3 Grocery transactional data.
Table 15.4 Transaction data.
Chapter 16
Table 16.1 Models performances on the test dataset.
Chapter 18
Table 18.1 Percentage of power for Discover data.
Table 18.2 Percentage of power for JPM data.
Table 18.3 Percentage of power for Microsoft data.
Table 18.4 Percentage of power for Walmart data.
Chapter 19
Table 19.1 Determining
and
for
.
Table 19.2 Percentage of total power (energy) for
Albuquerque, New Mexico
(
A
...
Table 19.3 Percentage of total power (energy) for
Tucson, Arizona
(
TUC
) seis...
Chapter 21
Table 21.1 Moments of the Poisson distribution with intensity
.
Table 21.2 Moments of the
distribution.
Table 21.3 Scaling exponents of Volcanic Data time series.
Chapter 4
Figure 4.1 Time series data of phase arrival times of an earthquake.
Figure 4.2 Time series data of financial returns corresponding to Bank of Am...
Figure 4.3 Seasonal trend component.
Figure 4.4 Linear trend component. The horizontal axis is time
, and the ve...
Figure 4.5 Nonlinear trend component. The horizontal axis is time
and the ...
Figure 4.6 Cyclical component (imposed on the underlying trend). The horizon...
Chapter 7
Figure 7.1 The big O notation.
Figure 7.2 The
notation.
Figure 7.3 The
notation.
Figure 7.4 Symbols used in flowchart.
Figure 7.5 Flowchart to add two numbers entered by user.
Figure 7.6 Flowchart to find all roots of a quadratic equation
.
Figure 7.7 Flowchart.
Chapter 8
Figure 8.1 The box plot.
Figure 8.2 Box plot example.
Chapter 9
Figure 9.1 Scatter plot of temperature versus ice cream sales.
Figure 9.2 Heatmap of handwritten digit data.
Figure 9.3 Map of earthquake magnitudes recorded in Chile.
Figure 9.4 Spatial distribution of earthquake magnitudes (Mariani et al. 201...
Figure 9.5 Number of text messages sent.
Figure 9.6 Normal Q–Q plot.
Figure 9.7 Risk of loan default. Source: Tableau Viz Gallery.
Figure 9.8 Top five publishing markets. Source: Modified from International ...
Figure 9.9 High yield defaulted issuer and volume trends. Source: Based on F...
Figure 9.10 Statistics page for popular movies and cinema locations. Source:...
Chapter 10
Figure 10.1 One‐step binomial tree for the return process.
Chapter 11
Figure 11.1 Height versus weight.
Figure 11.2 Visualizing low‐dimensional data.
Figure 11.3 2D data set.
Figure 11.4 First PCA axis.
Figure 11.5 Second PCA axis.
Figure 11.6 New axis.
Figure 11.7 Scatterplot of Royal Dutch Shell stock versus Exxon Mobil stock....
Chapter 12
Figure 12.1 Classification (by quadrant) of earthquakes and explosions using...
Figure 12.2 Classification (by quadrant) of Lehman Brothers collapse and Fla...
Figure 12.3 Clustering results for the earthquake and explosion series based...
Figure 12.4 Clustering results for the Lehman Brothers collapse, Flash crash...
Chapter 13
Figure 13.1 Scatter plot of data in Table 13.1
Chapter 16
Figure 16.1 The
‐plane and several other horizontal planes.
Figure 16.2 The
‐plane and several parallel planes.
Figure 16.3 The plane
.
Figure 16.4 Two class problem when data is linearly separable.
Figure 16.5 Two class problem when data is not linearly separable.
Figure 16.6 ROC curve for linear SVM.
Figure 16.7 ROC curve for nonlinear SVM.
Chapter 17
Figure 17.1 Single hidden layer feed‐forward neural networks.
Figure 17.2 Simple recurrent neural network.
Figure 17.3 Long short‐term memory unit.
Figure 17.4 Philippines (PSI). (a) Basic RNN. (b) LTSM.
Figure 17.5 Thailand (SETI). (a) Basic RNN. (b) LTSM.
Figure 17.6 United States (NASDAQ). (a) Basic RNN. (b) LTSM.
Figure 17.7 JPMorgan Chase & Co. (JPM). (a) Basic RNN. (b) LTSM.
Figure 17.8 Walmart (WMT). (a) Basic RNN. (b) LTSM.
Chapter 18
Figure 18.1 3D power spectra of the daily returns from the four analyzed sto...
Figure 18.2 3D power spectra of the returns (generated per minute) from the ...
Chapter 19
Figure 19.1 Time‐frequency image of explosion 1 recorded by ANMO (Table 19.2...
Figure 19.2 Time‐frequency image of earthquake 1 recorded by ANMO (Table 19....
Figure 19.3 Three‐dimensional graphic information of explosion 1 recorded by...
Figure 19.4 Three‐dimensional graphic information of earthquake 1 recorded b...
Figure 19.5 Time‐frequency image of explosion 2 recorded by TUC (Table 19.3)...
Figure 19.6 Time‐frequency image of earthquake 2 recorded by TUC (Table 19.3...
Figure 19.7 Three‐dimensional graphic information of explosion 2 recorded by...
Figure 19.8 Three‐dimensional graphic information of earthquake 2 recorded b...
Chapter 21
Figure 21.1
for volcanic eruptions 1 and 2.
Figure 21.2 DFA for volcanic eruptions 1 and 2.
Figure 21.3 DEA for volcanic eruptions 1 and 2.
Cover Page
Table of Contents
Title Page
Copyright
List of Figures
List of Tables
Preface
Begin Reading
Bibliography
Index
WILEY END USER LICENSE AGREEMENT
iii
iv
xvii
xviii
xix
xxi
xxii
xxiii
xxiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
191
192
193
194
195
196
197
198
199
200
201
202
203
205
206
207
208
209
210
211
212
213
214
215
216
217
219
220
221
222
223
224
225
226
227
228
229
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
Maria Cristina MarianiUniversity of Texas, El PasoEl Paso, United States
Osei Kofi TweneboahRamapo College of New JerseyMahwah, United States
Maria Pia Beccar-VarelaUniversity of Texas, El PasoEl Paso, United States
This first edition first published 2022
© 2022 John Wiley and Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions
The right of Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar‐Varela to be identified as the authors of this work has been asserted in accordance with law.
Registered OfficeJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data applied for
ISBN: 9781119674689
Cover Design: WileyCover Image: © nobeastsofierce/Shutterstock
Figure 4.1 Time series data of phase arrival times of an earthquake.
Figure 4.2 Time series data of financial returns corresponding to Bank of America (BAC) stock index.
Figure 4.3 Seasonal trend component.
Figure 4.4 Linear trend component. The horizontal axis is time , and the vertical axis is the time series . (a) Linear increasing trend. (b) Linear decreasing trend.
Figure 4.5 Nonlinear trend component. The horizontal axis is time and the vertical axis is the time series . (a) Nonlinear increasing trend. (b) Nonlinear decreasing trend.
Figure 4.6 Cyclical component (imposed on the underlying trend). The horizontal axis is time and the vertical axis is the time series .
Figure 7.1 The big O notation.
Figure 7.2 The notation.
Figure 7.3 The notation.
Figure 7.4 Symbols used in flowchart.
Figure 7.5 Flowchart to add two numbers entered by user.
Figure 7.6 Flowchart to find all roots of a quadratic equation .
Figure 7.7 Flowchart.
Figure 8.1 The box plot.
Figure 8.2 Box plot example.
Figure 9.1 Scatter plot of temperature versus ice cream sales.
Figure 9.2 Heatmap of handwritten digit data.
Figure 9.3 Map of earthquake magnitudes recorded in Chile.
Figure 9.4 Spatial distribution of earthquake magnitudes (Mariani et al. 2016).
Figure 9.5 Number of text messages sent.
Figure 9.6 Normal Q–Q plot.
Figure 9.7 Risk of loan default. Source: Tableau Viz Gallery.
Figure 9.8 Top five publishing markets. Source: Modified from International Publishers Association – Annual Report.
Figure 9.9 High yield defaulted issuer and volume trends. Source: Based on Fitch High Yield Default Index, Bloomberg.
Figure 9.10 Statistics page for popular movies and cinema locations. Source: Google Charts.
Figure 10.1 One‐step binomial tree for the return process.
Figure 11.1 Height versus weight.
Figure 11.2 Visualizing low‐dimensional data.
Figure 11.3 2D data set.
Figure 11.4 First PCA axis.
Figure 11.5 Second PCA axis.
Figure 11.6 New axis.
Figure 11.7 Scatterplot of Royal Dutch Shell stock versus Exxon Mobil stock.
Figure 12.1 Classification (by quadrant) of earthquakes and explosions using the Chernoff and Kullback–Leibler differences.
Figure 12.2 Classification (by quadrant) of Lehman Brothers collapse and Flash crash event using the Chernoff and Kullback–Leibler differences.
Figure 12.3 Clustering results for the earthquake and explosion series based on symmetric divergence using PAM algorithm.
Figure 12.4 Clustering results for the Lehman Brothers collapse, Flash crash event, Citigroup (2009), and IAG (2011) stock data based on symmetric divergence using the PAM algorithm.
Figure 13.1 Scatter plot of data in Table 13.1
Figure 16.1 The ‐plane and several other horizontal planes.
Figure 16.2 The ‐plane and several parallel planes.
Figure 16.3 The plane .
Figure 16.4 Two class problem when data is linearly separable.
Figure 16.5 Two class problem when data is not linearly separable.
Figure 16.6 ROC curve for linear SVM.
Figure 16.7 ROC curve for nonlinear SVM.
Figure 17.1 Single hidden layer feed‐forward neural networks.
Figure 17.2 Simple recurrent neural network.
Figure 17.3 Long short‐term memory unit.
Figure 17.4 Philippines (PSI). (a) Basic RNN. (b) LTSM.
Figure 17.5 Thailand (SETI). (a) Basic RNN. (b) LTSM.
Figure 17.6 United States (NASDAQ). (a) Basic RNN. (b) LTSM.
Figure 17.7 JPMorgan Chase & Co. (JPM). (a) Basic RNN. (b) LTSM.
Figure 17.8 Walmart (WMT). (a) Basic RNN. (b) LTSM.
Figure 18.1 3D power spectra of the daily returns from the four analyzed stock companies. (a) Discover. (b) Microsoft. (c) Walmart. (d) JPM Chase.
Figure 18.2 3D power spectra of the returns (generated per minute) from the four analyzed stock companies. (a) Discover. (b) Microsoft. (c) Walmart. (d) JPM Chase.
Figure 19.1 Time‐frequency image of explosion 1 recorded by ANMO (Table 19.2).
Figure 19.2 Time‐frequency image of earthquake 1 recorded by ANMO (Table 19.2).
Figure 19.3 Three‐dimensional graphic information of explosion 1 recorded by ANMO (Table 19.2).
Figure 19.4 Three‐dimensional graphic information of earthquake 1 recorded by ANMO (Table 19.2).
Figure 19.5 Time‐frequency image of explosion 2 recorded by TUC (Table 19.3).
Figure 19.6 Time‐frequency image of earthquake 2 recorded by TUC (Table 19.3).
Figure 19.7 Three‐dimensional graphic information of explosion 2 recorded by TUC (Table 19.3).
Figure 19.8 Three‐dimensional graphic information of earthquake 2 recorded by TUC (Table 19.3).
Figure 21.1 for volcanic eruptions 1 and 2.
Figure 21.2 DFA for volcanic eruptions 1 and 2.
Figure 21.3
Table 2.1 Examples of random vectors.
Table 3.1 Ramus Bone Length at Four Ages for 20 Boys.
Table 4.1 Time series data of the volume of sales of over a six hour period.
Table 4.2 Simple moving average forecasts.
Table 4.3 Time series data used in Example 4.6.
Table 4.4 Weighted moving average forecasts.
Table 4.5 Trend projection of weighted moving average forecasts.
Table 4.6 Exponential smoothing forecasts of volume of sales.
Table 4.7 Exponential smoothing forecasts from Example 4.9.
Table 4.8 Adjusted exponential smoothing forecasts.
Table 6.1 Numbers.
Table 6.2 Files mode in Python.
Table 7.1 Common asymptotic notations.
Table 9.1 Temperature versus ice cream sales.
Table 12.1 Events information.
Table 12.2 Discriminant scores for earthquakes and explosions groups.
Table 12.3 Discriminant scores for Lehman Brothers collapse and Flash crash event.
Table 12.4 Discriminant scores for Citigroup in 2009 and IAG stock in 2011.
Table 13.1 Data matrix.
Table 13.2 Distance matrix.
Table 13.3 Stress and goodness of fit.
Table 13.4 Data matrix.
Table 14.1 Models' performances on the test dataset with 23 variables using AUC and mean square error (MSE) values for the five models.
Table 14.2 Top 10 variables selected by the Random forest algorithm.
Table 14.3 Performance for the four models using the top 10 features from model Random forest on the test dataset.
Table 15.1 Market basket transaction data.
Table 15.2 A binary representation of market basket transaction data.
Table 15.3 Grocery transactional data.
Table 15.4 Transaction data.
Table 16.1 Models performances on the test dataset.
Table 18.1 Percentage of power for Discover data.
Table 18.2 Percentage of power for JPM data.
Table 18.3 Percentage of power for Microsoft data.
Table 18.4 Percentage of power for Walmart data.
Table 19.1 Determining and for .
Table 19.2 Percentage of total power (energy) for Albuquerque, New Mexico (ANMO) seismic station.
Table 19.3 Percentage of total power (energy) for Tucson, Arizona (TUC) seismic station.
Table 21.1 Moments of the Poisson distribution with intensity .
Table 21.2 Moments of the distribution.
Table 21.3
This textbook is dedicated to practitioners, graduate, and advanced undergraduate students who have interest in Data Science, Business analytics, and Statistical and Mathematical Modeling in different disciplines such as Finance, Geophysics, and Engineering. This book is designed to serve as a textbook for several courses in the aforementioned areas and a reference guide for practitioners in the industry.
The book has a strong theoretical background and several applications to specific practical problems. It contains numerous techniques applicable to modern data science and other disciplines. In today's world, many fields are confronted with increasingly large amounts of complex data. Financial, healthcare, and geophysical data sampled with high frequency is no exception. These staggering amounts of data pose special challenges to the world of finance and other disciplines such as healthcare and geophysics, as traditional models and information technology tools can be poorly suited to grapple with their size and complexity. Probabilistic modeling, mathematical modeling, and statistical data analysis attempt to discover order from apparent disorder; this textbook may serve as a guide to various new systematic approaches on how to implement these quantitative activities with complex data sets.
The textbook is split into five distinct parts. In the first part of this book, foundations of Data Science, we will discuss some fundamental mathematical and statistical concepts which form the basis for the study of data science. In the second part of the book, Data Science in Practice, we will present a brief introduction to R and Python programming and how to write algorithms. In addition, various techniques for data preprocessing, validations, and visualizations will be discussed. In the third part, Data Mining and Machine Learning techniques for Complex Data Sets and fourth part of the book, Advanced Models for Big Data Analytics and Complex Data Sets, we will provide exhaustive techniques for analyzing and predicting different types of complex data sets.
We conclude this book with a discussion of ethics in data science: With great power comes great responsibility.
The authors express their deepest gratitude to Wiley for making the publication a reality.
El Paso, TX and Mahwah, NJ, USASeptember 2021
Maria Cristina MarianiOsei Kofi TweneboahMaria Pia Beccar‐Varela
Data science is one of the most promising and high‐demand career paths for skilled professionals in the 21st century. Currently, successful data professionals understand that they must advance past the traditional skills of analyzing large amounts of data, statistical learning, and programming skills. In order to explore and discover useful information for their companies or organizations, data scientists must have a good grip of the full spectrum of the data science life cycle and have a level of flexibility and understanding to maximize returns at each phase of the process.
Data science is a “concept to unify statistics, mathematics, computer science, data analysis, machine learning and their related methods” in order to find trends, understand, and analyze actual phenomena with data. Due to the Coronavirus disease (COVID-19) many colleges, institutions, and large organizations asked their nonessential employees to work virtually. The virtual meetings have provided colleges and companies with plenty of data. Some aspect of the data suggest that virtual fatigue is on the rise. Virtual fatigue is defined as the burnout associated with the over dependence on virtual platforms for communication. Data science provides tools to explore and reveal the best and worst aspects of virtual work.
In the past decade, data scientists have become necessary assets and are present in almost all institutions and organizations. These professionals are data‐driven individuals with high‐level technical skills who are capable of building complex quantitative algorithms to organize and synthesize large amounts of information used to answer questions and drive strategy in their organization. This is coupled with the experience in communication and leadership needed to deliver tangible results to various stakeholders across an organization or business.
Data scientists need to be curious and result‐oriented, with good knowledge (domain specific) and communication skills that allow them to explain very technical results to their nontechnical counterparts. They possess a strong quantitative background in statistics and mathematics as well as programming knowledge with focuses in data warehousing, mining, and modeling to build and analyze algorithms. In fact, data scientists are a group of analytical data expert who have the technical skills to solve complex problems and the curiosity to explore how problems need to be solved.
Data scientists are part mathematicians, statisticians and computer scientists. And because they span both the business and information technology (IT) worlds, they're in high demand and well‐paid. Data scientists were not very popular some decades ago; however, their sudden popularity reflects how businesses now think about “Big data.” Big data is defined as a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data‐processing application software. That bulky mass of unstructured information can no longer be ignored and forgotten. It is a virtual gold mine that helps boost revenue as long as there is someone who explores and discovers business insights that no one thought to look for before. Many data scientists began their careers as statisticians or business analyst or data analysts. However, as big data began to grow and evolve, those roles evolved as well. Data is no longer just an add on for IT to handle. It is vital information that requires analysis, creative curiosity, and the ability to interpret high‐tech ideas into innovative ways to make profit and to help practitioners make informed decisions.
The term “data scientist” was invented as recently as 2008 when companies realized the need for data professionals who are skilled in organizing and analyzing massive amounts of data. Data scientists are quantitative and analytical data experts who utilize their skills in both technology and social science to find trends and manage the data around them. With the growth of big data integration in business, they have evolved at the forefront of the data revolution. They are part mathematicians, statisticians, computer programmers, and analysts who are equipped with a diverse and wide‐ranging skill set, balancing knowledge in several computer programming languages with advanced experience in statistical learning and data visualization.
There is not a definitive job description when it comes to a data scientist role. However, we outline here some stuffs they do:
Collecting and recording large amounts of unruly data and transforming it into a more usable format.
Solving business‐related problems using data‐driven techniques.
Working with a variety of programming languages, including SAS, Minitab, R, and Python.
Having a strong background of mathematics and statistics including statistical tests and distributions.
Staying on top of quantitative and analytical techniques such as machine learning, deep learning, and text analytics.
Communicating and collaborating with both IT and business.
Looking for order and patterns in data, as well as spotting trends that enables businesses to make informed decisions.
Some of the useful tools that every data scientist or practitioner needs are outlined below:
Data preparation:
The process of cleaning and transforming raw data into suitable formats prior to processing and analysis.
Data visualization:
The presentation of data in a pictorial or graphical format so it can be easily analyzed.
Statistical learning or Machine learning:
A branch of artificial intelligence based on mathematical algorithms and automation. Artificial intelligence (AI) refers to the process of building smart machines capable of performing tasks that typically require human intelligence. They are designed to make decisions, often using real-time data. Real-time data are information that is passed along to the end user immediately it is gathered.
Deep learning:
An area of statistical learning research that uses data to model complex abstractions.
Pattern recognition:
Technology that recognizes patterns in data (often used interchangeably with machine learning).
Text analytics:
The process of examining unstructured data and drawing meaning out of written communication.
We will discuss all the above tools in details in this book. There are several scientific and programming skills that every data scientist should have. They must be able to utilize key technical tools and skills, including R, Python, SAS, SQL, Tableau, and several others. Due to the ever growing technology, data scientist must always learn new and emerging techniques to stay on top of their game. We will discuss the R and Python programming in Chapters 5 and 6.
Big data is a term applied to ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by classical data‐processing tools. In particular, it refers to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low latency. Sources of big data includes data from sensors, stock market, devices, video/audio, networks, log files, transactional applications, web, and social media and much of it generated in real time and at a very large scale.
In recent times, the use of the term “big data” (both stored and real‐time) tend to refer to the use of user behavior analytics (UBA), predictive analytics, or certain other advanced data analytics methods that extract value from data. UBA solutions look at patterns of human behavior, and then apply algorithms and statistical analysis to detect meaningful anomalies from those patterns' anomalies that indicate potential threats. For example detection of hackers, detection of insider threats, targeted attacks, financial fraud, and several others.
Predictive analytics deals with the process of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Generally, predictive analytics does not tell you what will happen in the future. However, it forecasts what might happen in the future with some degree of certainty. Predictive analytics goes hand in hand with big data: Businesses and organizations collect large amounts of real‐time customer data and predictive analytics and uses this historical data, combined with customer insight, to forecast future events. Predictive analytics helps organizations to use big data to move from a historical view to a forward‐looking perspective of the customer. In this book, we will discuss several methods for analyzing big data.
Big data has one or more of the following characteristics: high volume, high velocity, high variety, and high veracity. That is, the data sets are characterized by huge amounts (volume) of frequently updated data (velocity) in various types, such as numeric, textual, audio, images and videos (variety), with high quality (veracity). We briefly discuss each in detail. Volume: Volume describes the quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. Velocity: Velocity describes the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in both stored and real‐time. Compared to small data, big data are produced more continually (it could be nanosecond, second, minute, hours, etc.). Two types of velocity related to big data are the frequency of generation and the frequency of handling, recording, and reporting. Variety: Variety describes the type and formats of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from different formats and completes missing pieces through data fusion. Data fusion is a term used to describe the technique of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. Veracity: Veracity describes the quality of data and the data value. The quality of data obtained can greatly affect the accuracy of the analyzed results. In the next subsection we will discuss some big data architectures. A comprehensive study of this topic can be found in the application architecture guide of the Microsoft technical documentation.
Big data architectures are designed to handle the ingestion, processing, and analysis of data that is too large or complex for classical data-processing application tools. Some popular big data architectures are the Lambda architecture, Kappa architecture and the Internet of Things (IoT). We refer the reader to the Microsoft technical documentation on Big data architectures for a detailed discussion on the different architectures. Almost all big data architectures include all or some of the following components:
Data sources
: All big data solutions begin with one or more data sources. Some common data sources includes the following: Application data stores such as relational databases, static files produced by applications such as web server log files, and real‐time data sources such as the
Internet of Things
(
IoT
) devices.
Data storage
: Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. A data lake is a storage repository that allows one to store structured and unstructured data at any scale until it is needed.
Batch processing
: Since data sets are enormous, often a big data solution must process data files using long‐running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Normally, these jobs involve reading source files, processing them, and writing the output to new files. Options include running U‐SQL jobs or using Java, Scala, R, or Python programs. U-SQL is a data processing language that merges the benefits of SQL with the expressive power of ones own code.
Real‐time message ingestion
: If the solution includes real‐time sources, the architecture must include a way to capture and store real‐time messages for stream processing. This might be a simple data store, where incoming messages are stored into a folder for processing. However, many solutions need a message ingestion store to act as a buffer for messages and to support scale‐out processing, reliable delivery, and other message queuing semantics.
Stream processing
: After obtaining real‐time messages, the solution must process them by filtering, aggregating, and preparing the data for analysis. The processed stream data is then written to an output sink.
Analytical data store
: Several big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball‐style relational data warehouse, as observed in most classical business intelligence (
BI
) solutions. Alternatively, the data could be presented through a low‐latency NoSQL technology, such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store.
Analysis and reporting
: The goal of most big data solutions is to provide insights into the data through analysis and reporting. Users can analyze the data using mathematical and statistical models as well using data visualization techniques. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts.
Orchestration
: Several big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or move the results to a report or dashboard.
The matrix algebra and random vectors presented in this chapter will enable us to precisely state statistical models. We will begin by discussing some basic concepts that will be essential throughout this chapter. For more details on matrix algebra please consult (Axler 2015).
Definition 2.1 (Vector) A vector is an array of real numbers , and it is written as:
Definition 2.2 (Scaler multiplication of vectors) The product of a scalar , and a vector is the vector obtained by multiplying each entry in the vector by the scalar:
Definition 2.3 (Vector addition) The sum of two vectors of the same size is the vector obtained by adding corresponding entries in the vectors:
so that is the vector with the th element .
Definition 2.4 (Matrix) Let and denote positive integers. An ‐by‐ matrix is a rectangular array of real numbers with rows and columns:
The notation denotes the entry in row , column of . In other words, the first index refers to the row number and the second index refers to the column number.
Example 2.1
then .
Definition 2.5 (Transpose of a matrix) The transpose operation of a matrix changes the columns into rows, i.e. in matrix notation , where “” denotes transpose.
Example 2.2
Definition 2.6 (Scaler multiplication of a matrix) The product of a scalar , and a matrix is the matrix obtained by multiplying each entry in the matrix by the scalar:
In other words, .
Definition 2.7 (Matrix addition) The sum of two vectors of the same size is the vector obtained by adding corresponding entries in the vectors:
In other words, .
Definition 2.8 (Matrix multiplication) Suppose is an ‐by‐ matrix and is an ‐by‐ matrix. Then is defined to be the ‐by‐ matrix whose entry in row , column , is given by the following equation:
In other words, the entry in row , column , of is computed by taking row of and column of , multiplying together corresponding entries, and then summing. The number of columns of must be equal to the number of rows of .
Example 2.3
then
Definition 2.9 (Square matrix) A matrix is said to be a square matrix if the number of rows is the same as the number of columns.
Definition 2.10 (Symmetric matrix) A square matrix is said to be symmetric if or in matrix notation all and .
Example 2.4
The matrix is symmetric; the matrix is not symmetric.
Definition 2.11 (Trace) For any square matrix , the trace of denoted by is defined as the sum of the diagonal elements, i.e.
Example 2.5
Let be a matrix with
Then
We remark that trace are only defined for square matrices.
Definition 2.12 (Determinant of a matrix) Suppose is an ‐by‐ matrix,
The determinant of , denoted det or , is defined by
where are referred to as the “cofactors” and are computed from
The term is known as the “minor matrix” and is the matrix you get if you eliminate row and column from matrix .
Finding the determinant depends on the dimension of the matrix ; determinants only exist for square matrices.
Example 2.6
For a 2 by 2 matrix
we have
Example 2.7
For a 3 by 3 matrix
we have
Definition 2.13 (Positive definite matrix) A square matrix is called positive definite if, for any vector nonidentically zero, we have
Example 2.8
Let be a 2 by 2 matrix
To show that is positive definite, by definition
Therefore, is positive definite.
Definition 2.14 (Positive semidefinite matrix) A matrix is called positive semidefinite (or nonnegative definite) if, for any vector , we have
Definition 2.15 (Negative definite matrix) A square matrix is called negative definite if, for any vector nonidentically zero, we have
Example 2.9
Let be a 2 by 2 matrix
To show that is negative definite, by definition
Therefore, is negative definite.
Definition 2.16 (Negative semidefinite matrix) A matrix is called negative semidefinite if, for any vector , we have
We state the following theorem without proof.
A 2 by 2 symmetric matrix
is:
positive definite if and only if
and det
negative definite if and only if
and det
indefinite if and only if det
.
We begin this section with the definition of ‐algebra.
Definition 2.17 (σ‐algebra) A ‐algebra is a collection of sets of satisfying the following condition:
.
If
then its complement
.
If
is a countable collection of sets in
then their union
.
Definition 2.18 (Measurable functions) A real‐valued function defined on is called measurable with respect to a sigma algebra in that space if the inverse image of the set , defined as is a set in ‐algebra , for all Borel sets of . Borel sets are sets that are constructed from open or closed sets by repeatedly taking countable unions, countable intersections and relative complement.
Definition 2.19 (Random vector) A random vector is any measurable function defined on the probability space with values in (Table 2.1).
Measurable functions will be discussed in detail in Section 20.5.
Suppose we have a random vector defined on a space . The sigma algebra generated by is the smallest sigma algebra in that contains all the pre images of sets in through . That is
This abstract concept is necessary to make sure that we may calculate any probability related to the random variable .
Any random vector has a distribution function, defined similarly with the one‐dimensional case. Specifically, if the random vector has components , its cumulative distribution function or cdf is defined as:
Associated with a random variable and its cdf is another function, called the probability density function (pdf) or probability mass function (pmf). The terms pdf and pmf refer to the continuous and discrete cases of random variables, respectively.
Table 2.1 Examples of random vectors.
Experiment
Random variable
Toss two dice
= sum of the numbers
Toss a coin 10 times
= sum of tails in 10 tosses
Definition 2.20 (Probability mass function) The pmf of a discrete random variable is given by
Definition 2.21 (Probability density function) The pdf, of a continuous random variable is the function that satisfies
We will discuss these notations in details in Chapter 20.
Using these concepts, we can define the moments of the distribution. In fact, suppose that is any function, then we can calculate the expected value of the random variable when the joint density exists as:
Now we can define the moments of the random vector. The first moment is a vector
The expectation applies to each component in the random vector. Expectations of functions of random vectors are computed just as with univariate random variables. We recall that expectation of a random variable is its average value.
The second moment requires calculating all the combination of the components. The result can be presented in a matrix form. The second central moment can be presented as the covariance matrix.
where we used the transpose matrix notation and since the , the matrix is symmetric.
We note that the covariance matrix is positive semidefinite (nonnegative definite), i.e. for any vector , we have .
Now we explain why the covariance matrix has to be semidefinite. Take any vector . Then the product
is a random variable (one dimensional) and its variance must be nonnegative. This is because in the one‐dimensional case, the variance of a random variable is defined as . We see that the variance is nonnegative for every random variable, and it is equal to zero if and only if the random variable is constant. The expectation of (2.2) is . Then we can write (since for any number , )
Since the variance is always nonnegative, the covariance matrix must be nonnegative definite (or positive semidefinite). We recall that a square symmetric matrix is positive semidefinite if . This difference is in fact important in the context of random variables since you may be able to construct a linear combination which is not always constant but whose variance is equal to zero.
The covariance matrix is discussed in detail in Chapter 3.
We now present examples of multivariate distributions.
Before we discuss the Dirichlet distribution, we define the Beta distribution.
Definition 2.22 (Beta distribution) A random variable is said to have a Beta distribution with parameters and if it has a pdf defined as:
where and .
The Dirichlet distribution , named after Johann Peter Gustav Lejeune Dirichlet (1805–1859), is a multivariate distribution parameterized by a vector of positive parameters .
Specifically, the joint density of an ‐dimensional random vector is defined as:
where is an indicator function.
Definition 2.23 (Indicator function) The indicator function of a subset of a set is a function
defined as
The components of the random vector thus are always positive and have the property . The normalizing constant is the multinomial beta function, that is defined as:
where we used the notation and for the Gamma function.
Because the Dirichlet distribution creates