Data Science in Theory and Practice - Maria Cristina Mariani - E-Book

Data Science in Theory and Practice E-Book

Maria Cristina Mariani

0,0
109,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Data Science in Theory and Practice EXPLORE THE FOUNDATIONS OF DATA SCIENCE WITH THIS INSIGHTFUL NEW RESOURCE Data Science in Theory and Practice delivers a comprehensive treatment of the mathematical and statistical models useful for analyzing data sets arising in various disciplines, like banking, finance, health care, bioinformatics, security, education, and social services. Written in five parts, the book examines some of the most commonly used and fundamental mathematical and statistical concepts that form the basis of data science. The authors go on to analyze various data transformation techniques useful for extracting information from raw data, long memory behavior, and predictive modeling. The book offers readers a multitude of topics all relevant to the analysis of complex data sets. Along with a robust exploration of the theory underpinning data science, it contains numerous applications to specific and practical problems. The book also provides examples of code algorithms in R and Python and provides pseudo-algorithms to port the code to any other language. Ideal for students and practitioners without a strong background in data science, readers will also learn from topics like: * Analyses of foundational theoretical subjects, including the history of data science, matrix algebra and random vectors, and multivariate analysis * A comprehensive examination of time series forecasting, including the different components of time series and transformations to achieve stationarity * Introductions to both the R and Python programming languages, including basic data types and sample manipulations for both languages * An exploration of algorithms, including how to write one and how to perform an asymptotic analysis * A comprehensive discussion of several techniques for analyzing and predicting complex data sets Perfect for advanced undergraduate and graduate students in Data Science, Business Analytics, and Statistics programs, Data Science in Theory and Practice will also earn a place in the libraries of practicing data scientists, data and business analysts, and statisticians in the private sector, government, and academia.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 487

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

List of Figures

List of Tables

Preface

1 Background of Data Science

1.1 Introduction

1.2 Origin of Data Science

1.3 Who is a Data Scientist?

1.4 Big Data

2 Matrix Algebra and Random Vectors

2.1 Introduction

2.2 Some Basics of Matrix Algebra

2.3 Random Variables and Distribution Functions

2.4 Problems

3 Multivariate Analysis

3.1 Introduction

3.2 Multivariate Analysis: Overview

3.3 Mean Vectors

3.4 Variance–Covariance Matrices

3.5 Correlation Matrices

3.6 Linear Combinations of Variables

3.7 Problems

4 Time Series Forecasting

4.1 Introduction

4.2 Terminologies

4.3 Components of Time Series

4.4 Transformations to Achieve Stationarity

4.5 Elimination of Seasonality via Differencing

4.6 Additive and Multiplicative Models

4.7 Measuring Accuracy of Different Time Series Techniques

4.8 Averaging and Exponential Smoothing Forecasting Methods

4.9 Problems

5 Introduction to R

5.1 Introduction

5.2 Basic Data Types

5.3 Simple Manipulations – Numbers and Vectors

5.4 Problems

6 Introduction to Python

6.1 Introduction

6.2 Basic Data Types

6.3 Number Type Conversion

6.4 Python Conditions

6.5 Python File Handling: Open, Read, and Close

6.6 Python Functions

6.7 Problems

7 Algorithms

7.1 Introduction

7.2 Algorithm – Definition

7.3 How to Write an Algorithm

7.4 Asymptotic Analysis of an Algorithm

7.5 Examples of Algorithms

7.6 Flowchart

7.7 Problems

8 Data Preprocessing and Data Validations

8.1 Introduction

8.2 Definition – Data Preprocessing

8.3 Data Cleaning

8.4 Data Transformations

8.5 Data Reduction

8.6 Data Validations

8.7 Problems

9 Data Visualizations

9.1 Introduction

9.2 Definition – Data Visualization

9.3 Data Visualization Techniques

9.4 Data Visualization Tools

9.5 Problems

10 Binomial and Trinomial Trees

10.1 Introduction

10.2 The Binomial Tree Method

10.3 Binomial Discrete Model

10.4 Trinomial Tree Method

10.5 Problems

11 Principal Component Analysis

11.1 Introduction

11.2 Background of Principal Component Analysis

11.3 Motivation

11.4 The Mathematics of PCA

11.5 How PCA Works

11.6 Application

11.7 Problems

12 Discriminant and Cluster Analysis

12.1 Introduction

12.2 Distance

12.3 Discriminant Analysis

12.4 Cluster Analysis

12.5 Problems

13 Multidimensional Scaling

13.1 Introduction

13.2 Motivation

13.3 Number of Dimensions and Goodness of Fit

13.4 Proximity Measures

13.5 Metric Multidimensional Scaling

13.6 Nonmetric Multidimensional Scaling

13.7 Problems

14 Classification and Tree‐Based Methods

14.1 Introduction

14.2 An Overview of Classification

14.3 Linear Discriminant Analysis

14.4 Tree‐Based Methods

14.5 Applications

14.6 Problems

15 Association Rules

15.1 Introduction

15.2 Market Basket Analysis

15.3 Terminologies

15.4 The Apriori Algorithm

15.5 Applications

15.6 Problems

16 Support Vector Machines

16.1 Introduction

16.2 The Maximal Margin Classifier

16.3 Classification Using a Separating Hyperplane

16.4 Kernel Functions

16.5 Applications

16.6 Problems

17 Neural Networks

17.1 Introduction

17.2 Perceptrons

17.3 Feed Forward Neural Network

17.4 Recurrent Neural Networks

17.5 Long Short‐Term Memory

17.6 Application

17.7 Significance of Study

17.8 Problems

18 Fourier Analysis

18.1 Introduction

18.2 Definition

18.3 Discrete Fourier Transform

18.4 The Fast Fourier Transform (FFT) Method

18.5 Dynamic Fourier Analysis

18.6 Applications of the Fourier Transform

18.7 Problems

19 Wavelets Analysis

19.1 Introduction

19.2 Discrete Wavelets Transforms

19.3 Applications of the Wavelets Transform

19.4 Problems

20 Stochastic Analysis

20.1 Introduction

20.2 Necessary Definitions from Probability Theory

20.3 Stochastic Processes

20.4 Examples of Stochastic Processes

20.5 Measurable Functions and Expectations

20.6 Problems

21 Fractal Analysis – Lévy, Hurst, DFA, DEA

21.1 Introduction and Definitions

21.2 Lévy Processes

21.3 Lévy Flight Models

21.4 Rescaled Range Analysis (Hurst Analysis)

21.5 Detrended Fluctuation Analysis (DFA)

21.6 Diffusion Entropy Analysis (DEA)

21.7 Application – Characterization of Volcanic Time Series

21.8 Problems

22 Stochastic Differential Equations

22.1 Introduction

22.2 Stochastic Differential Equations

22.3 Examples

22.4 Multidimensional Stochastic Differential Equations

22.5 Simulation of Stochastic Differential Equations

22.6 Problems

23 Ethics: With Great Power Comes Great Responsibility

23.1 Introduction

23.2 Data Science Ethical Principles

23.3 Data Science Code of Professional Conduct

23.4 Application

23.5 Problems

Bibliography

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 Examples of random vectors.

Chapter 3

Table 3.1 Ramus Bone Length at Four Ages for 20 Boys.

Chapter 4

Table 4.1 Time series data of the volume of sales of over a six hour period.

Table 4.2 Simple moving average forecasts.

Table 4.3 Time series data used in Example 4.6.

Table 4.4 Weighted moving average forecasts.

Table 4.5 Trend projection of weighted moving average forecasts.

Table 4.6 Exponential smoothing forecasts of volume of sales.

Table 4.7 Exponential smoothing forecasts from Example 4.9.

Table 4.8 Adjusted exponential smoothing forecasts.

Chapter 6

Table 6.1 Numbers.

Table 6.2 Files mode in Python.

Chapter 7

Table 7.1 Common asymptotic notations.

Chapter 9

Table 9.1 Temperature versus ice cream sales.

Chapter 12

Table 12.1 Events information.

Table 12.2 Discriminant scores for earthquakes and explosions groups.

Table 12.3 Discriminant scores for Lehman Brothers collapse and Flash crash ...

Table 12.4 Discriminant scores for Citigroup in 2009 and IAG stock in 2011.

Chapter 13

Table 13.1 Data matrix.

Table 13.2 Distance matrix.

Table 13.3 Stress and goodness of fit.

Table 13.4 Data matrix.

Chapter 14

Table 14.1 Models' performances on the test dataset with 23 variables using ...

Table 14.2 Top 10 variables selected by the Random forest algorithm.

Table 14.3 Performance for the four models using the top 10 features from mo...

Chapter 15

Table 15.1 Market basket transaction data.

Table 15.2 A binary

representation of market basket transaction data.

Table 15.3 Grocery transactional data.

Table 15.4 Transaction data.

Chapter 16

Table 16.1 Models performances on the test dataset.

Chapter 18

Table 18.1 Percentage of power for Discover data.

Table 18.2 Percentage of power for JPM data.

Table 18.3 Percentage of power for Microsoft data.

Table 18.4 Percentage of power for Walmart data.

Chapter 19

Table 19.1 Determining

and

for

.

Table 19.2 Percentage of total power (energy) for

Albuquerque, New Mexico

(

A

...

Table 19.3 Percentage of total power (energy) for

Tucson, Arizona

(

TUC

) seis...

Chapter 21

Table 21.1 Moments of the Poisson distribution with intensity

.

Table 21.2 Moments of the

distribution.

Table 21.3 Scaling exponents of Volcanic Data time series.

List of Illustrations

Chapter 4

Figure 4.1 Time series data of phase arrival times of an earthquake.

Figure 4.2 Time series data of financial returns corresponding to Bank of Am...

Figure 4.3 Seasonal trend component.

Figure 4.4 Linear trend component. The horizontal axis is time

, and the ve...

Figure 4.5 Nonlinear trend component. The horizontal axis is time

and the ...

Figure 4.6 Cyclical component (imposed on the underlying trend). The horizon...

Chapter 7

Figure 7.1 The big O notation.

Figure 7.2 The

notation.

Figure 7.3 The

notation.

Figure 7.4 Symbols used in flowchart.

Figure 7.5 Flowchart to add two numbers entered by user.

Figure 7.6 Flowchart to find all roots of a quadratic equation

.

Figure 7.7 Flowchart.

Chapter 8

Figure 8.1 The box plot.

Figure 8.2 Box plot example.

Chapter 9

Figure 9.1 Scatter plot of temperature versus ice cream sales.

Figure 9.2 Heatmap of handwritten digit data.

Figure 9.3 Map of earthquake magnitudes recorded in Chile.

Figure 9.4 Spatial distribution of earthquake magnitudes (Mariani et al. 201...

Figure 9.5 Number of text messages sent.

Figure 9.6 Normal Q–Q plot.

Figure 9.7 Risk of loan default. Source: Tableau Viz Gallery.

Figure 9.8 Top five publishing markets. Source: Modified from International ...

Figure 9.9 High yield defaulted issuer and volume trends. Source: Based on F...

Figure 9.10 Statistics page for popular movies and cinema locations. Source:...

Chapter 10

Figure 10.1 One‐step binomial tree for the return process.

Chapter 11

Figure 11.1 Height versus weight.

Figure 11.2 Visualizing low‐dimensional data.

Figure 11.3 2D data set.

Figure 11.4 First PCA axis.

Figure 11.5 Second PCA axis.

Figure 11.6 New axis.

Figure 11.7 Scatterplot of Royal Dutch Shell stock versus Exxon Mobil stock....

Chapter 12

Figure 12.1 Classification (by quadrant) of earthquakes and explosions using...

Figure 12.2 Classification (by quadrant) of Lehman Brothers collapse and Fla...

Figure 12.3 Clustering results for the earthquake and explosion series based...

Figure 12.4 Clustering results for the Lehman Brothers collapse, Flash crash...

Chapter 13

Figure 13.1 Scatter plot of data in Table 13.1

Chapter 16

Figure 16.1 The

‐plane and several other horizontal planes.

Figure 16.2 The

‐plane and several parallel planes.

Figure 16.3 The plane

.

Figure 16.4 Two class problem when data is linearly separable.

Figure 16.5 Two class problem when data is not linearly separable.

Figure 16.6 ROC curve for linear SVM.

Figure 16.7 ROC curve for nonlinear SVM.

Chapter 17

Figure 17.1 Single hidden layer feed‐forward neural networks.

Figure 17.2 Simple recurrent neural network.

Figure 17.3 Long short‐term memory unit.

Figure 17.4 Philippines (PSI). (a) Basic RNN. (b) LTSM.

Figure 17.5 Thailand (SETI). (a) Basic RNN. (b) LTSM.

Figure 17.6 United States (NASDAQ). (a) Basic RNN. (b) LTSM.

Figure 17.7 JPMorgan Chase & Co. (JPM). (a) Basic RNN. (b) LTSM.

Figure 17.8 Walmart (WMT). (a) Basic RNN. (b) LTSM.

Chapter 18

Figure 18.1 3D power spectra of the daily returns from the four analyzed sto...

Figure 18.2 3D power spectra of the returns (generated per minute) from the ...

Chapter 19

Figure 19.1 Time‐frequency image of explosion 1 recorded by ANMO (Table 19.2...

Figure 19.2 Time‐frequency image of earthquake 1 recorded by ANMO (Table 19....

Figure 19.3 Three‐dimensional graphic information of explosion 1 recorded by...

Figure 19.4 Three‐dimensional graphic information of earthquake 1 recorded b...

Figure 19.5 Time‐frequency image of explosion 2 recorded by TUC (Table 19.3)...

Figure 19.6 Time‐frequency image of earthquake 2 recorded by TUC (Table 19.3...

Figure 19.7 Three‐dimensional graphic information of explosion 2 recorded by...

Figure 19.8 Three‐dimensional graphic information of earthquake 2 recorded b...

Chapter 21

Figure 21.1

for volcanic eruptions 1 and 2.

Figure 21.2 DFA for volcanic eruptions 1 and 2.

Figure 21.3 DEA for volcanic eruptions 1 and 2.

Guide

Cover Page

Table of Contents

Title Page

Copyright

List of Figures

List of Tables

Preface

Begin Reading

Bibliography

Index

WILEY END USER LICENSE AGREEMENT

Pages

iii

iv

xvii

xviii

xix

xxi

xxii

xxiii

xxiv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

191

192

193

194

195

196

197

198

199

200

201

202

203

205

206

207

208

209

210

211

212

213

214

215

216

217

219

220

221

222

223

224

225

226

227

228

229

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

Data Science in Theory and Practice

 

Techniques for Big Data Analytics and Complex Data Sets

 

 

 

Maria Cristina MarianiUniversity of Texas, El PasoEl Paso, United States

 

Osei Kofi TweneboahRamapo College of New JerseyMahwah, United States

 

Maria Pia Beccar-VarelaUniversity of Texas, El PasoEl Paso, United States

 

This first edition first published 2022

© 2022 John Wiley and Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions

The right of Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar‐Varela to be identified as the authors of this work has been asserted in accordance with law.

Registered OfficeJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data applied for

ISBN: 9781119674689

Cover Design: WileyCover Image: © nobeastsofierce/Shutterstock

List of Figures

Figure 4.1 Time series data of phase arrival times of an earthquake.

Figure 4.2 Time series data of financial returns corresponding to Bank of America (BAC) stock index.

Figure 4.3 Seasonal trend component.

Figure 4.4 Linear trend component. The horizontal axis is time , and the vertical axis is the time series . (a) Linear increasing trend. (b) Linear decreasing trend.

Figure 4.5 Nonlinear trend component. The horizontal axis is time and the vertical axis is the time series . (a) Nonlinear increasing trend. (b) Nonlinear decreasing trend.

Figure 4.6 Cyclical component (imposed on the underlying trend). The horizontal axis is time and the vertical axis is the time series .

Figure 7.1 The big O notation.

Figure 7.2 The notation.

Figure 7.3 The notation.

Figure 7.4 Symbols used in flowchart.

Figure 7.5 Flowchart to add two numbers entered by user.

Figure 7.6 Flowchart to find all roots of a quadratic equation .

Figure 7.7 Flowchart.

Figure 8.1 The box plot.

Figure 8.2 Box plot example.

Figure 9.1 Scatter plot of temperature versus ice cream sales.

Figure 9.2 Heatmap of handwritten digit data.

Figure 9.3 Map of earthquake magnitudes recorded in Chile.

Figure 9.4 Spatial distribution of earthquake magnitudes (Mariani et al. 2016).

Figure 9.5 Number of text messages sent.

Figure 9.6 Normal Q–Q plot.

Figure 9.7 Risk of loan default. Source: Tableau Viz Gallery.

Figure 9.8 Top five publishing markets. Source: Modified from International Publishers Association – Annual Report.

Figure 9.9 High yield defaulted issuer and volume trends. Source: Based on Fitch High Yield Default Index, Bloomberg.

Figure 9.10 Statistics page for popular movies and cinema locations. Source: Google Charts.

Figure 10.1 One‐step binomial tree for the return process.

Figure 11.1 Height versus weight.

Figure 11.2 Visualizing low‐dimensional data.

Figure 11.3 2D data set.

Figure 11.4 First PCA axis.

Figure 11.5 Second PCA axis.

Figure 11.6 New axis.

Figure 11.7 Scatterplot of Royal Dutch Shell stock versus Exxon Mobil stock.

Figure 12.1 Classification (by quadrant) of earthquakes and explosions using the Chernoff and Kullback–Leibler differences.

Figure 12.2 Classification (by quadrant) of Lehman Brothers collapse and Flash crash event using the Chernoff and Kullback–Leibler differences.

Figure 12.3 Clustering results for the earthquake and explosion series based on symmetric divergence using PAM algorithm.

Figure 12.4 Clustering results for the Lehman Brothers collapse, Flash crash event, Citigroup (2009), and IAG (2011) stock data based on symmetric divergence using the PAM algorithm.

Figure 13.1 Scatter plot of data in Table 13.1

Figure 16.1 The ‐plane and several other horizontal planes.

Figure 16.2 The ‐plane and several parallel planes.

Figure 16.3 The plane .

Figure 16.4 Two class problem when data is linearly separable.

Figure 16.5 Two class problem when data is not linearly separable.

Figure 16.6 ROC curve for linear SVM.

Figure 16.7 ROC curve for nonlinear SVM.

Figure 17.1 Single hidden layer feed‐forward neural networks.

Figure 17.2 Simple recurrent neural network.

Figure 17.3 Long short‐term memory unit.

Figure 17.4 Philippines (PSI). (a) Basic RNN. (b) LTSM.

Figure 17.5 Thailand (SETI). (a) Basic RNN. (b) LTSM.

Figure 17.6 United States (NASDAQ). (a) Basic RNN. (b) LTSM.

Figure 17.7 JPMorgan Chase & Co. (JPM). (a) Basic RNN. (b) LTSM.

Figure 17.8 Walmart (WMT). (a) Basic RNN. (b) LTSM.

Figure 18.1 3D power spectra of the daily returns from the four analyzed stock companies. (a) Discover. (b) Microsoft. (c) Walmart. (d) JPM Chase.

Figure 18.2 3D power spectra of the returns (generated per minute) from the four analyzed stock companies. (a) Discover. (b) Microsoft. (c) Walmart. (d) JPM Chase.

Figure 19.1 Time‐frequency image of explosion 1 recorded by ANMO (Table 19.2).

Figure 19.2 Time‐frequency image of earthquake 1 recorded by ANMO (Table 19.2).

Figure 19.3 Three‐dimensional graphic information of explosion 1 recorded by ANMO (Table 19.2).

Figure 19.4 Three‐dimensional graphic information of earthquake 1 recorded by ANMO (Table 19.2).

Figure 19.5 Time‐frequency image of explosion 2 recorded by TUC (Table 19.3).

Figure 19.6 Time‐frequency image of earthquake 2 recorded by TUC (Table 19.3).

Figure 19.7 Three‐dimensional graphic information of explosion 2 recorded by TUC (Table 19.3).

Figure 19.8 Three‐dimensional graphic information of earthquake 2 recorded by TUC (Table 19.3).

Figure 21.1 for volcanic eruptions 1 and 2.

Figure 21.2 DFA for volcanic eruptions 1 and 2.

Figure 21.3

List of Tables

Table 2.1 Examples of random vectors.

Table 3.1 Ramus Bone Length at Four Ages for 20 Boys.

Table 4.1 Time series data of the volume of sales of over a six hour period.

Table 4.2 Simple moving average forecasts.

Table 4.3 Time series data used in Example 4.6.

Table 4.4 Weighted moving average forecasts.

Table 4.5 Trend projection of weighted moving average forecasts.

Table 4.6 Exponential smoothing forecasts of volume of sales.

Table 4.7 Exponential smoothing forecasts from Example 4.9.

Table 4.8 Adjusted exponential smoothing forecasts.

Table 6.1 Numbers.

Table 6.2 Files mode in Python.

Table 7.1 Common asymptotic notations.

Table 9.1 Temperature versus ice cream sales.

Table 12.1 Events information.

Table 12.2 Discriminant scores for earthquakes and explosions groups.

Table 12.3 Discriminant scores for Lehman Brothers collapse and Flash crash event.

Table 12.4 Discriminant scores for Citigroup in 2009 and IAG stock in 2011.

Table 13.1 Data matrix.

Table 13.2 Distance matrix.

Table 13.3 Stress and goodness of fit.

Table 13.4 Data matrix.

Table 14.1 Models' performances on the test dataset with 23 variables using AUC and mean square error (MSE) values for the five models.

Table 14.2 Top 10 variables selected by the Random forest algorithm.

Table 14.3 Performance for the four models using the top 10 features from model Random forest on the test dataset.

Table 15.1 Market basket transaction data.

Table 15.2 A binary representation of market basket transaction data.

Table 15.3 Grocery transactional data.

Table 15.4 Transaction data.

Table 16.1 Models performances on the test dataset.

Table 18.1 Percentage of power for Discover data.

Table 18.2 Percentage of power for JPM data.

Table 18.3 Percentage of power for Microsoft data.

Table 18.4 Percentage of power for Walmart data.

Table 19.1 Determining and for .

Table 19.2 Percentage of total power (energy) for Albuquerque, New Mexico (ANMO) seismic station.

Table 19.3 Percentage of total power (energy) for Tucson, Arizona (TUC) seismic station.

Table 21.1 Moments of the Poisson distribution with intensity .

Table 21.2 Moments of the distribution.

Table 21.3

Preface

This textbook is dedicated to practitioners, graduate, and advanced undergraduate students who have interest in Data Science, Business analytics, and Statistical and Mathematical Modeling in different disciplines such as Finance, Geophysics, and Engineering. This book is designed to serve as a textbook for several courses in the aforementioned areas and a reference guide for practitioners in the industry.

The book has a strong theoretical background and several applications to specific practical problems. It contains numerous techniques applicable to modern data science and other disciplines. In today's world, many fields are confronted with increasingly large amounts of complex data. Financial, healthcare, and geophysical data sampled with high frequency is no exception. These staggering amounts of data pose special challenges to the world of finance and other disciplines such as healthcare and geophysics, as traditional models and information technology tools can be poorly suited to grapple with their size and complexity. Probabilistic modeling, mathematical modeling, and statistical data analysis attempt to discover order from apparent disorder; this textbook may serve as a guide to various new systematic approaches on how to implement these quantitative activities with complex data sets.

The textbook is split into five distinct parts. In the first part of this book, foundations of Data Science, we will discuss some fundamental mathematical and statistical concepts which form the basis for the study of data science. In the second part of the book, Data Science in Practice, we will present a brief introduction to R and Python programming and how to write algorithms. In addition, various techniques for data preprocessing, validations, and visualizations will be discussed. In the third part, Data Mining and Machine Learning techniques for Complex Data Sets and fourth part of the book, Advanced Models for Big Data Analytics and Complex Data Sets, we will provide exhaustive techniques for analyzing and predicting different types of complex data sets.

We conclude this book with a discussion of ethics in data science: With great power comes great responsibility.

The authors express their deepest gratitude to Wiley for making the publication a reality.

El Paso, TX and Mahwah, NJ, USASeptember 2021

Maria Cristina MarianiOsei Kofi TweneboahMaria Pia Beccar‐Varela

1Background of Data Science

1.1 Introduction

Data science is one of the most promising and high‐demand career paths for skilled professionals in the 21st century. Currently, successful data professionals understand that they must advance past the traditional skills of analyzing large amounts of data, statistical learning, and programming skills. In order to explore and discover useful information for their companies or organizations, data scientists must have a good grip of the full spectrum of the data science life cycle and have a level of flexibility and understanding to maximize returns at each phase of the process.

Data science is a “concept to unify statistics, mathematics, computer science, data analysis, machine learning and their related methods” in order to find trends, understand, and analyze actual phenomena with data. Due to the Coronavirus disease (COVID-19) many colleges, institutions, and large organizations asked their nonessential employees to work virtually. The virtual meetings have provided colleges and companies with plenty of data. Some aspect of the data suggest that virtual fatigue is on the rise. Virtual fatigue is defined as the burnout associated with the over dependence on virtual platforms for communication. Data science provides tools to explore and reveal the best and worst aspects of virtual work.

In the past decade, data scientists have become necessary assets and are present in almost all institutions and organizations. These professionals are data‐driven individuals with high‐level technical skills who are capable of building complex quantitative algorithms to organize and synthesize large amounts of information used to answer questions and drive strategy in their organization. This is coupled with the experience in communication and leadership needed to deliver tangible results to various stakeholders across an organization or business.

Data scientists need to be curious and result‐oriented, with good knowledge (domain specific) and communication skills that allow them to explain very technical results to their nontechnical counterparts. They possess a strong quantitative background in statistics and mathematics as well as programming knowledge with focuses in data warehousing, mining, and modeling to build and analyze algorithms. In fact, data scientists are a group of analytical data expert who have the technical skills to solve complex problems and the curiosity to explore how problems need to be solved.

1.2 Origin of Data Science

Data scientists are part mathematicians, statisticians and computer scientists. And because they span both the business and information technology (IT) worlds, they're in high demand and well‐paid. Data scientists were not very popular some decades ago; however, their sudden popularity reflects how businesses now think about “Big data.” Big data is defined as a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data‐processing application software. That bulky mass of unstructured information can no longer be ignored and forgotten. It is a virtual gold mine that helps boost revenue as long as there is someone who explores and discovers business insights that no one thought to look for before. Many data scientists began their careers as statisticians or business analyst or data analysts. However, as big data began to grow and evolve, those roles evolved as well. Data is no longer just an add on for IT to handle. It is vital information that requires analysis, creative curiosity, and the ability to interpret high‐tech ideas into innovative ways to make profit and to help practitioners make informed decisions.

1.3 Who is a Data Scientist?

The term “data scientist” was invented as recently as 2008 when companies realized the need for data professionals who are skilled in organizing and analyzing massive amounts of data. Data scientists are quantitative and analytical data experts who utilize their skills in both technology and social science to find trends and manage the data around them. With the growth of big data integration in business, they have evolved at the forefront of the data revolution. They are part mathematicians, statisticians, computer programmers, and analysts who are equipped with a diverse and wide‐ranging skill set, balancing knowledge in several computer programming languages with advanced experience in statistical learning and data visualization.

There is not a definitive job description when it comes to a data scientist role. However, we outline here some stuffs they do:

Collecting and recording large amounts of unruly data and transforming it into a more usable format.

Solving business‐related problems using data‐driven techniques.

Working with a variety of programming languages, including SAS, Minitab, R, and Python.

Having a strong background of mathematics and statistics including statistical tests and distributions.

Staying on top of quantitative and analytical techniques such as machine learning, deep learning, and text analytics.

Communicating and collaborating with both IT and business.

Looking for order and patterns in data, as well as spotting trends that enables businesses to make informed decisions.

Some of the useful tools that every data scientist or practitioner needs are outlined below:

Data preparation:

The process of cleaning and transforming raw data into suitable formats prior to processing and analysis.

Data visualization:

The presentation of data in a pictorial or graphical format so it can be easily analyzed.

Statistical learning or Machine learning:

A branch of artificial intelligence based on mathematical algorithms and automation. Artificial intelligence (AI) refers to the process of building smart machines capable of performing tasks that typically require human intelligence. They are designed to make decisions, often using real-time data. Real-time data are information that is passed along to the end user immediately it is gathered.

Deep learning:

An area of statistical learning research that uses data to model complex abstractions.

Pattern recognition:

Technology that recognizes patterns in data (often used interchangeably with machine learning).

Text analytics:

The process of examining unstructured data and drawing meaning out of written communication.

We will discuss all the above tools in details in this book. There are several scientific and programming skills that every data scientist should have. They must be able to utilize key technical tools and skills, including R, Python, SAS, SQL, Tableau, and several others. Due to the ever growing technology, data scientist must always learn new and emerging techniques to stay on top of their game. We will discuss the R and Python programming in Chapters 5 and 6.

1.4 Big Data

Big data is a term applied to ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by classical data‐processing tools. In particular, it refers to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low latency. Sources of big data includes data from sensors, stock market, devices, video/audio, networks, log files, transactional applications, web, and social media and much of it generated in real time and at a very large scale.

In recent times, the use of the term “big data” (both stored and real‐time) tend to refer to the use of user behavior analytics (UBA), predictive analytics, or certain other advanced data analytics methods that extract value from data. UBA solutions look at patterns of human behavior, and then apply algorithms and statistical analysis to detect meaningful anomalies from those patterns' anomalies that indicate potential threats. For example detection of hackers, detection of insider threats, targeted attacks, financial fraud, and several others.

Predictive analytics deals with the process of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. Generally, predictive analytics does not tell you what will happen in the future. However, it forecasts what might happen in the future with some degree of certainty. Predictive analytics goes hand in hand with big data: Businesses and organizations collect large amounts of real‐time customer data and predictive analytics and uses this historical data, combined with customer insight, to forecast future events. Predictive analytics helps organizations to use big data to move from a historical view to a forward‐looking perspective of the customer. In this book, we will discuss several methods for analyzing big data.

1.4.1 Characteristics of Big Data

Big data has one or more of the following characteristics: high volume, high velocity, high variety, and high veracity. That is, the data sets are characterized by huge amounts (volume) of frequently updated data (velocity) in various types, such as numeric, textual, audio, images and videos (variety), with high quality (veracity). We briefly discuss each in detail. Volume: Volume describes the quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. Velocity: Velocity describes the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in both stored and real‐time. Compared to small data, big data are produced more continually (it could be nanosecond, second, minute, hours, etc.). Two types of velocity related to big data are the frequency of generation and the frequency of handling, recording, and reporting. Variety: Variety describes the type and formats of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from different formats and completes missing pieces through data fusion. Data fusion is a term used to describe the technique of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. Veracity: Veracity describes the quality of data and the data value. The quality of data obtained can greatly affect the accuracy of the analyzed results. In the next subsection we will discuss some big data architectures. A comprehensive study of this topic can be found in the application architecture guide of the Microsoft technical documentation.

1.4.2 Big Data Architectures

Big data architectures are designed to handle the ingestion, processing, and analysis of data that is too large or complex for classical data-processing application tools. Some popular big data architectures are the Lambda architecture, Kappa architecture and the Internet of Things (IoT). We refer the reader to the Microsoft technical documentation on Big data architectures for a detailed discussion on the different architectures. Almost all big data architectures include all or some of the following components:

Data sources

: All big data solutions begin with one or more data sources. Some common data sources includes the following: Application data stores such as relational databases, static files produced by applications such as web server log files, and real‐time data sources such as the

Internet of Things

(

IoT

) devices.

Data storage

: Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. A data lake is a storage repository that allows one to store structured and unstructured data at any scale until it is needed.

Batch processing

: Since data sets are enormous, often a big data solution must process data files using long‐running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Normally, these jobs involve reading source files, processing them, and writing the output to new files. Options include running U‐SQL jobs or using Java, Scala, R, or Python programs. U-SQL is a data processing language that merges the benefits of SQL with the expressive power of ones own code.

Real‐time message ingestion

: If the solution includes real‐time sources, the architecture must include a way to capture and store real‐time messages for stream processing. This might be a simple data store, where incoming messages are stored into a folder for processing. However, many solutions need a message ingestion store to act as a buffer for messages and to support scale‐out processing, reliable delivery, and other message queuing semantics.

Stream processing

: After obtaining real‐time messages, the solution must process them by filtering, aggregating, and preparing the data for analysis. The processed stream data is then written to an output sink.

Analytical data store

: Several big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball‐style relational data warehouse, as observed in most classical business intelligence (

BI

) solutions. Alternatively, the data could be presented through a low‐latency NoSQL technology, such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store.

Analysis and reporting

: The goal of most big data solutions is to provide insights into the data through analysis and reporting. Users can analyze the data using mathematical and statistical models as well using data visualization techniques. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts.

Orchestration

: Several big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or move the results to a report or dashboard.

2Matrix Algebra and Random Vectors

2.1 Introduction

The matrix algebra and random vectors presented in this chapter will enable us to precisely state statistical models. We will begin by discussing some basic concepts that will be essential throughout this chapter. For more details on matrix algebra please consult (Axler 2015).

2.2 Some Basics of Matrix Algebra

2.2.1 Vectors

Definition 2.1   (Vector)   A vector is an array of real numbers , and it is written as:

Definition 2.2   (Scaler multiplication of vectors)   The product of a scalar , and a vector is the vector obtained by multiplying each entry in the vector by the scalar:

Definition 2.3   (Vector addition)   The sum of two vectors of the same size is the vector obtained by adding corresponding entries in the vectors:

so that is the vector with the th element .

2.2.2 Matrices

Definition 2.4   (Matrix)   Let and denote positive integers. An ‐by‐ matrix is a rectangular array of real numbers with rows and columns:

The notation denotes the entry in row , column of . In other words, the first index refers to the row number and the second index refers to the column number.

Example 2.1

then .

Definition 2.5   (Transpose of a matrix)   The transpose operation of a matrix changes the columns into rows, i.e. in matrix notation , where “” denotes transpose.

Example 2.2

Definition 2.6   (Scaler multiplication of a matrix)   The product of a scalar , and a matrix is the matrix obtained by multiplying each entry in the matrix by the scalar:

In other words, .

Definition 2.7   (Matrix addition)   The sum of two vectors of the same size is the vector obtained by adding corresponding entries in the vectors:

In other words, .

Definition 2.8   (Matrix multiplication)   Suppose is an ‐by‐ matrix and is an ‐by‐ matrix. Then is defined to be the ‐by‐ matrix whose entry in row , column , is given by the following equation:

In other words, the entry in row , column , of is computed by taking row of and column of , multiplying together corresponding entries, and then summing. The number of columns of must be equal to the number of rows of .

Example 2.3

then

Definition 2.9   (Square matrix)   A matrix is said to be a square matrix if the number of rows is the same as the number of columns.

Definition 2.10   (Symmetric matrix)   A square matrix is said to be symmetric if or in matrix notation all and .

Example 2.4

The matrix is symmetric; the matrix is not symmetric.

Definition 2.11   (Trace)   For any square matrix , the trace of denoted by is defined as the sum of the diagonal elements, i.e.

Example 2.5

Let be a matrix with

Then

We remark that trace are only defined for square matrices.

Definition 2.12   (Determinant of a matrix)   Suppose is an ‐by‐ matrix,

The determinant of , denoted det or , is defined by

where are referred to as the “cofactors” and are computed from

The term is known as the “minor matrix” and is the matrix you get if you eliminate row and column from matrix .

Finding the determinant depends on the dimension of the matrix ; determinants only exist for square matrices.

Example 2.6

For a 2 by 2 matrix

we have

Example 2.7

For a 3 by 3 matrix

we have

Definition 2.13   (Positive definite matrix)   A square matrix is called positive definite if, for any vector nonidentically zero, we have

Example 2.8

Let be a 2 by 2 matrix

To show that is positive definite, by definition

Therefore, is positive definite.

Definition 2.14   (Positive semidefinite matrix)   A matrix is called positive semidefinite (or nonnegative definite) if, for any vector , we have

Definition 2.15   (Negative definite matrix)   A square matrix is called negative definite if, for any vector nonidentically zero, we have

Example 2.9

Let be a 2 by 2 matrix

To show that is negative definite, by definition

Therefore, is negative definite.

Definition 2.16   (Negative semidefinite matrix)   A matrix is called negative semidefinite if, for any vector , we have

We state the following theorem without proof.

Theorem 2.1

A 2 by 2 symmetric matrix

is:

positive definite if and only if

and det

negative definite if and only if

and det

indefinite if and only if det

.

2.3 Random Variables and Distribution Functions

We begin this section with the definition of ‐algebra.

Definition 2.17   (σ‐algebra)   A ‐algebra is a collection of sets of satisfying the following condition:

.

If

then its complement

.

If

is a countable collection of sets in

then their union

.

Definition 2.18   (Measurable functions)   A real‐valued function defined on is called measurable with respect to a sigma algebra in that space if the inverse image of the set , defined as is a set in ‐algebra , for all Borel sets of . Borel sets are sets that are constructed from open or closed sets by repeatedly taking countable unions, countable intersections and relative complement.

Definition 2.19   (Random vector)   A random vector is any measurable function defined on the probability space with values in (Table 2.1).

Measurable functions will be discussed in detail in Section 20.5.

Suppose we have a random vector defined on a space . The sigma algebra generated by is the smallest sigma algebra in that contains all the pre images of sets in through . That is

This abstract concept is necessary to make sure that we may calculate any probability related to the random variable .

Any random vector has a distribution function, defined similarly with the one‐dimensional case. Specifically, if the random vector has components , its cumulative distribution function or cdf is defined as:

Associated with a random variable and its cdf is another function, called the probability density function (pdf) or probability mass function (pmf). The terms pdf and pmf refer to the continuous and discrete cases of random variables, respectively.

Table 2.1 Examples of random vectors.

Experiment

Random variable

Toss two dice

= sum of the numbers

Toss a coin 10 times

= sum of tails in 10 tosses

Definition 2.20   (Probability mass function)   The pmf of a discrete random variable is given by

Definition 2.21   (Probability density function)   The pdf, of a continuous random variable is the function that satisfies

We will discuss these notations in details in Chapter 20.

Using these concepts, we can define the moments of the distribution. In fact, suppose that is any function, then we can calculate the expected value of the random variable when the joint density exists as:

Now we can define the moments of the random vector. The first moment is a vector

The expectation applies to each component in the random vector. Expectations of functions of random vectors are computed just as with univariate random variables. We recall that expectation of a random variable is its average value.

The second moment requires calculating all the combination of the components. The result can be presented in a matrix form. The second central moment can be presented as the covariance matrix.

(2.1)

where we used the transpose matrix notation and since the , the matrix is symmetric.

We note that the covariance matrix is positive semidefinite (nonnegative definite), i.e. for any vector , we have .

Now we explain why the covariance matrix has to be semidefinite. Take any vector . Then the product

(2.2)

is a random variable (one dimensional) and its variance must be nonnegative. This is because in the one‐dimensional case, the variance of a random variable is defined as . We see that the variance is nonnegative for every random variable, and it is equal to zero if and only if the random variable is constant. The expectation of (2.2) is . Then we can write (since for any number , )

Since the variance is always nonnegative, the covariance matrix must be nonnegative definite (or positive semidefinite). We recall that a square symmetric matrix is positive semidefinite if . This difference is in fact important in the context of random variables since you may be able to construct a linear combination which is not always constant but whose variance is equal to zero.

The covariance matrix is discussed in detail in Chapter 3.

We now present examples of multivariate distributions.

2.3.1 The Dirichlet Distribution

Before we discuss the Dirichlet distribution, we define the Beta distribution.

Definition 2.22   (Beta distribution)   A random variable is said to have a Beta distribution with parameters and if it has a pdf defined as:

where and .

The Dirichlet distribution , named after Johann Peter Gustav Lejeune Dirichlet (1805–1859), is a multivariate distribution parameterized by a vector of positive parameters .

Specifically, the joint density of an ‐dimensional random vector is defined as:

where is an indicator function.

Definition 2.23   (Indicator function)   The indicator function of a subset of a set is a function

defined as

The components of the random vector thus are always positive and have the property . The normalizing constant is the multinomial beta function, that is defined as:

where we used the notation and for the Gamma function.

Because the Dirichlet distribution creates

3Multivariate Analysis

3.1 Introduction