A General Introduction to Data Analytics - João Moreira - E-Book

A General Introduction to Data Analytics E-Book

João Moreira

0,0
86,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A guide to the principles and methods of data analysis that does not require knowledge of statistics or programming A General Introduction to Data Analytics is an essential guide to understand and use data analytics. This book is written using easy-to-understand terms and does not require familiarity with statistics or programming. The authors--noted experts in the field--highlight an explanation of the intuition behind the basic data analytics techniques. The text also contains exercises and illustrative examples. Thought to be easily accessible to non-experts, the book provides motivation to the necessity of analyzing data. It explains how to visualize and summarize data, and how to find natural groups and frequent patterns in a dataset. The book also explores predictive tasks, be them classification or regression. Finally, the book discusses popular data analytic applications, like mining the web, information retrieval, social network analysis, working with text, and recommender systems. The learning resources offer: * A guide to the reasoning behind data mining techniques * A unique illustrative example that extends throughout all the chapters * Exercises at the end of each chapter and larger projects at the end of each of the text's two main parts Together with these learning resources, the book can be used in a 13-week course guide, one chapter per course topic. The book was written in a format that allows the understanding of the main data analytics concepts by non-mathematicians, non-statisticians and non-computer scientists interested in getting an introduction to data science. A General Introduction to Data Analytics is a basic guide to data analytics written in highly accessible terms.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 502

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Preface

Part I: Introductory Background

1 What Can We Do With Data?

1.1 Big Data and Data Science

1.2 Big Data Architectures

1.3 Small Data

1.4 What is Data?

1.5 A Short Taxonomy of Data Analytics

1.6 Examples of Data Use

1.7 A Project on Data Analytics

1.8 How this Book is Organized

1.9 Who Should Read this Book

Part II: Getting Insights from Data

2 Descriptive Statistics

2.1 Scale Types

2.2 Descriptive Univariate Analysis

2.3 Descriptive Bivariate Analysis

2.4 Final Remarks

2.5 Exercises

3 Descriptive Multivariate Analysis

3.1 Multivariate Frequencies

3.2 Multivariate Data Visualization

3.3 Multivariate Statistics

3.4 Infographics and Word Clouds

3.5 Final Remarks

3.6 Exercises

4 Data Quality and Preprocessing

4.1 Data Quality

4.2 Converting to a Different Scale Type

4.3 Converting to a Different Scale

4.4 Data Transformation

4.5 Dimensionality Reduction

4.6 Final Remarks

4.7 Exercises

5 Clustering

5.1 Distance Measures

5.2 Clustering Validation

5.3 Clustering Techniques

5.4 Final Remarks

5.5 Exercises

6 Frequent Pattern Mining

6.1 Frequent Itemsets

6.2 Association Rules

6.3 Behind Support and Confidence

6.4 Other Types of Pattern

6.5 Final Remarks

6.6 Exercises

7 Cheat Sheet and Project on Descriptive Analytics

7.1 Cheat Sheet of Descriptive Analytics

7.2 Project on Descriptive Analytics

Part III: Predicting the Unknown

8 Regression

8.1 Predictive Performance Estimation

8.2 Finding the Parameters of the Model

8.3 Technique and Model Selection

8.4 Final Remarks

8.5 Exercises

9 Classification

9.1 Binary Classification

9.2 Predictive Performance Measures for Classification

9.3 Distance‐based Learning Algorithms

9.4 Probabilistic Classification Algorithms

9.5 Final Remarks

9.6 Exercises

10 Additional Predictive Methods

10.1 Search‐based Algorithms

10.2 Optimization‐based Algorithms

10.3 Final Remarks

10.4 Exercises

11 Advanced Predictive Topics

11.1 Ensemble Learning

11.2 Algorithm Bias

11.3 Non‐binary Classification Tasks

11.4 Advanced Data Preparation Techniques for Prediction

11.5 Description and Prediction with Supervised Interpretable Techniques

11.6 Exercises

12 Cheat Sheet and Project on Predictive Analytics

12.1 Cheat Sheet on Predictive Analytics

12.2 Project on Predictive Analytics

Part IV: Popular Data Analytics Applications

13 Applications for Text, Web and Social Media

13.1 Working with Texts

13.2 Recommender Systems

13.3 Social Network Analysis

13.4 Exercises

Appendix A: A Comprehensive Description of the CRISP‐DM Methodology

A.1 Business Understanding

A.2 Data Understanding

A.3 Data Preparation

A.4 Modeling

A.5 Evaluation

A.6 Deployment

References

Index

End User License Agreement

List of Tables

Chapter 01

Table 1.1 Data set of our private contact list.

Table 1.2 Family relations between contacts.

Chapter 02

Table 2.1 Data set of our private list of contacts with weight and height.

Table 2.2 Univariate absolute and relative frequencies for “company” attribute.

Table 2.3 Univariate absolute and relative frequencies for height.

Table 2.4 Univariate plots.

Table 2.5 Location univariate statistics for weight.

Table 2.6 Central tendency statistics according to the type of scale.

Table 2.7 Dispersion univariate statistics for the “weight” attribute.

Table 2.8 The rank values for the attributes “weight” and “height”.

Chapter 03

Table 3.1 Data set of our private list of contacts with weight and height.

Table 3.2 Location univariate statistics for quantitative attributes.

Table 3.3 Dispersion univariate statistics for quantitative attributes.

Table 3.4 Covariance matrix for quantitative attributes.

Table 3.5 Pearson correlation matrix for quantitative attributes.

Chapter 04

Table 4.1 Filling of missing values.

Table 4.2 Removal of redundant objects.

Table 4.3 Data set of our private list of contacts with weight and height.

Table 4.4 Food preferences of our colleagues.

Table 4.5 Conversion from nominal scale to relative scale.

Table 4.6 Conversion from the nominal scale to binary values.

Table 4.7 Conversion from the nominal scale to the relative scale.

Table 4.8 Conversion from the nominal scale to the relative scale.

Table 4.9 Conversion from the nominal scale to the relative scale.

Table 4.10 Conversion from the ordinal scale to the relative or absolute scale.

Table 4.11 Conversion from the ordinal scale to the relative scale.

Table 4.12 Euclidean distances of ages expressed in years.

Table 4.13 Euclidean distance with age expressed in decades.

Table 4.14 Normalization using min–max rescaling.

Table 4.15 Normalization using standardization.

Table 4.16 Euclidean distance with normalized values.

Table 4.17 How much each friend earns as income and spends on dinners per year.

Table 4.18 Correlation between each predictive attribute and the target attribute.

Table 4.19 Predictive performance of a classifier for each predictive attribute.

Table 4.20 Filling of missing values.

Chapter 05

Table 5.1 Simple social network data set.

Table 5.2 Example of bag of words vectors.

Table 5.3 The clusters to which each friend belongs to according to Figure 5.7a–f.

Table 5.4 Normalized centroids of the example dataset.

Table 5.5 Advantages and disadvantages of k‐means.

Table 5.6 Advantages and disadvantages of DBSCAN.

Table 5.7 First iteration of agglomerative hierarchical clustering.

Table 5.8 Second iteration of agglomerative hierarchical clustering.

Table 5.9 Advantages and disadvantages of agglomerative hierarchical clustering.

Chapter 06

Table 6.1 Data about the preferred cuisine of contacts.

Table 6.2 Transactional data created from Table 6.1.

Table 6.3 Combinatorial explosion with growing size of

.

Table 6.4 Transactional data from Table 6.2 in vertical format corresponding to the first iteration (

) of the Eclat algorithm.

Table 6.5 An example of a two‐way contingency table for itemsets

and

.

Table 6.6 Two‐way contingency tables for itemsets

and

for two groups A and B of data and the whole data set (combined groups A and B). The presence or absence of itemsets in transactions are marked by Yes and No, respectively.

Table 6.7 Example of a sequence database

with items

.

Chapter 07

Table 7.1 Summarizing methods for a single attribute.

Table 7.2 Summarizing methods for two attributes.

Table 7.3 Distance measures.

Table 7.4 Clustering methods.

Table 7.5 Time complexity and memory requirements of frequent itemset mining approaches.

Table 7.6 Measures, generally related to frequent mining approaches.

Table 7.7 Attributes of the Breast Cancer Wisconsin data set.

Table 7.8 Statistics of the attributes of the Breast Cancer Wisconsin data set.

Chapter 08

Table 8.1 Data set of our contact list, with weights and heights.

Table 8.2 Advantages and disadvantages of linear regression.

Table 8.3 Advantages and disadvantages of ridge regression.

Table 8.4 Advantages and disadvantages of the lasso.

Table 8.5 My social network data using the height as target.

Chapter 09

Table 9.1 Simple labeled social network data set for model induction.

Table 9.2 Simple labeled social network data set 2.

Table 9.3 Simple labeled social network data set 3.

Table 9.4 Extended simple labeled social network data set.

Table 9.5 Advantages and disadvantages of k‐NN.

Table 9.6 Advantages and disadvantages of logistic regression.

Table 9.7 Advantages and disadvantages of NB

Chapter 10

Table 10.1 Simple labeled social network data set 3.

Table 10.2 Advantages and disadvantages of decision trees.

Table 10.3 Advantages and disadvantages of MARS.

Table 10.4 Advantages and disadvantages of ANNs.

Table 10.5 Advantages and disadvantages of DL.

Table 10.6 Advantages and disadvantages of SVM.

Chapter 11

Table 11.1 Advantages and disadvantages of bagging.

Table 11.2 Advantages and disadvantages of random forests.

Table 11.3 Advantages and disadvantages of Adaboost.

Chapter 12

Table 12.1 A cheat sheet on predictive algorithms.

Table 12.2 Predictive attributes of the Polish company insolvency data set.

Table 12.3 Statistics of the Polish company insolvency data set.

Table 12.4 K‐NN confusion matrix using five predictive attributes.

Table 12.5 C4.5 confusion matrix using five predictive attributes.

Table 12.6 Random forest confusion matrix using all predictive attributes.

Chapter 13

Table 13.1 Training set of labeled texts.

Table 13.2 Results of stemming.

Table 13.3 Results of applying a stemmer.

Table 13.4 Stems after removal of stop words.

Table 13.5 Objects with their stems.

Table 13.6 Item recommendation scenario.

Table 13.7 Rating prediction scenario.

Table 13.8 Data for a content‐based RT for the user Eve from Table 13.7.

Table 13.9 Cosine vector similarities between the users from Table 13.6.

Table 13.10 Pearson correlation similarities between users in Table 13.7.

Table 13.11 The adjacency matrix for the network in Figure 13.5.

Table 13.12 The adjacency matrix from the Table 13.11 squared showing the counts of paths of length two between pairs of nodes.

Table 13.13 Basic properties of nodes from the network in Figure 13.5.

Table 13.14 The distance matrix – distances between nodes – for the graph in Figure 13.5.

List of Illustrations

Chapter 01

Figure 1.1 A prediction model to classify someone as either good or bad company.

Figure 1.2 The use of different methodologies on data analytics through time.

Figure 1.3 The CRISP‐DM methodology

Chapter 02

Figure 2.1 The main areas of statistics.

Figure 2.2 The relation between the four types of scales: absolute, relative, ordinal and nominal.

Figure 2.3 An example of an area chart used to compare several probability density functions.

Figure 2.4 Price absolute frequency distributions with (histogram) and without (bar chart) cell definition.

Figure 2.5 Empirical and probability distribution functions.

Figure 2.6 Stacked bar plot for “company” split by “gender”.

Figure 2.7 Location statistics on the absolute frequency plot for the attribute “weight”.

Figure 2.8 Box‐plot for the attribute “height”.

Figure 2.9 Central tendency statistics in asymmetric and symmetric unimodal distributions.

Figure 2.10 An example of the Likert scale.

Figure 2.11 Combination of a histogram and a box‐plot for the “height” attribute.

Figure 2.12 The probability density function,

of

Figure 2.13 The probability density function for different standard deviations,

.

Figure 2.14 3D histogram for attributes “weight” and “height”.

Figure 2.15 Scatter plot for the attributes “weight” and “height”.

Figure 2.16 Three examples of correlation between two attributes.

Figure 2.17 The scatter plot for the attributes

and

.

Figure 2.18 Contingency table with absolute joint frequencies for “company” and “gender”.

Figure 2.19 Mosaic plot for “company” and “gender”.

Figure 2.20 Scatter plot.

Chapter 03

Figure 3.1 Plot of objects with three attributes.

Figure 3.2 Two alternatives for a plot of three attributes, where the the third attribute is qualitative.

Figure 3.3 Plot for three attributes from the contacts data set.

Figure 3.4 Plot for four attributes of the friends data set using color for the forth attribute.

Figure 3.5 Parallel coordinate plot for three attributes.

Figure 3.6 Parallel coordinate plot for five attributes.

Figure 3.7 Parallel coordinate plots for multiple attributes: left, using a different style of line for contacts who are good and bad company; right, with the order of the attributes changed as well.

Figure 3.8 Star plot with the value of each attribute for each object in our contacts data set.

Figure 3.9 Star plot with the value of each attribute for each object in contacts data set.

Figure 3.10 Visualization of the objects in our contacts data set using Chernoff faces.

Figure 3.11 Set of box plots, one for each attribute.

Figure 3.12 Matrix of scatter plots for quantitative attributes.

Figure 3.13 Matrix of scatter plots for quantitative attributes with additional Pearson correlation values.

Figure 3.14 Correlograms for Pearson correlation between the attributes “maxtemp”, “weight”, “height” and “years”.

Figure 3.15 Heatmap for the short version of the contacts data set.

Figure 3.16 Infographic of the level of qualifications in England (Contains public sector information licensed under the Open Government Licence v3.0.).

Figure 3.17 Text visualization using a word cloud.

Chapter 04

Figure 4.1 Data set with and without noise.

Figure 4.2 Data set with outliers.

Figure 4.3 Outlier detection based on the interquartile range distance.

Figure 4.4 Two alternatives for a plot for three attributes the last of which is qualitative.

Figure 4.5 Principal components obtained by PCA for the short version of the contacts data set.

Figure 4.6 Components obtained by PCA and ICA for the short version of the contacts data set.

Chapter 05

Figure 5.1 Applying clustering to a dataset.

Figure 5.2 Alternative clustering of the data set.

Figure 5.3 Euclidean and Manhattan distances.

Figure 5.4 Dynamic time warping.

Figure 5.5 Transformation of images of size

pixels into matrices and vectors.

Figure 5.6 (a) Partitional and (b) hierarchical clustering.

Figure 5.7 K‐means with

. The big symbols represent the centroids. The example is step‐by‐step according to the Algorithm K‐means: (a) step 4; (b) 1st iteration of step 6; (c) 1st iteration of step 7; (d) 2nd iteration of step 6; (e) 2nd iteration of step 7; and (f) 3rd iteration of step (6). The algorithm stops in the 3rd iteration (f) because there is no instance changing of symbol between the 2nd (d) and the 3rd (f) iterations of step (6).

Figure 5.8 Examples of convex and non‐convex shapes.

Figure 5.9 Variation of within‐groups sum of squares with number of clusters.

Figure 5.10 DBSCAN for the example data set using a minimum number of instances of two.

Figure 5.11 Comparison of k‐means and DBSCAN: (a) k‐means with

; (b) DBSCAN with minimum number of instances of

and epsilon of

.

Figure 5.12 Single, complete and average linkage criteria.

Figure 5.13 Effect of different linkage criteria on sample data set: (a) single; (b) complete; (c) average; (d) Ward.

Figure 5.14 Dendrogram defined by agglomerative hierarchical clustering using the average as linkage criterion.

Figure 5.15 Dendrogram.

Figure 5.16 Space invaders.

Chapter 06

Figure 6.1 Itemsets organized into a lattice according to subset‐superset relations. Members of

are abbreviations for the five cuisine types introduced in Table 6.2, i.e. A for the Arabic cuisine, I for Indian, etc.

Figure 6.2 The Apriori principle on the data from Table 6.2 (or 6.1), illustrated using an enumeration tree.

Figure 6.3 Building an FP‐tree on the data from Table 6.2 (or 6.1). Each node contains an identifier of an item and its count in the given path.

Figure 6.4 The FP‐growth process of finding frequent itemsets ending in item O but not containing item A from the FT‐Tree (Figure 6.3) built from the data in Table 6.2.

Figure 6.5 Maximal (shaded) and closed (with solid edges) itemsets generated from the data in Table 6.2.

Figure 6.6 Association rules lattice corresponding to the frequent itemset {I,M,O} found in the data (Table 6.2).

Chapter 07

Figure 7.1 Relationship between frequent, closed and maximal patterns.

Figure 7.2 Centroids for k‐means with

, where the benign breast mass is Cluster 1.

Figure 7.3 Centroids for k‐means with

, where the benign breast mass is Cluster 1.

Figure 7.4 Centroids for k‐means with

, where the benign breast mass is Clusters 1 and 2.

Figure 7.5 The elbow curve for the breast cancer problem using the average of the within‐groups sum of squares.

Chapter 08

Figure 8.1 Simple linear regression for the social network data.

Figure 8.2 Holdout validation method.

Figure 8.3 Random sub‐sampling validation method.

Figure 8.4 k‐fold cross validation method.

Figure 8.5 Leave one out method.

Figure 8.6 Bootstrap method.

Figure 8.7 Approximately homogeneous variance between diameter and height and nonhomogeneous variance between diameter and age. This data is about the shellfish

Haliotis moluscus

, also known as abalone. It is a food source in several human cultures. The shells are used as decorative items.

Figure 8.8 Performance of different models when trained and tested on different instances of the same data set. Refer to main text for details.

Figure 8.9 Models from the Figure 8.8: left, average models; center, linear regression models; right, polynomial models.

Figure 8.10 The bias–variance trade‐off.

Chapter 09

Figure 9.1 Simple binary classification task.

Figure 9.2 Classification model induced for the binary classification task.

Figure 9.3 New binary classification task.

Figure 9.4 Binary classification task with two predictive attributes.

Figure 9.5 Classification model induced by a classification algorithm for the classification task with two predictive attributes.

Figure 9.6 Linearly separable classification task with three predictive attributes.

Figure 9.7 Confusion matrix for a binary classification task.

Figure 9.8 Data distribution according to predictive attribute values, true class and predicted class for a fictitious dataset.

Figure 9.9 Predictive performance measures for a binary classification task.

Figure 9.10 Confusion matrix for the example dataset.

Figure 9.11 ROC graph for three classifiers.

Figure 9.12 Classifier AUC with different ROC values.

Figure 9.13 Different AUC situations.

Figure 9.14 Influence of the value of

in k‐NN algorithm classification.

Figure 9.15 Case‐based reasoning, adapted from [29].

Figure 9.16 The age empirical distributions for people who are good and bad company.

Chapter 10

Figure 10.1 Example of a decision tree.

Figure 10.2 Input space partition by a decision tree.

Figure 10.3 Graphical representation of the CART classification tree for our social data set in Table 8.5.

Figure 10.4 Rules representation of the CART classification tree for our social data set from Table 8.5.

Figure 10.5 Example:

, where

is the prediction of

produced by a CART decision.

Figure 10.6 Solving an artificial data set with (a) MLR, (b) CART, (c) model trees and (d) MARS.

Figure 10.7 Detail of a model tree and MARS using an artificial data set.

Figure 10.8 MLP neural network with two hidden layers.

Figure 10.9 How the MLP training affects the orientation and position of separating hyperplanes and the orientation and position of convex and non‐convex separating regions.

Figure 10.10 Orientation and position of separating hyperplanes, and orientation and position of convex and non‐convex separating regions of a trained MLP network.

Figure 10.11 Separation regions created by (left) an MLP network trained with backpropagation and (right) a decision tree induction algorithm for the same data set.

Figure 10.12 Large margin defined by SVM support vectors.

Figure 10.13 Decision borders found by (left) perceptron and (right) SVM learning algorithms.

Figure 10.14 Increase of separation margin by allowing some objects to be placed inside the separation margin.

Figure 10.15 Example of the use of a kernel function to transform a non‐linearly separable classification task in a two‐dimensional space to a linearly separable classification task in a three‐dimensional space.

Figure 10.16 The soft‐margin in a regression problem and how

is calculated.

Chapter 11

Figure 11.1 Ensemble with a parallel combination of classifiers.

Figure 11.2 Ensemble with a sequential combination of classifiers.

Figure 11.3 Ensemble combining parallel and sequential approaches.

Figure 11.4 Example of a random forest predictive model.

Figure 11.5 Representational bias for three different classification algorithms.

Figure 11.6 Classification model induced by a decision tree algoritm on a data set with two predictive attributes.

Figure 11.7 Example of a one‐class classification task.

Figure 11.8 Example of a multi‐class classification task.

Figure 11.9 Example of a ranking classification task.

Figure 11.10 Example of a multi‐label classification task.

Figure 11.11 Example of a hierarchical classification task.

Chapter 13

Figure 13.1 Text about the food preferences of a friend.

Figure 13.2 Regression tree model learned from the training data in Table 13.8.

Figure 13.3 Representation of users and items in a common, latent two‐dimensional space from Example 13.9.

Figure 13.4 Undirected (a), weighted (b) where the thickness of an edge is proportional to its weight, directed (c), and, directed and weighted (d) edge types in networks.

Figure 13.5 An example social network. The nodes correspond to our friends from the example in Table 1.1; each person is indicated by their initial.

Figure 13.6 Part of a social network from Figure 13.5.

Figure 13.7 Example with three networks.

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

v

xiii

xiv

xv

xvii

xix

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

259

260

261

262

263

264

265

266

267

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

303

304

305

306

307

308

309

311

312

313

314

315

316

317

318

319

320

321

322

323

A General Introduction to Data Analytics

João Mendes Moreira

University of Porto

André C. P. L. F. de Carvalho

University of São Paulo

Tomáš Horváth

Eötvös Loránd University in BudapestPavol Jozef Šafárik University in Košice

This edition first published 2019© 2019 John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of João Moreira, André de Carvalho, and Tomáš Horváth to be identified as the author(s) of this work has been asserted in accordance with law.

Registered OfficeJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of WarrantyIn view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Moreira, João, 1969– author. | Carvalho, André Carlos Ponce de Leon Ferreira, author. | Horváth, Tomáš, 1976– author.Title: A general introduction to data analytics / by João Mendes Moreira, André C. P. L. F. de Carvalho, Tomáš Horváth.Description: Hoboken, NJ : John Wiley & Sons, 2019. | Includes bibliographical references and index. |Identifiers: LCCN 2017060728 (print) | LCCN 2018005929 (ebook) | ISBN 9781119296256 (pdf) | ISBN 9781119296263 (epub) | ISBN 9781119296249 (cloth)Subjects: LCSH: Mathematical statistics–Methodology. | Electronic data processing. | Data mining.Classification: LCC QA276.4 (ebook) | LCC QA276.4 .M664 2018 (print) | DDC 519.50285–dc23LC record available at https://lccn.loc.gov/2017060728

Cover image: © agsandrew/ShutterstockCover design by Wiley

To the women at home that make my life better: Mamã, Yá and Yé – JoãoTo my family, Valeria, Beatriz, Gabriela and Mariana – AndréTo my wife Danielle – Tomáš

Preface

We are living in a period of history that will certainly be remembered as one where information began to be instantaneously obtainable, services were tailored to individual criteria, and people did what made them feel good (if it did not put their lives at risk). Every year, machines are able to do more and more things that improve our quality of life. More data is available than ever before, and will become even more so. This is a time when we can extract more information from data than ever before, and benefit more from it.

In different areas of business and in different institutions, new ways to collect data are continuously being created. Old documents are being digitized, new sensors count the number of cars passing along motorways and extract useful information from them, our smartphones are informing us where we are at each moment and what new opportunities are available, and our favorite social networks register to whom we are related or what things we like.

Whatever area we work in, new data is available: data on how students evaluate professors, data on the evolution of diseases and the best treatment options per patient, data on soil, humidity levels and the weather, enabling us to produce more food with better quality, data on the macro economy, our investments and stock market indicators over time, enabling fairer distribution of wealth, data on things we purchase, allowing us to purchase more effectively and at lower cost.

Students in many different domains feel the need to take advantage of the data they have. New courses on data analytics have been proposed in many different programs, from biology to information science, from engineering to economics, from social sciences to agronomy, all over the world.

The first books on data analytics that appeared some years ago were written by data scientists for other data scientists or for data science students. The majority of the people interested in these subjects were computing and statistics students. The books on data analytics were written mainly for them. Nowadays, more and more people are interested in learning data analytics. Students of economics, management, biology, medicine, sociology, engineering, and some other subjects are willing to learn about data analytics. This book intends not only to provide a new, more friendly textbook for computing and statistics students, but also to open data analytics to those students who may know nothing about computing or statistics, but want to learn these subjects in a simple way. Those who have already studied subjects such as statistics will recognize some of the content described in this book, such as descriptive statistics. Students from computing will be familiar with a pseudocode.

After reading this book, it is not expected that you will feel like a data scientist with ability to create new methods, but it is expected that you might feel like a data analytics practitioner, able to drive a data analytics project, using the right methods to solve real problems.

João Mendes MoreiraUniversity of Porto, Porto, Portugal

André C. P. L. F. de CarvalhoUniversity of São Paulo, São Carlos, Brazil

Tomáš HorváthEötvös Loránd University in BudapestPavol Jozef Šafárik University in KošiceOctober, 2017

Acknowledgments

The authors would like to thank Bruno Almeida Pimentel, Edésio Alcobaça Neto, Everlândio Fernandes, Victor Alexandre Padilha and Victor Hugo Barella for their useful comments.

Over the last several months, we have been in contact with several people from Wiley: Jon Gurstelle, Executive Editor on Statistics; Kathleen Pagliaro, Assistant Editor; Samantha Katherine Clarke and Kshitija Iyer, Project Editors; and Katrina Maceda, Production Editor. To all these wonderful people, we owe a deep sense of gratitude, especially now this project has been completed.

Lastly, we would like to thank our families for their constant love, support, patience, and encouragement.

J. A. T.

Presentational Conventions

Definition The definitions are presented in the format shown here.

Special sections and formats Whenever a method is described, three different sections are presented:

Assessing and evaluating results: how can we assess the results of a method? How to interpret them? This section is all about answering these questions.

Setting the hyper‐parameters: each method has its own hyper‐parameters that must be set. This section explains how to set them.

Advantages and disadvantages: a table summarizes the positive and negative characteristics of a given method.

About the Companion Website

This book is accompanied by a companion website:

www.wiley.com/go/moreira/dataanalytics

The website includes:

Presentation slides for instructors

Part IIntroductory Background

1What Can We Do With Data?

Until recently, researchers working with data analysis were struggling to obtain data for their experiments. Recent advances in the technology of data processing, data storage and data transmission, associated with advanced and intelligent computer software, reducing costs and increasing capacity, have changed this scenario. It is the time of the Internet of Things, where the aim is to have everything or almost everything connected. Data previously produced on paper are now on‐line. Each day, a larger quantity of data is generated and consumed. Whenever you place a comment in your social network, upload a photograph, some music or a video, navigate through the Internet, or add a comment to an e‐commerce web site, you are contributing to the data increase. Additionally, machines, financial transactions and sensors such as security cameras, are increasingly gathering data from very diverse and widespread sources.

In 2012, it was estimated that, each year, the amount of data available in the world doubles [1]. Another estimate, from 2014, predicted that by 2020 all information will be digitized, eliminated or reinvented in 80% of processes and products of the previous decade [2]. In a third report, from 2015, it was predicted that mobile data traffic will be almost 10 times larger in 2020 [3]. The result of all these rapid increases of data is named by some the “data explosion”.

Despite the impression that this can give – that we are drowning in data – there are several benefits from having access to all these data. These data provide a rich source of information that can be transformed into new, useful, valid and human‐understandable knowledge. Thus, there is a growing interest in exploring these data to extract this knowledge, using it to support decision making in a wide variety of fields: agriculture, commerce, education, environment, finance, government, industry, medicine, transport and social care. Several companies around the world are realizing the gold mine they have and the potential of these data to support their work, reduce waste and dangerous and tedious work activities, and increase the value of their products and their profits.

The analysis of these data to extract such knowledge is the subject of a vibrant area known as data analytics, or simply “analytics”. You can find several definitions of analytics in the literature. The definition adopted here is:

Analytics

The science that analyze crude data to extract useful knowledge (patterns) from them.

This process can also include data collection, organization, pre‐processing, transformation, modeling and interpretation.

Analytics as a knowledge area involves input from many different areas. The idea of generalizing knowledge from a data sample comes from a branch of statistics known as inductive learning, an area of research with a long history. With the advances of personal computers, the use of computational resources to solve problems of inductive learning become more and more popular. Computational capacity has been used to develop new methods. At the same time, new problems have appeared requiring a good knowledge of computer sciences. For instance, the ability to perform a given task with more computational efficiency has become a subject of study for people working in computational statistics.

In parallel, several researchers have dreamed of being able to reproduce human behavior using computers. These were people from the area of artificial intelligence. They also used statistics for their research but the idea of reproducing human and biological behavior in computers was an important source of motivation. For instance, reproducing how the human brain works with artificial neural networks has been studied since the 1940s; reproducing how ants work with ant colony optimization algorithm since the 1990s. The term machine learning (ML) appeared in this context as the “field of study that gives computers the ability to learn without being explicitly programmed,” according to Arthur Samuel in 1959 [4].

In the 1990s, a new term appeared with a different slight meaning: data mining (DM). The 1990s was the decade of the appearance of business intelligence tools as consequence of the data facilities having larger and cheaper capacity. Companies start to collect more and more data, aiming to either solve or improve business operations, for example by detecting frauds with credit cards, by advising the public of road network constraints in cities, or by improving relations with clients using more efficient techniques of relational marketing. The question was of being able to mine the data in order to extract the knowledge necessary for a given task. This is the goal of data mining.

1.1 Big Data and Data Science

In the first years of the 20th century, the term big data has appeared. Big data, a technology for data processing, was initially defined by the “three Vs”, although some more Vs have been proposed since. The first three Vs allow us to define a taxonomy of big data. They are: volume, variety and velocity. Volume is concerned with how to store big data: data repositories for large amounts of data. Variety is concerned with how to put together data from different sources. Velocity concerns the ability to deal with data arriving very fast, in streams known as data streams. Analytics is also about discovering knowledge from data streams, going beyond the velocity component of big data.

Another term that has appeared and is sometimes used as a synonym for big data is data science. According to Provost and Fawcett [5], big data are data sets that are too large to be managed by conventional data‐processing technologies, requiring the development of new techniques and tools for data storage, processing and transmission. These tools include, for example, MapReduce, Hadoop, Spark and Storm. But data volume is not the only characterization of big data. The word “big” can refer to the number of data sources, to the importance of the data, to the need for new processing techniques, to how fast data arrive, to the combination of different sets of data so they can be analyzed in real time, and its ubiquity, since any company, nonprofit organization or individual has access to data now.

Thus big data is more concerned with technology. It provides a computing environment, not only for analytics, but also for other data processing tasks. These tasks include finance transaction processing, web data processing and georeferenced data processing.

Data science is concerned with the creation of models able to extract patterns from complex data and the use of these models in real‐life problems. Data science extracts meaningful and useful knowledge from data, with the support of suitable technologies. It has a close relationship to analytics and data mining. Data science goes beyond data mining by providing a knowledge extraction framework, including statistics and visualization.

Therefore, while big data gives support to data collection and management, data science applies techniques to these data to discover new and useful knowledge: big data collects and data science discovers. Other terms such as knowledge discovery or extraction, pattern recognition, data analysis, data engineering, and several others are also used. The definition we use of data analytics covers all these areas that are used to extract knowledge from data.

1.2 Big Data Architectures

As data increase in size, velocity and variety, new computer technologies become necessary. These new technologies, which include hardware and software, must be easily expanded as more data are processed. This property is known as scalability. One way to obtain scalability is by distributing the data processing tasks into several computers, which can be combined into clusters of computers. The reader should not confuse clusters of computers with clusters produced by clustering techniques, which are techniques from analytics in which a data set is partitioned to find groups within it.

Even if processing power is expanded by combining several computers in a cluster, creating a distributed system, conventional software for distributed systems usually cannot cope with big data. One of the limitations is the efficient distribution of data among the different processing and storage units. To deal with these requirements, new software tools and techniques have been developed.

One of the first techniques developed for big data processing using clusters was MapReduce. MapReduce is a programming model that has two steps: map and reduce. The most famous implementation of MapReduce is called Hadoop.

MapReduce divides the data set into parts – chunks – and stores in the memory of each cluster computer the chunk of the data set needed by this computer to accomplish its processing task. As an example, suppose that you need to calculate the average salary of 1 billion people and you have a cluster with 1000 computers, each with a processing unit and a storage memory. The people can be divided into 1000 chunks – subsets – with data from 1 million people each. Each chunk can be processed independently by one of the computers. The results produced by each these computers (the average salary of 1 million people) can be averaged, returning the final salary average.

To efficiently solve a big data problem, a distributed system must attend the following requirements:

Make sure that no chunk of data is lost and the whole task is concluded. If one or more computers has a failure, their tasks, and the corresponding data chunk, must be assumed by another computer in the cluster.

Repeat the same task, and corresponding data chunk, in more than one cluster computer; this is called redundancy. Thus, if one or more computer fails, the redundant computer carries on with the task.

Computers that have had faults can return to the cluster again when they are fixed.

Computers can be easily removed from the cluster or extra ones included in it as the processing demand changes.

A solution incorporating these requirements must hide from the data analyst the details of how the software works, such as how the data chunks and tasks are divided among the cluster computers.

1.3 Small Data

In the opposite direction from big data technologies and methods, there is a movement towards more personal, subjective analysis of chunks of data, termed “small data”. Small data is a data set whose volume and format allows its processing and analysis by a person or a small organization. Thus, instead of collecting data from several sources, with different formats, and generated at increasing velocities, creating large data repositories and processing facilities, small data favors the partition of a problem into small packages, which can be analyzed by different people or small groups in a distributed and integrated way.

People are continuously producing small data as they perform their daily activities, be it navigating the web, buying a product in a shop, undergoing medical examinations and using apps in their mobiles. When these data are collected to be stored and processed in large data servers they become big data. To be characterized as small data, a data set must have a size that allows its full understanding by an user.

The type of knowledge sought in big and small data is also different, with the first looking for correlations and the second for causality relations. While big data provide tools that allow companies to understand their customers, small data tools try to help customers to understand themselves. Thus, big data is concerned with customers, products and services, and small data is concerned with the individuals that produced the data.

1.4 What is Data?

But what is data about? Data, in the information age, are a large set of bits encoding numbers, texts, images, sounds, videos, and so on. Unless we add information to data, they are meaningless. When we add information, giving a meaning to them, these data become knowledge. But before data become knowledge, typically, they pass through several steps where they are still referred to as data, despite being a bit more organized; that is, they have some information associated with them.

Let us see the example of data collected from a private list of acquaintances or contacts.

Information as presented in Table 1.1, usually referred to as tabular data, is characterized by the way data are organized. In tabular data, data are organized in rows and columns, where each column represents a characteristic of the data and each row represents an occurrence of the data. A column is referred to as an attribute or, with the same meaning, a feature, while a row is referred to as an instance, or with the same meaning, an object.

Table 1.1 Data set of our private contact list.

Contact

Age

Educational level

Company

Andrew

55

1.0

Good

Bernhard

43

2.0

Good

Carolina

37

5.0

Bad

Dennis

82

3.0

Good

Eve

23

3.2

Bad

Fred

46

5.0

Good

Gwyneth

38

4.2

Bad

Hayden

50

4.0

Bad

Irene

29

4.5

Bad

James

42

4.1

Good

Kevin

35

4.5

Bad

Lea

38

2.5

Good

Marcus

31

4.8

Bad

Nigel

71

2.3

Good

Instance or Object

Examples of the concept we want to characterize.

Example 1.1

In the example in Table 1.1, we intend to characterize people in our private contact list. Each member is, in this case, an instance or object. It corresponds to a row of the table.

Attribute or Feature

Attributes, also called features, are characteristics of the instances.

Example 1.2

In Table 1.1, contact, age, education level and company are four different attributes.

The majority of the chapters in this book expect the data to be in tabular format; that is, already organized by rows and columns, each row representing an instance and each column representing an attribute. However, a table can be organized differently, having the instances per column and the attributes per row.

There are, however, data that are not possible to represent in a single table.

Example 1.3

As an example, if some of the contacts are relatives of other contacts, a second table, as shown in Table 1.2, representing the family relationships, would be necessary. You should note that each person referred to in Table 1.2 also exists in Table 1.1, i.e., there are relations between attributes of different tables.

Table 1.2 Family relations between contacts.

Friend

Father

Mother

Sister

Eve

Andrew

Hayden

Irene

Irene

Andrew

Hayden

Eve

Data sets represented by several tables, making clear the relations between these tables, are called relational data sets. This information is easily handled using relational databases. In this book, only simple forms of relational data will be used. This is discussed in each chapter whenever necessary.

Example 1.4

In our example, data is split into two tables, one with the individual data of each contact (Table 1.1) and the other with the data about the family relations between them (Table 1.2).

1.5 A Short Taxonomy of Data Analytics

Now that we know what data are, we will look at what we can do with them. A natural taxonomy that exists in data analytics is:

Descriptive analytics: summarize or condense data to extract patterns

Predictive analytics: extract models from data to be used for future predictions.

In descriptive analytics tasks, the result of a given method or technique,1 is obtained directly by applying an algorithm to the data. The result can be a statistic, such as an average, a plot, or a set of groups with similar instances, among other things, as we will see in this book. Let us see the definition of method and algorithm.

Method or technique

A method or technique is a systematic procedure that allows us to achieve an intended goal.

A method shows how to perform a given task. But in order to use a language closer to the language computers can understand, it is necessary to describe the method/technique through an algorithm.

Algorithm

An algorithm is a self‐contained, step‐by‐step set of instructions easily understandable by humans, allowing the implementation of a given method. They are self‐contained in order to be easily translated to an arbitrary programming language.

Example 1.5

The method to obtain the average age of my contacts uses the ages of each (we could use other methods, such as using the number of contacts for each different age). A possible algorithm for this very simple example is shown next.

In the limit, a method can be straightforward. It is possible, in many cases, to express it as a formula instead of as an algorithm.

Example 1.6

For instance, the average could be expressed as: .

We have seen an algorithm that describes a descriptive method. An algorithm can also describe predictive methods. In this last case it describes how to generate a model. Let us see what a model is.

Model

A model in data analytics is a generalization obtained from data that can be used afterwords to generate predictions for new given instances. It can be seen as a prototype that can be used to make predictions. Thus, model induction is a predictive task.

Example 1.7

If we apply an algorithm for induction of decision trees to provide an explanation of who, among our contacts, is a good company, we obtain a model, called a decision tree, like the one presented in Figure 1.1. It can be seen that people older than 38 years are typically better company than those whose age is equal or less than 38 more than 80% of people aged 38 or less are bad company, while more than 80% of people older than 38 are good company. This model could be used to predict whether a new contact is or not a good company. It would be enough to know the age of that new contact.

Figure 1.1 A prediction model to classify someone as either good or bad company.

Now that we have a rough idea of what analytics is, let us see real examples of problems in data analytics.

1.6 Examples of Data Use

We will describe two real‐world problems from different areas as an introduction to the different subjects that are covered in this book. Many more could be presented. One of the problems is from medicine and the other is from economics. The problems were chosen with a view to the availability of relevant data, because the problems involved will be solved in the project chapters of the book (Chapters 7 and 12).

1.6.1 Breast Cancer in Wisconsin

Breast cancer is a well‐known problem that affects mainly women. The detection of breast tumors can be performed through a biopsy technique known as fine‐needle aspiration. This uses a fine needle to sample cells from the mass under study. Samples of breast mass obtained using fine‐needle aspiration were recorded in a set of images [6]. Then, a dataset was collected by extracting features from these images. The objective of the first problem is to detect different patterns of breast tumors in this dataset, to enable it to be used for diagnostic purposes.

1.6.2 Polish Company Insolvency Data

The second problem concerns the prediction of the economic wealth of Polish companies. Can we predict which companies will become insolvent in the next five years? The answer to this question is obviously relevant to institutions and shareholders.

1.7 A Project on Data Analytics

Every project needs a plan. Or, to be precise, a methodology to prepare the plan. A project on data analytics does not imply only the use of one or more specific methods. It implies:

understanding the problem to be solved

defining the objectives of the project

looking for the necessary data

preparing these data so that they can be used

identifying suitable methods and choosing between them

tuning the hyper‐parameters of each method (see below)

analyzing and evaluating the results

redoing the pre‐processing tasks and repeating the experiments

and so on.

In this book, we assume that in the induction of a model, there are both hyper‐parameters and parameters whose values are set. The values of the hyper‐parameters are set by the user, or some external optimization method. The parameter values, on the other hand, are model parameters whose values are set by a modeling or learning algorithm in its internal procedure. When the distinction is not clear, we use the term parameter. Thus, hyper‐parameters might be, for example, the number of layers and the activation function in a multi‐layer perceptron neural network and the number of clusters for the k‐means algorithm. Examples of parameters are the weights found by the backpropagation algorithm when training a multi‐layer perceptron neural network and the distribution of objects carried out by k‐means. Multi‐layer perceptron neural networks and k‐means will be explained later in this book.

How can we perform all these operations in an organized way? This section is all about methodologies for planning and developing projects in data analytics.

A brief history of methodologies for data analytics is presented first. Afterwards, two different methodologies are described:

a methodology from Academia, KDD

a methodology from industry, CRISP‐DM.

The latter is used in the cheat sheet and project chapters (Chapters 7 and 12).

1.7.1 A Little History on Methodologies for Data Analytics

Machine learning, knowledge discovery from data and related areas experienced strong development in the 1990s. Both in academia and industry, the research on these topics was advancing quickly. Naturally, methodologies for projects in these areas, now referred to as data analytics, become a necessity. In the mid‐1990s, both in academia and industry, different methodologies were presented.

The most successful methodology from academia came from the USA. This was the KDD process of Usama Fayyad, Gregory Piatetsky‐Shapiro and Padhraic Smyth [7]. Despite being from academia, the authors had considerable work experience in industry.

The most successful tool from industry, was and still is the CRoss‐Industry Standard Process for Data Mining (CRISP‐DM) [8]. Conceived in 1996, it later got underway as an European Union project under the ESPRIT funding initiative. The project had five partners from industry: SPSS, Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company. In 1999 the first version was presented. An attempt to create a new version began between 2006 and 2008 but no new discoveries are known from these efforts. CRISP‐DM is nowadays used by many different practitioners and by several corporations, in particular IBM. However, despite its popularity, CRISP‐DM needs new developments in order to meet the new challenges from the age of big data.

Other methodologies exist. Some of them are domain‐specific: they assume the use of a given tool for data analytics. This is not the case for SEMMA, which, despite has been created by SAS, is tool independent. Each letter of its name, SEMMA, refers to one of its five steps: Sample, Explore, Modify, Model and Assess.

Polls performed by kdnuggets [9] over the years (2002, 2004, 2007 and 2014) show how methodologies on data analytics have been used through time (Figure 1.2).

Next, the KDD process and the CRISP‐DM methodologies are described in detail.

Figure 1.2 The use of different methodologies on data analytics through time.

1.7.2 The KDD Process

Intended to be a methodology that could cope with all the processes necessary to extract knowledge from data, the KDD process proposes a sequence of nine steps. In spite of the sequence, the KDD process considers the possibility of going back to any previous step in order to redo some part of the process. The nine steps are:

Learning the application domain:

What is expected in terms of the application domain? What are the characteristics of the problem; its specificities? A good understanding of the application domain is required.

Creating a target dataset:

What data are needed for the problem? Which attributes? How will they be collected and put in the desired format (say, a tabular data set)? Once the application domain is known, the data analyst team should be able to identify the data necessary to accomplish the project.

Data cleaning and pre‐processing:

How should missing values and/or outliers such as extreme values be handled? What data type should we choose for each attribute? It is necessary to put the data in a specific format, such as a tabular format.

Data reduction and projection:

Which features should we include to represent the data? From the available features, which ones should be discarded? Should further information be added, such as adding the day of the week to a timestamp? This can be useful in some tasks. Irrelevant attributes should be removed.

Choosing the data mining function:

Which type of methods should be used? Four types of method are: summarization, clustering, classification and regression. The first two are from the branch of descriptive analytics while the latter two are from predictive analytics.

Choosing the data mining algorithm(s):

Given the characteristics of the problem and the characteristics of the data, which methods should be used? It is expected that specific algorithms will be selected.

Data mining:

Given the characteristics of the problem, the characteristics of the data, and the applicable method type, which specific methods should be used? Which values should be assigned to the hyper‐parameters? The choice of method depends on many different factors: interpretability, ability to handle missing values, capacity to deal with outliers, computational efficiency, among others.

Interpretation:

What is the meaning of the results? What is the utility for the final user? To select the useful results and to evaluate them in terms of the application domain is the goal of this step. It is common to go back to a previous step when the results are not as good as expected.

Using discovered knowledge:

How can we apply the new knowledge in practice? How is it integrated in everyday life? This implies the integration of the new knowledge into the operational system or in the reporting system.

For simplicity sake, the nine steps were described sequentially, which is typical. However, in practice, some jumps are often necessary. As an example, steps 3 and 4 can be grouped together with steps 5 and 6. The way we pre‐process the data depends on the methods we will use. For instance, some methods are able to deal with missing values, others not. When a method is not able to deal with missing values, those missing values should be included somehow or some attributes or instances should be removed. Also, there are methods that are too sensitive to outliers or extreme values. When this happens, outliers should be removed. Otherwise, it is not necessary to remove them. These are just examples on how data cleaning and pre‐processing tasks depend on the chosen method(s) (steps 5 and 6).

1.7.3 The CRISP‐DM Methodology

CRoss‐Industry Standard Process for Data Mining (CRISP‐DM) is a six‐step method, which, like the KDD process, uses a non‐rigid sequential framework. Despite the six phases, CRISP‐DM is seen as a perpetual process, used throughout the life of a company in successive iterations (Figure 1.3).

Figure 1.3 The CRISP‐DM methodology

(adapted from http://www.crisp‐dm.org/).

The six phases are:

Business understanding:

This involves understanding the business domain, being able to define the problem from the business domain perspective, and finally being able to translate such business problems into a data analytics problem.

Data understanding:

This involves collection of the necessary data and their initial visualization/summarization in order to obtain the first insights, particularly but not exclusively, about data quality problems such as missing data or outliers.

Data preparation:

This involves preparing the data set for the modeling tool, and includes data transformation, feature construction, outlier removal, missing data fulfillment and incomplete instances removal.

Modeling: