Intelligent Data Analysis -  - E-Book

Intelligent Data Analysis E-Book

0,0
119,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book focuses on methods and tools for intelligent data analysis, aimed at narrowing the increasing gap between data gathering and data comprehension, and emphasis will also be given to solving of problems which result from automated data collection, such as analysis of computer-based patient records, data warehousing tools, intelligent alarming, effective and efficient monitoring, and so on. This book aims to describe the different approaches of Intelligent Data Analysis from a practical point of view: solving common life problems with data analysis tools.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 679

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

List of Contributors

Series Preface

Preface

1 Intelligent Data Analysis: Black Box Versus White Box Modeling

1.1 Introduction

1.2 Interpretation of White Box Models

1.3 Interpretation of Black Box Models

1.4 Issues and Further Challenges

1.5 Summary

References

2 Data: Its Nature and Modern Data Analytical Tools

2.1 Introduction

2.2 Data Types and Various File Formats

2.3 Overview of Big Data

2.4 Data Analytics Phases

2.5 Data Analytical Tools

2.6 Database Management System for Big Data Analytics

2.7 Challenges in Big Data Analytics

2.8 Conclusion

References

3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts

3.1 Introduction

3.2 Probability

3.3 Descriptive Statistics

3.4 Inferential Statistics

3.5 Statistical Methods

3.6 Errors

3.7 Conclusion

References

4 Intelligent Data Analysis with Data Mining: Theory and Applications

4.1 Introduction to Data Mining

4.2 Data and Knowledge

4.3 Discovering Knowledge in Data Mining

4.4 Data Analysis and Data Mining

4.5 Data Mining: Issues

4.6 Data Mining: Systems and Query Language

4.7 Data Mining Methods

4.8 Data Exploration

4.9 Data Visualization

4.10 Probability Concepts for Intelligent Data Analysis (IDA)

Reference

5 Intelligent Data Analysis: Deep Learning and Visualization

5.1 Introduction

5.2 Deep Learning and Visualization

5.3 Data Processing and Visualization

5.4 Experiments and Results

5.5 Conclusion

References

6 A Systematic Review on the Evolution of Dental Caries Detection Methods and Its Significance in Data Analysis Perspective

6.1 Introduction

6.2 Different Caries Lesion Detection Methods and Data Characterization

6.3 Technical Challenges with the Existing Methods

6.4 Result Analysis

6.5 Conclusion

Acknowledgment

References

7 Intelligent Data Analysis Using Hadoop Cluster – Inspired MapReduce Framework and Association Rule Mining on Educational Domain

7.1 Introduction

7.2 Learning Analytics in Education

7.3 Motivation

7.4 Literature Review

7.5 Intelligent Data Analytical Tools

7.6 Intelligent Data Analytics Using MapReduce Framework in an Educational Domain

7.7 Results

7.8 Conclusion and Future Scope

References

8 Influence of Green Space on Global Air Quality Monitoring: Data Analysis Using K-Means Clustering Algorithm

8.1 Introduction

8.2 Material and Methods

8.3 Results

8.4 Quantitative Analysis

8.5 Discussion

8.6 Conclusion

References

9 IDA with Space Technology and Geographic Information System

9.1 Introduction

9.2 Geospatial Techniques

9.3 Comparative Analysis

9.4 Conclusion

References

10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT

10.1 Introduction to Intelligent Transportation System (ITS)

10.2 Issues and Challenges of Intelligent Transportation System (ITS)

10.3 Intelligent Data Analysis Makes an IoT-Based Transportation System Intelligent

10.4 Intelligent Data Analysis for Security in Intelligent Transportation System

10.5 Tools to Support IDA in an Intelligent Transportation System

References

11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City

11.1 Introduction

11.2 Materials and Methods

11.3 Classification Algorithms and K-Fold Validation Using Data Set Obtained from NYPD (2012–2017)

11.4 Results

11.5 Discussion

11.6 Conclusion

References

12 A Smart and Promising Neurological Disorder Diagnostic System: An Amalgamation of Big Data, IoT, and Emerging Computing Techniques

12.1 Introduction

12.2 Statistics of Neurological Disorders

12.3 Emerging Computing Techniques

12.4 Related Works and Publication Trends of Articles

12.5 The Need for Neurological Disorders Diagnostic System

12.6 Conclusion

References

13 Comments-Based Analysis of a Bug Report Collection System and Its Applications

13.1 Introduction

13.2 Background

13.3 Related Work

13.4 Data Collection Process

13.5 Analysis of Bug Reports

13.6 Threats to Validity

13.7 Conclusion

References

Notes

14 Sarcasm Detection Algorithms Based on Sentiment Strength

14.1 Introduction

14.2 Literature Survey

14.3 Experiment

14.4 Results and Evaluation

14.5 Conclusion

References

Notes

15 SNAP: Social Network Analysis Using Predictive Modeling

15.1 Introduction

15.2 Literature Survey

15.3 Comparative Study

15.4 Simulation and Analysis

15.5 Conclusion and Future Work

References

16 Intelligent Data Analysis for Medical Applications

16.1 Introduction

16.2 IDA Needs in Medical Applications

16.3 IDA Methods Classifications

16.4 Intelligent Decision Support System in Medical Applications

16.5 Conclusion

References

17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording

17.1 Introduction

17.2 History of Sleep Disorder

17.3 Electroencephalogram Signal

17.4 EEG Data Measurement Technique

17.5 Literature Review

17.6 Subjects and Methodology

17.7 Data Analysis of the Bruxism and Normal Data Using EEG Signal

17.8 Result

17.9 Conclusions

Acknowledgments

References

18 Handwriting Analysis for Early Detection of Alzheimer's Disease

18.1 Introduction and Background

18.2 Proposed Work and Methodology

18.3 Results and Discussions

18.4 Conclusion

References

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 Schema of an employee table in a broker company.

Table 2.2 Data storage measurements.

Table 2.3 Comparison of different data analytic tools.

Table 2.4 Comparison between SQL and NoSQL.

Chapter 4

Table 4.1 Dissimilarities between data and knowledge.

Chapter 6

Table 6.1 Global DMFT trends for 12-year-old children [81–83].

Table 6.2 Code meaning.

Chapter 7

Table 7.1 Educational data set [37].

Table 7.2 Synthesized educational data set [38].

Table 7.3 The data set for course selection.

Table 7.4 Output of Map reduce task.

Table 7.5 Best rules found by Apriori.

Chapter 8

Table 8.1 Air quality categories (annual mean ambient defined by WHO).

Table 8.2 Categorization of the difference of green space area percentage dur...

Table 8.3 Analysis of variance (ANOVA) statistics table.

Chapter 9

Table 9.1 NoSQL database types.

Table 9.2 NoSQL database types.

Table 9.3 NoSQL database types.

Chapter 10

Table 10.1 Objects, statistical techniques, and graphs supported by R program...

Chapter 11

Table 11.1 Illustration of data set attributes.

Table 11.2 Categorized vehicle groups.

Table 11.3 Description of classification algorithms and functionalities.

Table 11.4 Comparison of classifier results.

Table 11.5 Analyzed

p

-value test results.

Chapter 12

Table 12.1 Difference between neurological and psychological disorders.

Table 12.2 Publications details along with citations used in the study.

Chapter 13

Table 13.1 Comparison of previous studies of data extraction.

Table 13.2 Categories of error and its significant keywords.

Table 13.3 Frequent words for severe and nonsevere bugs.

Chapter 14

Table 14.1 Examples for hyperbolic sarcasm.

Table 14.2 Examples for general sarcasm, positive sentences, and negative sen...

Table 14.3 Shows the patterns used by extended Algorithm 14.2 to detect the p...

Table 14.4 Shows example cases for Table 14.3.

Table 14.5 True positive and True negative values of the classification resul...

Table 14.6 Evaluation results of the classification done by the extended algo...

Chapter 15

Table 15.1 Comparison table of literature work.

Chapter 17

Table 17.1 The comparative analysis between bruxism and a normal human for th...

Table 17.2 The comparative analysis between bruxism and normal human for the ...

Table 17.3 The comparative analysis between bruxism and normal human for the ...

List of Illustrations

Chapter 1

Figure 1.1 Data analysis process.

Figure 1.2 Linear regression.

Figure 1.3 Decision tree.

Figure 1.4 Distribution of points in case of high and low information gain....

Figure 1.5 Partial dependence plots from a gradient boosting regressor train...

Figure 1.6 Partial dependence plot from a gradient boosting regressor traine...

Figure 1.7 Relationship between X

2

and Y [24].

Figure 1.8 ICE plot between feature X_2 and Y [24].

Figure 1.9 Calculation of PDP and M-plot [25].

Figure 1.10 Calculation of ALE plot [25].

Figure 1.11 Correlation does not imply causation [29].

Chapter 2

Figure 2.1 Various stages of data.

Figure 2.2 Classifications of digital data.

Figure 2.3 CSV file opened in Microsoft Excel.

Figure 2.4 Plain text file opened in Notepad.

Figure 2.8 Characteristics of big data.

Figure 2.9 Different types of big data analytics.

Figure 2.10 Various phases of data analytics.

Figure 2.11 Features of Apache Spark.

Figure 2.12 Components of Hadoop.

Chapter 4

Figure 4.1 From data to knowledge.

Figure 4.2 Variety of data in data mining.

Figure 4.3 Knowledge tree for intelligent data mining.

Figure 4.4 Knowledge discovery process.

Figure 4.5 Relationship between data analysis and data mining.

Figure 4.6 Issues in data mining.

Figure 4.7 Various systems in data mining.

Figure 4.8 Diagrammatic concept of classification.

Figure 4.9 Diagrammatic concept of clustering.

Figure 4.10 Diagrammatic concept of classification.

Figure 4.11 Specimen for decision tree induction.

Figure 4.12 Sample representation for stacked column chart.

Figure 4.13 Different relationships shown by scatter plots for bivariate ana...

Figure 4.14 Different techniques used for data visualization.

Figure 4.15 Different sample visualizations used for different cases.

Figure 4.16 Different probability distribution functions classification and ...

Chapter 5

Figure 5.1 Left: overview of neural network and deep learning; Right: branch...

Figure 5.2 (a) Overview of visualization: score function, data loss, and reg...

Figure 5.3 Linear model and sample data visualization: left: a simple linear...

Figure 5.4 Gradient descent is the excellent to visualization in deep learni...

Figure 5.5 left: Design the model with simplify blocks regarding dog detecti...

Figure 5.6 The loss of entropy.

Figure 5.7 (a) Matrix multiplication for deep learning using linear model. (...

Figure 5.8 Optimizer [16]: Adam works and others shows.

Figure 5.9 Left: example of block box most uses to visualize the complex net...

Figure 5.10 Overview of reinforcement learning model [9]: an agent is visual...

Figure 5.11 Deep reinforcement learning.

Figure 5.12 Reinforcement learning and visualization.

Figure 5.13 Inception v3 module: it was the powerful for visualizing the dee...

Figure 5.14 GoogLeNet architecture [12].

Figure 5.15 x: input, z: logit,

: softmax, y: 1-hot labels;

Figure 5.16 Example of interpretation of histogram distribution [Morvan].

Figure 5.17 Illustrated the multiple layers features in representation [medi...

Figure 5.18 Relationship visualizations: two variables using the scatter dia...

Figure 5.19 Comparison method: overview of charts is represented the most co...

Figure 5.20 Composition methodology: overview of charts is represented most ...

Figure 5.21 Example of visualization applied MNIST data set by using deep le...

Figure 5.22 MNIST visualization.

Figure 5.23 Example of visualization using MNIST in 3D.

Figure 5.24 L1 and L2 regularization.

Figure 5.25 Dropout processing and visualization: sampling dropout loss base...

Figure 5.26 Mask-RCNN for object detection and segmentation [21].

Figure 5.27 Mask-RCCN result progress: training with Mask-RCNN according to ...

Figure 5.28 Deep learning and object visualization based on sampling during ...

Figure 5.29 Deep learning and object visualization.

Figure 5.30 Human detection using Mask RCNN: noised data during the human de...

Figure 5.31 Showing the activation function of layers based on food recognit...

Figure 5.32 Interpretation of histogram distribution using Mask-RCNN.

Figure 5.33 Overfitting representation based on experience from Mask-RCNN [2...

Figure 5.34 Weights histogram based on distributed parameters of training se...

Figure 5.35 Correlations.

Figure 5.36 Visualization of food recognition.

Figure 5.37 Visualization for deep matrix factorization model [18].

Figure 5.38 Visualization and loss function in deep learning for recommendat...

Figure 5.39 Data visualization in MovieLens 1 M of recommendation system bas...

Figure 5.40 Line in charts, and modeling and visualization for reinforcement...

Chapter 6

Figure 6.1 Dental caries at its different phases.

Figure 6.2 Worldwide dental caries severity regions.

Figure 6.3 The affected risk of dental caries on smoking.

Figure 6.4 Worldwide dental caries affected Level that according to DMFT amo...

Figure 6.5 Classification of caries detection method.

Figure 6.6 Internal diagram of point detection method.

Figure 6.7 Teeth data features along with its distribution.

Figure 6.8 Discoloration of enamel under FOTI machine.

Figure 6.9 (a) (35–40) mm teeth image, (b) QLF teeth image.

Figure 6.10 (a) FOTI device, (b) diagnodent device, (c) QLF machine, (d) car...

Figure 6.11 Caries affected lesion, 3D view of the same lesion and it spread...

Figure 6.12 Performance of traditional caries detection methods after Bader ...

Figure 6.13 Performance of traditional method for Proximal Surfaces after Ba...

Chapter 7

Figure 7.1 Artificial intelligence and its subsets using intelligent data an...

Figure 7.2 Learner support provided by learning analytics.

Figure 7.3 Learning through web and mobile computing.

Figure 7.4 Sample techniques for the analytics engine [6].

Figure 7.5 Data mining using WEKA tool [36].

Figure 7.6 Decision tree generated for the data set [37].

Figure 7.7 Distribution table for the data set [37].

Figure 7.8 (a). Visualization of student attributes (K = 2) [38]. (b). Visua...

Figure 7.9 Working principle of MapReduce framework.

Figure 7.10 Output obtained from MapReduce programming framework.

Chapter 8

Figure 8.1 The flow of data processing procedure.

Figure 8.2 (a) Air quality with land areas in 2014 (using 1 048 576 instance...

Figure 8.3 (a) Tree area in 1990. (b) Tree area in 2014. (c) Difference of t...

Figure 8.4 Variance of each attribute with coordinates.

Figure 8.5 Variance of each attribute.

Figure 8.6 Count values of cases in each cluster.

Figure 8.7 Tree area percentage/relation of raw data (difference) and ranges...

Figure 8.8 Air quality with green space percentage.

Figure 8.9 Air quality with green space percentage analysis.

Chapter 9

Figure 9.1 Data collection from various sources from the space.

Figure 9.2 GIS evolution and future trends.

Figure 9.3 Remote sensing big data architecture.

Figure 9.4 The machine learning process.

Figure 9.5 Big data in remote sensing.

Figure 9.6 Big data in remote sensing.

Figure 9.7 Geospatial techniques.

Figure 9.8 A roadmap for geospatial big data management.

Figure 9.9 A roadmap knowledge discovery and service.

Figure 9.10 Conceptual diagram of the proposed fogGIS framework for power-ef...

Chapter 10

Figure 10.1 Overview of intelligent transportation system.

Figure 10.2 Services of intelligent transportation system (ITS).

Figure 10.3 Challenges and opportunities in the implementation of ITS.

Figure 10.4 Process of intelligent data analysis.

Figure 10.5 Three-dimensional model for security in ITS.

Figure 10.6 Data types of Python.

Chapter 11

Figure 11.1 Overall methodology of data analysis process.

Figure 11.2 Accuracy comparison of RF and

k

NN.

Figure 11.3 Random forest node processing time.

Figure 11.4 Random forest node accuracy.

Figure 11.5 Heat map of large vehicle collisions.

Figure 11.6 Heat map of very-small vehicle collisions.

Figure 11.7 Comparison of number of collisions, persons injured, and persons...

Figure 11.8 Number of persons injured based on vehicle groups.

Figure 11.9 Number of persons killed based on vehicle groups.

Figure 11.10 Number of persons injured based on borough.

Figure 11.11 Number of persons killed based on borough.

Figure 11.12 Number of persons injured in medium vehicles over N-68802 colli...

Figure 11.13 Number of persons killed in medium vehicles over N-68802 collis...

Figure 11.14 Number of persons injured in large vehicles over N-27508 collis...

Figure 11.15 Number of persons killed in large vehicles over N-27508 collisi...

Figure 11.16 Number of persons injured in small vehicles over N-892174 colli...

Figure 11.17 Number of persons killed in small vehicles over N-892174 collis...

Figure 11.18 Number of persons injured in very small vehicles over N-9705 co...

Figure 11.19 Number of persons killed in very small vehicles over N-9705 col...

Chapter 12

Figure 12.1 Types of neurological disorders.

Figure 12.2 Prevalence and death rate due to neurological disorders in the y...

Figure 12.3 Prevalence of neurological disorders in different countries [15]...

Figure 12.4 IoT and big data.

Figure 12.5 Soft computing techniques.

Figure 12.6 The process to generate an optimal solution [76, 77].

Figure 12.7 Machine learning applications.

Figure 12.8 The accuracy achieved by different studies for neurological diso...

Figure 12.9 Sensitivity achieve by different studies for neurological disord...

Figure 12.10 Specificity achieve by different studies for neurological disor...

Figure 12.11 Publication trend from 2008 to 2018 for neurological disorder d...

Figure 12.12 Neurological disorder diagnostic framework.

Chapter 13

Figure 13.1 Statistics of bug reports of 20 projects of the Apache Software ...

Figure 13.2 (a) Number of bug reports based on resolution. (b) Number of bug...

Figure 13.3 Example of bug report of Accumulo project.

Figure 13.4 Data extraction process.

Figure 13.5 Number of open bugs of distinct severity level.

Figure 13.6 Percentage of open bugs as per severity level.

Figure 13.7 (a)–(d) Most contributing developer for 20 projects of Apache So...

Figure 13.8 Code for finding corelated words (a) Association graph for logic...

Figure 13.9 (a)–(d) Association graphs of various errors for Kafka project (...

Figure 13.10 (a)–(b) Frequency and association plots for severe bugs.

Figure 13.11 K-means cluster group similar words.

Figure 13.12 Dendogram of most similar words.

Chapter 14

Figure. 14.1 Sentiment strengths and their elaboration as given by SentiStre...

Figure 14.2 Chart showing classification results of all four sentiments.

Figure 14.3 Chart showing evaluation results.

Chapter 16

Figure 16.1 Conventional decision support system.

Figure 16.2 Intelligent system for decision support/expert analysis in layou...

Chapter 17

Figure 17.1 Differences of bruxism patient teeth and normal human teeth.

Figure 17.2 Flow chart of the proposed work.

Figure 17.3 Low pass filter.

Figure 17.4 The loading of the bruxism data for the EEG signal and the total...

Figure 17.5 Loading of the normal data for the EEG signal in the S2 snooze s...

Figure 17.6 Extracted single-channels C4-A1 of the bruxism for the S2 sleep ...

Figure 17.7 Extracted single-channels C4-A1 of the normal for the S2 sleep s...

Figure 17.8 Filtered C4-A1 channel of S2 sleep stage for bruxism, we used a ...

Figure 17.9 Filtered C4-A1 channel of S2 sleep stage for the normal, we used...

Figure 17.10 Sampled C4-A1 channel of S2 sleep stage for the bruxism using t...

Figure 17.11 Sampled C4-A1 channel of S2 sleep stage for the normal using Ha...

Figure 17.12 It has represented the estimation of the power spectral density...

Figure 17.13 It has represented the estimation of the power spectral density...

Figure 17.14 Graphical representation for the normalized value of the single...

Chapter 18

Figure 18.1 This is simplest form of representation of the Encoder architect...

Figure 18.2 (a) The encoder compresses data into latent space (y). (b) The d...

Figure 18.3 Image reconstruction process.

Figure 18.4 Line segment from handwritten sample from patients suffering fro...

Figure 18.5 (a and b) Word segmentation samples produced from the segmented ...

Figure 18.6 Sample of character segmentation obtained using segmented words....

Figure 18.7 Segmented characters reconstructed using VAE.

Figure 18.8 Clusters of reconstructed images using VAE. (a) Cluster for “e,”...

Figure 18.9 Ambiguous “l” and “e.”

Figure 18.10 Ambiguous “c” and “e.”

Figure 18.11 Unclear or disconnected writing with spelling errors.

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

v

xix

xx

xxi

xxii

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

387

388

389

390

391

392

393

394

395

396

397

398

399

Intelligent Data Analysis

From Data Gathering to Data Comprehension

 

Edited by

Deepak Gupta

Maharaja Agrasen Institute of Technology

Delhi, India

Siddhartha Bhattacharyya

CHRIST (Deemed to be University)

Bengaluru, India

Ashish Khanna

Maharaja Agrasen Institute of Technology

Delhi, India

Kalpna Sagar

KIET Group of Institutions

Uttar Pradesh, India

 

 

 

This edition first published 2020

© 2020 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar to be identified as the authors of the editorial material in this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This work's use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software.

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Gupta, Deepak, editor.

Title: Intelligent data analysis : from data gathering to data comprehension / edited by Dr. Deepak Gupta, Dr. Siddhartha Bhattacharyya, Dr. Ashish Khanna, Ms. Kalpna Sagar.

Description: Hoboken, NJ, USA : Wiley, 2020. | Series: The Wiley series in intelligent signal and data processing | Includes bibliographical references and index.

Identifiers: LCCN 2019056735 (print) | LCCN 2019056736 (ebook) | ISBN 9781119544456 (hardback) | ISBN 9781119544449 (adobe pdf) | ISBN 9781119544463 (epub)

Subjects: LCSH: Data mining. | Computational intelligence.

Classification: LCC QA76.9.D343 I57435 2020 (print) | LCC QA76.9.D343 (ebook) | DDC 006.3/12–dc23

LC record available at https://lccn.loc.gov/2019056735

LC ebook record available at https://lccn.loc.gov/2019056736

Cover Design: Wiley

Cover Image: © gremlin/Getty Images

Deepak Gupta would like to dedicate this book to his father, Sh. R.K. Gupta, his mother, Smt. Geeta Gupta, his mentors for their constant encouragement, and his family members, including his wife, brothers, sisters, kids and the students.

Siddhartha Bhattacharyya would like to dedicate this book to his parents, the late Ajit Kumar Bhattacharyya and the late Hashi Bhattacharyya, his beloved wife, Rashni, and his research scholars, Sourav, Sandip, Hrishikesh, Pankaj, Debanjan, Alokananda, Koyel, and Tulika.

Ashish Khanna would like to dedicate this book to his parents, the late R.C. Khanna and Smt. Surekha Khanna, for their constant encouragement and support, and to his wife, Sheenu, and children, Master Bhavya and Master Sanyukt.

Kalpna Sagar would like to dedicate this book to her father, Mr. Lekh Ram Sagar, and her mother, Smt. Gomti Sagar, the strongest persons of her life.

List of Contributors

Ambarish G. Mohapatra

Silicon Institute of Technology

Bhubaneswar

India

Anirban Mukherjee

RCC Institute of Information Technology

West Bengal

India

Aniruddha Sadhukhan

RCC Institute of Information Technology

West Bengal

India

Anisha Roy

RCC Institute of Information Technology

West Bengal

India

Arvinder Kaur

Guru Gobind Singh Indraprastha University

India

Ayush Ahuja

Jaypee Institute of Information Technology Noida

India

Biswajit Modak

Nabadwip State General Hospital

Nabadwip

India

R.S. Bhatia

National Institute of Technology

Kurukshetra

India

Bright Keswani

Suresh Gyan Vihar University

Jaipur

India

Dakun Lai

University of Electronic Science and Technology of China

Chengdu

China

Deepak Kumar Sharma

Netaji Subhas University of Technology

New Delhi

India

Dhanushka Abeyratne

Yellowfin (HQ)

The University of Melbourne

Australia

Faijan Akhtar

Jamia Hamdard

New Delhi

India

Gihan S. Pathirana

Charles Sturt University

Melbourne

Australia

Huy V. Pham

Ton Duc Thang University

Vietnam

Malka N. Halgamuge

The University of Melbourne

Australia

Manashi De

Techno India

West Bengal

India

Manik Sharma

DAV University

Jalandhar

India

Manu Agarwal

Jaypee Institute of Information Technology Noida

India

Manu Sood

University Shimla

India

Md Belal Bin Heyat

University of Electronic Science and Technology of China

Chengdu

China

Mohd Ammar Bin Hayat

Medical University

India

Moolchand Sharma

Maharaja Agrasen Institute of Technology (MAIT)

Delhi

India

Nabendu Chaki

University of Calcutta

Kolkata

India

Nisheeth Joshi

Banasthali Vidyapith

Rajasthan

India

Om Prakash Rishi

University of Kota

India

Poonam Keswani

Akashdeep PG College

Jaipur

India

Prableen Kaur

DAV University

Jalandhar

India

Pragya Katyayan

Banasthali Vidyapith

Rajasthan

India

Pratiyush Guleria

University Shimla

India

Prerna Sharma

Maharaja Agrasen Institute of Technology (MAIT)

Delhi

India

Rachna Jain

Bharati Vidyapeeth's College of Engineering

New Delhi

India

Rahul Johari

GGSIP University

New Delhi

India

Rajib Saha

RCC Institute of Information Technology

West Bengal

India

Rakesh Roshan

Institute of Management Studies

Ghaziabad

India

Ramneek Singhal

Bharati Vidyapeeth's College of Engineering

New Delhi

India

Ravinder Ahuja

Jaypee Institute of Information Technology Noida

India

Samarth Chugh

Netaji Subhas University of Technology

New Delhi

India

Samridhi Seth

GGSIP University

New Delhi

India

Sarthak Gupta

Netaji Subhas University of Technology

New Delhi

India

Shadab Azad

Chaudhary Charan Singh University Meerut

India

Shafan Azad

Dr. A.P.J. Abdul Kalam Technical University

Uttar Pradesh

India

Shajan Azad

Hayat Institute of Nursing

Lucknow

India

Shikhar Asthana

Jaypee Institute of Information Technology Noida

India

Shivam Bachhety

Bharati Vidyapeeth's College of Engineering

New Delhi

India

Shubham Kumaram

Netaji Subhas University of Technology

New Delhi

India

Shubhra Goyal

Guru Gobind Singh Indraprastha University

India

Siddhant Bagga

Netaji Subhas University of Technology

New Delhi

India

Soma Datta

University of Calcutta

Kolkata

India

Tarini Ch. Mishra

Silicon Institute of Technology

Bhubaneswar

India

Than D. Le

University of Bordeaux

France

Vikas Chaudhary

KIET

Ghaziabad

India

Series Preface

Dr. Siddhartha Bhattacharyya, CHRIST (Deemed to be University), Bengaluru, India (Series Editor)

The Intelligent Signal and Data Processing (ISDP) book series is aimed at fostering the field of signal and data processing, which encompasses the theory and practice of algorithms and hardware that convert signals produced by artificial or natural means into a form useful for a specific purpose. The signals might be speech, audio, images, video, sensor data, telemetry, electrocardiograms, or seismic data, among others. The possible application areas include transmission, display, storage, interpretation, classification, segmentation, or diagnosis. The primary objective of the ISDP book series is to evolve future-generation scalable intelligent systems for faithful analysis of signals and data. ISDP is mainly intended to enrich the scholarly discourse on intelligent signal and image processing in different incarnations. ISDP will benefit a wide range of learners, including students, researchers, and practitioners. The student community can use the volumes in the series as reference texts to advance their knowledge base. In addition, the monographs will also come in handy to the aspiring researcher because of the valuable contributions both have made in this field. Moreover, both faculty members and data practitioners are likely to grasp depth of the relevant knowledge base from these volumes.

The series coverage will contain, not exclusively, the following:

Intelligent signal processing

Adaptive filtering

Learning algorithms for neural networks

Hybrid soft-computing techniques

Spectrum estimation and modeling

Image processing

Image thresholding

Image restoration

Image compression

Image segmentation

Image quality evaluation

Computer vision and medical imaging

Image mining

Pattern recognition

Remote sensing imagery

Underwater image analysis

Gesture analysis

Human mind analysis

Multidimensional image analysis

Speech processing

Modeling

Compression

Speech recognition and analysis

Video processing

Video compression

Analysis and processing

3D video compression

Target tracking

Video surveillance

Automated and distributed crowd analytics

Stereo-to-auto stereoscopic 3D video conversion

Virtual and augmented reality

Data analysis

Intelligent data acquisition

Data mining

Exploratory data analysis

Modeling and algorithms

Big data analytics

Business intelligence

Smart cities and smart buildings

Multiway data analysis

Predictive analytics

Intelligent systems

Preface

Intelligent data analysis (IDA), knowledge discovery, and decision support have recently become more challenging research fields and have gained much attention among a large number of researchers and practitioners. In our view, the awareness of these challenging research fields and emerging technologies among the research community will increase the applications in biomedical science. This book aims to present the various approaches, techniques, and methods that are available for IDA, and to present case studies of their application.

This volume comprises 18 chapters focusing on the latest advances in IDA tools and techniques.

Machine learning models are broadly categorized into two types: white box and black box. Due to the difficulty in interpreting their inner workings, some machine learning models are considered black box models. Chapter 1 focuses on the different machine learning models, along with their advantages and limitations as far as the analysis of data is concerned.

With the advancement of technology, the amount of data generated is very large. The data generated has useful information that needs to be gathered by data analytics tools in order to make better decisions. In Chapter 2, the definition of data and its classifications based on different factors is given. The reader will learn about how and what data is and about the breakup of the data. After a description of what data is, the chapter will focus on defining and explaining big data and the various challenges faced by dealing with big data. The authors also describe various types of analytics that can be performed on large data and six data analytics tools (Microsoft Excel, Apache Spark, OpenRefine, R, Hadoop, and Tableau).

In recent years, the widespread use of computers and the internet has led to the generation of data on an unprecedented scale. To make an effective use of this data, it is necessary that data must be collected and analyzed so that inferences can be made to improve various products and services. Statistics deals with the collection, organization, and analysis of data. The organization and description of data is studied under these statistics in Chapter 3 while analysis of data and how to make predictions based on it is dealt with in inferential statistics.

After having an idea about various aspects of IDA in the previous chapters, Chapter 4 deals with an overview of data mining. It also discusses the process of knowledge discovery in data along with a detailed analysis of various mining methods including classification, clustering, and decision tree. In addition to that, the chapter concludes with a view of data visualization and probability concepts for IDA.

In Chapter 5, the authors demonstrate one of the most crucial and challenge areas in computer vision and the IDA field based on manipulating the convergence. This subject is divided into a deep learning paradigm for object segmentation in computer vision and visualization paradigm for efficiently incremental interpretation in manipulating the datasets for supervised and unsupervised learning, and online or offline training in reinforcement learning. This topic recently has had a large impact in robotics and autonomous systems, food detection, recommendation systems, and medical applications.

Dental caries is a painful bacterial disease of teeth caused mainly by Streptococcus mutants, acid, and carbohydrates, and it destroys the enamel, or the dentine, layer of the tooth. As per the World Health Organization report, worldwide, 60–90% of school children and almost 100% of adults have dental caries. Dental caries and periodontal disease without treatment for long periods causes tooth loss. There is not a single method to detect caries in its earliest stages. The size of carious lesions and early caries detection are very challenging tasks for dental practitioners. The methods related to dental caries detection are the radiograph, QLF or or quantitative light-induced fluorescence, ECM, FOTI, DIFOTI, etc. In a radiograph-based technique, dentists analyze the image data. In Chapter 6, the authors present a method to detect caries by analyzing the secondary emission data.

With the growth of data in the education field in recent years, there is a need for intelligent data analytics, in order that academic data should be used effectively to improve learning. Educational data mining and learning analytics are the fields of IDA that play important roles in intelligent analysis of educational data. One of the real challenges faced by students and institutions alike is the quality of education. An equally important factor related to the quality of education is the performance of students in the higher education system. The decisions that the students make while selecting their area of specialization is of grave concern here. In the absence of support systems, the students and the teachers/mentors fall short when making the right decisions for the furthering of their chosen career paths. Therefore, in Chapter 7, the authors attempt to address the issue by proposing a system that can guide the student to choose and to focus on the right course(s) based on their personal preferences. For this purpose, a system has been envisaged by blending data mining and classification with big data. A methodology using MapReduce Framework and association rule mining is proposed in order to derive the right blend of courses for students to pursue to enhance their career prospects.

Atmospheric air pollution is creating significant health problems that affect millions of people around the world. Chapter 8 analyzes the hypothesis about whether or not global green space variation is changing the global air quality. The authors perform a big data analysis with a data set that contains more than 1M (1 048 000) green space data and air quality data points by considering 190 countries during the years 1990 to 2015. Air quality is measured by considering particular matter (PM) value. The analysis is carried out using multivariate graphs and a k-mean clustering algorithm. The relative geographical changes of the tree areas, as well as the level of the air quality, were identified and the results indicated encouraging news.

Space technology and geotechnology, such as geographic information systems, plays a vital role in the day-to-day activities of a society. In the initial days, the data collection was very rudimentary and primitive. The quality of the data collected was a subject of verification and the accuracy of the data was also questionable. With the advent of newer technology, the problems have been overcome. Using modern sophisticated systems, space science has been changed drastically. Implementing cutting-edge spaceborne sensors has made it possible to capture real-time data from space. Chapter 9 focuses on these aspects in detail.

Transportation plays an important role in our overall economy, conveying products and people through progressively mind-boggling, interconnected, and multidimensional transportation frameworks. But, the complexities of present-day transportation can't be managed by previous systems. The utilization of IDA frameworks and strategies, with compelling information gathering and data dispersion frameworks, gives openings that are required to building the future intelligent transportation systems (ITSs). In Chapter 10, the authors exhibit the application of IDA in IoT-based ITS.

Chapter 11 aims to observe emerging patterns and trends by using big data analysis to enhance predictions of motor vehicle collisions using a data set consisting of 17 attributes and 998 193 collisions in New York City. The data is extracted from the New York City Police Department (NYPD). The data set has then been tested in three classification algorithms, which are k-nearest neighbor, random forest, and naive Bayes. The outputs are captured using k-fold cross-validation method. These outputs are used to identify and compare classifier accuracy, and random forest node accuracy and processing time. Further, an analysis of raw data is performed describing the four different vehicle groups in order to detect significance within the recorded period. Finally, extreme cases of collision severity are identified using outlier analysis. The analysis demonstrates that out of three classifiers, random forest gives the best results.

Neurological disorders are the diseases that are related to the brain, nervous system, and the spinal cord of the human body. These disorders may affect the walking, speaking, learning, and moving capacity of human beings. Some of the major human neurological disorders are stroke, brain tumors, epilepsy, meningitis, Alzheimer's, etc. Additionally, remarkable growth has been observed in the areas of disease diagnosis and health informatics. The critical human disorders related to lung, kidney, skin, and brain have been successfully diagnosed using different data mining and machine learning techniques. In Chapter 12, several neurological and psychological disorders are discussed. The role of different computing techniques in designing different biomedical applications are presented. In addition, the challenges and promising areas of innovation in designing a smart and intelligent neurological disorder diagnostic system using big data, internet of things, and emerging computing techniques are also highlighted.

Bug reports are one of the crucial software artifacts in open-source software. Issue tracking systems maintain enormous bug reports with several attributes, such as long description of bugs, threaded discussion comments, and bug meta-data, which includes BugID, priority, status, resolution, time, and others. In Chapter 13, bug reports of 20 open-source projects of the Apache Software Foundation are extracted using a tool named the Bug Report Collection System for trend analysis. As per the quantitative analysis of data, about 20% of open bugs are critical in nature, which directly impacts the functioning of the system. The presence of a large number of bugs of this kind can put systems into vulnerability positions and reduces the risk aversion capability. Thus, it is essential to resolve these issues on a high priority. The test lead can assign these issues to the most contributing developers of a project for quick closure of opened critical bugs. The comments are mined, which help us identify the developers resolving the majority of bugs, which is beneficial for test leads of distinct projects. As per the collated data, the areas more prone to system failures are determined such as input/output type error and logical code error.

Sentiments are the standard way by which people express their feelings. Sentiments are broadly classified as positive and negative. The problem occurs when the user expresses with words that are different than the actual feelings. This phenomenon is generally known to us as sarcasm, where people say something opposite the actual sentiments. Sarcasm detection is of great importance for the correct analysis of sentiments. Chapter 14 attempts to give an algorithm for successful detection of hyperbolic sarcasm and general sarcasm in a data set of sarcastic posts that are collected from pages dedicated for sarcasm on social media sites such as Facebook, Pinterest, and Instagram. This chapter also shows the initial results of the algorithm and its evaluation.

Predictive analytics refers to forecasting the future probabilities by extracting information from existing data sets and determining patterns from predicted outcomes. Predictive analytics also includes what-if scenarios and risk assessment. In Chapter 15, an effort has been made to use principles of predictive modeling to analyze the authentic social network data set, and results have been encouraging. The post-analysis of the results have been focused on exhibiting contact details, mobility pattern, and a number of degree of connections/minutes leading to identification of the linkage/bonding between the nodes in the social network.

Modern medicine has been confronted by a major challenge of achieving promise and capacity of tremendous expansion in medical data sets of all kinds. Medical databases develop huge bulk of knowledge and data, which mandates a specialized tool to store and perform analysis of data and as a result, effectively use saved knowledge and data. Information is extracted from data by using a domain's background knowledge in the process of IDA. Various matters dealt with regard use, definition, and impact of these processes and they are tested for their optimization in application domains of medicine. The primary focus of Chapter 16 is on the methods and tools of IDA, with an aim to minimize the growing differences between data comprehension and data gathering.

Snoozing, or sleeping, is a physical phenomenon of the human life. When human snooze is disturbed, it generates many problems, such as mental disease, heart disease, etc. Total snooze is characterized by two stages, viz., rapid eye movement and nonrapid eye movement. Bruxism is a type of snooze disorder. The traditional method of the prognosis takes time and the result is in analog form. Chapter 17 proposes a method for easy prognosis of snooze bruxism.

Neurodegenerative diseases like Alzheimer's and Parkinson's impair the cognitive and motor abilities of the patient, along with memory loss and confusion. As handwriting involves proper functioning of the brain and motor control, it is affected. Alteration in handwriting is one of the first signs of Alzheimer's disease. The handwriting gets shaky, due to loss of muscle control, confusion, and forgetfulness. The symptoms get progressively worse. It gets illegible and the phonological spelling mistakes become inevitable. In Chapter 18, the authors use a feature extraction technique to be used as a parameter for diagnosis. Avariational auto encoder (VAE), a deep unsupervised learning technique, has been applied, which is used to compress the input data and then reconstruct it keeping the targeted output the same as the targeted input.

This edited volume on IDA gathers researchers, scientists, and practitioners interested in computational data analysis methods, aimed at narrowing the gap between extensive amounts of data stored in medical databases and the interpretation, understandable, and effective use of the stored data. The expected readers of this book are researchers, scientists, and practitioners interested in IDA, knowledge discovery, and decision support in databases, particularly those who are interested in using these technologies. This publication provides useful references for educational institutions, industry, academic researchers, professionals, developers, and practitioners to apply, evaluate, and reproduce the contributions to this book.

May 07, 2019

New Delhi, India

Deepak Gupta

Bengaluru, India

Siddhartha Bhattacharyya

New Delhi, India

Ashish Khanna

Uttar Pradesh, India

Kalpna Sagar

1Intelligent Data Analysis: Black Box Versus White Box Modeling

Sarthak Gupta, Siddhant Bagga, and Deepak Kumar Sharma

Division of Information Technology, Netaji Subhas University of Technology, New Delhi, India

1.1 Introduction

In the midst of all of the societal challenges of today's world, digital transformation is rapidly becoming a necessity. The number of internet users is growing at an unprecedented rate. New devices, sensors, and technologies are emerging every day. These factors have led to an exponential increase in the volume of data being generated. According to a recent research [1], users of the internet generate 2.5 quintillion bytes of data per day.

1.1.1 Intelligent Data Analysis

Data is only as good as what you make of it. The sheer amount of data being generated calls for methods to leverage its power. With the proper tools and methodologies, data analysis can improve decision making, lower the risks, and unearth hidden insights. Intelligent data analysis (IDA) is concerned with effective analysis of data [2, 3].

The process of IDA consists of three main steps (see Figure 1.1):

Data collection and preparation

: This step involves acquiring data, and converting it into a format suitable for further analysis. This may involve storing the data as a table, taking care of empty or null values, etc.

Exploration

: Before a thorough analysis can be performed on the data, certain characteristics are examined like number of data points, included variables, statistical features, etc. Data exploration allows analysts to get familiar with the dataset, and create prospective hypotheses. Visualization is extensively used in this step. Various visualization techniques will be discussed in depth later in this chapter.

Analysis

: Various machine learning and deep learning algorithms are applied at this step. Data analysts build models that try to find the best possible fit to the data points. These models can be classified as white box or black box models.

A more comprehensive introduction to data analysis can be found in prior pieces of literature [4–6].

Figure 1.1 Data analysis process.

1.1.2 Applications of IDA and Machine Learning

IDA and machine learning can be applied to a multitude of products and services, since these models have the ability to make fast, data-driven decisions at scale. We're surrounded by live examples of machine learning in things we use in day-to-day life.

A primary example is web page ranking [7, 8]. Whenever we search for anything on a search engine, the results that we get are presented to us in the order of relevance. To achieve this, the search engine needs to “know” which pages are more relevant than others.

A related application is collaborative filtering [9, 10]. Collaborative filtering filters information based on recommendations of other people. It is based on the premise that people who agreed in their evaluation of certain items in the past are likely to agree again in the future.

Another application is automatic translation of documents from one language to another. Manually doing this is an extremely arduous task and would take a significant amount of time.

IDA and machine learning models are also being used for many other tasks [11, 12] like object classification, named entity recognition, object localization, stock prices prediction, etc.

1.1.3 White Box Models Versus Black Box Models

IDA aims to analyze the data to create predictive models. Suppose that we're given a dataset D(X,T), where X represents inputs and T represents target values (i.e., known correct values with respect to the input). The goal is to learn a function (or map) from inputs (X) to outputs (T). This is done by employing supervised machine learning algorithms [13]. A model refers to the artifact that is created by the training (or learning) process. Models are broadly categorized into two types:

White box models

: The models whose predictions are easily explainable are called white box models. These models are extremely simple, and hence, not very effective. The accuracy of white box models is usually quite low. For example – simple decision trees, linear regression, logistic regression, etc.

Black box models

: The models whose predictions are difficult to interpret or explain are called black box models. They are difficult to interpret because of their complexity. Since they are complex models, their accuracy is usually high. For example – large decision trees, random forests, neural networks, etc.

So, IDA and machine learning models suffer from accuracy-explainability trade-off. However, with advances in IDA, the explainability gap in black box models is reducing.

1.1.4 Model Interpretability

If black box models have better accuracy, why not use them all the time? The problem is that a single metric, such as classification accuracy, is an incomplete description of most real-world tasks [14, 15]. Sometimes in low-risk environments, where decisions don't have severe consequences, it might be sufficient to just know that the model performed well on some test dataset without the need for an explanation. However, machine learning models are being extensively used in high-risk environments like health care, finance, data security, etc. where the impact of decisions is huge. Therefore, it's extremely important to bridge the explainability gap in black box models, so that they can be used with confidence in place of white box models to provide better accuracy.

Interpretability models may be local or global. Global methods try to explain the model itself, thereby explaining all possible outcomes. On the other hand, local models try to explain why a particular decision was made.

As artificial intelligence (AI)-assisted decision making is becoming commonplace, the ability to generate simple explanations for black box systems is going to be extremely important, and is already an area of active research.

1.2 Interpretation of White Box Models

White box models are extremely easy to interpret, since interpretability is inherent in their nature. Let's talk a few white box models and how to interpret them.

1.2.1 Linear Regression

Linear regression [16, 17] attempts to model the relationship between input variables and output by fitting a linear equation to the observed data (see Figure 1.2). A linear regression equation is of the form:

(1.1)

where,

y

is the output variable,

x1, x2,…, xp are “p” input variables,

w1, w2,…, wp are the weights associated with input variables, and

w0 makes sure that the regression line works even if the data is not centered around origin (along the output dimension).

The weights are calculated using techniques like ordinary least squares and gradient descent. The details of these techniques are beyond the scope of this chapter; we will focus more on the interpretability of these models.

The interpretation of the weights of a linear model is quite obvious. An increase by one unit in the feature xj results in a corresponding increase by wj in the output.

Another metric for interpreting linear models is R2 measurement [18]. R2 value tells us about how much variance of target outcomes is explained by the model. R2 value ranges from 0 to 1. Higher the R2 value, better the model explains the data. R2 is calculated as:

Figure 1.2 Linear regression.

(1.2)

where

SS

r

 is the squared sum of residuals, and

SS

t

is the total sum of squares (proportional to variance of the data)

Residual ei is defined as:

(1.3)

where

y

i

is the model's predicted output, and

t

i

is the target value in the dataset.

Hence, SSr is calculated as:

(1.4)

And SSt is calculated as:

(1.5)

where is the mean of all target values.

But, there is a problem with R2 value. It increases with number of features, even if they carry no information about the target values. Hence, adjusted R2 value () is used, which takes into account the number of input features:

(1.6)

Where

is the adjusted

R

2

value,

n

is the number of data points, and

p

is the number of input features (or input variables)

1.2.2 Decision Tree

Decision trees [19] are classifiers – they classify a given data point by posing a series of questions about the features associated with the data item (see Figure 1.3).

Unlike linear regression, decision trees are able to model nonlinear data. In a decision tree, nodes represent features, each edge or link represents a decision, and leaf nodes represent outcomes.

The general algorithm for decision trees is given below:

Pick the best attribute/feature. Best feature is that which separates the data in the best possible way. The optimal split would be when all data points belonging to different classes are in separate subsets after the split.

For each value of the attribute, create a new child node of the current node.

Divide data into the new child nodes.

For each new child node:

If all the data points in that node belong to the same class, then stop.

Else, go to step 1 and repeat the process with current node as decision node.

Figure 1.3 Decision tree.

Figure 1.4