119,99 €
This book focuses on methods and tools for intelligent data analysis, aimed at narrowing the increasing gap between data gathering and data comprehension, and emphasis will also be given to solving of problems which result from automated data collection, such as analysis of computer-based patient records, data warehousing tools, intelligent alarming, effective and efficient monitoring, and so on. This book aims to describe the different approaches of Intelligent Data Analysis from a practical point of view: solving common life problems with data analysis tools.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 679
Veröffentlichungsjahr: 2020
Cover
List of Contributors
Series Preface
Preface
1 Intelligent Data Analysis: Black Box Versus White Box Modeling
1.1 Introduction
1.2 Interpretation of White Box Models
1.3 Interpretation of Black Box Models
1.4 Issues and Further Challenges
1.5 Summary
References
2 Data: Its Nature and Modern Data Analytical Tools
2.1 Introduction
2.2 Data Types and Various File Formats
2.3 Overview of Big Data
2.4 Data Analytics Phases
2.5 Data Analytical Tools
2.6 Database Management System for Big Data Analytics
2.7 Challenges in Big Data Analytics
2.8 Conclusion
References
3 Statistical Methods for Intelligent Data Analysis: Introduction and Various Concepts
3.1 Introduction
3.2 Probability
3.3 Descriptive Statistics
3.4 Inferential Statistics
3.5 Statistical Methods
3.6 Errors
3.7 Conclusion
References
4 Intelligent Data Analysis with Data Mining: Theory and Applications
4.1 Introduction to Data Mining
4.2 Data and Knowledge
4.3 Discovering Knowledge in Data Mining
4.4 Data Analysis and Data Mining
4.5 Data Mining: Issues
4.6 Data Mining: Systems and Query Language
4.7 Data Mining Methods
4.8 Data Exploration
4.9 Data Visualization
4.10 Probability Concepts for Intelligent Data Analysis (IDA)
Reference
5 Intelligent Data Analysis: Deep Learning and Visualization
5.1 Introduction
5.2 Deep Learning and Visualization
5.3 Data Processing and Visualization
5.4 Experiments and Results
5.5 Conclusion
References
6 A Systematic Review on the Evolution of Dental Caries Detection Methods and Its Significance in Data Analysis Perspective
6.1 Introduction
6.2 Different Caries Lesion Detection Methods and Data Characterization
6.3 Technical Challenges with the Existing Methods
6.4 Result Analysis
6.5 Conclusion
Acknowledgment
References
7 Intelligent Data Analysis Using Hadoop Cluster – Inspired MapReduce Framework and Association Rule Mining on Educational Domain
7.1 Introduction
7.2 Learning Analytics in Education
7.3 Motivation
7.4 Literature Review
7.5 Intelligent Data Analytical Tools
7.6 Intelligent Data Analytics Using MapReduce Framework in an Educational Domain
7.7 Results
7.8 Conclusion and Future Scope
References
8 Influence of Green Space on Global Air Quality Monitoring: Data Analysis Using K-Means Clustering Algorithm
8.1 Introduction
8.2 Material and Methods
8.3 Results
8.4 Quantitative Analysis
8.5 Discussion
8.6 Conclusion
References
9 IDA with Space Technology and Geographic Information System
9.1 Introduction
9.2 Geospatial Techniques
9.3 Comparative Analysis
9.4 Conclusion
References
10 Application of Intelligent Data Analysis in Intelligent Transportation System Using IoT
10.1 Introduction to Intelligent Transportation System (ITS)
10.2 Issues and Challenges of Intelligent Transportation System (ITS)
10.3 Intelligent Data Analysis Makes an IoT-Based Transportation System Intelligent
10.4 Intelligent Data Analysis for Security in Intelligent Transportation System
10.5 Tools to Support IDA in an Intelligent Transportation System
References
11 Applying Big Data Analytics on Motor Vehicle Collision Predictions in New York City
11.1 Introduction
11.2 Materials and Methods
11.3 Classification Algorithms and K-Fold Validation Using Data Set Obtained from NYPD (2012–2017)
11.4 Results
11.5 Discussion
11.6 Conclusion
References
12 A Smart and Promising Neurological Disorder Diagnostic System: An Amalgamation of Big Data, IoT, and Emerging Computing Techniques
12.1 Introduction
12.2 Statistics of Neurological Disorders
12.3 Emerging Computing Techniques
12.4 Related Works and Publication Trends of Articles
12.5 The Need for Neurological Disorders Diagnostic System
12.6 Conclusion
References
13 Comments-Based Analysis of a Bug Report Collection System and Its Applications
13.1 Introduction
13.2 Background
13.3 Related Work
13.4 Data Collection Process
13.5 Analysis of Bug Reports
13.6 Threats to Validity
13.7 Conclusion
References
Notes
14 Sarcasm Detection Algorithms Based on Sentiment Strength
14.1 Introduction
14.2 Literature Survey
14.3 Experiment
14.4 Results and Evaluation
14.5 Conclusion
References
Notes
15 SNAP: Social Network Analysis Using Predictive Modeling
15.1 Introduction
15.2 Literature Survey
15.3 Comparative Study
15.4 Simulation and Analysis
15.5 Conclusion and Future Work
References
16 Intelligent Data Analysis for Medical Applications
16.1 Introduction
16.2 IDA Needs in Medical Applications
16.3 IDA Methods Classifications
16.4 Intelligent Decision Support System in Medical Applications
16.5 Conclusion
References
17 Bruxism Detection Using Single-Channel C4-A1 on Human Sleep S2 Stage Recording
17.1 Introduction
17.2 History of Sleep Disorder
17.3 Electroencephalogram Signal
17.4 EEG Data Measurement Technique
17.5 Literature Review
17.6 Subjects and Methodology
17.7 Data Analysis of the Bruxism and Normal Data Using EEG Signal
17.8 Result
17.9 Conclusions
Acknowledgments
References
18 Handwriting Analysis for Early Detection of Alzheimer's Disease
18.1 Introduction and Background
18.2 Proposed Work and Methodology
18.3 Results and Discussions
18.4 Conclusion
References
Index
End User License Agreement
Chapter 2
Table 2.1 Schema of an employee table in a broker company.
Table 2.2 Data storage measurements.
Table 2.3 Comparison of different data analytic tools.
Table 2.4 Comparison between SQL and NoSQL.
Chapter 4
Table 4.1 Dissimilarities between data and knowledge.
Chapter 6
Table 6.1 Global DMFT trends for 12-year-old children [81–83].
Table 6.2 Code meaning.
Chapter 7
Table 7.1 Educational data set [37].
Table 7.2 Synthesized educational data set [38].
Table 7.3 The data set for course selection.
Table 7.4 Output of Map reduce task.
Table 7.5 Best rules found by Apriori.
Chapter 8
Table 8.1 Air quality categories (annual mean ambient defined by WHO).
Table 8.2 Categorization of the difference of green space area percentage dur...
Table 8.3 Analysis of variance (ANOVA) statistics table.
Chapter 9
Table 9.1 NoSQL database types.
Table 9.2 NoSQL database types.
Table 9.3 NoSQL database types.
Chapter 10
Table 10.1 Objects, statistical techniques, and graphs supported by R program...
Chapter 11
Table 11.1 Illustration of data set attributes.
Table 11.2 Categorized vehicle groups.
Table 11.3 Description of classification algorithms and functionalities.
Table 11.4 Comparison of classifier results.
Table 11.5 Analyzed
p
-value test results.
Chapter 12
Table 12.1 Difference between neurological and psychological disorders.
Table 12.2 Publications details along with citations used in the study.
Chapter 13
Table 13.1 Comparison of previous studies of data extraction.
Table 13.2 Categories of error and its significant keywords.
Table 13.3 Frequent words for severe and nonsevere bugs.
Chapter 14
Table 14.1 Examples for hyperbolic sarcasm.
Table 14.2 Examples for general sarcasm, positive sentences, and negative sen...
Table 14.3 Shows the patterns used by extended Algorithm 14.2 to detect the p...
Table 14.4 Shows example cases for Table 14.3.
Table 14.5 True positive and True negative values of the classification resul...
Table 14.6 Evaluation results of the classification done by the extended algo...
Chapter 15
Table 15.1 Comparison table of literature work.
Chapter 17
Table 17.1 The comparative analysis between bruxism and a normal human for th...
Table 17.2 The comparative analysis between bruxism and normal human for the ...
Table 17.3 The comparative analysis between bruxism and normal human for the ...
Chapter 1
Figure 1.1 Data analysis process.
Figure 1.2 Linear regression.
Figure 1.3 Decision tree.
Figure 1.4 Distribution of points in case of high and low information gain....
Figure 1.5 Partial dependence plots from a gradient boosting regressor train...
Figure 1.6 Partial dependence plot from a gradient boosting regressor traine...
Figure 1.7 Relationship between X
2
and Y [24].
Figure 1.8 ICE plot between feature X_2 and Y [24].
Figure 1.9 Calculation of PDP and M-plot [25].
Figure 1.10 Calculation of ALE plot [25].
Figure 1.11 Correlation does not imply causation [29].
Chapter 2
Figure 2.1 Various stages of data.
Figure 2.2 Classifications of digital data.
Figure 2.3 CSV file opened in Microsoft Excel.
Figure 2.4 Plain text file opened in Notepad.
Figure 2.8 Characteristics of big data.
Figure 2.9 Different types of big data analytics.
Figure 2.10 Various phases of data analytics.
Figure 2.11 Features of Apache Spark.
Figure 2.12 Components of Hadoop.
Chapter 4
Figure 4.1 From data to knowledge.
Figure 4.2 Variety of data in data mining.
Figure 4.3 Knowledge tree for intelligent data mining.
Figure 4.4 Knowledge discovery process.
Figure 4.5 Relationship between data analysis and data mining.
Figure 4.6 Issues in data mining.
Figure 4.7 Various systems in data mining.
Figure 4.8 Diagrammatic concept of classification.
Figure 4.9 Diagrammatic concept of clustering.
Figure 4.10 Diagrammatic concept of classification.
Figure 4.11 Specimen for decision tree induction.
Figure 4.12 Sample representation for stacked column chart.
Figure 4.13 Different relationships shown by scatter plots for bivariate ana...
Figure 4.14 Different techniques used for data visualization.
Figure 4.15 Different sample visualizations used for different cases.
Figure 4.16 Different probability distribution functions classification and ...
Chapter 5
Figure 5.1 Left: overview of neural network and deep learning; Right: branch...
Figure 5.2 (a) Overview of visualization: score function, data loss, and reg...
Figure 5.3 Linear model and sample data visualization: left: a simple linear...
Figure 5.4 Gradient descent is the excellent to visualization in deep learni...
Figure 5.5 left: Design the model with simplify blocks regarding dog detecti...
Figure 5.6 The loss of entropy.
Figure 5.7 (a) Matrix multiplication for deep learning using linear model. (...
Figure 5.8 Optimizer [16]: Adam works and others shows.
Figure 5.9 Left: example of block box most uses to visualize the complex net...
Figure 5.10 Overview of reinforcement learning model [9]: an agent is visual...
Figure 5.11 Deep reinforcement learning.
Figure 5.12 Reinforcement learning and visualization.
Figure 5.13 Inception v3 module: it was the powerful for visualizing the dee...
Figure 5.14 GoogLeNet architecture [12].
Figure 5.15 x: input, z: logit,
: softmax, y: 1-hot labels;
Figure 5.16 Example of interpretation of histogram distribution [Morvan].
Figure 5.17 Illustrated the multiple layers features in representation [medi...
Figure 5.18 Relationship visualizations: two variables using the scatter dia...
Figure 5.19 Comparison method: overview of charts is represented the most co...
Figure 5.20 Composition methodology: overview of charts is represented most ...
Figure 5.21 Example of visualization applied MNIST data set by using deep le...
Figure 5.22 MNIST visualization.
Figure 5.23 Example of visualization using MNIST in 3D.
Figure 5.24 L1 and L2 regularization.
Figure 5.25 Dropout processing and visualization: sampling dropout loss base...
Figure 5.26 Mask-RCNN for object detection and segmentation [21].
Figure 5.27 Mask-RCCN result progress: training with Mask-RCNN according to ...
Figure 5.28 Deep learning and object visualization based on sampling during ...
Figure 5.29 Deep learning and object visualization.
Figure 5.30 Human detection using Mask RCNN: noised data during the human de...
Figure 5.31 Showing the activation function of layers based on food recognit...
Figure 5.32 Interpretation of histogram distribution using Mask-RCNN.
Figure 5.33 Overfitting representation based on experience from Mask-RCNN [2...
Figure 5.34 Weights histogram based on distributed parameters of training se...
Figure 5.35 Correlations.
Figure 5.36 Visualization of food recognition.
Figure 5.37 Visualization for deep matrix factorization model [18].
Figure 5.38 Visualization and loss function in deep learning for recommendat...
Figure 5.39 Data visualization in MovieLens 1 M of recommendation system bas...
Figure 5.40 Line in charts, and modeling and visualization for reinforcement...
Chapter 6
Figure 6.1 Dental caries at its different phases.
Figure 6.2 Worldwide dental caries severity regions.
Figure 6.3 The affected risk of dental caries on smoking.
Figure 6.4 Worldwide dental caries affected Level that according to DMFT amo...
Figure 6.5 Classification of caries detection method.
Figure 6.6 Internal diagram of point detection method.
Figure 6.7 Teeth data features along with its distribution.
Figure 6.8 Discoloration of enamel under FOTI machine.
Figure 6.9 (a) (35–40) mm teeth image, (b) QLF teeth image.
Figure 6.10 (a) FOTI device, (b) diagnodent device, (c) QLF machine, (d) car...
Figure 6.11 Caries affected lesion, 3D view of the same lesion and it spread...
Figure 6.12 Performance of traditional caries detection methods after Bader ...
Figure 6.13 Performance of traditional method for Proximal Surfaces after Ba...
Chapter 7
Figure 7.1 Artificial intelligence and its subsets using intelligent data an...
Figure 7.2 Learner support provided by learning analytics.
Figure 7.3 Learning through web and mobile computing.
Figure 7.4 Sample techniques for the analytics engine [6].
Figure 7.5 Data mining using WEKA tool [36].
Figure 7.6 Decision tree generated for the data set [37].
Figure 7.7 Distribution table for the data set [37].
Figure 7.8 (a). Visualization of student attributes (K = 2) [38]. (b). Visua...
Figure 7.9 Working principle of MapReduce framework.
Figure 7.10 Output obtained from MapReduce programming framework.
Chapter 8
Figure 8.1 The flow of data processing procedure.
Figure 8.2 (a) Air quality with land areas in 2014 (using 1 048 576 instance...
Figure 8.3 (a) Tree area in 1990. (b) Tree area in 2014. (c) Difference of t...
Figure 8.4 Variance of each attribute with coordinates.
Figure 8.5 Variance of each attribute.
Figure 8.6 Count values of cases in each cluster.
Figure 8.7 Tree area percentage/relation of raw data (difference) and ranges...
Figure 8.8 Air quality with green space percentage.
Figure 8.9 Air quality with green space percentage analysis.
Chapter 9
Figure 9.1 Data collection from various sources from the space.
Figure 9.2 GIS evolution and future trends.
Figure 9.3 Remote sensing big data architecture.
Figure 9.4 The machine learning process.
Figure 9.5 Big data in remote sensing.
Figure 9.6 Big data in remote sensing.
Figure 9.7 Geospatial techniques.
Figure 9.8 A roadmap for geospatial big data management.
Figure 9.9 A roadmap knowledge discovery and service.
Figure 9.10 Conceptual diagram of the proposed fogGIS framework for power-ef...
Chapter 10
Figure 10.1 Overview of intelligent transportation system.
Figure 10.2 Services of intelligent transportation system (ITS).
Figure 10.3 Challenges and opportunities in the implementation of ITS.
Figure 10.4 Process of intelligent data analysis.
Figure 10.5 Three-dimensional model for security in ITS.
Figure 10.6 Data types of Python.
Chapter 11
Figure 11.1 Overall methodology of data analysis process.
Figure 11.2 Accuracy comparison of RF and
k
NN.
Figure 11.3 Random forest node processing time.
Figure 11.4 Random forest node accuracy.
Figure 11.5 Heat map of large vehicle collisions.
Figure 11.6 Heat map of very-small vehicle collisions.
Figure 11.7 Comparison of number of collisions, persons injured, and persons...
Figure 11.8 Number of persons injured based on vehicle groups.
Figure 11.9 Number of persons killed based on vehicle groups.
Figure 11.10 Number of persons injured based on borough.
Figure 11.11 Number of persons killed based on borough.
Figure 11.12 Number of persons injured in medium vehicles over N-68802 colli...
Figure 11.13 Number of persons killed in medium vehicles over N-68802 collis...
Figure 11.14 Number of persons injured in large vehicles over N-27508 collis...
Figure 11.15 Number of persons killed in large vehicles over N-27508 collisi...
Figure 11.16 Number of persons injured in small vehicles over N-892174 colli...
Figure 11.17 Number of persons killed in small vehicles over N-892174 collis...
Figure 11.18 Number of persons injured in very small vehicles over N-9705 co...
Figure 11.19 Number of persons killed in very small vehicles over N-9705 col...
Chapter 12
Figure 12.1 Types of neurological disorders.
Figure 12.2 Prevalence and death rate due to neurological disorders in the y...
Figure 12.3 Prevalence of neurological disorders in different countries [15]...
Figure 12.4 IoT and big data.
Figure 12.5 Soft computing techniques.
Figure 12.6 The process to generate an optimal solution [76, 77].
Figure 12.7 Machine learning applications.
Figure 12.8 The accuracy achieved by different studies for neurological diso...
Figure 12.9 Sensitivity achieve by different studies for neurological disord...
Figure 12.10 Specificity achieve by different studies for neurological disor...
Figure 12.11 Publication trend from 2008 to 2018 for neurological disorder d...
Figure 12.12 Neurological disorder diagnostic framework.
Chapter 13
Figure 13.1 Statistics of bug reports of 20 projects of the Apache Software ...
Figure 13.2 (a) Number of bug reports based on resolution. (b) Number of bug...
Figure 13.3 Example of bug report of Accumulo project.
Figure 13.4 Data extraction process.
Figure 13.5 Number of open bugs of distinct severity level.
Figure 13.6 Percentage of open bugs as per severity level.
Figure 13.7 (a)–(d) Most contributing developer for 20 projects of Apache So...
Figure 13.8 Code for finding corelated words (a) Association graph for logic...
Figure 13.9 (a)–(d) Association graphs of various errors for Kafka project (...
Figure 13.10 (a)–(b) Frequency and association plots for severe bugs.
Figure 13.11 K-means cluster group similar words.
Figure 13.12 Dendogram of most similar words.
Chapter 14
Figure. 14.1 Sentiment strengths and their elaboration as given by SentiStre...
Figure 14.2 Chart showing classification results of all four sentiments.
Figure 14.3 Chart showing evaluation results.
Chapter 16
Figure 16.1 Conventional decision support system.
Figure 16.2 Intelligent system for decision support/expert analysis in layou...
Chapter 17
Figure 17.1 Differences of bruxism patient teeth and normal human teeth.
Figure 17.2 Flow chart of the proposed work.
Figure 17.3 Low pass filter.
Figure 17.4 The loading of the bruxism data for the EEG signal and the total...
Figure 17.5 Loading of the normal data for the EEG signal in the S2 snooze s...
Figure 17.6 Extracted single-channels C4-A1 of the bruxism for the S2 sleep ...
Figure 17.7 Extracted single-channels C4-A1 of the normal for the S2 sleep s...
Figure 17.8 Filtered C4-A1 channel of S2 sleep stage for bruxism, we used a ...
Figure 17.9 Filtered C4-A1 channel of S2 sleep stage for the normal, we used...
Figure 17.10 Sampled C4-A1 channel of S2 sleep stage for the bruxism using t...
Figure 17.11 Sampled C4-A1 channel of S2 sleep stage for the normal using Ha...
Figure 17.12 It has represented the estimation of the power spectral density...
Figure 17.13 It has represented the estimation of the power spectral density...
Figure 17.14 Graphical representation for the normalized value of the single...
Chapter 18
Figure 18.1 This is simplest form of representation of the Encoder architect...
Figure 18.2 (a) The encoder compresses data into latent space (y). (b) The d...
Figure 18.3 Image reconstruction process.
Figure 18.4 Line segment from handwritten sample from patients suffering fro...
Figure 18.5 (a and b) Word segmentation samples produced from the segmented ...
Figure 18.6 Sample of character segmentation obtained using segmented words....
Figure 18.7 Segmented characters reconstructed using VAE.
Figure 18.8 Clusters of reconstructed images using VAE. (a) Cluster for “e,”...
Figure 18.9 Ambiguous “l” and “e.”
Figure 18.10 Ambiguous “c” and “e.”
Figure 18.11 Unclear or disconnected writing with spelling errors.
Cover
Table of Contents
Begin Reading
iii
iv
v
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
387
388
389
390
391
392
393
394
395
396
397
398
399
Edited by
Deepak Gupta
Maharaja Agrasen Institute of Technology
Delhi, India
Siddhartha Bhattacharyya
CHRIST (Deemed to be University)
Bengaluru, India
Ashish Khanna
Maharaja Agrasen Institute of Technology
Delhi, India
Kalpna Sagar
KIET Group of Institutions
Uttar Pradesh, India
This edition first published 2020
© 2020 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Deepak Gupta, Siddhartha Bhattacharyya, Ashish Khanna, and Kalpna Sagar to be identified as the authors of the editorial material in this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This work's use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software.
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Gupta, Deepak, editor.
Title: Intelligent data analysis : from data gathering to data comprehension / edited by Dr. Deepak Gupta, Dr. Siddhartha Bhattacharyya, Dr. Ashish Khanna, Ms. Kalpna Sagar.
Description: Hoboken, NJ, USA : Wiley, 2020. | Series: The Wiley series in intelligent signal and data processing | Includes bibliographical references and index.
Identifiers: LCCN 2019056735 (print) | LCCN 2019056736 (ebook) | ISBN 9781119544456 (hardback) | ISBN 9781119544449 (adobe pdf) | ISBN 9781119544463 (epub)
Subjects: LCSH: Data mining. | Computational intelligence.
Classification: LCC QA76.9.D343 I57435 2020 (print) | LCC QA76.9.D343 (ebook) | DDC 006.3/12–dc23
LC record available at https://lccn.loc.gov/2019056735
LC ebook record available at https://lccn.loc.gov/2019056736
Cover Design: Wiley
Cover Image: © gremlin/Getty Images
Deepak Gupta would like to dedicate this book to his father, Sh. R.K. Gupta, his mother, Smt. Geeta Gupta, his mentors for their constant encouragement, and his family members, including his wife, brothers, sisters, kids and the students.
Siddhartha Bhattacharyya would like to dedicate this book to his parents, the late Ajit Kumar Bhattacharyya and the late Hashi Bhattacharyya, his beloved wife, Rashni, and his research scholars, Sourav, Sandip, Hrishikesh, Pankaj, Debanjan, Alokananda, Koyel, and Tulika.
Ashish Khanna would like to dedicate this book to his parents, the late R.C. Khanna and Smt. Surekha Khanna, for their constant encouragement and support, and to his wife, Sheenu, and children, Master Bhavya and Master Sanyukt.
Kalpna Sagar would like to dedicate this book to her father, Mr. Lekh Ram Sagar, and her mother, Smt. Gomti Sagar, the strongest persons of her life.
Ambarish G. Mohapatra
Silicon Institute of Technology
Bhubaneswar
India
Anirban Mukherjee
RCC Institute of Information Technology
West Bengal
India
Aniruddha Sadhukhan
RCC Institute of Information Technology
West Bengal
India
Anisha Roy
RCC Institute of Information Technology
West Bengal
India
Arvinder Kaur
Guru Gobind Singh Indraprastha University
India
Ayush Ahuja
Jaypee Institute of Information Technology Noida
India
Biswajit Modak
Nabadwip State General Hospital
Nabadwip
India
R.S. Bhatia
National Institute of Technology
Kurukshetra
India
Bright Keswani
Suresh Gyan Vihar University
Jaipur
India
Dakun Lai
University of Electronic Science and Technology of China
Chengdu
China
Deepak Kumar Sharma
Netaji Subhas University of Technology
New Delhi
India
Dhanushka Abeyratne
Yellowfin (HQ)
The University of Melbourne
Australia
Faijan Akhtar
Jamia Hamdard
New Delhi
India
Gihan S. Pathirana
Charles Sturt University
Melbourne
Australia
Huy V. Pham
Ton Duc Thang University
Vietnam
Malka N. Halgamuge
The University of Melbourne
Australia
Manashi De
Techno India
West Bengal
India
Manik Sharma
DAV University
Jalandhar
India
Manu Agarwal
Jaypee Institute of Information Technology Noida
India
Manu Sood
University Shimla
India
Md Belal Bin Heyat
University of Electronic Science and Technology of China
Chengdu
China
Mohd Ammar Bin Hayat
Medical University
India
Moolchand Sharma
Maharaja Agrasen Institute of Technology (MAIT)
Delhi
India
Nabendu Chaki
University of Calcutta
Kolkata
India
Nisheeth Joshi
Banasthali Vidyapith
Rajasthan
India
Om Prakash Rishi
University of Kota
India
Poonam Keswani
Akashdeep PG College
Jaipur
India
Prableen Kaur
DAV University
Jalandhar
India
Pragya Katyayan
Banasthali Vidyapith
Rajasthan
India
Pratiyush Guleria
University Shimla
India
Prerna Sharma
Maharaja Agrasen Institute of Technology (MAIT)
Delhi
India
Rachna Jain
Bharati Vidyapeeth's College of Engineering
New Delhi
India
Rahul Johari
GGSIP University
New Delhi
India
Rajib Saha
RCC Institute of Information Technology
West Bengal
India
Rakesh Roshan
Institute of Management Studies
Ghaziabad
India
Ramneek Singhal
Bharati Vidyapeeth's College of Engineering
New Delhi
India
Ravinder Ahuja
Jaypee Institute of Information Technology Noida
India
Samarth Chugh
Netaji Subhas University of Technology
New Delhi
India
Samridhi Seth
GGSIP University
New Delhi
India
Sarthak Gupta
Netaji Subhas University of Technology
New Delhi
India
Shadab Azad
Chaudhary Charan Singh University Meerut
India
Shafan Azad
Dr. A.P.J. Abdul Kalam Technical University
Uttar Pradesh
India
Shajan Azad
Hayat Institute of Nursing
Lucknow
India
Shikhar Asthana
Jaypee Institute of Information Technology Noida
India
Shivam Bachhety
Bharati Vidyapeeth's College of Engineering
New Delhi
India
Shubham Kumaram
Netaji Subhas University of Technology
New Delhi
India
Shubhra Goyal
Guru Gobind Singh Indraprastha University
India
Siddhant Bagga
Netaji Subhas University of Technology
New Delhi
India
Soma Datta
University of Calcutta
Kolkata
India
Tarini Ch. Mishra
Silicon Institute of Technology
Bhubaneswar
India
Than D. Le
University of Bordeaux
France
Vikas Chaudhary
KIET
Ghaziabad
India
The Intelligent Signal and Data Processing (ISDP) book series is aimed at fostering the field of signal and data processing, which encompasses the theory and practice of algorithms and hardware that convert signals produced by artificial or natural means into a form useful for a specific purpose. The signals might be speech, audio, images, video, sensor data, telemetry, electrocardiograms, or seismic data, among others. The possible application areas include transmission, display, storage, interpretation, classification, segmentation, or diagnosis. The primary objective of the ISDP book series is to evolve future-generation scalable intelligent systems for faithful analysis of signals and data. ISDP is mainly intended to enrich the scholarly discourse on intelligent signal and image processing in different incarnations. ISDP will benefit a wide range of learners, including students, researchers, and practitioners. The student community can use the volumes in the series as reference texts to advance their knowledge base. In addition, the monographs will also come in handy to the aspiring researcher because of the valuable contributions both have made in this field. Moreover, both faculty members and data practitioners are likely to grasp depth of the relevant knowledge base from these volumes.
The series coverage will contain, not exclusively, the following:
Intelligent signal processing
Adaptive filtering
Learning algorithms for neural networks
Hybrid soft-computing techniques
Spectrum estimation and modeling
Image processing
Image thresholding
Image restoration
Image compression
Image segmentation
Image quality evaluation
Computer vision and medical imaging
Image mining
Pattern recognition
Remote sensing imagery
Underwater image analysis
Gesture analysis
Human mind analysis
Multidimensional image analysis
Speech processing
Modeling
Compression
Speech recognition and analysis
Video processing
Video compression
Analysis and processing
3D video compression
Target tracking
Video surveillance
Automated and distributed crowd analytics
Stereo-to-auto stereoscopic 3D video conversion
Virtual and augmented reality
Data analysis
Intelligent data acquisition
Data mining
Exploratory data analysis
Modeling and algorithms
Big data analytics
Business intelligence
Smart cities and smart buildings
Multiway data analysis
Predictive analytics
Intelligent systems
Intelligent data analysis (IDA), knowledge discovery, and decision support have recently become more challenging research fields and have gained much attention among a large number of researchers and practitioners. In our view, the awareness of these challenging research fields and emerging technologies among the research community will increase the applications in biomedical science. This book aims to present the various approaches, techniques, and methods that are available for IDA, and to present case studies of their application.
This volume comprises 18 chapters focusing on the latest advances in IDA tools and techniques.
Machine learning models are broadly categorized into two types: white box and black box. Due to the difficulty in interpreting their inner workings, some machine learning models are considered black box models. Chapter 1 focuses on the different machine learning models, along with their advantages and limitations as far as the analysis of data is concerned.
With the advancement of technology, the amount of data generated is very large. The data generated has useful information that needs to be gathered by data analytics tools in order to make better decisions. In Chapter 2, the definition of data and its classifications based on different factors is given. The reader will learn about how and what data is and about the breakup of the data. After a description of what data is, the chapter will focus on defining and explaining big data and the various challenges faced by dealing with big data. The authors also describe various types of analytics that can be performed on large data and six data analytics tools (Microsoft Excel, Apache Spark, OpenRefine, R, Hadoop, and Tableau).
In recent years, the widespread use of computers and the internet has led to the generation of data on an unprecedented scale. To make an effective use of this data, it is necessary that data must be collected and analyzed so that inferences can be made to improve various products and services. Statistics deals with the collection, organization, and analysis of data. The organization and description of data is studied under these statistics in Chapter 3 while analysis of data and how to make predictions based on it is dealt with in inferential statistics.
After having an idea about various aspects of IDA in the previous chapters, Chapter 4 deals with an overview of data mining. It also discusses the process of knowledge discovery in data along with a detailed analysis of various mining methods including classification, clustering, and decision tree. In addition to that, the chapter concludes with a view of data visualization and probability concepts for IDA.
In Chapter 5, the authors demonstrate one of the most crucial and challenge areas in computer vision and the IDA field based on manipulating the convergence. This subject is divided into a deep learning paradigm for object segmentation in computer vision and visualization paradigm for efficiently incremental interpretation in manipulating the datasets for supervised and unsupervised learning, and online or offline training in reinforcement learning. This topic recently has had a large impact in robotics and autonomous systems, food detection, recommendation systems, and medical applications.
Dental caries is a painful bacterial disease of teeth caused mainly by Streptococcus mutants, acid, and carbohydrates, and it destroys the enamel, or the dentine, layer of the tooth. As per the World Health Organization report, worldwide, 60–90% of school children and almost 100% of adults have dental caries. Dental caries and periodontal disease without treatment for long periods causes tooth loss. There is not a single method to detect caries in its earliest stages. The size of carious lesions and early caries detection are very challenging tasks for dental practitioners. The methods related to dental caries detection are the radiograph, QLF or or quantitative light-induced fluorescence, ECM, FOTI, DIFOTI, etc. In a radiograph-based technique, dentists analyze the image data. In Chapter 6, the authors present a method to detect caries by analyzing the secondary emission data.
With the growth of data in the education field in recent years, there is a need for intelligent data analytics, in order that academic data should be used effectively to improve learning. Educational data mining and learning analytics are the fields of IDA that play important roles in intelligent analysis of educational data. One of the real challenges faced by students and institutions alike is the quality of education. An equally important factor related to the quality of education is the performance of students in the higher education system. The decisions that the students make while selecting their area of specialization is of grave concern here. In the absence of support systems, the students and the teachers/mentors fall short when making the right decisions for the furthering of their chosen career paths. Therefore, in Chapter 7, the authors attempt to address the issue by proposing a system that can guide the student to choose and to focus on the right course(s) based on their personal preferences. For this purpose, a system has been envisaged by blending data mining and classification with big data. A methodology using MapReduce Framework and association rule mining is proposed in order to derive the right blend of courses for students to pursue to enhance their career prospects.
Atmospheric air pollution is creating significant health problems that affect millions of people around the world. Chapter 8 analyzes the hypothesis about whether or not global green space variation is changing the global air quality. The authors perform a big data analysis with a data set that contains more than 1M (1 048 000) green space data and air quality data points by considering 190 countries during the years 1990 to 2015. Air quality is measured by considering particular matter (PM) value. The analysis is carried out using multivariate graphs and a k-mean clustering algorithm. The relative geographical changes of the tree areas, as well as the level of the air quality, were identified and the results indicated encouraging news.
Space technology and geotechnology, such as geographic information systems, plays a vital role in the day-to-day activities of a society. In the initial days, the data collection was very rudimentary and primitive. The quality of the data collected was a subject of verification and the accuracy of the data was also questionable. With the advent of newer technology, the problems have been overcome. Using modern sophisticated systems, space science has been changed drastically. Implementing cutting-edge spaceborne sensors has made it possible to capture real-time data from space. Chapter 9 focuses on these aspects in detail.
Transportation plays an important role in our overall economy, conveying products and people through progressively mind-boggling, interconnected, and multidimensional transportation frameworks. But, the complexities of present-day transportation can't be managed by previous systems. The utilization of IDA frameworks and strategies, with compelling information gathering and data dispersion frameworks, gives openings that are required to building the future intelligent transportation systems (ITSs). In Chapter 10, the authors exhibit the application of IDA in IoT-based ITS.
Chapter 11 aims to observe emerging patterns and trends by using big data analysis to enhance predictions of motor vehicle collisions using a data set consisting of 17 attributes and 998 193 collisions in New York City. The data is extracted from the New York City Police Department (NYPD). The data set has then been tested in three classification algorithms, which are k-nearest neighbor, random forest, and naive Bayes. The outputs are captured using k-fold cross-validation method. These outputs are used to identify and compare classifier accuracy, and random forest node accuracy and processing time. Further, an analysis of raw data is performed describing the four different vehicle groups in order to detect significance within the recorded period. Finally, extreme cases of collision severity are identified using outlier analysis. The analysis demonstrates that out of three classifiers, random forest gives the best results.
Neurological disorders are the diseases that are related to the brain, nervous system, and the spinal cord of the human body. These disorders may affect the walking, speaking, learning, and moving capacity of human beings. Some of the major human neurological disorders are stroke, brain tumors, epilepsy, meningitis, Alzheimer's, etc. Additionally, remarkable growth has been observed in the areas of disease diagnosis and health informatics. The critical human disorders related to lung, kidney, skin, and brain have been successfully diagnosed using different data mining and machine learning techniques. In Chapter 12, several neurological and psychological disorders are discussed. The role of different computing techniques in designing different biomedical applications are presented. In addition, the challenges and promising areas of innovation in designing a smart and intelligent neurological disorder diagnostic system using big data, internet of things, and emerging computing techniques are also highlighted.
Bug reports are one of the crucial software artifacts in open-source software. Issue tracking systems maintain enormous bug reports with several attributes, such as long description of bugs, threaded discussion comments, and bug meta-data, which includes BugID, priority, status, resolution, time, and others. In Chapter 13, bug reports of 20 open-source projects of the Apache Software Foundation are extracted using a tool named the Bug Report Collection System for trend analysis. As per the quantitative analysis of data, about 20% of open bugs are critical in nature, which directly impacts the functioning of the system. The presence of a large number of bugs of this kind can put systems into vulnerability positions and reduces the risk aversion capability. Thus, it is essential to resolve these issues on a high priority. The test lead can assign these issues to the most contributing developers of a project for quick closure of opened critical bugs. The comments are mined, which help us identify the developers resolving the majority of bugs, which is beneficial for test leads of distinct projects. As per the collated data, the areas more prone to system failures are determined such as input/output type error and logical code error.
Sentiments are the standard way by which people express their feelings. Sentiments are broadly classified as positive and negative. The problem occurs when the user expresses with words that are different than the actual feelings. This phenomenon is generally known to us as sarcasm, where people say something opposite the actual sentiments. Sarcasm detection is of great importance for the correct analysis of sentiments. Chapter 14 attempts to give an algorithm for successful detection of hyperbolic sarcasm and general sarcasm in a data set of sarcastic posts that are collected from pages dedicated for sarcasm on social media sites such as Facebook, Pinterest, and Instagram. This chapter also shows the initial results of the algorithm and its evaluation.
Predictive analytics refers to forecasting the future probabilities by extracting information from existing data sets and determining patterns from predicted outcomes. Predictive analytics also includes what-if scenarios and risk assessment. In Chapter 15, an effort has been made to use principles of predictive modeling to analyze the authentic social network data set, and results have been encouraging. The post-analysis of the results have been focused on exhibiting contact details, mobility pattern, and a number of degree of connections/minutes leading to identification of the linkage/bonding between the nodes in the social network.
Modern medicine has been confronted by a major challenge of achieving promise and capacity of tremendous expansion in medical data sets of all kinds. Medical databases develop huge bulk of knowledge and data, which mandates a specialized tool to store and perform analysis of data and as a result, effectively use saved knowledge and data. Information is extracted from data by using a domain's background knowledge in the process of IDA. Various matters dealt with regard use, definition, and impact of these processes and they are tested for their optimization in application domains of medicine. The primary focus of Chapter 16 is on the methods and tools of IDA, with an aim to minimize the growing differences between data comprehension and data gathering.
Snoozing, or sleeping, is a physical phenomenon of the human life. When human snooze is disturbed, it generates many problems, such as mental disease, heart disease, etc. Total snooze is characterized by two stages, viz., rapid eye movement and nonrapid eye movement. Bruxism is a type of snooze disorder. The traditional method of the prognosis takes time and the result is in analog form. Chapter 17 proposes a method for easy prognosis of snooze bruxism.
Neurodegenerative diseases like Alzheimer's and Parkinson's impair the cognitive and motor abilities of the patient, along with memory loss and confusion. As handwriting involves proper functioning of the brain and motor control, it is affected. Alteration in handwriting is one of the first signs of Alzheimer's disease. The handwriting gets shaky, due to loss of muscle control, confusion, and forgetfulness. The symptoms get progressively worse. It gets illegible and the phonological spelling mistakes become inevitable. In Chapter 18, the authors use a feature extraction technique to be used as a parameter for diagnosis. Avariational auto encoder (VAE), a deep unsupervised learning technique, has been applied, which is used to compress the input data and then reconstruct it keeping the targeted output the same as the targeted input.
This edited volume on IDA gathers researchers, scientists, and practitioners interested in computational data analysis methods, aimed at narrowing the gap between extensive amounts of data stored in medical databases and the interpretation, understandable, and effective use of the stored data. The expected readers of this book are researchers, scientists, and practitioners interested in IDA, knowledge discovery, and decision support in databases, particularly those who are interested in using these technologies. This publication provides useful references for educational institutions, industry, academic researchers, professionals, developers, and practitioners to apply, evaluate, and reproduce the contributions to this book.
May 07, 2019
New Delhi, India
Deepak Gupta
Bengaluru, India
Siddhartha Bhattacharyya
New Delhi, India
Ashish Khanna
Uttar Pradesh, India
Kalpna Sagar
Sarthak Gupta, Siddhant Bagga, and Deepak Kumar Sharma
Division of Information Technology, Netaji Subhas University of Technology, New Delhi, India
In the midst of all of the societal challenges of today's world, digital transformation is rapidly becoming a necessity. The number of internet users is growing at an unprecedented rate. New devices, sensors, and technologies are emerging every day. These factors have led to an exponential increase in the volume of data being generated. According to a recent research [1], users of the internet generate 2.5 quintillion bytes of data per day.
Data is only as good as what you make of it. The sheer amount of data being generated calls for methods to leverage its power. With the proper tools and methodologies, data analysis can improve decision making, lower the risks, and unearth hidden insights. Intelligent data analysis (IDA) is concerned with effective analysis of data [2, 3].
The process of IDA consists of three main steps (see Figure 1.1):
Data collection and preparation
: This step involves acquiring data, and converting it into a format suitable for further analysis. This may involve storing the data as a table, taking care of empty or null values, etc.
Exploration
: Before a thorough analysis can be performed on the data, certain characteristics are examined like number of data points, included variables, statistical features, etc. Data exploration allows analysts to get familiar with the dataset, and create prospective hypotheses. Visualization is extensively used in this step. Various visualization techniques will be discussed in depth later in this chapter.
Analysis
: Various machine learning and deep learning algorithms are applied at this step. Data analysts build models that try to find the best possible fit to the data points. These models can be classified as white box or black box models.
A more comprehensive introduction to data analysis can be found in prior pieces of literature [4–6].
Figure 1.1 Data analysis process.
IDA and machine learning can be applied to a multitude of products and services, since these models have the ability to make fast, data-driven decisions at scale. We're surrounded by live examples of machine learning in things we use in day-to-day life.
A primary example is web page ranking [7, 8]. Whenever we search for anything on a search engine, the results that we get are presented to us in the order of relevance. To achieve this, the search engine needs to “know” which pages are more relevant than others.
A related application is collaborative filtering [9, 10]. Collaborative filtering filters information based on recommendations of other people. It is based on the premise that people who agreed in their evaluation of certain items in the past are likely to agree again in the future.
Another application is automatic translation of documents from one language to another. Manually doing this is an extremely arduous task and would take a significant amount of time.
IDA and machine learning models are also being used for many other tasks [11, 12] like object classification, named entity recognition, object localization, stock prices prediction, etc.
IDA aims to analyze the data to create predictive models. Suppose that we're given a dataset D(X,T), where X represents inputs and T represents target values (i.e., known correct values with respect to the input). The goal is to learn a function (or map) from inputs (X) to outputs (T). This is done by employing supervised machine learning algorithms [13]. A model refers to the artifact that is created by the training (or learning) process. Models are broadly categorized into two types:
White box models
: The models whose predictions are easily explainable are called white box models. These models are extremely simple, and hence, not very effective. The accuracy of white box models is usually quite low. For example – simple decision trees, linear regression, logistic regression, etc.
Black box models
: The models whose predictions are difficult to interpret or explain are called black box models. They are difficult to interpret because of their complexity. Since they are complex models, their accuracy is usually high. For example – large decision trees, random forests, neural networks, etc.
So, IDA and machine learning models suffer from accuracy-explainability trade-off. However, with advances in IDA, the explainability gap in black box models is reducing.
If black box models have better accuracy, why not use them all the time? The problem is that a single metric, such as classification accuracy, is an incomplete description of most real-world tasks [14, 15]. Sometimes in low-risk environments, where decisions don't have severe consequences, it might be sufficient to just know that the model performed well on some test dataset without the need for an explanation. However, machine learning models are being extensively used in high-risk environments like health care, finance, data security, etc. where the impact of decisions is huge. Therefore, it's extremely important to bridge the explainability gap in black box models, so that they can be used with confidence in place of white box models to provide better accuracy.
Interpretability models may be local or global. Global methods try to explain the model itself, thereby explaining all possible outcomes. On the other hand, local models try to explain why a particular decision was made.
As artificial intelligence (AI)-assisted decision making is becoming commonplace, the ability to generate simple explanations for black box systems is going to be extremely important, and is already an area of active research.
White box models are extremely easy to interpret, since interpretability is inherent in their nature. Let's talk a few white box models and how to interpret them.
Linear regression [16, 17] attempts to model the relationship between input variables and output by fitting a linear equation to the observed data (see Figure 1.2). A linear regression equation is of the form:
where,
y
is the output variable,
x1, x2,…, xp are “p” input variables,
w1, w2,…, wp are the weights associated with input variables, and
w0 makes sure that the regression line works even if the data is not centered around origin (along the output dimension).
The weights are calculated using techniques like ordinary least squares and gradient descent. The details of these techniques are beyond the scope of this chapter; we will focus more on the interpretability of these models.
The interpretation of the weights of a linear model is quite obvious. An increase by one unit in the feature xj results in a corresponding increase by wj in the output.
Another metric for interpreting linear models is R2 measurement [18]. R2 value tells us about how much variance of target outcomes is explained by the model. R2 value ranges from 0 to 1. Higher the R2 value, better the model explains the data. R2 is calculated as:
Figure 1.2 Linear regression.
where
SS
r
is the squared sum of residuals, and
SS
t
is the total sum of squares (proportional to variance of the data)
Residual ei is defined as:
where
y
i
is the model's predicted output, and
t
i
is the target value in the dataset.
Hence, SSr is calculated as:
And SSt is calculated as:
where is the mean of all target values.
But, there is a problem with R2 value. It increases with number of features, even if they carry no information about the target values. Hence, adjusted R2 value () is used, which takes into account the number of input features:
Where
is the adjusted
R
2
value,
n
is the number of data points, and
p
is the number of input features (or input variables)
Decision trees [19] are classifiers – they classify a given data point by posing a series of questions about the features associated with the data item (see Figure 1.3).
Unlike linear regression, decision trees are able to model nonlinear data. In a decision tree, nodes represent features, each edge or link represents a decision, and leaf nodes represent outcomes.
The general algorithm for decision trees is given below:
Pick the best attribute/feature. Best feature is that which separates the data in the best possible way. The optimal split would be when all data points belonging to different classes are in separate subsets after the split.
For each value of the attribute, create a new child node of the current node.
Divide data into the new child nodes.
For each new child node:
If all the data points in that node belong to the same class, then stop.
Else, go to step 1 and repeat the process with current node as decision node.
Figure 1.3 Decision tree.
Figure 1.4
