197,99 €
Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel machine learning computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization. Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics approximating classification and prediction of disease, feature selection, dimensionality reduction, gene selection and classification of microarray data and many more.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 668
Veröffentlichungsjahr: 2021
Cover
Title Page
Copyright
Preface
Acknowledgement
Part 1: THE COMMENCEMENT OF MACHINE LEARNING SOLICITATION TO BIOINFORMATICS
1 Introduction to Supervised Learning
1.1 Introduction
1.2 Learning Process & its Methodologies
1.3 Classification and its Types
1.4 Regression
1.5 Random Forest
1.6 K-Nearest Neighbor
1.7 Decision Trees
1.8 Support Vector Machines
1.9 Neural Networks
1.10 Comparison of Numerical Interpretation
1.11 Conclusion & Future Scope
References
2 Introduction to Unsupervised Learning in Bioinformatics
2.1 Introduction
2.2 Clustering in Unsupervised Learning
2.3 Clustering in Bioinformatics—Genetic Data
2.4 Conclusion
References
3 A Critical Review on the Application of Artificial Neural Network in Bioinformatics
3.1 Introduction
3.2 Biological Datasets
3.3 Building Computational Model
3.4 Literature Review
3.5 Critical Analysis
3.6 Conclusion
References
Part 2: MACHINE LEARNING AND GENOMIC TECHNOLOGY, FEATURE SELECTION AND DIMENSIONALITY REDUCTION
4 Dimensionality Reduction Techniques: Principles, Benefits, and Limitations
4.1 Introduction
4.2 The Benefits and Limitations of Dimension Reduction Methods
4.3 Components of Dimension Reduction
4.4 Methods of Dimensionality Reduction
4.5 Conclusion
References
5 Plant Disease Detection Using Machine Learning Tools With an Overview on Dimensionality Reduction
5.1 Introduction
5.2 Flowchart
5.3 Machine Learning (ML) in Rapid Stress Phenotyping 113
5.4 Dimensionality Reduction
5.5 Literature Survey
5.6 Types of Plant Stress
5.7 Implementation I: Numerical Dataset
5.8 Implementation II: Image Dataset
5.9 Conclusion
References
6 Gene Selection Using Integrative Analysis of Multi-Level Omics Data: A Systematic Review
6.1 Introduction
6.2 Approaches for Gene Selection
6.3 Multi-Level Omics Data Integration
6.4 Machine Learning Approaches for Multi-Level Data Integration
6.5 Critical Observation
6.6 Conclusion
References
7 Random Forest Algorithm in Imbalance Genomics Classification
7.1 Introduction
7.2 Methodological Issues
7.3 Biological Terminologies
7.4 Proposed Model
7.5 Experimental Analysis
7.6 Current and Future Scope of ML in Genomics 188
7.7 Conclusion
References
8 Feature Selection and Random Forest Classification for Breast Cancer Disease
8.1 Introduction
8.2 Literature Survey
8.3 Machine Learning
8.4 Feature Engineering
8.5 Methodology
8.6 Result Analysis
8.7 Conclusion
References
9 A Comprehensive Study on the Application of Grey Wolf Optimization for Microarray Data
9.1 Introduction
9.2 Microarray Data
9.3 Grey Wolf Optimization (GWO) Algorithm
9.4 Studies on GWO Variants
9.5 Application of GWO in Medical Domain
9.6 Application of GWO in Microarray Data
9.7 Conclusion and Future Work
References
10 The Cluster Analysis and Feature Selection: Perspective of Machine Learning and Image Processing
10.1 Introduction
10.2 Various Image Segmentation Techniques
10.3 How to Deal With Image Dataset
10.4 Class Imbalance Problem
10.5 Optimization of Hyperparameter
10.6 Case Study
10.7 Using AI to Detect Coronavirus
10.8 Using Artificial Intelligence (AI), CT Scan and X-Ray 274
10.9 Conclusion
References
Part 3: MACHINE LEARNING AND HEALTHCARE APPLICATIONS
11 Artificial Intelligence and Machine Learning for Healthcare Solutions
11.1 Introduction
11.2 Using Machine Learning Approaches for Different Purposes
11.3 Various Resources of Medical Data Set for Research
11.4 Deep Learning in Healthcare
11.5 Various Projects in Medical Imaging and Diagnostics
11.6 Conclusion
References
12 Forecasting of Novel Corona Virus Disease (Covid-19) Using LSTM and XG Boosting Algorithms
12.1 Introduction
12.2 Machine Learning Algorithms for Forecasting 296
12.3 Proposed Method
12.4 Implementation
12.5 Results and Discussion
12.6 Conclusion and Future Work
References
13 An Innovative Machine Learning Approach to Diagnose Cancer at Early Stage
13.1 Introduction
13.2 Related Work
13.3 Materials and Methods
13.4 System Design
13.5 Results and Discussion
13.6 Conclusion
References
14 A Study of Human Sleep Staging Behavior Based on Polysomnography Using Machine Learning Techniques
14.1 Introduction
14.2 Polysomnography Signal Analysis
14.3 Case Study on Automated Sleep Stage Scoring
14.4 Summary and Conclusion
References
15 Detection of Schizophrenia Using EEG Signals
15.1 Introduction
15.2 Methodology
15.3 Literature Review
15.4 Discussion
15.5 Conclusion
References
16 Performance Analysis of Signal Processing Techniques in Bioinformatics for Medical Applications Using Machine Learning Concepts
16.1 Introduction
16.2 Basic Definition of Anatomy and Cell at Micro Level 397
16.3 Signal Processing—Genome Signal Processing 403
16.4 Hotspots Identification Algorithm
16.5 Results—Experimental Investigations
16.6 Analysis Using Machine Learning Metrics
16.7 Conclusion
Appendix
A.1 Hotspot Identification Code
A.2 Performance Metrics Code
References
17 Survey of Various Statistical Numerical and Machine Learning Ontological Models on Infectious Disease Ontology
17.1 Introduction
17.2 Disease Ontology
17.3 Infectious Disease Ontology
17.4 Biomedical Ontologies on IDO
17.5 Various Methods on IDO
17.6 Machine Learning-Based Ontology for IDO
17.7 Recommendation or Suggestions for Future Study 437
17.8 Conclusions
References
18 An Efficient Model for Predicting Liver Disease Using Machine Learning
18.1 Introduction
18.2 Related Works
18.3 Proposed Model
18.4 Results and Analysis
18.5 Conclusion
References
Part 4: BIOINFORMATICS AND MARKET ANALYSIS
19 A Novel Approach for Prediction of Stock Market Behavior Using Bioinformatics Techniques
19.1 Introduction
19.2 Literature Review
19.3 Proposed Work
19.4 Experimental Study
19.5 Conclusion and Future Work
References
20 Stock Market Price Behavior Prediction Using Markov Models: A Bioinformatics Approach
20.1 Introduction
20.2 Literature Survey
20.3 Proposed Work
20.4 Experimental Work
20.5 Conclusions and Future Work
References
Index
End User License Agreement
Chapter 1
Figure 1.1 Traditional learning.
Figure 1.2 Machine learning.
Figure 1.3 Learning behavior of a machine.
Figure 1.4 Block diagram of supervised learning.
Figure 1.5 Block diagram of unsupervised learning.
Figure 1.6 Block diagram of reinforcement learning.
Figure 1.7 Concept of classification.
Figure 1.8 Classification based on gender.
Figure 1.9 Regression.
Figure 1.10 Cholesterol line fit plot.
Figure 1.11 ROC curve for logistic regression.
Figure 1.12 Random forest.
Figure 1.13 ROC curve for random forest.
Figure 1.14 ROC curve for k-nearest neighbor.
Figure 1.15 Decision tree.
Figure 1.16 Support vector machine.
Figure 1.17 Neural network (general).
Figure 1.18 Neural network (detailed).
Chapter 2
Figure 2.1 Machine learning in bioinformatics.
Figure 2.2 Example matrix of gene expression (10 genes in a row and 2 samples in...
Figure 2.3 Partition clustering algorithms.
Figure 2.4 (a) Agglomerative clustering, (b) divisive clustering.
Figure 2.5 Self Organizing Map (SOM).
Chapter 3
Figure 3.1 Areas of research of bioinformatics [4].
Figure 3.2 Simple network architecture of ANN with four input unit [17].
Figure 3.3 Single layer perceptron (left) and multilayer perceptron with one hid...
Chapter 4
Figure 4.1 The Steps of LDA reduction technique.
Figure 4.2 The steps of backward feature elimination technique.
Figure 4.3 The steps of backward feature elimination technique.
Figure 4.4 Low variance ratio feature reduction techniques.
Figure 4.5 The steps of the random forest algorithm.
Chapter 5
Figure 5.1 Flowchart to depict the structure of the book chapter.
Figure 5.2 Depicts the different steps in which PC1 is created, by considering a...
Figure 5.3 Shows the screen plot which depicts the variance of each of the princ...
Figure 5.4 Shows the steps followed in extraction of the features by ORB in case...
Figure 5.5 Shows the steps followed in extraction of the features by ORB in case...
Figure 5.6 Shows the steps followed in extraction of the features by ORB in case...
Figure 5.7 Color histogram comparison between three pairs of images: images of b...
Chapter 6
Figure 6.1 Multiple levels of omics data in biological system, from genome, epig...
Figure 6.2 (a) Hypothesis 1: Molecular variations propagates linearly in a hiera...
Figure 6.3 Different integration pipelines of multi-level omics data. (a) Horizo...
Figure 6.4 Methods for parallel integration. (a) Concatenation-based, (b) transf...
Chapter 7
Figure 7.1 Knowledge discovery process in genomic data.
Figure 7.2 Decision tree classifier on a dataset having two features (X
1
and X
2
)...
Figure 7.3 Ensemble learning model to increase the accuracy of classification mo...
Figure 7.4 Bootstrap aggregation (bagging) technique.
Figure 7.5 Implementing RF classifier in a dataset having four features (X
1
…X
4
) ...
Figure 7.6 Genes, proteins and molecular machines.
Chapter 8
Figure 8.1 Ensemble learning.
Figure 8.2 Output label.
Figure 8.3 Heat map.
Figure 8.4 Correlation matrix.
Figure 8.5 Confusion matrix.
Figure 8.6 Confusion matrix.
Figure 8.7 Comparison of different methods.
Chapter 9
Figure 9.1 Schematic representation of DNA microarray technology [12].
Figure 9.2 Model for microarray data analysis [13].
Figure 9.3 Social hierarchy of grey wolves.
Figure 9.4 Flow chart of GWO.
Chapter 10
Figure 10.1 Flowchart for K-mean clustering.
Figure 10.2 Linear classifier.
Figure 10.3 Hyper plane classifier.
Figure 10.4 Optimal line.
Figure 10.5 Working process of SVM.
Figure 10.6 Hybrid algorithm.
Figure 10.7 C—means clustering algorithm.
Figure 10.8 Model of data prediction.
Figure 10.9 The flow-chart signifies different systems, approaches and investiga...
Chapter 12
Figure 12.1 Total Confirmed Cases across the world on May, 2020. (Source : https...
Figure 12.2 Confirmed Covid-19 cases in Tamilnadu—Highly affected cities. (Sourc...
Figure 12.3 Proposed model for forecasting covid-19.
Figure 12.4 Working of LSTM.
Figure 12.5 Working of the gradient boosting algorithm.
Figure 12.6 Output of linear regression for linear data.
Figure 12.7 Working of polynomial regression with the non-linear data.
Figure 12.8 Actual and predicted values of Confirmed cases using polynomial Regr...
Figure 12.9 Actual and predicted values of fatalities using polynomial regressio...
Chapter 13
Figure 13.1 Block diagram.
Figure 13.2 Input image.
Figure 13.3 Gray image.
Figure 13.4 Filtered image.
Figure 13.5 SNR comparison.
Figure 13.6 Identification of cancer.
Chapter 14
Figure 14.1 C3-A2 channel of EEG signal for different stages behavior of sleep: ...
Figure 14.2 60 s epoch original EEG signal behavior from Healthy Subject.
Figure 14.3 Different brain behaviors of healthy subject during different sleep ...
Figure 14.4 EEG signal behavior from sleep disorder subject with duration of 60 ...
Figure 14.5 Different sleep stage characteristics from subject affected with sle...
Figure 14.6 (a) 10–20 Electrode placement system, (b) Notations for placement of...
Figure 14.7 Left frontal region recording.
Figure 14.8 Left central region recording.
Figure 14.9 Right frontal recording.
Figure 14.10 Right central recording.
Figure 14.11 Left occipital recording.
Figure 14.12 Right occipital region recording.
Figure 14.13 Right outer canthus region recording.
Figure 14.14 Left outer canthus region recording (typical activity of the EOG (L...
Figure 14.15 Chin EMG region recording (typical activity of the EMG (EMG-X1)).
Figure 14.16 Limb EMG (left leg) region recording (typical activity of the EMG (...
Figure 14.17 Limb EMG (Right Leg) Region Recording (Typical activity of the EMG ...
Figure 14.18 Sleep stages behavior of the subjects (a) Subject-16, (b) Subject-1...
Figure 14.19 Framework of the proposed research work.
Figure 14.20 Performance graphs of the proposed SleepEEG test model with the C4-...
Chapter 15
Figure 15.1 Electrode placement method.
Figure 15.2 EEG delta, theta, alpha, beta and gamma [7].
Figure 15.3 Steps involved in processing of EEG signal.
Chapter 16
Figure 16.1 Bioinformatics suit.
Figure 16.2 Amount of data stored by EBI over the years [3].
Figure 16.3 Depicts prokaryotic & eukaryotic cells.
Figure 16.4 Eukaryotic nuclear DNA within the chromosomes.
Figure 16.5 Depicts RNA transcription process shows a strand RNA given from a do...
Figure 16.6 Depicts transcription and translation describing the process of conv...
Figure 16.7 Depicts a protein handshake.
Figure 16.8 Depicts Protein–Protein interaction render to hotspot.
Figure 16.9 Complex plane of unit circle taken for 20 complex numbers in CPNR.
Figure 16.10 Mapping of amino acids using CPNR.
Figure 16.11 Mapping of amino acids using EIIP.
Figure 16.12 Depicts Flow of EIIP, CPNR mapping with DWT for hotspots identifica...
Figure 16.13 CPNR results.
Figure 16.14 EIIP results.
Figure 16.15 Confusion matrix.
Figure 16.16 Precision calcuation details.
Figure 16.17 Recall calculation details.
Figure 16.18 Overall performance analysis visualization.
Chapter 18
Figure 18.1 Overview of experimental methodology.
Figure 18.2 Correlation heatmap.
Figure 18.3 Encoding process.
Chapter 19
Figure 19.1 Framework of the book chapter.
Figure 19.2 Actual vs predicted behavior BSE February, 2001.
Figure 19.3 Actual vs predicted behavior BSE February, 2004.
Figure 19.4 Actual vs predicted behavior BSE February, 2005.
Figure 19.5 Actual vs predicted behavior BSE February, 2008.
Figure 19.6 Actual vs predicted behavior BSE February, 2010.
Figure 19.7 Actual vs predicted behavior BSE March, 2001.
Figure 19.8 Actual vs predicted behavior BSE March, 2002.
Figure 19.9 Actual vs predicted behavior BSE March, 2003.
Figure 19.10 Actual vs predicted behavior BSE March, 2006.
Figure 19.11 Actual vs predicted behavior BSE March, 2008.
Figure 19.12 Actual vs predicted behavior BSE March, 2010.
Figure 19.13 Actual vs predicted behavior BSE April, 2001.
Figure 19.14 Actual vs predicted behavior BSE April, 2003.
Figure 19.15 Actual vs predicted behavior BSE April, 2004.
Figure 19.16 Actual vs predicted behavior BSE April, 2007.
Figure 19.17 Actual vs predicted behavior BSE April, 2008.
Figure 19.18 Actual vs predicted behavior BSE May, 2001.
Figure 19.19 Actual vs predicted behavior BSE May, 2002.
Figure 19.20 Actual vs predicted behavior BSE October, 2002.
Figure 19.21 Actual vs predicted behavior BSE October, 2003.
Chapter 20
Figure 20.1 Framework for prediction using Markov model.
Figure 20.2 Example for zero order Markov model of BSE (2001).
Figure 20.3 Example for first order Markov model of BSE (2001).
Figure 20.4 Behavior of BSE, January 2005 (actual vs predicted).
Figure 20.5 Behavior of BSE, February 2005 (actual vs predicted).
Figure 20.6 Behavior of BSE, March 2005 (actual vs predicted).
Figure 20.7 Behavior of BSE, March 2010 (actual vs predicted).
Figure 20.8 First two days prediction accuracy percentage of BSE (November).
Figure 20.9 First four days prediction accuracy percentage of BSE (November).
Figure 20.10 First six days prediction accuracy percentage of BSE (November).
Chapter 1
Table 1.1 Regression statistics.
Table 1.2 AUC: Logistic regression.
Table 1.3 Difference between linear & logistic regression.
Table 1.4 AUC: Random forest.
Table 1.5 AUC: K-nearest neighbor.
Table 1.6 AUC: Decision trees.
Table 1.7 AUC: Support vector machines.
Table 1.8 AUC: Neural network.
Table 1.9 AUC: Comparison of numerical interpretations.
Chapter 2
Table 2.1 Gene expression data matrix representation.
Chapter 3
Table 3.1 Shows published articles of ANN used for biological data.
Chapter 5
Table 5.1 Highlights the important work of literature published in the last two ...
Table 5.2 Depicts the features influencing each of the 9 principal components.
Chapter 6
Table 6.1 Published studies of multi-level omics data integration using unsuperv...
Table 6.2 Published studies of multi-level omics data integration using supervis...
Chapter 7
Table 7.1 The classification accuracy, recall, precision and F-score of ensemble...
Table 7.2 The classification accuracy, recall, precision and F-score of ensemble...
Table 7.3 The classification accuracy, recall, precision and F-score of ensemble...
Chapter 8
Table 8.1 Dataset Attributes.
Chapter 9
Table 9.1 Literature Review for Hybridization of GWO Algorithm.
Table 9.2 Literature Review for GWO Extension Algorithms.
Table 9.3 Literature Review for GWO Modification Algorithm.
Table 9.4 Literature Review for GWO in Medical Applications.
Table 9.5 Literature Review for GWO in Medical Application for Microarray Datase...
Chapter 12
Table 12.1 Sample dataset.
Table 12.2 Performance analysis of the proposed algorithms.
Chapter 13
Table 13.1 SNR Comparison of different images.
Chapter 14
Table 14.1 Composed channels of polysomnography.
Table 14.2 Details of enrolled subjects in this proposed work.
Table 14.3 Description of epochs of various sleep stages from four subjects used...
Table 14.4 Confusion matrix performance achieved by the SleepEEG study using C4-...
Table 14.5 Performances Statistics of Subjects with input of C4-A1 channel sleep...
Chapter 15
Table 15.1 Different part of cerebrum and its operations.
Table 15.2 Literature Review on the Analysis of EEG Signal for Detection of Schi...
Chapter 16
Table 16.1 Genetic code describing 64 possible codons and the corresponding amin...
Table 16.2 Amino acids listed in table with 3-letter and 1-letter codes.
Table 16.3 Depicts EIIP mapping for amino acids.
Table 16.4 Number of codons for 20 amino acids.
Table 16.5 Different pattern among each of the non-composable number and its bes...
Table 16.6 Complex Prime Numerical Representation (CPNR).
Table 16.7 Dataset of protein sequences used for experiment.
Table 16.8 Details of identified hotspots using CPNR.
Table 16.9 Details of identified hotspots using EIIP.
Table 16.10 Analysis of experimental results.
Chapter 18
Table 18.1 Data types of features.
Table 18.2 Comparison of Performance Metrics on various Machine Learning Models.
Table 18.3 Accuracy on application of 10-fold cross validation of logistic regre...
Chapter 19
Table 19.1 Hamming distance result year wise.
Chapter 20
Table 20.1 Example for second order Markov model of BSE (2001).
Table 20.2 Hamming distance of BSE (2002 to 2014) using zero order Markov model.
Table 20.3 Hamming distance of BSE (2002 to 2014) using first order Markov model...
Table 20.4 Hamming distance of BSE (2002 to 2014) using Second order Markov mode...
Cover
Table of Contents
Title page
Copyright
Preface
Acknowledgement
Begin Reading
Index
End User License Agreement
v
ii
iii
iv
xix
xx
xxi
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
283
284
285
286
287
288
289
290
291
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
459
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
507
508
509
510
511
512
513
514
Scrivener Publishing
100 Cummings Center, Suite 541J
Beverly, MA 01915-6106
Publishers at Scrivener
Martin Scrivener ([email protected])
Phillip Carmical ([email protected])
Edited by
Rabinarayan Satpathy
Tanupriya Choudhury
Suneeta Satpathy
Sachi Nandan Mohanty
and
Xiaobo Zhang
This edition first published 2021 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA
© 2021 Scrivener Publishing LLC
For more information about Scrivener publications please visit www.scrivenerpublishing.com.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
Wiley Global Headquarters
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.
Library of Congress Cataloging-in-Publication Data
ISBN 978-1-119-78553-8
Cover image: Pixabay.Com
Cover design by Russell Richardson
Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines
Printed in the USA
10 9 8 7 6 5 4 3 2 1
Machine learning has become increasingly popular in recent decades due to its well-defined algorithms and techniques that enable computers to learn and solve real-life problems which are difficult, time-consuming, and tedious to solve traditionally. Regarded as a subdomain of artificial intelligence, it has a gamut of applications in the field of healthcare, medical diagnosis, bioinformatics, natural language processing, stock market analysis and many more. Recently, there has been an explosion of heterogeneous biological data requiring analysis, retrieval of useful patterns, management and proper storage. Moreover, there is the additional challenge of developing automated tools and techniques that can deal with these different kinds of outsized data in order to translate and transform computational modelling of biological systems and its correlated disciplinary data for further classification, clustering, prediction and decision-making.
Machine learning has justified its potential with its application in extracting relevant information in various biological domains like bioinformatics. It has been successful in dealing with and finding efficient solutions for complex biomedical problems. Prior to the application of machine learning, traditional mathematical as well as statistical models were used along with the domain of expert intelligence to carry out investigations and experiments manually, using instruments, hands and eyes, etc. But such conventional methods alone are not enough to deal with large volumes of different types of biological data. Hence, the application of machine learning techniques has become the need of the hour in research in order to find a solution to complex bioinformatics applications for both the disciplines of computer science and biology. With this in mind, this book has been designed with a number of chapters from eminent researchers who relate and explain the machine learning techniques and their application to various bioinformatics problems such as classification and prediction of disease, feature selection, dimensionality reduction, gene selection, etc. Since the chapters are based on progressive collaborative research work on a broad range of topics and implementations, it will be of interest to both students and researchers from the computer science as well as biological domains.
This edited book is compiled using four sections, with the first section rationalizing the applications of machine learning techniques in bioinformatics with introductory chapters. The subsequent chapters in the second section flows with machine learning technological applications for dimensionality reduction, feature & gene selection, plant disease analysis & prediction as well as cluster analysis. Further, the third section of the book brings together a variety of machine learning research applications to healthcare domain. Then the book dives into the concluding remarks of machine learning applications to stock market behavioural analysis and prediction.
The Editors
December 2020
The editors would like to acknowledge and congratulate all the people who extended their assistance for this book. Our sincere thankfulness goes to each one of the chapter’s authors for their contributions, without whose support this book would not have become a reality. Our heartfelt gratefulness and acknowledgement also go to the subject matter experts who could find their time to review the chapters and deliver those in time to improve the quality, prominence as well as uniform arrangement of the chapters in the book. Finally, a ton of thanks to all the team members of Scrivener Publishing for their dedicated support and help in publishing this edited book.
Rajat Verma, Vishal Nagar and Satyasundara Mahapatra*
PSIT, Kanpur, Uttar Pradesh, India
Abstract
Artificial Intelligence (AI) has enhanced its importance through machines in the field of present business scenario. AI delineates the intelligence illustrated by machines and performs in a contrasting manner to the natural intelligence signified by all living objects. Today, AI is popular due to its Machine Learning (ML) techniques. In the field of ML, the performance of a machine depend upon the learning performance of that machine. Hence, the improvement of the machine’s performance is always proportional to its learning behavior. These Learning behaviors are obtained from the knowledge of living object’s intelligence. An introductory aspect of AI through a detailed scenario of ML is presented in this chapter. In the journey of ML’s success, data is the only requirement. ML is known because of its execution through its diverse learning approaches. These approaches are known as supervised, unsupervised, and reinforcement. These are performed only on data, as its quintessential element. In Supervised, attempts are done to find the relationship between the independent variables and the dependent variables. The Independent variables are the input attributes whereas the dependent variables are the target attributes. Unsupervised works are contrary to the supervised approach. The former (i.e. unsupervised) deals with the formation of groups or clusters, whereas the latter (i.e. supervised) deals with the relationship between the input and the target attributes. The third aspect (i.e. reinforcement) works through feedback or reward. This Chapter focuses on the importance of ML and its learning techniques in day to day lives with the help of a case study (heart disease) dataset. The numerical interpretation of the learning techniques is explained with the help of graph representation and tabular data representation for easy understanding.
Keywords: Artificial intelligence, machine learning, supervised, unsupervised, reinforcement, knowledge, intelligence
In today’s world, businesses are moving towards the implementation of automatic intelligence for decision making. This is only possible with the help of a well-known intelligence technique otherwise known as Artificial Intelligence (AI). This intelligence technique also plays a vital role in the field of research, which is nothing but taking decisions instantly. The dimension of AI is bifurcated into sub-domains such as Machine Learning (ML) and Artificial Neural Networks (ANN) [1]. The term ML is also termed as augmented analytics [2] and depicts the development of machine’s performances. This is achieved through the previous experiences obtained by the machines, but the traditional learning (i.e. the intelligence used in the mid-1800s) works not so efficiently if compared with the ML [3]. In traditional learning, the user deals with data and programs as an input attribute and provides the output or results whereas, in the case of ML the user provides the data and output or desired results as an input attribute and produces the program or rules as an output attribute. This means that data is more important rather than the programs. This is so because the business world depends on the accuracy level of the program which is used for decision making. The block diagram of Traditional learning is shown below in Figure 1.1 for easy understanding.
Traditional Learning is a manual process whereas the functioning of ML is an automated one. Due to ML, the accuracy of analytic worthiness is increased in different diversified domains. These domains are utilized for the preparation of data (raw facts and figures), Outlier Detection (Automatic), Natural Language Interfaces (NLI), and Recommendations, etc. [4]. Due to these domains, the bias factor for taking decisions on a business problem is decreased.
Figure 1.1 Traditional learning.
Figure 1.2 Machine learning.
ML is a sub-group of AI and its primary work is allowing systems to learn automatically with the help of data or observations obtained from the environment through different devices [5]. The block-diagram of ML is shown below in Figure 1.2.
ML-based algorithms perform predictions as well as decisions by using mathematical models that are based on some training data [6–8]. Few popular implementations of Machine Learning are Filtering of E-mails [9], Medical Diagnosis [10], Classification [11], Extraction [12], etc. ML works for the growth of the accuracy level of the computer programs. This was done by accessing data from the surrounding, learn the data automatically, and enhancing the capacity of decision making. The main objective of ML is to minimize human intervention and assistance while performing any task. The next section of this chapter highlights the process of learning along with its different methodologies.
In AI, Learning means a process to train a machine in such a way so that the machine can take decisions instantly. Hence, the performance of that machine is upgraded because of its accuracy. When a machine performs in its working environment it may get either success or failure. From these successes or failures machines are gaining experience itself. These newly gained experience, improve the machines through their actions and forms an optimal policy for the working environment. This process is known as learning from experience. This process of learning is possible in an unknown working environment. A general block diagram learning architecture for such a method is presented below in Figure 1.3. This figure tries to present the mechanism of learning a new experience by a machine. The sequence of learning behavior in a stepwise manner is given below.
Step 1. The IoT based sensors received input from the environment.
Step 2. Then, the sensor sends these inputs to the critics for performance evaluation, according to the previously stored performance standards. Simultaneously, the sensor sends the same input to the performance element for checking its effectiveness, if found OK then immediately return the same to the environment through effectors.
Step 3. The Critics provide the feedback to the learning element, if any new feedback occurs then it should be updated in the performance of the element. Then, the updated knowledge comes back to the learning element and send it to the problem generator as a learning goal for evaluating the same through experiments. The updates are sent to the performance of the element for future reference.
Figure 1.3 Learning behavior of a machine.
The learning process of ML is done in three different ways. These are supervised learning, unsupervised learning, and reinforcement learning. These three learning types have their importance in the different fields of bioinformatics research. Hence, they are explained with suitable examples in the next sections.
This is a very common learning mechanism in ML and used by most of the newcomer researchers in their respective fields. This learning mechanism trains the machine by using a labeled dataset in the form of compressed input–output pair as depicted in Refs. [13–15]. These datasets are available in continuous or discrete form. But the important thing is, it needs supervision with an appropriate training model. As supervised learning predicts accurate results [16], hence it is mostly used for Regression analysis and classification purposes. Figure 1.4 shows the execution model of supervised learning.
The figure shows that in supervised learning, a given set of input attributes (i. e. A1, A2, A3, A4 … … Ak) along with their output attributes (i.e. B1, B2, B3, B4 … … … Bk) are kept in a knowledge dataset. The Learning Algorithm takes an input Ai and executes with its model and produces the result Bi as the desired output. Supervised Learning has its importance in the field of Bioinformatics as concerning the heart disease scenario where inputs can be a lot of symptoms of heart diseases such as High Cholesterol, Chest Pain, and Blood Pressure, etc. and the output could be a person suffering from heart disease or not. Now all these inputs are passed on to the learning algorithm where it gets trained and if a new input is passed through the model then the machine gives an expected output. If the expected output’s accuracy is not up to the mark then there is a need for modification or up-gradation in the model.
Figure 1.4 Block diagram of supervised learning.
An example of supervised learning could be of a person who felt that he has a high cholesterol level and a chest pain and went to the doctor for a check-up. The Doctor fed the inputs given by the patient to the machine. The Machine predicted and told the doctor that the patient is suffering from a cardiac issue in his heart. It acts as an analogy to the supervised learning as the inputs given by the patient are the independent variables and their corresponding output from the machine acts as the dependent attribute. The Machine acted as a model that predicted and gave a relevant output as it is trained by similar inputs. Supervised Learning is itself a huge subfield of ML and useful for a variety of techniques used in research work. These techniques include Regression Analysis, Artificial Neural Networks (ANN), Support Vector Machines (SVM), etc.
In Unsupervised Learning, the user doesn’t have to supervise the model. Here, the model is allowed to work on its own to find the information. Here clusters are made [17–21]. A block diagram of unsupervised learning is shown in Figure 1.5.
The figure says that in unsupervised learning the inputs are collected as a set of features that are described as A1, A2, A3, A4 … … Ak. But, the output features are not available. The input parameters are passed to a learning algorithm module and diverse groups are formed that are called clusters [22–26].
Figure 1.5 Block diagram of unsupervised learning.
Unsupervised Learning has its role in Bioinformatics as concerning the heart disease scenario where inputs can be a lot of symptoms of heart diseases such as High Cholesterol, Chest Pain, and Blood Pressure, etc. These symptoms are passed onto the learning algorithm as input where clusters are made by the model and help the patient for identifying a disease (variables/values of similar types in one cluster) that may occur in near future.
In the field of ML, Reinforcement Learning was developed by John Andreae in the year 1963 when he invented a system called Stella [27]. It is a dynamic approach that works on the concept of feedbacks [28–31]. Reinforcement for a machine is the reward that it receives upon acting in the environment. When the machine acts on its environment, it receives some evaluation on its actions which is called reinforcement but is not told of which action is the correct one for achieving the goal. In this, the machine’s utility is defined by the feedback function [32]. The objective is to maximize the expected feedback. The block diagram of reinforcement learning is shown below in Figure 1.6.
The above figure tries to present that, a machine at first performs some actions in the environment. Once the actions are performed, then the machine starts to receive the feedbacks. The collected feedbacks may be positive or negative type. The positive feedbacks are kept inside the machines as knowledge. The machine tries to learn from the negative feedback so that in future such an incident may not happen again. Another important aspect of reinforcement learning is the state. The state also provides the input based on the situation to the machine for learning purposes.
Figure 1.6 Block diagram of reinforcement learning.
A few points of reinforcement learning are as follows:
The Input of the Reinforcement Learning Process: Initial state
The Output of the Reinforcement Learning Process: Diversified solutions can be present, depending on the feedbacks obtained
The training process is purely based on input.
This Reinforcement Learning model is a continuous process.
The best solution for this reinforcement learning is the maximum positive feedback.
An example of reinforcement learning could of a person who is suffering from high cholesterol and high blood pressure. He visits his family doctor and requests a medication regarding the same. After analyzing the symptoms, the doctor prescribed a diet chart and a set of medicines to minimize the cholesterol level and blood pressure. He took the medicines and felt better. Here, the patient gets positive feedback in the form of the results of the medication provided by the doctor. Now, the patient will be motivated and will consume only low-fat and low-sodium diet to keep down the levels of blood pressure & cholesterol. If the levels did not go down then the patient will ask the doctor about the same and more tests will be considered for the lowering of the parameters that are required to evaluate the heart of the patient.
Classification is a task in ML, which deals with the organized process of assigning a class label to an observation from the problem domain. It is a sub-group of the supervised form of ML. The traditional classification algorithm was invented by a Swedish botanist Carl Von Linnaeus and depicted in Ref. [33]. In the process of calculating the desired output in supervised learning, this classification is more effective when the input attribute is in the form of a discrete. The Classification approach always helps the user for taking decisions by providing the classified conclusions from the observed data, values as discussed in Refs. [34–36]. Figure 1.7 tries to present a classification graph by executing the data of different persons who are suffering from heart disease or not.
In the above figure, the patients that are suffering from Heart disease are represented by the triangle symbol, and those who are not, are represented by rectangle symbols. The hyperplane (partition) line depicts the bifurcation between these two classified entities. In general, there are four types of classification techniques. They are:
Figure 1.7 Concept of classification.
Binary Classification: It considers the tasks of classification where the class labels are two, and the two classes consider one in the normal state and the other in the abnormal state [37].
Imbalanced Classification: It involves the tasks of classification where the examples are unequally distributed in the class [38].
Multi-label Classification: It involves the tasks of classification where the number of class labels is two or greater than two where for every example one or more than one class label may be predicted [39].
Multi-Class Classification: It involves the tasks of classification where the number of class labels is greater than two [40].
Figure 1.8 Classification based on gender.
For Achieving the Classification approach more precisely, a heart disease dataset [41] has been used that comprises of a total of 1,025 people out of which 312 are females and 713 are males. A particular reason behind taking this dataset is that people are continuously suffering from heart diseases, this is so because people who consume alcohol excessively, consume oily and fast food and also inhale dangerous gases due to pollution. This Classification of gender is given below in Figure 1.8.
Regression is a very powerful type of statistical analysis. This is used for finding the strength as well as the character between one dependent variable and a series of independent variables [42–44]. This analysis provides the knowledge on the product that weather any updation in the future is possible or not. The operation of regression provides the ability to a researcher for identifying the best parameter of a topic that can be used for analysis. Also, it provides the parameters that are not to be used for analysis.
In the field of ML, Linear Regression is the most common type of regression analysis for the purpose of prediction [45]. In this process of statistical analysis, equations are made for identifying the useful and not-useful parameters. These are done by linear regression as well as multiple linear regression [46–49]. The representation of Linear Regression is presented in Equation (1.1) and the representation of Multiple Linear Regression is presented in Equation (1.2).
Where,
B is known as dependent variable
A or A
j∈k
are independent variable
n is an intercept
q or q
j∈k
are slope variables
i is regression residual
k is any natural number
.
For easy understanding, a case study on heart disease is discussed below. In this case study, with the help of the regression approach, a prediction was done whether a person has heart disease or not. Here, the dependent variable is the heart disease and the independent variables are cholesterol levels, blood pressure, etc. After analyzing the data, it was found that the patient has a problem in his heart which is presented below on a 2D plane in Figure 1.9.
The steps required for regression analysis are [50]:
Select the dependent & independent variables.
Explore the co-relation matrix along with the scatter plot.
Perform the Linear or Multiple Regression Operation.
Accord with the outliers along with the multi-collinearity.
Perform the t-test.
Handle the insignificant variables.
Figure 1.9 Regression.
Figure 1.10 Cholesterol line fit plot.
The Regression operation performed on the heart disease dataset concerning the age and cholesterol and got the following results as shown in Figure 1.10.
In the above figure, a line fit plot is mentioned that depicts the line of best fit. This line of best fit is known as the trend line. This trend line is based on a linear equation and try to present the standard cholesterol level of a general human w.r.t. the age. The plot has two axes that include a vertical axis depicting the age and the horizontal axis depicting the cholesterol values. The trend line could be linear, polynomial, or exponential as discussed in Refs. [51–53]. In the process of regression analysis on the heart disease dataset, the following numerical interpretation is obtained and presented in Table 1.1.
Where,
Multiple R (Co-relation Coefficient): It depicts the strength of a linear relationship between two variables i.e. age and cholesterol of a human. This value always lies between −1 and +1. The obtained value i.e. 0.972834634 indicated that there is a good relationship between age and cholesterol level.
R
2
:
It is the coefficient of determination i.e. the goodness of fit. The obtained value is 0.946407225 which indicates that 95% of the values of the heart disease dataset fit the regression model.
Adjusted
R
2
: This variable is an upgraded version of
R
2
. This value tries to adjust the predictor number in the model. This value increases when any new term improves the performance of the model more than the expectation and viceversa. The obtained value i.e. 0.945430663 indicates that the model is not performing well so there is a need for modification in predictor number.
Standard Error: It measures the precision of the regression model, the smaller the number, the more accurate the results are. The value obtained is 12.7814549 which indicates that the results are near to accurate value. The Standard Error depicts the measure of how well the data has been approximated.
Table 1.1 Regression statistics.
Regression Statistics
Multiple R
0.972834634
R Square
0.946407225
Adjusted R Square
0.945430663
Standard Error
12.7814549
Observations
1,025
Logistic Regression is a statistical model used for identifying the probability of a class with the help of binary dependent variables i.e. Yes or No. It indicates whether a class belongs to the Yes category or the No category. For example, after executing an event on an object the results maybe Win or Loss, Pass or Fail, Accept or Not-Accept, etc. The mathematical representation of the Logistic Regression model is done by two indicator variables i.e. 0 and 1. It is different from the Linear Regression technique as depicted in Ref. [54]. As logistic regression has its importance in the real-life classification problems as depicted in Refs. [55, 56], different fields like Medical Sciences, Social Sciences, ML are using this model in their various field of operations.
The Logistic Regression is performed on the heart disease dataset [41]. The Receiver Operating Characteristics (ROC) is calculated that is based on the true positive rate that is plotted on the y-axis and the false positive rate that is plotted on the x-axis. After performing the logistic regression in python (Google Colab), the outcome is represented in Figure 1.11 and Table 1.2. Figure 1.11 represents the ROC curve and Table 1.2 represents the Area under the ROC Curve (AUC).
At the time of processing, the AUC value obtained (Table 1.2) on training data is 0.8374022, but when the data is processed for testing then the obtained result is outstanding (i.e. 0.9409523). This indicates that the model is more than 90% efficient for classification. In the next section, the difference between Linear and Logistic Regression is discussed.
Figure 1.11ROC curve for logistic regression.
Table 1.2 AUC: Logistic regression.
Parameter
Data
Value
Result
The area under
Training Data
0.8374022
Excellent
the ROC Curve (AUC)
Test Data
0.9409523
Outstanding
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding
Linear and Logistics regression are two common types of regression used for prediction. The result of the prediction is represented with the help of numeric variables. The difference between linear and logistic regression is depicted in Table 1.3 for easy understanding.
Linear regression is used to model the data by using a straight line whereas the logistic regression deals with the modeling of probability of events in a bi-variate manner that is occurring as a linear function of dependent variables. Few other types of regression analysis are depicted by different scientists and listed below.
Table 1.3 Difference between linear & logistic regression.
S. No.
Parameter
Linear regression
Logistic regression
1
Purpose
Used for solving regression problems.
Used for solving classification problems.
2
Variables Involved
Continuous Variables
Categorical Variables
3
Objective
Finding of best-fit-line and predicting the output.
Finding of s-curve and classifying the samples.
4
Output
Continuous Variables such as age, price, etc.
Categorical Values such as 0 & 1, Yes & No.
5
Collinearity
There may be collinearity between independent attributes.
There should not be collinearity between independent attributes.
6
Relationship
The relationship between a dependent variable and the independent variable must be linear.
The relationship between a dependent variable and an independent variable may not be linear.
7
Estimation Method Used
Least Square Estimation
Maximum Likelihood Estimation
Polynomial Regression: It is used for curvilinear data [57–58].
Stepwise Regression: It works with predictive models [59–60].
Ridge Regression: Used for multiple regression data [61–62].
Lasso Regression: Used for the purpose of variable selection & regularization [63–64].
Elastic Net Regression: Used when the penalties of lasso and ridge method are combined [65].
The Random Forest was first invented by Tim Kan Ho [66]. Random Forest is a supervised ensemble learning method, which solves regression and classification problems. It is a method of ensemble learning (i.e. bagging algorithm) and works by averaging the result and by reducing overfitting [67–71]. It is a flexible method and a ready to use in the machine learning algorithm. The Random Forest can be used for the process of regression and known as Regression Forests [72]. It can cope up with the missing values but deals with complexity as well as a longer training period. There are two specific causes for naming it as Random that are:
When building trees, then a random sampling of training data sets is followed.
When Splitting nodes, then a random subset of features is considered.
The functioning of random forests is illustrated in Figure 1.12.
In the above figure, five forests are there and each one representing a disease, such as blue represents liver disease, orange represents heart disease, the green tree represents stomach disease, yellow represents lung disease. It was observed that as per the majority of color, Orange is the winner.
Figure 1.12 Random forest.
This concept is known as the Wisdom of crowd as discussed in Ref. [73]. The execution of this method is achieved with the help of two concepts, which is listed below
Bagging: The Data on which the decision trees are trained are very sensitive. This means a small change in the data can bring diverse effects in the model. Because of this, the structure of the tree can completely change. They take benefit of it by allowing each tree to randomly sample the dataset with a replacement that results in different trees. This is called bagging or bootstrap aggregation [74–75].
Random Feature Selection: Normally, when we split a node, every possible feature is considered. The one that produces the most separation is considered. Whereas, in the random forest scenario we can consider a random subset of features. This allows more variation in the model and results in a greater diversification. [76]
The Concept of Random Forest took place in the heart disease dataset also. The low correlation is the key, between the models. The Area under the ROC Curve (AUC) characteristic of Random Forest performed in python (Google Colab) is shown in Table 1.4 and Figure 1.13.
In the above table, the area under the receiver operating characteristic curve (AUC) is mentioned.
AUC measures the degree of separability. The obtained value of Training Data is 1.0000000 that attains an outstanding remark and the value of the testing data is 1.0000000 that attains an outstanding remark in the AUC score. The result indicates that the used models perform outstandingly on the heart disease dataset.
Table 1.4 AUC: Random forest.
Parameter
Data
Value
Result
The area under the ROC Curve (AUC)
Training Data
1.0000000
Outstanding
Test Data
1.0000000
Outstanding
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding
Figure 1.13 ROC curve for random forest.
K-Nearest Neighbor belongs to the category of supervised classification algorithm and hence, needs labeled data for training [77, 78]. In this approach, the value of K is suggested by the user. It can be used for both the classification and regression approaches but the attributes must be known. By performing the KNN algorithm, it will give new data points according to the k-number or the closest data points.
In the heart disease dataset also, The Area under the ROC Curve (AUC) has been used. It is the most basic tool for judging the classifier’s performance in a medical decision making concerns [79–81]. It is a graphical plot for judging the diagnostic ability with the help of a binary classifier. The generated ROC curve for KNN on the heart disease dataset [41] is presented below in Figure 1.14.
In the above figure, the true positive rate (probability of detection) is mentioned on the Y-axis, and on the x-axis, the false positive rate (probability of false alarm) is mentioned. The False Positive rate depicts the unit proportion with a known negative condition for which the predicted condition is positive.
The Area under the ROC Curve (AUC) of K-nearest neighbor is performed on the heart disease dataset [41] in python (Google Colab) and shown below in Table 1.5.
Figure 1.14 ROC curve for k-nearest neighbor.
Table 1.5 AUC: K-nearest neighbor.
Parameter
Data
Value
Result
The area under the ROC Curve (AUC)
Training Data
1.0000000
Outstanding
Test Data
1.0000000
Outstanding
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding
The obtained value of Training Data is 1.0000000 that attains an outstanding remark and the value of the testing data is 1.0000000 that attains an outstanding remark in the AUC score. The result shows that KNN performs outstandingly on the dataset.
Decision Tree is a form of supervised machine learning and was invented by William Belson in the year 1959 [82]. It predicts the response values by learning the decision rules that were derived from features [83–84]. They are good for evaluating the options. It is used in operation research and decision analysis. An example of Decision Trees considering a person is having heart disease or not is presented below in Figure 1.15 for easy understanding.
The above figure depicts the answer to the Question “A person having Heart Disease or not?” by concerning various conditions and reaching a conclusion. Initially, it is checked that a person having chest pain or not. If yes, then it is checked that the person has high blood pressure or not. If the blood pressure if high or even low, then the person is suffering from heart disease. If the person doesn’t have chest pain then he is not suffering from heart disease. After implementing the Decision tree on the heart disease dataset [41] the AUC values are generated and presented in Table 1.6. The implementation was done in Python (Google Colab).
Figure 1.15 Decision tree.
Table 1.6 AUC: Decision trees.
Parameter
Data
Value
Result
The area under the ROC Curve (AUC)
Training Data
0.9588996
Outstanding
Test Data
0.9773333
Outstanding
Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding
The obtained value of Training Data is 0.9588996 that attains an outstanding remark and the value of the testing data is 0.9773333 that attains an outstanding remark in the AUC score. The result indicates that the decision tree model performs outstandingly on the heart disease dataset.
The original Support Vector Machine (SVM) algorithm was invented by Vladimir N. Vapnik & Alexey Ya. Chervonenkis in 1963 [85]. In machine learning, the Support Vector Classifier fits the data that the user provides, and returns the best-fit hyper-plane that categorizes the data. After getting the hyperplane, the user can feed some features to the classifier to check the predicted class [86–87]. SVM is used for analyzing data that can be used for the process of regression or classification. Taking a similar example of the bifurcation of a person suffering from heart disease or not but giving it a more detailed view, it is depicted in Figure 1.16.
