Data Analytics in Bioinformatics -  - E-Book

Data Analytics in Bioinformatics E-Book

0,0
197,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel machine learning computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization. Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics approximating classification and prediction of disease, feature selection, dimensionality reduction, gene selection and classification of microarray data and many more.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 668

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Preface

Acknowledgement

Part 1: THE COMMENCEMENT OF MACHINE LEARNING SOLICITATION TO BIOINFORMATICS

1 Introduction to Supervised Learning

1.1 Introduction

1.2 Learning Process & its Methodologies

1.3 Classification and its Types

1.4 Regression

1.5 Random Forest

1.6 K-Nearest Neighbor

1.7 Decision Trees

1.8 Support Vector Machines

1.9 Neural Networks

1.10 Comparison of Numerical Interpretation

1.11 Conclusion & Future Scope

References

2 Introduction to Unsupervised Learning in Bioinformatics

2.1 Introduction

2.2 Clustering in Unsupervised Learning

2.3 Clustering in Bioinformatics—Genetic Data

2.4 Conclusion

References

3 A Critical Review on the Application of Artificial Neural Network in Bioinformatics

3.1 Introduction

3.2 Biological Datasets

3.3 Building Computational Model

3.4 Literature Review

3.5 Critical Analysis

3.6 Conclusion

References

Part 2: MACHINE LEARNING AND GENOMIC TECHNOLOGY, FEATURE SELECTION AND DIMENSIONALITY REDUCTION

4 Dimensionality Reduction Techniques: Principles, Benefits, and Limitations

4.1 Introduction

4.2 The Benefits and Limitations of Dimension Reduction Methods

4.3 Components of Dimension Reduction

4.4 Methods of Dimensionality Reduction

4.5 Conclusion

References

5 Plant Disease Detection Using Machine Learning Tools With an Overview on Dimensionality Reduction

5.1 Introduction

5.2 Flowchart

5.3 Machine Learning (ML) in Rapid Stress Phenotyping 113

5.4 Dimensionality Reduction

5.5 Literature Survey

5.6 Types of Plant Stress

5.7 Implementation I: Numerical Dataset

5.8 Implementation II: Image Dataset

5.9 Conclusion

References

6 Gene Selection Using Integrative Analysis of Multi-Level Omics Data: A Systematic Review

6.1 Introduction

6.2 Approaches for Gene Selection

6.3 Multi-Level Omics Data Integration

6.4 Machine Learning Approaches for Multi-Level Data Integration

6.5 Critical Observation

6.6 Conclusion

References

7 Random Forest Algorithm in Imbalance Genomics Classification

7.1 Introduction

7.2 Methodological Issues

7.3 Biological Terminologies

7.4 Proposed Model

7.5 Experimental Analysis

7.6 Current and Future Scope of ML in Genomics 188

7.7 Conclusion

References

8 Feature Selection and Random Forest Classification for Breast Cancer Disease

8.1 Introduction

8.2 Literature Survey

8.3 Machine Learning

8.4 Feature Engineering

8.5 Methodology

8.6 Result Analysis

8.7 Conclusion

References

9 A Comprehensive Study on the Application of Grey Wolf Optimization for Microarray Data

9.1 Introduction

9.2 Microarray Data

9.3 Grey Wolf Optimization (GWO) Algorithm

9.4 Studies on GWO Variants

9.5 Application of GWO in Medical Domain

9.6 Application of GWO in Microarray Data

9.7 Conclusion and Future Work

References

10 The Cluster Analysis and Feature Selection: Perspective of Machine Learning and Image Processing

10.1 Introduction

10.2 Various Image Segmentation Techniques

10.3 How to Deal With Image Dataset

10.4 Class Imbalance Problem

10.5 Optimization of Hyperparameter

10.6 Case Study

10.7 Using AI to Detect Coronavirus

10.8 Using Artificial Intelligence (AI), CT Scan and X-Ray 274

10.9 Conclusion

References

Part 3: MACHINE LEARNING AND HEALTHCARE APPLICATIONS

11 Artificial Intelligence and Machine Learning for Healthcare Solutions

11.1 Introduction

11.2 Using Machine Learning Approaches for Different Purposes

11.3 Various Resources of Medical Data Set for Research

11.4 Deep Learning in Healthcare

11.5 Various Projects in Medical Imaging and Diagnostics

11.6 Conclusion

References

12 Forecasting of Novel Corona Virus Disease (Covid-19) Using LSTM and XG Boosting Algorithms

12.1 Introduction

12.2 Machine Learning Algorithms for Forecasting 296

12.3 Proposed Method

12.4 Implementation

12.5 Results and Discussion

12.6 Conclusion and Future Work

References

13 An Innovative Machine Learning Approach to Diagnose Cancer at Early Stage

13.1 Introduction

13.2 Related Work

13.3 Materials and Methods

13.4 System Design

13.5 Results and Discussion

13.6 Conclusion

References

14 A Study of Human Sleep Staging Behavior Based on Polysomnography Using Machine Learning Techniques

14.1 Introduction

14.2 Polysomnography Signal Analysis

14.3 Case Study on Automated Sleep Stage Scoring

14.4 Summary and Conclusion

References

15 Detection of Schizophrenia Using EEG Signals

15.1 Introduction

15.2 Methodology

15.3 Literature Review

15.4 Discussion

15.5 Conclusion

References

16 Performance Analysis of Signal Processing Techniques in Bioinformatics for Medical Applications Using Machine Learning Concepts

16.1 Introduction

16.2 Basic Definition of Anatomy and Cell at Micro Level 397

16.3 Signal Processing—Genome Signal Processing 403

16.4 Hotspots Identification Algorithm

16.5 Results—Experimental Investigations

16.6 Analysis Using Machine Learning Metrics

16.7 Conclusion

Appendix

A.1 Hotspot Identification Code

A.2 Performance Metrics Code

References

17 Survey of Various Statistical Numerical and Machine Learning Ontological Models on Infectious Disease Ontology

17.1 Introduction

17.2 Disease Ontology

17.3 Infectious Disease Ontology

17.4 Biomedical Ontologies on IDO

17.5 Various Methods on IDO

17.6 Machine Learning-Based Ontology for IDO

17.7 Recommendation or Suggestions for Future Study 437

17.8 Conclusions

References

18 An Efficient Model for Predicting Liver Disease Using Machine Learning

18.1 Introduction

18.2 Related Works

18.3 Proposed Model

18.4 Results and Analysis

18.5 Conclusion

References

Part 4: BIOINFORMATICS AND MARKET ANALYSIS

19 A Novel Approach for Prediction of Stock Market Behavior Using Bioinformatics Techniques

19.1 Introduction

19.2 Literature Review

19.3 Proposed Work

19.4 Experimental Study

19.5 Conclusion and Future Work

References

20 Stock Market Price Behavior Prediction Using Markov Models: A Bioinformatics Approach

20.1 Introduction

20.2 Literature Survey

20.3 Proposed Work

20.4 Experimental Work

20.5 Conclusions and Future Work

References

Index

End User License Agreement

List of Illustrations

Chapter 1

Figure 1.1 Traditional learning.

Figure 1.2 Machine learning.

Figure 1.3 Learning behavior of a machine.

Figure 1.4 Block diagram of supervised learning.

Figure 1.5 Block diagram of unsupervised learning.

Figure 1.6 Block diagram of reinforcement learning.

Figure 1.7 Concept of classification.

Figure 1.8 Classification based on gender.

Figure 1.9 Regression.

Figure 1.10 Cholesterol line fit plot.

Figure 1.11 ROC curve for logistic regression.

Figure 1.12 Random forest.

Figure 1.13 ROC curve for random forest.

Figure 1.14 ROC curve for k-nearest neighbor.

Figure 1.15 Decision tree.

Figure 1.16 Support vector machine.

Figure 1.17 Neural network (general).

Figure 1.18 Neural network (detailed).

Chapter 2

Figure 2.1 Machine learning in bioinformatics.

Figure 2.2 Example matrix of gene expression (10 genes in a row and 2 samples in...

Figure 2.3 Partition clustering algorithms.

Figure 2.4 (a) Agglomerative clustering, (b) divisive clustering.

Figure 2.5 Self Organizing Map (SOM).

Chapter 3

Figure 3.1 Areas of research of bioinformatics [4].

Figure 3.2 Simple network architecture of ANN with four input unit [17].

Figure 3.3 Single layer perceptron (left) and multilayer perceptron with one hid...

Chapter 4

Figure 4.1 The Steps of LDA reduction technique.

Figure 4.2 The steps of backward feature elimination technique.

Figure 4.3 The steps of backward feature elimination technique.

Figure 4.4 Low variance ratio feature reduction techniques.

Figure 4.5 The steps of the random forest algorithm.

Chapter 5

Figure 5.1 Flowchart to depict the structure of the book chapter.

Figure 5.2 Depicts the different steps in which PC1 is created, by considering a...

Figure 5.3 Shows the screen plot which depicts the variance of each of the princ...

Figure 5.4 Shows the steps followed in extraction of the features by ORB in case...

Figure 5.5 Shows the steps followed in extraction of the features by ORB in case...

Figure 5.6 Shows the steps followed in extraction of the features by ORB in case...

Figure 5.7 Color histogram comparison between three pairs of images: images of b...

Chapter 6

Figure 6.1 Multiple levels of omics data in biological system, from genome, epig...

Figure 6.2 (a) Hypothesis 1: Molecular variations propagates linearly in a hiera...

Figure 6.3 Different integration pipelines of multi-level omics data. (a) Horizo...

Figure 6.4 Methods for parallel integration. (a) Concatenation-based, (b) transf...

Chapter 7

Figure 7.1 Knowledge discovery process in genomic data.

Figure 7.2 Decision tree classifier on a dataset having two features (X

1

and X

2

)...

Figure 7.3 Ensemble learning model to increase the accuracy of classification mo...

Figure 7.4 Bootstrap aggregation (bagging) technique.

Figure 7.5 Implementing RF classifier in a dataset having four features (X

1

…X

4

) ...

Figure 7.6 Genes, proteins and molecular machines.

Chapter 8

Figure 8.1 Ensemble learning.

Figure 8.2 Output label.

Figure 8.3 Heat map.

Figure 8.4 Correlation matrix.

Figure 8.5 Confusion matrix.

Figure 8.6 Confusion matrix.

Figure 8.7 Comparison of different methods.

Chapter 9

Figure 9.1 Schematic representation of DNA microarray technology [12].

Figure 9.2 Model for microarray data analysis [13].

Figure 9.3 Social hierarchy of grey wolves.

Figure 9.4 Flow chart of GWO.

Chapter 10

Figure 10.1 Flowchart for K-mean clustering.

Figure 10.2 Linear classifier.

Figure 10.3 Hyper plane classifier.

Figure 10.4 Optimal line.

Figure 10.5 Working process of SVM.

Figure 10.6 Hybrid algorithm.

Figure 10.7 C—means clustering algorithm.

Figure 10.8 Model of data prediction.

Figure 10.9 The flow-chart signifies different systems, approaches and investiga...

Chapter 12

Figure 12.1 Total Confirmed Cases across the world on May, 2020. (Source : https...

Figure 12.2 Confirmed Covid-19 cases in Tamilnadu—Highly affected cities. (Sourc...

Figure 12.3 Proposed model for forecasting covid-19.

Figure 12.4 Working of LSTM.

Figure 12.5 Working of the gradient boosting algorithm.

Figure 12.6 Output of linear regression for linear data.

Figure 12.7 Working of polynomial regression with the non-linear data.

Figure 12.8 Actual and predicted values of Confirmed cases using polynomial Regr...

Figure 12.9 Actual and predicted values of fatalities using polynomial regressio...

Chapter 13

Figure 13.1 Block diagram.

Figure 13.2 Input image.

Figure 13.3 Gray image.

Figure 13.4 Filtered image.

Figure 13.5 SNR comparison.

Figure 13.6 Identification of cancer.

Chapter 14

Figure 14.1 C3-A2 channel of EEG signal for different stages behavior of sleep: ...

Figure 14.2 60 s epoch original EEG signal behavior from Healthy Subject.

Figure 14.3 Different brain behaviors of healthy subject during different sleep ...

Figure 14.4 EEG signal behavior from sleep disorder subject with duration of 60 ...

Figure 14.5 Different sleep stage characteristics from subject affected with sle...

Figure 14.6 (a) 10–20 Electrode placement system, (b) Notations for placement of...

Figure 14.7 Left frontal region recording.

Figure 14.8 Left central region recording.

Figure 14.9 Right frontal recording.

Figure 14.10 Right central recording.

Figure 14.11 Left occipital recording.

Figure 14.12 Right occipital region recording.

Figure 14.13 Right outer canthus region recording.

Figure 14.14 Left outer canthus region recording (typical activity of the EOG (L...

Figure 14.15 Chin EMG region recording (typical activity of the EMG (EMG-X1)).

Figure 14.16 Limb EMG (left leg) region recording (typical activity of the EMG (...

Figure 14.17 Limb EMG (Right Leg) Region Recording (Typical activity of the EMG ...

Figure 14.18 Sleep stages behavior of the subjects (a) Subject-16, (b) Subject-1...

Figure 14.19 Framework of the proposed research work.

Figure 14.20 Performance graphs of the proposed SleepEEG test model with the C4-...

Chapter 15

Figure 15.1 Electrode placement method.

Figure 15.2 EEG delta, theta, alpha, beta and gamma [7].

Figure 15.3 Steps involved in processing of EEG signal.

Chapter 16

Figure 16.1 Bioinformatics suit.

Figure 16.2 Amount of data stored by EBI over the years [3].

Figure 16.3 Depicts prokaryotic & eukaryotic cells.

Figure 16.4 Eukaryotic nuclear DNA within the chromosomes.

Figure 16.5 Depicts RNA transcription process shows a strand RNA given from a do...

Figure 16.6 Depicts transcription and translation describing the process of conv...

Figure 16.7 Depicts a protein handshake.

Figure 16.8 Depicts Protein–Protein interaction render to hotspot.

Figure 16.9 Complex plane of unit circle taken for 20 complex numbers in CPNR.

Figure 16.10 Mapping of amino acids using CPNR.

Figure 16.11 Mapping of amino acids using EIIP.

Figure 16.12 Depicts Flow of EIIP, CPNR mapping with DWT for hotspots identifica...

Figure 16.13 CPNR results.

Figure 16.14 EIIP results.

Figure 16.15 Confusion matrix.

Figure 16.16 Precision calcuation details.

Figure 16.17 Recall calculation details.

Figure 16.18 Overall performance analysis visualization.

Chapter 18

Figure 18.1 Overview of experimental methodology.

Figure 18.2 Correlation heatmap.

Figure 18.3 Encoding process.

Chapter 19

Figure 19.1 Framework of the book chapter.

Figure 19.2 Actual vs predicted behavior BSE February, 2001.

Figure 19.3 Actual vs predicted behavior BSE February, 2004.

Figure 19.4 Actual vs predicted behavior BSE February, 2005.

Figure 19.5 Actual vs predicted behavior BSE February, 2008.

Figure 19.6 Actual vs predicted behavior BSE February, 2010.

Figure 19.7 Actual vs predicted behavior BSE March, 2001.

Figure 19.8 Actual vs predicted behavior BSE March, 2002.

Figure 19.9 Actual vs predicted behavior BSE March, 2003.

Figure 19.10 Actual vs predicted behavior BSE March, 2006.

Figure 19.11 Actual vs predicted behavior BSE March, 2008.

Figure 19.12 Actual vs predicted behavior BSE March, 2010.

Figure 19.13 Actual vs predicted behavior BSE April, 2001.

Figure 19.14 Actual vs predicted behavior BSE April, 2003.

Figure 19.15 Actual vs predicted behavior BSE April, 2004.

Figure 19.16 Actual vs predicted behavior BSE April, 2007.

Figure 19.17 Actual vs predicted behavior BSE April, 2008.

Figure 19.18 Actual vs predicted behavior BSE May, 2001.

Figure 19.19 Actual vs predicted behavior BSE May, 2002.

Figure 19.20 Actual vs predicted behavior BSE October, 2002.

Figure 19.21 Actual vs predicted behavior BSE October, 2003.

Chapter 20

Figure 20.1 Framework for prediction using Markov model.

Figure 20.2 Example for zero order Markov model of BSE (2001).

Figure 20.3 Example for first order Markov model of BSE (2001).

Figure 20.4 Behavior of BSE, January 2005 (actual vs predicted).

Figure 20.5 Behavior of BSE, February 2005 (actual vs predicted).

Figure 20.6 Behavior of BSE, March 2005 (actual vs predicted).

Figure 20.7 Behavior of BSE, March 2010 (actual vs predicted).

Figure 20.8 First two days prediction accuracy percentage of BSE (November).

Figure 20.9 First four days prediction accuracy percentage of BSE (November).

Figure 20.10 First six days prediction accuracy percentage of BSE (November).

List of Tables

Chapter 1

Table 1.1 Regression statistics.

Table 1.2 AUC: Logistic regression.

Table 1.3 Difference between linear & logistic regression.

Table 1.4 AUC: Random forest.

Table 1.5 AUC: K-nearest neighbor.

Table 1.6 AUC: Decision trees.

Table 1.7 AUC: Support vector machines.

Table 1.8 AUC: Neural network.

Table 1.9 AUC: Comparison of numerical interpretations.

Chapter 2

Table 2.1 Gene expression data matrix representation.

Chapter 3

Table 3.1 Shows published articles of ANN used for biological data.

Chapter 5

Table 5.1 Highlights the important work of literature published in the last two ...

Table 5.2 Depicts the features influencing each of the 9 principal components.

Chapter 6

Table 6.1 Published studies of multi-level omics data integration using unsuperv...

Table 6.2 Published studies of multi-level omics data integration using supervis...

Chapter 7

Table 7.1 The classification accuracy, recall, precision and F-score of ensemble...

Table 7.2 The classification accuracy, recall, precision and F-score of ensemble...

Table 7.3 The classification accuracy, recall, precision and F-score of ensemble...

Chapter 8

Table 8.1 Dataset Attributes.

Chapter 9

Table 9.1 Literature Review for Hybridization of GWO Algorithm.

Table 9.2 Literature Review for GWO Extension Algorithms.

Table 9.3 Literature Review for GWO Modification Algorithm.

Table 9.4 Literature Review for GWO in Medical Applications.

Table 9.5 Literature Review for GWO in Medical Application for Microarray Datase...

Chapter 12

Table 12.1 Sample dataset.

Table 12.2 Performance analysis of the proposed algorithms.

Chapter 13

Table 13.1 SNR Comparison of different images.

Chapter 14

Table 14.1 Composed channels of polysomnography.

Table 14.2 Details of enrolled subjects in this proposed work.

Table 14.3 Description of epochs of various sleep stages from four subjects used...

Table 14.4 Confusion matrix performance achieved by the SleepEEG study using C4-...

Table 14.5 Performances Statistics of Subjects with input of C4-A1 channel sleep...

Chapter 15

Table 15.1 Different part of cerebrum and its operations.

Table 15.2 Literature Review on the Analysis of EEG Signal for Detection of Schi...

Chapter 16

Table 16.1 Genetic code describing 64 possible codons and the corresponding amin...

Table 16.2 Amino acids listed in table with 3-letter and 1-letter codes.

Table 16.3 Depicts EIIP mapping for amino acids.

Table 16.4 Number of codons for 20 amino acids.

Table 16.5 Different pattern among each of the non-composable number and its bes...

Table 16.6 Complex Prime Numerical Representation (CPNR).

Table 16.7 Dataset of protein sequences used for experiment.

Table 16.8 Details of identified hotspots using CPNR.

Table 16.9 Details of identified hotspots using EIIP.

Table 16.10 Analysis of experimental results.

Chapter 18

Table 18.1 Data types of features.

Table 18.2 Comparison of Performance Metrics on various Machine Learning Models.

Table 18.3 Accuracy on application of 10-fold cross validation of logistic regre...

Chapter 19

Table 19.1 Hamming distance result year wise.

Chapter 20

Table 20.1 Example for second order Markov model of BSE (2001).

Table 20.2 Hamming distance of BSE (2002 to 2014) using zero order Markov model.

Table 20.3 Hamming distance of BSE (2002 to 2014) using first order Markov model...

Table 20.4 Hamming distance of BSE (2002 to 2014) using Second order Markov mode...

Guide

Cover

Table of Contents

Title page

Copyright

Preface

Acknowledgement

Begin Reading

Index

End User License Agreement

Pages

v

ii

iii

iv

xix

xx

xxi

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

283

284

285

286

287

288

289

290

291

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

459

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

507

508

509

510

511

512

513

514

Scrivener Publishing

100 Cummings Center, Suite 541J

Beverly, MA 01915-6106

Publishers at Scrivener

Martin Scrivener ([email protected])

Phillip Carmical ([email protected])

Data Analytics in Bioinformatics

A Machine Learning Perspective

Edited by

Rabinarayan Satpathy

Tanupriya Choudhury

Suneeta Satpathy

Sachi Nandan Mohanty

and

Xiaobo Zhang

This edition first published 2021 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA

© 2021 Scrivener Publishing LLC

For more information about Scrivener publications please visit www.scrivenerpublishing.com.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

Wiley Global Headquarters

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.

Library of Congress Cataloging-in-Publication Data

ISBN 978-1-119-78553-8

Cover image: Pixabay.Com

Cover design by Russell Richardson

Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines

Printed in the USA

10 9 8 7 6 5 4 3 2 1

Preface

Machine learning has become increasingly popular in recent decades due to its well-defined algorithms and techniques that enable computers to learn and solve real-life problems which are difficult, time-consuming, and tedious to solve traditionally. Regarded as a subdomain of artificial intelligence, it has a gamut of applications in the field of healthcare, medical diagnosis, bioinformatics, natural language processing, stock market analysis and many more. Recently, there has been an explosion of heterogeneous biological data requiring analysis, retrieval of useful patterns, management and proper storage. Moreover, there is the additional challenge of developing automated tools and techniques that can deal with these different kinds of outsized data in order to translate and transform computational modelling of biological systems and its correlated disciplinary data for further classification, clustering, prediction and decision-making.

Machine learning has justified its potential with its application in extracting relevant information in various biological domains like bioinformatics. It has been successful in dealing with and finding efficient solutions for complex biomedical problems. Prior to the application of machine learning, traditional mathematical as well as statistical models were used along with the domain of expert intelligence to carry out investigations and experiments manually, using instruments, hands and eyes, etc. But such conventional methods alone are not enough to deal with large volumes of different types of biological data. Hence, the application of machine learning techniques has become the need of the hour in research in order to find a solution to complex bioinformatics applications for both the disciplines of computer science and biology. With this in mind, this book has been designed with a number of chapters from eminent researchers who relate and explain the machine learning techniques and their application to various bioinformatics problems such as classification and prediction of disease, feature selection, dimensionality reduction, gene selection, etc. Since the chapters are based on progressive collaborative research work on a broad range of topics and implementations, it will be of interest to both students and researchers from the computer science as well as biological domains.

This edited book is compiled using four sections, with the first section rationalizing the applications of machine learning techniques in bioinformatics with introductory chapters. The subsequent chapters in the second section flows with machine learning technological applications for dimensionality reduction, feature & gene selection, plant disease analysis & prediction as well as cluster analysis. Further, the third section of the book brings together a variety of machine learning research applications to healthcare domain. Then the book dives into the concluding remarks of machine learning applications to stock market behavioural analysis and prediction.

The Editors

December 2020

Acknowledgement

The editors would like to acknowledge and congratulate all the people who extended their assistance for this book. Our sincere thankfulness goes to each one of the chapter’s authors for their contributions, without whose support this book would not have become a reality. Our heartfelt gratefulness and acknowledgement also go to the subject matter experts who could find their time to review the chapters and deliver those in time to improve the quality, prominence as well as uniform arrangement of the chapters in the book. Finally, a ton of thanks to all the team members of Scrivener Publishing for their dedicated support and help in publishing this edited book.

Part 1THE COMMENCEMENT OF MACHINE LEARNING SOLICITATION TO BIOINFORMATICS

1Introduction to Supervised Learning

Rajat Verma, Vishal Nagar and Satyasundara Mahapatra*

PSIT, Kanpur, Uttar Pradesh, India

Abstract

Artificial Intelligence (AI) has enhanced its importance through machines in the field of present business scenario. AI delineates the intelligence illustrated by machines and performs in a contrasting manner to the natural intelligence signified by all living objects. Today, AI is popular due to its Machine Learning (ML) techniques. In the field of ML, the performance of a machine depend upon the learning performance of that machine. Hence, the improvement of the machine’s performance is always proportional to its learning behavior. These Learning behaviors are obtained from the knowledge of living object’s intelligence. An introductory aspect of AI through a detailed scenario of ML is presented in this chapter. In the journey of ML’s success, data is the only requirement. ML is known because of its execution through its diverse learning approaches. These approaches are known as supervised, unsupervised, and reinforcement. These are performed only on data, as its quintessential element. In Supervised, attempts are done to find the relationship between the independent variables and the dependent variables. The Independent variables are the input attributes whereas the dependent variables are the target attributes. Unsupervised works are contrary to the supervised approach. The former (i.e. unsupervised) deals with the formation of groups or clusters, whereas the latter (i.e. supervised) deals with the relationship between the input and the target attributes. The third aspect (i.e. reinforcement) works through feedback or reward. This Chapter focuses on the importance of ML and its learning techniques in day to day lives with the help of a case study (heart disease) dataset. The numerical interpretation of the learning techniques is explained with the help of graph representation and tabular data representation for easy understanding.

Keywords: Artificial intelligence, machine learning, supervised, unsupervised, reinforcement, knowledge, intelligence

1.1 Introduction

In today’s world, businesses are moving towards the implementation of automatic intelligence for decision making. This is only possible with the help of a well-known intelligence technique otherwise known as Artificial Intelligence (AI). This intelligence technique also plays a vital role in the field of research, which is nothing but taking decisions instantly. The dimension of AI is bifurcated into sub-domains such as Machine Learning (ML) and Artificial Neural Networks (ANN) [1]. The term ML is also termed as augmented analytics [2] and depicts the development of machine’s performances. This is achieved through the previous experiences obtained by the machines, but the traditional learning (i.e. the intelligence used in the mid-1800s) works not so efficiently if compared with the ML [3]. In traditional learning, the user deals with data and programs as an input attribute and provides the output or results whereas, in the case of ML the user provides the data and output or desired results as an input attribute and produces the program or rules as an output attribute. This means that data is more important rather than the programs. This is so because the business world depends on the accuracy level of the program which is used for decision making. The block diagram of Traditional learning is shown below in Figure 1.1 for easy understanding.

Traditional Learning is a manual process whereas the functioning of ML is an automated one. Due to ML, the accuracy of analytic worthiness is increased in different diversified domains. These domains are utilized for the preparation of data (raw facts and figures), Outlier Detection (Automatic), Natural Language Interfaces (NLI), and Recommendations, etc. [4]. Due to these domains, the bias factor for taking decisions on a business problem is decreased.

Figure 1.1 Traditional learning.

Figure 1.2 Machine learning.

ML is a sub-group of AI and its primary work is allowing systems to learn automatically with the help of data or observations obtained from the environment through different devices [5]. The block-diagram of ML is shown below in Figure 1.2.

ML-based algorithms perform predictions as well as decisions by using mathematical models that are based on some training data [6–8]. Few popular implementations of Machine Learning are Filtering of E-mails [9], Medical Diagnosis [10], Classification [11], Extraction [12], etc. ML works for the growth of the accuracy level of the computer programs. This was done by accessing data from the surrounding, learn the data automatically, and enhancing the capacity of decision making. The main objective of ML is to minimize human intervention and assistance while performing any task. The next section of this chapter highlights the process of learning along with its different methodologies.

1.2 Learning Process & its Methodologies

In AI, Learning means a process to train a machine in such a way so that the machine can take decisions instantly. Hence, the performance of that machine is upgraded because of its accuracy. When a machine performs in its working environment it may get either success or failure. From these successes or failures machines are gaining experience itself. These newly gained experience, improve the machines through their actions and forms an optimal policy for the working environment. This process is known as learning from experience. This process of learning is possible in an unknown working environment. A general block diagram learning architecture for such a method is presented below in Figure 1.3. This figure tries to present the mechanism of learning a new experience by a machine. The sequence of learning behavior in a stepwise manner is given below.

Step 1. The IoT based sensors received input from the environment.

Step 2. Then, the sensor sends these inputs to the critics for performance evaluation, according to the previously stored performance standards. Simultaneously, the sensor sends the same input to the performance element for checking its effectiveness, if found OK then immediately return the same to the environment through effectors.

Step 3. The Critics provide the feedback to the learning element, if any new feedback occurs then it should be updated in the performance of the element. Then, the updated knowledge comes back to the learning element and send it to the problem generator as a learning goal for evaluating the same through experiments. The updates are sent to the performance of the element for future reference.

Figure 1.3 Learning behavior of a machine.

The learning process of ML is done in three different ways. These are supervised learning, unsupervised learning, and reinforcement learning. These three learning types have their importance in the different fields of bioinformatics research. Hence, they are explained with suitable examples in the next sections.

1.2.1 Supervised Learning

This is a very common learning mechanism in ML and used by most of the newcomer researchers in their respective fields. This learning mechanism trains the machine by using a labeled dataset in the form of compressed input–output pair as depicted in Refs. [13–15]. These datasets are available in continuous or discrete form. But the important thing is, it needs supervision with an appropriate training model. As supervised learning predicts accurate results [16], hence it is mostly used for Regression analysis and classification purposes. Figure 1.4 shows the execution model of supervised learning.

The figure shows that in supervised learning, a given set of input attributes (i. e. A1, A2, A3, A4 … … Ak) along with their output attributes (i.e. B1, B2, B3, B4 … … … Bk) are kept in a knowledge dataset. The Learning Algorithm takes an input Ai and executes with its model and produces the result Bi as the desired output. Supervised Learning has its importance in the field of Bioinformatics as concerning the heart disease scenario where inputs can be a lot of symptoms of heart diseases such as High Cholesterol, Chest Pain, and Blood Pressure, etc. and the output could be a person suffering from heart disease or not. Now all these inputs are passed on to the learning algorithm where it gets trained and if a new input is passed through the model then the machine gives an expected output. If the expected output’s accuracy is not up to the mark then there is a need for modification or up-gradation in the model.

Figure 1.4 Block diagram of supervised learning.

An example of supervised learning could be of a person who felt that he has a high cholesterol level and a chest pain and went to the doctor for a check-up. The Doctor fed the inputs given by the patient to the machine. The Machine predicted and told the doctor that the patient is suffering from a cardiac issue in his heart. It acts as an analogy to the supervised learning as the inputs given by the patient are the independent variables and their corresponding output from the machine acts as the dependent attribute. The Machine acted as a model that predicted and gave a relevant output as it is trained by similar inputs. Supervised Learning is itself a huge subfield of ML and useful for a variety of techniques used in research work. These techniques include Regression Analysis, Artificial Neural Networks (ANN), Support Vector Machines (SVM), etc.

1.2.2 Unsupervised Learning

In Unsupervised Learning, the user doesn’t have to supervise the model. Here, the model is allowed to work on its own to find the information. Here clusters are made [17–21]. A block diagram of unsupervised learning is shown in Figure 1.5.

The figure says that in unsupervised learning the inputs are collected as a set of features that are described as A1, A2, A3, A4 … … Ak. But, the output features are not available. The input parameters are passed to a learning algorithm module and diverse groups are formed that are called clusters [22–26].

Figure 1.5 Block diagram of unsupervised learning.

Unsupervised Learning has its role in Bioinformatics as concerning the heart disease scenario where inputs can be a lot of symptoms of heart diseases such as High Cholesterol, Chest Pain, and Blood Pressure, etc. These symptoms are passed onto the learning algorithm as input where clusters are made by the model and help the patient for identifying a disease (variables/values of similar types in one cluster) that may occur in near future.

1.2.3 Reinforcement Learning

In the field of ML, Reinforcement Learning was developed by John Andreae in the year 1963 when he invented a system called Stella [27]. It is a dynamic approach that works on the concept of feedbacks [28–31]. Reinforcement for a machine is the reward that it receives upon acting in the environment. When the machine acts on its environment, it receives some evaluation on its actions which is called reinforcement but is not told of which action is the correct one for achieving the goal. In this, the machine’s utility is defined by the feedback function [32]. The objective is to maximize the expected feedback. The block diagram of reinforcement learning is shown below in Figure 1.6.

The above figure tries to present that, a machine at first performs some actions in the environment. Once the actions are performed, then the machine starts to receive the feedbacks. The collected feedbacks may be positive or negative type. The positive feedbacks are kept inside the machines as knowledge. The machine tries to learn from the negative feedback so that in future such an incident may not happen again. Another important aspect of reinforcement learning is the state. The state also provides the input based on the situation to the machine for learning purposes.

Figure 1.6 Block diagram of reinforcement learning.

A few points of reinforcement learning are as follows:

The Input of the Reinforcement Learning Process: Initial state

The Output of the Reinforcement Learning Process: Diversified solutions can be present, depending on the feedbacks obtained

The training process is purely based on input.

This Reinforcement Learning model is a continuous process.

The best solution for this reinforcement learning is the maximum positive feedback.

An example of reinforcement learning could of a person who is suffering from high cholesterol and high blood pressure. He visits his family doctor and requests a medication regarding the same. After analyzing the symptoms, the doctor prescribed a diet chart and a set of medicines to minimize the cholesterol level and blood pressure. He took the medicines and felt better. Here, the patient gets positive feedback in the form of the results of the medication provided by the doctor. Now, the patient will be motivated and will consume only low-fat and low-sodium diet to keep down the levels of blood pressure & cholesterol. If the levels did not go down then the patient will ask the doctor about the same and more tests will be considered for the lowering of the parameters that are required to evaluate the heart of the patient.

1.3 Classification and its Types

Classification is a task in ML, which deals with the organized process of assigning a class label to an observation from the problem domain. It is a sub-group of the supervised form of ML. The traditional classification algorithm was invented by a Swedish botanist Carl Von Linnaeus and depicted in Ref. [33]. In the process of calculating the desired output in supervised learning, this classification is more effective when the input attribute is in the form of a discrete. The Classification approach always helps the user for taking decisions by providing the classified conclusions from the observed data, values as discussed in Refs. [34–36]. Figure 1.7 tries to present a classification graph by executing the data of different persons who are suffering from heart disease or not.

In the above figure, the patients that are suffering from Heart disease are represented by the triangle symbol, and those who are not, are represented by rectangle symbols. The hyperplane (partition) line depicts the bifurcation between these two classified entities. In general, there are four types of classification techniques. They are:

Figure 1.7 Concept of classification.

Binary Classification: It considers the tasks of classification where the class labels are two, and the two classes consider one in the normal state and the other in the abnormal state [37].

Imbalanced Classification: It involves the tasks of classification where the examples are unequally distributed in the class [38].

Multi-label Classification: It involves the tasks of classification where the number of class labels is two or greater than two where for every example one or more than one class label may be predicted [39].

Multi-Class Classification: It involves the tasks of classification where the number of class labels is greater than two [40].

Figure 1.8 Classification based on gender.

For Achieving the Classification approach more precisely, a heart disease dataset [41] has been used that comprises of a total of 1,025 people out of which 312 are females and 713 are males. A particular reason behind taking this dataset is that people are continuously suffering from heart diseases, this is so because people who consume alcohol excessively, consume oily and fast food and also inhale dangerous gases due to pollution. This Classification of gender is given below in Figure 1.8.

1.4 Regression

Regression is a very powerful type of statistical analysis. This is used for finding the strength as well as the character between one dependent variable and a series of independent variables [42–44]. This analysis provides the knowledge on the product that weather any updation in the future is possible or not. The operation of regression provides the ability to a researcher for identifying the best parameter of a topic that can be used for analysis. Also, it provides the parameters that are not to be used for analysis.

In the field of ML, Linear Regression is the most common type of regression analysis for the purpose of prediction [45]. In this process of statistical analysis, equations are made for identifying the useful and not-useful parameters. These are done by linear regression as well as multiple linear regression [46–49]. The representation of Linear Regression is presented in Equation (1.1) and the representation of Multiple Linear Regression is presented in Equation (1.2).

(1.1)
(1.2)

Where,

B is known as dependent variable

A or A

j∈k

are independent variable

n is an intercept

q or q

j∈k

are slope variables

i is regression residual

k is any natural number

.

For easy understanding, a case study on heart disease is discussed below. In this case study, with the help of the regression approach, a prediction was done whether a person has heart disease or not. Here, the dependent variable is the heart disease and the independent variables are cholesterol levels, blood pressure, etc. After analyzing the data, it was found that the patient has a problem in his heart which is presented below on a 2D plane in Figure 1.9.

The steps required for regression analysis are [50]:

Select the dependent & independent variables.

Explore the co-relation matrix along with the scatter plot.

Perform the Linear or Multiple Regression Operation.

Accord with the outliers along with the multi-collinearity.

Perform the t-test.

Handle the insignificant variables.

Figure 1.9 Regression.

Figure 1.10 Cholesterol line fit plot.

The Regression operation performed on the heart disease dataset concerning the age and cholesterol and got the following results as shown in Figure 1.10.

In the above figure, a line fit plot is mentioned that depicts the line of best fit. This line of best fit is known as the trend line. This trend line is based on a linear equation and try to present the standard cholesterol level of a general human w.r.t. the age. The plot has two axes that include a vertical axis depicting the age and the horizontal axis depicting the cholesterol values. The trend line could be linear, polynomial, or exponential as discussed in Refs. [51–53]. In the process of regression analysis on the heart disease dataset, the following numerical interpretation is obtained and presented in Table 1.1.

Where,

Multiple R (Co-relation Coefficient): It depicts the strength of a linear relationship between two variables i.e. age and cholesterol of a human. This value always lies between −1 and +1. The obtained value i.e. 0.972834634 indicated that there is a good relationship between age and cholesterol level.

R

2

:

It is the coefficient of determination i.e. the goodness of fit. The obtained value is 0.946407225 which indicates that 95% of the values of the heart disease dataset fit the regression model.

Adjusted

R

2

: This variable is an upgraded version of

R

2

. This value tries to adjust the predictor number in the model. This value increases when any new term improves the performance of the model more than the expectation and viceversa. The obtained value i.e. 0.945430663 indicates that the model is not performing well so there is a need for modification in predictor number.

Standard Error: It measures the precision of the regression model, the smaller the number, the more accurate the results are. The value obtained is 12.7814549 which indicates that the results are near to accurate value. The Standard Error depicts the measure of how well the data has been approximated.

Table 1.1 Regression statistics.

Regression Statistics

Multiple R

0.972834634

R Square

0.946407225

Adjusted R Square

0.945430663

Standard Error

12.7814549

Observations

1,025

1.4.1 Logistic Regression

Logistic Regression is a statistical model used for identifying the probability of a class with the help of binary dependent variables i.e. Yes or No. It indicates whether a class belongs to the Yes category or the No category. For example, after executing an event on an object the results maybe Win or Loss, Pass or Fail, Accept or Not-Accept, etc. The mathematical representation of the Logistic Regression model is done by two indicator variables i.e. 0 and 1. It is different from the Linear Regression technique as depicted in Ref. [54]. As logistic regression has its importance in the real-life classification problems as depicted in Refs. [55, 56], different fields like Medical Sciences, Social Sciences, ML are using this model in their various field of operations.

The Logistic Regression is performed on the heart disease dataset [41]. The Receiver Operating Characteristics (ROC) is calculated that is based on the true positive rate that is plotted on the y-axis and the false positive rate that is plotted on the x-axis. After performing the logistic regression in python (Google Colab), the outcome is represented in Figure 1.11 and Table 1.2. Figure 1.11 represents the ROC curve and Table 1.2 represents the Area under the ROC Curve (AUC).

At the time of processing, the AUC value obtained (Table 1.2) on training data is 0.8374022, but when the data is processed for testing then the obtained result is outstanding (i.e. 0.9409523). This indicates that the model is more than 90% efficient for classification. In the next section, the difference between Linear and Logistic Regression is discussed.

Figure 1.11ROC curve for logistic regression.

Table 1.2 AUC: Logistic regression.

Parameter

Data

Value

Result

The area under

Training Data

0.8374022

Excellent

the ROC Curve (AUC)

Test Data

0.9409523

Outstanding

Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding

1.4.2 Difference between Linear & Logistic Regression

Linear and Logistics regression are two common types of regression used for prediction. The result of the prediction is represented with the help of numeric variables. The difference between linear and logistic regression is depicted in Table 1.3 for easy understanding.

Linear regression is used to model the data by using a straight line whereas the logistic regression deals with the modeling of probability of events in a bi-variate manner that is occurring as a linear function of dependent variables. Few other types of regression analysis are depicted by different scientists and listed below.

Table 1.3 Difference between linear & logistic regression.

S. No.

Parameter

Linear regression

Logistic regression

1

Purpose

Used for solving regression problems.

Used for solving classification problems.

2

Variables Involved

Continuous Variables

Categorical Variables

3

Objective

Finding of best-fit-line and predicting the output.

Finding of s-curve and classifying the samples.

4

Output

Continuous Variables such as age, price, etc.

Categorical Values such as 0 & 1, Yes & No.

5

Collinearity

There may be collinearity between independent attributes.

There should not be collinearity between independent attributes.

6

Relationship

The relationship between a dependent variable and the independent variable must be linear.

The relationship between a dependent variable and an independent variable may not be linear.

7

Estimation Method Used

Least Square Estimation

Maximum Likelihood Estimation

Polynomial Regression: It is used for curvilinear data [57–58].

Stepwise Regression: It works with predictive models [59–60].

Ridge Regression: Used for multiple regression data [61–62].

Lasso Regression: Used for the purpose of variable selection & regularization [63–64].

Elastic Net Regression: Used when the penalties of lasso and ridge method are combined [65].

1.5 Random Forest

The Random Forest was first invented by Tim Kan Ho [66]. Random Forest is a supervised ensemble learning method, which solves regression and classification problems. It is a method of ensemble learning (i.e. bagging algorithm) and works by averaging the result and by reducing overfitting [67–71]. It is a flexible method and a ready to use in the machine learning algorithm. The Random Forest can be used for the process of regression and known as Regression Forests [72]. It can cope up with the missing values but deals with complexity as well as a longer training period. There are two specific causes for naming it as Random that are:

When building trees, then a random sampling of training data sets is followed.

When Splitting nodes, then a random subset of features is considered.

The functioning of random forests is illustrated in Figure 1.12.

In the above figure, five forests are there and each one representing a disease, such as blue represents liver disease, orange represents heart disease, the green tree represents stomach disease, yellow represents lung disease. It was observed that as per the majority of color, Orange is the winner.

Figure 1.12 Random forest.

This concept is known as the Wisdom of crowd as discussed in Ref. [73]. The execution of this method is achieved with the help of two concepts, which is listed below

Bagging: The Data on which the decision trees are trained are very sensitive. This means a small change in the data can bring diverse effects in the model. Because of this, the structure of the tree can completely change. They take benefit of it by allowing each tree to randomly sample the dataset with a replacement that results in different trees. This is called bagging or bootstrap aggregation [74–75].

Random Feature Selection: Normally, when we split a node, every possible feature is considered. The one that produces the most separation is considered. Whereas, in the random forest scenario we can consider a random subset of features. This allows more variation in the model and results in a greater diversification. [76]

The Concept of Random Forest took place in the heart disease dataset also. The low correlation is the key, between the models. The Area under the ROC Curve (AUC) characteristic of Random Forest performed in python (Google Colab) is shown in Table 1.4 and Figure 1.13.

In the above table, the area under the receiver operating characteristic curve (AUC) is mentioned.

AUC measures the degree of separability. The obtained value of Training Data is 1.0000000 that attains an outstanding remark and the value of the testing data is 1.0000000 that attains an outstanding remark in the AUC score. The result indicates that the used models perform outstandingly on the heart disease dataset.

Table 1.4 AUC: Random forest.

Parameter

Data

Value

Result

The area under the ROC Curve (AUC)

Training Data

1.0000000

Outstanding

Test Data

1.0000000

Outstanding

Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding

Figure 1.13 ROC curve for random forest.

1.6 K-Nearest Neighbor

K-Nearest Neighbor belongs to the category of supervised classification algorithm and hence, needs labeled data for training [77, 78]. In this approach, the value of K is suggested by the user. It can be used for both the classification and regression approaches but the attributes must be known. By performing the KNN algorithm, it will give new data points according to the k-number or the closest data points.

In the heart disease dataset also, The Area under the ROC Curve (AUC) has been used. It is the most basic tool for judging the classifier’s performance in a medical decision making concerns [79–81]. It is a graphical plot for judging the diagnostic ability with the help of a binary classifier. The generated ROC curve for KNN on the heart disease dataset [41] is presented below in Figure 1.14.

In the above figure, the true positive rate (probability of detection) is mentioned on the Y-axis, and on the x-axis, the false positive rate (probability of false alarm) is mentioned. The False Positive rate depicts the unit proportion with a known negative condition for which the predicted condition is positive.

The Area under the ROC Curve (AUC) of K-nearest neighbor is performed on the heart disease dataset [41] in python (Google Colab) and shown below in Table 1.5.

Figure 1.14 ROC curve for k-nearest neighbor.

Table 1.5 AUC: K-nearest neighbor.

Parameter

Data

Value

Result

The area under the ROC Curve (AUC)

Training Data

1.0000000

Outstanding

Test Data

1.0000000

Outstanding

Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding

The obtained value of Training Data is 1.0000000 that attains an outstanding remark and the value of the testing data is 1.0000000 that attains an outstanding remark in the AUC score. The result shows that KNN performs outstandingly on the dataset.

1.7 Decision Trees

Decision Tree is a form of supervised machine learning and was invented by William Belson in the year 1959 [82]. It predicts the response values by learning the decision rules that were derived from features [83–84]. They are good for evaluating the options. It is used in operation research and decision analysis. An example of Decision Trees considering a person is having heart disease or not is presented below in Figure 1.15 for easy understanding.

The above figure depicts the answer to the Question “A person having Heart Disease or not?” by concerning various conditions and reaching a conclusion. Initially, it is checked that a person having chest pain or not. If yes, then it is checked that the person has high blood pressure or not. If the blood pressure if high or even low, then the person is suffering from heart disease. If the person doesn’t have chest pain then he is not suffering from heart disease. After implementing the Decision tree on the heart disease dataset [41] the AUC values are generated and presented in Table 1.6. The implementation was done in Python (Google Colab).

Figure 1.15 Decision tree.

Table 1.6 AUC: Decision trees.

Parameter

Data

Value

Result

The area under the ROC Curve (AUC)

Training Data

0.9588996

Outstanding

Test Data

0.9773333

Outstanding

Index: 0.5: No Discriminant, 0.6–0.8: Can be considered accepted, 0.8–0.9: Excellent, >0.9: Outstanding

The obtained value of Training Data is 0.9588996 that attains an outstanding remark and the value of the testing data is 0.9773333 that attains an outstanding remark in the AUC score. The result indicates that the decision tree model performs outstandingly on the heart disease dataset.

1.8 Support Vector Machines

The original Support Vector Machine (SVM) algorithm was invented by Vladimir N. Vapnik & Alexey Ya. Chervonenkis in 1963 [85]. In machine learning, the Support Vector Classifier fits the data that the user provides, and returns the best-fit hyper-plane that categorizes the data. After getting the hyperplane, the user can feed some features to the classifier to check the predicted class [86–87]. SVM is used for analyzing data that can be used for the process of regression or classification. Taking a similar example of the bifurcation of a person suffering from heart disease or not but giving it a more detailed view, it is depicted in Figure 1.16.