187,99 €
This book is an essential resource on the impact of AI in medical systems, helping readers stay ahead in the modern era with cutting-edge solutions, knowledge, and real-world case studies.
Wellness Management Powered by AI Technologies explores the intricate ways machine learning and the Internet of Things (IoT) have been woven into the fabric of healthcare solutions. From smart wearable devices tracking vital signs in real time to ML-driven diagnostic tools providing accurate predictions, readers will gain insights into how these technologies continually reshape healthcare.
The book begins by examining the fundamental principles of machine learning and IoT, providing readers with a solid understanding of the underlying concepts. Through clear and concise explanations, readers will grasp the complexities of the algorithms that power predictive analytics, disease detection, and personalized treatment recommendations. In parallel, they will uncover the role of IoT devices in collecting data that fuels these intelligent systems, bridging the gap between patients and practitioners.
In the following chapters, readers will delve into real-world case studies and success stories that illustrate the tangible benefits of this dynamic duo. This book is not merely a technical exposition; it serves as a roadmap for healthcare professionals and anyone invested in the future of healthcare.
Readers will find the book:
Audience
This book is a valuable resource for researchers, industry professionals, and engineers from diverse fields such as computer science, artificial intelligence, electronics and electrical engineering, healthcare management, and policymakers.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 601
Veröffentlichungsjahr: 2024
Cover
Table of Contents
Series Page
Title Page
Copyright Page
Preface
1 Exploring Functional Modules Using Co-Clustering of Protein Interaction Networks
1.1 Introduction
1.2 Related Works
1.3 Basic Terminologies
1.4 Existing Methods
1.5 About Dataset
1.6 Experimental Environment
1.7 Validation Measures
1.8 Biological Significances
1.9 Proposed Co-Clustering Approach: MR-CoC
1.10 Functional Module Mining Using MR-CoC
1.11 Conclusion
Appendix
References
2 Natural Language Processing in Healthcare: Enhancing Wellbeing through a COVID-19 Case Study
2.1 Introduction
2.2 NLP Approaches
2.3 NLP Pipeline for Smart Healthcare
2.4 Applications of NLP in Healthcare
2.5 COVID Detection Using NLP
2.6 Results and Discussion
2.7 Conclusion
References
3 Artificial Intelligence Assisted Internet of Medical Things (AIoMTs) in Sustainable Healthcare Ecosystem
3.1 Introduction
3.2 Medical Wearable Electronics
3.3 Electronic Signals in Sensors
3.4 Electronic Devices Challenges in the AIoMT
3.5 AIoMT Benefits
3.6 AIoMTs Challenges
3.7 AIoMT Limitations
3.8 Future Research Direction
3.9 Conclusions and Future Scope
References
4 An Online Platform for Timely Access to Medical Care with the Help of Real-Time Data Analysis
4.1 Introduction
4.2 What Happened
4.3 Literature Review
4.4 Methodology
4.5 Hardware Component
4.6 Conclusion
4.7 Future Work
References
5 A Comprehensive Review of Cardiac Image Analysis for Precise Heart Disease Diagnosis Using Deep Learning Techniques
5.1 Introduction and Major Contribution
5.2 Literature Review
5.3 Machine Learning Methods
5.4 Proposed System
5.5 Mathematical Model
5.6 Data Preparation
5.7 Model Training and Evaluation
5.8 Results and Discussion
5.9 Conclusion and Future Work
References
6 A Hybrid Machine Learning Model for an Efficient Detection of Liver Inflammation
Abbreviations
6.1 Introduction
6.2 Machine Learning for Liver Disease Prediction
6.3 Related Works
6.4 Experimental Analysis
6.5 Result Evaluation
6.6 Conclusion
6.7 Enhancement of PCA Over Other Dimensionality Reductions
References
7 Advancements in Parkinson’s Disease Diagnosis through Automated Speech Analysis
7.1 Introduction
7.2 Speech Characteristics in Parkinson’s Disease
7.3 Technological Advances in Speech Analysis
7.4 Integration of Multimodal Data
7.5 Related Works
7.6 Building a Machine Learning (ML) Model
7.7 Experimental Analysis and Performance Measures
7.8 Future Directions
7.9 Challenges and Limitations
7.10 Conclusion and Implications
References
8 Public Opinion Segmentation on COVID-19 Vaccination and Its Impact on Wellbeing
8.1 Introduction
8.2 Background and Related Work
8.3 Machine Learning Techniques
8.4 Ensemble Machine Learning Algorithms
8.5 Methodology
8.6 Results and Discussion
8.7 Impact on Wellbeing
8.8 Conclusion
References
9 Revolutionizing Healthcare with IoT in Cardiology
9.1 Introduction
9.2 Background
9.3 Motivation
9.4 Primary Diseases Globally
9.5 IoT Revolutionizes Healthcare
9.6 IoT Patient Monitoring Devices and Early Detection of Heart-Related Problems
9.7 An IoT-Based Heart Disease Monitoring System
9.8 Conclusions
References
10 Human Biological Analysis Through Fitness Watch Using Deep Learning Algorithm
10.1 Introduction
10.2 Literature Survey
10.3 Methodology
10.4 Results and Discussion
10.5 Limitation of the Work
10.6 Validation and Comparative Analysis
10.7 Conclusion
References
11 Decoding Kidney Health: Effectiveness of Machine Learning Techniques in Diagnosis of Chronic Kidney Disease
11.1 Introduction
11.2 Methods
11.3 Methodology
11.4 Results and Discussion
11.5 Conclusion
References
12 Integrating Metaheuristics and Machine Learning for Wellbeing Management: Case of COVID-19
12.1 Introduction
12.2 Related Work
12.3 Background Knowledge
12.4 Methodology
12.5 Results and Discussions
12.6 Conclusion
References
13 Fusing Sentiment Analysis with Hybrid Collaborative Algorithms for Enhanced Recommender Systems
13.1 Introduction
13.2 Literature Survey
13.3 Comparative Result Study
13.4 Conclusion and Future Scope
References
14 The Future of Well-Being: AI-Powered Health Management with Privacy at its Core
14.1 Introduction
14.2 Related Works
14.3 Proposed Work
14.4 Performance Evaluation
14.5 Conclusion and Future Work
References
15 Artificial Pancreas: Enhancing Glucose Control and Overall Well-Being
15.1 Introduction
15.2 Closed-Loop Diabetes Control System
15.3 Testing and Regulatory Approvals
15.4 Safety Requirements in the Design of Artificial Pancreas
References
Index
End User License Agreement
Chapter 1
Table 1.1 Literature study for functional module mining.
Table 1.2 A review on existing approaches for binary co-clustering.
Table 1.3 SCoC
nsym
: Toy example.
Table 1.4 Synthetic dataset description—SCoC
nsym
.
Table 1.5 Comparative analysis of SCoC
nsym
based on match score measure.
Table 1.6 Comparative analysis of SCoC
nsym
based on computational time.
Table 1.7 Types of seeding tested on synthetic Dataset_rand_I in SCoC
rand
appr...
Table 1.8 Comparative analysis of SCoC
rand
based on match score measure of co-...
Table 1.9 Comparative analysis of SCoC
rand
based on computational time (in sec...
Table 1.10 Key-value pairs of MR-CoC.
Table 1.11 Synthetic datasets for the MR-CoC approach.
Table 1.12 Comparative analysis of MR-CoC based on match score measure.
Table 1.13 Comparative analysis of MR-CoC based on computational time.
Table 1.14 Proposed methods vs. focused issues.
Table 1.15 MR-CoC: biological functionalities of few UM protein modules.
Table 1.16 Drug target-based biological functionalities of UM functional modul...
Table 1.17 List of abbreviations.
Chapter 2
Table 2.1 Features in the collected dataset.
Table 2.2 Classification report of machine learning algorithms using the propo...
Table 2.3 Classification report of ensemble machine learning algorithms using ...
Chapter 4
Table 4.1 Survey on previously implemented techniques.
Chapter 5
Table 5.1 Accuracy performance.
Table 5.2 Error performance employing different algorithms.
Chapter 6
Table 6.1 Different possible datasets of liver patients.
Table 6.2 Correlation between PCA and variance.
Table 6.3 Evaluation results of liver dataset.
Chapter 7
Table 7.1 Recent works in Parkinson’s disease (PD) detection.
Table 7.2 Experimental performance analysis of the classifiers.
Table 7.3 Hyperparameters tuning for different kernels.
Table 7.4 Comparative study of the recent studies of PD detection models.
Chapter 8
Table 8.1 Hyperparameter tuning results.
Table 8.2 Classification report of various machine and ensemble learning algor...
Chapter 9
Table 9.1 Primary diseases globally.
Table 9.2 Average heart rate [72].
Chapter 10
Table 10.1 Gender-based health analytics.
Table 10.2 Age-based health analytics.
Chapter 11
Table 11.1 Confusion matrix.
Table 11.2 Accuracies of kidney disease diagnosis models.
Table 11.3 Confusion matrix of kidney disease diagnosis models.
Table 11.4 Precision recall and F1 score of diagnosis models.
Chapter 12
Table 12.1 LSTM parameters’ configuration type.
Table 12.2 LSTM parameters’ values.
Table 12.3 Genetic algorithm parameters.
Table 12.4 Feature selection results for the total number of deaths in UAE.
Table 12.5 Genetic algorithm results for the total number of deaths in UAE.
Table 12.6 Cross-validation results for the total number of deaths in UAE.
Table 12.7 Benchmark model results for the total number of deaths in UAE.
Table 12.8 The GA-optimized LSTM results for the total number of deaths in Oma...
Table 12.9 Results of the GA-optimized LSTM.
Table 12.10 The GA-optimized LSTM results for the total number of cases in UAE...
Table 12.11 The GA-optimized LSTM results for the total number of cases in Bah...
Table 12.12 Different variants of LSTM results for the total number of cases i...
Chapter 13
Table 13.1 Summary of the reviewed literature.
Table 13.2 Comparison of the accuracy of various procedures.
Chapter 14
Table 14.1 Key contributions and limitations of existing privacy schemes.
Table 14.2 Performance metrics.
Table 14.3 Performance evaluation metrics comparison across systems.
Chapter 1
Figure 1.1 Workflow of the current research.
Figure 1.2 Central dogma of proteins [25].
Figure 1.3 Sample protein interaction network.
Figure 1.4 Sample protein modules.
Figure 1.5 Co-clustering: A toy example.
Figure 1.6 Sample illustration of BiMax algorithm.
Figure 1.7 Sample PIN dataset.
Figure 1.8 STRING database download page.
Figure 1.9 CORUM database statistics [29].
Figure 1.10 Protein complex statistics.
Figure 1.11 Workflow of MapReduce in MATLAB [30].
Figure 1.12 Sample input matrix and its heatmap.
Figure 1.13 Synthetic Dataset_nsym_I. (a) Noiseless; (b) noisy.
Figure 1.14 Synthetic Dataset_nsym_II. (a) Noiseless; (b) noisy.
Figure 1.15 Heatmap of synthetic Dataset_nsym_III. (a) Noiseless; (b) noisy.
Figure 1.16 Heatmap of synthetic Dataset_nsym_IV. (a) Noiseless; (b) noisy.
Figure 1.17 Synthetic Dataset_rand_I—implanted co-clusters at random portions ...
Figure 1.18 SCoC
rand
—row wise seeds of Dataset_rand_I.
Figure 1.19 SCoC
rand
—column wise seeds of Dataset_rand_I.
Figure 1.20 SCoCrand—random seeds of Dataset_rand_I.
Figure 1.21 Workflow of the MR-CoC approach.
Figure 1.22 Protein complex coverage with minimum protein module size 3.
Figure 1.23 Protein complex inclusion rate with module size 4 and above.
Figure 1.24 Protein complex inclusion with module size 5 and above.
Chapter 2
Figure 2.1 Pipeline for smart healthcare.
Figure 2.2 Proposed methodology.
Chapter 3
Figure 3.1 Essential emerging technologies and selected applications.
Figure 3.2 Enabling medical technologies applications.
Figure 3.3 AIoMTs and application.
Figure 3.4 Sampled IoMT application in medical attention.
Figure 3.5 AIoMT management processes.
Figure 3.6 Medical healthcare innovation.
Chapter 4
Figure 4.1 Ministry of road transport and highways.
Figure 4.2 Deaths amenable to healthcare.
Figure 4.3 Decision tree working.
Figure 4.4 K-means algorithm.
Figure 4.5 Conceptual framework.
Figure 4.6 Model visualization results.
Figure 4.7 Finding the shortest path.
Figure 4.8 RFID technology.
Chapter 5
Figure 5.1 Convolutional neural network (CNN) architecture.
Figure 5.2 Precision vs. sensitivity vs. specificity.
Figure 5.3 Accuracy of various algorithms.
Figure 5.4 MAE chart.
Figure 5.5 Kappa statistics chart.
Figure 5.6 Confusion matrix for ECG heartbeat.
Figure 5.7 Percentage of correctly classified by category.
Figure 5.8 Distribution of heartbeats classified correctly and incorrectly.
Figure 5.9 The training and validation accuracy and loss of a convolutional ne...
Chapter 6
Figure 6.1 Cumulative proteins vs albumin.
Figure 6.2 Integration of PCA with KNN for liver inflammation.
Figure 6.3 Random forest algorithm.
Figure 6.4 Confusion matrix before and after applying PCA.
Chapter 7
Figure 7.1 Human brain [27].
Figure 7.2 PD symptoms.
Figure 7.3 Sample screenshot of the dataset.
Figure 7.4 Model building process.
Figure 7.5 Correlation graph.
Figure 7.6 Accumulated explained variance.
Figure 7.7 Classification classes.
Figure 7.8 Comparison graph for the different classifiers under study.
Figure 7.9 Comparison for the different kernels with respect to the Parzen Win...
Chapter 8
Figure 8.1 Word cloud of positive tweets.
Figure 8.2 Word cloud of negative tweets.
Figure 8.3 Distribution of COVID-19 vaccination-based sentiments.
Chapter 9
Figure 9.1 The concept of IoT in healthcare [20].
Figure 9.2 Heartbeat sensor [52].
NOTE:
A group consisting of K. Butchi Raju,...
Figure 9.3 Smart heart disease prediction system incorporating IoT and fog com...
Figure 9.4 ECG sensor [53].
Figure 9.5 Blood pressure [54].
Figure 9.6 Heart rate monitor [55].
Figure 9.7 Pulse oximeter [56].
Figure 9.8 Temperature sensor [57].
Figure 9.9 Respiratory rate monitor [58].
Figure 9.10 Activity and movement sensors [59].
Figure 9.11 Sleep monitoring sensor [60].
Figure 9.12 Stress and anxiety monitor [61].
Figure 9.13 System flow chart.
Figure 9.14 Heart rate sensor connection with Arduino Uno [70].
Figure 9.15 Heart monitoring system using Blynk app [72].
Figure 9.16 (a): Arduino board [75], (b): original pictures of Arduino board c...
Figure 9.17 HC-05 Bluetooth [81].
Figure 9.18 Jumper wires [82].
Figure 9.19 Breadboard [83].
Figure 9.20 USB cable connecting PC with Arduino [84].
Figure 9.21 Result on the serial monitor.
Figure 9.22 Indicating low BP.
Figure 9.23 Normal BP.
Figure 9.24 High BP.
Figure 9.25 LCD display [85].
Chapter 10
Figure 10.1 Fitness interface of (a) teenage group, (b) youth group, and (c) m...
Figure 10.2 Dataset collected from the user using Google Forms.
Figure 10.3 Output graph of training and validation.
Figure 10.4 Gender graph.
Figure 10.5 Age group graph.
Figure 10.6 Count of predicted and actual BMI.
Figure 10.7 Actual BMI vs. predicted BMI.
Chapter 11
Figure 11.1 Proposed flow chart for kidney disease diagnosis.
Figure 11.2 Precision–recall and ROC curves depicting the diagnostic performan...
Figure 11.3 A bar chart visually represents the comparison of various machine ...
Chapter 12
Figure 12.1 LSTM structure.
Figure 12.2 Genetic algorithm lifecycle.
Figure 12.3 Stages of the framework for building element of decision forecasti...
Figure 12.4 Gene settings in GA.
Figure 12.5 CV RMSE for the different variants of LSTM for total number of dea...
Figure 12.6 GA-optimized LSTM predictions for day 14 for the total number of d...
Figure 12.7 GA-optimized LSTM predictions for day 14 for the total number of d...
Figure 12.8 Benchmark model results compared with actual values.
Figure 12.9 GA-optimized LSTM predictions for day 14 for the total number of d...
Figure 12.10 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.11 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.12 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.13 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.14 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.15 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.16 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.17 GA-optimized LSTM predictions for day 14 for the total number of ...
Figure 12.18 GA-optimized LSTM predictions for day 14 for the total number of ...
Chapter 13
Figure 13.1 Sentiment analysis.
Figure 13.2 Analysis of sentiment based on collaborative filtering.
Figure 13.3 Collaborative filtering.
Figure 13.4 HCF-based recommender system.
Figure 13.5 Comparison of various sentiment analysis techniques.
Chapter 14
Figure 14.1 Structure of the privacy revolution of federated learning.
Figure 14.2 Architecture of proposed AI-powered health management with privacy...
Figure 14.3 Adaptive AI framework.
Figure 14.4 Differential privacy framework.
Figure 14.5 Verifiable credentials and blockchain integration.
Figure 14.6 Federated learning auditing mechanism.
Figure 14.7 Model accuracy.
Figure 14.8 Performance evaluation across existing systems.
Chapter 15
Figure 15.1 Closed-loop diabetes control system.
Figure 15.2 SMBG glucose monitor.
Cover Page
Table of Contents
Series Page
Title Page
Copyright Page
Preface
Begin Reading
Index
WILEY END USER LICENSE AGREEMENT
ii
iii
iv
xv
xvi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
421
422
423
424
425
Scrivener Publishing100 Cummings Center, Suite 541JBeverly, MA 01915-6106
Machine Learning in Biomedical Science and Healthcare Informatics
Series Editors: Vishal Jain ([email protected])and Jyotir Moy Chatterjee ([email protected])
In this series, an attempt has been made to capture the scope of various applications of machine learning in the biomedical engineering and healthcare fields, with a special emphasis on the most representative machine learning techniques, namely deep learning-based approaches. Machine learning tasks are typically classified into two broad categories depending on whether there is a learning ‘label’ or ‘feedback’ available to a learning system: supervised learning and unsupervised learning. This series also introduces various types of machine learning tasks in the biomedical engineering field from classification (supervised learning) to clustering (unsupervised learning). The objective of the series is to compile all aspects of biomedical science and healthcare informatics, from fundamental principles to current advanced concepts.
Publishers at ScrivenerMartin Scrivener ([email protected])Phillip Carmical ([email protected])
Edited by
Bharat Bhushan
School of Engineering and Technology, Sharda University, Greater Noida, India
Akib Khanday
Dept. of Computer Science and Software Engineering, United Arab Emirates University, UAE
Department of Computer Science, Samarkand International University of Technology, Samarkand, Uzbekistan
Khursheed Aurangzeb
College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia
Sudhir Kumar Sharma
KIET Group of Institutions, Delhi-NCR, Ghaziabad, India
and
Parma Nand
School of Engineering Technology, Sharda University, Greater Noida, India
This edition first published 2025 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA© 2025 Scrivener Publishing LLCFor more information about Scrivener publications please visit www.scrivenerpublishing.com.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
Wiley Global Headquarters111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchant-ability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.
Library of Congress Cataloging-in-Publication Data
ISBN 978-1-394-28699-7
Front cover images courtesy of Adobe FireflyCover design by Russell Richardson
Machine Learning (ML) and the Internet of Things (IoT) offer powerful applications and solutions when integrated. ML and IoT have become essential components of the healthcare industry, with hospitals, clinics, and healthcare providers adopting ML-powered diagnostic tools, wearable devices, and IoT-enabled patient monitoring systems. This adoption aims to improve patient care, reduce healthcare costs, and enhance service quality. ML and IoT are key growth areas within the technology sector, with tech companies, startups, and research institutions actively developing innovative solutions that leverage ML algorithms and IoT connectivity to address healthcare challenges.
This book explores how technology has become an essential component of modern healthcare solutions, from wearable smart devices that track vital signs in real time to machine learning-driven diagnostic tools that generate precise forecasts. IoT devices produce vast amounts of data from sensors and connected systems. By applying machine learning algorithms to this data, patterns can be identified that predict when a machine is likely to fail, enabling predictive maintenance, reducing downtime, and optimizing maintenance schedules.
IoT devices monitor patients' health metrics, while ML models analyze this data to detect abnormalities and alert healthcare professionals. Together, ML and IoT have optimized healthcare processes and contributed to proactive wellness management. They demonstrate how smart wearable devices, with real-time vital sign tracking, and ML-driven diagnostic tools, capable of highly accurate predictions, are reshaping the landscape of healthcare solutions. The convergence of ML and IoT has also led to innovative solutions in remote patient monitoring, early disease detection, mental health support, and personalized wellness plans.
We are grateful to the contributing authors for their dedication and expertise, and we extend our thanks to the reviewers who have provided invaluable feedback throughout the preparation of this volume. Finally, we thank Martin Scrivener and Scrivener Publishing for their support and publication.
The EditorsOctober 2024
R. Gowri1* and R. Rathipriya2†
1Department of Computer Science, AVS College of Arts and Science (Autonomous), Salem, Tamil Nadu, India
2Department of Computer Science, Periyar University, Salem, Tamil Nadu, India
This chapter introduces the new score-based co-clustering (SCoC) method for functional module mining (FMM) in protein interaction networks (PINs). This strategy focuses on the drawbacks of previous approaches, including computational overhead, time consumption, and a disregard for quality and overlapping modules. This chapter has proposed two revised versions of the SCoC method: MR-CoC and SCoC rand. Artificial datasets are utilized to evaluate these suggested methods’ performances. These datasets are created with the intention of imposing certain criteria, such as distributed co-cluster, matrix type, noise, and data size. This chapter discusses how these suggested ways are being implemented. Additionally, the MR-CoC was used for functional module mining in the protein interaction networks of humans. To analyze the efficiency of MR-CoC, its results are compared with those of existing protein complexes. The biological implications of these findings have been further examined.
Keywords: MapReduce, protein modules, functional modules, functional coherence, key-value pairs, molecular function
The major problem focused on in this research work is the functional module mining in a protein interaction network (PIN). Currently, the functional modules are identified based on laboratory experiments. This process involves the selection of the initial candidates and elimination/addition of the participants to this module based on the lab experiments and analysis. The choice of the initial candidates for the functional module is made manually by the biologists using specific tools based on their requirements. Finding such candidates from the complex PIN is a tedious task to the biologists [1]. This current research is aimed to propose a computational solution to this problem as shown in Figure 1.1.
Currently, the availability of biological networks is increasing due to technological developments in bioinformatics [2]. These networks are highly useful in the medical field for studying and analyzing the behaviors and functionality of various pathogens within the host organisms [3]. They are used for early detection of disease, disease diagnosis, prognosis, drug discovery, drug target identification, and so on.
They are also used to study the functionalities of various organisms, especially in their PINs, which are used to communicate signals or information within the biological system [4]. The proteins are connected with each other to form a PIN. The protein complexes or functional modules are the groups of proteins densely connected to perform a specific biological process.
Figure 1.1 Workflow of the current research.
The major significances [5, 6] of the functional module mining are as follows:
The central nodes of the biological networks are vulnerable to the targeted attacks by dangerous diseases.
Predicting key targets for tackling glioma (malignant tumor) drug resistance.
The network modularity correlated with cancer (diseased) patient survivability.
The pathogen infection (say cancer) tend to be enriched in particular network modules.
Finding the hallmark network modules activated predominantly in each tumor is achieved by the identification of significant network modules.
Neurodegenerative diseases (e.g., Parkinson’s disease) are due to many dysregulated physiological processes, which are identified from the abnormalities in molecular networks.
Identification of therapeutic targets (drug target-based functional modules) in complex disorders is a major challenge in developing effective therapies for complex diseases.
In the literature, various approaches exist for functional module (sub-networks) identification. Some approaches use graph theoretical concepts to identify these sub-networks but face issues like scalability, ignorance of overlapped sub-networks, time consumption, and computational overhead [41–44]. From the related works, it has also been found that such approaches do not consider any functional features for identifying the functional modules.
The objective of this chapter is to overcome these issues and also extract the functional modules based on their functional density measures using data-mining techniques. The score-based co-clustering with MapReduce (MR-CoC) approach is experimented with on the PIN of Homo sapiens, and the results are compared with the existing protein complexes. This approach mines the existing protein complexes efficiently. Then, the biological significances of the unknown modules are analyzed.
Biologists can use this approach for finding novel functional modules from any PIN. It reduces the complexity of manual extraction of functional modules and can be used for predicting the new drug targets for various diseases. These resultant modules can be further tested in laboratories for new functional module predictions.
This chapter is further organized as follows: Section 1.2 discusses the related research works of functional module mining and binary co-clustering approaches. Sections 1.3 to 1.8 discuss the terminologies, existing methods, datasets, experimental environment, validation measures, and biological significances. Section 1.9 presents the proposed method MR-CoC and its enhancements based on the comparative analysis. Section 1.10 elaborates the functional module mining using MR-CoC based on a comparative analysis of the results under different experimental setups, and analysis of experimental results for their biological significance. Section 1.11 summarizes the entire work carried out in this chapter.
The functional module mining approaches in the literature show that most are performed using graph theoretical algorithms and are based on the topological properties. The various related research articles in the literature are listed in Table 1.1. They focus on MCODE, edge sampling, clustering, and optimization techniques for functional module mining. Table 1.1 highlights the approaches, measures, and various related issues (scalability, time consumption, computational overhead, functional measures, etc.). The tick mark and cross mark in Table 1.1 represent the presence and absence of the specified issue.
From the study, MCODE is the pioneer approach for protein complex detection. It detects the cliques from PIN based on the network density. The complexity of the MCODE is O(n3).
The PIN is represented using the adjacency matrix, which is a binary symmetric matrix. Thus, binary co-clustering approach is proposed in this chapter. The related works of binary co-clustering are also studied. The existing co-clustering approaches for binary data are reviewed in this section. Cheng and Church [10], Plaid [11], OPSM [12], etc. are for numerical data matrices. These approaches are using the distance measures that do not suit the binary datasets. There are some specific approaches like xMotif [13], BiMax [14], BicBin [15], BicSim [16], BMF [17], BiBit [18], BBK [19], BitTable [20], BiBinCons & BiBinAlter [21] and ParBiBit [22] that were developed at different time periods for co-clustering the binary data. The detailed reviews of these existing approaches are presented in Table 1.2. Every existing approach is reviewed based on various criteria such as method, binarization, overlapped co-clusters, scalability issue, computational overhead, parameter tuning, time consumption, parallelization, noise sensitivity, and remarks about the approach.
Table 1.1 Literature study for functional module mining.
Title
Algorithm
Measure
Scalability issue
Ignore overlapped modules
Time consumption
Computational overhead
Functional features usage
An automated method for finding molecular complexes in large protein interaction networks
[7]
MCODE
Network density
✓
✓
✓
✓
×
Detection of functional modules from protein interaction networks
[4]
Clustering
Classification score
✓
×
✓
✓
×
Identifying functional modules in protein–protein interaction networks: an integrated exact approach
[8]
Mathematical optimization
Modular scoring function
✓
✓
✓
✓
×
Protein interaction networks—more than mere modules
[1]
Block method based on GO terms
Error minimization
✓
✓
✓
✓
×
Weighted consensus clustering for identifying functional modules in protein–protein interaction networks
[9]
Combines 4 clustering algorithms
Cluster coefficient
✓
✓
✓
✓
×
BicNET: Flexible module discovery in large-scale biological networks using biclustering
[2]
Biclustering
Biclusters
✓
✓
✓
✓
×
From Table 1.2, the xMotif and ParBiBit are greedy-type algorithms; BiMax and BBK are “divide and conquer”-type approaches; the BiSim and BiBinCons are iterative approaches; BicBin is a model-based approach; BiBit is an exhaustive enumeration approach; BitTable is an a priori-type approach. All the existing approaches discussed here are binarizing the numerical data for its processing. The xMotif, BiMax, and BBK algorithms ignore the overlapped co-clusters in the dataset. All these existing approaches are having computational overhead and scalability issues. The parameter tuning is necessary for most of the approaches except the xMotif approach in Table 1.2 for better performance. The BicBin is a parallelized approach; thus, the time consumption is less when compared to other existing approaches. The noise in the dataset affects the efficiency of co-clustering, which results in ignorance of the quality co-clusters in the dataset. The BicBin, BMF, and BiBinCons are noise-insensitive approaches in the related research works.
From the study, it is clear that the BiMax and xMotif approaches are used for comparative analysis in most of these existing methods. There is no benchmark algorithm for co-clustering.
Most of these existing approaches failed to extract all the embedded co-clusters from the given data. The limitations of the existing approaches are missing the overlapped co-clusters and quality co-clusters, scalability issue, computational overhead, and more time consumption.
Table 1.2 A review on existing approaches for binary co-clustering.
S. no.
Existing approach
Method
Binarization
Overlapped co-clusters
Scalability Issue
Computational overhead
Parameter tuning
Time consumption
Parallelization
Noise sensitivity
Remarks
1.
xMotif (2003)
Greedy
✓
×
✓
✓
×
✓
×
✓
As it is a greedy approach, it may lose good co-clusters and may take wrong decisionsLeukemia gene expression analysis
2.
BiMax (2006)
Divide and conquer
✓
×
✓
✓
✓
✓
×
✓
Benchmark approachBest for limited size datasetExtracts only pure constant maximal BiclustersGene Expression Data Analysis
3.
BicBin (2008)
Model based
✓
✓
✓
✓
✓
×
✓
×
Uses cost function to evaluate the sub-matrices
4.
BiSim (2009)
Iterative
✓
✓
✓
✓
✓
✓
×
✓
Used for Gene Expression Analysis
5.
BMF (2010)
Matrix multiplication
✓
✓
✓
✓
✓
✓
×
×
Quality of co-clusters depends on the selection of the discretization methodFixing the number of factors ‘k’ is a big hurdlePerforms dimensionality reduction of data
6.
BiBit (2011)
Exhaustive enumeration
✓
✓
✓
✓
✓
✓
×
✓
Uses Boolean Algebraic operationsSearches co-clusters in all possible combinationsSuits for limited data sizeEmbryonic Tumor Gene expression analysis
7.
BBK (2012)
Divide and conquer
✓
×
✓
✓
✓
✓
×
✓
Uses Bron-Kerbosch backtracking approach to improvise BiMax approachStill faces the listed issues
8.
Bit-Table (2014)
A priori
✓
✓
✓
✓
✓
✓
×
✓
Too many candidate itemsetsScans data once for each itemsetHigh memory consumptionUses various Bit Table Operations
9.
BiBinCons BiBinAlter (2015)
Iterative
✓
✓
✓
✓
✓
✓
×
×
It performs an exhaustive search for bicluster in each row and column combinationMicroarray Data Analysis
10.
ParBiBit (2018)
Greedy
✓
✓
✓
✓
✓
✓
✓
✓
Utilizes the modern distributed memory systems efficientlyAs it is a greedy method likely to take wrong decisions
The basic terminologies used in this research are categorized under biological terminologies and technical terminologies.
In this research, the scientific terminologies stand for a variety of topics that are biologically related. This research section discusses numerous terminologies to enhance understanding of the bioinformatics concept. They are as follows:
Protein
: Proteins are composed of a linear chain of 22 amino acids, known as a sequence of amino acids. The data consist of one-dimensional arrays represented as lengthy strings of letters [
4
,
23
,
24
] as shown in
Figure 1.2
. They are the genetic sequences representing a gene or a portion of a gene. It is utilized to depict the operations of biological systems. They oversee a range of physiological and biochemical processes within the cell. The central dogma illustrates the protein route, as depicted in
Figure 1.2
.
Molecular Networks
: Molecular networks refer to the interconnected relationships among biological components such as genes, RNA, and proteins. They are utilized to symbolize the relationship, connectedness, and communication among these products. They are created through chemical reactions within the cell [
4
,
23
,
24
].
Protein Interaction Networks (PINs)
: Proteins construct molecular networks. They are utilized to emphasize the connections of diverse proteins in various cellular compartments. These networks are utilized for transmitting signals and commands to different parts of the system.
Figure 1.3
shows the sample PIN [
4
,
23
,
24
].
Protein Modules
: Protein modules are a set of proteins that carry out specific functions within the cell. It is a component of a protein interaction network that has stronger connections with each other compared to others
[24]
. The portions of the PIN are very coherent.
Figure 1.4
emphasizes the protein modules in the sample
[24]
.
Protein Complex vs. Functional Modules
: These two are protein modules. The protein modules are stationary, meaning they remain in a fixed position consistently. The functional modules are dynamic and can be generated at any location and time, not remaining fixed in one place consistently. Functional modules emerge as a result of pathogen infections and protein abnormalities. Functional modules are utilized to forecast the existence of pathogen infections, diseases, and anomalies, and are employed in drug and drug target discovery, therapy recommendations, and treatment development for diverse complicated disorders [
23
,
24
].
Drug Targets
: They are proteins that facilitate drug delivery to infected areas, enhance therapeutic effectiveness, and interact with the drug to improve treatment
[26]
.
Figure 1.2 Central dogma of proteins [25].
Figure 1.3 Sample protein interaction network.
Figure 1.4 Sample protein modules.
The existing methods used for comparative analysis in various stages of this research are the binary co-clustering approaches (BiMax and x-Motif), and optimization approaches (PSO, GA, and Firefly) are discussed in detail.
Data are “co-clustered” when its rows and columns are grouped at the same time, as seen in Figure 1.5. It identifies and isolates the specific patterns present in the data within a certain region or area. It refers to a distinct collection of rows and columns that exhibit greater similarity, as shown in Figure 1.5.
Figure 1.5 Co-clustering: A toy example.
This study work involves a comparative investigation of the proposed binary co-clustering strategy using Binary inclusion-Maximal [14] and x-Motif [13] algorithms.
Prelic et al. developed the Binary inclusion-Maximal (BiMax) co-clustering algorithm in 2006 for finding the maximal co-clusters from the binary data matrix. This algorithm is the pioneer and is used for comparative analysis for most of the co-clustering approaches. The pseudocode of BiMax is in Algorithm 1.1. This approach has an O(n2m2α) complexity, where the n, m and α are the count of genes, conditions, and inclusion-maximal co-clusters.
Figure 1.6 Sample illustration of BiMax algorithm.
Input: Input Data Matrix
Output: Resulting Sub-Matrices
Step 1 : Divide the columns into CU and CV (subsets),
Step 2 : Sort rows of E with the first row as a template
place all genes in G
U
, expressed to conditions only in C
U
place all genes in G
W
, expressed to conditions both in C
U
and C
V
place all genes in G
V
expressed to conditions only in C
V
Step 3 : Define the combination of genes GU, GW,, GV and conditions CU and CV
Step 4 : Recursively decompose the U, V sub-matrices using steps 1 to 4.
Figure 1.6 depicts the sample illustration of the BiMax approach. It is a reference-based divide-and-conquer approach used for binary co-clustering. It takes its first row as its reference to further group the data.
The xMotif algorithm is devised for gene expression data to extract conserved gene motifs [13]. It is also used for various co-clustering problems in the literature. It suits both real-valued and binary data co-clustering. The xMotif algorithm is the most frequently used co-clustering approach introduced by Murali and Kasif in 2003, but its usage is extending over a long period. The xMotif is also a co-cluster. The pseudocode of xMotif is in Algorithm 1.2. This algorithm finds the largest xMotif from the given data. In the algorithm, the “ns” and “nd” is the number of initial seeds and samples chosen from each seed, respectively, “sd” is the sample size, and “α” and “β” are the user-defined fractions of samples and genes conserved, respectively, in the samples chosen. It is less time-consuming when compared to other co-clustering approaches. Its time complexity is O(nm O(log (1/α)+ log (1/b))).
Input: Input data matrix
Output: motif
Step 1 : Repeati = 1tons
Select a random sample
c
uniformly
Repeat
j
=
1
to
n
d
Select a subset
D
randomly of size
s
d
Include
(
g
,
s
)
in
G
ij
for each row, if
g
∈
s
in c, and similarly for all samples in D
Assign all set of gene-states in G
ij
that satisfy the c
discard (C
ij
, G
ij
), if lesser than α
n
samples present in C
ij
,
Step 2 : return (C+,G+), with maximal |Gij|, 1 ≤ i ≤ ns, 1 ≤ j ≤ nd
This section deals with the datasets experimented in this research work. They are PIN, complexes of protein dataset, and targets dataset.
It is a set of interconnections between the proteins; this interconnection is also called interactions. It represents a link connecting two proteins. Two interactors are provided for each record in the data to describe an interaction. These interactions are undirected in nature. The sample dataset records are given in Figure 1.7.
The sample dataset is the first ten records of Homo_Sapiens PIN. There are various columns corresponding to the interactors (A, B), official symbols of A and B, alias names of A and B, experimental system, PubMed identifier of the interaction, and organisms of A and B in the dataset. In this research, the interactor columns are used for experimentation; the other additional information can be used for analytical purposes. This study makes use of a PIN dataset that was obtained from the STRING database [27, 28].
Figure 1.7 Sample PIN dataset.
The existing and tested protein interactions are present in this STRING repository [27, 28]. Physical and functional linkages are also a part of these interactions. There are ≅ 3.12 billion interactions involving ≅ 24.58 million proteins from 5090 different organisms. This database contains 19.4K Homo sapiens proteins and ≅ 8.5 million interactions.
The various data sources of this database are COG, BioGRID, MINT, KEGG, Gene Ontology, Ensembl, etc. [27]. The protein sequence data of these proteins were also taken from this database. Figure 1.8 shows the sample download page of the STRING database.
Figure 1.8 STRING database download page.
Protein complexes are groups of proteins that work together to carry out particular cellular tasks in living organisms [24]. Different creatures have their own unique protein complexes. In order to compare the outcomes, this study makes use of this protein complex data. The CORUM database is used to retrieve this data.
Mammalian protein complexes are exhaustively represented in this database [29]. It is composed of mammalian protein complexes that have been confirmed by experiments. There are 64% human, 16% mouse, and 12% rat complexes in this database. A total of 2,358 protein complexes from humans make it up. The name, composition, function, and reference of each protein complex are also included.
Figure 1.9 shows the statistics of the protein complexes in each release of the CORUM database [29]. The size of the protein complexes is from 1 to 64, but there are very few protein complexes from size 9 to 64, i.e., there are no complexes present in most of this size limit, and the maximal complexes present in the size limit of 1 to 8. As per the literature, the protein complex sizes and their counts are shown in Figure 1.10. Based on these sizes, only 2358 protein complexes taken for this research are in the size limit 3 to 8. The protein complex sizes 1 and 2 are not meant to be chosen because size 1 represents a single protein and size 2 describes an interaction. Thus, selecting these two sizes will increase the time complexity. Instead, the individual protein and its interaction in PIN can be analyzed separately. Hence, the size limit 3 to 8 is taken. This limit is fixed on a trial-and-error basis. For the maximal sizes, more seeds and lengthy seeds have to be generated. Based on the size of most of the existing protein complexes, size 8 is fixed for this research work. Further, it can be increased in the future to enhance this research.
Figure 1.9 CORUM database statistics [29].
Figure 1.10 Protein complex statistics.
All the implementations of this research and their experiments are carried out in the MATLAB 2016a environment. In this research, the co-clustering approach proposed for Functional Module Mining uses the MapReduce (MR) framework. The basic configuration of MR Framework of the MATLAB [30] is used in this research work. This default MapReduce framework runs on the virtual distributed environmental setup. The distributed Hadoop cluster can also be configured in MATLAB. This MapReduce Framework is discussed in detail further.
The MapReduce framework is for the parallel processing of the voluminous distributed data [31–33]. Data are processed using key-value (KV) combinations. This framework consists of three different phases: map phase, rearranging or grouping (intermediate) phase, and reduce phase. Out of these phases, the Map and Reduce phases are the user-defined phases that consist of user-defined functions to process the data. The Map function works on the data blocks and produces KV pairs; the reduce function works on each unique key separately and returns KV pairs. The intermediate phase is for grouping them based on their keys, which are default processing in this framework.
Figure 1.11 Workflow of MapReduce in MATLAB [30].
The MapReduce Framework, as per the MATLAB [30], is in Figure 1.11. The inputs and outputs of this framework are in the form of a data store. Tables of data, KV pairs, text files, images, and other types of media can all be input. The key-value datastore will be the final product. The map function will take the data blocks from this input data store for processing and produce the intermediate KV pairs in the form of intermediate KV store. In the intermediate phase, a value iterator will be generated by default for each key, as in Figure 1.11. This phase is for grouping the intermediate KV pairs based on the unique keys. After this phase, the user-defined “Reduce” function will be invoked for each unique key for further processing. It will generate the output KV pairs.
In this section, the various validation measures such as Match Score, Network Metrics, and Functional Coherence are used for evaluating the performances of the proposed work.
The match score measure is used for evaluating the proportion of matching between two patterns of data or two sets of patterns [14]. The patterns are co-clusters. It is evaluated using Equations 1.1 and 1.2.
The local patterns formed by grouping rows and columns simultaneously are called co-clusters. The scr(B1, B2) is the match score between the co-clusters, which is evaluated using Equation 1.1. Specifically, it is the ratio of the total number of unique row and column identifiers among co-clusters to the total number of common identifiers among them. The “B1” and “B2” represent the two co-clusters; one is the target co-cluster, and another is the output or predicted co-cluster. The {I1, J1}, and {I2, J2} represent the row and column identifier sets of the co-clusters B1 and B2, respectively.
The Scr*B(M1, M2) is the match score between the two co-cluster sets, as in Equation 1.2. It is the maximal value of the match score averages of all co-clusters in “M1” with “M2”. “M1” is the target co-cluster set and “M2” is the output co-cluster set.
The match score measure is used in this research for evaluating the match between the existing and extracted patterns for their studying their accuracy.
In order to assess how well the protein module’s functional annotations hold together, this metric is used [34]. Each protein is accompanied by biological functional characteristics known as functional annotations. Under several facets, it depicts their functions within the biological system. To predict their properties, we quantify these functional annotations for every protein module. To find out how similar these functional annotations are, we use a metric called functional coherence (FC). It is the percentage of proteins in the protein module that have a functional annotation relative to the total number of proteins.
From the ith protein module (PM), we may derive the functional coherence (FC) of the jth annotation using Equation 1.3. pij denotes the proteins in the ith module that have the jth annotation, while pi stands for the proteins in the ith module itself. If all the proteins in the module have the jth annotation, then the value of FCj is “1”; otherwise, it is “0”. This FC value describes the protein module’s functioning. In this study, the functions of the resulting protein modules are determined using this metric.
To illustrate the proteins’ biological importance, functional annotations are used. When it comes to characterizing the functional features of molecular products like proteins, functional annotations are the gold standard [35, 36]. The three primary groups into which these implications fall are as follows:
Molecular Function
: These include molecular-level processes like binding and catalysis, among others. Depending on where they are situated and circumstances, they might be caused by a single bio-product (protein) or a set of bio-products (protein complex). These inscriptions exclusively depict the bio-product activity, not the mixture of bio-products that cause this activity [
37
,
38
].
Biological Process
: It stands for the cellular-level metabolic activity, transportation, etc. that are essential to the bio-products’ biological processes. The execution of a biological action in the cell is accomplished by combining one or more molecular processes. A bio-product or products carry them out [
37
,
38
].
Cell Component
: The bio-product’s relative position or cellular compartments during molecular activities are represented by it. Several factors, such as their respective cellular topologies, cellular compartments, and the presence of stable macromolecular complexes, are usually considered when making these determinations [
37
,
38
].
In this section, the SCoC approach for the non-symmetric matrix called SCoCnsym is focused on, which is an extended version of the previous SCoC approach. The non-symmetric matrix is either a rectangular matrix or square matrix where the upper-right and lower-left triangular matrices are not the same. It is used to represent the relationship between two different disjoint sets of objects. The SCoCnsym is performed on one dimension at a time either on a row or column based on the condition specified. The score threshold is the major part of mining different types of co-clusters based on the requirement. Algorithm 1.3 shows the steps of SCoCnsym.
Input: 2-D Input Data (D)
Result: co-cluster (C)
Step 1 : C=D, compute Sn (score) of C
Step 2 : if Sn < threshold then
compute scores of C (rows and columns)
ignore low score row or column in C
compute S
n
Step 3 : return C
Initially, input data matrix D is imputed with noise and non-symmetric co-cluster in it. The score value of the co-cluster is computed at each iteration of the process; ignore the rows and columns with low score at each iteration and update the score value using the equation; continue this process until the score value is less than the selected threshold. This algorithm is for mining the constant (1’s) co-cluster from a given data matrix where the threshold is “1”. For the 0’s co-cluster, the score threshold should be “0” and has to remove the row or column with a high score value. This approach extracts the maximal constant co-cluster from the given binary data matrix. O(ne+nv2) is the temporal complexity of this approach.
The sample illustration of SCoCnsym for mining 1’s constant co-cluster is explained in this section. Let “C” be the Input Matrix, where the embedded co-cluster is highlighted in Figure 1.12. The score threshold is “1”.
Figure 1.12 Sample input matrix and its heatmap.
Table 1.3 illustrates a sample for the proposed SCoCnsym approach. Here the 7 × 3 constant 1’s co-cluster is embedded in the 8 × 7 noisy data matrix. In every iteration, the row score and column score of the given data are computed; either the row or column with a minimal score is removed; the score is evaluated for the result matrix. This approach is also attempted on the synthetic datasets for their performance analysis similar to the previous implementation, which is discussed in the next section.
Table 1.3 SCoCnsym: Toy example.
The synthetic binary datasets are generated for experimenting with the SCoCnsym. In this section, four different binary datasets are generated, where the co-clusters and noise are implanted.
Figure 1.13 Synthetic Dataset_nsym_I. (a) Noiseless; (b) noisy.
SCoCnsym is experimented on both the noisy and noiseless data. The dataset description is shown in Table 1.4, which highlights the matrix type, data size, co-cluster size, co-cluster position, presence, and nature of noise.
The proposed approach SCoCnsym is attempted on four different synthetic non-symmetric datasets with a score threshold of “1” for mining constant 1’s co-cluster. The MATLAB environment is used for experimenting this research work. The comparative analysis of the performance of the SCoC nsym approach is carried out. These results are given in Table 1.5, which shows that the proposed approach SCoCnsym outperforms the existing approaches. The SCoCnsym approach extracted the implanted co-clusters from all different types of synthetic data under the noisy space. The BiMax algorithm can extract the implanted co-clusters only in the datasets without noise. The performance of the existing BiMax approach is affected by the noise in the data.
Figure 1.14 Synthetic Dataset_nsym_II. (a) Noiseless; (b) noisy.
The computational time taken for mining the co-clusters in these four synthetic datasets by these approaches is also recorded in Table 1.6. The outcomes show that the performance of the xMotif approach is better than other approaches, but it does not extract the expected co-cluster from any dataset.
Figure 1.15 Heatmap of synthetic Dataset_nsym_III. (a) Noiseless; (b) noisy.
In all these four synthetic datasets, the expected co-clusters, i.e., the implanted co-clusters, are the maximal co-cluster in them. The BiMax algorithm can mine the co-clusters in minimal time when compared to the proposed approach. Even though the BiMax algorithm consumes less time, its performance is highly affected by the noise.
The previous two approaches mine the maximal co-cluster present in the dataset, whereas the remaining patterns hidden are not explored. In this case, the randomization of this approach makes it to mine co-clusters hidden in the different portions of the dataset, i.e., the smaller co-clusters are ignored. It is implemented by introducing the random seed vector that represents the search location within the dataset. Many random seeds are generated to initiate different searches in the dataset. It results in exploring the different patterns present in the data. Also, the entire dataset has to be traversed for co-cluster mining, which is a time-consuming process for the large dataset. In such cases, the randomization will be the better solution to mine the patterns in the dataset stochastically. For these reasons, the randomization of the SCoC approach is proposed.
Figure 1.16 Heatmap of synthetic Dataset_nsym_IV. (a) Noiseless; (b) noisy.
In this section, the implementation of the SCoCrand is discussed. Here, three different seeding techniques are adopted for testing this proposed approach. They are row-wise random seeding, column-wise random seeding, and random seeding (both row and column seed). A synthetic dataset is generated and tested with these types of seeding techniques discussed further.
This SCoCrand approach uses the SCoC approach, which is applied to each seeding portion. The pseudocode of the SCoCrand is shown in Algorithm 1.4.
Table 1.4 Synthetic dataset description—SCoCnsym.
S. no.
Dataset name
Size
Matrix type
Co-cluster size
Co-cluster position
Noise type
1
Dataset_ nsym_I (
Figure 1.13
)
500 × 700
Non-symmetric binary
150 × 190
Randomly scattered and symmetrical
Random symmetric noise
2.
Dataset_ nsym_ II (
Figure 1.14
)
200 × 400
Non-symmetric binary
150 × 190
Specific continuous
Symmetric random noise
3
Dataset_ nsym_ III (
Figure 1.15
)
100 × 200
Non-symmetric binary
50 × 50
Randomly symmetrical
Random noise
4
Dataset_ nsym_ IV (
Figure 1.16
)
400 × 200
Non-symmetric binary
200 × 150
Randomly scattered and symmetrical
Symmetric random noise
Table 1.5 Comparative analysis of SCoCnsym based on match score measure.
Dataset_nsym I
Dataset_nsym II
Dataset_nsym III
Dataset_nsym IV
Noise-free
Noisy
Noise-free
Noisy
Noise-free
Noisy
Noise-free
Noisy
SCoC
nsym
1
1
1
1
1
1
1
1
BiMax
1
0.4319
1
0.0117
1
0.0278
1
0.5719
xMotif
0.1075
0.1508
0.2500
0.1709
0.1667
0.0397
0.1867
0.0083
Table 1.6 Comparative analysis of SCoCnsym based on computational time.
Dataset_nsym I
Dataset_nsym II
Dataset_nsym III
Dataset_nsym IV
Noise-free
Noisy
Noise-free
Noisy
Noise-free
Noisy
Noise-free
Noisy
SCoC
nsym
1.44029
6.60305
0.83674
0.93903
0.11604
0.35163
0.68721
1.14249
BiMax
9.48878
0.01153
0.00743
0.01251
0.00197
0.00166
0.01251
0.00317
xMotif
0.98624
0.03936
0.01348
0.01785
0.01009
0.03381
0.03746
0.03430
Input: 2-D Input Data (D)
Result: co-cluster (M = {C1, C2,…Cns})
Step 1 : Assign the seed count (ns) and seed size (lens)
Step 2 : Generate the initial seeds
Step 3 : For every seed si
Perform SCoC
nsym
to extract co-cluster
Step 4 : Assess and visualize the resultant co-clusters (M).
A seed vector will be the set of identifiers to represent the particular subset of the dataset. The number of seeds (ns) and seed vector size (lens) can be fixed based on the user requirement. It can also be set on a trial-and-error basis. In this SCoCrand, instead of finding one co-cluster, more co-clusters can be extracted. The different sub-matrix in the input data is selected using random seeds and co-clusters are extracted from each seed by applying the SCoCnsym to each seed. This can be used to extract multiple co-clusters from the given data. The temporal complexity in terms of “O” for this approach is Tr = O(Ns(max(Nes)+ Nvs2)), where “Ns”, “Nes”, and “Nvs” are the seed count, edges, and vertices in a seed, respectively.
In this approach, the seeds are chosen to mine more patterns from the given data matrix. It helps to mine the maximal co-cluster in each sub-matrix of the data. They are selected based on the random seeds. The seeds are generated at different criteria like row-wise random seeds, column-wise random seeds, and random seeds (both row and column). The row-wise random seed splits the data along the row (row subset); similarly, the column-wise