173,99 €
DATA SCIENCE WITH SEMANTIC TECHNOLOGIES This book will serve as an important guide toward applications of data science with semantic technologies for the upcoming generation and thus becomes a unique resource for scholars, researchers, professionals, and practitioners in this field. To create intelligence in data science, it becomes necessary to utilize semantic technologies which allow machine-readable representation of data. This intelligence uniquely identifies and connects data with common business terms, and it also enables users to communicate with data. Instead of structuring the data, semantic technologies help users to understand the meaning of the data by using the concepts of semantics, ontology, OWL, linked data, and knowledge-graphs. These technologies help organizations to understand all the stored data, adding the value in it, and enabling insights that were not available before. As data is the most important asset for any organization, it is essential to apply semantic technologies in data science to fulfill the need of any organization. Data Science with Semantic Technologies provides a roadmap for the deployment of semantic technologies in the field of data science. Moreover, it highlights how data science enables the user to create intelligence through these technologies by exploring the opportunities and eradicating the challenges in the current and future time frame. In addition, this book provides answers to various questions like: Can semantic technologies be able to facilitate data science? Which type of data science problems can be tackled by semantic technologies? How can data scientists benefit from these technologies? What is knowledge data science? How does knowledge data science relate to other domains? What is the role of semantic technologies in data science? What is the current progress and future of data science with semantic technologies? Which types of problems require the immediate attention of researchers? Audience Researchers in the fields of data science, semantic technologies, artificial intelligence, big data, and other related domains, as well as industry professionals, software engineers/scientists, and project managers who are developing the software for data science. Students across the globe will get the basic and advanced knowledge on the current state and potential future of data science.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 629
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Preface
1 A Brief Introduction and Importance of Data Science
1.1 What is Data Science? What Does a Data Scientist Do?
1.2 Why Data Science is in Demand?
1.3 History of Data Science
1.4 How Does Data Science Differ from Business Intelligence?
1.5 Data Science Life Cycle
1.6 Data Science Components
1.7 Why Data Science is Important
1.8 Current Challenges
1.9 Tools Used for Data Science
1.10 Benefits and Applications of Data Science
1.11 Conclusion
References
2 Exploration of Tools for Data Science
2.1 Introduction
2.2 Top Ten Tools for Data Science
2.3 Python for Data Science
2.4 R Language for Data Science
2.5 SQL for Data Science
2.6 Microsoft Excel for Data Science
2.7 D3.JS for Data Science
2.8 Other Important Tools for Data Science
2.9 Conclusion
References
3 Data Modeling as Emerging Problems of Data Science
3.1 Introduction
3.2 Data
3.3 Data Model Design
3.4 Data Modeling
3.5 Polyglot Persistence Environment
References
4 Data Management as Emerging Problems of Data Science
4.1 Introduction
4.2 Perspective and Context
4.3 Data Distribution
4.4 CAP Theorem
4.5 Polyglot Persistence
References
5 Role of Data Science in Healthcare
5.1 Predictive Modeling—Disease Diagnosis and Prognosis
5.2 Preventive Medicine—Genetics/Molecular Sequencing
5.3 Personalized Medicine
5.4 Signature Biomarkers Discovery from High Throughput Data
Conclusion
References
6 Partitioned Binary Search Trees (P(h)-BST): A Data Structure for Computer RAM
6.1 Introduction
6.2 P(h)-BST Structure
6.3 Maintenance Operations
6.4 Insert and Delete Algorithms
6.5 P(h)-BST as a Generator of Balanced Binary Search Trees
6.6 Simulation Results
6.7 Conclusion
Acknowledgments
References
7 Security Ontologies: An Investigation of Pitfall Rate
7.1 Introduction
7.2 Secure Data Management in the Semantic Web
7.3 Security Ontologies in a Nutshell
7.4 InFra_OE Framework
7.5 Conclusion
References
8 IoT-Based Fully-Automated Fire Control System
8.1 Introduction
8.2 Related Works
8.3 Proposed Architecture
8.4 Major Components
8.5 Hardware Interfacing
8.6 Software Implementation
8.7 Conclusion
References
9 Phrase Level-Based Sentiment Analysis Using Paired Inverted Index and Fuzzy Rule
9.1 Introduction
9.2 Literature Survey
9.3 Methodology
9.4 Conclusion
References
10 Semantic Technology Pillars: The Story So Far
10.1 The Road that Brought Us Here
10.2 What is a Semantic Pillar?
10.3 The Foundation Semantic Pillars: IRI’s, RDF, and RDFS
10.4 The Semantic Upper Pillars: OWL, SWRL, SPARQL, and SHACL
10.5 Conclusion
References
11 Evaluating Richness of Security Ontologies for Semantic Web
11.1 Introduction
11.2 Ontology Evaluation: State-of-the-Art
11.3 Security Ontology
11.4 Richness of Security Ontologies
11.5 Conclusion
References
12 Health Data Science and Semantic Technologies
12.1 Health Data
12.2 Data Science
12.3 Health Data Science
12.4 Examples of Health Data Science Applications
12.5 Health Data Science Challenges
12.6 Health Data Science and Semantic Technologies
12.7 Application of Data Science for COVID-19
12.8 Data Challenges During COVID-19 Outbreak
12.9 Biomedical Data Science
12.10 Conclusion
References
13 Hybrid Mixed Integer Optimization Method for Document Clustering Based on Semantic Data Matrix
13.1 Introduction
13.2 A Method for Constructing a Semantic Matrix of Relations Between Documents and Taxonomy Concepts
13.3 Mathematical Statements for Clustering Problem
13.4 Heuristic Hybrid Clustering Algorithm
13.5 Application of a Hybrid Optimization Algorithm for Document Clustering
13.6 Conclusion
References
14 Role of Knowledge Data Science During COVID-19 Pandemic
14.1 Introduction
14.2 Literature Review
14.3 Model Discussion
14.4 Results and Discussions
14.5 Conclusion
References
15 Semantic Data Science in the COVID-19 Pandemic
15.1 Crises Often Are Catalysts for New Technologies
15.2 The Domains of COVID-19 Semantic Data Science Research
15.3 Discussion
References
Index
Wiley End User License Agreement
Chapter 1
Table 1.1 Comparison of data science and business intelligence.
Chapter 2
Table 2.1 Common SQL commands.
Table 2.2 Common SQL operations.
Table 2.3 Common SQL order of execution.
Table 2.4 Worldwide electric energy consumption from 1981 to 2020.
Table 2.5 Sample dataset: prices of product (X) within 10 consecutive months.
Chapter 3
Table 3.1 Atoms and characteristics.
Chapter 5
Table 5.1 Data sets [14].
Table 5.2 Genes selected—AML/ALL data set [14].
Table 5.3 Genes selected - Lung Harvard2 data set [14].
Table 5.4 Signature genes related with various conditions for data set II (Ovari...
Chapter 6
Table 6.1 Rotations, Ratios Max/Min, Execution times–Insert algorithm– Random ca...
Table 6.2 Rotations, Ratios Max/Min, Execution times–Insert algorithm– Random ca...
Table 6.3 Number of rotations, execution times–Delete algorithm–Random case.
Table 6.4 Insert algorithm—Ascending case.
Table 6.5 Execution time–Insert algorithm–Random case.
Table 6.6 Execution time–Delete algorithm–Random case.
Chapter 7
Table 7.1 Size of security ontologies.
Chapter 9
Table 9.1 Comparsion of our approah with other bench mark approach.
Chapter 11
Table 11.1 Richness of security ontologies via ontometric tool (a) security onto...
Chapter 14
Table 14.1 COVID-19 pandemic timeline from January 2020 to December 2020.
Table 14.2 COVID-19 pandemic timeline from January 2021 to June 2021.
Table 14.3 The country wise display of confirmed, recovered, deaths, and active ...
Table 14.4 14.4 (a), 14.4 (b), 14.4 (c), and 14.4 (d) depict the predicted outpu...
Table 14.5 Model performance for active cases.
Table 14.6 Model performance for confirmed cases.
Table 14.7 Model performance for recovered cases.
Table 14.8 Model performance for death cases.
Table 14.9 Display of top 20 countries confirmed and recovered cases as on 29.05...
Table 14.10 COVID-19 patients actual and predicted confirmed, deaths, recovered ...
Table 14.11 Forecast values for 5th day, 10th day, and 15th day new infected cas...
Chapter 15
Table 15.1 Summary table of all reviewed research.
Chapter 1
Figure 1.1 Proceeding of IJCAI-workshop.
Figure 1.2 Data science Venn diagram.
Figure 1.3 Job growth on analytics and data science.
Figure 1.4 Components of data science.
Figure 1.5 Features of SAS.
Figure 1.6 Features of Apache Spark.
Figure 1.7 Features of D3.js.
Figure 1.8 Features of MATLAB.
Figure 1.9 Features of excel.
Figure 1.10 Features of tableau.
Figure 1.11 Features of NLTK.
Figure 1.12 Features of TensorFlow.
Chapter 2
Figure 2.1 Diverse fields of data science.
Figure 2.2 Data scientist vs. data engineer [5].
Figure 2.3 Structured data vs. unstructured data vs semistructured data.
Figure 2.4 Top ten tools for data science.
Figure 2.5 Front view of Anaconda platform and Jupyter for IPython.
Figure 2.6 Companies that use R language for data analytics.
Figure 2.7 Operators of R language.
Figure 2.8 Most common operators of R language.
Figure 2.9 Example of schema for RDB of two tables.
Figure 2.10 Simplified example for outliers in the dataset.
Figure 2.11 Algorithm of outlier detection used in Tukey labeling.
Figure 2.12 Mouse pointer points to statistic chart symbol, and then points box ...
Figure 2.13 Box and whisker with single outlier.
Figure 2.14 Linear regression vs logistic regression.
Figure 2.15 Using scatter plot tool in microsoft excel.
Figure 2.16 The scatter plot for the energy data records provided in Table 2.4.
Figure 2.17 “Add Trend line” option for the scatter plot.
Figure 2.18 Using Linear regression via “Add Trend line” option for the scatter ...
Figure 2.19 Using forecasting tool in Microsoft Excel.
Figure 2.20 Preview of forecasting results.
Figure 2.21 Final forecasting results of six future values.
Figure 2.22 Apache spark architecture.
Figure 2.23 MongoDB data store architecture.
Figure 2.24 MATLAB computing system architecture.
Figure 2.25 Neo4j graph database architecture.
Figure 2.26 VMWare virtualization system architecture.
Chapter 3
Figure 3.1 Periodic table of atoms. (Sources: https://en.wikipedia.org/wiki/Peri...
Figure 3.2 Schema of data item, data field, record and file (Sources: https://ww...
Chapter 5
Figure 5.1 The history of DNA sequencing technologies (Source: https://www.cd-ge...
Figure 5.2 Block diagram of proposed method I.
Figure 5.3 Block diagram of proposed method II [14].
Chapter 6
Figure 6.1 P(2)-BST.
Figure 6.2 P(3)-BST.
Screenshot 6-01 AVL tree insert algorithm.
Screenshot 6-02 AVL tree insert algorithm.
Figure 6.3 Partitioning.
Figure 6.4 Departitioning.
Figure 6.5 Departitioning (special case).
Figure 6.6 Restructuring-partitioning.
Figure 6.7 Transforming.
Figure 6.8 Step-by-step insert process.
Figure 6.9 Step-by-step delete process.
Figure 6.10 P(2)-BST and P(3)-BST versus corresponding Red-Black trees.
Screenshot 6-03 PBST node implementation.
Screenshot 6-04 Red-Black tree node implementation.
Screenshot 6-05 AVL tree node implementation.
Figure 6.11 Numbers of rotations–Insert algorithm–Random case.
Figure 6.12 Execution times–Insert algorithm–Random case.
Figure 6.13 Number of rotations–Delete algorithm–Random case.
Figure 6.14 Execution times–Delete algorithm–Random case.
Figure 6.15 Maximal height of path–Insert algorithm–Ascending case.
Figure 6.16 Ratio Max/Min–Insert algorithm–Ascending case.
Figure 6.17 Execution time–Insert algorithm–Ascending case.
Figure 6.18 Distribution of class and simple nodes–Insert algorithm–Ascending ca...
Chapter 7
Figure 7.1 Languages leading to OWL.
Figure 7.2 From HTML to linked data.
Figure 7.3 Obtained articles per year (from 2010 to July 2021) (a) Science direc...
Figure 7.4 Screening of the articles.
Figure 7.5 Classification framework for security ontologies analysis.
Figure 7.6 Evaluation of security ontology via OOPS! tool of InFra_OE framework.
Figure 7.7 Pitfall rate of security ontologies.
Chapter 8
Figure 8.1 Water used in coal mining.
Figure 8.2 Interfacing block representation of proposed architecture.
Figure 8.3 (a) Arduino Uno (b)Arduino Uno Atmega328p pin mapping.
Figure 8.4 LM 35 temperature sensor.
Figure 8.5 16X2 LCD display module.
Figure 8.6 (a) DHT11 temperature-humidity sensor. (b) DHT11 connection with proc...
Figure 8.7 Negative temperature coefficient of resistance.
Figure 8.8 Moisture sensor module.
Figure 8.9 MQ 135 gas sensor module.
Figure 8.10 SIM900 GSM GPRS Module connection with Arduino Uno.
Figure 8.11 Solar PV system.
Figure 8.12 LM-35 interfaces with Arduino.
Figure 8.13 DHT11 interface with Arduino.
Figure 8.14 MQ-X module interface with Arduino.
Figure 8.15 NEO 6M GPS module interface with Arduino.
Figure 8.16 DTH11 module pc display result.
Chapter 9
Figure 9.1 Inverted index stucture stucture with other indexted index structure.
Figure 9.2 Hierarchical structure of sentence classification.
Figure 9.3 Fuzzy system.
Chapter 10
Figure 10.1 (a), (b) and (c) Predict the cost of a house based on square feet.
Figure 10.2 Semantic stack architecture.
Figure 10.3 RDF graph.
Figure 10.4 Property values for Mary_Doe.
Figure 10.5 Explanation for inference that Mary_Doe has_Aunt Sarah_Doe.
Chapter 11
Figure 11.1 Ontology evaluation tools.
Chapter 13
Figure 13.1 Calculation of the elements of the semantic matrix.
Figure 13.2 An example of calculating the strength of the semantic relationship ...
Figure 13.3 Relations of the distances of objects i, j to the center of the cl...
Figure 13.4 Geometric representation of cluster k objects for the extended PDC c...
Chapter 14
Figure 14.1 Illustration of the SARS-CoV-2 virion.
Figure 14.2 Transmission stage of corona virus.
Figure 14.3 Flowchart of the implementation methodology.
Figure 14.4 World map of COVID-19 active cases as of 29.05.2021. The darker the ...
Figure 14.5 The graph depicts the observed date Vs data frame (because of the hu...
Figure 14.6 (a–b) shows the actual and predicted cases active, death, recovered,...
Figure 14.6 (c-d) shows the actual and predicted cases active, death, recovered,...
Figure 14.7 Trend, weekly, and daily cases of active case.
Figure 14.8 Trend, weekly, and daily cases of confirmed cases.
Figure 14.9 Trend, weekly, and daily cases of recovered cases.
Figure 14.10 Trend, weekly, and daily cases of death cases.
Figure 14.11 Trend of the performance measures (i) MSE, (ii) RMSE, (iii) MAE for...
Figure 14.12 Trend of the performance measures (i) MSE, (ii) RMSE, (iii) MdAPE f...
Figure 14.13 Trend of the performance measures (i) MSE, (ii) RMSE, and (iii) MAP...
Figure 14.14 Trend of the performance measures (i) MSE, (ii) RMSE, and (iii) MAP...
Figure 14.15 The world’s top 20 countries recovered vs confirmed cases as on 29....
Figure 14.16 Worldwide top countries daily COVID-19 cases of actual vs predicted...
Chapter 15
Figure 15.1 Semantic technology COVID-19 domains.
Figure 15.2 Amazon semantic search tool.
Figure 15.3 COVID*GRAPH semantic search tool.
Figure 15.4 Knowledge graph visualization of COVID-19 concepts and relevant pape...
Figure 15.5 Example of papers discovered via network graph semantic search.
Figure 15.6 The Johns Hopkins dashboard.
Figure 15.7 The NY times dataset in the Stardog cloud.
Figure 15.8 COVID trials visualization.
Figure 15.9 Drug repurposing visualization.
Figure 15.10 Infection paths visualized by CODO.
Figure 15.11 Geographic information visualized in CODO.
Cover
Table of Contents
Title Page
Copyright
Preface
1 A Brief Introduction and Importance of Data Science
Index
Wiley End User License Agreement
v
ii
iii
iv
xv
xvi
xvii
xviii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
Scrivener Publishing
100 Cummings Center, Suite 541J
Beverly, MA 01915-6106
Publishers at Scrivener
Martin Scrivener ([email protected])
Phillip Carmical ([email protected])
Edited by
Archana Patel
Department of Software Engineering, School of Computing and Information Technology, Eastern International University, Vietnam
Narayan C. Debnath
School of Computing and Department of Computer Science and Engineering, School of Engineering Vietnam
and
Bharat Bhusan
Technology, Sharda University Information Technology, India
This edition first published 2022 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA © 2022 Scrivener Publishing LLCFor more information about Scrivener publications please visit www.scrivenerpublishing.com.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
Wiley Global Headquarters111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchant- ability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.
Library of Congress Cataloging-in-Publication Data
ISBN 9781119864981
Cover image: PixaBay.ComCover design by Russell Richardson
Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines
Printed in the USA
10 9 8 7 6 5 4 3 2 1
Data Science is an invaluable resource that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. To create intelligence in data science, it becomes necessary to utilize the semantic technologies which allow machine-readable representation of data. This intelligence uniquely identifies and connects data with common business terms, and also enables users to communicate with data. Instead of structuring the data, semantic technologies help users to understand the meaning of the data by using the concepts of semantics, ontology, OWL, linked data, and knowledge graphs. These technologies assist organizations in understanding all of the stored data, adding value to it, and enabling insights that were not available before. Organizations are also using semantic technologies to unearth precious nuggets of information from vast volumes of data and to enable more flexible use of data. These technologies can deal with the existing problems of data scientists and help them in making better decisions for any organization. All of these needs are part of a focused shift towards utilization of semantic technologies in data science that provide knowledge along with the ability to understand, reason, plan, and learn with existing and new data sets. These technologies also generate expected, reproducible, user-desired results.
This book aims to provide a roadmap for the deployment of semantic technologies in the field of data science. Moreover, it highlights how data science enables the user to create intelligence through these technologies by exploring the opportunities and eradicating the challenges in current and future time frames. It can serve as an important guide to applications of data science with semantic technologies for the upcoming generation and thus is a unique resource for scholars, researchers, professionals and practitioners in this field. Following is a brief description of the subjects covered in the 15 chapters of the book.
– Chapter 1 provides a brief introduction to data science. It addresses various aspects of data science such as what a data scientist does and why data science is in demand; the history of data science and how it differs from business intelligence; the life cycle of data science and data science components; why data science is important; the challenges of data science; the tools used for data science; and the benefits and applications of data science.
– Chapter 2 provides an overview of the top 10 tools and applications that should be of interest to any data scientist. Its objective includes, but is not limited to, realizing the use of Python in developing solutions to data science tasks; recognizing the use of R language as an open-source data science provider; traveling around the SQL to provide structured models for data science projects; navigating through data analytics and statistics using Excel; and using D3.js scripting tools for data visualization. Also, practical examples/case studies are provided on data visualization, data analytics, regression, forecasting, and outlier detection.
– Chapter 3 presents the use of data modeling for data science, revealing the possibility of a new side of the data. The chapter covers different types of data (unstructured data, semi-structured data, structured data, hybrid (un/semi)-structured data and big data) and data model design.
– Chapter 4 shows data management by considering language based on the novelty view of data. The chapter focuses on data life cycle, data distribution and CAP theorem.
– Chapter 5 presents the role of data science in healthcare. There are several fields in the healthcare sector, such as predictive modeling, genetics, etc., which make use of data science for diagnosis and drug discovery, thereby increasing usability of precision medicine.
– Chapter 6 provides a new balanced binary search tree that generates two kinds of nodes: simple and class nodes. Two advantages make the new structure attractive. First, it subsumes the most popular data structures of AVL and Red-Black trees. Second, it proposes other unknown balanced binary search trees in which we can adjust the maximal height of paths between 1.44 lg(n) and 2 lg(n), where n is the number of nodes in the tree and lg the base-two logarithm.
– Chapter 7 shows the study of machine learning and deep learning algorithms with detailed and analytical comparisons, which help new and inexperienced medical professionals or researchers in the medical field. The proposed machine learning model has an accurate algorithm that works with rich healthcare data, a high-dimensional data handling system, and an intelligent framework that uses different data sources to predict heart disease. This chapter uses an ensemble-based deep learning model with optimal feature selection to improve accuracy.
– Chapter 8 presents an IoT-based automated fire control system in a mining area which will help to protect many valuable lives whenever an accident occurs due to fire. In the experimental application, different types of sensors for temperature, moisture, and gas are used to sense the different environmental data.
– Chapter 9 offers an aspect identification method for sentiment sentences in review documents. The chapter describes two key tasks—one for extracting significant features from the reviews and another for identification of degrees of product reviews.
– Chapter 10 shows the research that paved the way for semantic technology. It then describes each of the semantic pillars with examples and explanations of the business value of each technology.
– Chapter 11 describes the ontology evaluation tools and then focuses on the evaluation of the security ontologies. The existing ontology evaluation tools are classified under two categories; namely, domain-dependent ontology evaluation tools and domain-independent ontology evaluation tools. The evaluation of security ontology assesses the quality of ontology among the available ontologies.
– Chapter 12 discusses the main concepts of health data, data science, health data science, examples of the application of health data science and related challenges. In addition, it also highlights the application of semantic technologies in health data science and the challenges that lie ahead of using these technologies.
– Chapter 13 proposes an original hybrid optimization approach based on two different mixed integer programming statements. The first statement is based on minimizing the sum of pairwise distances between all objects (PDC clustering), while the second statement is based on minimizing the total distance from objects to cluster centers (CC clustering). Computational experiments showed that the hybrid method developed for solving the clustering problem combines the advantages of both approaches—the speed of the k-means method and the accuracy of PDC clustering—which makes it possible to get rid of the main drawback of the k-means, namely, the lack of guaranteed determining of the global optimum.
– Chapter 14 uses a model for the analysis of time series data which highly depend on the novel coronavirus 2019. This model predicts the future trend of confirmed, recovered, active, and death cases based on the available data from January 22, 2020 to May 29, 2021. The present model predicted the spread of COVID-19 for a future period of 30 days. The RMSE, MSE, MAE, and MdAPE metrics are used for the model evaluation.
– Chapter 15 focuses on systems that incorporated real-world data utilized by actual users. It first describes a new methodology for the survey and then covers the various domains where semantic technology can be applied and some of the most impressive systems developed in each domain.
Finally, the editors would like to sincerely thank all the authors and reviewers who contributed their time and expertise for the successful completion of this book. The editors also appreciate the support given by Scrivener Publishing, which allowed us to compile the first edition of this book.
The EditorsArchana PatelNarayan C. DebnathBharat BhusanJune 2022
Karthika N.1*, Sheela J.1 and Janet B.2
1Department of SCOPE, VIT-AP University, Amaravati, Andhra Pradesh, India
2Department of Computer Applications, National Institute of Technology, Tiruchirappalli, India
Abstract
Data is very important component of any organization. According to International Data Corporation, by 2025, global data will reach to 175 zettabytes. They need data to help them make careful decisions in business. Data is worthless until it is transformed into valuable data. Data science plays a vital role in processing and interpreting data. It focuses on the analysis and management of data too. It is concerned with obtaining useful information from large datasets. It is frequently applied in a wide range of industries, including healthcare, marketing, banking, finance, policy work, and more. This enables companies to make informed decisions around growth, optimization, and performance. In this brief monograph, we address following questions.
What is data science and what does a data scientist do? Why data science is in demand? History of data science, how data science differs from business intelligence? The lifecycle of data science, data science components, why data science is important? Challenges of data science, tools used for data science, benefits and applications of data science.
Keywords: Data science, history, lifecycle, components, tools
Data is very important component of any organization. According to International Data Corporation, by 2025, global data will reach to 175 zettabytes. They need data to help them make careful decisions in business. Data is worthless until it is transformed into valuable data. Data science plays a vital role in processing and interpreting data. It focuses on the analysis and management of data too. It is concerned with obtaining useful information from large datasets. It is frequently applied in a wide range of industries, including healthcare, marketing, banking, finance, policy work, and more. This enables companies to make informed decisions around growth, optimization, and performance. In nutshell, Data science is an integrative strategy for deriving actionable insights from today’s organizations’ massive and ever-increasing data sets. Preparing data for analysis and processing, performing advanced data analysis, and presenting the findings to expose trends and allow stakeholders to make educated decisions are all part of data science [1, 2]. Data science experts are both well-known, data-driven individuals with advanced technical capabilities who can construct complicated quantitative algorithms to organize and interpret huge amounts of data in order to address questions and drive strategy in their company. This is combined with the communication and leadership skills required to provide tangible results to numerous stake-holders throughout a company or organization. Data scientists must be inquisitive and results-driven, with great industry-specific expertise and communication abilities that enable them to convey highly technical outcomes to non-technical colleagues. To create and analyze algorithms, they have a solid quantitative background in statistics and linear algebra, as well as programming experience with a focus on data warehousing, mining, and modeling [3].
Data science is the branch of science concerned with the discovery, analysis, modeling, and extraction of useful information which has become a buzz in a lot of companies. Firms are increasingly aware that they have been sitting on data treasure mines the priority with which this data must be analyzed, and ROI generated is obvious. We look at the most important reasons that data science professions are in high demand [4].
Data Organization
During the mid-2000s IT boom, the emphasis was on “lifting and shifting” offline business operations into automated computer systems. Digital content generation, transactional data processing, and data log streams have all been consistent throughout the last two decades. This indicates that every company now has a plethora of information that it believes can really be valuable but does not know how to use. This is apparent in Glassdoor’s recent analysis, which identifies the 50 greatest jobs in modern era.
Scarcity of Trained Manpower
According to a McKinsey Global Institute study, by 2018, the United States will be short 190,000 data scientists, 1.5 million managers, including analysts who would properly comprehend and make judgments based on Big Data. The need is particularly great in India, where the tools and techniques are available but there are not enough qualified people. Data scientists, who can perform analytics, and analytics consultants, who can analyze and apply data, are two sorts of talent shortages, according to Srikanth Velamakanni, co-founder and CEO of Fractal Analytics. The supply of talent in these fields, particularly data scientists, is extremely limited, and the demand is enormous.”
The Pay Is Outstanding
A data science position is currently one of the highest paying in the market. The national average income for a data scientist/analyst in the United States, according to Glass Door, is more than $62,000. In India, pay is heavily influenced by experience. Those with the appropriate skillset can earn up to 19 LPA. (source: PayScale.)
The “X” Factor
A data scientist’s major responsibility are exceptional and specific to the position. Because of nature of the profession, they may flourish in their careers by integrating several analytical expertise across diverse areas such as big data, machine learning, and so on. This vast knowledge base gives them an unsurpassed reputation or X-factor.
Data Scientists’ Democratization
Tech behemoths are not the only ones who need data scientists. According to a Harvard Business Report issued many years ago, “Organizations in the top list of their area in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their peers”. Even mid-sized and small organizations have been driven to adopt data science because of this. In truth, many small businesses are trying to hire entry-level data scientists for a fair wage. This works well for both. The scientist will be able to further develop his or her skills, and the company will be able to pay less than it would otherwise.
Fewer Barriers for Professionals
Data science is open to a wide range of experts from varied backgrounds because it is a relatively new discipline. Math/statistics, computer science and engineering, and natural science are all areas of knowledge for today’s data scientists. Some perhaps have social science, economics, or business degrees. They have all devised a problem-solving technique and improved their skills through formal or online education.
Abundance of Jobs
Data science is employed in a wide range of business sectors, from production to healthcare, Information Technology to finance, therefore there are plenty of data science jobs available for individuals who are interested and willing to put in the effort. It is true not only in terms of industries, but also in terms of geography. So, regardless of one’s geographical location or current domain, data science and analytics are available to everybody.
A Wide Range of Roles
Even if data science job is indeed a broad term, there are numerous subroles that fall under its scope. Data scientists, data architects, business intelligence engineers, business analysts, data engineers, database administrators, and data analytics managers are all in considerable demand.
The terminology “data science” was just recently coined a new profession interested in trying to make sense of large volumes of data. Making sense of data, on the other hand, has a significant background, and it has been addressed for years by many computer scientists, scientists, librarians, statisticians, and others. The history below shows how the terminology: data science” evolved over time, as well as attempts to describe it and associated concepts [5].
In 1974, Peter Naur’s book gives a broad overview of modern data processing techniques that are employed in a variety of applications. The IFIP Guide to Data Processing Concepts and Terms states that it is organized around the data principle: “Data is a codified representation of ideas or facts that may be communicated and even perhaps changed by certain process.” According to the book’s preface, a course plan titled “Datalogy, the science of data and data processes, and its position in education” was presented at the 1968 IFIP Congress, and the name “data science” has been widely used since then. Data science, according to Naur, is defined as “the science of working with data after it has been established, but the relationship of the data to what it represents is assigned to other disciplines and sciences.”
In 1977, the International Association for Statistical Computing (IASC) was founded as an ISI chapter. “The goal of the IASC is to connect conventional statistical techniques, innovative computer technology, and domain specialists’ skills to transform data into knowledge and information,” says the organization.” 1989 The first Knowledge Discovery in Databases (KDD) workshop is arranged and chaired by Gregory Piatetsky-Shapiro. Figure 1.1 shows the proceeding of IJCAI-Workshop. It was renamed the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) in 1995. September 1994, an article on “Database Marketing” appears in Business Week: “Enterprises are acquiring vast amounts of data on you, processing it to determine how likely someone really is to purchase a product, after then using that intelligence to design a marketing strategy perfectly tuned to find a way to convince you to do so... A prior spike of anticipation in the 1980s, sparked by the extensive use of checkout scanners, resulted in severe disappointment. Several organizations were overwhelmed with vast amount of data and were unable to do anything valuable with it... Despite this, many corporations recognize that they have no other option except to enter the database marketing arena.”
In 1997, the journal Data Mining and Knowledge Discovery is set up, with the reversal of the two terms in the title emphasizing the rise of “data mining” as the standard term for “extracting information from vast data-sets.” December 1999, “Existing statistical procedures perform effectively with relatively small data sets, Jacob Zahavi says in Knowledge @ Wharton’s “Mining Data for Nuggets of Knowledge.” Today’s databases, on the other hand, can have trillions of rows and columns of data. In data mining, scalability is a major concern. Another technical issue is creating models that would better analyze data, recognize nonlinear relations, and interaction among elements. To handle web-site issues, specialized data mining techniques may need to be developed.”
Figure 1.1 Proceeding of IJCAI-workshop.
2001 “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics,” by William S. Cleveland, is published. That is a plan to “expand the key areas of technological endeavor in statistics.” The new area will be dubbed “data science” because the notion is ambitious and requires significant change. Cleveland compares and contrasts the proposed new discipline to computer science and current data mining research.2001 “Statistical Modeling: The Two Cultures,” by Leo Breiman, is published. When it comes to utilizing statistical modeling to draw conclusions from data, there are two contrasting cultures. A stochastic data model is assumed to have generated the data.
The statistical community has been devoted to using data models nearly exclusively. This emphasis has resulted in useless theory, spurious findings, and the exclusion of statisticians from a wide spectrum of current situations. Algorithmic modeling has advanced significantly in theory and in practice in domains other than statistics that may be used on huge challenging as well as on small datasets as the more useful and consistent alternative to data modeling. If we want to utilize data to resolve problems as a domain, we ought to move away from merely using data models and use a wider range of tools. The International Council for Science’s Committee on Data for Science and Technology (CODATA) publishes the journal (ICSU). January in the year 2003 “By data science, people mean practically the whole thing that has to do with data: acquiring, examining, and modeling,” according to the launch of the Journal of Data Science. However, the most important feature is its applicability—it may be used in a wide range of situations. The journal is primarily concerned with the application of statistical methods in general. All data scientists will be able to submit their viewpoints and ideas in the Journal of Data Science. The data science venn diagram is depicted by Figure 1.2.
In “A Taxonomy of Data Science,” Chris Wiggins and Hilary Mason write in September 2010: One potential taxonomy... of what a data scientist does may be: obtain, scrub, explore, model, and perceive, in approximately chronological order. Data science is evidently a merger of the hacker arts of statistics and machine learning, as well as math and data topic knowledge for the analysis to be understandable. It necessitates creative thinking and a willingness to learn. “To become a properly trained data scientist, one needs comprehend a great deal,” writes Drew Conway in “The Data Science Venn Diagram” from September 2010. Regrettably, simply citing literature and teachings is insufficient to untangle the tangles.
Figure 1.2 Data science Venn diagram.
In the month of May 2011, in his article “Why the phrase ‘data science’ is wrong but useful,” Pete Warden writes: According to P., there is no commonly agreed-upon boundary between what is inside and beyond the purview of data science. Rather than choosing a topic in the beginning and later gathering data to shed light on it, they tend to be more concerned with what the data can disclose and then picking fascinating strands to pursue. Matthew J. Graham presented “The Art of Data Science” at the Astro statistics and Data Mining in Large Astronomical Databases workshop in June 2011. “We need to learn new abilities to succeed in the modern data-intensive world of twenty-first-century science,” he says. “We should understand how [data] is perceived and expressed, as well as how it links to physical space and time.”
Figure 1.3 Job growth on analytics and data science.
The title of “business analyst” felt too constrictive. The title of “data analyst” was a candidate, but we believed it limited what people could do. After all, many of the members of our teams were highly skilled engineers. The phrase “research scientist” was used by firms, including Sun, Yahoo, HP, IBM, and Xerox as a fitting job title. We assumed, however, that the majority of research scientists worked on future and abstract issues in labs apart from the product development groups. It would take ages for lab research to have an impact on big items. As a replacement, our teams concentrated on developing data applications that can have an instant and significant influence on the firm. Data scientist seems to be the greatest fit: individuals who combine data and research to produce something new [6]. Figure 1.3 shows the job growth on analytics and data science.
Business intelligence (BI) is a method of analyzing descriptive data using technology and knowledge in order to make informed business decisions [7]. The BI toolkit is used to collect, govern, and transform data. It allows internal and external stakeholders to communicate data, making decision-making easier. Business intelligence is the process of extracting useful information from data. Some of the things that BI can help you with are:
Developing a better grasp of the marketplace
Identifying new revenue streams
Enhancing business procedures
Keeping one step ahead of the competition
Cloud computing has been the most important facilitator of BI in recent years. The cloud has enabled organizations to process more data, from more sources, and more efficiently than they could before cloud technologies were introduced.
Data science vs. business intelligence: Understanding the differences between data science and business intelligence is beneficial. Understanding how they work together is also beneficial. It is not a question of choose one over the other [8]. It all boils down to picking the proper solution to receive the information you need. Most of the time, this means combining data science and business intelligence. The simplest approach to distinguish between the two is to think about data science in terms of the future and business intelligence in terms of the past and present. Data science is concerned with predictive and prescriptive analysis, whereas business intelligence is concerned with descriptive analysis. Scope, data integration, and skill set are other differentiating considerations [9]. Table 1.1 shows the comparison between data science and business intelligence.
Table 1.1 Comparison of data science and business intelligence.
Factor
Data science
Business intelligence
Concept
It is a discipline that employs mathematics, statistics, and other methods to uncover hidden patterns in data.
It is a collection of technology, applications, and processes that businesses employ to analyze business data.
Focus
It is concentrated on the future.
It concentrated the past and present.
Data
It can handle both structured and unstructured data.
It is mostly concerned with structured data.
Flexibility
Data science is more adaptable since data sources can be added as needed.
It is less flexible because data sources for business intelligence must be planned ahead of time.
Method
It employs the scientific process.
It employs the analytic method.
Complexity
In comparison to business intelligence, it is more sophisticated.
When compared to data science, it is a lot easier.
Expertise
Data scientist is its area of competence.
Its area of specialization is for business users.
Questions
It addresses the questions of what will happen and what might happen.
It is concerned with the question of what occurred.
Tools
SAS, BigML, MATLAB, Excel, and other tools are used.
Insight Squared Sales Analytics, Klipfolio, ThoughtSpot, Cyfe, TIBCO Spotfire, and more solutions are among them.
Data science life cycle contains following steps:
a)
Gathering Data:
The first step of the data science life cycle is to collect the data from the available data sources. The particular packages are available that allow the tools to read the data from the different sources, like Python or R. The alternate option is to use Web APIs for the extraction of the data. The user that uses social media sites (Facebook, Twitter, etc.) access the data via web servers. The most common method for the collection of data is straight from the files. This can be performed via downloading the data from Kaggle, Comma Separated Values (CSV) formats, Tab Separated Values (TSV) formats. These are the flat text files, hence required a parser to read them. The few available databases are MongoDB, Oracle, MySQL, DB2, and PostgreSQL.
b)
Cleaning Data:
The following stage is to clean the information, which entails data scrubbing in addition to filtering. This technique necessitates data conversion to a different layout. It is required for the processing then analysis of data. If the files are web locked, then the lines of these files must also be filtered. Furthermore, cleaning data entails removing and replacing values. The replacement of missing data sets must be done carefully, as they may appear to be non-values. Furthermore, columns are split, combined, and withdrawn.
c)
Exploring Data:
Before the data can be used, it must first be analyzed. It is entirely up to the data scientist in a company context to change the data that is accessible into something that can be used in a corporate setting. As a result, the initial step should be to explore the data. Inspection of the data and its qualities is required. Because different data kinds, such as ordinal data, nominal, numerical data, and categorical data require distinct treatment, this is the case. Following that, descriptive statistics must be computed. It is so that important variables may be evaluated, and features can be retrieved. Correlation is used to look at the important variables. Even though some of these variables are connected, this does not imply causation. Feature is a term used in machine learning. This aids data scientists in identifying the qualities that represent the data in question. These could include things like a person’s “name,” “gender,” and “age.” In addition, data visualization is used to highlight critical data trends and patterns. Simple tools like bar and line charts can help people understand the value of data [10].
d)
Modeling Data:
The modeling phase follows the crucial stages of data cleaning and exploration. It is frequently regarded as the most fascinating aspect of the Data Science Life Cycle. When modeling data, the first step is to reduce the number of dimensions in the data collection. Every value and feature are not required for the results to be predicted. At this point, data scientist must select the critical features that will quickly aid the model’s prediction. Modeling entails a variety of tasks. Through logistic regressions, models may be trained to differentiate via classification, such as mails found as “Primary” and “Promotion.” The usage of linear regressions can also be used to forecast. It is also possible to group data in order to understand the rationale that behind these parts. E-Commerce consumers, for example, are classified so that their behavior on a certain E-Commerce site can be studied. Hierarchical clustering, as well as K-Means and other clustering techniques, make this possible. For classification and identification, forecasting values, and clustering groups, the most used tools are prediction and regression.
e)
Interpreting Data:
The final and most crucial stage of life cycle is data interpretation. The final phase is to interpret the data and models. The ability to generalize lies at the heart of every prediction model’s power. The model’s ability to generalize future data, which is ambiguous and unknown, is crucial [11]. Data interpretation refers to the presenting of data to a layperson with no technical knowledge of data. Delivered outcomes provide answers to business queries addressed at the start of the life cycle. It is combined with the actionable insights uncovered via the data science life cycle process [12].
Data science can provide both predictive and prescriptive insights, and actionable insight is an important component of showcasing this. This enables you to reproduce a favorable result while avoiding a negative one. You will be able to comprehend the data science life cycle if you learn data science. Furthermore, these findings must be well visualized. This is accomplished by ensuring that the original corporate matters support them. The most important component of all of this is condensing all the information such that it is truly useful to the company.
The primary components or processes of data science are depicted by Figure 1.4:
a)
Data Exploration:
It is the most crucial phase because it takes the most time. Data exploration takes up around 70% of the time. Because data are the most important component of data science, we rarely receive data in a well-formatted format. The data contains a significant amount of noise. The term “noise” refers to a large amount of unnecessary data. So, what are our plans for this step? We verify the observations (rows) and characteristics (columns) and use statistical methods to reduce noise in this process, which includes data sampling and transformation. This stage is also applied to evaluate the relationship between the various characteristics in the data set; by relationship, we indicate whether the features are dependent or independent of one another, as well as whether missing values data exist [13]. In a nutshell, the data is transformed and prepared for usage. As a result, this is one of the extremely time-consuming operations.
Figure 1.4 Components of data science.
b)
Modeling:
Our data has been ready and is suitable to use so far then second phase is putting machine learning algorithms to work. The data is really fitted into the model at this point. We choose a model depending on the sort of data and the business need. The model used to recommend an article to a consumer. We fit the data into the model after it has been built.
c)
Testing the Model:
This is the next step, and it is critical to the model’s success. The model is tested with test data to ensure that it is accurate and has other desirable properties, and necessary changes are made to the model to achieve the intended result. If the needed accuracy is not achieved, we can return to modeling step, go for a new model, then repeat steps 3 and 4 to find the model that best suits the business needs [14].
d)
Deploying Models:
We complete the model that gives us the best result based on testing findings and deploy it in the production environment whenever we achieve the desired result through rigorous testing in accordance with the business requirements.
Data is important, and the science of deciphering it is as well. Data is being generated in billions of bytes, and its worth has now surpassed that of oil. For firms in a variety of industries, the function of a data scientist is and will continue to be critical.
Data without science is nothing.
It is necessary to interpret and analyze data.
This emphasizes the importance of having high-quality data and knowing how to read it and produce data-driven discoveries.
Data will aid in the development of improved consumer experiences.
In the case of commodities and products, data science will use machine learning to help corporations invent and produce products that people will love. A good recommendation system, for example, can assist an eCommerce company find their consumer personas by examining at their past purchases.
Data will be used across verticals.
Data science is not just about consumer items, technology, or healthcare. From banking to transportation to manufacturing, there will be a huge demand for ds to streamline corporate processes. As a result, everyone interested in becoming a data scientist will have a whole new universe of possibilities open to them. Data is the way of the future. It is critical to improve Data science in order to improve marketing. Big data and data science are critical components of future progress [15]. The data science process entails analyzing, visualizing, extracting, managing, and storing data in order to generate significant analytical insights. These data-driven insights and reports assist businesses in analyzing their marketing strategy, making more effective data-driven decisions and creating more effective commercials.
Social media, phone data, e-commerce sites, healthcare surveys, internet searches, and other fields, and platforms are used to collect data. As the amount of data available grows, a new field of research known as big data—or exceptionally huge data sets—emerges, which can aid in the development of better operational tools in a variety of fields. Collaboration with financial service providers, which employ technology to create and enhance traditional financial goods and services, allow for easy access to ever-growing sets and data. The data generated generate more data for emerging financial technology solutions, like cloud computing and storage, which can be shared easily across all institutions [16]. Companies, on the other hand, may find that interpreting large amounts of unstructured data for effective decision making is extremely difficult and time-consuming. Data science has arisen throughout the world to deal with such annoyances.
Incorporating data science into a commercial organization poses extra hurdles in addition to the analytical ones. We have compiled a list of the most common issues and difficulty areas that arise throughout a data science project, both organizationally and technically [17].
Individual, “lone wolf” data scientists are giving way to teams with highly specialized expertise in the field of data science. When addressing data science tasks as a collaborative effort, the main problem for data science projects is coordination. Confusion, inefficiency, and errors occur from poorly coordinated procedures. Furthermore, this absence of effective synchronization occurs within and across data analytics teams. Apart from a lack of direction, there are apparent collaboration challenges, as well as an absence of open statement between the three primary investors: the customer, the analytics group, and the IT sector. The challenge for analytics teams effectively execute to production, cooperate with the IT section, and communicate data science to business associates, for example, is mentioned. It also exposes a lack of business support, in the sense that there is not enough business data and sometimes domain knowledge to provide respectable results. Generally, it appears that the data analytics crew, as well as data scientists, are having difficulty collaborating effectively with the IT department and business agents. Furthermore, scientists emphasize ineffective data analytics management practices and inadequate management, as well as a lack of top management sponsorship. Assert that working in perplexing, chaotic surroundings may be difficult, and that it can reduce team members’ drive and capacity to attention on project goals [18].
In other words, it highlights issues with assembling the right team for the job and the scarcity of people with analytical talents. Because of the scarcity of specialist analytical labor, every major institution has started a new analytics, big data, or data science division. In this context, it promotes the necessity for a multidisciplinary group: success in data science projects requires data science, technological, business, and management expertise. For example, due to the lack of a complete team-based strategy and process immaturity, data science crews have a heavy reliance on the senior data scientist. It has extremely uncertain inputs and results, and it is frequently ad hoc, including a lot of back-and-forth among team members, as well as trial-and-error to find the correct analysis appliances and settings. Because projects are experimental in nature, it can be difficult to create appropriate expectations, create realistic project timeframes, and anticipate how long tasks will take to finish. In this regard, emphasis is in the difficulty of determining the project’s scope risk exposure, as well as the difficulty of comprehending the business goals. The writers specifically point out the lack of clear business goals, inadequate ROI or business cases, and an inappropriate project choice.
There is a disproportionate focus on technological difficulties, which has hindered firms’ capacity to realize the full promise of data analytics. Data scientists must be fascinated with getting state-of-the-art outcomes on benchmarking activities rather than focusing on the business challenge yet striving for a tiny boost in performance can build models too complex to be effective. This approach works well in data science competitions, like Kaggle, but not in the real world [19].
Analytics vs. Stakeholders, furthermore, the project concept is sometimes unclear, and there is lacking in involvement from the business area, which may only offer data and a smattering of domain expertise, thinking that the data analytics group would perform the rest of the “magic” on its own. The machine learning and deep learning techniques have produced unrealistic expectations, leading to a misunderstanding. Machine learning and deep learning approaches have created excessive expectations, leading to the false belief that these new technologies can do anything a company wants at a fair cost, which is not the case. The lack of involvement on the part of the company might also be due to a shortage of understanding among the two parties: data scientists may not comprehend the data’s domain, and the business is typically unfamiliar with data analysis methods.