134,99 €
DATA SCIENCE HANDBOOK
This desk reference handbook gives a hands-on experience on various algorithms and popular techniques used in real-time in data science to all researchers working in various domains.
Data Science is one of the leading research-driven areas in the modern era. It is having a critical role in healthcare, engineering, education, mechatronics, and medical robotics. Building models and working with data is not value-neutral. We choose the problems with which we work, make assumptions in these models, and decide on metrics and algorithms for the problems. The data scientist identifies the problem which can be solved with data and expert tools of modeling and coding.
The book starts with introductory concepts in data science like data munging, data preparation, and transforming data. Chapter 2 discusses data visualization, drawing various plots and histograms. Chapter 3 covers mathematics and statistics for data science. Chapter 4 mainly focuses on machine learning algorithms in data science. Chapter 5 comprises of outlier analysis and DBSCAN algorithm. Chapter 6 focuses on clustering. Chapter 7 discusses network analysis. Chapter 8 mainly focuses on regression and naive-bayes classifier. Chapter 9 covers web-based data visualizations with Plotly. Chapter 10 discusses web scraping.
The book concludes with a section discussing 19 projects on various subjects in data science.
Audience
The handbook will be used by graduate students up to research scholars in computer science and electrical engineering as well as industry professionals in a range of industries such as healthcare.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 193
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Dedication
Acknowledgment
Preface
1 Data Munging Basics
1 Introduction
1.1 Filtering and Selecting Data
1.2 Treating Missing Values
1.3 Removing Duplicates
1.4 Concatenating and Transforming Data
1.5 Grouping and Data Aggregation
References
2 Data Visualization
2.1 Creating Standard Plots (Line, Bar, Pie)
2.2 Defining Elements of a Plot
2.3 Plot Formatting
2.4 Creating Labels and Annotations
2.5 Creating Visualizations from Time Series Data
2.6 Constructing Histograms, Box Plots, and Scatter Plots
References
3 Basic Math and Statistics
3.1 Linear Algebra
3.2 Calculus
3.3 Inferential Statistics
3.4 Using NumPy to Perform Arithmetic Operations on Data
3.5 Generating Summary Statistics Using Pandas and Scipy
3.6 Summarizing Categorical Data Using Pandas
3.7 Starting with Parametric Methods in Pandas and Scipy
3.8 Delving Into Non-Parametric Methods Using Pandas and Scipy
3.9 Transforming Dataset Distributions
References
4 Introduction to Machine Learning
4.1 Introduction to Machine Learning
4.2 Types of Machine Learning Algorithms
4.3 Explanatory Factor Analysis
4.4 Principal Component Analysis (PCA)
References
5 Outlier Analysis
5.1 Extreme Value Analysis Using Univariate Methods
5.2 Multivariate Analysis for Outlier Detection
5.3 DBSCan Clustering to Identify Outliers
References
6 Cluster Analysis
6.1 K-Means Algorithm
6.2 Hierarchial Methods
6.3 Instance-Based Learning w/k-Nearest Neighbor
References
7 Network Analysis with NetworkX
7.1 Working with Graph Objects
7.2 Simulating a Social Network (ie; Directed Network Analysis)
7.3 Analyzing a Social Network
References
8 Basic Algorithmic Learning
8.1 Linear Regression
8.2 Logistic Regression
8.3 Naive Bayes Classifiers
References
9 Web-Based Data Visualizations with Plotly
9.1 Collaborative Analytics
9.2 Basic Charts
9.3 Statistical Charts
9.4 Plotly Maps
References
10 Web Scraping with Beautiful Soup
10.1 The BeautifulSoup Object
10.2 Exploring NavigableString Objects
10.3 Data Parsing
10.4 Web Scraping
10.5 Ensemble Models with Random Forests
References
Data Science Projects
11 Covid19 Detection and Prediction
Bibliography
12 Leaf Disease Detection
Bibliography
13 Brain Tumor Detection with Data Science
Bibliography
14 Color Detection with Python
Bibliography
15 Detecting Parkinson’s Disease
Bibliography
16 Sentiment Analysis
Bibliography
17 Road Lane Line Detection
Bibliography
18 Fake News Detection
Bibliography
19 Speech Emotion Recognition
Bibliography
20 Gender and Age Detection with Data Science
Bibliography
21 Diabetic Retinopathy
Bibliography
22 Driver Drowsiness Detection in Python
Bibliography
23 Chatbot Using Python
Bibliography
24 Handwritten Digit Recognition Project
Bibliography
25 Image Caption Generator Project in Python
Bibliography
26 Credit Card Fraud Detection Project
Bibliography
27 Movie Recommendation System
Bibliography
28 Customer Segmentation
Bibliography
29 Breast Cancer Classification
Bibliography
30 Traffic Signs Recognition
Bibliography
End User License Agreement
Chapter 4
Table 4.1 Data set for Naïve bayes computation.
Chapter 1
Fig 1.1 Stages of data preparation process.
Chapter 4
Fig 4.1 Plot of the points for equation y=a+bx.
Fig 4.2 Plot of the transformation function h(x).
Fig 4.3 Example of CART.
Fig 4.4 Rule defining for support, confidence and lift formulae.
Fig 4.5 Pictorial representation of working of k-means algorithm.
Fig 4.6 Construction of PCA.
Fig 4.7 Steps of AdaBoost algorithm.
Fig 4.8 Percentage of Variance (Information) for each by PC.
Cover
Table of Contents
Title Page
Copyright
Dedication
Acknowledgment
Preface
Begin Reading
End User License Agreement
vii
iii
iv
v
xi
xiii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
257
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
277
278
279
280
281
282
283
285
286
287
288
289
290
291
292
293
294
295
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
381
382
383
384
385
386
387
388
389
390
391
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
Kolla Bhanu Prakash
K. L. University, Vaddeswaram, Andhra Pradesh, India
This edition first published 2022 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA
© 2022 Scrivener Publishing LLC
For more information about Scrivener publications please visit www.scrivenerpublishing.com.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
Wiley Global Headquarters
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.
Library of Congress Cataloging-in-Publication Data
ISBN 978-1-119-85733-4
Cover image: Pixabay.Com
Cover design by Russell Richardson
Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines
Printed in the USA
10 9 8 7 6 5 4 3 2 1
Dedicated to
My ParentsSri. Kolla Narayana RaoSmt. Kolla Uma Maheswari
And my WifeMrs. M. V. Prasanna Lakshmi
I would like to say thank you to the Almighty and my parents for the endless support, guidance and love through all stages. I dedicate this book to my parents, family members and my wife.
I would like to specially thank Sri. Koneru Satyanarayana, President, K. L. University, India for his continuous support and encouragement throughout the preparation of this book.
I also thank my student Ms. Vishnu Priya, who helped for some time and effort to support and contribute in some manner. I would like to express my gratitude towards all who supported, shared, talked things over, read, wrote, offered comments, allowed us to quote remarks and assisted in editing, proofreading and design throughout this books’ journey. We pay our sincere thanks to the open data set providers.
I am grateful to Mr. Martin Scrivener and Wiley-Scrivener Publishing team, who showed us the ropes to creating this book. Without that knowledge I would not have ventured into starting this book, which ultimately led to this. Their trusting in us, their guidance and providing the necessary time and resources, gave us the freedom to manage this book.
Last, but definitely not least, I’d like to thank my readers, who gave us their trust and hope this work inspires and guides them.
Kolla Bhanu Prakash
Data Science is one of the leading research-driven areas in the modern era. It is having a critical role in healthcare, engineering, education, mechatronics and medical robotics. Building models and working with data is not value neutral. We choose the problems with which we work on, we make assumptions in these models and we decide metrics and algorithms for the problems. The data scientist identifies the problem which can be solved with data and expert tools of modelling and coding. The main aim of writing this book is to give a hands-on experience on different algorithms and popular techniques used in real time in data science to all the researchers working in various domains.
The book starts with introductory concepts in data science like data munging, data preparation, transforming data. Chapter 2 discusses data visualization, drawing various plots and histograms. Chapter 3 covers mathematics and statistics for data science. Chapter 4 mainly focuses on machine learning algorithms in data science. Chapter 5 comprises of outlier analysis and DBSCAN algorithm. Chapter 6 focuses on clustering. Chapter 7 discusses network analysis. Chapter 8 mainly focuses on regression and naive-bayes classifier. Chapter 9 covers web-based data visualizations with Plotly. Chapter 10 discusses web scraping. Various projects in data science are then discussed.
Kolla Bhanu Prakash
June 2022
Data gains value by transforming itself in to useful information. Every firm is more significant about the data generated from its all assets. The firm’s data helps the different personnel in the organization to improve their business tasks, save time and expenditure amount on maintenance of it. The top level management fails in taking appropriate decision if they don’t consider the data as important factor in understanding the business process. Many poor decisions related to the advertisement of company products leads to wastage of resources and affect the fame of the organization at every level. Companies may avoid squandering money by tracking the success of numerous marketing channels and concentrating on the ones that provide the best return on investment. As a result, a business can get more leads for less money spent on advertising [1].
Data Science provides study of discovering different data patterns from inter-disciplinary domains like business, education, research etc... Much of the information extracted is of the form unstructured like text and images and structured like in tabular format. The basic functional feature of data science involves the statistical techniques, inference rules, analytics for prediction, fundamental algorithms in machine learning, and novel methods for gleaning insights from huge data.
Business use cases which uses data science for serving the customers in different domains.
Banking organization provides a mobile app to send recommendation on various loan offers to their applicants.
One of the car manufacturing firms uses data science to build a 3-D printing screen for guiding driver less cars by enabling the object detection mechanism with more accuracy.
An automation solution provider using cognitive approach develops an incident response system for failure detection in functionalities offered to their clients.
General viewer behaviour is analysed by different channel subscribers based on the study of audience analytical platform and provide solution of grouping favourable TV channels.
Cyber police department uses statistical tools to analyse the crime incidents occurring in particular locality with the capturing images from different CCTV footages and caution citizens to be-aware about those criminals.
To safeguard the old age patients with memory loss or suffering with paralysis using body sensor information to analyse their health condition for their close relatives or care givers as part of building smart health care system.
Data science adopts four popular strategies [8] while exploring data they are (i) Understanding the problem in real time world (Probing Reality) (ii) Usage patterns of data (Discovery Patterns) (iii) Building Predictive data model for future perspective (predicting future events) (iv) Being empathetic business world (Understanding the people and the world)
(i) Understanding the problem in real time world:- Active and passive methods are used in collecting data for a particular problem in business process to take action. All the responses collected during the business process are more important to perform analysis in taking appropriate decision and leads success in further subsequent decisions.
(ii) Usage Patterns of Data (Discovery Patterns):- Divide and Conquer mechanism can be used to analyze the complex problems but it may not always the perfect solution without understanding the purpose of data. Much of the data is analyzed by clustering the data usage patterns this mechanism of clustering study helps to deal with real time digital marketing data.
(iii) Building Predictive models (Predicting future events): Right from the study of statistics it is clear that many of the techniques in mathematics are evolved to analyze the current data and predict the future. The predictive analysis will really help in decision making in dealing with the current scenarios of data collection. The prediction of future endeavors will help us to add valuable knowledge for the current data.
(iv) Emphatic in business world (Understanding the People and the world):- The toughest task by any organization in building the teams to understand the people in the real time world who are interacting with your organization for multiple reasons. Optimal decision making is possible only by understanding the real time scenarios of data generated during interaction and provides supported evidence for framing strategy in decision making solution for organization. High end domain knowledge like deep learning are used to understand the visual object recognition for study of the real time world.
Simple business intelligence tools are analyzed for unstructured data which is very small. Most of the data collected in traditional system were of the form of unstructured. The data was generated from different sources like financial reports, textual files, multimedia information, sensors and instrumental data. The business intelligence solutions cannot deal with huge volume of data with different complex formats. To process the complex formatted data we need high processing ability with improved analytical tools and algorithms for getting better insights that is done as part of data science.
In 1962, John Tukey published a paper on the convergence of statistics and computers, showing how they may provide measurable results in hours. In 1974, Peter Naur written a book on Concise Survey of Computer Methods in which he coined the term data science many times to refer processing of data through specific mathematical methods. In 1977, an international association was established for statistical processing of data with the purpose of translating data into knowledge by combining modern computer technology, traditional statistical techniques, and domain knowledge. Tukey released Exploratory Data Analysis in the same year, emphasizing the importance of data.
Businesses began collecting enormous volumes of personal data in anticipation of new advertising efforts as early as 1994. Jacob Zahavi emphasized the need for new technology to manage the large volume of data generated by different organizations. William S. Cleveland published an article outlining on specialized learning methods and scope for Data Scientists which was used as case studies for businesses and education institute.
In 2002, a journal for Data Science was launched by international council for science. It focused on Data Science topics such as data systems modeling and its application. In 2013, IBM claimed that much of digital data collected all over the world is generated in the last two years, from then all organizations planned to build good amount of data for their benefits in decision making and started gaining good insights for improvement in the organization growth.
According to IDC, global data will exceed 175 zettabytes by 2025. Data Science allows businesses to swiftly interpret large amounts of data from a number of sources and turn that data into actionable insights for better data-driven decisions which is widely used in marketing, healthcare, finance, banking, policy work, and other fields. The market for Data Science platforms is expected to reach 178 billion dollars by 2025. Data science provides a platform for data scientists to explore many options for business organizations to track the latest developments in relevant to data gathering and maintenance for appropriate decision making.
Business Intelligence is a process involved in decision making by getting insights in to the current data available as part of their organization transactions with respective all stake holders. It gathers data from all sources which can be from external or internal of the organization. The set of BI tools provide support for running queries, displaying results of data with good visualization mechanisms by performing analysis on revenue earned in that quarterly by facing business challenges. BI enables to provide suggestions based on market study, revealing revenue opportunities and business processes improvement. It is purely meant for building business strategies to earn profits in long run for the organization. Tools Like OLAP, warehouse ETL are used for storing and visualizing data in BI.
Data Science is a multi-disciplinary domain which performs study on data by extracting meaningful insights. It also uses tools relevant to data processing from machine learning and artificial intelligence to develop predictive models. It is further used for forecasting the future perspective growth in business organization carried functionalities. Python, R programming used to build the predictive data models by implementing efficient machine learning algorithms and the results are tracked based on high end visual communication techniques.
Data Science is multi-disciplinary field which derives its features from artificial intelligence, machine learning and deep learning to uncover the more insights of data which is in different forms like structured (Tabular format of data) and unstructured (text, images). It performs study on specific problem domain areas and find or define solutions with available input data usage patterns and reveals good insights [1, 2].