Data Science Handbook -  - E-Book

Data Science Handbook E-Book

0,0
134,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

DATA SCIENCE HANDBOOK

This desk reference handbook gives a hands-on experience on various algorithms and popular techniques used in real-time in data science to all researchers working in various domains.

Data Science is one of the leading research-driven areas in the modern era. It is having a critical role in healthcare, engineering, education, mechatronics, and medical robotics. Building models and working with data is not value-neutral. We choose the problems with which we work, make assumptions in these models, and decide on metrics and algorithms for the problems. The data scientist identifies the problem which can be solved with data and expert tools of modeling and coding.

The book starts with introductory concepts in data science like data munging, data preparation, and transforming data. Chapter 2 discusses data visualization, drawing various plots and histograms. Chapter 3 covers mathematics and statistics for data science. Chapter 4 mainly focuses on machine learning algorithms in data science. Chapter 5 comprises of outlier analysis and DBSCAN algorithm. Chapter 6 focuses on clustering. Chapter 7 discusses network analysis. Chapter 8 mainly focuses on regression and naive-bayes classifier. Chapter 9 covers web-based data visualizations with Plotly. Chapter 10 discusses web scraping.

The book concludes with a section discussing 19 projects on various subjects in data science.

Audience

The handbook will be used by graduate students up to research scholars in computer science and electrical engineering as well as industry professionals in a range of industries such as healthcare.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 193

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.


Ähnliche


Table of Contents

Cover

Title Page

Copyright

Dedication

Acknowledgment

Preface

1 Data Munging Basics

1 Introduction

1.1 Filtering and Selecting Data

1.2 Treating Missing Values

1.3 Removing Duplicates

1.4 Concatenating and Transforming Data

1.5 Grouping and Data Aggregation

References

2 Data Visualization

2.1 Creating Standard Plots (Line, Bar, Pie)

2.2 Defining Elements of a Plot

2.3 Plot Formatting

2.4 Creating Labels and Annotations

2.5 Creating Visualizations from Time Series Data

2.6 Constructing Histograms, Box Plots, and Scatter Plots

References

3 Basic Math and Statistics

3.1 Linear Algebra

3.2 Calculus

3.3 Inferential Statistics

3.4 Using NumPy to Perform Arithmetic Operations on Data

3.5 Generating Summary Statistics Using Pandas and Scipy

3.6 Summarizing Categorical Data Using Pandas

3.7 Starting with Parametric Methods in Pandas and Scipy

3.8 Delving Into Non-Parametric Methods Using Pandas and Scipy

3.9 Transforming Dataset Distributions

References

4 Introduction to Machine Learning

4.1 Introduction to Machine Learning

4.2 Types of Machine Learning Algorithms

4.3 Explanatory Factor Analysis

4.4 Principal Component Analysis (PCA)

References

5 Outlier Analysis

5.1 Extreme Value Analysis Using Univariate Methods

5.2 Multivariate Analysis for Outlier Detection

5.3 DBSCan Clustering to Identify Outliers

References

6 Cluster Analysis

6.1 K-Means Algorithm

6.2 Hierarchial Methods

6.3 Instance-Based Learning w/k-Nearest Neighbor

References

7 Network Analysis with NetworkX

7.1 Working with Graph Objects

7.2 Simulating a Social Network (ie; Directed Network Analysis)

7.3 Analyzing a Social Network

References

8 Basic Algorithmic Learning

8.1 Linear Regression

8.2 Logistic Regression

8.3 Naive Bayes Classifiers

References

9 Web-Based Data Visualizations with Plotly

9.1 Collaborative Analytics

9.2 Basic Charts

9.3 Statistical Charts

9.4 Plotly Maps

References

10 Web Scraping with Beautiful Soup

10.1 The BeautifulSoup Object

10.2 Exploring NavigableString Objects

10.3 Data Parsing

10.4 Web Scraping

10.5 Ensemble Models with Random Forests

References

Data Science Projects

11 Covid19 Detection and Prediction

Bibliography

12 Leaf Disease Detection

Bibliography

13 Brain Tumor Detection with Data Science

Bibliography

14 Color Detection with Python

Bibliography

15 Detecting Parkinson’s Disease

Bibliography

16 Sentiment Analysis

Bibliography

17 Road Lane Line Detection

Bibliography

18 Fake News Detection

Bibliography

19 Speech Emotion Recognition

Bibliography

20 Gender and Age Detection with Data Science

Bibliography

21 Diabetic Retinopathy

Bibliography

22 Driver Drowsiness Detection in Python

Bibliography

23 Chatbot Using Python

Bibliography

24 Handwritten Digit Recognition Project

Bibliography

25 Image Caption Generator Project in Python

Bibliography

26 Credit Card Fraud Detection Project

Bibliography

27 Movie Recommendation System

Bibliography

28 Customer Segmentation

Bibliography

29 Breast Cancer Classification

Bibliography

30 Traffic Signs Recognition

Bibliography

End User License Agreement

List of Tables

Chapter 4

Table 4.1 Data set for Naïve bayes computation.

List of Illustrations

Chapter 1

Fig 1.1 Stages of data preparation process.

Chapter 4

Fig 4.1 Plot of the points for equation y=a+bx.

Fig 4.2 Plot of the transformation function h(x).

Fig 4.3 Example of CART.

Fig 4.4 Rule defining for support, confidence and lift formulae.

Fig 4.5 Pictorial representation of working of k-means algorithm.

Fig 4.6 Construction of PCA.

Fig 4.7 Steps of AdaBoost algorithm.

Fig 4.8 Percentage of Variance (Information) for each by PC.

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

Acknowledgment

Preface

Begin Reading

End User License Agreement

Pages

vii

iii

iv

v

xi

xiii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

257

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

277

278

279

280

281

282

283

285

286

287

288

289

290

291

292

293

294

295

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

381

382

383

384

385

386

387

388

389

390

391

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

Data Science Handbook

A Practical Approach

Kolla Bhanu Prakash

K. L. University, Vaddeswaram, Andhra Pradesh, India

This edition first published 2022 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA

© 2022 Scrivener Publishing LLC

For more information about Scrivener publications please visit www.scrivenerpublishing.com.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

Wiley Global Headquarters

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.

Library of Congress Cataloging-in-Publication Data

ISBN 978-1-119-85733-4

Cover image: Pixabay.Com

Cover design by Russell Richardson

Set in size of 11pt and Minion Pro by Manila Typesetting Company, Makati, Philippines

Printed in the USA

10 9 8 7 6 5 4 3 2 1

Dedication

Dedicated to

My ParentsSri. Kolla Narayana RaoSmt. Kolla Uma Maheswari

And my WifeMrs. M. V. Prasanna Lakshmi

Acknowledgment

I would like to say thank you to the Almighty and my parents for the endless support, guidance and love through all stages. I dedicate this book to my parents, family members and my wife.

I would like to specially thank Sri. Koneru Satyanarayana, President, K. L. University, India for his continuous support and encouragement throughout the preparation of this book.

I also thank my student Ms. Vishnu Priya, who helped for some time and effort to support and contribute in some manner. I would like to express my gratitude towards all who supported, shared, talked things over, read, wrote, offered comments, allowed us to quote remarks and assisted in editing, proofreading and design throughout this books’ journey. We pay our sincere thanks to the open data set providers.

I am grateful to Mr. Martin Scrivener and Wiley-Scrivener Publishing team, who showed us the ropes to creating this book. Without that knowledge I would not have ventured into starting this book, which ultimately led to this. Their trusting in us, their guidance and providing the necessary time and resources, gave us the freedom to manage this book.

Last, but definitely not least, I’d like to thank my readers, who gave us their trust and hope this work inspires and guides them.

Kolla Bhanu Prakash

Preface

Data Science is one of the leading research-driven areas in the modern era. It is having a critical role in healthcare, engineering, education, mechatronics and medical robotics. Building models and working with data is not value neutral. We choose the problems with which we work on, we make assumptions in these models and we decide metrics and algorithms for the problems. The data scientist identifies the problem which can be solved with data and expert tools of modelling and coding. The main aim of writing this book is to give a hands-on experience on different algorithms and popular techniques used in real time in data science to all the researchers working in various domains.

The book starts with introductory concepts in data science like data munging, data preparation, transforming data. Chapter 2 discusses data visualization, drawing various plots and histograms. Chapter 3 covers mathematics and statistics for data science. Chapter 4 mainly focuses on machine learning algorithms in data science. Chapter 5 comprises of outlier analysis and DBSCAN algorithm. Chapter 6 focuses on clustering. Chapter 7 discusses network analysis. Chapter 8 mainly focuses on regression and naive-bayes classifier. Chapter 9 covers web-based data visualizations with Plotly. Chapter 10 discusses web scraping. Various projects in data science are then discussed.

Kolla Bhanu Prakash

June 2022

1Data Munging Basics

1 Introduction

Data gains value by transforming itself in to useful information. Every firm is more significant about the data generated from its all assets. The firm’s data helps the different personnel in the organization to improve their business tasks, save time and expenditure amount on maintenance of it. The top level management fails in taking appropriate decision if they don’t consider the data as important factor in understanding the business process. Many poor decisions related to the advertisement of company products leads to wastage of resources and affect the fame of the organization at every level. Companies may avoid squandering money by tracking the success of numerous marketing channels and concentrating on the ones that provide the best return on investment. As a result, a business can get more leads for less money spent on advertising [1].

Data Science provides study of discovering different data patterns from inter-disciplinary domains like business, education, research etc... Much of the information extracted is of the form unstructured like text and images and structured like in tabular format. The basic functional feature of data science involves the statistical techniques, inference rules, analytics for prediction, fundamental algorithms in machine learning, and novel methods for gleaning insights from huge data.

Business use cases which uses data science for serving the customers in different domains.

Banking organization provides a mobile app to send recommendation on various loan offers to their applicants.

One of the car manufacturing firms uses data science to build a 3-D printing screen for guiding driver less cars by enabling the object detection mechanism with more accuracy.

An automation solution provider using cognitive approach develops an incident response system for failure detection in functionalities offered to their clients.

General viewer behaviour is analysed by different channel subscribers based on the study of audience analytical platform and provide solution of grouping favourable TV channels.

Cyber police department uses statistical tools to analyse the crime incidents occurring in particular locality with the capturing images from different CCTV footages and caution citizens to be-aware about those criminals.

To safeguard the old age patients with memory loss or suffering with paralysis using body sensor information to analyse their health condition for their close relatives or care givers as part of building smart health care system.

Data science adopts four popular strategies [8] while exploring data they are (i) Understanding the problem in real time world (Probing Reality) (ii) Usage patterns of data (Discovery Patterns) (iii) Building Predictive data model for future perspective (predicting future events) (iv) Being empathetic business world (Understanding the people and the world)

(i) Understanding the problem in real time world:- Active and passive methods are used in collecting data for a particular problem in business process to take action. All the responses collected during the business process are more important to perform analysis in taking appropriate decision and leads success in further subsequent decisions.

(ii) Usage Patterns of Data (Discovery Patterns):- Divide and Conquer mechanism can be used to analyze the complex problems but it may not always the perfect solution without understanding the purpose of data. Much of the data is analyzed by clustering the data usage patterns this mechanism of clustering study helps to deal with real time digital marketing data.

(iii) Building Predictive models (Predicting future events): Right from the study of statistics it is clear that many of the techniques in mathematics are evolved to analyze the current data and predict the future. The predictive analysis will really help in decision making in dealing with the current scenarios of data collection. The prediction of future endeavors will help us to add valuable knowledge for the current data.

(iv) Emphatic in business world (Understanding the People and the world):- The toughest task by any organization in building the teams to understand the people in the real time world who are interacting with your organization for multiple reasons. Optimal decision making is possible only by understanding the real time scenarios of data generated during interaction and provides supported evidence for framing strategy in decision making solution for organization. High end domain knowledge like deep learning are used to understand the visual object recognition for study of the real time world.

Purpose of Data Science

Simple business intelligence tools are analyzed for unstructured data which is very small. Most of the data collected in traditional system were of the form of unstructured. The data was generated from different sources like financial reports, textual files, multimedia information, sensors and instrumental data. The business intelligence solutions cannot deal with huge volume of data with different complex formats. To process the complex formatted data we need high processing ability with improved analytical tools and algorithms for getting better insights that is done as part of data science.

Past and Future of Data Science

In 1962, John Tukey published a paper on the convergence of statistics and computers, showing how they may provide measurable results in hours. In 1974, Peter Naur written a book on Concise Survey of Computer Methods in which he coined the term data science many times to refer processing of data through specific mathematical methods. In 1977, an international association was established for statistical processing of data with the purpose of translating data into knowledge by combining modern computer technology, traditional statistical techniques, and domain knowledge. Tukey released Exploratory Data Analysis in the same year, emphasizing the importance of data.

Businesses began collecting enormous volumes of personal data in anticipation of new advertising efforts as early as 1994. Jacob Zahavi emphasized the need for new technology to manage the large volume of data generated by different organizations. William S. Cleveland published an article outlining on specialized learning methods and scope for Data Scientists which was used as case studies for businesses and education institute.

In 2002, a journal for Data Science was launched by international council for science. It focused on Data Science topics such as data systems modeling and its application. In 2013, IBM claimed that much of digital data collected all over the world is generated in the last two years, from then all organizations planned to build good amount of data for their benefits in decision making and started gaining good insights for improvement in the organization growth.

According to IDC, global data will exceed 175 zettabytes by 2025. Data Science allows businesses to swiftly interpret large amounts of data from a number of sources and turn that data into actionable insights for better data-driven decisions which is widely used in marketing, healthcare, finance, banking, policy work, and other fields. The market for Data Science platforms is expected to reach 178 billion dollars by 2025. Data science provides a platform for data scientists to explore many options for business organizations to track the latest developments in relevant to data gathering and maintenance for appropriate decision making.

BI (Business Intelligence) Vs DS (Data Science)

Business Intelligence is a process involved in decision making by getting insights in to the current data available as part of their organization transactions with respective all stake holders. It gathers data from all sources which can be from external or internal of the organization. The set of BI tools provide support for running queries, displaying results of data with good visualization mechanisms by performing analysis on revenue earned in that quarterly by facing business challenges. BI enables to provide suggestions based on market study, revealing revenue opportunities and business processes improvement. It is purely meant for building business strategies to earn profits in long run for the organization. Tools Like OLAP, warehouse ETL are used for storing and visualizing data in BI.

Data Science is a multi-disciplinary domain which performs study on data by extracting meaningful insights. It also uses tools relevant to data processing from machine learning and artificial intelligence to develop predictive models. It is further used for forecasting the future perspective growth in business organization carried functionalities. Python, R programming used to build the predictive data models by implementing efficient machine learning algorithms and the results are tracked based on high end visual communication techniques.

Data Munging Basics

Data Science is multi-disciplinary field which derives its features from artificial intelligence, machine learning and deep learning to uncover the more insights of data which is in different forms like structured (Tabular format of data) and unstructured (text, images). It performs study on specific problem domain areas and find or define solutions with available input data usage patterns and reveals good insights [1, 2].