Applied Machine Learning for Data Science Practitioners - Vidya Subramanian - E-Book

Applied Machine Learning for Data Science Practitioners E-Book

Vidya Subramanian

0,0
57,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

A single-volume reference on data science techniques for evaluating and solving business problems using Applied Machine Learning (ML).

Applied Machine Learning for Data Science Practitioners offers a practical, step-by-step guide to building end-to-end ML solutions for real-world business challenges, empowering data science practitioners to make informed decisions and select the right techniques for any use case.

Unlike many data science books that focus on popular algorithms and coding, this book takes a holistic approach. It equips you with the knowledge to evaluate a range of techniques and algorithms. The book balances theoretical concepts with practical examples to illustrate key concepts, derive insights, and demonstrate applications. In addition to code snippets and reviewing output, the book provides guidance on interpreting results.

This book is an essential resource if you are looking to elevate your understanding of ML and your technical capabilities, combining theoretical and practical coding examples. A basic understanding of using data to solve business problems, high school-level math and statistics, and basic Python coding skills are assumed.

Written by a recognized data science expert, Applied Machine Learning for Data Science Practitioners covers essential topics, including:

  • Data Science Fundamentals that provide you with an overview of core concepts, laying the foundation for understanding ML.
  • Data Preparation covers the process of framing ML problems and preparing data and features for modeling.
  • ML Problem Solving introduces you to a range of ML algorithms, including Regression, Classification, Ranking, Clustering, Patterns, Time Series, and Anomaly Detection.
  • Model Optimization explores frameworks, decision trees, and ensemble methods to enhance performance and guide the selection of the most effective model.
  • ML Ethics addresses ethical considerations, including fairness, accountability, transparency, and ethics.
  • Model Deployment and Monitoring focuses on production deployment, performance monitoring, and adapting to model drift.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 947

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

COVER

TABLE OF CONTENTS

TITLE PAGE

COPYRIGHT

DEDICATION

ABOUT THE AUTHOR

HOW DO I USE THIS BOOK?

FOREWORD

PREFACE

ACKNOWLEDGMENTS

ABOUT THE COMPANION WEBSITE

SECTION 1: Introduction to Machine Learning and Data Science

CHAPTER 1: Data Science Overview

1.1 What Is Data Science?

1.2 What Is Big Data?

1.3 What Is Machine Learning?

1.4 Summary: Chapter Recap & FAQs

Bibliography

SECTION 2: Data Preparation and Feature Engineering

CHAPTER 2: Data Preparation

2.1 Introduction to Colab

2.2 Data Preparation Overview and Steps

2.3 Align with Stakeholders

2.4 Summary: Chapter Recap & FAQs

Bibliography

CHAPTER 3: Data Extraction

3.1 Data Extraction

3.2 Summary: Chapter Recap & FAQs

Bibliography

CHAPTER 4: Machine Learning Problem Framing

4.1 Evaluate ML as a Potential Solution

4.2 Frame the ML Problem

4.3 Summary: Chapter Recap and FAQs

4.4 How much data do you need to run an ML model?

Bibliography

CHAPTER 5: Data Comprehension

5.1 Data Comprehension

5.2 Test and Validation Datasets: Data Comprehension

5.3 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 6: Data Quality Engineering

6.1 Data Quality Engineering

6.2 Test and Validation Datasets: Data Quality Engineering

6.3 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 7: Feature Optimization

7.1 Feature Optimization

7.2 Test and Validation Datasets: Feature Engineering

7.3 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 8: Feature Set Finalization

8.1 Feature Engineering

8.2 Test and Validation Datasets: Feature Set Finalization

8.3 Summary: Chapter Recap and FAQs

Bibliography

SECTION 3: Build, Train, or Estimate the ML Model

CHAPTER 9: Regression

9.1 Introduction to Regression

9.2 Simple Regression

9.3 Multiple Regression

9.4 Multi‐Output Regression

9.5 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 10: Classification

10.1 Introduction to Classification

10.2 Model Preparation Steps

10.3 Choice of Classification Model Families

10.4 Model Metrics Evaluation Choices

10.5 Model Diagnostics

10.6 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 11: Ranking

11.1 Introduction to Ranking and Learning to Rank (LtR) Models

11.2 Types of Ranking Algorithms

11.3 Implementing LtR Algorithms

11.4 Evaluation

11.5 Model Diagnostics

11.6 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 12: Clustering

12.1 Introduction to Clustering

12.2 Implementing Clustering

12.3 Model Metrics Evaluation Choices

12.4 Model Diagnostics

12.5 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 13: Patterns

13.1 Introduction to Pattern Mining

13.2 Frequent Pattern Mining

13.3 Association Rules

13.4 Model Diagnostics

13.5 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 14: Time Series

14.1 Introduction to Time Series

14.2 Univariate Time Series

14.3 Algorithm Choices for Other Time Series Forecasting

14.4 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 15: Anomaly Detection

15.1 Explore Normal Data and Anomalies

15.2 Anomaly Detection Example

15.3 Summary: Chapter Recap and FAQs

Bibliography

SECTION 4: Model Performance Optimization

CHAPTER 16: Model Optimization and Model Selection

16.1 Introduction to Model Optimization and Model Selection

16.2 Improve Model Input (Optimal model for current dataset)

16.3 Generalization (Optimal Model Performance Across Datasets)

16.4 Refine Model (Optimal Settings for the Current Model)

16.5 Generate more Candidate Models ‐ Optimal Choice Across Models

16.6 Select a Model

16.7 Summary: Chapter Recap and FAQs

Bibliography

CHAPTER 17: Decision Tree

17.1 Introduction to Decision Trees

17.2 Decision Tree Algorithms

17.3 Model Metrics Evaluation

17.4 Model Diagnosis

17.5 Summary: Chapter Recap & FAQs

Bibliography

CHAPTER 18: Ensemble Methods

18.1 Introduction to Ensemble Methods

18.2 Ensemble Algorithms

18.3 Model Metrics Evaluation

18.4 Model Diagnosis

18.5 Summary: Chapter Recap and FAQs

Bibliography

SECTION 5: ML Ethics

CHAPTER 19: ML Ethics

19.1 Introduction to ML Ethics and Its Goals

19.2 Model Ethics Alignment

19.3 People Alignment

19.4 Real‐World Challenges and How to Deal With Them

19.5 Summary: Chapter Recap and FAQs

Bibliography

SECTION 6: Productionalize the Machine Learning Model

CHAPTER 20: Deploy and Monitor Models

20.1 Deploy Models

20.2 Monitoring Model Performance

20.3 Experimentation

20.4 Summary: Chapter Recap and FAQs

Bibliography

INDEX

END USER LICENSE AGREEMENT

List of Illustrations

How do I Use This Book?

FIGURE 1. Chapter Trail – Applied Machine Learning for Business.

Chapter 1

FIGURE 1.1 Chapter Trail – Data Science Overview.

FIGURE 1.2 Current and Future Digital Trends.

FIGURE 1.3 Data Volume Growth from 2010 to 2025.

FIGURE 1.4 Discovery Insights and Patterns in Big Data.

FIGURE 1.5 Digital and Physical Data Storage Options.

FIGURE 1.6 Visualization for Finding Patterns.

FIGURE 1.7 Knowledge Applications Examples.

FIGURE 1.8 What is a Model?

FIGURE 1.9 Visualization Example.

FIGURE 1.10 Machine Learning Process.

FIGURE 1.11 Machine Learning and Related Technologies.

Chapter 2

FIGURE 2.1 Chapter Trail – Data Preparation Overview.

FIGURE 2.2 Some Coding Tools

FIGURE 2.3 Google Colab.

FIGURE 2.4 Run Google Colab Code.

FIGURE 2.5 Data Preparation Steps.

FIGURE 2.6 Align with the Stakeholders.

Chapter 3

FIGURE 3.1 Chapter Trail – Data Extraction.

FIGURE 3.2 Data Extraction.

FIGURE 3.3 Relational Database.

Chapter 4

FIGURE 4.1 Chapter Trail – ML Problem Framing.

FIGURE 4.2 Frame ML Problem.

FIGURE 4.3 Types of Learning Styles.

FIGURE 4.4 Data Needed for an ML Model.

Chapter 5

FIGURE 5.1 Chapter Trail – Data Comprehension.

FIGURE 5.2 Data Comprehension.

FIGURE 5.3 Data Types in Data Science.

FIGURE 5.4 Skewness.

FIGURE 5.5 Kurtosis.

FIGURE 5.6 Mixture Pattern.

FIGURE 5.7 Oscillation Pattern.

FIGURE 5.8 Cyclic Pattern.

FIGURE 5.9 Trend Pattern.

Chapter 6

FIGURE 6.1 Chapter Trail – Data Quality Engineering.

FIGURE 6.2 Data Quality Engineering.

FIGURE 6.3 Options to Fix Data Quality.

Chapter 7

FIGURE 7.1 Chapter Trail – Feature Optimization.

FIGURE 7.2 Step 5 – Feature Optimization.

Chapter 8

FIGURE 8.1 Chapter Trail – Feature Set Finalization.

FIGURE 8.2 Step 6 – Feature Set Finalization.

FIGURE 8.3 Wrapper Methods.

Chapter 9

FIGURE 9.1 Chapter Trail – Regression.

FIGURE 9.2 Linear Model Example.

FIGURE 9.3 Bias and Variance Tradeoff.

FIGURE 9.4 Types of Regression.

FIGURE 9.5 Scatter Plot Example.

FIGURE 9.6 Estimation.

FIGURE 9.7 Decomposition of Variance.

FIGURE 9.8 Residual Analysis.

Chapter 10

FIGURE 10.1 Chapter Trail – Classification.

FIGURE 10.2 Classification Example.

FIGURE 10.3 Classification Example.

FIGURE 10.4 Classification Types.

FIGURE 10.5 Types of Classifiers.

FIGURE 10.6 Types of Classifiers.

FIGURE 10.7 Classification Example.

Chapter 11

FIGURE 11.1 Chapter Trail – Ranking.

FIGURE 11.2 Ranking Algorithms.

Chapter 12

FIGURE 12.1 Chapter Trail – Clustering.

FIGURE 12.2 Clustering Example.

FIGURE 12.3 Hierarchical Clustering Using Dendrogram.

Chapter 13

FIGURE 13.1 Chapter Trail – Patterns.

FIGURE 13.2 Basic Pattern Mining Methods.

FIGURE 13.3 FP‐Tree.

Chapter 14

FIGURE 14.1 Chapter Trail – Time Series.

FIGURE 14.2 Example Time Series.

FIGURE 14.3 Types of Time Series.

Chapter 15

FIGURE 15.1 Chapter Trail – Anomaly Detection.

FIGURE 15.2 Outliers and Inliers.

Chapter 16

FIGURE 16.1 Chapter Trail – Model Optimization and Model Selection.

FIGURE 16.2 Optimize Model Performance.

FIGURE 16.3 Model Validation Techniques.

FIGURE 16.4 Model Validation → Sampling → Test Train Split.

FIGURE 16.5 Model Validation → Sampling → Test Train Split with Holdout.

FIGURE 16.6 Model Validation → Bootstrapping → Bagging.

FIGURE 16.7 Model Validation → Cross Validation → K‐Fold.

FIGURE 16.8 Model Validation → Cross Validation → Repeated K‐Fold.

FIGURE 16.9 Model Validation → Cross Validation → Leave One Out.

FIGURE 16.10 Model Validation → Cross Validation → Stratified K‐Fold.

FIGURE 16.11 Model Validation → Cross Validation → SMOTE.

FIGURE 16.12 Model Validation → Univariate Time Series → Time Series Split....

FIGURE 16.13 Model Validation → Univariate Time Series → Walk‐Forward Valida...

FIGURE 16.14 Model Validation → Nested Time Series → Predict Second Half.

FIGURE 16.15 Model Validation → Nested Time Series → Forward Chaining.

Chapter 17

FIGURE 17.1 Chapter Trail – Decision Tree.

FIGURE 17.2 Decision Tree Structure.

FIGURE 17.3 Types of Decision Trees.

FIGURE 17.4 Univariate Decision Tree Structure.

FIGURE 17.5 Multivariate Decision Tree Structure.

FIGURE 17.6 Binary Decision Tree Structure.

FIGURE 17.7 Multiway Decision Tree Structure.

FIGURE 17.8 Complete Split Decision Tree Structure.

FIGURE 17.9 Interval Split Decision Tree Structure.

FIGURE 17.10 Single Output Decision Tree Regressor.

FIGURE 17.11 Multi‐Output Decision Tree Regressor.

FIGURE 17.12 Single Output Decision Tree Classifier.

FIGURE 17.13 Multi‐Output Decision Tree Classifier.

FIGURE 17.14 Single Output Decision Tree Ranker.

FIGURE 17.15 Multi‐Output Decision Tree Ranker.

Chapter 18

FIGURE 18.1 Chapter Trail – Ensemble Methods.

FIGURE 18.2 Simplified Ensemble Method Overview.

FIGURE 18.3 Simplified Ensemble Process.

FIGURE 18.4 Types of Ensemble Methods.

FIGURE 18.5 Bagging.

FIGURE 18.6 Pasting.

FIGURE 18.7 Random Subspaces.

FIGURE 18.8 Random Patches.

FIGURE 18.9 AdaBoost.

Chapter 19

FIGURE 19.1 Chapter Trail – ML Ethics.

FIGURE 19.2 Model Transparency.

FIGURE 19.3 Build Alignment by Increasing the Sphere of Influence Indicated ...

Chapter 20

FIGURE 20.1 Chapter Trail – Deploy and Monitor Models.

FIGURE 20.2 Monitor Model Performance.

FIGURE 20.3 Monitor Model Performance.

FIGURE 20.4 Monitor Model Performance.

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

About the Author

How Do I Use This Book?

Foreword

Preface

Acknowledgments

About the Companion Website

Begin Reading

Index

End User License Agreement

Pages

iii

iv

xix

xxi

xxii

xxiii

xxv

xxvi

xxvii

xxviii

xxix

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

29

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

481

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

597

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

Applied Machine Learning for Data Science Practitioners

 

Vidya Subramanian

 

 

 

 

 

Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial intelligence technologies or similar technologies.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

The manufacturer’s authorized representative according to the EU General Product Safety Regulation is Wiley‐VCH GmbH, Boschstr. 12, 69469 Weinheim, Germany, e‐mail: [email protected].

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication Data Applied for:

Hardback: 9781394155378

Cover Design: WileyCover Image: © Westend61/Getty Images

 

 

 

Dedicated to my parents, my role models, whose relentless pursuit of excellence inspires me every day.

ABOUT THE AUTHOR

Vidya Subramanian is a seasoned Data Science leader with experience leading teams at Google, Apple, and Intuit. She currently heads Data Science and Analytics for Google Play, with a focus on data‐driven insights for business and product strategy.

Forbes recognized her as one of the “8 Female Analytics Experts From The Fortune 500.” She authored Adobe Analytics with SiteCatalyst (Adobe Press) and McGraw‐Hill’s PMP Certification Mathematics (McGraw Hill). She has also been a featured speaker at several prestigious conferences.

She holds a Master’s in Information Systems (MIS) from Virginia Tech and a Master’s in Computer Software Application (MCSA) from Somaiya Institute of Management (India).

In her leisure time, she enjoys reading books while sipping endless cups of tea, indulging in culinary experiments, and immersing herself in new travel experiences with her family and friends. She resides in the Bay Area with her husband Ravi, their children – Rhea and Rishi, and their dog Buddy.

Connect on LinkedIn

HOW DO I USE THIS BOOK?

This book is organized into six sections, each encompassing logically grouped chapters. This structure mirrors the framework we use to approach Machine Learning problems (Figure 1). The book equips you with the tools to solve specific problems through hands‐on examples.

FIGURE 1. Chapter Trail – Applied Machine Learning for Business.

This book caters to three distinct audiences:

Aspiring Data Scientists seeking an accessible entry point into the field

Undergraduate and graduate students studying Data Science

Working professionals in search of a reliable reference

The book begins with an overview of the Data Science landscape, providing readers with a solid foundation. It then transitions to explaining a framework for approaching any Data Science question. This structured approach introduces clear objectives for each step, shedding light on the purpose of each action and its integration within the broader context of Machine Learning.

The book doesn't stop at theoretical explanations; it includes practical examples to illustrate our point, derive insights, and apply them, as well as code snippets and guidance on interpreting the code output. Each dataset example is clearly laid out in the appendix. This practical application helps readers understand and apply the concepts in real‐world scenarios.

Each chapter starts with academic exercises to ease readers into the subject matter before addressing practical challenges that may arise. Additionally, I address common concerns and lingering doubts with Frequently Asked Questions (FAQs). The following is a quick layout of the chapters:

Section 1: Introduction to Machine Learning (ML) and Data Science

The

Data Science Overview

chapter begins by exploring the digital trends that permeate our daily lives. We then delve into Big Data and the Knowledge Discovery process. Our journey continues as we learn to distinguish Machine Learning from AI and its role within the Data Science ecosystem.

Section 2: Understanding the Problem, Preparing the Data and Features for the Model

We embark on the

Data Preparation

chapter by accessing the dataset used in this book. As a warm‐up for the subsequent chapters, we kick‐start our coding journey with the “Hello World!” program. We also gain insight into the essential steps of data preparation and emphasize the importance of alignment with stakeholders.

The

Data Extraction

chapter introduces the various data types commonly encountered in Machine Learning problems. We learn to extract structured, semi‐structured, and unstructured data, while respecting user privacy. Real‐world scenarios guide us in creating synthetic data and integrating data when necessary.

The

ML Problem Framing

chapter guides us in assessing options for solving business problems using various ML model types. We explore trade‐offs and success criteria for evaluating alternatives, ultimately framing the ML problem.

The

Data Comprehension

chapter builds on our fundamental knowledge of data types. It also introduces data profiling methods. Subsequently, we delve into a comprehensive, step‐by‐step guide on performing Exploratory Data Analysis (EDA).

The

Data Quality Engineering

chapter introduces various techniques to enhance data quality. We decide on the strategies most pertinent to the problem at hand.

Feature Optimization

reflects the standard techniques we can use to create optimal features that balance the Model's requirements and are grounded in business reality.

The

Feature Set Finalization

chapter explores all the technical strategies to select the optimal number of features best suited for the business problem.

Section 3: Solve the Machine Learning Problem

The

Regression

chapter focuses on how to estimate, predict, and control using regression. The chapter encompasses the relationships we can derive between data and the spectrum of algorithms that enable us to predict outcomes. We review simple, multiple, and multi‐output regression to better understand the prediction problems we typically encounter.

Next, we explore how to find similarities in data with an emphasis on

Classification

. Classification encompasses a spectrum of problems, including binary, multi‐class, multi‐label, multi‐class multi‐label, zero‐shot, and rule‐based classification, that extends our toolkit of solutions.

Ranking and Learning‐to‐Rank (LtR)

are the backbone of search results, recommendations, and voice assistants (Amazon Alexa, Google Home, etc.). We learn more about maximizing user relevance while minimizing information retrieval time.

Next, we venture into

Clustering

, which aims to group similar cohort data. The primary purpose of clustering is to use the feature variables available to them with no prior understanding of the structure of the groups, which is the premise of unsupervised learning.

Pattern Mining

encompasses finding associations, correlations, and recurring patterns in the data. We explore frequently occurring data subsets, sequences, and sub‐sequences. These include Pattern Mining variations like interesting, negative, constrained, and compressed patterns.

No book on Data Science is complete without learning forecasting with

Time Series

. We navigate statistical models along with ML solutions for univariate time series. We also learn more about multiple time series and nested time series.

The

Anomaly Detection

discusses the different techniques to find outliers in the data. These anomalies can indicate a risk to business health or can be data noise that can be ignored. This step also serves as a data preparation step to address outliers to optimize model performance.

Section 4: Optimize the Machine Learning Models

The

Model Optimization

framework helps us optimize the underlying data quality, refine Feature choices, and fine‐tune the hyper‐parameters in our current model for the current dataset. Then, we test how well the model generalizes across new datasets. Finally, we review more candidate models. After comparing the metrics across all candidate models, we select the model that best fits our business requirements.

Decision Trees (DT)

are visually represented as a graphical tree‐like structure. They recursively split the data into branches, grouping similar data (homogeneity) into separate categories while aiming to make those categories as different as possible (heterogeneity). We commonly use DT for regression, classification, and specific ranking scenarios.

Ensemble Methods

tap into the “crowd wisdom of algorithms” rather than placing all bets on a single algorithm. They seek better predictive performance by aggregating results from multiple algorithms and generalization by incorporating resampling data methods and combining the predictions from various models.

Section 5: Integrate ML Ethics

ML Ethics

entails conversations around Fairness, Accountability, Transparency, and Ethics. The deep dive into the Model interpretability and explainability helps to build cross‐functional alignment. This elevates social accountability with inbuilt ethics.

Section 6: Productionalize the Machine Learning Model

In the last chapter, we

Deploy & Monitor the Model

. Finally, we are ready to put the Model in production. We study the different types of drifts that impact the Model performance. Adapting to model drift helps us ensure the Model continues to perform optimally.

With all this knowledge, you can handle any data science problem and determine the optimal solution using this framework.

Congratulations! You are officially ready to glean insights from your data and conquer new heights in Data Science.

Becoming a Data Scientist is similar to being a chef; while there are countless recipes to solve any problem, a certain degree of art and self‐expression will make each solution inherently as unique as your signature dish.

Good Luck!Vidya

FOREWORD

There are many books, tutorials, and courses dedicated to either Machine Learning or Data Science. But there are far fewer options for those who want to see how the two tie together. This book answers the call. It has wide appeal because it covers both Machine Learning and Data Science, and also because it explains the concepts in multiple ways: with prose; with diagrams, tables, and checklists; and with Python code. It is all appropriately understandable by both a beginner and a more experienced reader. It covers solving a problem end‐to‐end, in a more complete way than many other books.

This book shows how the astounding growth in the volume of available data over the last decade or two has led to exciting applications. It explains the knowledge discovery process: managing the data pipeline, iteratively discovering insights, and applying the insights to derive value. It explains the various software tools that facilitate data preparation, model creation, visualization, and automated inference. It defines all the terms and gives examples of tools and applications. Most chapters have an accompanying IPython Notebook giving these all run in the Colab environment, so the reader does not need any software other than a web browser.

Most Data Science books stop there, but this one goes further in covering various approaches to Machine Learning (such as regression, classification, clustering, and time series analysis), and tackling the sticky issue of Machine Learning ethics (including fairness, bias, discrimination, diversity and inclusion, accountability, privacy, security, sensitivity, robustness, reliability, transparency, interpretability, explainability, and alignment).

The book concludes with a guide to actually running applications in the real world: deploying and monitoring a model. There are other books that are more detailed and more advanced. But this one delivers on the promise of taking the diligent reader from the start to a place where they can successfully deploy models in the real world.

by Peter Norvig

Peter Norvig, Research Director at Google, is an American computer scientist and a Distinguished Education Fellow at the Stanford Institute for Human‐Centered Artificial Intelligence. In recognition of his invaluable contributions to the field of AI, he was named a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI) in 2001 and a Fellow of the Association for Computing Machinery (ACM) in 2006, among many other accolades. Throughout his illustrious career, Norvig has authored several papers and books, including co‐authoring the widely used textbook Artificial Intelligence: A Modern Approach with Stuart J. Russell, which is taught in more than 1,500 universities across 135 countries.

PREFACE

Data Science is an expansive domain, often appearing as an intricate interplay of Computer Science, algorithms, data, business acumen, mathematics, and statistics. Navigating this complex landscape can be challenging, and keeping pace with its rapid evolution can be overwhelming.

If you're embarking on a Machine Learning journey, you might be daunted by the prospect of mastering all these individual areas: statistics, mathematics, visualizations, computer science, and various tools. This book systematically addresses these concerns, offering practical examples and frameworks to effectively tackle specific problems within Applied Machine Learning.

ACKNOWLEDGMENTS

In his book, "The World as I See It," Albert Einstein is quoted as saying, “It’s every man’s obligation to put back into the world at least the equivalent of what he takes out of it.” As I reflect on my career, I realize that so many thought leaders and colleagues have fueled my passion for Data Science, spent hours debating ideas with me in the pursuit of excellence, and helped push the envelope to deliver innovative solutions. Many teachers, colleagues, mentors, and friends have been instrumental in shaping my career journey. This book is my attempt to pay it forward.

Over the years, as I mentored Data Scientists, I recognized the need for a comprehensive book that covered the breadth and depth of Data Science. Writing this book was an ambitious undertaking, and I almost talked myself out of it, knowing the immense effort it would require. But clearly, that did not work! This book became my late‐night and weekend passion project. My husband, Ravi, always a champion of my dreams, provided unwavering support as I wrestled with the book's content; and my kids, Rhea and Rishi, played a pivotal role in keeping me grounded with a million questions about the book and Machine Learning. Our dog, Buddy, has been my trusty sidekick, especially on the days when my code did not work!

Dr. Peter Norvig, I am deeply grateful for your encouragement and generous offer to write the foreword for this book. Your significant contributions to Data Science have been a humbling inspiration. Anoop Sinha, thank you for your guidance. I’ve learned so much from every interaction with you.

I have strived to make this book as error‐free as possible, enlisting the help of industry practitioners, university professors, and data science students to ensure its accuracy. Debating the book's structure with Jessica Chopra was an absolute joy. Her knowledge is impressive, but it is her passion for innovation that truly inspires me to push my limits every day. Joe Christopher's early feedback and brainstorming sessions were instrumental in shaping the manuscript. Jim Sterne, a true trailblazer in Analytics and a renowned author and speaker, provided candid feedback on the initial chapters, which significantly influenced the book's direction.

I’ve had the privilege of working alongside extremely talented and hardworking colleagues. Anant Nawalgaria, I’m grateful for your comprehensive feedback across multiple chapters and enthusiasm to help despite your busy schedule. Dirk Nachbar and Wayne Wang, your thorough comments helped make this a better book. Thank you, David Lundquist, Raunak Bhandari, and Taibai Xu, for your feedback and suggestions.

Thank you, Dr. Barbara Hoopes, Associate Dean of the Graduate School at Virginia Tech, for your encouragement and feedback on critical chapters. You have been a role model for me since I was your graduate assistant at Virginia Tech. Feng Chen, PhD, Associate Professor, University of Texas at Dallas, thank you for your feedback on the Anomaly Detection chapter. Thanks are also due to Jack Lu, a PhD student at New York University, for sharing insightful feedback on many chapters.

Numerous practitioners generously contributed their time and industry insights to enhance this book. My heartfelt thanks to Zilu Zhou, Ritwik Sinha, Dorian Puleri, Boyko Ivanov, Ravikant Pandey, Arundati Puranik, and Disha Ahuja. Additionally, Rhea Ravishankar, Rishi Ravishankar, Indraneel Mahendrakumar, Sanjna Raisinghani, Suryansh Raisinghani, Chaya Bakshi, Maria Izzi, Aditya Prakash, and Mihir Argulkar offered invaluable perspectives from a student's vantage point.

Special thanks to Margaret Cummins, Aileen Storry, Victoria Bradshaw, Sindhu Raj Kuttappan, and the rest of the Wiley team for their relentless efforts in transforming the manuscript into this beautiful book.

Subha Subramanian patiently reviewed the book countless times, even going so far as to learn key Machine Learning concepts to provide valuable suggestions. Sumathi Raisinghani meticulously reviewed the book from both a high‐level and detailed perspective, ensuring its accuracy. Kavitha Subramanian consistently offered her support with insightful advice throughout the entire process. I'm also deeply grateful to my family – Bhanu, Raju, Vikram, Vidya, Mahendra, Navin, Ravikumar, Kiran, Khushi, Lavanya, Rohit, and my wonderful in‐laws, Prema and Ramachandran – for their unwavering encouragement.

My mother, Kanthimathi, remains my favorite role model. As a teacher, she has profoundly impacted countless lives, elevating their thinking and inspiring them to reach new heights. My father, Subramanian, always encouraged me to balance this book with my job and personal life, often reminding me of his favorite Jim Rohn quote, “If you really want to do something, you'll find a way. If you don't, you'll find an excuse.”

I hope this book serves as a catalyst on your Data Science journey, empowering you to create positive change for yourself and the world around you.

Happy reading!Vidya Subramanian

ABOUT THE COMPANION WEBSITE

 

This book is accompanied by a companion website:

www.wiley.com/go/subramanian/appliedmachinelearning1 

This Website includes Code files and Data files.

SECTION 1Introduction to Machine Learning and Data Science

 

CHAPTER 1Data Science Overview

FIGURE 1.1 Chapter Trail – Data Science Overview.

CHAPTER GOALS

Welcome to the world of Data Science! The world has witnessed remarkable technological advancements in the past decade, often taken for granted. From facial recognition unlocking our phones to the emergence of self‐driving cars, Data Science applications have surpassed our wildest imaginations. Large Language Models (LLMs) have become an integral part of our lives, enabling content generation from prompts. Data is already transforming how we live and work, and its importance will only increase with new data being generated every day.

We need a way to uncover hidden patterns and relationships that would have otherwise been impossible to surface, to facilitate data‐driven decisions (Figure 1.1). Hence, we look to Data Science to provide the tools and techniques to organize, analyze, and interpret data and turn it into actionable insights, such as self‐driving cars and facial recognition software. Our journey begins with understanding the Data Science landscape and contemporary digital trends.

In this chapter, we will:

Understand the digital trends landscape and explore Data Science fundamentals

Learn the basics of Big Data and Machine Learning (ML)

Understand how Machine Learning fits into the overall Data Science umbrella

Let's begin!

In the ever‐evolving field of Data Science, keeping up with new terminology, jargon, and the intricate interplay between technologies can be challenging. Luckily, thanks to open‐source tools, readily available technology, and accessible data, the barriers to entry are lower than ever. Hence this book aims to collate the most useful Machine Learning techniques and algorithms, and not only explains why, what, when, and how to use them, but also bridges the gap for Data Scientists like you!

We will accomplish our chapter goals by breaking the chapter into subsections with goals for each section as below:

Goal #1 Understand the digital trends landscape and explore Data Science fundamentals

Step 1a:

Delve into the world of Data Science and Big Data

Step 1b:

Understand and learn Data Pre‐preparation and Knowledge Discovery techniques

Step 1c:

Learn how to apply Data Insights

Step 1d:

Become familiar with the tools to assist you in your Data Science journey

1.1 What Is Data Science?

First, let us understand that Data Science is the complete practice of deriving insights from raw data using mathematical, statistical, and computer knowledge techniques. It also includes the necessary processes, tools, and methods to extract knowledge and insights from data (Figure 1.2).

Looking at current and future digital trends, we realize it is an exciting time to be in technology! Machine Learning (ML) and Artificial Intelligence (AI) are being applied to solve problems across a wide range of real‐world domains, a few of which are listed below:

FIGURE 1.2 Current and Future Digital Trends.

Source: DooG/Adobe Stock,visoot/Adobe Stock,AI/Adobe Stock,ipopba/Adobe Stock,phive2015/Adobe Stock,ipopba/Adobe Stock,Julien Eichinger/Adobe Stock,SasinParaksa/Adobe Stock,violetkaipa/Adobe Stock,1000pixels/Adobe Stock.

Designers use

3D printers

to create simple – and, more importantly – cheap product parts and prototypes before moving to full‐scale production. The technology allows for very risk‐averse proof of concepts to be made with little difficulty. With more designers using 3D printers, researchers have found a way to improve 3D printing accuracy through Machine Learning algorithms.

User Authentication

validates a customer's identity using any authentication factor. With biometrics like voice recognition, fingerprints, FaceID, and heart rate sensors replacing passwords, Machine Learning algorithms help segregate impersonators from genuine users. There are privacy and ethical considerations associated with the use of biometric data.

Faster and better‐quality Internet with

5G and 6G technology

has enabled improved data transmission speeds, and energy savings, as well as decreased latency compared to previous technology generations. These technologies use Machine Learning to forecast demand and integrate distributed computing to deploy equipment and services most efficiently.

1.2 What Is Big Data?

In layman's terms, “Big Data” implies a massive scale of structured and unstructured data that traditional techniques cannot process. In our daily lives, we constantly explicitly or implicitly generate data. To highlight how large data growth has been in the past decade, International Data Corporation (IDC) has indicated a + 61% data volume growth from 2 ZB in 2010 to 181 ZB by 2025. A zettabyte (ZB) is a trillion gigabytes. We are now interconnected perennially through smartphones, computers, and other Internet‐enabled devices. Machine interconnectivity has contributed to a double‐digit growth trajectory in data, as illustrated in the graph (Figure 1.3).

FIGURE 1.3 Data Volume Growth from 2010 to 2025.

Source: Replicated by Vidya Subramanian from Statista, Inc./https://www.statista.com/statistics/871513/worldwide-data-created/.

Traditionally, data has been stored in relational databases with similar attributes and data types. The consistency of this data makes it optimal for storage, processing, and analysis using relational databases. However, relational databases cannot handle unstructured data, or data at the scale that we are currently generating.

On the other hand, Big Data can accommodate and is built to manage data in inconsistent formats, whether unstructured or semi‐structured. The unique characteristics of Big Data enable us to deal with ever‐changing data. We need to sieve through that data to gain a competitive edge in its industry and find insights. People use analytics to make informed decisions. So, the underlying technology to analyze big data has to keep pace with the surge in data.

“Big Data” exhibits some unique characteristics famously defined by the 4Vs – nouns that start with “V” – Volume, Variety, Velocity, and Veracity. Let us take a look at what these characteristics look like:

In Big Data, the data

volume

is enormous. The traditional methods of processing and analyzing data using relational databases are expensive and insufficient to analyze data in zettabytes. In just a decade, the scale of data has increased from petabytes (PB) to zettabytes (ZB). As a frame of reference for the massive data scale, a petabyte is half the contents of all US academic research libraries. A zettabyte can house as much information as grains of sand found across all the beaches in the world.

Variety

refers to the heterogeneous property of data and how it is stored. The data generated includes structured data housed in relational databases, semi‐structured data like graph data, and unstructured data from emails, videos, PDFs, audio, and photos.

Velocity

is the rate of data generation, transmission, and collection. As of less than a year and a half ago, according to

http://Statista.com

, we average 231M emails, 16M texts, and 6M Google searches per minute; numbers which are only growing. That gives us perspective on the amount of data we are generating. In the real world, Amazon ships approximately 1.6 M packages a day. Mathematically, that is more than 66,000 orders per hour or 18.5 orders per second! Every hour, Walmart collects 2.5 PB of unstructured data from 1 million customers. The data generated by Walmart every hour equals 167 times the number of books in America's Library of Congress.

Veracity

is the consistency and accuracy of the data. Big Data has inherent quality issues that can stem from noise, bias, outliers, incorrect data, and missing values, among other reasons. When we have disparate data sources or data captured using different technologies, achieving high data accuracy takes time and effort. In the subsequent chapters, we will review some ways to improve the data integrity.

Data Scientists aim to discover insights and patterns from data by leveraging multiple disciplines and various techniques and tools. A typical Data Science process is iterative, starting with data collection from numerous sources and ending with the communication or application of the derived insights, also known as the Knowledge Discovery Process.

The Knowledge Discovery process entails multiple steps to analyze Big Data, from data generation to its application, which can be categorized into three phases:

Phase 1:

spans the entire process of generating, collecting, storing, and aggregating the data, which makes data retrieval easy for analyses.

Phase 2:

is the iterative Insights discovery & evaluation process to select the optimal algorithm.

Phase 3:

focuses on applying the Insights. Finally, several tools and techniques enrich the Insights Discovery journey across all these phases.

Let us take a deeper look at it (Figure 1.4).

FIGURE 1.4 Discovery Insights and Patterns in Big Data.

1.2.1 Phase 1: Data Generation & Data Collection

Big Data generation is the methodological process for information gathering that benefits the analyses we might need or intend to perform. We must ensure that the data collected meets the analysis requirements for the task. We collect data legally and ethically, with privacy in mind, and we take explicit user consent for data collection. Data can obviously be collected in multiple ways. Examples include transactional data collected from applications, including shipping, order, and payment information legally required for running the business successfully, in addition to audit purposes.

Individuals can also self‐disclose zero‐party data to an organization. Examples could be self‐information and personal preferences that can help to curate future experiences through personalization. Organizations typically collect first‐party data directly, allowing companies to analyze how consumers interact with their brands and improve user experience and engagement holistically and individually. Second‐party data is data shared by another organization, with user consent. In contrast, third‐party data is aggregated, rented, or sold by organizations specializing in data collection.

Surveys and Social media are other ways to get qualitative user or product experience data. Examples include Twitter feeds from an API. Web‐scraped data, which entails reading the code rendered on the web page and then parsing it to gather the needed information, is a way to get data from the Internet. Like Wikipedia, companies can also crowdsource data by enlisting people to help generate data. This is also how many AI models are trained.

Numerous Big Data sources could exist within each company in diverse formats. They may be stored in various repositories and duplicated based on business needs. So, we need to combine them into designated Data Warehouses for processing based on our goals for this data.

1.2.2 Phase 1: Data Ingestion

Data ingestion is the process used to import data from multiple sources. It can vary based on data formats, repositories, and business requirements such as:

Batch data ingestion

is an efficient way to collect & ingest large amounts of transactional data over time, enabling applications like payroll & invoice processing.

Stream data ingestion

continuously collects data from multiple sources simultaneously in small bursts. Streaming data includes data from wearable devices, phones, and web applications. We stitch this data from disparate sources to understand user behavior across devices collectively. Real‐time data ingestion minimizes the latency of the data being ingested.

Offline data ingestion

covers data collected from offline methods like surveys and third‐party apps.

Challenges for easy data ingestion include incompatibility between multiple data formats, data volume, and quality.

1.2.3 Phase 1: Data Storage

Before determining physical data storage, we should understand various factors related to the data we plan to store. Several factors impact storage choices: volume, type of data, frequency of data, security, reliability, cost, and infrastructure maintenance (Figure 1.5).

FIGURE 1.5 Digital and Physical Data Storage Options.

For smaller data volumes, standard storage options include Comma‐Separated‐Value files (CSV) or Excel. Some popular cloud options that provide cloud data warehouse capabilities that we can query directly include Google Cloud Storage, AWS S3, Snowflake, Azure, and Databricks. We can also store data using a combination of storage options. Let's take a sneak peek into the types of data storage available. While the list is not intended to be comprehensive, it serves as a starting point.

As the name suggests,

Centralized storage

means data is stored in a central location. This makes hardware configuration, performance, and capacity control more manageable, which reduces expenditure and risk. A centralized dataset also enables better quality, version control, and security. Centralized Relational Databases/Data Warehouses traditionally store data in a central server. Popular databases like MySQL, Oracle, and Teradata are examples of this architecture.

In

Decentralized

storage, there is no single centralized server; the data is distributed across servers. As the volume and variety of big data increase, we want a reliable solution that increases retrieval and processing speed and balances load. The data distribution helps us leverage parallel processing across servers, eliminating the need for custom or expensive hardware and coping with the influx of data requests. Data redundancy and failure recovery allow the business to run like a well‐oiled machine. Let's take a look at potential applications of decentralized storage:

Peer‐to‐peer networks can

solve the problem of server bottlenecks for Big Data, where each computer serves as both a client and a server for the others.

Distributed Relational Databases

help coordinate a seamless execution of queries for the end user by spreading tables and objects across interconnected computers.

A

Distributed Non‐Relational Database

stores data across multiple interconnected computers in a non‐tabular format. These computers enable distributed file storage, but they can also make data retrieval appear seamless, as if it were stored in a single location.

Decentralized NewSQL

is a class of relational databases that marries the capabilities of unstructured database systems with the traditional database system. NewSQL supports both Relational DataBase Management Systems (RDBMS) tabular structures and NoSQL capabilities. It preserves the fundamental properties that define an RDBMS. Secondary indexing, in‐memory storage, data processing, and parallel query execution lead to faster information retrieval.

As the volume of Big Data increases, we want to combine parallel and distributed processing reliably.

Distributed Computing

can be used in distributed systems, and parallel computing must be managed and synchronized across servers.

Distributed Ledger Technology

is a publicly available data storage that lets anyone add but not alter or delete data. It refers to the technological infrastructure and protocols that allow access, validation, and record updating in an immutable manner in a network spread across multiple entities or locations and finds applications in blockchain technology.

1.2.4 Phase 1: Data Aggregation

After we generate, ingest, and store the data, we need to make it readily available for analysis. With the growing complexity of modern business, multiple data types are held in various storage formats, making data analysis that much harder. Drawing insights from raw data that has high granularity is challenging. To overcome that, we create aggregated views to store the data, optimize it for analysis, and make it accessible for processing. Here, we can choose from a few options based on specifics such as where your data currently lives, and other considerations such as data volume:

Data Warehouses

are central relational database repositories that integrate data from disparate sources. They are designed and optimized for data retrieval rather than manipulation.

Data Cubes

are multidimensional databases that allow users to analyze large datasets across multiple dimensions or perspectives. The dimensions are usually time, location, product, and customer. Data cubes are used for complex analytical queries and data mining.

A

Data Mart

is a curated data subset that supports analytics for business users. It is a repository catered to a subset of users with information only pertinent to them.

Data Virtualization

creates a virtual layer over the physical data sources, enabling a unified data view. It allows data access from multiple sources without physically replicating or moving it.

Data Lakes

are centralized repositories that store large amounts of data, the purpose of which is typically undefined. It is generally used for big data processing, real‐time analytics, and Machine Learning.

In‐memory Databases

store data in RAM, enabling faster processing and retrieval times, especially for real‐time applications. They are an alternative to traditional disk‐based databases, which have comparatively slower storage and retrieval times.

N

ot

o

nly

SQL (NoSQL) Databases

are non‐relational databases that provide flexible data modeling, horizontal scalability, and high availability. We use it for applications that require high performance and scalability, such as web applications, real‐time analytics, and the Internet of Things (IoT).

Graph Databases

store data in a graph format, representing complex relationships between data entities. We use it for applications that require relationship‐based queries, such as social networks, recommendation systems, and fraud detection.

Vector Databases

are built to handle the unique structure of vector embeddings, which are numerical representations (distances) of data such as text, images, audio, and video. They index vectors for faster search and retrieval by comparing values to find text, images, audio, and video that are most similar to one another. Databases like Pinecone, Milvus, and Weaviate are useful for recommendation systems, photo and video searches, and natural language processing (NLP) applications.

Another critical consideration during the Data Aggregation step is keeping the data in its original or native format. You can choose to optimize the data for analysis by transforming the data into a more suitable format. Every architecture has its purpose, strengths, and weaknesses. The choice of which one to apply depends on the business requirements and the trade‐offs the business is willing to make.

1.2.5 Phase 2: Data Preparation

No matter how you generate, ingest, store, or aggregate the data, you will need to prepare it before using it to generate insights – unless you are very lucky and your data comes packaged with a bow!

At this point, we need to identify a real‐world problem or opportunity where our data can be applied to solve a real‐world problem effectively. Before doing that, we need to make sure the data quality of the model input is usable. This involves a series of steps known as Data Preparation.

It is probably worth repeating the cliche here about “garbage in, garbage out.” Our model will be only as good as the underlying data. Throughout data preparation, analysis, and evaluation, we assess where to leverage the insights from this data to improve decision‐making, automate processes, and solve complex business problems.

Our task starts with proactively aligning with stakeholders. Cross‐functional alignment is critical as we make multiple collaborative decisions while creating and finalizing the model for its intended business objective.

In the data aggregation step, we collated the data from multiple sources. Now we have all that data, we profile and understand it next via the Data Comprehension step. This encompasses understanding the data, running an exploratory analysis that typically includes visually inspecting the data, and reviewing summary statistics. Once we get a good sense of the data through data analysis, we will focus on Data Quality Engineering, that is, we will change values as required, transform them, and run data reduction techniques. We also increase the underlying data quality, which is covered in great detail in Chapter 6.

Now, we are ready to discover insights from our data!

1.2.6 Phase 2: Data Analysis

Data Analysis aims to convert any data into meaningful patterns, trends, associations, and insights to aid the solving of the business problem(s) you have identified above. The goal is to generate insights using the most efficient techniques for a given problem, such as statistical processes, visualization techniques, or Machine Learning models. We then refine our approach to find consistent, non‐redundant, and non‐trivial knowledge.

We use the following three techniques for Knowledge Discovery: Data Mining, Information Retrieval, and Machine Learning.

Data Mining

enables the discovery of

hidden

and

new insights

from data. It exists at the intersection of statistics, math, Computer Science, and data visualization techniques.

In Data Mining, we can apply statistics in exploratory data analysis and predictive models to reveal patterns and trends in massive data sets. For example, companies can find consistent patterns in user behavior that qualify them for a marketing offer or indicate the likelihood of a fraudulent transaction based on data from millions of users.

We can also visualize the data to reveal patterns. Consider a simple example of a vacation home's area and its impact on its sale price. In many chapters of this book, we will extend the same example of a vacation home and the factors that affect its price to illustrate our concepts, derive insights, and apply them toward our learning. In the appendix, you can find additional details about the example and data applicable to relevant chapters.

Below is a simple line plot to find a correlation between VacHomeSalePrice and VacHomeSqFt. Based on this visualization, it is clear that the Vacation home sale price linearly changes with the change in the home's square footage (Figure 1.6).

Information Retrieval (IR)

is about

finding relevant data from our dataset

as fast as possible. IR uses a Learning‐to‐Rank technique that orders the data based on the user query and any user attributes to personalize the results. For example, when we query Google search, the search results displayed are based on collected and processed data.

Machine Learning (ML)

is a technique used to

apply existing knowledge to new

data

as accurately as possible. Later in this chapter, we will learn more about how to apply Knowledge Discovery and techniques to ML, as they form the core of this book.

FIGURE 1.6 Visualization for Finding Patterns.

1.2.7 Phase 2: Data Evaluation

Data evaluation evaluates how accurately our Knowledge Discovery solution satisfies its intended purpose. The evaluation technique varies depending on how we discover Knowledge, i.e., whether we use statistical models, Information Retrieval, ML techniques, or Large Language Models (LLMs):

When we use

Statistical models

for Data Mining analysis, we also use appropriate statistical tests to evaluate our hypothesis.

For

Information retrieval

, context is important. We must maximize

perceived

degrees of relevance, preference, or importance while minimizing retrieval time. What makes Knowledge evaluation difficult here is the word “perceived.” So, for example, if two users see the same search result for a search query of “bat,” their perceived relevance might vary if they intended to search for a baseball bat vs. a bat (mammal).

By integrating

Large Language Models

(LLMs) into the data evaluation phase, we can leverage advanced NLP capabilities to enhance knowledge discovery processes' accuracy, relevance, and efficiency. This integration enables more comprehensive analysis, deeper insights, and improved decision‐making across various domains and applications.

For

Machine Learning techniques

, the evaluation method varies according to the problems we solve. A Machine Learning model is an algorithm that generates predictions or patterns. There are several ways to pick suitable algorithms for a given business problem based on the use case and the type of data we are looking to solve. Model evaluation is a way to assess the appropriateness of a given model using numeric metrics, scores, or loss functions to create a baseline. We will then continue to compare and optimize the models and build alignment with stakeholders to ensure the model performance is within acceptable thresholds. Once it is considered satisfactory, we deploy the model in production. This method is the focus of this book.

1.2.8 Phase 3: Knowledge Application

Knowledge application typically refers to the process of using the insights, techniques, and models developed through the study of data to solve real‐world problems or make informed decisions (Figure 1.7).

FIGURE 1.7 Knowledge Applications Examples.

Now that we have taken a preliminary pass at understanding the basics of Big Data, we can focus on Knowledge Application, which is the application of insights that we derived in the previous step to solve problems like the following:

Artificial Intelligence (AI)

uses inputs from the models and generates action. For example, in an autonomous car, Machine Learning is used to identify input images of stop signs. AI uses this information to determine the next course of action, when to apply brakes, and with how much intensity.

New

research

lays the groundwork for discoveries. Specialization areas like Computer Vision and NLP require focused problem‐solving. They have led to much research and progress on Large Language Models like PaLM and GPT‐4, which fuel Gemini and ChatGPT, respectively.

Decision Support