The Data Science Handbook - Field Cady - E-Book

The Data Science Handbook E-Book

Field Cady

0,0
64,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Practical, accessible guide to becoming a data scientist, updated to include the latest advances in data science and related fields.

Becoming a data scientist is hard. The job focuses on mathematical tools, but also demands fluency with software engineering, understanding of a business situation, and deep understanding of the data itself. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.

The focus of The Data Science Handbook is on practical applications and the ability to solve real problems, rather than theoretical formalisms that are rarely needed in practice. Among its key points are:

  • An emphasis on software engineering and coding skills, which play a significant role in most real data science problems.
  • Extensive sample code, detailed discussions of important libraries, and a solid grounding in core concepts from computer science (computer architecture, runtime complexity, and programming paradigms).
  • A broad overview of important mathematical tools, including classical techniques in statistics, stochastic modeling, regression, numerical optimization, and more.
  • Extensive tips about the practical realities of working as a data scientist, including understanding related jobs functions, project life cycles, and the varying roles of data science in an organization.
  • Exactly the right amount of theory. A solid conceptual foundation is required for fitting the right model to a business problem, understanding a tool’s limitations, and reasoning about discoveries.

Data science is a quickly evolving field, and this 2nd edition has been updated to reflect the latest developments, including the revolution in AI that has come from Large Language Models and the growth of ML Engineering as its own discipline. Much of data science has become a skillset that anybody can have, making this book not only for aspiring data scientists, but also for professionals in other fields who want to use analytics as a force multiplier in their organization.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 701

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Title Page

Copyright Page

Dedication Page

Preface to the First Edition

Preface to the Second Edition

1 Introduction

1.1 What Data Science Is and Isn’t

1.2 This Book’s Slogan: Simple Models Are Easier to Work With

1.3 How Is This Book Organized?

1.4 How to Use This Book?

1.5 Why Is It All in Python, Anyway?

1.6 Example Code and Datasets

1.7 Parting Words

Part 1: The Stuff You’ll Always Use

2 The Data Science Road Map

2.1 Frame the Problem

2.2 Understand the Data: Basic Questions

2.3 Understand the Data: Data Wrangling

2.4 Understand the Data: Exploratory Analysis

2.5 Extract Features

2.6 Model

2.7 Present Results

2.8 Deploy Code

2.9 Iterating

2.10 Glossary

3 Programming Languages

3.1 Why Use a Programming Language? What Are the Other Options?

3.2 A Survey of Programming Languages for Data Science

3.3 Where to Write Code

3.4 Python Overview and Example Scripts

3.5 Python Data Types

3.6 GOTCHA: Hashable and Unhashable Types

3.7 Functions and Control Structures

3.8 Other Parts of Python

3.9 Python’s Technical Libraries

3.10 Other Python Resources

3.11 Further Reading

3.12 Glossary

Interlude: My Personal Toolkit

4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning

4.1 The Worst Dataset in the World

4.2 How to Identify Pathologies

4.3 Problems with Data Content

4.4 Formatting Issues

4.5 Example Formatting Script

4.6 Regular Expressions

4.7 Life in the Trenches

4.8 Glossary

5 Visualizations and Simple Metrics

5.1 A Note on Python’s Visualization Tools

5.2 Example Code

5.3 Pie Charts

5.4 Bar Charts

5.5 Histograms

5.6 Means, Standard Deviations, Medians, and Quantiles

5.7 Boxplots

5.8 Scatterplots

5.9 Scatterplots with Logarithmic Axes

5.10 Scatter Matrices

5.11 Heatmaps

5.12 Correlations

5.13 Anscombe’s Quartet and the Limits of Numbers

5.14 Time Series

5.15 Further Reading

5.16 Glossary

6 Overview: Machine Learning and Artificial Intelligence

6.1 Historical Context

6.2 The Central Paradigm: Learning a Function from Example

6.3 Machine Learning Data: Vectors and Feature Extraction

6.4 Supervised, Unsupervised, and In‐Between

6.5 Training Data, Testing Data, and the Great Boogeyman of Overfitting

6.6 Reinforcement Learning

6.7 ML Models as Building Blocks for AI Systems

6.8 ML Engineering as a New Job Role

6.9 Further Reading

6.10 Glossary

7 Interlude: Feature Extraction Ideas

7.1 Standard Features

7.2 Features that Involve Grouping

7.3 Preview of More Sophisticated Features

7.4 You Get What You Measure: Defining the Target Variable

8 Machine‐Learning Classification

8.1 What Is a Classifier, and What Can You Do with It?

8.2 A Few Practical Concerns

8.3 Binary Versus Multiclass

8.4 Example Script

8.5 Specific Classifiers

8.6 Evaluating Classifiers

8.7 Selecting Classification Cutoffs

8.8 Further Reading

8.9 Glossary

9 Technical Communication and Documentation

9.1 Several Guiding Principles

9.2 Slide Decks

9.3 Written Reports

9.4 Speaking: What Has Worked for Me

9.5 Code Documentation

9.6 Further Reading

9.7 Glossary

Part II: Stuff You Still Need to Know

10 Unsupervised Learning: Clustering and Dimensionality Reduction

10.1 The Curse of Dimensionality

10.2 Example: Eigenfaces for Dimensionality Reduction

10.3 Principal Component Analysis and Factor Analysis

10.4 Skree Plots and Understanding Dimensionality

10.5 Factor Analysis

10.6 Limitations of PCA

10.7 Clustering

10.8 Further Reading

10.9 Glossary

11 Regression

11.1 Example: Predicting Diabetes Progression

11.2 Fitting a Line with Least Squares

11.3 Alternatives to Least Squares

11.4 Fitting Nonlinear Curves

11.5 Goodness of Fit: 

R

2

and Correlation

11.6 Correlation of Residuals

11.7 Linear Regression

11.8 LASSO Regression and Feature Selection

11.9 Further Reading

11.10 Glossary

12 Data Encodings and File Formats

12.1 Typical File Format Categories

12.2 CSV Files

12.3 JSON Files

12.4 XML Files

12.5 HTML Files

12.6 Tar Files

12.7 GZip Files

12.8 Zip Files

12.9 Image Files: Rasterized, Vectorized, and/or Compressed

12.10 It’s All Bytes at the End of the Day

12.11 Integers

12.12 Floats

12.13 Text Data

12.14 Further Reading

12.15 Glossary

13 Big Data

13.1 What Is Big Data?

13.2 When to Use – And not Use – Big Data

13.3 Hadoop: The File System and the Processor

13.4 Example PySpark Script

13.5 Spark Overview

13.6 Spark Operations

13.7 PySpark Data Frames

13.8 Two Ways to Run PySpark

13.9 Configuring Spark

13.10 Under the Hood

13.11 Spark Tips and Gotchas

13.12 The MapReduce Paradigm

13.13 Performance Considerations

13.14 Further Reading

13.15 Glossary

14 Databases

14.1 Relational Databases and MySQL

®

14.2 Key–Value Stores

14.3 Wide‐Column Stores

14.4 Document Stores

14.5 Further Reading

14.6 Glossary

15 Software Engineering Best Practices

15.1 Coding Style

15.2 Version Control and Git for Data Scientists

15.3 Testing Code

15.4 Test‐Driven Development

15.5 AGILE Methodology

15.6 Further Reading

15.7 Glossary

16 Traditional Natural Language Processing

16.1 Do I Even Need NLP?

16.2 The Great Divide: Language Versus Statistics

16.3 Example: Sentiment Analysis on Stock Market Articles

16.4 Software and Datasets

16.5 Tokenization

16.6 Central Concept: Bag‐of‐Words

16.7 Word Weighting: TF‐IDF

16.8

n

‐Grams

16.9 Stop Words

16.10 Lemmatization and Stemming

16.11 Synonyms

16.12 Part of Speech Tagging

16.13 Common Problems

16.14 Advanced Linguistic NLP: Syntax Trees, Knowledge, and Understanding

16.15 Further Reading

16.16 Glossary

17 Time Series Analysis

17.1 Example: Predicting Wikipedia Page Views

17.2 A Typical Workflow

17.3 Time Series Versus Time‐Stamped Events

17.4 Resampling and Interpolation

17.5 Smoothing Signals

17.6 Logarithms and Other Transformations

17.7 Trends and Periodicity

17.8 Windowing

17.9 Brainstorming Simple Features

17.10 Better Features: Time Series as Vectors

17.11 Fourier Analysis: Sometimes a Magic Bullet

17.12 Time Series in Context: The Whole Suite of Features

17.13 Further Reading

17.14 Glossary

18 Probability

18.1 Flipping Coins: Bernoulli Random Variables

18.2 Throwing Darts: Uniform Random Variables

18.3 The Uniform Distribution and Pseudorandom Numbers

18.4 Nondiscrete, Noncontinuous Random Variables

18.5 Notation, Expectations, and Standard Deviation

18.6 Dependence, Marginal, and Conditional Probability

18.7 Understanding the Tails

18.8 Binomial Distribution

18.9 Poisson Distribution

18.10 Normal Distribution

18.11 Multivariate Gaussian

18.12 Exponential Distribution

18.13 Log‐Normal Distribution

18.14 Entropy

18.15 Further Reading

18.16 Glossary

19 Statistics

19.1 Statistics in Perspective

19.2 Bayesian Versus Frequentist: Practical Tradeoffs and Differing Philosophies

19.3 Hypothesis Testing: Key Idea and Example

19.4 Multiple Hypothesis Testing

19.5 Parameter Estimation

19.6 Hypothesis Testing:

t

‐Test

19.7 Confidence Intervals

19.8 Bayesian Statistics

19.9 Naive Bayesian Statistics

19.10 Bayesian Networks

19.11 Choosing Priors: Maximum Entropy or Domain Knowledge

19.12 Further Reading

19.13 Glossary

20 Programming Language Concepts

20.1 Programming Paradigms

20.2 Compilation and Interpretation

20.3 Type Systems

20.4 Further Reading

20.5 Glossary

21 Performance and Computer Memory

21.1 A Word of Caution

21.2 Example Script

21.3 Algorithm Performance and Big‐O Notation

21.4 Some Classic Problems: Sorting a List and Binary Search

21.5 Amortized Performance and Average Performance

21.6 Two Principles: Reducing Overhead and Managing Memory

21.7 Performance Tip: Use Numerical Libraries When Applicable

21.8 Performance Tip: Delete Large Structures You Don’t Need

21.9 Performance Tip: Use Built‐In Functions When Possible

21.10 Performance Tip: Avoid Superfluous Function Calls

21.11 Performance Tip: Avoid Creating Large New Objects

21.12 Further Reading

21.13 Glossary

Part III: Specialized or Advanced Topics

22 Computer Memory and Data Structures

22.1 Virtual Memory, the Stack, and the Heap

22.2 Example C Program

22.3 Data Types and Arrays in Memory

22.4 Structs

22.5 Pointers, the Stack, and the Heap

22.6 Key Data Structures

22.7 Further Reading

22.8 Glossary

23 Maximum‐Likelihood Estimation and Optimization

23.1 Maximum‐Likelihood Estimation

23.2 A Simple Example: Fitting a Line

23.3 Another Example: Logistic Regression

23.4 Optimization

23.5 Gradient Descent

23.6 Convex Optimization

23.7 Stochastic Gradient Descent

23.8 Further Reading

23.9 Glossary

24 Deep Learning and AI

24.1 A Note on Libraries and Hardware

24.2 A Note on Training Data

24.3 Simple Deep Learning: Perceptrons

24.4 What Is a Tensor?

24.5 Convolutional Neural Networks

24.6 Example: The MNIST Handwriting Dataset

24.7 Autoencoders and Latent Vectors

24.8 Generative AI and GANs

24.9 Diffusion Models

24.10 RNNs, Hidden State, and the Encoder–Decoder

24.11 Attention and Transformers

24.12 Stable Diffusion: Bringing the Parts Together

24.13 Large Language Models and Prompt Engineering

24.14 Further Reading

24.15 Glossary

25 Stochastic Modeling

25.1 Markov Chains

25.2 Two Kinds of Markov Chain, Two Kinds of Questions

25.3 Hidden Markov Models and the Viterbi Algorithm

25.4 The Viterbi Algorithm

25.5 Random Walks

25.6 Brownian Motion

25.7 ARIMA Models

25.8 Continuous‐Time Markov Processes

25.9 Poisson Processes

25.10 Further Reading

25.11 Glossary

26 Parting Words

Index

End User License Agreement

Guide

Cover Page

Table of Contents

Title Page

Copyright Page

Dedication Page

Preface to the First Edition

Preface to the Second Edition

Begin Reading

Index

WILEY END USER LICENSE AGREEMENT

Pages

iii

iv

vi

v

xvii

xix

1

2

3

4

5

7

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

77

78

79

80

81

82

83

84

85

86

87

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

109

110

111

112

113

114

115

116

117

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

243

244

245

246

247

248

249

250

251

252

253

254

255

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

283

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

331

332

333

334

335

336

337

338

339

340

341

343

344

345

346

347

348

The Data Science Handbook

Field Cady

Second Edition

Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication Data applied for:Hardback ISBN: 9781394234493

Cover Design: WileyCover Images: © alexlmx/Adobe Stock Photos, © da‐kuk/Getty Images

To my wife, Ryna. Thank you honey, for your support and for always believing in me.

Preface to the First Edition

This book was written to solve a problem. The people who I interview for data science jobs have sterling mathematical pedigrees, but most of them are unable to write a simple script that computes Fibonacci numbers (in case you aren’t familiar with Fibonacci numbers, this takes about five lines of code). On the other side, employers tend to view data scientists as either mysterious wizards or used‐car salesmen (and when data scientists can’t be trusted to write a basic script, the latter impression has some merit!). These problems reflect a fundamental misunderstanding, by all parties, of what data science is (and isn’t) and what skills its practitioners need.

When I first got into data science, I was part of that problem. Years of doing academic physics had trained me to solve problems in a way that was long on abstract theory but short on common sense or flexibility. Mercifully, I also knew how to code (thanks, Google™ internships!), and this let me limp along while I picked up the skills and mindsets that actually mattered.

Since leaving academia, I have done data science consulting for companies of every stripe. This includes web traffic analysis for tiny start‐ups, manufacturing optimizations for Fortune 100 giants, and everything in between. The problems to solve are always unique, but the skills required to solve them are strikingly universal. They are an eclectic mix of computer programming, mathematics, and business savvy. They are rarely found together in one person, but in truth they can be learned by anybody.

A few interviews I have given stand out in my mind. The candidate was smart and knowledgeable, but the interview made it painfully clear that they were unprepared for the daily work of a data scientist. What do you do as an interviewer when the candidate starts apologizing for wasting your time? We ended up filling the hour with a crash course on what they were missing and how they could go out and fill the gaps in their knowledge. They went out, learned what they needed to, and are now successful data scientists.

I wrote this book in an attempt to help people like that out, by condensing data science’s various skill sets into a single, coherent volume. It is hands‐on and to the point: ideal for somebody who needs to come up to speed quickly or solve a problem on a tight deadline. The educational system is still catching up to the demands of this new and exciting field, and my hope is that this book will help you bridge the gap.

Field Cady

September 2016Redmond, Washington

Preface to the Second Edition

In the first edition of this book, I called the introduction “Becoming a Unicorn.” Data science was a new field that was poorly understood, and data scientists were often called “unicorns” in reference to their miraculous ability to do both math and programming. I wrote the book with one central message: data science isn’t as inaccessible as people are making it out to be. It is perfectly reasonable for somebody to acquire the whole palette of skills required, and my book aspired to be a one‐stop‐shop for people to learn them.

A great deal has changed since then, and I’m delighted that the educational system has caught on. There are now degree programs and bootcamps that can teach the essentials of data science to most anybody who is willing to learn them. There are relatively standard curricula, fewer people who are baffled by the subject, and more young professionals embarking on this exciting career. Data science has gone from being an obscure priesthood to an exciting career that normal people can have.

As the discipline has expanded, the tools have also evolved, and I felt that a second edition was in order. By far the most important change I have made is more coverage of deep learning: previously I barely touched on RNNs, but now I continue up through topics such as encoder–decoder architectures, diffusion models, LLMs, and prompt engineering. AI tools are coming of age (perhaps AI is now where data science was 10 years ago) and a data scientist needs to be familiar with them. I have also updated my treatment of Spark to cover its new DataFrame interface, and reduced the emphasis on Hadoop since it is on the decline. Other changes include a reduced emphasis on Bayesian networks (which have waned in popularity with the rise of deep learning), a switch from Python 2 to Python 3, and numerous improvements to the prose.

Field Cady

Redmond, Washington

1Introduction

The goal of this book is to turn you into a data scientist, and there are two parts to this mission. First, there is a set of specific concepts, tools, and techniques that you can go out and solve problems with today. They include buzzwords such as machine learning (ML), Spark, and natural language processing (NLP). They also include concepts that are distinctly less sexy but often more useful, like regular expressions, unit tests, and SQL queries. It would be impossible to give an exhaustive list in any single book, but I cast a wide net.

That brings me to the second part of my goal. Tools are constantly changing, and your long‐term future as a data scientist depends less on what you know today and more on what you are able to learn going forward. To that end, I want to help you understand the concepts behind the algorithms and the technological fundamentals that underlie the tools we use. For example, this is why I spend a fair amount of time on computer memory and optimization: they are often the underlying reason that one approach is better than another. If you understand the key concepts, you can make the right trade‐offs, and you will be able to see how new ideas are related to older ones.

As the field evolves, data science is becoming not just a discipline in its own right, but also a skillset that anybody can have. The software tools are getting better and easier to use, best practices are becoming widely known, and people are learning many of the key skills in school before they’ve even started their career. There will continue to be data science specialists, but there is also a growing number of the so‐called “citizen data scientists” whose real job is something else. They are engineers, biologists, UX designers, programmers, and economists: professionals from all fields who have learned the techniques of data science and are fruitfully applying them to their main discipline.

This book is aimed at anybody who is entering the field. Depending on your background, some parts of it may be stuff you already know. Especially for citizen data scientists, other parts may be unnecessary for your work. But taken as a whole, this book will give you a practical skillset for today, and a solid foundation for your future in data science.

1.1 What Data Science Is and Isn’t

Despite the fact that “data science” is widely practiced and studied today, the term itself is somewhat elusive. So before we go any further, I’d like to give you the definition that I use. I’ve found that this one gets right to the heart of what sets it apart from other disciplines. Here goes:

Data science means doing analytically oriented work that, for one reason or another, requires a substantial amount of software engineering skills.

Often the final deliverable is the kind of thing a statistician or business analyst might provide, but achieving that goal demands software skills that your typical analyst simply doesn’t have – writing a custom parser for an obscure data format, complex preprocessing logic that must be kept in order, etc. Other times the data scientist will need to write production software based on their insights, or perhaps make their model available in real time. Often the dataset itself is so large that just creating a pie chart requires that the work be done in parallel across a cluster of computers. And sometimes, it’s just a really gnarly SQL query that most people struggle to wrap their heads around.

Nate Silver, a statistician famous for accurate forecasting of US elections, once said: “I think data scientist is a sexed‐up term for statistician.” He has a point, but what he said is only partly true. The discipline of statistics deals mostly with rigorous mathematical methods for solving well‐defined problems; data scientists spend most of their time getting data and the problem into a form where statistical methods can even be applied. This involves making sure that the analytics problem is a good match to business objectives, choosing what to measure and how to quantify things (typically more the domain of a BI analyst), extracting meaningful features from the raw data, and coping with any pathologies of the data or weird edge cases (which often requires a level of coding more typical of a software engineer). Once that heavy lifting is done, you can apply statistical tools to get the final results – although, in practice, you often don’t even need them. Professional statisticians need to do a certain amount of preprocessing themselves, but there is a massive difference in degree.

Historically, statistics focused on rigorous methods to analyze clean datasets, such as those that come out of controlled experiments in medicine and agriculture. Often the data was gathered explicitly to support the statisticians’ analysis! In the 2000s though a new class of datasets became popular to analyze. “Big Data” used new cluster computing tools to study large, messy, heterogenous datasets of the sort that would make statisticians shudder: HTML pages, image files, e‐mails, raw output logs of web servers, and so on. These datasets don’t fit the mold of relational databases or statistical tools, and they were not designed to facilitate any particular statistical analysis; so for decades, they were just piling up without being analyzed. Data science came into being as a way to finally milk them for insights. Most of the first data scientists were computer programmers or ML experts who were working on Big Data problems, not statisticians in the traditional sense.

The lines have now blurred: statisticians do more coding than they used to, Big Data tools are less central to the work of a data scientist, and ML is used by a broad swatch of people. And this is healthy: the differences between these fields are, after all, really just a matter of degree and/or historical accident. But, in practical terms, “data scientists” are still the jacks‐of‐all‐trades in the middle. They can do statistics, but if you’re looking to tease every last insight out of clinical trial data, you should consult a statistician. They can train and deploy ML models, but if you’re trying to eke performance out of a large neural network an ML engineer would be better. They can turn business questions into math problems, but they may not have the deep business knowledge of an analyst.

1.2 This Book’s Slogan: Simple Models Are Easier to Work With

There is a common theme in this book that I would like to call out at as the book’s explicit motto: simple models are easier to work with. Let me explain.

People tend to idolize and gravitate toward complicated analytical models like deep neural nets, Bayesian networks, ARIMA models, and the like. There are good reasons to use these tools; the best‐performing models in the world are usually complicated, there may be fancy ways to bake in expert knowledge, etc. There are also bad reasons to use these tools, like ego and pressure to use to latest buzzwords.

But seasoned data scientists understand that there is more to a model than how accurate it is. Simple models are, above all, easier to reason about. If you’re trying to understand what patterns in the data your model is picking up on, simple models are the way to go. Oftentimes this is the whole point of a model anyway: we are just trying to get insights into the system we are studying, and a model’s performance is just used to gauge how fully it has captured the relevant patterns in the data.

A related advantage of simple models is supremely mundane: stuff breaks, and they make it easier to find what’s broken. Bad training data, perverse inputs to the model, and data that is incorrectly formatted – all of these are liable to cause conspicuous failures, and it’s easy to figure out what went wrong by dissecting the model. For this reason, I like “stunt double models,” which have the same input/output format as a complicated one and are used to debug the model’s integration with other systems.

Simple models are less prone to overfitting. If your dataset is small, a fancy model will often actually perform worse: it essentially memorizes the training data, rather than extracting general patterns from it. The simpler a model, the less you have to worry about the size of your dataset (though admittedly this can create a square‐peg‐in‐a‐round‐hole situation where the model can’t fit the data well and performance degrades).

Simple models are easier to hack and jury‐rig. Frequently they have a small number of tunable parameters, with clear meanings that you can adjust to suit the business needs at hand.

The inferior performance of simple models can act as a performance benchmark, a level that the fancier model must meaningfully exceed in order to justify its extra complexity. And if a simple model performs particularly badly, this may suggest that there isn’t enough signal in the data to make the problem worthwhile.

On the other hand, when there is enough training data and it is representative of what you expect to see, fancier models do perform better. You usually don’t want to leave money on the table by deploying grossly inferior models simply because they are easier to debug. And there are many situations, like cutting‐edge AI, where the relevant patterns are very complicated, and it takes a complicated model to accurately capture them. Even in these cases though, it is often possible to keep the complexity modular and, hence, easier to reason about. For example, say we are choosing which ads to show to which customer. Instead of directly predicting the click‐rate for various ads and picking the best one, we might have a very complex model that assigns the person to some pre‐existing user segments, and then a simple model that shows them ads based on the segments they are in. This model will be easier to debug and much more scalable.

Model complexity is an area that requires critical thinking and flexibility. Simple models are often good enough for the problem at hand, especially in situations where training data is limited anyway. When more complexity is justified, it is often buttressed by an army of simple models that tackle various subproblems (like various forms of cleaning and labeling the training data). Simple models are easier to work with, but fancy ones sometimes give better performance: technical and data considerations tell you the constraints, and business value should guide the ultimate choice.

1.3 How Is This Book Organized?

This book is organized into three sections. The first, The Stuff You’ll Always Use, covers topics that, in my experience, you will end up using in almost any data science project. They are core skills, which are absolutely indispensable for data science at any level.

The first section was also written with an eye toward people who need data science to answer a specific question but do not aspire to become full‐fledged data scientists. If you are in this camp, then there is a good chance that Part I of the book will give you everything you need.

The second section, Stuff You Still Need to Know, covers additional core skills for a data scientist. Some of these, such as clustering, are so common that they almost made it into the first section, and they could easily play a role in any project. Others, such as NLP, are somewhat specialized subjects that are critical in certain domains but superfluous in others. In my judgment, a data scientist should be conversant in all of these subjects, even if they don’t always use them all.

The final section, Stuff That’s Good to Know, covers a variety of topics that are optional. Some of these chapters are just expansions on topics from the first two sections, but they give more theoretical background and discuss some additional topics. Others are entirely new material, which does come up in data science, but which you could go through a career without ever running into.

1.4 How to Use This Book?

This book was written with three use cases in mind:

You can read it cover‐to‐cover. If you do that, it should give you a self‐contained course in data science that will leave you ready to tackle real problems. If you have a strong background in computer programming, or in mathematics, then some of it will be review.

You can use it to come quickly up to speed on a specific subject. I have tried to make the different chapters pretty self‐contained, especially the chapters after the first section.

The book contains a lot of sample codes, in pieces that are large enough to use as a starting point for your own projects.

1.5 Why Is It All in Python, Anyway?

The example code in this book is all in Python, except for a few domain‐specific languages such as SQL. My goal isn’t to push you to use Python; there are lots of good tools out there, and you can use whichever ones you want.

However, I wanted to use one language for all of my examples, which lets readers follow the whole book while only knowing one language. Of the various languages available, there are two reasons why I chose Python:

Python is without question the most popular language for data scientists. R is its only major competitor, at least when it comes to free tools. I have used both extensively, and I think that Python is flat‐out better (except for some obscure statistics packages that have been written in R and that are rarely needed anyway).

I like to say that Python is the second‐best language for any task. It’s a jack‐of‐all‐trades. If you only need to worry about statistics, or numerical computation, or web parsing, then there are better options out there. But if you need to do all of these things within a single project, then Python is your best bet. Since data science is so inherently multidisciplinary, this makes it a perfect fit.

As a note of advice, it is much better to be proficient in one language, to the point where you can reliably churn out code that is of high quality, than to be mediocre at several.

1.6 Example Code and Datasets

This book is rich in example code, in fairly long chunks. This was done for two reasons:

As a data scientist, you need to be able to read longish pieces of code. This is a non‐optional skill, and if you aren’t used to it, then this will give you a chance to practice.

I wanted to make it easier for you to poach the code from this book, if you feel so inclined.

You can do whatever you want with the code, with or without attribution. I release it into the public domain in the hope that it can give some people a small leg up. You can find it on my GitHub page at www.github.com/field-cady.

The sample data that I used comes in two forms:

Test datasets that are built into Python’s scientific libraries.

Data that is pulled off the Internet, from sources such as Yahoo and Wikipedia. When I do this, the example scripts will include code that pulls the data.

1.7 Parting Words

It is my hope that this book not only teaches you how to do nut‐and‐bolts data science but also gives you a feel of how exciting this deeply interdisciplinary subject is. Please feel free to reach out to me at www.fieldcady.com or [email protected] with comments, errata, or any other feedback.

Part IThe Stuff You’ll Always Use

The first section of this book covers core topics that everybody doing data science should know. This includes people who are not interested in being professional data scientists, but need to know just enough to solve some specific problem. These are the subjects that will likely arise in every data science project you do.

2The Data Science Road Map

In this chapter, I will give you a high‐level overview of the process of data science. I will focus on the different stages of data science work, including common pain points, key things to get right, and where data science parts ways from other disciplines.

The process of solving a data science problem is summarized in the following figure, which I called the Data Science Road Map.

The first step is always to frame the problem: understand the business use case and craft a well‐defined analytics problem (or problems) out of it. This is followed by an extensive stage of grappling with the data and the real‐world things that it describes, so that we can extract meaningful features. Finally, these features are plugged into analytical tools that give us hard numerical results.

Before I go into more detail about the different stages of the roadmap, I want to point out two things.

The first is that “Model and Analyze” loops back to framing the problem. This is one of the key features of data science that differentiate it from traditional software engineering. Data scientists write code, and they use many of the same tools as software engineers. However, there is a tight feedback loop between data science work and the real world. Questions are always being reframed as new insights become available. As a result, data scientists must keep their code base extremely flexible and always have an eye toward the real‐world problem they are solving. Often you will follow the loop back many times, constantly refining your methods and producing new insights.

The second point is that there are two different (although not mutually exclusive) ways to exit the road map: presenting results and deploying code. My friend Michael Li, a data scientist who founded The Data Incubator, likened this to having two different types of clients: humans and machines. They require distinct skill sets and modifications to every stage of the data science road map.

If your clients are humans, then usually you are trying to use available data sources to answer some kind of business problem. Examples would be the following:

Identifying leading indicators of spikes in the price of a stock, so that people can understand what causes price spikes.

Determining whether customers break down into natural subtypes and what characteristics each type has.

Assessing whether traffic to one website can be used to predict traffic to another site.

Typically, the final deliverable for work such as this will be a PowerPoint slide deck or a written report. The goal is to give business insights, and often these insights will be used for making key decisions. This kind of data science also functions as a way to test the waters and see whether some analytics approach is worth a larger follow‐up project that may result in production software.

If your clients are machines, then you are doing something that blends into software engineering, where the deliverable is a piece of software that performs some analytics work. Examples would be the following:

Implementing the algorithm that chooses which ad to show to a customer and training it on real data.

Writing a batch process that generates daily reports based on company records generated that day, using some kind of analytics to point out salient patterns.

In these cases, your main deliverable is a piece of software. In addition to performing a useful task, it had better work well in terms of performance, robustness to bad inputs, and so on.

Once you understand who your clients are, the next step is to determine what you’ll be doing for them. In the next section, I will show you how to do this all‐important step.

2.1 Frame the Problem

The difference between great and mediocre data science is not about math or engineering: it is about asking the right question(s). Alternately, if you’re trying to build some piece of software, you need to decide what exactly that software should do. No amount of technical competence or statistical rigor can make up for having solved a useless problem.

If your clients are humans, most projects start with some kind of extremely open‐ended question. Perhaps, there is a known pain point, but it’s not clear what a solution would look like. If your clients are machines, then the business problem is usually pretty clear, but there can be a lot of ambiguity about what constraints there might be on the software (languages to use, runtime, how accurate predictions need to be, etc.). Before diving into actual work, it’s important to clarify exactly what would constitute a solution to this problem. A “definition of done” is a good way to put it: what criteria constitute a completed project, and (most importantly) what would be required to make the project a success?

For large projects, these criteria are often laid out in a document. Writing that document is a collaborative process involving a lot of back‐and‐forth with stakeholders, negotiation, and sometimes disagreement. In consulting, these documents are often called “statements of work” or SOWs. Within a company that is creating a product (as opposed to just a stand‐alone investigation), they are often referred to as “project requirements documents” or PRDs.

The main purpose of an SOW is to get everybody on the same page about exactly what work should be done, what the priorities are, and what expectations are realistic. Business problems are typically very vague to start off with, and it takes a lot of time and effort to follow a course of action through to the final result. So before investing that effort, it is critical to make sure that you are working on the right problem. Crafting the SOW will often include a range of one‐off analyses the gauge which avenues are promising enough to commit resources to.

There is, however, also an element of self‐defense. Sometimes it ends up being impossible to solve a problem with the available data or maybe stakeholders decide that the project isn’t important anymore. A good SOW keeps everybody honest in case things don’t work out: everybody agrees up‐front that this looks like it will be both valuable and feasible.

Having an SOW doesn’t set things in stone. There are course corrections based on preliminary discoveries. Sometimes, people change their minds after the SOW has been signed. It happens. But, crafting an SOW is the best way to make sure that all efforts are pointed in the most useful direction.

2.2 Understand the Data: Basic Questions

Once you have access to the data you’ll be using, it’s good to have a battery of standard questions that you always ask about it. This is a good way to hit the ground running with your analyses, rather than risk analysis paralysis. It is also a good safeguard to identify problems with the data as quickly as possible.

A few good generic questions to ask are as follows:

How big is the dataset? Is this the entire dataset or just a sample? If it’s just a sample, do we know how to sampling was done?

Is this data representative enough? For example, maybe data was only collected for a subset of users.

Are there likely to be gross outliers or extraordinary sources of noise? For example, 99% of the traffic from a web server might be a single denial‐of‐service attack.

Are there likely to be heavy tails? For example, the vast majority of web traffic might go to only a few sites, and if those sites are over‐ or under‐represented in a sample you took your metrics might be misleading.

Might there be artificial data inserted into the dataset? This happens a lot in industrial settings.

Are there any fields that are unique identifiers? These are the fields you might use for joining between datasets, etc. Make sure that unique ID fields are actually unique – they often aren’t.

If there are two datasets A and B that need to be joined, what does it mean if something in A doesn’t match anything in B?

When data entries are blank, where does that come from?

How common are blank entries?

The most important question to ask about the data is whether it can solve the business problem that you are trying to tackle. If not, then you might need to look into additional sources of data or modify the work that you are planning.

Speaking from personal experience, I have been inclined to neglect these preliminary questions. I am excited to get into the actual analysis, so I’ve sometimes jumped right in without taking the time to make sure that I know what I’m doing. For example, I once had a project where there was a collection of motors and time series data monitoring their physical characteristics: one time series per motor. My job was to find leading indicators of failure, and I started doing this by comparing the last day’s worth of time series for a given motor (i.e., the data taken right before it failed) against its previous data. Well, I realized a couple of weeks in that sometimes the time series stopped long before the motor actually failed, and, in other cases, the time series data continued long after the motor was dead. The actual times the motors had died were listed in a separate table, and it would have been easy for me to double‐check early on that they corresponded to the ends of the time series.

2.3 Understand the Data: Data Wrangling

Data wrangling is the process of getting the data from its raw format into something suitable for more conventional analytics. This typically means creating a software pipeline that gets the data out of wherever it is stored, does any cleaning or filtering necessary, and puts it into a regular format.

Data wrangling is the main area where data scientists need skills that a traditional statistician or analyst doesn’t have. The data is often stored in a special‐purpose database that requires specialized tools to access. There could be so much of it that Big Data techniques are required to process it. You might need to use performance tricks to make things run quickly. Especially with messy data, the preprocessing pipelines are often so complex that it is very difficult to keep the code organized.

Speaking of messy data, I should tell you this upfront: industrial datasets are always more convoluted than you would think they reasonably should be. The question is not whether the problems exist but whether they impact your work. My recipe for figuring out how a particular dataset is broken includes the following:

If the raw data is text, look directly at the plain files in a text editor or something similar. Things such as irregular date formats, irregular capitalizations, and lines that are clearly junk will jump out at you.

If there is a tool that is supposed to be able to open or process the data, make sure that it can actually do it. For example, if you have a CSV file, try opening it in something that reads data frames. Did it read all the rows in? If not, maybe some rows have the wrong number of entries. Did the column that is supposed to be a datetime get read in as a datetime? If not, then maybe the formatting is irregular.

Do some histograms and scatterplots. Are these numbers realistic, given what you know about the real‐life situation? Are there any massive outliers?

Take some simple questions that you already know the (maybe approximate) answer to, answer them based on this data, and see if the results agree. For example, you might try to calculate the number of customers by counting how many unique customer IDs there are. If these numbers don’t agree, then you’ve probably misunderstood something about the data.

2.4 Understand the Data: Exploratory Analysis

Once you have the data digested into a usable format, the next step is exploratory analysis. This basically means poking around in the data, visualizing it in lots of different ways, trying out different ways to transform it, and seeing what there is to see. This stage is very creative, and it’s a great place to let your curiosity run a little wild. Feel free to calculate some correlations and similar metrics, but don’t break out the fancy machine learning classifiers. Keep things simple and intuitive.

There are two things that you typically get out of exploratory analysis:

You develop an intuitive feel for the data, including what the salient patterns look like visually. This is especially important if you’re going to be working with similar data a lot in the future. This also helps ferret out pathologies in the data that weren’t found earlier.

You get a list of concrete hypotheses about what’s going on in the data. Oftentimes, a hypothesis will be motivated by a compelling graphic that you generated: a snapshot of a time series that shows an unmistakable pattern, a scatterplot demonstrating that two variables are related to each other, or a histogram that is clearly bimodal.

A common misconception is that data scientists don’t need visualizations. This attitude is not only inaccurate: it is very dangerous. Most machine learning algorithms are not inherently visual, but it is very easy to misinterpret their outputs if you look only at the numbers. There is no substitute for the human eye when it comes to making intuitive sense of things.

2.5 Extract Features

This stage has a lot of overlap with exploratory analysis and data wrangling. A feature is really just a number or a category that is extracted from your data and describes some entity. For example, you might extract the average word length from a text document or the number of characters in the document. Or, if you have temperature measurements, you might extract the average temperature for a particular location.

In practical terms, feature extraction means taking your raw datasets and distilling them down into a table with rows and columns. This is called “tabular data.” Each row corresponds to some real‐world entity, and each column gives a single piece of information (generally a number) that describes that entity. Virtually all analytics techniques, from lowly scatterplots to fancy neural networks, operate on tabular data.

Extracting good features is the most important thing for getting your analysis to work. It is much more important than good machine‐learning classifiers, fancy statistical techniques, or elegant code. Especially, if your data doesn’t come with readily available features (as is the case with web pages, images, etc.), how you reduce it to numbers will make the difference between success and failure.

Feature extraction is also the most creative part of data science and the one most closely tied to domain expertise. Typically, a really good feature will correspond to some real‐world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers.

Sometimes, there is also room for creativity as to what entities you are extracting features about. For example, let’s say that you have a bunch of transaction logs, each of which gives a person’s name and e‐mail address. Do you want to have one row per human or one row per e‐mail address? For many real‐world situations, you want one row per human (in which case, the number of unique e‐mail addresses they have might be a good feature to extract!), but that opens the very thorny question of how you can tell when two people are the same based on their names.

Most features that we extract will be used to predict something. However, you may also need to extract the thing that you are predicting, which is also called the target variable. For example, I was once tasked with predicting whether my client’s customers would lose their brand loyalty. There was no “loyalty” field in the data: it was just a log of various customer interactions and transactions. I had to figure out a way to measure “loyalty.”

2.6 Model

Once features have been extracted, most data science projects involve some kind of machine‐learning model. Maybe this is a classifier that guesses whether a customer is still loyal, a regression model that predicts a stock’s price on the next day, or a clustering algorithm that breaks customers into different segments.

In many data science projects, the modeling stage is quite simple: you just take a standard suite of models, plug your data into each one of them, and see which one works best. In other cases, a lot of care is taken to carefully tune a model and eek out every last bit of performance.

Really, this should happen at every stage of a data science project, but it becomes especially crucial when analyzing the results of the modeling stage. If you have identified different clusters, what do they correspond to? Does your classifier work well enough to be useful? Is there anything interesting about the cases in which it fails?

This stage is what allows for course corrections in a project and gives ideas for what to do differently if there is another iteration.

If your client is a human, it is common to use a variety of models, tuned in different ways, to examine different aspects of your data. If your client is a machine though, you will probably need to zero in on a single, canonical model that will be used in production.

2.7 Present Results

If your client is a human, then you will probably have to give either a slide deck or a written report describing the work you did and what your results were. You are also likely to have to do this even if your main clients are machines.

Communication in slide decks and prose is a difficult, important skill set in itself. But, it is especially tricky with data science, where the material you are communicating is highly technical and you are presenting to a broad audience. Data scientists must communicate fluidly with business stakeholders, domain experts, software engineers, and business analysts. These groups tend to have different knowledge bases coming in, different things they will be paying attention to, and different presentation styles to which they are accustomed.

I can’t emphasize enough the fact that your numbers and figures should be reproducible. There is nothing worse than getting probing questions about a graphic that you can’t answer because you don’t have a record of exactly how it was generated.

2.8 Deploy Code

If your ultimate clients are computers, then it is your job to produce code that will be run regularly in the future by other people. Typically, this falls into one of two categories:

Batch analytics code.

This will be used to redo an analysis similar to the one that has already been done, on data that will be collected in the future. Sometimes, it will produce some human‐readable analytics reports. Other times, it will train a statistical model that will be referenced by other code.

Real‐time code.

This will typically be an analytical module in a larger software package, written in a high‐performance programming language and adhering to all the best practices of software engineering.

There are three typical deliverables from this stage:

The code itself, often baked into a Docker container or something similar. The latter allows the data scientist to have responsibility for the code itself, while engineers handle the system that it plugs into.

Some documentation of how to run the code. Sometimes, this is a stand‐alone work document, often called a “run book.” Other times, the documentation is embedded in the code.

Usually, you need some way to test code that ensures that your code operates correctly. For real‐time code, this will normally take the form of unit tests. For batch processes, it is sometimes a sample input dataset (designed to illustrate all the relevant edge cases) along with what the output should look like.

In deploying code, data scientists often take on a dual role as full‐fledged software engineers. Especially with very intricate algorithms, it often just isn’t practical to have one person spec it out and another implement the same thing for production.

2.9 Iterating

Data science is a deeply iterative process, even more so than typical software engineering. This is because in software you generally have a pretty good idea what you’re aiming to create, even if you take an iterative approach to implementing it. But, in data science, it is usually an open question of what features will end up being useful to extract and what model you will train. For this reason, the data science process should be built around the goal of being able to change things painlessly.

My recommendations are as follows:

Try to get preliminary results as quickly as possible after you’ve understood the data. A scatterplot or histogram that shows you that there is a clear pattern in the data. Maybe a simple model based on crude preliminary features that nonetheless works. Sometimes an analysis is doomed to failure, because there just isn’t much signal in the data. If this is the case, you want to know sooner rather than later, so that you can change your focus.

Automate relentlessly: put your analysis into a single script or notebook so that it’s easy to run the whole thing at once. This is a point that I’ve learned the hard way: it is really, really easy after several hours at the command line to lose track of exactly what processing you did to get your data into its current form. Keep things reproducible from the beginning.

Keep your code modular and broken out into clear stages. This makes it easy to modify, add in, and take out steps as you experiment.

Notice how much of this comes down to considerations of software, not analytics. The code must be flexible enough to solve all manner of problems, powerful enough to do it efficiently, and comprehensible enough to edit quickly if objectives change. Doing this requires that data scientists use flexible, powerful programming languages, which I will discuss in the next chapter.

2.10 Glossary

Data wrangling

The nitty‐gritty task of cleaning data and getting it into a standard format that is suitable for downstream analysis.

Exploratory analysis

A stage of analysis that focuses on exploring the data to generate hypotheses about it. Exploratory analysis relies heavily on visualizations.

Feature

A small piece of data, usually a number or a label, that is extracted from your data and characterizes some entity in your dataset.

Product requirements document (PRD)

A document that specifies exactly what functionality a planned product should have.

Production code

Software that is run repeatedly and maintained. It especially refers to source code of software product that is distributed to other people.

Statement of work (SOW)

A document that specifies what work is to be done in a project, relevant timelines, and specific deliverables.

Target variable

A feature that you are trying to predict in machine learning. Sometimes, it is already in your data, and other times, you must construct it yourself.