64,99 €
Practical, accessible guide to becoming a data scientist, updated to include the latest advances in data science and related fields.
Becoming a data scientist is hard. The job focuses on mathematical tools, but also demands fluency with software engineering, understanding of a business situation, and deep understanding of the data itself. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
The focus of The Data Science Handbook is on practical applications and the ability to solve real problems, rather than theoretical formalisms that are rarely needed in practice. Among its key points are:
Data science is a quickly evolving field, and this 2nd edition has been updated to reflect the latest developments, including the revolution in AI that has come from Large Language Models and the growth of ML Engineering as its own discipline. Much of data science has become a skillset that anybody can have, making this book not only for aspiring data scientists, but also for professionals in other fields who want to use analytics as a force multiplier in their organization.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 701
Veröffentlichungsjahr: 2024
Cover
Table of Contents
Title Page
Copyright Page
Dedication Page
Preface to the First Edition
Preface to the Second Edition
1 Introduction
1.1 What Data Science Is and Isn’t
1.2 This Book’s Slogan: Simple Models Are Easier to Work With
1.3 How Is This Book Organized?
1.4 How to Use This Book?
1.5 Why Is It All in Python, Anyway?
1.6 Example Code and Datasets
1.7 Parting Words
Part 1: The Stuff You’ll Always Use
2 The Data Science Road Map
2.1 Frame the Problem
2.2 Understand the Data: Basic Questions
2.3 Understand the Data: Data Wrangling
2.4 Understand the Data: Exploratory Analysis
2.5 Extract Features
2.6 Model
2.7 Present Results
2.8 Deploy Code
2.9 Iterating
2.10 Glossary
3 Programming Languages
3.1 Why Use a Programming Language? What Are the Other Options?
3.2 A Survey of Programming Languages for Data Science
3.3 Where to Write Code
3.4 Python Overview and Example Scripts
3.5 Python Data Types
3.6 GOTCHA: Hashable and Unhashable Types
3.7 Functions and Control Structures
3.8 Other Parts of Python
3.9 Python’s Technical Libraries
3.10 Other Python Resources
3.11 Further Reading
3.12 Glossary
Interlude: My Personal Toolkit
4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning
4.1 The Worst Dataset in the World
4.2 How to Identify Pathologies
4.3 Problems with Data Content
4.4 Formatting Issues
4.5 Example Formatting Script
4.6 Regular Expressions
4.7 Life in the Trenches
4.8 Glossary
5 Visualizations and Simple Metrics
5.1 A Note on Python’s Visualization Tools
5.2 Example Code
5.3 Pie Charts
5.4 Bar Charts
5.5 Histograms
5.6 Means, Standard Deviations, Medians, and Quantiles
5.7 Boxplots
5.8 Scatterplots
5.9 Scatterplots with Logarithmic Axes
5.10 Scatter Matrices
5.11 Heatmaps
5.12 Correlations
5.13 Anscombe’s Quartet and the Limits of Numbers
5.14 Time Series
5.15 Further Reading
5.16 Glossary
6 Overview: Machine Learning and Artificial Intelligence
6.1 Historical Context
6.2 The Central Paradigm: Learning a Function from Example
6.3 Machine Learning Data: Vectors and Feature Extraction
6.4 Supervised, Unsupervised, and In‐Between
6.5 Training Data, Testing Data, and the Great Boogeyman of Overfitting
6.6 Reinforcement Learning
6.7 ML Models as Building Blocks for AI Systems
6.8 ML Engineering as a New Job Role
6.9 Further Reading
6.10 Glossary
7 Interlude: Feature Extraction Ideas
7.1 Standard Features
7.2 Features that Involve Grouping
7.3 Preview of More Sophisticated Features
7.4 You Get What You Measure: Defining the Target Variable
8 Machine‐Learning Classification
8.1 What Is a Classifier, and What Can You Do with It?
8.2 A Few Practical Concerns
8.3 Binary Versus Multiclass
8.4 Example Script
8.5 Specific Classifiers
8.6 Evaluating Classifiers
8.7 Selecting Classification Cutoffs
8.8 Further Reading
8.9 Glossary
9 Technical Communication and Documentation
9.1 Several Guiding Principles
9.2 Slide Decks
9.3 Written Reports
9.4 Speaking: What Has Worked for Me
9.5 Code Documentation
9.6 Further Reading
9.7 Glossary
Part II: Stuff You Still Need to Know
10 Unsupervised Learning: Clustering and Dimensionality Reduction
10.1 The Curse of Dimensionality
10.2 Example: Eigenfaces for Dimensionality Reduction
10.3 Principal Component Analysis and Factor Analysis
10.4 Skree Plots and Understanding Dimensionality
10.5 Factor Analysis
10.6 Limitations of PCA
10.7 Clustering
10.8 Further Reading
10.9 Glossary
11 Regression
11.1 Example: Predicting Diabetes Progression
11.2 Fitting a Line with Least Squares
11.3 Alternatives to Least Squares
11.4 Fitting Nonlinear Curves
11.5 Goodness of Fit:
R
2
and Correlation
11.6 Correlation of Residuals
11.7 Linear Regression
11.8 LASSO Regression and Feature Selection
11.9 Further Reading
11.10 Glossary
12 Data Encodings and File Formats
12.1 Typical File Format Categories
12.2 CSV Files
12.3 JSON Files
12.4 XML Files
12.5 HTML Files
12.6 Tar Files
12.7 GZip Files
12.8 Zip Files
12.9 Image Files: Rasterized, Vectorized, and/or Compressed
12.10 It’s All Bytes at the End of the Day
12.11 Integers
12.12 Floats
12.13 Text Data
12.14 Further Reading
12.15 Glossary
13 Big Data
13.1 What Is Big Data?
13.2 When to Use – And not Use – Big Data
13.3 Hadoop: The File System and the Processor
13.4 Example PySpark Script
13.5 Spark Overview
13.6 Spark Operations
13.7 PySpark Data Frames
13.8 Two Ways to Run PySpark
13.9 Configuring Spark
13.10 Under the Hood
13.11 Spark Tips and Gotchas
13.12 The MapReduce Paradigm
13.13 Performance Considerations
13.14 Further Reading
13.15 Glossary
14 Databases
14.1 Relational Databases and MySQL
®
14.2 Key–Value Stores
14.3 Wide‐Column Stores
14.4 Document Stores
14.5 Further Reading
14.6 Glossary
15 Software Engineering Best Practices
15.1 Coding Style
15.2 Version Control and Git for Data Scientists
15.3 Testing Code
15.4 Test‐Driven Development
15.5 AGILE Methodology
15.6 Further Reading
15.7 Glossary
16 Traditional Natural Language Processing
16.1 Do I Even Need NLP?
16.2 The Great Divide: Language Versus Statistics
16.3 Example: Sentiment Analysis on Stock Market Articles
16.4 Software and Datasets
16.5 Tokenization
16.6 Central Concept: Bag‐of‐Words
16.7 Word Weighting: TF‐IDF
16.8
n
‐Grams
16.9 Stop Words
16.10 Lemmatization and Stemming
16.11 Synonyms
16.12 Part of Speech Tagging
16.13 Common Problems
16.14 Advanced Linguistic NLP: Syntax Trees, Knowledge, and Understanding
16.15 Further Reading
16.16 Glossary
17 Time Series Analysis
17.1 Example: Predicting Wikipedia Page Views
17.2 A Typical Workflow
17.3 Time Series Versus Time‐Stamped Events
17.4 Resampling and Interpolation
17.5 Smoothing Signals
17.6 Logarithms and Other Transformations
17.7 Trends and Periodicity
17.8 Windowing
17.9 Brainstorming Simple Features
17.10 Better Features: Time Series as Vectors
17.11 Fourier Analysis: Sometimes a Magic Bullet
17.12 Time Series in Context: The Whole Suite of Features
17.13 Further Reading
17.14 Glossary
18 Probability
18.1 Flipping Coins: Bernoulli Random Variables
18.2 Throwing Darts: Uniform Random Variables
18.3 The Uniform Distribution and Pseudorandom Numbers
18.4 Nondiscrete, Noncontinuous Random Variables
18.5 Notation, Expectations, and Standard Deviation
18.6 Dependence, Marginal, and Conditional Probability
18.7 Understanding the Tails
18.8 Binomial Distribution
18.9 Poisson Distribution
18.10 Normal Distribution
18.11 Multivariate Gaussian
18.12 Exponential Distribution
18.13 Log‐Normal Distribution
18.14 Entropy
18.15 Further Reading
18.16 Glossary
19 Statistics
19.1 Statistics in Perspective
19.2 Bayesian Versus Frequentist: Practical Tradeoffs and Differing Philosophies
19.3 Hypothesis Testing: Key Idea and Example
19.4 Multiple Hypothesis Testing
19.5 Parameter Estimation
19.6 Hypothesis Testing:
t
‐Test
19.7 Confidence Intervals
19.8 Bayesian Statistics
19.9 Naive Bayesian Statistics
19.10 Bayesian Networks
19.11 Choosing Priors: Maximum Entropy or Domain Knowledge
19.12 Further Reading
19.13 Glossary
20 Programming Language Concepts
20.1 Programming Paradigms
20.2 Compilation and Interpretation
20.3 Type Systems
20.4 Further Reading
20.5 Glossary
21 Performance and Computer Memory
21.1 A Word of Caution
21.2 Example Script
21.3 Algorithm Performance and Big‐O Notation
21.4 Some Classic Problems: Sorting a List and Binary Search
21.5 Amortized Performance and Average Performance
21.6 Two Principles: Reducing Overhead and Managing Memory
21.7 Performance Tip: Use Numerical Libraries When Applicable
21.8 Performance Tip: Delete Large Structures You Don’t Need
21.9 Performance Tip: Use Built‐In Functions When Possible
21.10 Performance Tip: Avoid Superfluous Function Calls
21.11 Performance Tip: Avoid Creating Large New Objects
21.12 Further Reading
21.13 Glossary
Part III: Specialized or Advanced Topics
22 Computer Memory and Data Structures
22.1 Virtual Memory, the Stack, and the Heap
22.2 Example C Program
22.3 Data Types and Arrays in Memory
22.4 Structs
22.5 Pointers, the Stack, and the Heap
22.6 Key Data Structures
22.7 Further Reading
22.8 Glossary
23 Maximum‐Likelihood Estimation and Optimization
23.1 Maximum‐Likelihood Estimation
23.2 A Simple Example: Fitting a Line
23.3 Another Example: Logistic Regression
23.4 Optimization
23.5 Gradient Descent
23.6 Convex Optimization
23.7 Stochastic Gradient Descent
23.8 Further Reading
23.9 Glossary
24 Deep Learning and AI
24.1 A Note on Libraries and Hardware
24.2 A Note on Training Data
24.3 Simple Deep Learning: Perceptrons
24.4 What Is a Tensor?
24.5 Convolutional Neural Networks
24.6 Example: The MNIST Handwriting Dataset
24.7 Autoencoders and Latent Vectors
24.8 Generative AI and GANs
24.9 Diffusion Models
24.10 RNNs, Hidden State, and the Encoder–Decoder
24.11 Attention and Transformers
24.12 Stable Diffusion: Bringing the Parts Together
24.13 Large Language Models and Prompt Engineering
24.14 Further Reading
24.15 Glossary
25 Stochastic Modeling
25.1 Markov Chains
25.2 Two Kinds of Markov Chain, Two Kinds of Questions
25.3 Hidden Markov Models and the Viterbi Algorithm
25.4 The Viterbi Algorithm
25.5 Random Walks
25.6 Brownian Motion
25.7 ARIMA Models
25.8 Continuous‐Time Markov Processes
25.9 Poisson Processes
25.10 Further Reading
25.11 Glossary
26 Parting Words
Index
End User License Agreement
Cover Page
Table of Contents
Title Page
Copyright Page
Dedication Page
Preface to the First Edition
Preface to the Second Edition
Begin Reading
Index
WILEY END USER LICENSE AGREEMENT
iii
iv
vi
v
xvii
xix
1
2
3
4
5
7
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
77
78
79
80
81
82
83
84
85
86
87
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
109
110
111
112
113
114
115
116
117
119
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
243
244
245
246
247
248
249
250
251
252
253
254
255
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
283
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
331
332
333
334
335
336
337
338
339
340
341
343
344
345
346
347
348
Field Cady
Second Edition
Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data applied for:Hardback ISBN: 9781394234493
Cover Design: WileyCover Images: © alexlmx/Adobe Stock Photos, © da‐kuk/Getty Images
To my wife, Ryna. Thank you honey, for your support and for always believing in me.
This book was written to solve a problem. The people who I interview for data science jobs have sterling mathematical pedigrees, but most of them are unable to write a simple script that computes Fibonacci numbers (in case you aren’t familiar with Fibonacci numbers, this takes about five lines of code). On the other side, employers tend to view data scientists as either mysterious wizards or used‐car salesmen (and when data scientists can’t be trusted to write a basic script, the latter impression has some merit!). These problems reflect a fundamental misunderstanding, by all parties, of what data science is (and isn’t) and what skills its practitioners need.
When I first got into data science, I was part of that problem. Years of doing academic physics had trained me to solve problems in a way that was long on abstract theory but short on common sense or flexibility. Mercifully, I also knew how to code (thanks, Google™ internships!), and this let me limp along while I picked up the skills and mindsets that actually mattered.
Since leaving academia, I have done data science consulting for companies of every stripe. This includes web traffic analysis for tiny start‐ups, manufacturing optimizations for Fortune 100 giants, and everything in between. The problems to solve are always unique, but the skills required to solve them are strikingly universal. They are an eclectic mix of computer programming, mathematics, and business savvy. They are rarely found together in one person, but in truth they can be learned by anybody.
A few interviews I have given stand out in my mind. The candidate was smart and knowledgeable, but the interview made it painfully clear that they were unprepared for the daily work of a data scientist. What do you do as an interviewer when the candidate starts apologizing for wasting your time? We ended up filling the hour with a crash course on what they were missing and how they could go out and fill the gaps in their knowledge. They went out, learned what they needed to, and are now successful data scientists.
I wrote this book in an attempt to help people like that out, by condensing data science’s various skill sets into a single, coherent volume. It is hands‐on and to the point: ideal for somebody who needs to come up to speed quickly or solve a problem on a tight deadline. The educational system is still catching up to the demands of this new and exciting field, and my hope is that this book will help you bridge the gap.
Field Cady
September 2016Redmond, Washington
In the first edition of this book, I called the introduction “Becoming a Unicorn.” Data science was a new field that was poorly understood, and data scientists were often called “unicorns” in reference to their miraculous ability to do both math and programming. I wrote the book with one central message: data science isn’t as inaccessible as people are making it out to be. It is perfectly reasonable for somebody to acquire the whole palette of skills required, and my book aspired to be a one‐stop‐shop for people to learn them.
A great deal has changed since then, and I’m delighted that the educational system has caught on. There are now degree programs and bootcamps that can teach the essentials of data science to most anybody who is willing to learn them. There are relatively standard curricula, fewer people who are baffled by the subject, and more young professionals embarking on this exciting career. Data science has gone from being an obscure priesthood to an exciting career that normal people can have.
As the discipline has expanded, the tools have also evolved, and I felt that a second edition was in order. By far the most important change I have made is more coverage of deep learning: previously I barely touched on RNNs, but now I continue up through topics such as encoder–decoder architectures, diffusion models, LLMs, and prompt engineering. AI tools are coming of age (perhaps AI is now where data science was 10 years ago) and a data scientist needs to be familiar with them. I have also updated my treatment of Spark to cover its new DataFrame interface, and reduced the emphasis on Hadoop since it is on the decline. Other changes include a reduced emphasis on Bayesian networks (which have waned in popularity with the rise of deep learning), a switch from Python 2 to Python 3, and numerous improvements to the prose.
Field Cady
Redmond, Washington
The goal of this book is to turn you into a data scientist, and there are two parts to this mission. First, there is a set of specific concepts, tools, and techniques that you can go out and solve problems with today. They include buzzwords such as machine learning (ML), Spark, and natural language processing (NLP). They also include concepts that are distinctly less sexy but often more useful, like regular expressions, unit tests, and SQL queries. It would be impossible to give an exhaustive list in any single book, but I cast a wide net.
That brings me to the second part of my goal. Tools are constantly changing, and your long‐term future as a data scientist depends less on what you know today and more on what you are able to learn going forward. To that end, I want to help you understand the concepts behind the algorithms and the technological fundamentals that underlie the tools we use. For example, this is why I spend a fair amount of time on computer memory and optimization: they are often the underlying reason that one approach is better than another. If you understand the key concepts, you can make the right trade‐offs, and you will be able to see how new ideas are related to older ones.
As the field evolves, data science is becoming not just a discipline in its own right, but also a skillset that anybody can have. The software tools are getting better and easier to use, best practices are becoming widely known, and people are learning many of the key skills in school before they’ve even started their career. There will continue to be data science specialists, but there is also a growing number of the so‐called “citizen data scientists” whose real job is something else. They are engineers, biologists, UX designers, programmers, and economists: professionals from all fields who have learned the techniques of data science and are fruitfully applying them to their main discipline.
This book is aimed at anybody who is entering the field. Depending on your background, some parts of it may be stuff you already know. Especially for citizen data scientists, other parts may be unnecessary for your work. But taken as a whole, this book will give you a practical skillset for today, and a solid foundation for your future in data science.
Despite the fact that “data science” is widely practiced and studied today, the term itself is somewhat elusive. So before we go any further, I’d like to give you the definition that I use. I’ve found that this one gets right to the heart of what sets it apart from other disciplines. Here goes:
Data science means doing analytically oriented work that, for one reason or another, requires a substantial amount of software engineering skills.
Often the final deliverable is the kind of thing a statistician or business analyst might provide, but achieving that goal demands software skills that your typical analyst simply doesn’t have – writing a custom parser for an obscure data format, complex preprocessing logic that must be kept in order, etc. Other times the data scientist will need to write production software based on their insights, or perhaps make their model available in real time. Often the dataset itself is so large that just creating a pie chart requires that the work be done in parallel across a cluster of computers. And sometimes, it’s just a really gnarly SQL query that most people struggle to wrap their heads around.
Nate Silver, a statistician famous for accurate forecasting of US elections, once said: “I think data scientist is a sexed‐up term for statistician.” He has a point, but what he said is only partly true. The discipline of statistics deals mostly with rigorous mathematical methods for solving well‐defined problems; data scientists spend most of their time getting data and the problem into a form where statistical methods can even be applied. This involves making sure that the analytics problem is a good match to business objectives, choosing what to measure and how to quantify things (typically more the domain of a BI analyst), extracting meaningful features from the raw data, and coping with any pathologies of the data or weird edge cases (which often requires a level of coding more typical of a software engineer). Once that heavy lifting is done, you can apply statistical tools to get the final results – although, in practice, you often don’t even need them. Professional statisticians need to do a certain amount of preprocessing themselves, but there is a massive difference in degree.
Historically, statistics focused on rigorous methods to analyze clean datasets, such as those that come out of controlled experiments in medicine and agriculture. Often the data was gathered explicitly to support the statisticians’ analysis! In the 2000s though a new class of datasets became popular to analyze. “Big Data” used new cluster computing tools to study large, messy, heterogenous datasets of the sort that would make statisticians shudder: HTML pages, image files, e‐mails, raw output logs of web servers, and so on. These datasets don’t fit the mold of relational databases or statistical tools, and they were not designed to facilitate any particular statistical analysis; so for decades, they were just piling up without being analyzed. Data science came into being as a way to finally milk them for insights. Most of the first data scientists were computer programmers or ML experts who were working on Big Data problems, not statisticians in the traditional sense.
The lines have now blurred: statisticians do more coding than they used to, Big Data tools are less central to the work of a data scientist, and ML is used by a broad swatch of people. And this is healthy: the differences between these fields are, after all, really just a matter of degree and/or historical accident. But, in practical terms, “data scientists” are still the jacks‐of‐all‐trades in the middle. They can do statistics, but if you’re looking to tease every last insight out of clinical trial data, you should consult a statistician. They can train and deploy ML models, but if you’re trying to eke performance out of a large neural network an ML engineer would be better. They can turn business questions into math problems, but they may not have the deep business knowledge of an analyst.
There is a common theme in this book that I would like to call out at as the book’s explicit motto: simple models are easier to work with. Let me explain.
People tend to idolize and gravitate toward complicated analytical models like deep neural nets, Bayesian networks, ARIMA models, and the like. There are good reasons to use these tools; the best‐performing models in the world are usually complicated, there may be fancy ways to bake in expert knowledge, etc. There are also bad reasons to use these tools, like ego and pressure to use to latest buzzwords.
But seasoned data scientists understand that there is more to a model than how accurate it is. Simple models are, above all, easier to reason about. If you’re trying to understand what patterns in the data your model is picking up on, simple models are the way to go. Oftentimes this is the whole point of a model anyway: we are just trying to get insights into the system we are studying, and a model’s performance is just used to gauge how fully it has captured the relevant patterns in the data.
A related advantage of simple models is supremely mundane: stuff breaks, and they make it easier to find what’s broken. Bad training data, perverse inputs to the model, and data that is incorrectly formatted – all of these are liable to cause conspicuous failures, and it’s easy to figure out what went wrong by dissecting the model. For this reason, I like “stunt double models,” which have the same input/output format as a complicated one and are used to debug the model’s integration with other systems.
Simple models are less prone to overfitting. If your dataset is small, a fancy model will often actually perform worse: it essentially memorizes the training data, rather than extracting general patterns from it. The simpler a model, the less you have to worry about the size of your dataset (though admittedly this can create a square‐peg‐in‐a‐round‐hole situation where the model can’t fit the data well and performance degrades).
Simple models are easier to hack and jury‐rig. Frequently they have a small number of tunable parameters, with clear meanings that you can adjust to suit the business needs at hand.
The inferior performance of simple models can act as a performance benchmark, a level that the fancier model must meaningfully exceed in order to justify its extra complexity. And if a simple model performs particularly badly, this may suggest that there isn’t enough signal in the data to make the problem worthwhile.
On the other hand, when there is enough training data and it is representative of what you expect to see, fancier models do perform better. You usually don’t want to leave money on the table by deploying grossly inferior models simply because they are easier to debug. And there are many situations, like cutting‐edge AI, where the relevant patterns are very complicated, and it takes a complicated model to accurately capture them. Even in these cases though, it is often possible to keep the complexity modular and, hence, easier to reason about. For example, say we are choosing which ads to show to which customer. Instead of directly predicting the click‐rate for various ads and picking the best one, we might have a very complex model that assigns the person to some pre‐existing user segments, and then a simple model that shows them ads based on the segments they are in. This model will be easier to debug and much more scalable.
Model complexity is an area that requires critical thinking and flexibility. Simple models are often good enough for the problem at hand, especially in situations where training data is limited anyway. When more complexity is justified, it is often buttressed by an army of simple models that tackle various subproblems (like various forms of cleaning and labeling the training data). Simple models are easier to work with, but fancy ones sometimes give better performance: technical and data considerations tell you the constraints, and business value should guide the ultimate choice.
This book is organized into three sections. The first, The Stuff You’ll Always Use, covers topics that, in my experience, you will end up using in almost any data science project. They are core skills, which are absolutely indispensable for data science at any level.
The first section was also written with an eye toward people who need data science to answer a specific question but do not aspire to become full‐fledged data scientists. If you are in this camp, then there is a good chance that Part I of the book will give you everything you need.
The second section, Stuff You Still Need to Know, covers additional core skills for a data scientist. Some of these, such as clustering, are so common that they almost made it into the first section, and they could easily play a role in any project. Others, such as NLP, are somewhat specialized subjects that are critical in certain domains but superfluous in others. In my judgment, a data scientist should be conversant in all of these subjects, even if they don’t always use them all.
The final section, Stuff That’s Good to Know, covers a variety of topics that are optional. Some of these chapters are just expansions on topics from the first two sections, but they give more theoretical background and discuss some additional topics. Others are entirely new material, which does come up in data science, but which you could go through a career without ever running into.
This book was written with three use cases in mind:
You can read it cover‐to‐cover. If you do that, it should give you a self‐contained course in data science that will leave you ready to tackle real problems. If you have a strong background in computer programming, or in mathematics, then some of it will be review.
You can use it to come quickly up to speed on a specific subject. I have tried to make the different chapters pretty self‐contained, especially the chapters after the first section.
The book contains a lot of sample codes, in pieces that are large enough to use as a starting point for your own projects.
The example code in this book is all in Python, except for a few domain‐specific languages such as SQL. My goal isn’t to push you to use Python; there are lots of good tools out there, and you can use whichever ones you want.
However, I wanted to use one language for all of my examples, which lets readers follow the whole book while only knowing one language. Of the various languages available, there are two reasons why I chose Python:
Python is without question the most popular language for data scientists. R is its only major competitor, at least when it comes to free tools. I have used both extensively, and I think that Python is flat‐out better (except for some obscure statistics packages that have been written in R and that are rarely needed anyway).
I like to say that Python is the second‐best language for any task. It’s a jack‐of‐all‐trades. If you only need to worry about statistics, or numerical computation, or web parsing, then there are better options out there. But if you need to do all of these things within a single project, then Python is your best bet. Since data science is so inherently multidisciplinary, this makes it a perfect fit.
As a note of advice, it is much better to be proficient in one language, to the point where you can reliably churn out code that is of high quality, than to be mediocre at several.
This book is rich in example code, in fairly long chunks. This was done for two reasons:
As a data scientist, you need to be able to read longish pieces of code. This is a non‐optional skill, and if you aren’t used to it, then this will give you a chance to practice.
I wanted to make it easier for you to poach the code from this book, if you feel so inclined.
You can do whatever you want with the code, with or without attribution. I release it into the public domain in the hope that it can give some people a small leg up. You can find it on my GitHub page at www.github.com/field-cady.
The sample data that I used comes in two forms:
Test datasets that are built into Python’s scientific libraries.
Data that is pulled off the Internet, from sources such as Yahoo and Wikipedia. When I do this, the example scripts will include code that pulls the data.
It is my hope that this book not only teaches you how to do nut‐and‐bolts data science but also gives you a feel of how exciting this deeply interdisciplinary subject is. Please feel free to reach out to me at www.fieldcady.com or [email protected] with comments, errata, or any other feedback.
The first section of this book covers core topics that everybody doing data science should know. This includes people who are not interested in being professional data scientists, but need to know just enough to solve some specific problem. These are the subjects that will likely arise in every data science project you do.
In this chapter, I will give you a high‐level overview of the process of data science. I will focus on the different stages of data science work, including common pain points, key things to get right, and where data science parts ways from other disciplines.
The process of solving a data science problem is summarized in the following figure, which I called the Data Science Road Map.
The first step is always to frame the problem: understand the business use case and craft a well‐defined analytics problem (or problems) out of it. This is followed by an extensive stage of grappling with the data and the real‐world things that it describes, so that we can extract meaningful features. Finally, these features are plugged into analytical tools that give us hard numerical results.
Before I go into more detail about the different stages of the roadmap, I want to point out two things.
The first is that “Model and Analyze” loops back to framing the problem. This is one of the key features of data science that differentiate it from traditional software engineering. Data scientists write code, and they use many of the same tools as software engineers. However, there is a tight feedback loop between data science work and the real world. Questions are always being reframed as new insights become available. As a result, data scientists must keep their code base extremely flexible and always have an eye toward the real‐world problem they are solving. Often you will follow the loop back many times, constantly refining your methods and producing new insights.
The second point is that there are two different (although not mutually exclusive) ways to exit the road map: presenting results and deploying code. My friend Michael Li, a data scientist who founded The Data Incubator, likened this to having two different types of clients: humans and machines. They require distinct skill sets and modifications to every stage of the data science road map.
If your clients are humans, then usually you are trying to use available data sources to answer some kind of business problem. Examples would be the following:
Identifying leading indicators of spikes in the price of a stock, so that people can understand what causes price spikes.
Determining whether customers break down into natural subtypes and what characteristics each type has.
Assessing whether traffic to one website can be used to predict traffic to another site.
Typically, the final deliverable for work such as this will be a PowerPoint slide deck or a written report. The goal is to give business insights, and often these insights will be used for making key decisions. This kind of data science also functions as a way to test the waters and see whether some analytics approach is worth a larger follow‐up project that may result in production software.
If your clients are machines, then you are doing something that blends into software engineering, where the deliverable is a piece of software that performs some analytics work. Examples would be the following:
Implementing the algorithm that chooses which ad to show to a customer and training it on real data.
Writing a batch process that generates daily reports based on company records generated that day, using some kind of analytics to point out salient patterns.
In these cases, your main deliverable is a piece of software. In addition to performing a useful task, it had better work well in terms of performance, robustness to bad inputs, and so on.
Once you understand who your clients are, the next step is to determine what you’ll be doing for them. In the next section, I will show you how to do this all‐important step.
The difference between great and mediocre data science is not about math or engineering: it is about asking the right question(s). Alternately, if you’re trying to build some piece of software, you need to decide what exactly that software should do. No amount of technical competence or statistical rigor can make up for having solved a useless problem.
If your clients are humans, most projects start with some kind of extremely open‐ended question. Perhaps, there is a known pain point, but it’s not clear what a solution would look like. If your clients are machines, then the business problem is usually pretty clear, but there can be a lot of ambiguity about what constraints there might be on the software (languages to use, runtime, how accurate predictions need to be, etc.). Before diving into actual work, it’s important to clarify exactly what would constitute a solution to this problem. A “definition of done” is a good way to put it: what criteria constitute a completed project, and (most importantly) what would be required to make the project a success?
For large projects, these criteria are often laid out in a document. Writing that document is a collaborative process involving a lot of back‐and‐forth with stakeholders, negotiation, and sometimes disagreement. In consulting, these documents are often called “statements of work” or SOWs. Within a company that is creating a product (as opposed to just a stand‐alone investigation), they are often referred to as “project requirements documents” or PRDs.
The main purpose of an SOW is to get everybody on the same page about exactly what work should be done, what the priorities are, and what expectations are realistic. Business problems are typically very vague to start off with, and it takes a lot of time and effort to follow a course of action through to the final result. So before investing that effort, it is critical to make sure that you are working on the right problem. Crafting the SOW will often include a range of one‐off analyses the gauge which avenues are promising enough to commit resources to.
There is, however, also an element of self‐defense. Sometimes it ends up being impossible to solve a problem with the available data or maybe stakeholders decide that the project isn’t important anymore. A good SOW keeps everybody honest in case things don’t work out: everybody agrees up‐front that this looks like it will be both valuable and feasible.
Having an SOW doesn’t set things in stone. There are course corrections based on preliminary discoveries. Sometimes, people change their minds after the SOW has been signed. It happens. But, crafting an SOW is the best way to make sure that all efforts are pointed in the most useful direction.
Once you have access to the data you’ll be using, it’s good to have a battery of standard questions that you always ask about it. This is a good way to hit the ground running with your analyses, rather than risk analysis paralysis. It is also a good safeguard to identify problems with the data as quickly as possible.
A few good generic questions to ask are as follows:
How big is the dataset? Is this the entire dataset or just a sample? If it’s just a sample, do we know how to sampling was done?
Is this data representative enough? For example, maybe data was only collected for a subset of users.
Are there likely to be gross outliers or extraordinary sources of noise? For example, 99% of the traffic from a web server might be a single denial‐of‐service attack.
Are there likely to be heavy tails? For example, the vast majority of web traffic might go to only a few sites, and if those sites are over‐ or under‐represented in a sample you took your metrics might be misleading.
Might there be artificial data inserted into the dataset? This happens a lot in industrial settings.
Are there any fields that are unique identifiers? These are the fields you might use for joining between datasets, etc. Make sure that unique ID fields are actually unique – they often aren’t.
If there are two datasets A and B that need to be joined, what does it mean if something in A doesn’t match anything in B?
When data entries are blank, where does that come from?
How common are blank entries?
The most important question to ask about the data is whether it can solve the business problem that you are trying to tackle. If not, then you might need to look into additional sources of data or modify the work that you are planning.
Speaking from personal experience, I have been inclined to neglect these preliminary questions. I am excited to get into the actual analysis, so I’ve sometimes jumped right in without taking the time to make sure that I know what I’m doing. For example, I once had a project where there was a collection of motors and time series data monitoring their physical characteristics: one time series per motor. My job was to find leading indicators of failure, and I started doing this by comparing the last day’s worth of time series for a given motor (i.e., the data taken right before it failed) against its previous data. Well, I realized a couple of weeks in that sometimes the time series stopped long before the motor actually failed, and, in other cases, the time series data continued long after the motor was dead. The actual times the motors had died were listed in a separate table, and it would have been easy for me to double‐check early on that they corresponded to the ends of the time series.
Data wrangling is the process of getting the data from its raw format into something suitable for more conventional analytics. This typically means creating a software pipeline that gets the data out of wherever it is stored, does any cleaning or filtering necessary, and puts it into a regular format.
Data wrangling is the main area where data scientists need skills that a traditional statistician or analyst doesn’t have. The data is often stored in a special‐purpose database that requires specialized tools to access. There could be so much of it that Big Data techniques are required to process it. You might need to use performance tricks to make things run quickly. Especially with messy data, the preprocessing pipelines are often so complex that it is very difficult to keep the code organized.
Speaking of messy data, I should tell you this upfront: industrial datasets are always more convoluted than you would think they reasonably should be. The question is not whether the problems exist but whether they impact your work. My recipe for figuring out how a particular dataset is broken includes the following:
If the raw data is text, look directly at the plain files in a text editor or something similar. Things such as irregular date formats, irregular capitalizations, and lines that are clearly junk will jump out at you.
If there is a tool that is supposed to be able to open or process the data, make sure that it can actually do it. For example, if you have a CSV file, try opening it in something that reads data frames. Did it read all the rows in? If not, maybe some rows have the wrong number of entries. Did the column that is supposed to be a datetime get read in as a datetime? If not, then maybe the formatting is irregular.
Do some histograms and scatterplots. Are these numbers realistic, given what you know about the real‐life situation? Are there any massive outliers?
Take some simple questions that you already know the (maybe approximate) answer to, answer them based on this data, and see if the results agree. For example, you might try to calculate the number of customers by counting how many unique customer IDs there are. If these numbers don’t agree, then you’ve probably misunderstood something about the data.
Once you have the data digested into a usable format, the next step is exploratory analysis. This basically means poking around in the data, visualizing it in lots of different ways, trying out different ways to transform it, and seeing what there is to see. This stage is very creative, and it’s a great place to let your curiosity run a little wild. Feel free to calculate some correlations and similar metrics, but don’t break out the fancy machine learning classifiers. Keep things simple and intuitive.
There are two things that you typically get out of exploratory analysis:
You develop an intuitive feel for the data, including what the salient patterns look like visually. This is especially important if you’re going to be working with similar data a lot in the future. This also helps ferret out pathologies in the data that weren’t found earlier.
You get a list of concrete hypotheses about what’s going on in the data. Oftentimes, a hypothesis will be motivated by a compelling graphic that you generated: a snapshot of a time series that shows an unmistakable pattern, a scatterplot demonstrating that two variables are related to each other, or a histogram that is clearly bimodal.
A common misconception is that data scientists don’t need visualizations. This attitude is not only inaccurate: it is very dangerous. Most machine learning algorithms are not inherently visual, but it is very easy to misinterpret their outputs if you look only at the numbers. There is no substitute for the human eye when it comes to making intuitive sense of things.
This stage has a lot of overlap with exploratory analysis and data wrangling. A feature is really just a number or a category that is extracted from your data and describes some entity. For example, you might extract the average word length from a text document or the number of characters in the document. Or, if you have temperature measurements, you might extract the average temperature for a particular location.
In practical terms, feature extraction means taking your raw datasets and distilling them down into a table with rows and columns. This is called “tabular data.” Each row corresponds to some real‐world entity, and each column gives a single piece of information (generally a number) that describes that entity. Virtually all analytics techniques, from lowly scatterplots to fancy neural networks, operate on tabular data.
Extracting good features is the most important thing for getting your analysis to work. It is much more important than good machine‐learning classifiers, fancy statistical techniques, or elegant code. Especially, if your data doesn’t come with readily available features (as is the case with web pages, images, etc.), how you reduce it to numbers will make the difference between success and failure.
Feature extraction is also the most creative part of data science and the one most closely tied to domain expertise. Typically, a really good feature will correspond to some real‐world phenomenon. Data scientists should work closely with domain experts and understand what these phenomena mean and how to distill them into numbers.
Sometimes, there is also room for creativity as to what entities you are extracting features about. For example, let’s say that you have a bunch of transaction logs, each of which gives a person’s name and e‐mail address. Do you want to have one row per human or one row per e‐mail address? For many real‐world situations, you want one row per human (in which case, the number of unique e‐mail addresses they have might be a good feature to extract!), but that opens the very thorny question of how you can tell when two people are the same based on their names.
Most features that we extract will be used to predict something. However, you may also need to extract the thing that you are predicting, which is also called the target variable. For example, I was once tasked with predicting whether my client’s customers would lose their brand loyalty. There was no “loyalty” field in the data: it was just a log of various customer interactions and transactions. I had to figure out a way to measure “loyalty.”
Once features have been extracted, most data science projects involve some kind of machine‐learning model. Maybe this is a classifier that guesses whether a customer is still loyal, a regression model that predicts a stock’s price on the next day, or a clustering algorithm that breaks customers into different segments.
In many data science projects, the modeling stage is quite simple: you just take a standard suite of models, plug your data into each one of them, and see which one works best. In other cases, a lot of care is taken to carefully tune a model and eek out every last bit of performance.
Really, this should happen at every stage of a data science project, but it becomes especially crucial when analyzing the results of the modeling stage. If you have identified different clusters, what do they correspond to? Does your classifier work well enough to be useful? Is there anything interesting about the cases in which it fails?
This stage is what allows for course corrections in a project and gives ideas for what to do differently if there is another iteration.
If your client is a human, it is common to use a variety of models, tuned in different ways, to examine different aspects of your data. If your client is a machine though, you will probably need to zero in on a single, canonical model that will be used in production.
If your client is a human, then you will probably have to give either a slide deck or a written report describing the work you did and what your results were. You are also likely to have to do this even if your main clients are machines.
Communication in slide decks and prose is a difficult, important skill set in itself. But, it is especially tricky with data science, where the material you are communicating is highly technical and you are presenting to a broad audience. Data scientists must communicate fluidly with business stakeholders, domain experts, software engineers, and business analysts. These groups tend to have different knowledge bases coming in, different things they will be paying attention to, and different presentation styles to which they are accustomed.
I can’t emphasize enough the fact that your numbers and figures should be reproducible. There is nothing worse than getting probing questions about a graphic that you can’t answer because you don’t have a record of exactly how it was generated.
If your ultimate clients are computers, then it is your job to produce code that will be run regularly in the future by other people. Typically, this falls into one of two categories:
Batch analytics code.
This will be used to redo an analysis similar to the one that has already been done, on data that will be collected in the future. Sometimes, it will produce some human‐readable analytics reports. Other times, it will train a statistical model that will be referenced by other code.
Real‐time code.
This will typically be an analytical module in a larger software package, written in a high‐performance programming language and adhering to all the best practices of software engineering.
There are three typical deliverables from this stage:
The code itself, often baked into a Docker container or something similar. The latter allows the data scientist to have responsibility for the code itself, while engineers handle the system that it plugs into.
Some documentation of how to run the code. Sometimes, this is a stand‐alone work document, often called a “run book.” Other times, the documentation is embedded in the code.
Usually, you need some way to test code that ensures that your code operates correctly. For real‐time code, this will normally take the form of unit tests. For batch processes, it is sometimes a sample input dataset (designed to illustrate all the relevant edge cases) along with what the output should look like.
In deploying code, data scientists often take on a dual role as full‐fledged software engineers. Especially with very intricate algorithms, it often just isn’t practical to have one person spec it out and another implement the same thing for production.
Data science is a deeply iterative process, even more so than typical software engineering. This is because in software you generally have a pretty good idea what you’re aiming to create, even if you take an iterative approach to implementing it. But, in data science, it is usually an open question of what features will end up being useful to extract and what model you will train. For this reason, the data science process should be built around the goal of being able to change things painlessly.
My recommendations are as follows:
Try to get preliminary results as quickly as possible after you’ve understood the data. A scatterplot or histogram that shows you that there is a clear pattern in the data. Maybe a simple model based on crude preliminary features that nonetheless works. Sometimes an analysis is doomed to failure, because there just isn’t much signal in the data. If this is the case, you want to know sooner rather than later, so that you can change your focus.
Automate relentlessly: put your analysis into a single script or notebook so that it’s easy to run the whole thing at once. This is a point that I’ve learned the hard way: it is really, really easy after several hours at the command line to lose track of exactly what processing you did to get your data into its current form. Keep things reproducible from the beginning.
Keep your code modular and broken out into clear stages. This makes it easy to modify, add in, and take out steps as you experiment.
Notice how much of this comes down to considerations of software, not analytics. The code must be flexible enough to solve all manner of problems, powerful enough to do it efficiently, and comprehensible enough to edit quickly if objectives change. Doing this requires that data scientists use flexible, powerful programming languages, which I will discuss in the next chapter.
Data wrangling
The nitty‐gritty task of cleaning data and getting it into a standard format that is suitable for downstream analysis.
Exploratory analysis
A stage of analysis that focuses on exploring the data to generate hypotheses about it. Exploratory analysis relies heavily on visualizations.
Feature
A small piece of data, usually a number or a label, that is extracted from your data and characterizes some entity in your dataset.
Product requirements document (PRD)
A document that specifies exactly what functionality a planned product should have.
Production code
Software that is run repeatedly and maintained. It especially refers to source code of software product that is distributed to other people.
Statement of work (SOW)
A document that specifies what work is to be done in a project, relevant timelines, and specific deliverables.
Target variable
A feature that you are trying to predict in machine learning. Sometimes, it is already in your data, and other times, you must construct it yourself.