Data Science - Field Cady - E-Book

Data Science E-Book

Field Cady

0,0
66,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Tap into the power of data science with this comprehensive resource for non-technical professionals Data Science: The Executive Summary - A Technical Book for Non-Technical Professionals is a comprehensive resource for people in non-engineer roles who want to fully understand data science and analytics concepts. Accomplished data scientist and author Field Cady describes both the "business side" of data science, including what problems it solves and how it fits into an organization, and the technical side, including analytical techniques and key technologies. Data Science: The Executive Summary covers topics like: * Assessing whether your organization needs data scientists, and what to look for when hiring them * When Big Data is the best approach to use for a project, and when it actually ties analysts' hands * Cutting edge Artificial Intelligence, as well as classical approaches that work better for many problems * How many techniques rely on dubious mathematical idealizations, and when you can work around them Perfect for executives who make critical decisions based on data science and analytics, as well as mangers who hire and assess the work of data scientists, Data Science: The Executive Summary also belongs on the bookshelves of salespeople and marketers who need to explain what a data analytics product does. Finally, data scientists themselves will improve their technical work with insights into the goals and constraints of the business situation.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 370

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Data Science: The Executive Summary

Copyright

Dedication

1 Introduction

1.1 Why Managers Need to Know About Data Science

1.2 The New Age of Data Literacy

1.3 Data‐Driven Development

1.4 How to Use this Book

2 The Business Side of Data Science

2.1 What Is Data Science?

2.2 Data Science in an Organization

2.3 Hiring Data Scientists

2.4 Management Failure Cases

3 Working with Modern Data

3.1 Unstructured Data and Passive Collection

3.2 Data Types and Sources

3.3 Data Formats

3.4 Databases

3.5 Data Analytics Software Architectures

Notes

4 Telling the Story, Summarizing Data

4.1 Choosing What to Measure

4.2 Outliers, Visualizations, and the Limits of Summary Statistics: A Picture Is Worth a Thousand Numbers

4.3 Experiments, Correlation, and Causality

4.4 Summarizing One Number

4.5 Key Properties to Assess: Central Tendency, Spread, and Heavy Tails

4.6 Summarizing Two Numbers: Correlations and Scatterplots

4.7 Advanced Material: Fitting a Line or Curve

4.8 Statistics: How to Not Fool Yourself

4.9 Advanced Material: Probability Distributions Worth Knowing

5 Machine Learning

5.1 Supervised Learning, Unsupervised Learning, and Binary Classifiers

5.2 Measuring Performance

5.3 Advanced Material: Important Classifiers

5.4 Structure of the Data: Unsupervised Learning

5.5 Learning as You Go: Reinforcement Learning

6 Knowing the Tools

6.1 A Note on Learning to Code

6.2 Cheat Sheet

6.3 Parts of the Data Science Ecosystem

6.4 Advanced Material: Database Query Crash Course

7 Deep Learning and Artificial Intelligence

7.1 Overview of AI

7.2 Neural Networks

7.3 Natural Language Processing

7.4 Knowledge Bases and Graphs

Postscript

Index

End User License Agreement

List of Tables

Chapter 2

Table 2.1 Data science work can largely be divided into producing human‐under...

Table 2.2 Data engineers specialize in creating software systems to store and...

Table 2.3 BI analysts generally lack the ability to create mathematically com...

Table 2.4 Software engineers create products of a scale and complexity far gr...

Chapter 6

Table 6.1 These functions – which are present in most SQL‐like languages – ta...

Table 6.2 Common SQL aggregation functions.

Chapter 7

Table 7.1 Feature of regular expressions.

List of Illustrations

Chapter 2

Figure 2.1 The process of data science is deeply iterative, with the questio...

Chapter 4

Figure 4.1 Anscombe's quartet is a famous demonstration of the limitations o...

Figure 4.2 Mean, median, and mode are the most common measures of central te...

Figure 4.3 Box‐and‐whisker plots capture the median, the 25% and 75% percent...

Figure 4.4 Box‐and‐whisker plots allow you to visually compare several data ...

Figure 4.5 The histograms of two datasets, plotted for comparison on (a) a n...

Figure 4.6 In both of these plots the correlation between

x

and

y

will be cl...

Figure 4.7 This dataset will have ordinal correlation of 1, since

y

consiste...

Figure 4.8 Residuals measure the accuracy of a model. Here the gray points a...

Figure 4.9 A degenerative form of “curve fitting” is used as a base of compa...

Figure 4.10 Large residuals can come from two sources: either that data we a...

Figure 4.11 The most intuitive way to think of a probability distribution is...

Figure 4.12 The area under the curve of a continuous probability is distribu...

Figure 4.13 The Bernoulli distribution is just the flipping of a biased coin...

Figure 4.14 The uniform distribution gives constantly probability density ov...

Figure 4.15 The normal distribution, aka Gaussian, is the prototypical “bell...

Figure 4.16 The exponential distribution is often used to estimate the lengt...

Figure 4.17 Say there are many independent events that

could

happen (there a...

Chapter 5

Figure 5.1

K

‐fold cross‐validation breaks the dataset into

k

partitions. Eac...

Figure 5.2 The performance of a classifier can't really be boiled down to a ...

Figure 5.3 The ROC curve plots the true/false positive rate for a classifier...

Figure 5.4 For this cutoff the fraction of all 0s that get incorrectly flagg...

Figure 5.5 For this cutoff a small change in your classification threshold w...

Figure 5.6 In a lift curve the

x

‐axis (the “reach”) is the fraction of all d...

Figure 5.7 A decision tree classifier is somewhat like a flow chart. Every n...

Figure 5.8 Support vector machines look for a hyperplane that divides your t...

Figure 5.9 The key weakness of support vector machines is that often there i...

Figure 5.10 Sometimes you can fix the linear separability problem by mapping...

Figure 5.11 The Sigmoid function shows up many places in machine learning. E...

Figure 5.12 A perceptron is a neural network with a single hidden layer.

Figure 5.13 The “curse of dimensionality” describes how high‐dimensional spa...

Figure 5.14 If many fields in your data move in lock‐step then in a sense th...

Figure 5.15 A Scree plot shows how much of a dataset's variability is accoun...

Figure 5.16 The “clusters” identified by

k

‐means clustering are really just ...

Figure 5.17 The indicated point is closer to the middle of the other cluster...

Chapter 6

Figure 6.1 The map‐reduce paradigm is one of the building blocks of the Big ...

Chapter 7

Figure 7.1 A neural network consists of “nodes” arranged into “layers.” Each...

Figure 7.2 Convolutional neural networks are stars in image processing. The ...

Guide

Cover Page

Title Page

Copyright

Table of Contents

Begin Reading

Postscript

Index

WILEY END USER LICENSE AGREEMENT

Pages

iv

v

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

181

182

183

184

185

186

187

Data Science: The Executive Summary

A Technical Book for Non-Technical Professionals

Field Cady

 

 

 

 

 

 

Copyright

This edition first published 2021

© 2021 by John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Field Cady to be identified as the author of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Cady, Field, 1984‐ author.

Title: Data science : the executive summary : a technical book for

 non‐technical professionals / by Field Cady.

Description: Hoboken, NJ : Wiley, 2021. | Includes bibliographical

 references and index.

Identifiers: LCCN 2020024708 (print) | LCCN 2020024709 (ebook) | ISBN

 9781119544081 (hardback) | ISBN 9781119544166 (adobe pdf) | ISBN

 9781119544173 (epub)

Subjects: LCSH: Data mining.

Classification: LCC QA76.9.D343 C33 2021 (print) | LCC QA76.9.D343

 (ebook) | DDC 006.3/12–dc23

LC record available at https://lccn.loc.gov/2020024708

LC ebook record available at https://lccn.loc.gov/2020024709

Cover Design: Wiley

Cover Image: © monsitj/Getty Images

For my Uncle Steve, who left the world on the day this book was finished.

And for my son Cyrus, who entered shortly thereafter.

1Introduction

1.1 Why Managers Need to Know About Data Science

There are many “data science for managers” books on the market today. They are filled with business success stories, pretty visualizations, and pointers about what some of the hot trends are. That material will get you rightfully excited about data science's potential, and maybe even get you started off on the right foot with some promising problems, but it isn't enough to see projects over the finish line or bring the full benefits of data to your organization. Depending on your role you may also need to decide how much to trust a piece of analytical work, make final calls about what tools your company will invest in, and hire/manage a team of data scientists. These tasks don't require writing your own code or performing mathematical derivations, but they do require a solid grounding in data science concepts and the ability to think critically about them.

In the past, mathematical disciplines like statistics and accounting solved precisely defined problems with a clear business meaning. You don't need a STEM degree to understand the idea of testing whether a drug works or balancing a checkbook! But as businesses tackle more open‐ended questions, and do so with datasets that are increasingly complex, the analytics problems become more ambiguous. A data science problem almost never lines up perfectly with something in a textbook; there is always a business consideration or data issue that requires some improvisation. Flexibility like this can become recklessness without fluency in the underlying technical concepts. Combine this with the fact that data science is fast becoming ubiquitous in the business world, and managers and executives face a higher technical bar than they ever did in the past.

Business education has not caught up to this new reality. Most careers follow a “business track” that teaches few technical concepts, or a “technical track” that focuses on hands‐on skills that are useless for businesspeople. This book charts a middle path, teaching non‐technical professionals the core concepts of modern data science. I won't teach you the brass tacks of how to do the work yourself (that truly is for specialists), but I will give you the conceptual background you need to recognize good analytics, frame business needs as solvable problems, manage data science projects, and understand the ways data science is likely to transform your industry.

In my work as a consultant I have seen PMs struggle to mediate technical disagreements, ultimately making decisions based on peoples' personalities rather than the merits of their ideas. I've seen large‐scale proof‐of‐concept projects that proved the wrong concept, because organizers set out inappropriate success metrics. And I've seen executives scratching their heads after contractors deliver a result, unable to see for themselves whether they got what they paid for.

Conversely, I have seen managers who can talk shop with their analysts, asking solid questions that move the needle on the business. I've seen executives who understand what is and isn't feasible, instinctively moving resources toward projects that are likely to succeed. And I've seen non‐technical employees who can identify key oversights on the part of analysts and communicate results throughout an organization.

Most books on data science come in one of two types. Some are written for aspiring data scientists, with a focus on example code and the gory details of how to tune different models. Others assume that their readers are unable or unwilling to think critically, and dumb the technical material down to the point of uselessness. This book rejects both those approaches. I am convinced that it is not just possible for people throughout the modern business workforce to learn the language of data: it is essential.

1.2 The New Age of Data Literacy

Analytics used to play a minor role in business. For the most part it was used to solve a few well‐known problems that were industry‐specific. When more general analytics was needed, it was for well‐defined problems, like conducting an experiment to see what product customers preferred.

Two trends have changed that situation. The first is the intrusion of computers into every aspect of life and business. Every phone app, every new feature in a computer program, every device that monitors a factory is a place where computers are making decisions based on algorithmic rules, rather than human judgment. Determining those rules, measuring their effectiveness, and monitoring them over time are inherently analytical. The second trend is the profusion of data and machines that can process it. In the past data was rare, gathered with a specific purpose in mind, and carefully structured so as to support the intended analysis. These days every device is generating a constant stream of data, which is passively gathered and stored in whatever format is most convenient. Eventually it gets used by high‐powered computer clusters to answer a staggering range of questions, many of which it wasn't really designed for.

I don't mean to make it sound like computers are able to take care of everything themselves – quite the opposite. They have no real‐world insights, no creativity, and no common sense. It is the job of humans to make sure that computers' brute computational muscle is channeled toward the right questions, and to know their limitations when interpreting the answers. Humans are not being replaced – they are taking on the job of shepherding machines.

I am constantly concerned when I see smart, ethical business people failing to keep up with these changes. Good managers are at risk of botching major decisions for dumb reasons, or even falling prey to unscrupulous snake oil vendors. Some of these people are my friends and colleagues. It's not a question of intelligence or earnestness – many simply don't have the required conceptual background, which is understandable. I wrote this book for my friends and people like them, so that they can be empowered by the age of data rather than left behind.

1.3 Data‐Driven Development

So where is all of this leading? Cutting out hyperbole and speculation, what does it look like for an organization to make full use of modern data technologies and what are the benefits? The goal that we are pushing toward is what I call “data‐driven development” (DDD). In an organization that uses DDD, all stages in a business process have their data gathered, modeled, and deployed to enable better decision making. Overall business goals and workflows are crafted by human experts, but after that every part of the system can be monitored and optimized, hypotheses can be tested rigorously and retroactively, and large‐scale trends can be identified and capitalized on. Data greases the wheels of all parts of the operation and provides a constant pulse on what's happening on the ground.

I break the benefits of DDD into three major categories:

1.

Human decisions are better‐informed

: Business is filled with decisions about what to prioritize, how to allocate resources, and which direction to take a project. Often the people making these calls have no true confidence in one direction or the other, and the numbers that could help them out are either unavailable or dubious. In DDD the data they need will be available at a moment's notice. More than that though, there will be an understanding of how to access it, pre‐existing models that give interpretations and predictions, and a tribal understanding of how reliable these analyses are.

2.

Some decisions are made autonomously

: If there is a single class of “killer apps” for data science, it is machine learning algorithms that can make decisions without human intervention. In a DDD system large portions of a workflow can be automated, with assurances about performance based on historical data.

3.

Everything can be measured and monitored

: Understanding a large, complex, real‐time operation requires the ability to monitor all aspects of it over time. This ranges from concrete stats – like visitors to a website or yield at a stage of a manufacturing pipeline – to fuzzier concepts like user satisfaction. This makes it possible to constantly optimize a system, diagnose problems quickly, and react more quickly to a changing environment.

It might seem at first blush like these benefit categories apply to unrelated aspects of a business. But in fact they have much in common: they rely on the same datasets and data processing systems, they leverage the same models to make predictions, and they inform each other. If an autonomous decision algorithm suddenly starts performing poorly, it will prompt an investigation and possibly lead to high‐level business choices. Monitoring systems use autonomous decision algorithms to prioritize incidents for human investigation. And any major business decision will be accompanied by a plan to keep track of how well it turns out, so that adjustments can be made as needed.

Data science today is treated as a collection of stand‐alone projects, each with its own models, team, and datasets. But in DDD all of these projects are really just applications of a single unified system. DDD goes so far beyond just giving people access to a common database; it keeps a pulse on all parts of a business operation, it automates large parts of it, and where automation isn't possible it puts all the best analyses at people's fingertips.

It's a waste of effort to sit around and guess things that can be measured, or to cross our fingers about hypotheses that we can go out and test. Ideally we should spend our time coming up with creative new ideas, understanding customer needs, deep troubleshooting, or anticipating “black swan” events that have no historical precedent. DDD pushes as much work as possible onto machines and pre‐existing models, so that humans can focus on the work that only a human can do.

1.4 How to Use this Book

This book was written to bring people who don't necessarily have a technical background up to speed on data science. The goals are twofold: first I want to give a working knowledge of the current state of data science, the tools being used, and where it's going in the foreseeable future. Secondly, I want to give a solid grounding in the core concepts of analytics that will never go out of date. This book may also be of interest to data scientists who have nitty‐gritty technical chops but want to take their career to the next level by focusing on work that moves the business needle.

The first part of this book, The Business Side of Data Science, stands on its own. It explains in non‐technical terms what data science is, how to manage, hire, and work with data scientists, and how you can leverage DDD without getting into the technical weeds.

To really achieve data literacy though requires a certain amount of technical background, at least at a conceptual level. That's where the rest of the book comes in. It gives you the foundation required to formulate clear analytics questions, know what is and isn't possible, understand the tradeoffs between different approaches, and think critically about the usefulness of analytics results. Key jargon is explained in basic terms, the real‐world impact of technical details is shown, unnecessary formalism is avoided, and there is no code. Theory is kept to a minimum, but when it is necessary I illustrate it by example and explain why it is important. I have tried to adhere to Einstein's maxim: “everything should be made as simple as possible… but not simpler.”

Some sections of the book are flagged as “advanced material” in the title. These sections are (by comparison) highly technical in their content. They are necessary for understanding the strengths and weaknesses of specific data science techniques, but are less important for framing analytics problems and managing data science teams.

I have tried to make the chapters as independent as possible, so that the book can be consumed in bite‐sized chunks. In some places the concepts necessarily build off of each other; I have tried to call this out explicitly when it occurs, and to summarize the key background ideas so that the book can be used as a reference.

2The Business Side of Data Science

A lot of this book focuses on teaching the analytics concepts required to best leverage data science in your organization. This first part, however, zooms out and takes the “pure business” perspective. It functions as a primer on what value data scientists can bring to an organization, where they fit into the business ecosystem, and how to hire and manage them effectively.

2.1 What Is Data Science?

There is no accepted definition of “data science.” I don't expect there to ever be one either, because its existence as a job role has less to do with clearly defined tasks and more to do with historical circumstance. The first data scientists were solving problems that would normally fall under the umbrella of statistics or business intelligence, but they were doing it in a computationally‐intensive way that relied heavily on software engineering and computer science skills. I'll talk more about these historical circumstances and the blurry lines between the job titles shortly, but for now a good working definition is:

Data Science: Analytics work that, for one reason or another, requires a substantial amount of software engineering skills

This definition technically applies to some people who identify as statisticians, business analysts, and mathematicians, but most people in those fields can't do the work of a good data scientist. That might change in the future as the educational system catches up to the demands of the information economy, but for the time being data scientists are functionally speaking a very distinctive role.

2.1.1 What Data Scientists Do

Data science can largely be divided into two types of work: the kind where the clients are humans and the kind where the clients are machines. These styles are often used on the same data, and leverage many of the same techniques, but the goals and final deliverables can be wildly different.

If the client is a human, then typically you are either investigating a business situation (what are our users like?) or you are using data to help make a business decision (is this feature of our product useful enough to justify the cost of its upkeep?). Some of these questions are extremely concrete, like how often a particular pattern shows up in the available data. More often though they are open‐ended, and there is a lot of flexibility in figuring out what best addresses the business question. A few good examples would be

Quantifying the usefulness of a feature on a software product. This involves figuring out what “success” looks like in the data logs, whether some customers are more important than others, and being aware of how the ambiguities here qualify the final assessment.

Determining whether some kind of compelling pattern exists in the available data. Companies are often sitting on years' worth of data and wondering whether there are natural classes of users, or whether there are leading indicators of some significant event. These kinds of “see what there is to see” analyses are often pilots, which gauge whether a given avenue is worth pouring additional time and effort into.

Finding patterns that predict whether a machine will fail or a transaction will go through to completion. Patterns that correlate strongly with failure may represent problems in a company's processes that can be rectified (although they could equally well be outside of your control).

Test which of two versions of a website works better. AB testing like this largely falls under the domain of statistics, but measuring effectiveness often requires more coding than a typical statistician is up for.

Typically the deliverables for this kind of work are slide decks, written reports, or emails that summarize findings.

If the client is a machine then typically the data scientist is devising some logic that will be used by a computer to make real‐time judgements autonomously. Examples include

Determining which ad to show a user, or which product to try up‐selling them with on a website

Monitoring an industrial machine to identify leading indicators of failure and sound an alarm when a situation arises

Identifying components on an assembly line that are likely to cause failures downstream so that they can be discarded or re‐processed

In these situations the data scientist usually writes the final code that will run in a production situation, or at least they write the first version of it that is incorporated by engineers into the final product. In some cases they do not write the real‐time code, but they write code that is periodically re‐run to tune the parameters in the business logic.

The differences between the two categories of data science are summarized in Table 2.1. This might seem like I'm describing two different job roles, but in fact a given data scientist is likely to be involved in both types of work. They use a lot of the same datasets and domain expertise, and there is often a lot of feedback between them. For example, I mentioned identifying leading indicators of failure in an industrial machine – these insights can be used by a human being to improve a process or by a machine to raise an alarm when failure is likely imminent.

Table 2.1 Data science work can largely be divided into producing human‐understandable insights or producing code and models that get run in production.

Client

Uses

Deliverables

Special considerations

Human

Understanding the business

Helping humans make data‐driven decisions

Slide decks

Narratives that explain how/why

Formulating questions that are useful and answerable

How to measure business outcomes

Machine

Making decisions autonomously

Production code

Performance specs

Code quality

Performance in terms of speed

Another common case, which does not fit cleanly into the “client is a machine” or “client is a human” category, is setting up analytics infrastructures. Especially in small organizations or teams, a data scientist often functions as a one‐person shop. They set up the databases, write the pre‐processing scripts that get the data into a form where it is suitable for a database, and create the dashboards that do the standard monitoring. This infrastructure can ultimately be used to help either human or machine clients.

2.1.2 History of Data Science

There are two stories that come together in the history of data science. The first concerns the evolution of data analysis methodology and statistics, and especially how it was impacted by the availability of computers. The second is about the data itself, and the Big Data technologies that changed the way we see it.

Data has been gathered and analyzed in some form for millennia, but statistics as a discipline is widely dated to 1662. In that year John Graunt and William Petty used mathematical methods to study topics in demographics, like life expectancy tables and the populations of cities. Increasingly mathematical techniques were developed as the scientific revolution picked up steam, especially in the context of using astronomical data to estimate the locations and trajectories of celestial bodies. People began to apply these scientific techniques to other areas, like Florence Nightingale in medicine. The Royal Statistical Society was founded in 1834, recognizing the general‐purpose utility of statistics. The greatest figure in classical statistics was Ronald Fisher. In the early twentieth century, he almost single‐handedly created the discipline in its modern form, with a particular focus on biological problems. The name of the game was the following: distill a real‐world situation down into a set of equations that capture the essence of what's going on, but are also simple enough to solve by hand.

The situation changed with the advent of computers, because it was no longer necessary to do the math by hand. This made it possible to try out different analytical approaches to see what worked, or even to just explore the data in an open‐ended way. It also opened the door to a new paradigm – the early stages of machine learning (ML) – which began to gain traction both in and especially outside of the statistics community. In many problems the goal is simply to make accurate predictions, by hook or crook. Previously you did that by understanding a real‐world situation and distilling it down to a mathematical model that (hopefully!) captured the essence of the phenomena you were studying. But maybe you could use a very complicated model instead and just solve it with a computer. In that case you might not need to “understand” the world: if you had enough data, you could just fit a model to it by rote methods. The world is a complex place, and fitting complicated models to large datasets might be more accurate than any idealization simple enough to fit into a human brain.

Over the latter part of the twentieth century, this led to something of a polarization. Traditional statisticians continued to use models based on idealizations of the world, but a growing number of people experimented with the new approach. The divide was best described by Leo Breiman – a statistician who was one of the leading lights in ML – in 1991:

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

The early “algorithmic models” included the standard classifiers and regressors that the discipline of ML is built on. It has since grown to include deep learning, which is dramatically more complicated than the early models but also (potentially) much more powerful.

The story of the data itself is much more recent. Three years after Breiman's quote, in 2004, the Big Data movement started when Google published a paper on MapReduce, a new software framework that made it easy to program a cluster of computers to collaborate on a single analysis, with the (potentially very large) dataset spread out across the various computers in the cluster. Especially in the early days clusters were notoriously finicky, and they required far more IT skills than a typical statistician was likely to know.

The other thing that was new about Big Data was that the data was often “unstructured.” This means it wasn't relevant numbers arranged into nice rows and columns. It was webpages in HTML, Word documents, and log files belched out by computers – a cacophony of different formats that were not designed with any particular analytics question in mind. Companies had been sitting of backlogs of messy legacy data for years, and MapReduce finally made it possible to milk insights out of them.

Most of the labor of Big Data was in just getting the data from its raw form into tables of numbers where statistics or ML could even be applied. This meant large code frameworks for handling all the different formats, and business‐specific knowledge for how to distill the raw data into meaningful numbers. The actual statistics was often just fitting a line – you need some mathematical knowledge to do this responsibly given so many moving parts, but it hardly requires a full‐blown statistician.

And so the hybrid role of data scientist was born. They were mostly drawn from the ranks of computer scientists and software engineers, especially the ones who were originally from math‐heavy backgrounds (I've always been shocked by how many of the great computer scientist were originally physicists). This gave rise to the mythology of the data scientist – a brilliant polymath who had a mastery of all STEM knowledge. After all, look at the resumes of the people who became data scientists!

The reality though was that you just needed somebody who was competent at both coding and analytics, and the people who knew both of these largely unrelated disciplines were predominantly polymaths. It doesn't need to be this way, and the educational system is quickly figuring out that solid coding skills are important for people in all STEM fields. In explaining my work, I often jokingly tell people that “if you take a mediocre computer programmer and a mediocre statistician and put them together you get a good data scientist.”

Technology evolved. As individual computers became more powerful there was less need to resort to cluster computing (which has many disadvantages I will discuss later in the book – you should only use a cluster if you really have to). Cluster computing itself improved, so that data scientists could spend less time troubleshooting the software and more time on the modeling. And so data science has come full circle – from an order of cluster‐whisperers who could coax insights from data to mainstream professionals who can do whatever is needed (as far as math or coding) to solve the business problem.

2.1.3 Data Science Roadmap

The process of solving a data science problem is summarized in Figure 2.1, which I call the Data Science Roadmap. The first step is always to frame the problem: identify the business use case and craft a well‐defined analytics problem (or problems) out of it. This is the most important stage of the process, because it determines whether the ultimate results are something that will be useful to the business. Asking the right question is also the place that managers and executives are most critical.

Figure 2.1 The process of data science is deeply iterative, with the questions and methods evolving as discoveries are made. The first stage – and the most important – is to ask questions that provide the most business value.

This is followed by a stage of grappling with the data and the real‐world things that it describes. This involves the nitty‐gritty of how the data is formatted, how to access it, and what exactly it is describing. This stage can also reveal fundamental inadequacies in the data that might require revising the problem we aim to solve. Next comes “feature extraction,” where we distill the data into meaningful numbers and labels that characterize the objects we are studying. If the raw data are text documents, for example, we might characterize them by their length, how often certain key words/phrases occur, and whether they were written by an employee of the organization. Extracted features can also be quite complex, such as the presence or absence of faces in an image (as guessed by a computer program).

If asking the right question is the most important part of data science, the second most important part is probably feature extraction. It is where the real‐world and mathematics meet head‐on, because we want features that faithfully capture business realities while also having good mathematical properties (robustness to outliers, absence of pathological edge cases, etc.). We will have much more to say about feature extraction as the book progresses.

Once features have all been extracted, the actual analysis can run the gamut from simple pie charts to Bayesian networks. A key point to note is that this loops back to framing the problem. Data science is a deeply iterative process, where we are constantly refining our questions in light of new discoveries. Of course in practice these stages can blur together, and it may be prudent to jump ahead depending on the circumstances, but the most important cycle is using our discoveries to help refine the questions that we are asking.

Finally, notice that there are two paths out of the workflow: presenting results and deploying code. These correspond to the case when our clients are humans making business decisions and when they are machines making judgments autonomously.

2.1.4 Demystifying the Terms: Data Science, Machine Learning, Statistics, and Business Intelligence

There are a number of fields that overlap with data science and the terminology can be quite confusing. So now that you have a better idea what data science is, let me sketch the blurry lines that exist between it and other disciplines.

2.1.4.1 Machine Learning

Machine learning (ML) is a collection of techniques whereby a computer can analyze a dataset and identify patterns in it, especially patterns that are relevant to making some kind of prediction. A prototypical example would be to take a lot of images that have been manually flagged as either containing a human face or not – this is called “labeled data.” An ML algorithm could then be trained on this labeled data so that it can confidently identify whether future pictures contain faces. ML specialists tend to come from computer science backgrounds, and they are experts in the mathematical nuances of the various ML models. The models tend to work especially well in situations where there is a staggering amount of data available so that very subtle patterns can be found.

Data scientists make extensive use of ML models, but they typically use them as pre‐packaged tools and don't worry much about their internal workings. Their concern is more about formulating the problems in the first place – what exactly is the thing we are trying to predict, and will it adequately address the business case? Data scientists also spend a lot of time cleaning known pathologies out of data and condensing it into a sensible format so that the ML models will have an easier time finding the relevant patterns.

2.1.4.2 Statistics

Statistics is similar to ML in that it's about mathematical techniques for finding and quantifying patterns in data. In another universe they could be the same discipline, but in our universe the disciplines evolved along very different paths. ML grew out of computer science, where computational power was a given and the datasets studied were typically very large. For that reason it was possible to study very complex patterns and thoroughly analyze not just whether they existed, but how useful they were. Statistics is an older discipline, having grown more out of studying agricultural and census data. Statistics often does not have the luxury of large datasets; if you're studying whether a drug works every additional datapoint is literally a human life, so statisticians go to great pains to extract as much meaning as possible out of very sparse data. Forget the highly tuned models and fancy performance specs of ML – statisticians often work in situations where it is unclear whether a pattern even exists at all. They spend a great deal of time trying to distinguish between bona fide discoveries and simple coincidences.

Statistics often serves a complementary role to data science in modern business. Statistics is especially used for conducting carefully controlled experiments, testing very specific hypotheses, and drawing conclusions about causality. Data science is more about wading through reams of available data to generate those hypotheses in the first place. For example, you might use data science to identify which feature on a website is the best predictor of people making a purchase, then design an alternative layout of the site that increases the prominence of that feature, and run an AB test. Statistics would be used to determine whether the new layout truly drives more purchases and to assess how large the experiment has to be in order to achieve a given level of confidence.

In practice data scientists usually work with problems and datasets that are more suitable for ML than statistics. They do however need to know the basic concepts, and they need to understand the circumstances when the careful hair‐splitting of statistics if required.

2.1.4.3 Business Intelligence

The term “business intelligence” (BI) focuses more on the uses to which data is being put, rather than the technical aspects of how it is stored, formatted, and analyzed. It is all about giving subjective insight into the business process so that decision makers can monitor things and make the right call. As a job function, BI analysts tend to use data that is readily available in a database, and the analyses are mathematically simple (pie charts, plots of something over time, etc.). Their focus is on connecting the data back to the business reality and giving graphical insight into all aspects of a business process. A BI analyst is the person who will put together a dashboard that condenses operations into a few key charts and then allows for drilling down into the details when the charts show something interesting.

This description might sound an awful lot like data science – graphical exploration, asking the right business question, and so on. And in truth data scientists often provide many of the same deliverables as a BI analyst. The difference is that BI analysts generally lack the technical skills to ask questions that can't be answered with a database query (if you don't know what that means, it will be covered later in the book); they usually don't have strong coding skills and do not know techniques that are mathematically sophisticated. On the other hand they use tools like Tableau to produce much more compelling visualizations, and typically they have a more thorough understanding of the business.