Principles of Data Science. - Sunil Kakade - E-Book

Principles of Data Science. E-Book

Sunil Kakade

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Need to turn programming skills into effective data science skills? This book helps you connect mathematics, programming, and business analysis. You’ll feel confident asking—and answering—complex, sophisticated questions of your data, making abstract and raw statistics into actionable ideas.
Going through the data science pipeline, you'll clean and prepare data and learn effective data mining strategies and techniques to gain a comprehensive view of how the data science puzzle fits together. You’ll learn fundamentals of computational mathematics and statistics and pseudo-code used by data scientists and analysts. You’ll learn machine learning, discovering statistical models that help control and navigate even the densest datasets, and learn powerful visualizations that communicate what your data means.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 519

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Principles of Data Science - Second Edition
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
1. How to Sound Like a Data Scientist
What is data science?
Basic terminology
Why data science?
Example – xyz123 Technologies
The data science Venn diagram
The math
Example – spawner-recruit models
Computer programming
Why Python?
Python practices
Example of basic Python
Example – parsing a single tweet
Domain knowledge
Some more terminology
Data science case studies
Case study – automating government paper pushing
Fire all humans, right?
Case study – marketing dollars
Case study – what's in a job description?
Summary
2. Types of Data
Flavors of data
Why look at these distinctions?
Structured versus unstructured data
Example of data pre-processing
Word/phrase counts
Presence of certain special characters
The relative length of text
Picking out topics
Quantitative versus qualitative data
Example – coffee shop data
Example – world alcohol consumption data
Digging deeper
The road thus far
The four levels of data
The nominal level
Mathematical operations allowed
Measures of center
What data is like at the nominal level
The ordinal level
Examples
Mathematical operations allowed
Measures of center
Quick recap and check
The interval level
Example
Mathematical operations allowed
Measures of center
Measures of variation
Standard deviation
The ratio level
Examples
Measures of center
Problems with the ratio level
Data is in the eye of the beholder
Summary
Answers
3. The Five Steps of Data Science
Introduction to data science
Overview of the five steps
Asking an interesting question
Obtaining the data
Exploring the data
Modeling the data
Communicating and visualizing the results
Exploring the data
Basic questions for data exploration
Dataset 1 – Yelp
DataFrames
Series
Exploration tips for qualitative data
Nominal level columns
Filtering in pandas
Ordinal level columns
Dataset 2 – Titanic
Summary
4. Basic Mathematics
Mathematics as a discipline
Basic symbols and terminology
Vectors and matrices
Quick exercises
Answers
Arithmetic symbols
Summation
Proportional
Dot product
Graphs
Logarithms/exponents
Set theory
Linear algebra
Matrix multiplication
How to multiply matrices
Summary
5. Impossible or Improbable - A Gentle Introduction to Probability
Basic definitions
Probability
Bayesian versus Frequentist
Frequentist approach
The law of large numbers
Compound events
Conditional probability
The rules of probability
The addition rule
Mutual exclusivity
The multiplication rule
Independence
Complementary events
A bit deeper
Summary
6. Advanced Probability
Collectively exhaustive events
Bayesian ideas revisited
Bayes' theorem
More applications of Bayes' theorem
Example – Titanic
Example – medical studies
Random variables
Discrete random variables
Types of discrete random variables
Binomial random variables
Geometric random variables
Poisson random variable
Continuous random variables
Summary
7. Basic Statistics
What are statistics?
How do we obtain and sample data?
Obtaining data
Observational
Experimental
Sampling data
Probability sampling
Random sampling
Unequal probability sampling
How do we measure statistics?
Measures of center
Measures of variation
Definition
Example – employee salaries
Measures of relative standing
The insightful part – correlations in data
The empirical rule
Summary
8. Advanced Statistics
Point estimates
Sampling distributions
Confidence intervals
Hypothesis tests
Conducting a hypothesis test
One sample t-tests
Example of a one-sample t-test
Assumptions of the one-sample t-test
Type I and type II errors
Hypothesis testing for categorical variables
Chi-square goodness of fit test
Assumptions of the chi-square goodness of fit test
Example of a chi-square test for goodness of fit
Chi-square test for association/independence
Assumptions of the chi-square independence test
Summary
9. Communicating Data
Why does communication matter?
Identifying effective and ineffective visualizations
Scatter plots
Line graphs
Bar charts
Histograms
Box plots
When graphs and statistics lie
Correlation versus causation
Simpson's paradox
If correlation doesn't imply causation, then what does?
Verbal communication
It's about telling a story
On the more formal side of things
The why/how/what strategy of presenting
Summary
10. How to Tell If Your Toaster Is Learning – Machine Learning Essentials
What is machine learning?
Example – facial recognition
Machine learning isn't perfect
How does machine learning work?
Types of machine learning
Supervised learning
Example – heart attack prediction
It's not only about predictions
Types of supervised learning
Regression
Classification
Data is in the eyes of the beholder
Unsupervised learning
Reinforcement learning
Overview of the types of machine learning
How does statistical modeling fit into all of this?
Linear regression
Adding more predictors
Regression metrics
Logistic regression
Probability, odds, and log odds
The math of logistic regression
Dummy variables
Summary
11. Predictions Don't Grow on Trees - or Do They?
Naive Bayes classification
Decision trees
How does a computer build a regression tree?
How does a computer fit a classification tree?
Unsupervised learning
When to use unsupervised learning
k-means clustering
Illustrative example – data points
Illustrative example – beer!
Choosing an optimal number for K and cluster validation
The Silhouette Coefficient
Feature extraction and principal component analysis
Summary
12. Beyond the Essentials
The bias/variance trade-off
Errors due to bias
Error due to variance
Example – comparing body and brain weight of mammals
Two extreme cases of bias/variance trade-off
Underfitting
Overfitting
How bias/variance play into error functions
K folds cross-validation
Grid searching
Visualizing training error versus cross-validation error
Ensembling techniques
Random forests
Comparing random forests with decision trees
Neural networks
Basic structure
Summary
13. Case Studies
Case study 1 – Predicting stock prices based on social media
Text sentiment analysis
Exploratory data analysis
Regression route
Classification route
Going beyond with this example
Case study 2 – Why do some people cheat on their spouses?
Case study 3 – Using TensorFlow
TensorFlow and neural networks
Summary
14. Building Machine Learning Models with Azure Databricks and Azure Machine Learning service
Technical requirements
Technologies for machine learning projects
Apache Spark
Data management in Apache Spark
Databricks and Azure Databricks
MLlib
Configuring Azure Databricks
Creating an Azure Databricks cluster
Training a text classifier with Azure Databricks
Loading data into Azure Databricks
Reading and prepping our dataset
Feature engineering
Tokenizers
StopWordsRemover
TF-IDF
Model training and testing
Exporting the model
Azure Machine Learning
Creating an Azure Machine Learning workspace
Azure Machine Learning SDK for Python
Integrating Azure Databricks and Azure Machine Learning
Programmatically create a new Azure Machine Learning workspace
SMS spam classifier on Azure Machine Learning
Experimenting with and selecting the best model
Deploying to Azure Container Instances
Testing our RESTful intelligent web service
Summary
Other Books You May Enjoy
Leave a review – let other readers know what you think
Index

Principles of Data Science - Second Edition

Principles of Data Science - Second Edition

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey Varangoankar

Acquisition Editor: Dayne Castelino

Content Development Editors: Chris D'cruz

Technical Editor: Sneha Hanchate

Copy Editor: Safis Editing

Project Coordinator: Namarata Swetta

Proofreader: Safis Editing

Indexers: Pratik Shirodkar

Graphics: Tom Scaria

Production Coordinator: Nilesh Mohite

First published: December 2016

Second editon: December 2018

Production reference: 1141218

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78980-454-6

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionalsLearn better with Skill Plans built especially for youGet a free eBook or video every monthMapt is fully searchableCopy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Sinan Ozdemir is a data scientist, start-up founder, and educator living in the San Francisco Bay Area. He studied pure mathematics at Johns Hopkins University. He then spent several years conducting lectures on data science at Johns Hopkins University before founding his own start-up, Kylie.ai, which uses artificial intelligence to clone brand personalities and automate customer service communications.

Sinan is also the author of Principles of Data Science, First Edition available through Packt.

Sunil Kakade is a technologist, educator, and senior leader with expertise in creating data- and AI-driven organizations. He is in the adjunct faculty at Northwestern University, Evanston, IL, where he teaches graduate courses of data science and big data. He has several research papers to his credit and has presented his work in big data applications at reputable conferences. He has US patents in areas of big data and retail processes. He is passionate about applying data science to improve business outcomes and save patients' lives. At present, Sunil leads the information architecture and analytics team for a large healthcare organization focused on improving healthcare outcomes and lives with his wife, Pratibha, and daughter, Preeti, in Scottsdale, Arizona.

I would like to thank my mother, Subhadra, wife; Pratibha; and daughter, Preeti, for supporting me during my education and career and for supporting my passion for learning. Many thanks to my mentors, Prof. Faisal Akkawi, Northwestern University; Bill Guise, Sr. Director, Dr. Joseph Colorafi, CMIO, and Deanna Wise, CIO at Dignity Health for supporting my passion for big data, data science, and artificial intelligence. Special thanks to Sinan Ozdemir and Packt Publishing for giving me the opportunity to co-author this book. I appreciate the incredible support of my team at Dignity Health Insights in my journey in data science. Finally, I'd like to thank my friend, Anand Deshpande, who inspired me to take on this project.

Marco Tibaldeschi, born in 1983, Master’s degree in informatic engineering, has actively worked on the web since 1994. Thanks to the fact that he is the fourth of four brothers, he has always had a foot in the future. In 1998 he registered his first domain which was one of the first virtual web communities in Italy. Because of this, he has been interviewed by different national newspapers and radio stations, and a research book has been written by University of Pisa in order to understand the social phenomenon. In 2003, he founded DBN Communication, a web consulting company that owns and develops eDock, a SaaS that helps sellers to manage their inventories and orders on the biggest marketplaces in the world (like Amazon and eBay).

ACKNOWLEDGMENTS

I'd like to thank my wife Giulia, because with her and her support everything seems and becomes possible. Without her help and her love, I'd be a different man and, I'm sure, a worse one. I'd also like to thank Nelson Morris and Chris D'cruz from Packt for this opportunity and for their continuous support.

About the reviewers

Oleg Okun got his PhD from the Institute of Engineering Cybernetics, National Academy of Sciences (Minsk, Belarus) in 1996. Since 1998 he has worked abroad, doing both academic research (in Belarus and Finland) and industrial research (in Sweden and Germany). His research experience includes document image analysis, cancer prediction by analyzing gene expression profiles (bioinformatics), fingerprint verification and identification (biometrics), online and offline marketing analytics, credit scoring (microfinance), and text search and summarization (natural language processing). He has 80 publications, including one IGI Global-published book and three co-edited books published by Springer-Verlag, as well as book chapters, journal articles, and numerous conference papers. He has also been a reviewer of several books published by Packt Publishing.

Jared James Thompson, PhD, is a graduate of Purdue University and has held both academic and industrial appointments teaching programming, algorithms, and big data technology. He is a machine learning enthusiast and has a particular love of optimization. Jared is currently employed as a machine learning engineer at Atomwise, a start-up that leverages artificial intelligence to design better drugs faster.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

The topic of this book is data science, which is a field of study that has been growing rapidly for the past few decades. Today, more companies than ever before are investing in big data and data science to improve business performance, drive innovation, and create new revenue streams by building data products. According to LinkedIn's 2017 US Emerging Jobs Report, machine learning engineer, data scientist, and big data engineer rank among the top emerging jobs, and companies in a wide range of industries are seeking people with the requisite skills for those roles.

We will dive into topics from all three areas and solve complex problems. We will clean, explore, and analyze data in order to derive scientific and accurate conclusions. Machine learning and deep learning techniques will be applied to solve complex data tasks.

Who this book is for

This book is for people who are looking to understand and utilize the basic practices of data science for any domain. The reader should be fairly well acquainted with basic mathematics (algebra, and perhaps probability) and should feel comfortable reading snippets in R/Python as well as pseudo code. The reader is not expected to have worked in a data field; however, they should have the urge to learn and apply the techniques put forth in this book to either their own datasets or those provided to them.

What this book covers

Chapter 1, How to Sound Like a Data Scientist, introduces the basic terminology used by data scientists and looks at the types of problem we will be solving throughout this book.

Chapter 2, Types of Data, looks at the different levels and types of data out there and shows how to manipulate each type. This chapter will begin to deal with the mathematics needed for data science.

Chapter 3, The Five Steps of Data Science, uncovers the five basic steps of performing data science, including data manipulation and cleaning, and shows examples of each step in detail.

Chapter 4, Basic Mathematics, explains the basic mathematical principles that guide the actions of data scientists by presenting and solving examples in calculus, linear algebra, and more.

Chapter 5, Impossible or Improbable– a Gentle Introduction to Probability, is a beginner's guide to probability theory and how it is used to gain an understanding of our random universe.

Chapter 6, Advanced Probability, uses principles from the previous chapter and introduces and applies theorems, such as Bayes' Theorem, in the hope of uncovering the hidden meaning in our world.

Chapter 7, Basic Statistics, deals with the types of problem that statistical inference attempts to explain, using the basics of experimentation, normalization, and random sampling.

Chapter 8, Advanced Statistics, uses hypothesis testing and confidence intervals to gain insight from our experiments. Being able to pick which test is appropriate and how to interpret p-values and other results is very important as well.

Chapter 9, Communicating Data, explains how correlation and causation affect our interpretation of data. We will also be using visualizations in order to share our results with the world.

Chapter 10, How to Tell Whether Your Toaster Is Learning – Machine Learning Essentials, focuses on the definition of machine learning and looks at real-life examples of how and when machine learning is applied. A basic understanding of the relevance of model evaluation is introduced.

Chapter 11, Predictions Don't Grow on Trees, or Do They?, looks at more complicated machine learning models, such as decision trees and Bayesian predictions, in order to solve more complex data-related tasks.

Chapter 12, Beyond the Essentials, introduces some of the mysterious forces guiding data science, including bias and variance. Neural networks are introduced as a modern deep learning technique.

Chapter 13, Case Studies, uses an array of case studies in order to solidify the ideas of data science. We will be following the entire data science workflow from start to finish multiple times for different examples, including stock price prediction and handwriting detection.

Chapter 14, Microsoft Databricks Case Studies, will harness the power of the Microsoft data environment as well as Apache Spark to put our machine learning in high gear. This chapter makes use of parallelization and advanced visualization software to get the most out of our data.

Chapter 15, Building Machine Learning Models with Azure Databricks and Azure ML, looks at the different technologies that a data scientist can use on Microsoft Azure Platform, which help in managing big data projects without having to worry about infrastructure and computing power.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at <[email protected]>.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Chapter 1. How to Sound Like a Data Scientist

No matter which industry you work in—IT, fashion, food, or finance—there is no doubt that data affects your life and work. At some point this week, you will either have or hear a conversation about data. News outlets are covering more and more stories about data leaks, cybercrimes, and how data can give us a glimpse into our lives. But why now? What makes this era such a hotbed of data-related industries?

In the nineteenth century, the world was in the grip of the Industrial Age. Mankind was exploring its place in the industrial world, working with giant mechanical inventions. Captains of industry, such as Henry Ford, recognized that using these machines could open major market opportunities, enabling industries to achieve previously unimaginable profits. Of course, the Industrial Age had its pros and cons. While mass production placed goods in the hands of more consumers, our battle with pollution also began at around this time.

By the twentieth century, we were quite skilled at making huge machines; the goal now was to make them smaller and faster. The Industrial Age was over and was replaced by what we now refer to as the Information Age. We started using machines to gather and store information (data) about ourselves and our environment for the purpose of understanding our universe.

Beginning in the 1940s, machines such as ENIAC (considered one of the first—if not the first—computers) were computing math equations and running models and simulations like never before. The following photograph shows ENIAC:

ENIAC—The world's first electronic digital computer (Ref: http://ftp.arl.mil/ftp/historic-computers/)

We finally had a decent lab assistant who could run the numbers better than we could! As with the Industrial Age, the Information Age brought us both the good and the bad. The good was the extraordinary works of technology, including mobile phones and televisions. The bad was not as bad as worldwide pollution, but still left us with a problem in the twenty-first century—so much data.

That's right—the Information Age, in its quest to procure data, has exploded the production of electronic data. Estimates show that we created about 1.8 trillion gigabytes of data in 2011 (take a moment to just think about how much that is). Just one year later, in 2012, we created over 2.8 trillion gigabytes of data! This number is only going to explode further to hit an estimated 40 trillion gigabytes of created data in just one year by 2020. People contribute to this every time they tweet, post on Facebook, save a new resume on Microsoft Word, or just send their mom a picture by text message.

Not only are we creating data at an unprecedented rate, but we are also consuming it at an accelerated pace as well. Just five years ago, in 2013, the average cell phone user used under 1 GB of data a month. Today, that number is estimated to be well over 2 GB a month. We aren't just looking for the next personality quiz—what we are looking for is insight. With all of this data out there, some of it has to be useful to me! And it can be!

So we, in the twenty-first century, are left with a problem. We have so much data and we keep making more. We have built insanely tiny machines that collect data 24/7, and it's our job to make sense of it all. Enter the Data Age. This is the age when we take machines dreamed up by our nineteenth century ancestors and the data created by our twentieth century counterparts and create insights and sources of knowledge that every human on Earth can benefit from. The United States created an entirely new role in the government of chief data scientist. Many companies are now investing in data science departments and hiring data scientists. The benefit is quite obvious—using data to make accurate predictions and simulations gives us insight into our world like never before.

Sounds great, but what's the catch?

This chapter will explore the terminology and vocabulary of the modern data scientist. We will learn keywords and phrases that will be essential in our discussion of data science throughout this book. We will also learn why we use data science and learn about the three key domains that data science is derived from before we begin to look at the code in Python, the primary language used in this book. This chapter will cover the following topics:

The basic terminology of data scienceThe three domains of data scienceThe basic Python syntax

What is data science?

Before we go any further, let's look at some basic definitions that we will use throughout this book. The great/awful thing about this field is that it is so young that these definitions can differ from textbook to newspaper to whitepaper.

Basic terminology

The definitions that follow are general enough to be used in daily conversations, and work to serve the purpose of this book, an introduction to the principles of data science.

Let's start by defining what data is. This might seem like a silly first definition to look at, but it is very important. Whenever we use the word "data," we refer to a collection of information in either an organized or unorganized format. These formats have the following qualities:

Organized data: This refers to data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation.Unorganized data: This is the type of data that is in a free form, usually text or raw audio/signals that must be parsed further to become organized.

Whenever you open Excel (or any other spreadsheet program), you are looking at a blank row/column structure waiting for organized data. These programs don't do well with unorganized data. For the most part, we will deal with organized data as it is the easiest to glean insights from, but we will not shy away from looking at raw text and methods of processing unorganized forms of data.

Data science is the art and science of acquiring knowledge through data.

What a small definition for such a big topic, and rightfully so! Data science covers so many things that it would take pages to list it all out (I should know—I tried and got told to edit it down).

Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:

Make decisionsPredict the futureUnderstand the past/presentCreate new industries/products

This book is all about the methods of data science, including how to process data, gather insights, and use those insights to make informed decisions and predictions.

Data science is about using data in order to gain new insights that you would otherwise have missed.

As an example, using data science, clinics can identify patients who are likely to not show up for an appointment. This can help improve margins, and providers can give other patients available slots.

That's why data science won't replace the human brain, but complement it, working alongside it. Data science should not be thought of as an end-all solution to our data woes; it is merely an opinion—a very informed opinion, but an opinion nonetheless. It deserves a seat at the table.

Why data science?

In this Data Age, it's clear that we have a surplus of data. But why should that necessitate an entirely new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it literally impossible for a human to parse it in a reasonable time frame. Data is collected in various forms and from different sources, and often comes in a very unorganized format.

Data can be missing, incomplete, or just flat out wrong. Oftentimes, we will have data on very different scales, and that makes it tough to compare it. Say that we are looking at data in relation to pricing used cars. One characteristic of a car is the year it was made, and another might be the number of miles on that car. Once we clean our data (which we will spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.

Earlier, we looked at data science in a more historical perspective, but let's take a minute to discuss its role in business today using a very simple example.

Example – xyz123 Technologies

Ben Runkle, the CEO of xyz123 Technologies, is trying to solve a huge problem. The company is consistently losing long-time customers. He does not know why they are leaving, but he must do something fast. He is convinced that in order to reduce his churn, he must create new products and features, and consolidate existing technologies. To be safe, he calls in his chief data scientist, Dr. Hughan. However, she is not convinced that new products and features alone will save the company. Instead, she turns to the transcripts of recent customer service tickets. She shows Ben the most recent transcripts and finds something surprising:

".... Not sure how to export this; are you?""Where is the button that makes a new list?""Wait, do you even know where the slider is?""If I can't figure this out today, it's a real problem..."

It is clear that customers were having problems with the existing UI/UX, and weren't upset because of a lack of features. Runkle and Hughan organized a mass UI/UX overhaul and their sales have never been better.

Of course, the science used in the last example was minimal, but it makes a point. We tend to call people like Runkle drivers. Today's common stick-to-your-gut CEO wants to make all decisions quickly and iterate over solutions until something works. Dr. Hughan is much more analytical. She wants to solve the problem just as much as Runkle, but she turns to user-generated data instead of her gut feeling for answers. Data science is about applying the skills of the analytical mind and using them as a driver would.

Both of these mentalities have their place in today's enterprises; however, it is Hughan's way of thinking that dominates the ideas of data science—using data generated by the company as her source of information, rather than just picking up a solution and going with it.

The data science Venn diagram

It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. Understanding data science begins with three basic areas:

Math/statistics: This is the use of equations and formulas to perforanalysis.Computer programming: This is the ability to use code to create outcomes on computer.Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, d so on).

The following Venn diagram provides a visual representation of how these three areas of data science intersect:

The Venn diagram of data science

Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics background allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive expertise (domain expertise) allows you to apply concepts and results in a meaningful and effective way.

While having only two of these three qualities can make you intelligent, it will also leave a gap. Let's say that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place, but lack the math skills to evaluate your algorithms. This will mean that you end up losing money in the long run. It is only when you boost your skills in coding, math, and domain knowledge that you can truly perform data science.

The quality that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers.

Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and, above all, understand our analyses' place in the domain we are in. This includes the presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information, or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist.

Note

The intersection of math and coding is machine learning. This book will look at machine learning in great detail later on, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just that—algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data, but if you don't understand how to apply this model in a practical sense so that doctors and nurses can easily use it, your model might be useless.

Both computer programming and math are covered extensively in this book. Domain knowledge comes with both the practice of data science and reading examples of other people's analyses.

The math

Most people stop listening once someone says the word "math." They'll nod along in an attempt to hide their utter disdain for the topic. This book will guide you through the math needed for data science, specifically statistics and probability. We will use these subdomains of mathematics to create what are called models.

A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon.

Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding the theory allows us to apply a model that we built for the fashion industry to a financial domain.

The math covered in this book ranges from basic algebra to advanced probabilistic and statistical modeling. Do not skip over these chapters, even if you already know these topics or you're afraid of them. Every mathematical concept that I will introduce will be introduced with care and purpose, using examples. The math in this book is essential for data scientists.

Example – spawner-recruit models

In biology, we use, among many other models, a model known as the spawner-recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the graph further down (titled spawner-recruit model) was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that the group would obtain, and vice versa?

Essentially, models allow us to plug in one variable to get the other. Consider the folloIn this example, let's say we knew that a group of salmon had 1.15 (in thousands) spawners. Then, we would have tThis result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change.

There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the best model possible. We no longer rely on human instincts—rather, we rely on data, such as that displayed in the following graph:

The spawner-recruit model visualized

The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! Throughout this book, we will look at relationships involving marketing dollars, sentiment data, restaurant reviews, and much more. The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible.

Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere.

Computer programming

Let's be honest: you probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on technology. You don't turn on the TV to see a new theory on primes—rather, you will see investigative reports on how the latest smartphone can take better photos of cats, or something. Computer languages are how we communicate with machines and tell them to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages that are available to us. This book will focus exclusively on using Python.

Some more terminology

This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terms that you are likely to encounter.

Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. We have seen the concept of machine learning earlier in this chapter as the union of someone who has both coding and math skills. Here, we are attempting to formalize this definition. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and create powerful data models. Speaking of data models, in this book, we will concern ourselves with the following two basic types of data model:
Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness.Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula.

Note

While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate, since machine learning algorithms generally attempt to learn relationships in different ways. We will take a look at the statistical and probabilistic models in later chapters.

Exploratory data analysis (EDA): This refers to preparing data in order to standardize results and gain quick insights. EDA is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots to identify key features and relationships to exploit in our data models.Data mining: This is the process of finding relationships between elements of data. Data mining is the part of data science where we try to find relationships between variables (think the spawn-recruit model).

I have tried pretty hard not to use the term big data up until now. This is because I think this term is misused, a lot. Big data is data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data).

The following diagram shows the relationship between these data science concepts:

The state of data science (so far)

The preceding diagram is incomplete and is meant for visualization purposes only.

Summary

At the beginning of this chapter, I posed a simple question: what's the catch of data science? Well, there is one. It isn't all fun, games, and modeling. There must be a price for our quest to create ever-smarter machines and algorithms. As we seek new and innovative ways to discover data trends, a beast lurks in the shadows. I'm not talking about the learning curve of mathematics or programming, nor am I referring to the surplus of data. The Industrial Age left us with an ongoing battle against pollution. The subsequent Information Age left behind a trail of big data. So, what dangers might the Data Age bring us?

The Data Age can lead to something much more sinister — the dehumanization of the individual through mass data.

More and more people are jumping head-first into the field of data science, most with no prior experience of math or CS, which, on the surface, is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jump start their education.