E-Book
29,99 €

Principles of Data Science. E-Book

Sunil Kakade

0,0

29,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Need to turn programming skills into effective data science skills? This book helps you connect mathematics, programming, and business analysis. You’ll feel confident asking—and answering—complex, sophisticated questions of your data, making abstract and raw statistics into actionable ideas.
Going through the data science pipeline, you'll clean and prepare data and learn effective data mining strategies and techniques to gain a comprehensive view of how the data science puzzle fits together. You’ll learn fundamentals of computational mathematics and statistics and pseudo-code used by data scientists and analysts. You’ll learn machine learning, discovering statistical models that help control and navigate even the densest datasets, and learn powerful visualizations that communicate what your data means.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 519

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Principles of Data Science - Second Edition

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

1. How to Sound Like a Data Scientist

What is data science?

Basic terminology

Why data science?

Example – xyz123 Technologies

The data science Venn diagram

The math

Example – spawner-recruit models

Computer programming

Why Python?

Python practices

Example of basic Python

Example – parsing a single tweet

Domain knowledge

Some more terminology

Data science case studies

Case study – automating government paper pushing

Fire all humans, right?

Case study – marketing dollars

Case study – what's in a job description?

Summary

2. Types of Data

Flavors of data

Why look at these distinctions?

Structured versus unstructured data

Example of data pre-processing

Word/phrase counts

Presence of certain special characters

The relative length of text

Picking out topics

Quantitative versus qualitative data

Example – coffee shop data

Example – world alcohol consumption data

Digging deeper

The road thus far

The four levels of data

The nominal level

Mathematical operations allowed

Measures of center

What data is like at the nominal level

The ordinal level

Examples

Mathematical operations allowed

Measures of center

Quick recap and check

The interval level

Example

Mathematical operations allowed

Measures of center

Measures of variation

Standard deviation

The ratio level

Examples

Measures of center

Problems with the ratio level

Data is in the eye of the beholder

Summary

Answers

3. The Five Steps of Data Science

Introduction to data science

Overview of the five steps

Asking an interesting question

Obtaining the data

Exploring the data

Modeling the data

Communicating and visualizing the results

Exploring the data

Basic questions for data exploration

Dataset 1 – Yelp

DataFrames

Series

Exploration tips for qualitative data

Nominal level columns

Filtering in pandas

Ordinal level columns

Dataset 2 – Titanic

Summary

4. Basic Mathematics

Mathematics as a discipline

Basic symbols and terminology

Vectors and matrices

Quick exercises

Answers

Arithmetic symbols

Summation

Proportional

Dot product

Graphs

Logarithms/exponents

Set theory

Linear algebra

Matrix multiplication

How to multiply matrices

Summary

5. Impossible or Improbable - A Gentle Introduction to Probability

Basic definitions

Probability

Bayesian versus Frequentist

Frequentist approach

The law of large numbers

Compound events

Conditional probability

The rules of probability

The addition rule

Mutual exclusivity

The multiplication rule

Independence

Complementary events

A bit deeper

Summary

6. Advanced Probability

Collectively exhaustive events

Bayesian ideas revisited

Bayes' theorem

More applications of Bayes' theorem

Example – Titanic

Example – medical studies

Random variables

Discrete random variables

Types of discrete random variables

Binomial random variables

Geometric random variables

Poisson random variable

Continuous random variables

Summary

7. Basic Statistics

What are statistics?

How do we obtain and sample data?

Obtaining data

Observational

Experimental

Sampling data

Probability sampling

Random sampling

Unequal probability sampling

How do we measure statistics?

Measures of center

Measures of variation

Definition

Example – employee salaries

Measures of relative standing

The insightful part – correlations in data

The empirical rule

Summary

8. Advanced Statistics

Point estimates

Sampling distributions

Confidence intervals

Hypothesis tests

Conducting a hypothesis test

One sample t-tests

Example of a one-sample t-test

Assumptions of the one-sample t-test

Type I and type II errors

Hypothesis testing for categorical variables

Chi-square goodness of fit test

Assumptions of the chi-square goodness of fit test

Example of a chi-square test for goodness of fit

Chi-square test for association/independence

Assumptions of the chi-square independence test

Summary

9. Communicating Data

Why does communication matter?

Identifying effective and ineffective visualizations

Scatter plots

Line graphs

Bar charts

Histograms

Box plots

When graphs and statistics lie

Correlation versus causation

Simpson's paradox

If correlation doesn't imply causation, then what does?

Verbal communication

It's about telling a story

On the more formal side of things

The why/how/what strategy of presenting

Summary

10. How to Tell If Your Toaster Is Learning – Machine Learning Essentials

What is machine learning?

Example – facial recognition

Machine learning isn't perfect

How does machine learning work?

Types of machine learning

Supervised learning

Example – heart attack prediction

It's not only about predictions

Types of supervised learning

Regression

Classification

Data is in the eyes of the beholder

Unsupervised learning

Reinforcement learning

Overview of the types of machine learning

How does statistical modeling fit into all of this?

Linear regression

Adding more predictors

Regression metrics

Logistic regression

Probability, odds, and log odds

The math of logistic regression

Dummy variables

Summary

11. Predictions Don't Grow on Trees - or Do They?

Naive Bayes classification

Decision trees

How does a computer build a regression tree?

How does a computer fit a classification tree?

Unsupervised learning

When to use unsupervised learning

k-means clustering

Illustrative example – data points

Illustrative example – beer!

Choosing an optimal number for K and cluster validation

The Silhouette Coefficient

Feature extraction and principal component analysis

Summary

12. Beyond the Essentials

The bias/variance trade-off

Errors due to bias

Error due to variance

Example – comparing body and brain weight of mammals

Two extreme cases of bias/variance trade-off

Underfitting

Overfitting

How bias/variance play into error functions

K folds cross-validation

Grid searching

Visualizing training error versus cross-validation error

Ensembling techniques

Random forests

Comparing random forests with decision trees

Neural networks

Basic structure

Summary

13. Case Studies

Case study 1 – Predicting stock prices based on social media

Text sentiment analysis

Exploratory data analysis

Regression route

Classification route

Going beyond with this example

Case study 2 – Why do some people cheat on their spouses?

Case study 3 – Using TensorFlow

TensorFlow and neural networks

Summary

14. Building Machine Learning Models with Azure Databricks and Azure Machine Learning service

Technical requirements

Technologies for machine learning projects

Apache Spark

Data management in Apache Spark

Databricks and Azure Databricks

MLlib

Configuring Azure Databricks

Creating an Azure Databricks cluster

Training a text classifier with Azure Databricks

Loading data into Azure Databricks

Reading and prepping our dataset

Feature engineering

Tokenizers

StopWordsRemover

TF-IDF

Model training and testing

Exporting the model

Azure Machine Learning

Creating an Azure Machine Learning workspace

Azure Machine Learning SDK for Python

Integrating Azure Databricks and Azure Machine Learning

Programmatically create a new Azure Machine Learning workspace

SMS spam classifier on Azure Machine Learning

Experimenting with and selecting the best model

Deploying to Azure Container Instances

Testing our RESTful intelligent web service

Summary

Other Books You May Enjoy

Leave a review – let other readers know what you think

Index

Principles of Data Science - Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey Varangoankar

Acquisition Editor: Dayne Castelino

Content Development Editors: Chris D'cruz

Technical Editor: Sneha Hanchate

Copy Editor: Safis Editing

Project Coordinator: Namarata Swetta

Proofreader: Safis Editing

Indexers: Pratik Shirodkar

Graphics: Tom Scaria

Production Coordinator: Nilesh Mohite

First published: December 2016

Second editon: December 2018

Production reference: 1141218

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78980-454-6

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionalsLearn better with Skill Plans built especially for youGet a free eBook or video every monthMapt is fully searchableCopy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Sinan Ozdemir is a data scientist, start-up founder, and educator living in the San Francisco Bay Area. He studied pure mathematics at Johns Hopkins University. He then spent several years conducting lectures on data science at Johns Hopkins University before founding his own start-up, Kylie.ai, which uses artificial intelligence to clone brand personalities and automate customer service communications.

Sinan is also the author of Principles of Data Science, First Edition available through Packt.

Sunil Kakade is a technologist, educator, and senior leader with expertise in creating data- and AI-driven organizations. He is in the adjunct faculty at Northwestern University, Evanston, IL, where he teaches graduate courses of data science and big data. He has several research papers to his credit and has presented his work in big data applications at reputable conferences. He has US patents in areas of big data and retail processes. He is passionate about applying data science to improve business outcomes and save patients' lives. At present, Sunil leads the information architecture and analytics team for a large healthcare organization focused on improving healthcare outcomes and lives with his wife, Pratibha, and daughter, Preeti, in Scottsdale, Arizona.

I would like to thank my mother, Subhadra, wife; Pratibha; and daughter, Preeti, for supporting me during my education and career and for supporting my passion for learning. Many thanks to my mentors, Prof. Faisal Akkawi, Northwestern University; Bill Guise, Sr. Director, Dr. Joseph Colorafi, CMIO, and Deanna Wise, CIO at Dignity Health for supporting my passion for big data, data science, and artificial intelligence. Special thanks to Sinan Ozdemir and Packt Publishing for giving me the opportunity to co-author this book. I appreciate the incredible support of my team at Dignity Health Insights in my journey in data science. Finally, I'd like to thank my friend, Anand Deshpande, who inspired me to take on this project.

Marco Tibaldeschi, born in 1983, Master’s degree in informatic engineering, has actively worked on the web since 1994. Thanks to the fact that he is the fourth of four brothers, he has always had a foot in the future. In 1998 he registered his first domain which was one of the first virtual web communities in Italy. Because of this, he has been interviewed by different national newspapers and radio stations, and a research book has been written by University of Pisa in order to understand the social phenomenon. In 2003, he founded DBN Communication, a web consulting company that owns and develops eDock, a SaaS that helps sellers to manage their inventories and orders on the biggest marketplaces in the world (like Amazon and eBay).

ACKNOWLEDGMENTS

I'd like to thank my wife Giulia, because with her and her support everything seems and becomes possible. Without her help and her love, I'd be a different man and, I'm sure, a worse one. I'd also like to thank Nelson Morris and Chris D'cruz from Packt for this opportunity and for their continuous support.

About the reviewers

Oleg Okun got his PhD from the Institute of Engineering Cybernetics, National Academy of Sciences (Minsk, Belarus) in 1996. Since 1998 he has worked abroad, doing both academic research (in Belarus and Finland) and industrial research (in Sweden and Germany). His research experience includes document image analysis, cancer prediction by analyzing gene expression profiles (bioinformatics), fingerprint verification and identification (biometrics), online and offline marketing analytics, credit scoring (microfinance), and text search and summarization (natural language processing). He has 80 publications, including one IGI Global-published book and three co-edited books published by Springer-Verlag, as well as book chapters, journal articles, and numerous conference papers. He has also been a reviewer of several books published by Packt Publishing.

Jared James Thompson, PhD, is a graduate of Purdue University and has held both academic and industrial appointments teaching programming, algorithms, and big data technology. He is a machine learning enthusiast and has a particular love of optimization. Jared is currently employed as a machine learning engineer at Atomwise, a start-up that leverages artificial intelligence to design better drugs faster.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

The topic of this book is data science, which is a field of study that has been growing rapidly for the past few decades. Today, more companies than ever before are investing in big data and data science to improve business performance, drive innovation, and create new revenue streams by building data products. According to LinkedIn's 2017 US Emerging Jobs Report, machine learning engineer, data scientist, and big data engineer rank among the top emerging jobs, and companies in a wide range of industries are seeking people with the requisite skills for those roles.

We will dive into topics from all three areas and solve complex problems. We will clean, explore, and analyze data in order to derive scientific and accurate conclusions. Machine learning and deep learning techniques will be applied to solve complex data tasks.

Who this book is for

This book is for people who are looking to understand and utilize the basic practices of data science for any domain. The reader should be fairly well acquainted with basic mathematics (algebra, and perhaps probability) and should feel comfortable reading snippets in R/Python as well as pseudo code. The reader is not expected to have worked in a data field; however, they should have the urge to learn and apply the techniques put forth in this book to either their own datasets or those provided to them.

What this book covers

Chapter 1, How to Sound Like a Data Scientist, introduces the basic terminology used by data scientists and looks at the types of problem we will be solving throughout this book.

Chapter 2, Types of Data, looks at the different levels and types of data out there and shows how to manipulate each type. This chapter will begin to deal with the mathematics needed for data science.

Chapter 3, The Five Steps of Data Science, uncovers the five basic steps of performing data science, including data manipulation and cleaning, and shows examples of each step in detail.

Chapter 4, Basic Mathematics, explains the basic mathematical principles that guide the actions of data scientists by presenting and solving examples in calculus, linear algebra, and more.

Chapter 5, Impossible or Improbable– a Gentle Introduction to Probability, is a beginner's guide to probability theory and how it is used to gain an understanding of our random universe.

Chapter 6, Advanced Probability, uses principles from the previous chapter and introduces and applies theorems, such as Bayes' Theorem, in the hope of uncovering the hidden meaning in our world.

Chapter 7, Basic Statistics, deals with the types of problem that statistical inference attempts to explain, using the basics of experimentation, normalization, and random sampling.

Chapter 8, Advanced Statistics, uses hypothesis testing and confidence intervals to gain insight from our experiments. Being able to pick which test is appropriate and how to interpret p-values and other results is very important as well.

Chapter 9, Communicating Data, explains how correlation and causation affect our interpretation of data. We will also be using visualizations in order to share our results with the world.

Chapter 10, How to Tell Whether Your Toaster Is Learning – Machine Learning Essentials, focuses on the definition of machine learning and looks at real-life examples of how and when machine learning is applied. A basic understanding of the relevance of model evaluation is introduced.

Chapter 11, Predictions Don't Grow on Trees, or Do They?, looks at more complicated machine learning models, such as decision trees and Bayesian predictions, in order to solve more complex data-related tasks.

Chapter 12, Beyond the Essentials, introduces some of the mysterious forces guiding data science, including bias and variance. Neural networks are introduced as a modern deep learning technique.

Chapter 13, Case Studies, uses an array of case studies in order to solidify the ideas of data science. We will be following the entire data science workflow from start to finish multiple times for different examples, including stock price prediction and handwriting detection.

Chapter 14, Microsoft Databricks Case Studies, will harness the power of the Microsoft data environment as well as Apache Spark to put our machine learning in high gear. This chapter makes use of parallelization and advanced visualization software to get the most out of our data.

Chapter 15, Building Machine Learning Models with Azure Databricks and Azure ML, looks at the different technologies that a data scientist can use on Microsoft Azure Platform, which help in managing big data projects without having to worry about infrastructure and computing power.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at <[email protected]>.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Chapter 1. How to Sound Like a Data Scientist

No matter which industry you work in—IT, fashion, food, or finance—there is no doubt that data affects your life and work. At some point this week, you will either have or hear a conversation about data. News outlets are covering more and more stories about data leaks, cybercrimes, and how data can give us a glimpse into our lives. But why now? What makes this era such a hotbed of data-related industries?

In the nineteenth century, the world was in the grip of the Industrial Age. Mankind was exploring its place in the industrial world, working with giant mechanical inventions. Captains of industry, such as Henry Ford, recognized that using these machines could open major market opportunities, enabling industries to achieve previously unimaginable profits. Of course, the Industrial Age had its pros and cons. While mass production placed goods in the hands of more consumers, our battle with pollution also began at around this time.

By the twentieth century, we were quite skilled at making huge machines; the goal now was to make them smaller and faster. The Industrial Age was over and was replaced by what we now refer to as the Information Age. We started using machines to gather and store information (data) about ourselves and our environment for the purpose of understanding our universe.

Beginning in the 1940s, machines such as ENIAC (considered one of the first—if not the first—computers) were computing math equations and running models and simulations like never before. The following photograph shows ENIAC:

ENIAC—The world's first electronic digital computer (Ref: http://ftp.arl.mil/ftp/historic-computers/)

We finally had a decent lab assistant who could run the numbers better than we could! As with the Industrial Age, the Information Age brought us both the good and the bad. The good was the extraordinary works of technology, including mobile phones and televisions. The bad was not as bad as worldwide pollution, but still left us with a problem in the twenty-first century—so much data.

That's right—the Information Age, in its quest to procure data, has exploded the production of electronic data. Estimates show that we created about 1.8 trillion gigabytes of data in 2011 (take a moment to just think about how much that is). Just one year later, in 2012, we created over 2.8 trillion gigabytes of data! This number is only going to explode further to hit an estimated 40 trillion gigabytes of created data in just one year by 2020. People contribute to this every time they tweet, post on Facebook, save a new resume on Microsoft Word, or just send their mom a picture by text message.

Not only are we creating data at an unprecedented rate, but we are also consuming it at an accelerated pace as well. Just five years ago, in 2013, the average cell phone user used under 1 GB of data a month. Today, that number is estimated to be well over 2 GB a month. We aren't just looking for the next personality quiz—what we are looking for is insight. With all of this data out there, some of it has to be useful to me! And it can be!

So we, in the twenty-first century, are left with a problem. We have so much data and we keep making more. We have built insanely tiny machines that collect data 24/7, and it's our job to make sense of it all. Enter the Data Age. This is the age when we take machines dreamed up by our nineteenth century ancestors and the data created by our twentieth century counterparts and create insights and sources of knowledge that every human on Earth can benefit from. The United States created an entirely new role in the government of chief data scientist. Many companies are now investing in data science departments and hiring data scientists. The benefit is quite obvious—using data to make accurate predictions and simulations gives us insight into our world like never before.

Sounds great, but what's the catch?

This chapter will explore the terminology and vocabulary of the modern data scientist. We will learn keywords and phrases that will be essential in our discussion of data science throughout this book. We will also learn why we use data science and learn about the three key domains that data science is derived from before we begin to look at the code in Python, the primary language used in this book. This chapter will cover the following topics:

The basic terminology of data scienceThe three domains of data scienceThe basic Python syntax

What is data science?

Before we go any further, let's look at some basic definitions that we will use throughout this book. The great/awful thing about this field is that it is so young that these definitions can differ from textbook to newspaper to whitepaper.

Basic terminology

The definitions that follow are general enough to be used in daily conversations, and work to serve the purpose of this book, an introduction to the principles of data science.

Let's start by defining what data is. This might seem like a silly first definition to look at, but it is very important. Whenever we use the word "data," we refer to a collection of information in either an organized or unorganized format. These formats have the following qualities:

Organized data: This refers to data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation.Unorganized data: This is the type of data that is in a free form, usually text or raw audio/signals that must be parsed further to become organized.

Whenever you open Excel (or any other spreadsheet program), you are looking at a blank row/column structure waiting for organized data. These programs don't do well with unorganized data. For the most part, we will deal with organized data as it is the easiest to glean insights from, but we will not shy away from looking at raw text and methods of processing unorganized forms of data.

Data science is the art and science of acquiring knowledge through data.

What a small definition for such a big topic, and rightfully so! Data science covers so many things that it would take pages to list it all out (I should know—I tried and got told to edit it down).

Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:

Make decisionsPredict the futureUnderstand the past/presentCreate new industries/products

This book is all about the methods of data science, including how to process data, gather insights, and use those insights to make informed decisions and predictions.

Data science is about using data in order to gain new insights that you would otherwise have missed.

As an example, using data science, clinics can identify patients who are likely to not show up for an appointment. This can help improve margins, and providers can give other patients available slots.

That's why data science won't replace the human brain, but complement it, working alongside it. Data science should not be thought of as an end-all solution to our data woes; it is merely an opinion—a very informed opinion, but an opinion nonetheless. It deserves a seat at the table.

Why data science?

In this Data Age, it's clear that we have a surplus of data. But why should that necessitate an entirely new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it literally impossible for a human to parse it in a reasonable time frame. Data is collected in various forms and from different sources, and often comes in a very unorganized format.

Data can be missing, incomplete, or just flat out wrong. Oftentimes, we will have data on very different scales, and that makes it tough to compare it. Say that we are looking at data in relation to pricing used cars. One characteristic of a car is the year it was made, and another might be the number of miles on that car. Once we clean our data (which we will spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.

Earlier, we looked at data science in a more historical perspective, but let's take a minute to discuss its role in business today using a very simple example.

Example – xyz123 Technologies

Ben Runkle, the CEO of xyz123 Technologies, is trying to solve a huge problem. The company is consistently losing long-time customers. He does not know why they are leaving, but he must do something fast. He is convinced that in order to reduce his churn, he must create new products and features, and consolidate existing technologies. To be safe, he calls in his chief data scientist, Dr. Hughan. However, she is not convinced that new products and features alone will save the company. Instead, she turns to the transcripts of recent customer service tickets. She shows Ben the most recent transcripts and finds something surprising:

".... Not sure how to export this; are you?""Where is the button that makes a new list?""Wait, do you even know where the slider is?""If I can't figure this out today, it's a real problem..."

It is clear that customers were having problems with the existing UI/UX, and weren't upset because of a lack of features. Runkle and Hughan organized a mass UI/UX overhaul and their sales have never been better.

Of course, the science used in the last example was minimal, but it makes a point. We tend to call people like Runkle drivers. Today's common stick-to-your-gut CEO wants to make all decisions quickly and iterate over solutions until something works. Dr. Hughan is much more analytical. She wants to solve the problem just as much as Runkle, but she turns to user-generated data instead of her gut feeling for answers. Data science is about applying the skills of the analytical mind and using them as a driver would.

Both of these mentalities have their place in today's enterprises; however, it is Hughan's way of thinking that dominates the ideas of data science—using data generated by the company as her source of information, rather than just picking up a solution and going with it.

The data science Venn diagram

It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. Understanding data science begins with three basic areas:

Math/statistics: This is the use of equations and formulas to perforanalysis.Computer programming: This is the ability to use code to create outcomes on computer.Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, d so on).

The following Venn diagram provides a visual representation of how these three areas of data science intersect:

The Venn diagram of data science

Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics background allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive expertise (domain expertise) allows you to apply concepts and results in a meaningful and effective way.

While having only two of these three qualities can make you intelligent, it will also leave a gap. Let's say that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place, but lack the math skills to evaluate your algorithms. This will mean that you end up losing money in the long run. It is only when you boost your skills in coding, math, and domain knowledge that you can truly perform data science.

The quality that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers.

Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and, above all, understand our analyses' place in the domain we are in. This includes the presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information, or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist.

Note

The intersection of math and coding is machine learning. This book will look at machine learning in great detail later on, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just that—algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data, but if you don't understand how to apply this model in a practical sense so that doctors and nurses can easily use it, your model might be useless.

Both computer programming and math are covered extensively in this book. Domain knowledge comes with both the practice of data science and reading examples of other people's analyses.

The math

Most people stop listening once someone says the word "math." They'll nod along in an attempt to hide their utter disdain for the topic. This book will guide you through the math needed for data science, specifically statistics and probability. We will use these subdomains of mathematics to create what are called models.

A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon.

Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding the theory allows us to apply a model that we built for the fashion industry to a financial domain.

The math covered in this book ranges from basic algebra to advanced probabilistic and statistical modeling. Do not skip over these chapters, even if you already know these topics or you're afraid of them. Every mathematical concept that I will introduce will be introduced with care and purpose, using examples. The math in this book is essential for data scientists.

Example – spawner-recruit models

In biology, we use, among many other models, a model known as the spawner-recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the graph further down (titled spawner-recruit model) was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that the group would obtain, and vice versa?

Essentially, models allow us to plug in one variable to get the other. Consider the folloIn this example, let's say we knew that a group of salmon had 1.15 (in thousands) spawners. Then, we would have tThis result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change.

There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the best model possible. We no longer rely on human instincts—rather, we rely on data, such as that displayed in the following graph:

The spawner-recruit model visualized

The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! Throughout this book, we will look at relationships involving marketing dollars, sentiment data, restaurant reviews, and much more. The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible.

Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere.

Computer programming

Let's be honest: you probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on technology. You don't turn on the TV to see a new theory on primes—rather, you will see investigative reports on how the latest smartphone can take better photos of cats, or something. Computer languages are how we communicate with machines and tell them to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages that are available to us. This book will focus exclusively on using Python.

Some more terminology

This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terms that you are likely to encounter.

Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. We have seen the concept of machine learning earlier in this chapter as the union of someone who has both coding and math skills. Here, we are attempting to formalize this definition. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and create powerful data models. Speaking of data models, in this book, we will concern ourselves with the following two basic types of data model:

Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness.Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula.

Note

While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate, since machine learning algorithms generally attempt to learn relationships in different ways. We will take a look at the statistical and probabilistic models in later chapters.

Exploratory data analysis (EDA): This refers to preparing data in order to standardize results and gain quick insights. EDA is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots to identify key features and relationships to exploit in our data models.Data mining: This is the process of finding relationships between elements of data. Data mining is the part of data science where we try to find relationships between variables (think the spawn-recruit model).

I have tried pretty hard not to use the term big data up until now. This is because I think this term is misused, a lot. Big data is data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data).

The following diagram shows the relationship between these data science concepts:

The state of data science (so far)

The preceding diagram is incomplete and is meant for visualization purposes only.

Summary

At the beginning of this chapter, I posed a simple question: what's the catch of data science? Well, there is one. It isn't all fun, games, and modeling. There must be a price for our quest to create ever-smarter machines and algorithms. As we seek new and innovative ways to discover data trends, a beast lurks in the shadows. I'm not talking about the learning curve of mathematics or programming, nor am I referring to the surplus of data. The Industrial Age left us with an ongoing battle against pollution. The subsequent Information Age left behind a trail of big data. So, what dangers might the Data Age bring us?

The Data Age can lead to something much more sinister — the dehumanization of the individual through mass data.

More and more people are jumping head-first into the field of data science, most with no prior experience of math or CS, which, on the surface, is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jump start their education.