29,99 €
Need to turn programming skills into effective data science skills? This book helps you connect mathematics, programming, and business analysis. You’ll feel confident asking—and answering—complex, sophisticated questions of your data, making abstract and raw statistics into actionable ideas.
Going through the data science pipeline, you'll clean and prepare data and learn effective data mining strategies and techniques to gain a comprehensive view of how the data science puzzle fits together. You’ll learn fundamentals of computational mathematics and statistics and pseudo-code used by data scientists and analysts. You’ll learn machine learning, discovering statistical models that help control and navigate even the densest datasets, and learn powerful visualizations that communicate what your data means.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 519
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey Varangoankar
Acquisition Editor: Dayne Castelino
Content Development Editors: Chris D'cruz
Technical Editor: Sneha Hanchate
Copy Editor: Safis Editing
Project Coordinator: Namarata Swetta
Proofreader: Safis Editing
Indexers: Pratik Shirodkar
Graphics: Tom Scaria
Production Coordinator: Nilesh Mohite
First published: December 2016
Second editon: December 2018
Production reference: 1141218
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78980-454-6
www.packtpub.com
mapt.io
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Sinan Ozdemir is a data scientist, start-up founder, and educator living in the San Francisco Bay Area. He studied pure mathematics at Johns Hopkins University. He then spent several years conducting lectures on data science at Johns Hopkins University before founding his own start-up, Kylie.ai, which uses artificial intelligence to clone brand personalities and automate customer service communications.
Sinan is also the author of Principles of Data Science, First Edition available through Packt.
Sunil Kakade is a technologist, educator, and senior leader with expertise in creating data- and AI-driven organizations. He is in the adjunct faculty at Northwestern University, Evanston, IL, where he teaches graduate courses of data science and big data. He has several research papers to his credit and has presented his work in big data applications at reputable conferences. He has US patents in areas of big data and retail processes. He is passionate about applying data science to improve business outcomes and save patients' lives. At present, Sunil leads the information architecture and analytics team for a large healthcare organization focused on improving healthcare outcomes and lives with his wife, Pratibha, and daughter, Preeti, in Scottsdale, Arizona.
I would like to thank my mother, Subhadra, wife; Pratibha; and daughter, Preeti, for supporting me during my education and career and for supporting my passion for learning. Many thanks to my mentors, Prof. Faisal Akkawi, Northwestern University; Bill Guise, Sr. Director, Dr. Joseph Colorafi, CMIO, and Deanna Wise, CIO at Dignity Health for supporting my passion for big data, data science, and artificial intelligence. Special thanks to Sinan Ozdemir and Packt Publishing for giving me the opportunity to co-author this book. I appreciate the incredible support of my team at Dignity Health Insights in my journey in data science. Finally, I'd like to thank my friend, Anand Deshpande, who inspired me to take on this project.
Marco Tibaldeschi, born in 1983, Master’s degree in informatic engineering, has actively worked on the web since 1994. Thanks to the fact that he is the fourth of four brothers, he has always had a foot in the future. In 1998 he registered his first domain which was one of the first virtual web communities in Italy. Because of this, he has been interviewed by different national newspapers and radio stations, and a research book has been written by University of Pisa in order to understand the social phenomenon. In 2003, he founded DBN Communication, a web consulting company that owns and develops eDock, a SaaS that helps sellers to manage their inventories and orders on the biggest marketplaces in the world (like Amazon and eBay).
ACKNOWLEDGMENTS
I'd like to thank my wife Giulia, because with her and her support everything seems and becomes possible. Without her help and her love, I'd be a different man and, I'm sure, a worse one. I'd also like to thank Nelson Morris and Chris D'cruz from Packt for this opportunity and for their continuous support.
Oleg Okun got his PhD from the Institute of Engineering Cybernetics, National Academy of Sciences (Minsk, Belarus) in 1996. Since 1998 he has worked abroad, doing both academic research (in Belarus and Finland) and industrial research (in Sweden and Germany). His research experience includes document image analysis, cancer prediction by analyzing gene expression profiles (bioinformatics), fingerprint verification and identification (biometrics), online and offline marketing analytics, credit scoring (microfinance), and text search and summarization (natural language processing). He has 80 publications, including one IGI Global-published book and three co-edited books published by Springer-Verlag, as well as book chapters, journal articles, and numerous conference papers. He has also been a reviewer of several books published by Packt Publishing.
Jared James Thompson, PhD, is a graduate of Purdue University and has held both academic and industrial appointments teaching programming, algorithms, and big data technology. He is a machine learning enthusiast and has a particular love of optimization. Jared is currently employed as a machine learning engineer at Atomwise, a start-up that leverages artificial intelligence to design better drugs faster.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
The topic of this book is data science, which is a field of study that has been growing rapidly for the past few decades. Today, more companies than ever before are investing in big data and data science to improve business performance, drive innovation, and create new revenue streams by building data products. According to LinkedIn's 2017 US Emerging Jobs Report, machine learning engineer, data scientist, and big data engineer rank among the top emerging jobs, and companies in a wide range of industries are seeking people with the requisite skills for those roles.
We will dive into topics from all three areas and solve complex problems. We will clean, explore, and analyze data in order to derive scientific and accurate conclusions. Machine learning and deep learning techniques will be applied to solve complex data tasks.
This book is for people who are looking to understand and utilize the basic practices of data science for any domain. The reader should be fairly well acquainted with basic mathematics (algebra, and perhaps probability) and should feel comfortable reading snippets in R/Python as well as pseudo code. The reader is not expected to have worked in a data field; however, they should have the urge to learn and apply the techniques put forth in this book to either their own datasets or those provided to them.
Chapter 1, How to Sound Like a Data Scientist, introduces the basic terminology used by data scientists and looks at the types of problem we will be solving throughout this book.
Chapter 2, Types of Data, looks at the different levels and types of data out there and shows how to manipulate each type. This chapter will begin to deal with the mathematics needed for data science.
Chapter 3, The Five Steps of Data Science, uncovers the five basic steps of performing data science, including data manipulation and cleaning, and shows examples of each step in detail.
Chapter 4, Basic Mathematics, explains the basic mathematical principles that guide the actions of data scientists by presenting and solving examples in calculus, linear algebra, and more.
Chapter 5, Impossible or Improbable– a Gentle Introduction to Probability, is a beginner's guide to probability theory and how it is used to gain an understanding of our random universe.
Chapter 6, Advanced Probability, uses principles from the previous chapter and introduces and applies theorems, such as Bayes' Theorem, in the hope of uncovering the hidden meaning in our world.
Chapter 7, Basic Statistics, deals with the types of problem that statistical inference attempts to explain, using the basics of experimentation, normalization, and random sampling.
Chapter 8, Advanced Statistics, uses hypothesis testing and confidence intervals to gain insight from our experiments. Being able to pick which test is appropriate and how to interpret p-values and other results is very important as well.
Chapter 9, Communicating Data, explains how correlation and causation affect our interpretation of data. We will also be using visualizations in order to share our results with the world.
Chapter 10, How to Tell Whether Your Toaster Is Learning – Machine Learning Essentials, focuses on the definition of machine learning and looks at real-life examples of how and when machine learning is applied. A basic understanding of the relevance of model evaluation is introduced.
Chapter 11, Predictions Don't Grow on Trees, or Do They?, looks at more complicated machine learning models, such as decision trees and Bayesian predictions, in order to solve more complex data-related tasks.
Chapter 12, Beyond the Essentials, introduces some of the mysterious forces guiding data science, including bias and variance. Neural networks are introduced as a modern deep learning technique.
Chapter 13, Case Studies, uses an array of case studies in order to solidify the ideas of data science. We will be following the entire data science workflow from start to finish multiple times for different examples, including stock price prediction and handwriting detection.
Chapter 14, Microsoft Databricks Case Studies, will harness the power of the Microsoft data environment as well as Apache Spark to put our machine learning in high gear. This chapter makes use of parallelization and advanced visualization software to get the most out of our data.
Chapter 15, Building Machine Learning Models with Azure Databricks and Azure ML, looks at the different technologies that a data scientist can use on Microsoft Azure Platform, which help in managing big data projects without having to worry about infrastructure and computing power.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at <[email protected]>.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
No matter which industry you work in—IT, fashion, food, or finance—there is no doubt that data affects your life and work. At some point this week, you will either have or hear a conversation about data. News outlets are covering more and more stories about data leaks, cybercrimes, and how data can give us a glimpse into our lives. But why now? What makes this era such a hotbed of data-related industries?
In the nineteenth century, the world was in the grip of the Industrial Age. Mankind was exploring its place in the industrial world, working with giant mechanical inventions. Captains of industry, such as Henry Ford, recognized that using these machines could open major market opportunities, enabling industries to achieve previously unimaginable profits. Of course, the Industrial Age had its pros and cons. While mass production placed goods in the hands of more consumers, our battle with pollution also began at around this time.
By the twentieth century, we were quite skilled at making huge machines; the goal now was to make them smaller and faster. The Industrial Age was over and was replaced by what we now refer to as the Information Age. We started using machines to gather and store information (data) about ourselves and our environment for the purpose of understanding our universe.
Beginning in the 1940s, machines such as ENIAC (considered one of the first—if not the first—computers) were computing math equations and running models and simulations like never before. The following photograph shows ENIAC:
ENIAC—The world's first electronic digital computer (Ref: http://ftp.arl.mil/ftp/historic-computers/)
We finally had a decent lab assistant who could run the numbers better than we could! As with the Industrial Age, the Information Age brought us both the good and the bad. The good was the extraordinary works of technology, including mobile phones and televisions. The bad was not as bad as worldwide pollution, but still left us with a problem in the twenty-first century—so much data.
That's right—the Information Age, in its quest to procure data, has exploded the production of electronic data. Estimates show that we created about 1.8 trillion gigabytes of data in 2011 (take a moment to just think about how much that is). Just one year later, in 2012, we created over 2.8 trillion gigabytes of data! This number is only going to explode further to hit an estimated 40 trillion gigabytes of created data in just one year by 2020. People contribute to this every time they tweet, post on Facebook, save a new resume on Microsoft Word, or just send their mom a picture by text message.
Not only are we creating data at an unprecedented rate, but we are also consuming it at an accelerated pace as well. Just five years ago, in 2013, the average cell phone user used under 1 GB of data a month. Today, that number is estimated to be well over 2 GB a month. We aren't just looking for the next personality quiz—what we are looking for is insight. With all of this data out there, some of it has to be useful to me! And it can be!
So we, in the twenty-first century, are left with a problem. We have so much data and we keep making more. We have built insanely tiny machines that collect data 24/7, and it's our job to make sense of it all. Enter the Data Age. This is the age when we take machines dreamed up by our nineteenth century ancestors and the data created by our twentieth century counterparts and create insights and sources of knowledge that every human on Earth can benefit from. The United States created an entirely new role in the government of chief data scientist. Many companies are now investing in data science departments and hiring data scientists. The benefit is quite obvious—using data to make accurate predictions and simulations gives us insight into our world like never before.
Sounds great, but what's the catch?
This chapter will explore the terminology and vocabulary of the modern data scientist. We will learn keywords and phrases that will be essential in our discussion of data science throughout this book. We will also learn why we use data science and learn about the three key domains that data science is derived from before we begin to look at the code in Python, the primary language used in this book. This chapter will cover the following topics:
Before we go any further, let's look at some basic definitions that we will use throughout this book. The great/awful thing about this field is that it is so young that these definitions can differ from textbook to newspaper to whitepaper.
The definitions that follow are general enough to be used in daily conversations, and work to serve the purpose of this book, an introduction to the principles of data science.
Let's start by defining what data is. This might seem like a silly first definition to look at, but it is very important. Whenever we use the word "data," we refer to a collection of information in either an organized or unorganized format. These formats have the following qualities:
Whenever you open Excel (or any other spreadsheet program), you are looking at a blank row/column structure waiting for organized data. These programs don't do well with unorganized data. For the most part, we will deal with organized data as it is the easiest to glean insights from, but we will not shy away from looking at raw text and methods of processing unorganized forms of data.
Data science is the art and science of acquiring knowledge through data.
What a small definition for such a big topic, and rightfully so! Data science covers so many things that it would take pages to list it all out (I should know—I tried and got told to edit it down).
Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:
This book is all about the methods of data science, including how to process data, gather insights, and use those insights to make informed decisions and predictions.
Data science is about using data in order to gain new insights that you would otherwise have missed.
As an example, using data science, clinics can identify patients who are likely to not show up for an appointment. This can help improve margins, and providers can give other patients available slots.
That's why data science won't replace the human brain, but complement it, working alongside it. Data science should not be thought of as an end-all solution to our data woes; it is merely an opinion—a very informed opinion, but an opinion nonetheless. It deserves a seat at the table.
In this Data Age, it's clear that we have a surplus of data. But why should that necessitate an entirely new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it literally impossible for a human to parse it in a reasonable time frame. Data is collected in various forms and from different sources, and often comes in a very unorganized format.
Data can be missing, incomplete, or just flat out wrong. Oftentimes, we will have data on very different scales, and that makes it tough to compare it. Say that we are looking at data in relation to pricing used cars. One characteristic of a car is the year it was made, and another might be the number of miles on that car. Once we clean our data (which we will spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.
Earlier, we looked at data science in a more historical perspective, but let's take a minute to discuss its role in business today using a very simple example.
Ben Runkle, the CEO of xyz123 Technologies, is trying to solve a huge problem. The company is consistently losing long-time customers. He does not know why they are leaving, but he must do something fast. He is convinced that in order to reduce his churn, he must create new products and features, and consolidate existing technologies. To be safe, he calls in his chief data scientist, Dr. Hughan. However, she is not convinced that new products and features alone will save the company. Instead, she turns to the transcripts of recent customer service tickets. She shows Ben the most recent transcripts and finds something surprising:
It is clear that customers were having problems with the existing UI/UX, and weren't upset because of a lack of features. Runkle and Hughan organized a mass UI/UX overhaul and their sales have never been better.
Of course, the science used in the last example was minimal, but it makes a point. We tend to call people like Runkle drivers. Today's common stick-to-your-gut CEO wants to make all decisions quickly and iterate over solutions until something works. Dr. Hughan is much more analytical. She wants to solve the problem just as much as Runkle, but she turns to user-generated data instead of her gut feeling for answers. Data science is about applying the skills of the analytical mind and using them as a driver would.
Both of these mentalities have their place in today's enterprises; however, it is Hughan's way of thinking that dominates the ideas of data science—using data generated by the company as her source of information, rather than just picking up a solution and going with it.
It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. Understanding data science begins with three basic areas:
The following Venn diagram provides a visual representation of how these three areas of data science intersect:
The Venn diagram of data science
Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics background allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive expertise (domain expertise) allows you to apply concepts and results in a meaningful and effective way.
While having only two of these three qualities can make you intelligent, it will also leave a gap. Let's say that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place, but lack the math skills to evaluate your algorithms. This will mean that you end up losing money in the long run. It is only when you boost your skills in coding, math, and domain knowledge that you can truly perform data science.
The quality that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers.
Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and, above all, understand our analyses' place in the domain we are in. This includes the presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information, or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist.
The intersection of math and coding is machine learning. This book will look at machine learning in great detail later on, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just that—algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data, but if you don't understand how to apply this model in a practical sense so that doctors and nurses can easily use it, your model might be useless.
Both computer programming and math are covered extensively in this book. Domain knowledge comes with both the practice of data science and reading examples of other people's analyses.
Most people stop listening once someone says the word "math." They'll nod along in an attempt to hide their utter disdain for the topic. This book will guide you through the math needed for data science, specifically statistics and probability. We will use these subdomains of mathematics to create what are called models.
A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon.
Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding the theory allows us to apply a model that we built for the fashion industry to a financial domain.
The math covered in this book ranges from basic algebra to advanced probabilistic and statistical modeling. Do not skip over these chapters, even if you already know these topics or you're afraid of them. Every mathematical concept that I will introduce will be introduced with care and purpose, using examples. The math in this book is essential for data scientists.
In biology, we use, among many other models, a model known as the spawner-recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the graph further down (titled spawner-recruit model) was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that the group would obtain, and vice versa?
Essentially, models allow us to plug in one variable to get the other. Consider the folloIn this example, let's say we knew that a group of salmon had 1.15 (in thousands) spawners. Then, we would have tThis result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change.
There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the best model possible. We no longer rely on human instincts—rather, we rely on data, such as that displayed in the following graph:
The spawner-recruit model visualized
The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! Throughout this book, we will look at relationships involving marketing dollars, sentiment data, restaurant reviews, and much more. The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible.
Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere.
Let's be honest: you probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on technology. You don't turn on the TV to see a new theory on primes—rather, you will see investigative reports on how the latest smartphone can take better photos of cats, or something. Computer languages are how we communicate with machines and tell them to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages that are available to us. This book will focus exclusively on using Python.
This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terms that you are likely to encounter.
While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate, since machine learning algorithms generally attempt to learn relationships in different ways. We will take a look at the statistical and probabilistic models in later chapters.
I have tried pretty hard not to use the term big data up until now. This is because I think this term is misused, a lot. Big data is data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data).
The following diagram shows the relationship between these data science concepts:
The state of data science (so far)
The preceding diagram is incomplete and is meant for visualization purposes only.
At the beginning of this chapter, I posed a simple question: what's the catch of data science? Well, there is one. It isn't all fun, games, and modeling. There must be a price for our quest to create ever-smarter machines and algorithms. As we seek new and innovative ways to discover data trends, a beast lurks in the shadows. I'm not talking about the learning curve of mathematics or programming, nor am I referring to the surplus of data. The Industrial Age left us with an ongoing battle against pollution. The subsequent Information Age left behind a trail of big data. So, what dangers might the Data Age bring us?
The Data Age can lead to something much more sinister — the dehumanization of the individual through mass data.
More and more people are jumping head-first into the field of data science, most with no prior experience of math or CS, which, on the surface, is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jump start their education.
