Python: Advanced Predictive Analytics - Ashish Kumar - E-Book

Python: Advanced Predictive Analytics E-Book

Ashish Kumar

0,0
91,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Gain practical insights by exploiting data in your business to build advanced predictive modeling applications

About This Book

  • A step-by-step guide to predictive modeling including lots of tips, tricks, and best practices
  • Learn how to use popular predictive modeling algorithms such as Linear Regression, Decision Trees, Logistic Regression, and Clustering
  • Master open source Python tools to build sophisticated predictive models

Who This Book Is For

This book is designed for business analysts, BI analysts, data scientists, or junior level data analysts who are ready to move on from a conceptual understanding of advanced analytics and become an expert in designing and building advanced analytics solutions using Python. If you are familiar with coding in Python (or some other programming/statistical/scripting language) but have never used or read about predictive analytics algorithms, this book will also help you.

What You Will Learn

  • Understand the statistical and mathematical concepts behind predictive analytics algorithms and implement them using Python libraries
  • Get to know various methods for importing, cleaning, sub-setting, merging, joining, concatenating, exploring, grouping, and plotting data with pandas and NumPy
  • Master the use of Python notebooks for exploratory data analysis and rapid prototyping
  • Get to grips with applying regression, classification, clustering, and deep learning algorithms
  • Discover advanced methods to analyze structured and unstructured data
  • Visualize the performance of models and the insights they produce
  • Ensure the robustness of your analytic applications by mastering the best practices of predictive analysis

In Detail

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form; it needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Using the Python programming language, analysts can use these sophisticated methods to build scalable analytic applications. This book is your guide to getting started with predictive analytics using Python.

You'll balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and NumPy. Through case studies and code examples using popular open-source Python libraries, this book illustrates the complete development process for analytic applications. Covering a wide range of algorithms for classification, regression, clustering, as well as cutting-edge techniques such as deep learning, this book illustrates explains how these methods work. You will learn to choose the right approach for your problem and how to develop engaging visualizations to bring to life the insights of predictive modeling.

Finally, you will learn best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world. The course provides you with highly practical content from the following Packt books:

1. Learning Predictive Analytics with Python

2. Mastering Predictive Analytics with Python

Style and approach

This course aims to create a smooth learning path that will teach you how to effectively perform predictive analytics using Python. Through this comprehensive course, you'll learn the basics of predictive analytics and progress to predictive modeling in the modern world.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 797

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Python: Advanced Predictive Analytics
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Getting Started with Predictive Modelling
Introducing predictive modelling
Scope of predictive modelling
Ensemble of statistical algorithms
Statistical tools
Historical data
Mathematical function
Business context
Knowledge matrix for predictive modelling
Task matrix for predictive modelling
Applications and examples of predictive modelling
LinkedIn's "People also viewed" feature
What it does?
How is it done?
Correct targeting of online ads
How is it done?
Santa Cruz predictive policing
How is it done?
Determining the activity of a smartphone user using accelerometer data
How is it done?
Sport and fantasy leagues
How was it done?
Python and its packages – download and installation
Anaconda
Standalone Python
Installing a Python package
Installing pip
Installing Python packages with pip
Python and its packages for predictive modelling
IDEs for Python
Summary
2. Data Cleaning
Reading the data – variations and examples
Data frames
Delimiters
Various methods of importing data in Python
Case 1 – reading a dataset using the read_csv method
The read_csv method
Use cases of the read_csv method
Passing the directory address and filename as variables
Reading a .txt dataset with a comma delimiter
Specifying the column names of a dataset from a list
Case 2 – reading a dataset using the open method of Python
Reading a dataset line by line
Changing the delimiter of a dataset
Case 3 – reading data from a URL
Case 4 – miscellaneous cases
Reading from an .xls or .xlsx file
Writing to a CSV or Excel file
Basics – summary, dimensions, and structure
Handling missing values
Checking for missing values
What constitutes missing data?
How missing values are generated and propagated
Treating missing values
Deletion
Imputation
Creating dummy variables
Visualizing a dataset by basic plotting
Scatter plots
Histograms
Boxplots
Summary
3. Data Wrangling
Subsetting a dataset
Selecting columns
Selecting rows
Selecting a combination of rows and columns
Creating new columns
Generating random numbers and their usage
Various methods for generating random numbers
Seeding a random number
Generating random numbers following probability distributions
Probability density function
Cumulative density function
Uniform distribution
Normal distribution
Using the Monte-Carlo simulation to find the value of pi
Geometry and mathematics behind the calculation of pi
Generating a dummy data frame
Grouping the data – aggregation, filtering, and transformation
Aggregation
Filtering
Transformation
Miscellaneous operations
Random sampling – splitting a dataset in training and testing datasets
Method 1 – using the Customer Churn Model
Method 2 – using sklearn
Method 3 – using the shuffle function
Concatenating and appending data
Merging/joining datasets
Inner Join
Left Join
Right Join
An example of the Inner Join
An example of the Left Join
An example of the Right Join
Summary of Joins in terms of their length
Summary
4. Statistical Concepts for Predictive Modelling
Random sampling and the central limit theorem
Hypothesis testing
Null versus alternate hypothesis
Z-statistic and t-statistic
Confidence intervals, significance levels, and p-values
Different kinds of hypothesis test
A step-by-step guide to do a hypothesis test
An example of a hypothesis test
Chi-square tests
Correlation
Summary
5. Linear Regression with Python
Understanding the maths behind linear regression
Linear regression using simulated data
Fitting a linear regression model and checking its efficacy
Finding the optimum value of variable coefficients
Making sense of result parameters
p-values
F-statistics
Residual Standard Error
Implementing linear regression with Python
Linear regression using the statsmodel library
Multiple linear regression
Multi-collinearity
Variance Inflation Factor
Model validation
Training and testing data split
Summary of models
Linear regression with scikit-learn
Feature selection with scikit-learn
Handling other issues in linear regression
Handling categorical variables
Transforming a variable to fit non-linear relations
Handling outliers
Other considerations and assumptions for linear regression
Summary
6. Logistic Regression with Python
Linear regression versus logistic regression
Understanding the math behind logistic regression
Contingency tables
Conditional probability
Odds ratio
Moving on to logistic regression from linear regression
Estimation using the Maximum Likelihood Method
Likelihood function:
Log likelihood function:
Building the logistic regression model from scratch
Making sense of logistic regression parameters
Wald test
Likelihood Ratio Test statistic
Chi-square test
Implementing logistic regression with Python
Processing the data
Data exploration
Data visualization
Creating dummy variables for categorical variables
Feature selection
Implementing the model
Model validation and evaluation
Cross validation
Model validation
The ROC curve
Confusion matrix
Summary
7. Clustering with Python
Introduction to clustering – what, why, and how?
What is clustering?
How is clustering used?
Why do we do clustering?
Mathematics behind clustering
Distances between two observations
Euclidean distance
Manhattan distance
Minkowski distance
The distance matrix
Normalizing the distances
Linkage methods
Single linkage
Compete linkage
Average linkage
Centroid linkage
Ward's method
Hierarchical clustering
K-means clustering
Implementing clustering using Python
Importing and exploring the dataset
Normalizing the values in the dataset
Hierarchical clustering using scikit-learn
K-Means clustering using scikit-learn
Interpreting the cluster
Fine-tuning the clustering
The elbow method
Silhouette Coefficient
Summary
8. Trees and Random Forests with Python
Introducing decision trees
A decision tree
Understanding the mathematics behind decision trees
Homogeneity
Entropy
Information gain
ID3 algorithm to create a decision tree
Gini index
Reduction in Variance
Pruning a tree
Handling a continuous numerical variable
Handling a missing value of an attribute
Implementing a decision tree with scikit-learn
Visualizing the tree
Cross-validating and pruning the decision tree
Understanding and implementing regression trees
Regression tree algorithm
Implementing a regression tree using Python
Understanding and implementing random forests
The random forest algorithm
Implementing a random forest using Python
Why do random forests work?
Important parameters for random forests
Summary
9. Best Practices for Predictive Modelling
Best practices for coding
Commenting the codes
Defining functions for substantial individual tasks
Example 1
Example 2
Example 3
Avoid hard-coding of variables as much as possible
Version control
Using standard libraries, methods, and formulas
Best practices for data handling
Best practices for algorithms
Best practices for statistics
Best practices for business contexts
Summary
A. A List of Links
2. Module 2
1. From Data to Decisions – Getting Started with Analytic Applications
Designing an advanced analytic solution
Data layer: warehouses, lakes, and streams
Modeling layer
Deployment layer
Reporting layer
Case study: sentiment analysis of social media feeds
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Case study: targeted e-mail campaigns
Data input and transformation
Sanity checking
Model development
Scoring
Visualization and reporting
Summary
2. Exploratory Data Analysis and Visualization in Python
Exploring categorical and numerical data in IPython
Installing IPython notebook
The notebook interface
Loading and inspecting data
Basic manipulations – grouping, filtering, mapping, and pivoting
Charting with Matplotlib
Time series analysis
Cleaning and converting
Time series diagnostics
Joining signals and correlation
Working with geospatial data
Loading geospatial data
Working in the cloud
Introduction to PySpark
Creating the SparkContext
Creating an RDD
Creating a Spark DataFrame
Summary
3. Finding Patterns in the Noise – Clustering and Unsupervised Learning
Similarity and distance metrics
Numerical distance metrics
Correlation similarity metrics and time series
Similarity metrics for categorical data
K-means clustering
Affinity propagation – automatically choosing cluster numbers
k-medoids
Agglomerative clustering
Where agglomerative clustering fails
Streaming clustering in Spark
Summary
4. Connecting the Dots with Models – Regression Methods
Linear regression
Data preparation
Model fitting and evaluation
Statistical significance of regression outputs
Generalize estimating equations
Mixed effects models
Time series data
Generalized linear models
Applying regularization to linear models
Tree methods
Decision trees
Random forest
Scaling out with PySpark – predicting year of song release
Summary
5. Putting Data in its Place – Classification Methods and Analysis
Logistic regression
Multiclass logistic classifiers: multinomial regression
Formatting a dataset for classification problems
Learning pointwise updates with stochastic gradient descent
Jointly optimizing all parameters with second-order methods
Fitting the model
Evaluating classification models
Strategies for improving classification models
Separating Nonlinear boundaries with Support vector machines
Fitting and SVM to the census data
Boosting – combining small models to improve accuracy
Gradient boosted decision trees
Comparing classification methods
Case study: fitting classifier models in pyspark
Summary
6. Words and Pixels – Working with Unstructured Data
Working with textual data
Cleaning textual data
Extracting features from textual data
Using dimensionality reduction to simplify datasets
Principal component analysis
Latent Dirichlet Allocation
Using dimensionality reduction in predictive modeling
Images
Cleaning image data
Thresholding images to highlight objects
Dimensionality reduction for image analysis
Case Study: Training a Recommender System in PySpark
Summary
7. Learning from the Bottom Up – Deep Networks and Unsupervised Features
Learning patterns with neural networks
A network of one – the perceptron
Combining perceptrons – a single-layer neural network
Parameter fitting with back-propagation
Discriminative versus generative models
Vanishing gradients and explaining away
Pretraining belief networks
Using dropout to regularize networks
Convolutional networks and rectified units
Compressing Data with autoencoder networks
Optimizing the learning rate
The TensorFlow library and digit recognition
The MNIST data
Constructing the network
Summary
8. Sharing Models with Prediction Services
The architecture of a prediction service
Clients and making requests
The GET requests
The POST request
The HEAD request
The PUT request
The DELETE request
Server – the web traffic controller
Application – the engine of the predictive services
Persisting information with database systems
Case study – logistic regression service
Setting up the database
The web server
The web application
The flow of a prediction service – training a model
On-demand and bulk prediction
Summary
9. Reporting and Testing – Iterating on Analytic Systems
Checking the health of models with diagnostics
Evaluating changes in model performance
Changes in feature importance
Changes in unsupervised model performance
Iterating on models through A/B testing
Experimental allocation – assigning customers to experiments
Deciding a sample size
Multiple hypothesis testing
Guidelines for communication
Translate terms to business values
Visualizing results
Case Study: building a reporting service
The report server
The report application
The visualization layer
Summary
Bibliography
Index

Python: Advanced Predictive Analytics

Python: Advanced Predictive Analytics

Copyright © 2017 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: December 2017

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78899-236-7

www.packtpub.com

Credits

Authors

Ashish Kumar

Joseph Babcock

Reviewers

Matt Hollingsworth

Dipanjan Deb

Content Development Editor

Cheryl Dsa

Production Coordinator

Melwyn D'sa

Preface

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form - It needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Using the Python programming language, analysts can use these sophisticated methods to build scalable analytic applications to deliver insights that are of tremendous value to their organizations.This course is your guide to getting started with Predictive Analytics using Python. You will see how to process data and make predictive models from it. We balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and numpy. Later you will learn the process of turning raw data into powerful insights. Through case studies and code examples using popular open-source Python libraries, this course illustrates the complete development process for analytic applications and how to quickly apply these methods to your own data to create robust and scalable prediction services. Covering a wide range of algorithms for classification, regression, clustering, as well as cutting-edge techniques such as deep learning, this book illustrates not only how these methods work, but how to implement them in practice. You will learn to choose the right approach for your problem and how to develop engaging visualizations to bring the insights of predictive modeling to life. Finally, you will see the best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world.

What this learning path covers

Module 1, Learning Predictive Analytics with Python, is your guide to getting started with Predictive Analytics using Python. You will see how to process data and make predictive models from it. It is the perfect balance of both statistical and mathematical concepts, and implementing them in Python using libraries such as pandas, scikit-learn, and numpy.

Module 2, Mastering Predictive Analytics with Python, will show you the process of turning raw data into powerful insights. Through case studies and code examples using popular open-source Python libraries, this course illustrates the complete development process for analytic applications and how to quickly apply these methods to your own data to create robust and scalable prediction services.

What you need for this learning path

Module 1:

In order to make the best use of this module, you will require the following:

All the datasets that have been used to illustrate the concepts in various chapters. These datasets can be downloaded from this URL: https://goo.gl/zjS4C6. There is a sub-folder containing required datasets for each chapter.Your computer should have any of the Python distribution installed. The examples in the module have been worked upon in IPython Notebook. Following the examples will be much easier if you use IPython Notebook.This comes with Anaconda distribution that can be installed from https://www.continuum.io/downloads.The Python packages which are used widely, for example, pandas, matplotlib, scikit-learn, NumPy, and so on, should be installed. If you install Anaconda these packages will come pre-installed.One of the best ways to use this module will be to take the dataset used to illustrate concepts and flow along with the chapter. The concepts will be easier to understand if the reader works hands on the examples.A basic aptitude for mathematics is expected. It is beneficial to understand the mathematics behind the algorithms before applying them.Prior experience or knowledge of coding will be an added advantage. But, it is not a pre-requisite at all.Similarly, knowledge of statistics and some algorithms will be beneficial, but it is not a pre-requisite.An open mind curious to learn the tips and tricks of a subject that is going to be an indispensable skillset in the coming future.

Module 2:

You'll need latest Python version and PySpark version installed, along with the Jupyter notebook.

Who this learning path is for

This course is designed for business analysts, BI analysts, data scientists, or junior level data analysts who are ready to move from a conceptual understanding of advanced analytics to an expert in designing and building advanced analytics solutions using Python. If you are familiar with coding in Python (or some other programming/statistical/scripting language) but have never used or read about Predictive Analytics algorithms, this course will also help you.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you're looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Python-Advanced-Predictive-Analytics. We also have other code bundles from our rich catalog of books, videos, and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the course in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part 1. Module 1

Learning Predictive Analytics with Python

Gain practical insights into predictive modelling by implementing Predictive Analytics algorithms on public datasets with Python

Chapter 1. Getting Started with Predictive Modelling

Predictive modelling is an art; its a science of unearthing the story impregnated into silos of data. This chapter introduces the scope and application of predictive modelling and shows a glimpse of what could be achieved with it, by giving some real-life examples.

In this chapter, we will cover the following topics in detail:

Introducing predictive modellingApplications and examples of predictive modellingInstalling and downloading Python and its packagesWorking with different IDEs for Python

Introducing predictive modelling

Did you know that Facebook users around the world share 2,460,000 pieces of content every minute of the day? Did you know that 72-hours worth of new video content is uploaded on YouTube in the same time and, brace yourself, did you know that everyday around 2.5 exabytes (10^18) of data is created by us humans? To give you a perspective on how much data that is, you will need a million 1 TB (1000 GB) hard disk drives every day to store that much data. In a year, we will outgrow the US population and will be north of five times the UK population and this estimation is by assuming the fact that the rate of the data generation will remain the same, which in all likelihoods will not be the case.

The breakneck speed at which the social media and Internet of Things have grown is reflected in the huge silos of data humans generate. The data about where we live, where we come from, what we like, what we buy, how much money we spend, where we travel, and so on. Whenever we interact with a social media or Internet of Things website, we leave a trail, which these websites gleefully log as their data. Every time you buy a book at Amazon, receive a payment through PayPal, write a review on Yelp, post a photo on Instagram, do a check-in on Facebook, apart from making business for these websites, you are creating data for them.

Harvard Business Review (HBR) says "Data is the new oil" and that "Data Scientist is the sexiest job of the 21st century". So, why is the data so important and how can we realize the full potential of it? There are broadly two ways in which the data is used:

Retrospective analytics: This approach helps us analyze history and glean out insights from the data. It allows us to learn from mistakes and adopt best practices. These insights and learnings become the torchbearer for the purpose of devising better strategy. Not surprisingly, many experts have been claiming that data is the new middle manager.Predictive analytics: This approach unleashes the might of data. In short, this approach allows us to predict the future. Data science algorithms take historical data and spit out a statistical model, which can predict who will buy, cheat, lie, or die in the future.

Let us evaluate the comparisons made with oil in detail:

Data is as abundant as oil used to be, once upon a time, but in contrast to oil, data is a non-depleting resource. In fact, one can argue that it is reusable, in the sense that, each dataset can be used in more than one way and also multiple number of times.It doesn't take years to create data, as it takes for oil.Oil in its crude form is worth nothing. It needs to be refined through a comprehensive process to make it usable. There are various grades of this process to suit various needs; it's the same with data. The data sitting in silos is worthless; it needs to be cleaned, manipulated, and modelled to make use of it. Just as we need refineries and people who can operate those refineries, we need tools that can handle data and people who can operate those tools. Some of the tools for the preceding tasks are Python, R, SAS, and so on, and the people who operate these tools are called data scientists.

A more detailed comparison of oil and data is provided in the following table:

Data

Oil

It's a non-depleting resource and also reusable.

It's a depleting resource and non-reusable.

Data collection requires some infrastructure or system in place. Once the system is in place, the data generation happens seamlessly.

Drilling oil requires a lot of infrastructure. Once the infrastructure is in place, one can keep drawing the oil until the stock dries up.

It needs to be cleaned and modelled.

It needs to be cleaned and processed.

The time taken to generate data varies from fractions of second to months and years.

It takes decades to generate.

The worth and marketability of different kinds of data is different.

The worth of crude oil is same everywhere. However, the price and marketability of different end products of refinement is different.

The time horizon for monetization of data is smaller after getting the data.

The time horizon for monetizing oil is longer than that for data.

Scope of predictive modelling

Predictive modelling is an ensemble of statistical algorithms coded in a statistical tool, which when applied on historical data, outputs a mathematical function (or equation). It can in-turn be used to predict outcomes based on some inputs (on which the model operates) from the future to drive a goal in business context or enable better decision making in general.

To understand what predictive modelling entails, let us focus on the phrases highlighted previously.

Ensemble of statistical algorithms

Statistics are important to understand data. It tells volumes about the data. How is the data distributed? Is it centered with little variance or does it varies widely? Are two of the variables dependent on or independent of each other? Statistics helps us answer these questions. This book will expect a basic understanding of basic statistical terms, such as mean, variance, co-variance, and correlation. Advanced terms, such as hypothesis testing, Chi-Square tests, p-values, and so on will be explained as and when required. Statistics are the cog in the wheel called model.

Algorithms, on the other hand, are the blueprints of a model. They are responsible for creating mathematical equations from the historical data. They analyze the data, quantify the relationship between the variables, and convert it into a mathematical equation. There is a variety of them: Linear Regression, Logistic Regression, Clustering, Decision Trees, Time-Series Modelling, Naïve Bayes Classifiers, Natural Language Processing, and so on. These models can be classified under two classes:

Supervised algorithms: These are the algorithms wherein the historical data has an output variable in addition to the input variables. The model makes use of the output variables from historical data, apart from the input variables. The examples of such algorithms include Linear Regression, Logistic Regression, Decision Trees, and so on.Un-supervised algorithms: These algorithms work without an output variable in the historical data. The example of such algorithms includes clustering.

The selection of a particular algorithm for a model depends majorly on the kind of data available. The focus of this book would be to explain methods of handling various kinds of data and illustrating the implementation of some of these models.

Statistical tools

There are a many statistical tools available today, which are laced with inbuilt methods to run basic statistical chores. The arrival of open-source robust tools like R and Python has made them extremely popular, both in industry and academia alike. Apart from that, Python's packages are well documented; hence, debugging is easier.

Python has a number of libraries, especially for running the statistical, cleaning, and modelling chores. It has emerged as the first among equals when it comes to choosing the tool for the purpose of implementing preventive modelling. As the title suggests, Python will be the choice for this book, as well.

Historical data

Our machinery (model) is built and operated on this oil called data. In general, a model is built on the historical data and works on future data. Additionally, a predictive model can be used to fill missing values in historical data by interpolating the model over sparse historical data. In many cases, during modelling stages, future data is not available. Hence, it is a common practice to divide the historical data into training (to act as historical data) and testing (to act as future data) through sampling.

As discussed earlier, the data might or might not have an output variable. However, one thing that it promises to be is messy. It needs to undergo a lot of cleaning and manipulation before it can become of any use for a modelling process.

Mathematical function

Most of the data science algorithms have underlying mathematics behind them. In many of the algorithms, such as regression, a mathematical equation (of a certain type) is assumed and the parameters of the equations are derived by fitting the data to the equation.

For example, the goal of linear regression is to fit a linear model to a dataset and find the equation parameters of the following equation:

The purpose of modelling is to find the best values for the coefficients. Once these values are known, the previous equation is good to predict the output. The equation above, which can also be thought of as a linear function of Xi's (or the input variables), is the linear regression model.

Another example is of logistic regression. There also we have a mathematical equation or a function of input variables, with some differences. The defining equation for logistic regression is as follows:

Here, the goal is to estimate the values of a and b by fitting the data to this equation. Any supervised algorithm will have an equation or function similar to that of the model above. For unsupervised algorithms, an underlying mathematical function or criterion (which can be formulated as a function or equation) serves the purpose. The mathematical equation or function is the backbone of a model.

Business context

All the effort that goes into predictive analytics and all its worth, which accrues to data, is because it solves a business problem. A business problem can be anything and it will become more evident in the following examples:

Tricking the users of the product/service to buy more from you by increasing the click through rates of the online adsPredicting the probable crime scenes in order to prevent them by aggregating an invincible lineup for a sports leaguePredicting the failure rates and associated costs of machinery componentsManaging the churn rate of the customers

The predictive analytics is being used in an array of industries to solve business problems. Some of these industries are, as follows:

BankingSocial mediaRetailTransportHealthcarePolicingEducationTravel and logisticsE-commerceHuman resource

By what quantum did the proposed solution make life better for the business, is all that matters. That is the reason; predictive analytics is becoming an indispensable practice for management consulting.

In short, predictive analytics sits at the sweet spot where statistics, algorithm, technology and business sense intersect. Think about it, a mathematician, a programmer, and a business person rolled in one.

Knowledge matrix for predictive modelling

As discussed earlier, predictive modelling is an interdisciplinary field sitting at the interface and requiring knowledge of four disciplines, such as Statistics, Algorithms, Tools, Techniques, and Business Sense. Each of these disciplines is equally indispensable to perform a successful task of predictive modelling.

These four disciplines of predictive modelling carry equal weights and can be better represented as a knowledge matrix; it is a symmetric 2 x 2 matrix containing four equal-sized squares, each representing a discipline.

Fig. 1.1: Knowledge matrix: four disciplines of predictive modelling

Task matrix for predictive modelling

The tasks involved in predictive modelling follows the Pareto principle. Around 80% of the effort in the modelling process goes towards data cleaning and wrangling, while only 20% of the time and effort goes into implementing the model and getting the prediction. However, the meaty part of the modelling that is rich with almost 80% of results and insights is undoubtedly the implementation of the model. This information can be better represented as a matrix, which can be called a task matrix that will look something similar to the following figure:

Fig. 1.2: Task matrix: split of time spent on data cleaning and modelling and their final contribution to the model

Many of the data cleaning and exploration chores can be automated because they are alike most of the times, irrespective of the data. The part that needs a lot of human thinking is the implementation of a model, which is what makes the bulk of this book.

Applications and examples of predictive modelling

In the introductory section, data has been compared with oil. While oil has been the primary source of energy for the last couple of centuries and the legends of OPEC, Petrodollars, and Gulf Wars have set the context for the oil as a begrudged resource; the might of data needs to be demonstrated here to set the premise for the comparison. Let us glance through some examples of predictive analytics to marvel at the might of data.

LinkedIn's "People also viewed" feature

If you are a frequent LinkedIn user, you might be familiar with LinkedIn's "People also viewed" feature.

What it does?

Let's say you have searched for some person who works at a particular organization and LinkedIn throws up a list of search results. You click on one of them and you land up on their profile. In the middle-right section of the screen, you will find a panel titled "People Also Viewed"; it is essentially a list of people who either work at the same organization as the person whose profile you are currently viewing or the people who have the same designation and belong to same industry.

Isn't it cool? You might have searched for these people separately if not for this feature. This feature increases the efficacy of your search results and saves your time.

How is it done?

Are you wondering how LinkedIn does it? The rough blueprint is as follows:

LinkedIn leverages the search history data to do this. The model underneath this feature plunges into a treasure trove of search history data and looks at what people have searched next after finding the correct person they were searching for.This event of searching for a particular second person after searching for a particular first person has some probability. This will be calculated using all the data for such searches. The profiles with the highest probability of being searched (based on the historical data) are shown in the "People Also Viewed" section.This probability comes under the ambit of a broad set of rules called Association Rules. These are very widely used in Retail Analytics where we are interested to know what a group of products will sell together. In other words, what is the probability of buying a particular second product given that the consumer has already bought the first product?

Correct targeting of online ads

If you browse the Internet, which I am sure you must be doing frequently, you must have encountered online ads, both on the websites and smartphone apps. Just like the ads in the newspaper or TV, there is a publisher and an advertiser for online ads too. The publisher in this case is the website or the app where the ad will be shown while the advertiser is the company/organization that is posting that ad.

The ultimate goal of an online ad is to be clicked on. Each instance of an ad display is called an impression. The number of clicks per impression is called Click Through Rate and is the single most important metric that the advertisers are interested in. The problem statement is to determine the list of publishers where the advertiser should publish its ads so that the Click Through Rate is the maximum.

How is it done?

The historical data in this case will consist of information about people who visited a certain website/app and whether they clicked the published ad or not. Some or a combination of classification models, such as Decision Trees, and Support Vector Machines are used in such cases to determine whether a visitor will click on the ad or not, given the visitor's profile information.One problem with standard classification algorithms in such cases is that the Click Through Rates are very small numbers, of the order of less than 1%. The resulting dataset that is used for classification has a very sparse positive outcome. The data needs to be downsampled to enrich the data with positive outcomes before modelling.

The logistical regression is one of the most standard classifiers for situations with binary outcomes. In banking, whether a person will default on his loan or not can be predicted using logistical regression given his credit history.

Santa Cruz predictive policing

Based on the historical data consisting of the area and time window of the occurrence of a crime, a model was developed to predict the place and time where the next crime might take place.

How is it done?

A decision tree model was created using the historical data. The prediction of the model will foretell whether a crime will occur in an area on a given date and time in the future.The model is consistently recalibrated every day to include the crimes that happened during that day.

The good news is that the police are using such techniques to predict the crime scenes in advance so that they can prevent it from happening. The bad news is that certain terrorist organizations are using such techniques to target the locations that will cause the maximum damage with minimal efforts from their side. The good news again is that this strategic behavior of terrorists has been studied in detail and is being used to form counter-terrorist policies.

Determining the activity of a smartphone user using accelerometer data

The accelerometer in a smartphone measures the acceleration over a period of time as the user indulges in various activities. The acceleration is measured over the three axes, X, Y, and Z. This acceleration data can then be used to determine whether the user is sleeping, walking, running, jogging, and so on.

How is it done?

The acceleration data is clustered based on the acceleration values in the three directions. The values of the similar activities cluster together.The clustering performs well in such cases if the columns contributing the maximum to the separation of activities are also included while calculating the distance matrix for clustering. Such columns can be found out using a technique called Singular Value Decomposition.

Sport and fantasy leagues

Moneyball, anyone? Yes, the movie. The movie where a statistician turns the fortunes of a poorly performing baseball team, Oak A, by developing an algorithm to select players who were cheap to buy but had a lot of latent potential to perform.

How was it done?

Bill James, using historical data, concluded that the older metrics used to rate a player, such as stolen balls, runs batted in, and batting average were not very useful indicators of a player's performance in a given match. He rather relied on metrics like on-base percentage and sluggish percentage to be a better predictor of a player's performance.The chief statistician behind the algorithms, Bill James, compiled the data for performance of all the baseball league players and sorted them for these metrics. Surprisingly, the players who had high values for these statistics also came at cheaper prices.

This way, they gathered an unbeatable team that didn't have individual stars who came at hefty prices but as a team were an indomitable force. Since then, these algorithms and their variations have been used in a variety of real and fantasy leagues to select players. The variants of these algorithms are also being used by Venture Capitalists to optimize and automate their due diligence to select the prospective start-ups to fund.

Python and its packages – download and installation

There are various ways in which one can access and install Python and its packages. Here we will discuss a couple of them.

Anaconda

Anaconda is a popular Python distribution consisting of more than 195 popular Python packages. Installing Anaconda automatically installs many of the packages discussed in the preceding section, but they can be accessed only through an IDE called Spyder (more on this later in this chapter), which itself is installed on Anaconda installation. Anaconda also installs IPython Notebook and when you click on the IPython Notebook icon, it opens a browser tab and a Command Prompt.

Note

Anaconda can be downloaded and installed from the following web address: http://continuum.io/downloads

Download the suitable installer and double click on the .exe file and it will install Anaconda. Two of the features that you must check after the installation are:

IPython NotebookSpyder IDE

Search for them in the "Start" icon's search, if it doesn't appear in the list of programs and files by default. We will be using IPython Notebook extensively and the codes in this book will work the best when run in IPython Notebook.

IPython Notebook can be opened by clicking on the icon. Alternatively, you can use the Command Prompt to open IPython Notebook. Just navigate to the directory where you have installed Anaconda and then write ipython notebook, as shown in the following screenshot:

Fig. 1.3: Opening IPython Notebook

Note

On the system used for this book, Anaconda was installed in the C:\Users\ashish directory. One can open a new Notebook in IPython by clicking on the New Notebook button on the dashboard, which opens up. In this book, we have used IPython Notebook extensively.

Standalone Python

You can download a Python version that is stable and is compatible to the OS on your system. The most stable version of Python is 2.7.0. So, installing this version is highly recommended. You can download it from https://www.python.org/ and install it.

There are some Python packages that you need to install on your machine before you start predictive analytics and modelling. This section consists of a demo of installation of one such library and a brief description of all such libraries.

Installing a Python package

There are several ways to install a Python package. The easiest and the most effective is the one using pip. As you might be aware, pip is a package management system that is used to install and manage software packages written in Python. To be able to use it to install other packages, pip needs to be installed first.

Installing pip

The following steps demonstrate how to install pip. Follow closely!

Navigate to the webpage shown in the following screenshot. The URL address is https://pypi.python.org/pypi/pip:

Downloading pip from the Python's official website

Download the pip-7.0.3.tar.gz file and unzip in the folder where Python is installed. If you have Python v2.7.0 installed, this folder should be C:\Python27:

Unzipping the .zar file for pip in the correct folder

On unzipping the previously mentioned file, a folder called pip-7.0.3 is created. Opening that folder will take you to the screen similar to the one in the preceding screenshot.Open the CMD on your computer and change the current directory to the current directory in the preceding screenshot that is C:\Python27\pip-7.0.3 using the following command:
cd C:\Python27\pip-7.0.3.
The result of the preceding command is shown in the following screenshot:

Navigating to the directory where pip is installed

Now, the current directory is set to the directory where setup file for pip (setup.py) resides. Write the following command to install pip:
python setup.py install
The result of the preceding command is shown in the following screenshot:

Installing pip using a command line

Once pip is installed, it is very easy to install all the required Python packages to get started.

Installing Python packages with pip

The following are the steps to install Python packages using pip, which we just installed in the preceding section:

Change the current directory in the command prompt to the directory where the Python v2.7.0 is installed that is: C:\Python27.Write the following command to install the package:
pip install package-name
For example, to install pandas, you can proceed as follows:

Installing a Python package using a command line and pip

Finally, to confirm that the package has installed successfully, write the following command:
python -c "import pandas"
The result of the preceding command is shown in the following screenshot:

Checking whether the package has installed correctly or not

If this doesn't throw up an error, then the package has been installed successfully.

Python and its packages for predictive modelling

In this section, we will discuss some commonly used packages for predictive modelling.

pandas: The most important and versatile package that is used widely in data science domains is pandas and it is no wonder that you can see import pandas at the beginning of any data science code snippet, in this book, and anywhere in general. Among other things, the pandas package facilitates:

The reading of a dataset in a usable format (data frame in case of Python)Calculating basic statisticsRunning basic operations like sub-setting a dataset, merging/concatenating two datasets, handling missing data, and so on

The various methods in pandas will be explained in this book as and when we use them.

Note

To get an overview, navigate to the official page of pandas here: http://pandas.pydata.org/index.html

NumPy: NumPy, in many ways, is a MATLAB equivalent in the Python environment. It has powerful methods to do mathematical calculations and simulations. The following are some of its features:

A powerful and widely used a N-d array elementAn ensemble of powerful mathematical functions used in linear algebra, Fourier transforms, and random number generationA combination of random number generators and an N-d array elements is used to generate dummy datasets to demonstrate various procedures, a practice we will follow extensively, in this book

Note

To get an overview, navigate to official page of NumPy at http://www.NumPy.org/

matplotlib: matplotlib is a Python library that easily generates high-quality 2-D plots. Again, it is very similar to MATLAB.

It can be used to plot all kind of common plots, such as histograms, stacked and unstacked bar charts, scatterplots, heat diagrams, box plots, power spectra, error charts, and so onIt can be used to edit and manipulate all the plot properties such as title, axes properties, color, scale, and so on

Note

To get an overview, navigate to the official page of matplotlib at: http://matplotlib.org

IPython: IPython provides an environment for interactive computing.

It provides a browser-based notebook that is an IDE-cum-development environment to support codes, rich media, inline plots, and model summary. These notebooks and their content can be saved and used later to demonstrate the result as it is or to save the codes separately and execute them. It has emerged as a powerful tool for web based tutorials as the code and the results flow smoothly one after the other in this environment. At many places in this book, we will be using this environment.

Note

To get an overview, navigate to the official page of IPython here http://ipython.org/

Scikit-learn: scikit-learn is the mainstay of any predictive modelling in Python. It is a robust collection of all the data science algorithms and methods to implement them. Some of the features of scikit-learn are as follows:

It is built entirely on Python packages like pandas, NumPy, and matplotlibIt is very simple and efficient to useIt has methods to implement most of the predictive modelling techniques, such as linear regression, logistic regression, clustering, and Decision TreesIt gives a very concise method to predict the outcome based on the model and measure the accuracy of the outcomes

Note

To get an overview, navigate to the official page of scikit-learn here: http://scikit-learn.org/stable/index.html

Python packages, other than these, if used in this book, will be situation based and can be installed using the method described earlier in this section.

IDEs for Python

The IDE or the Integrated Development Environment is a software that provides the source-code editor cum debugger for the purpose of writing code. Using these software, one can write, test, and debug a code snippet before adding the snippet in the production version of the code.

IDLE: IDLE is the default Integrated Development Environment for Python that comes with the default implementation of Python. It comes with the following features:

Multi-window text-editor with auto-completion, smart-indent, syntax, and keyword highlightingPython shell with syntax highlighting

IDLE is widely popular as an IDE for beginners; it is simple to use and works well for simple tasks. Some of the issues with IDLE are bad output reporting, absence of line numbering options, and so on. As a result, advanced practitioners move on to better IDEs.

IPython Notebook: IPython Notebook is a powerful computational environment where code, execution, results, and media can co-exist in one single document. There are two components of this computing environment:

IPython Notebook: Web applications containing code, executions, plots, and results are stored in different cells; they can be saved and edited as and when requiredNotebook: It is a plain text document meant to record and distribute the result of a computational analysis

The IPython documents are stored with an extension .ipynb in the directory where it is installed on the computer.

Some of the features of IPython Notebook are as follows:

Inline figure rendering of the matplotlib plots that can be saved in multiple formats(JPEG, PNG).Standard Python syntax in the notebook can be saved as a Python script.The notebooks can be saved as HTML files and .ipynb files. These notebooks can be viewed in browsers and this has been developed as a popular tool for illustrated blogging in Python. A notebook in IPython looks as shown in the following screenshot:

An Ipython Notebook

Spyder: Spyder is a powerful scientific computing and development environment for Python. It has the following features:

Advanced editing, auto-completion, debugging, and interactive testingPython kernel and code editor with line numbering in the same screenPreinstalled scientific packages like NumPy, pandas, scikit-learn, matplotlib, and so on.In some ways, Spyder is very similar to RStudio environment where text editing and interactive testing go hand in hand:

The interface of Spyder IDE

In this book, IPython Notebook and Spyder have been used extensively. IDLE has been used from time to time and some people use other environments, such as Pycharm. Readers of this book are free to use such editors if they are more comfortable with them. However, they should make sure that all the required packages are working fine in those environments.

Summary

The following are some of the takeaways from this chapter:

Social media and Internet of Things have resulted in an avalanche of data.Data is powerful but not in its raw form. The data needs to be processed and modelled.Organizations across the world and across the domains are using data to solve critical business problems. The knowledge of statistical algorithms, statisticals tool, business context, and handling of historical data is vital to solve these problems using predictive modelling.Python is a robust tool to handle, process, and model data. It has an array of packages for predictive modelling and a suite of IDEs to choose from.

Let us enter the battlefield where Python is our weapon. We will start using it from the next chapter. In the next chapter, we will learn how to read data in various cases and do a basic processing.

Chapter 2. Data Cleaning

Without any further ado, lets kick-start the engine and start our foray into the world of predictive analytics. However, you need to remember that our fuel is data. In order to do any predictive analysis, one needs to access and import data for the engine to rev up.

I assume that you have already installed Python and the required packages with an IDE of your choice. Predictive analytics, like any other art, is best learnt when tried hands-on and practiced as frequently as possible. The book will be of the best use if you open a Python IDE of your choice and practice the explained concepts on your own. So, if you haven't installed Python and its packages yet, now is the time. If not all the packages, at-least pandas should be installed, which are the mainstay of the things that we will learn in this chapter.

After reading this chapter, you should be familiar with the following topics:

Handling various kind of data importing scenarios that is importing various kind of datasets (.csv, .txt), different kind of delimiters (comma, tab, pipe), and different methods (read_csv, read_table)Getting basic information, such as dimensions, column names, and statistics summaryGetting basic data cleaning done that is removing NAs and blank spaces, imputing values to missing data points, changing a variable type, and so onCreating dummy variables in various scenarios to aid modellingGenerating simple plots like scatter plots, bar charts, histograms, box plots, and so on

From now on, we will be using a lot of publicly available datasets to illustrate concepts and examples. All the used datasets have been stored in a Google Drive folder, which can be accessed from this link: https://goo.gl/zjS4C6.

Note

This folder is called "Datasets for Predictive Modelling with Python". This folder has a subfolder dedicated to each chapter of the book. Each subfolder contains the datasets that were used in the chapter.

The paths for the dataset used in this book are paths on my local computer. You can download the datasets from these subfolders to your local computer before using them. Better still, you can download the entire folder, at once and save it somewhere on your local computer.

Reading the data – variations and examples

Before we delve deeper into the realm of data, let us familiarize ourselves with a few terms that will appear frequently from now on.

Data frames

A data frame is one of the most common data structures available in Python. Data frames are very similar to the tables in a spreadsheet or a SQL table. In Python vocabulary, it can also be thought of as a dictionary of series objects (in terms of structure). A data frame, like a spreadsheet, has index labels (analogous to rows) and column labels (analogous to columns). It is the most commonly used pandas object and is a 2D structure with columns of different or same types. Most of the standard operations, such as aggregation, filtering, pivoting, and so on which can be applied on a spreadsheet or the SQL table can be applied to data frames using methods in pandas.

The following screenshot is an illustrative picture of a data frame. We will learn more about working with them as we progress in the chapter:

Fig. 2.1 A data frame

Delimiters

A delimiter is a special character that separates various columns of a dataset from one another. The most common (one can go to the extent of saying that it is a default delimiter) delimiter is a comma (,). A .csv file is called so because it has comma separated values. However, a dataset can have any special character as its delimiter and one needs to know how to juggle and manage them in order to do an exhaustive and exploratory analysis and build a robust predictive model. Later in this chapter, we will learn how to do that.

The read_csv method

The name of the method doesn't unveil its full might. It is a kind of misnomer in the sense that it makes us think that it can be used to read only CSV files, which is not the case. Various kinds of files, including .txt files having delimiters of various kinds can be read using this method.

Let's learn a little bit more about the various arguments of this method in order to assess its true potential. Although the read_csv method has close to 30 arguments, the ones listed in the next section are the ones that are most commonly used.

The general form of a read_csv statement is something similar to:

pd.read_csv(filepath, sep=', ', dtype=None, header=None, skiprows=None, index_col=None, skip_blank_lines=TRUE, na_filter=TRUE)

Now, let us understand the significance and usage of each of these arguments one by one:

filepath: filepath is the complete address of the dataset or file that you are trying to read. The complete address includes the address of the directory in which the file is stored and the full name of the file with its extension. Remember to use a forward slash (/) in the directory address. Later in this chapter, we will see that the filepath can be a URL as well.sep: sep allows us to specify the delimiter for the dataset to read. By default, the method assumes that the delimiter is a comma (,). The various other delimiters that are commonly used are blank spaces ( ), tab (|), and are called space delimiter or tab demilited datasets. This argument of the method also takes regular expressions as a value.dtype: Sometimes certain columns of the dataset need to be formatted to some other type, in order to apply certain operations successfully. One example is the date variables. Very often, they have a string type which needs to be converted to date type before we can use them to apply date-related operations. The dtype argument is to specify the data type of the columns of the dataset. Suppose, two columns a and b, of the dataset need to be formatted to the types int32 and float64; it can be achieved by passing {'a':np.float64, 'b'.np.int32} as the value of dtype. If not specified, it will leave the columns in the same format as originally found.header: The value of a