Practical Predictive Analytics - Ralph Winters - E-Book

Practical Predictive Analytics E-Book

Ralph Winters

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Make sense of your data and predict the unpredictable

About This Book

  • A unique book that centers around develop six key practical skills needed to develop and implement predictive analytics
  • Apply the principles and techniques of predictive analytics to effectively interpret big data
  • Solve real-world analytical problems with the help of practical case studies and real-world scenarios taken from the world of healthcare, marketing, and other business domains

Who This Book Is For

This book is for those with a mathematical/statistics background who wish to understand the concepts, techniques, and implementation of predictive analytics to resolve complex analytical issues. Basic familiarity with a programming language of R is expected.

What You Will Learn

  • Master the core predictive analytics algorithm which are used today in business
  • Learn to implement the six steps for a successful analytics project
  • Classify the right algorithm for your requirements
  • Use and apply predictive analytics to research problems in healthcare
  • Implement predictive analytics to retain and acquire your customers
  • Use text mining to understand unstructured data
  • Develop models on your own PC or in Spark/Hadoop environments
  • Implement predictive analytics products for customers

In Detail

This is the go-to book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a step-by-step installation of RStudio, and a description of the best practices for organizing your projects.

On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model.

We will continue your journey in the cloud by extending your skill set by learning about Databricks and SparkR, which allow you to develop predictive models on vast gigabytes of data.

Style and Approach

This book takes a practical hands-on approach wherein the algorithms will be explained with the help of real-world use cases. It is written in a well-researched academic style which is a great mix of theoretical and practical information. Code examples are supplied for both theoretical concepts as well as for the case studies. Key references and summaries will be provided at the end of each chapter so that you can explore those topics on their own.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 586

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Practical Predictive Analytics

 

 

 

 

 

 

 

 

 

 

 

Back to the future with R, Spark, and more!

 

 

 

 

 

 

 

 

 

Ralph Winters

 

BIRMINGHAM - MUMBAI

< html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

Practical Predictive Analytics

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: June 2017

Production reference: 1300617

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78588-618-8

www.packtpub.com

Credits

Author

 

Ralph Winters

Copy Editor

 

Saafis Editing

Reviewers

Alberto Boschetti

Armando Fandango

Project Coordinator

Shweta H Birwatkar

Commissioning Editor

 

Veena Pagare

Proofreader

 

Safis Editing

Acquisition Editor

 

Chandan Kumar

Indexer

 

Mariammal Chettiyar

ContentDevelopmentEditor

 

Amrita Noronha

Graphics

 

Tania Dutta

Technical Editor

 

Sneha Hanchate

Production Coordinator

 

Melwyn Dsa

 

About the Author

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customer retention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models.

He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists.

Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA.

Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable.

Ralph's web site can be found at ralphwinters.com.

About the Reviewers

Armando Fandango serves as chief technology officer of REAL Inc., building AI-based products and platforms for making smart connections between brands, agencies, publishers, and audiences. Armando founded NeuraSights with the goal of creating insights from small and big data using neural networks and machine learning. Previously, as chief data scientist and chief technology officer (CTO) for Epic Engineering and Consulting Group LLC, Armando worked with government agencies and large private organizations to build smart products by incorporating machine learning, big data engineering, enterprise data repositories, and enterprise dashboards. Armando has led data science and engineering teams as head of data for Sonobi Inc., driving big data and predictive analytics technology and strategy for JetStream, Sonobi's AdTech platform. Armando has managed high-performance computing (HPC) consulting and infrastructure for the Advanced Research Computing Centre at UCF. Armando has also been advising high-tech startups Quantfarm, Cortxia Foundation, and Studyrite as an advisory board member and AI expert. Armando has authored a book titled Python Data Analysis - Second Edition and has published research in international journals and conferences.

 

Alberto Boschetti is a data scientist, with strong expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he daily faces challenges spanning among natural language processing (NLP), machine learning, and distributed processing. He is very passionate about his job and he always tries to be updated on the latest developments in data science technologies, attending meetups, conferences, and other events. He is the author of Python Data Science Essentials, Regression Analysis with Python and Large Scale Machine Learning with Python, all published by Packt.

 

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1785886185.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Predictive Analytics

Predictive analytics are in so many industries

Predictive Analytics in marketing

Predictive Analytics in healthcare

Predictive Analytics in other industries

Skills and roles that are important in Predictive Analytics

Related job skills and terms

Predictive analytics software

Open source software

Closed source software

Peaceful coexistence

Other helpful tools

Past the basics

Data analytics/research

Data engineering

Management

Team data science

Two different ways to look at predictive analytics

R

CRAN

R installation

Alternate ways of exploring R

How is a predictive analytics project organized?

Setting up your project and subfolders

GUIs

Getting started with RStudio

Rearranging the layout to correspond with the examples

Brief description of some important panes

Creating a new project

The R console

The source window

Creating a new script

Our first predictive model

Code description

Saving the script

Your second script

Code description

The predict function

Examining the prediction errors

R packages

The stargazer package

Installing stargazer package

Code description

Saving your work

References

Summary

The Modeling Process

Advantages of a structured approach

Ways in which structured methodologies can help

Analytic process methodologies

CRISP-DM and SEMMA

CRISP-DM and SEMMA chart

Agile processes

Six sigma and root cause

To sample or not to sample?

Using all of the data

Comparing a sample to the population

An analytics methodology outline – specific steps

Step 1 business understanding

Communicating business goals – the feedback loop

Internal data

External data

Tools of the trade

Process understanding

Data lineage

Data dictionaries

SQL

Example – Using SQL to get sales by region

Charts and plots

Spreadsheets

Simulation

Example – simulating if a customer contact will yield a sale

Example – simulating customer service calls

Step 2 data understanding

Levels of measurement

Nominal data

Ordinal data

Interval data

Ratio data

Converting from the different levels of measurement

Dependent and independent variables

Transformed variables

Single variable analysis

Summary statistics

Bivariate analysis

Types of questions that bivariate analysis can answer

Quantitative with quantitative variables

Code example

Nominal with nominal variables

Cross-tabulations

Mosaic plots

Nominal with quantitative variables

Point biserial correlation

Step 3 data preparation

Step 4 modeling

Description of specific models

Poisson (counts)

Logistic regression

Support vector machines (SVM)

Decision trees

Random forests

Example - comparing single decision trees to a random forest

An age decision tree

An alternative decision tree

The random forest model

Random forest versus decision trees

Variable importance plots

Dimension reduction techniques

Principal components

Clustering

Time series models

Naive Bayes classifier

Text mining techniques

Step 5 evaluation

Model validation

Area under the curve

Computing an ROC curve using the titanic dataset

In sample/out of sample tests, walk forward tests

Training/test/validation datasets

Time series validation

Benchmark against best champion model

Expert opinions: man against machine

Meta-analysis

Dart board method

Step 6 deployment

Model scoring

References

Notes

Summary

Inputting and Exploring Data

Data input

Text file Input

The read.table function

Database tables

Spreadsheet files

XML and JSON data

Generating your own data

Tips for dealing with large files

Data munging and wrangling

Joining data

Using the sqldf function

Housekeeping and loading of necessary packages

Generating the data

Examining the metadata

Merging data using Inner and Outer joins

Identifying members with multiple purchases

Eliminating duplicate records

Exploring the hospital dataset

Output from the str(df) function

Output from the View function

The colnames function

The summary function

Sending the output to an HTML file

Open the file in the browser

Plotting the distributions

Visual plotting of the variables

Breaking out summaries by groups

Standardizing data

Changing a variable to another type

Appending the variables to the existing dataframe

Extracting a subset

Transposing a dataframe

Dummy variable coding

Binning – numeric and character

Binning character data

Missing values

Setting up the missing values test dataset

The various types of missing data

Missing Completely at Random (MCAR)

Testing for MCAR

Missing at Random (MAR)

Not Missing at Random (NMAR)

Correcting for missing values

Listwise deletion

Imputation methods

Imputing missing values using the 'mice' package

Running a regression with imputed values

Imputing categorical variables

Outliers

Why outliers are important

Detecting outliers

Transforming the data

Tracking down the cause of the outliers

Ways to deal with outliers

Example – setting the outliers to NA

Multivariate outliers

Data transformations

Generating the test data

The Box-Cox Transform

Variable reduction/variable importance

Principal Components Analysis (PCA)

Where is PCA used?

A PCA example – US Arrests

All subsets regression

An example – airquality

Adjusted R-square plot

Variable importance

Variable influence plot

References

Summary

Introduction to Regression Algorithms

Supervised versus unsupervised learning models

Supervised learning models

Unsupervised learning models

Regression techniques

Advantages of regression

Generalized linear models

Linear regression using GLM

Logistic regression

The odds ratio

The logistic regression coefficients

Example - using logistic regression in health care to predict pain thresholds

Reading the data

Obtaining some basic counts

Saving your data

Fitting a GLM model

Examining the residuals

Residual plots

Added variable plots

Outliers in the regression

P-values and effect size

P-values and effect sizes

Variable selection

Interactions

Goodness of fit statistics

McFadden statistic

Confidence intervals and Wald statistics

Basic regression diagnostic plots

Description of the plots

An interactive game – guessing if the residuals are random

Goodness of fit – Hosmer-Lemeshow test

Goodness of fit example on the PainGLM data

Regularization

An example – ElasticNet

Choosing a correct lamda

Printing out the possible coefficients based on Lambda

Summary

Introduction to Decision Trees, Clustering, and SVM

Decision tree algorithms

Advantages of decision trees

Disadvantages of decision trees

Basic decision tree concepts

Growing the tree

Impurity

Controlling the growth of the tree

Types of decision tree algorithms

Examining the target variable

Using formula notation in an rpart model

Interpretation of the plot

Printing a text version of the decision tree

The ctree algorithm

Pruning

Other options to render decision trees

Cluster analysis

Clustering is used in diverse industries

What is a cluster?

Types of clustering

Partitional clustering

K-means clustering

The k-means algorithm

Measuring distance between clusters

Clustering example using k-means

Cluster elbow plot

Extracting the cluster assignments

Graphically displaying the clusters

Cluster plots

Generating the cluster plot

Hierarchical clustering

Examining some examples from cluster 1

Examining some examples from cluster 2

Examining some examples from cluster 3

Support vector machines

Simple illustration of a mapping function

Analyzing consumer complains data using SVM

Converting unstructured to structured data

References

Summary

Using Survival Analysis to Predict and Analyze Customer Churn

What is survival analysis?

Time-dependent data

Censoring

Left censoring

Right censoring

Our customer satisfaction dataset

Generating the data using probability functions

Creating the churn and no churn dataframes

Creating and verifying the new simulated variables

Recombining the churner and non-churners

Creating matrix plots

Partitioning into training and test data

Setting the stage by creating survival objects

Examining survival curves

Better plots

Contrasting survival curves

Testing for the gender difference between survival curves

Testing for the educational differences between survival curves

Plotting the customer satisfaction and number of service call curves

Improving the education survival curve by adding gender

Transforming service calls to a binary variable

Testing the difference between customers who called and those who did not

Cox regression modeling

Our first model

Examining the cox regression output

Proportional hazards test

Proportional hazard plots

Obtaining the cox survival curves

Plotting the curve

Partial regression plots

Examining subset survival curves

Comparing gender differences

Comparing customer satisfaction differences

Validating the model

Computing baseline estimates

Running the predict() function

Predicting the outcome at time 6

Determining concordance

Time-based variables

Changing the data to reflect the second survey

How survSplit works

Adjusting records to simulate an intervention

Running the time-based model

Comparing the models

Variable selection

Incorporating interaction terms

Displaying the formulas sublist

Comparing AIC among the candidate models

Summary

Using Market Basket Analysis as a Recommender Engine

What is market basket analysis?

Examining the groceries transaction file

Format of the groceries transaction Files

The sample market basket

Association rule algorithms

Antecedents and descendants

Evaluating the accuracy of a rule

Support

Calculating support

Examples

Confidence

Lift

Evaluating lift

Preparing the raw data file for analysis

Reading the transaction file

capture.output function

Analyzing the input file

Analyzing the invoice dates

Plotting the dates

Scrubbing and cleaning the data

Removing unneeded character spaces

Simplifying the descriptions

Removing colors automatically

The colors() function

Cleaning up the colors

Filtering out single item transactions

Looking at the distributions

Merging the results back into the original data

Compressing descriptions using camelcase

Custom function to map to camelcase

Extracting the last word

Creating the test and training datasets

Saving the results

Loading the analytics file

Determining the consequent rules

Replacing missing values

Making the final subset

Creating the market basket transaction file

Method one – Coercing a dataframe to a transaction file

Inspecting the transaction file

Obtaining the topN purchased items

Finding the association rules

Examining the rules summary

Examining the rules quality and observing the highest support

Confidence and lift measures

Filtering a large number of rules

Generating many rules

Plotting many rules

Method two – Creating a physical transactions file

Reading the transaction file back in

Plotting the rules

Creating subsets of the rules

Text clustering

Converting to a document term matrix

Removing sparse terms

Finding frequent terms

K-means clustering of terms

Examining cluster 1

Examining cluster 2

Examining cluster 3

Examining cluster 4

Examining cluster 5

Predicting cluster assignments

Using flexclust to predict cluster assignment

Running k-means to generate the clusters

Creating the test DTM

Running the apriori algorithm on the clusters

Summarizing the metrics

References

Summary

Exploring Health Care Enrollment Data as a Time Series

Time series data

Exploring time series data

Health insurance coverage dataset

Housekeeping

Read the data in

Subsetting the columns

Description of the data

Target time series variable

Saving the data

Determining all of the subset groups

Merging the aggregate data back into the original data

Checking the time intervals

Picking out the top groups in terms of average population size

Plotting the data using lattice

Plotting the data using ggplot

Sending output to an external file

Examining the output

Detecting linear trends

Automating the regressions

Ranking the coefficients

Merging scores back into the original dataframe

Plotting the data with the trend lines

Plotting all the categories on one graph

Adding labels

Performing some automated forecasting using the ets function

Converting the dataframe to a time series object

Smoothing the data using moving averages

Simple moving average

Computing the SMA using a function

Verifying the SMA calculation

Exponential moving average

Computing the EMA using a function

Selecting a smoothing factor

Using the ets function

Forecasting using ALL AGES

Plotting the predicted and actual values

The forecast (fit) method

Plotting future values with confidence bands

Modifying the model to include a trend component

Running the ets function iteratively over all of the categories

Accuracy measures produced by onestep

Comparing the Test and Training for the "UNDER 18 YEARS" group

Accuracy measures

References

Summary

Introduction to Spark Using R

About Spark

Spark environments

Cluster computing

Parallel computing

SparkR

Dataframes

Building our first Spark dataframe

Simulation

Importing the sample notebook

Notebook format

Creating a new notebook

Becoming large by starting small

The Pima Indians diabetes dataset

Running the code

Running the initialization code

Extracting the Pima Indians diabetes dataset

Examining the output

Output from the str() function

Output from the summary() function

Comparing outcomes

Checking for missing values

Imputing the missing values

Checking the imputations (reader exercise)

Missing values complete!

Calculating the correlation matrices

Calculating the column means

Simulating the data

Which correlations to use?

Checking the object type

Simulating the negative cases

Concatenating the positive and negative cases into a single Spark dataframe

Running summary statistics

Saving your work

Summary

Exploring Large Datasets Using Spark

Performing some exploratory analysis on positives

Displaying the contents of a Spark dataframe

Graphing using native graph features

Running pairwise correlations directly on a Spark dataframe

Cleaning up and caching the table in memory

Some useful Spark functions to explore your data

Count and groupby

Covariance and correlation functions

Creating new columns

Constructing a cross-tab

Contrasting histograms

Plotting using ggplot

Spark SQL

Registering tables

Issuing SQL through the R interface

Using SQL to examine potential outliers

Creating some aggregates

Picking out some potential outliers using a third query

Changing to the SQL API

SQL – computing a new column using the Case statement

Evaluating outcomes based upon the Age segment

Computing mean values for all of the variables

Exporting data from Spark back into R

Running local R packages

Using the pairs function (available in the base package)

Generating a correlation plot

Some tips for using Spark

Summary

Spark Machine Learning - Regression and Cluster Models

About this chapter/what you will learn

Reading the data

Running a summary of the dataframe and saving the object

Splitting the data into train and test datasets

Generating the training datasets

Generating the test dataset

A note on parallel processing

Introducing errors into the test data set

Generating a histogram of the distribution

Generating the new test data with errors

Spark machine learning using logistic regression

Examining the output:

Regularization Models

Predicting outcomes

Plotting the results

Running predictions for the test data

Combining the training and test dataset

Exposing the three tables to SQL

Validating the regression results

Calculating goodness of fit measures

Confusion matrix

Confusion matrix for test group

Distribution of average errors by group

Plotting the data

Pseudo R-square

Root-mean-square error (RMSE)

Plotting outside of Spark

Collecting a sample of the results

Examining the distributions by outcome

Registering some additional tables

Creating some global views

User exercise

Cluster analysis

Preparing the data for analysis

Reading the data from the global views

Inputting the previously computed means and standard deviations

Joining the means and standard deviations with the training data

Joining the means and standard deviations with the test data

Normalizing the data

Displaying the output

Running the k-means model

Fitting the model to the training data

Fitting the model to the test data

Graphically display cluster assignment

Plotting via the Pairs function

Characterizing the clusters by their mean values

Calculating mean values for the test data

Summary

Spark Models – Rule-Based Learning

Loading the stop and frisk dataset

Importing the CSV file to databricks

Reading the table

Running the first cell

Reading the entire file into memory

Transforming some variables to integers

Discovering the important features

Eliminating some factors with a large number of levels

Test and train datasets

Examining the binned data

Running the OneR model

Interpreting the output

Constructing new variables

Running the prediction on the test sample

Another OneR example

The rules section

Constructing a decision tree using Rpart

First collect the sample

Decision tree using Rpart

Plot the tree

Running an alternative model in Python

Running a Python Decision Tree

Reading the Stop and Frisk table

Indexing the classification features

Mapping to an RDD

Specifying the decision tree model

Producing a larger tree

Visual trees

Comparing train and test decision trees

Summary

Preface

This is a different kind of predictive analytics book. My original intention was to introduce predictive analytics techniques targeted towards legacy analytics folks, using open source tools.

However, I soon realized that they were certain aspects of legacy analytics tools that could benefit the new generation of data scientists. Having worked a large part of my career in enterprise data solutions, I was interested in writing about some different kinds of topics, such as analytics methodologies, agile, metadata, SQL analytics, and reproducible research, which are often neglected in some data science/predictive analytics books, but still critical to the success of analytics project.

I also wanted to write about some underrepresented analytics techniques that extend beyond standard regression and classification tasks, such as using survival analysis to predict customer churn, and using market basket analysis as a recommendation engine.

Since there is a lot of movement towards cloud-based solutions, I thought it was important to include some chapters on cloud based analytics (big data), so I included several chapters on developing predictive analytics solutions within a Spark environment.

Whatever your orientation is, a key point of this book is collaboration, and I hope that regardless of your definition of data science, predictive analytics, big data, or even a benign term such as forecasting, you will find something here that suits your needs.

Furthermore, I wanted to pay homage to the domain expert as part of the data science team. Often, these analysts are not given fancy titles, but business analysts, can make the difference between a successful analytics project and one that falls flat on its face. Hopefully, some of the topics I discuss will strike a chord with them, and get them more interested in some of the technical concepts of predictive analytics.

When I was asked by Packt to write a book about predictive analytics, I first wondered what would be a good open source language to bridge the gap between legacy analytics and today's data scientist world. I thought about this considerably, since each language brings its own nuances in terms of how solutions to problems are expressed. However, I decided ultimately not to sweat the details, since predictive analytics concepts are not language-dependent, and the choice of language often is determined by personal preference as well as what is in use within the company in which you work.

I chose the R language because my background is in statistics, and I felt that R had good statistical rigor and now has reasonable integration with propriety software such as SAS, and also has good integration with relational database systems, as well as web protocols. It also has an excellent plotting and visualization system, and along with its many good user contributed packages, covers most statistical and predictive analytics tasks.

Regarding statistics, I suggest that you learn as much statistics as you can. Knowing statistics can help you separate good models from bad, and help you identify many problems in bad data just by understanding basic concepts such as measures of central tendencies (mean, median, mode), hypothesis testing, p-values, and effect sizes. It will also help you shy away from merely running a package in an automated way, and help you look a little at what is under the hood.

One downside to R is that it processes data in memory, so the software can limit the size of potentially larger datasets when used on a single PC. For the datasets we use in this book, there should be no problems running R on a single PC. If you are interested in analyzing big data, I do spend several chapters discussing R and Spark within a cloud environment, in which you can processes very large datasets that are distributed between many different computers.

Speaking of the datasets used in this book, I did not want to use the same datasets that you see analyzed repeatedly. Some of these are datasets are excellent for demonstrating techniques, but I wanted some alternatives. However, I did not see a whole lot of them that I thought would be useful for this book. Some were from unknown sources, some needed formal permission to use, some lacked a good data dictionary. So, for many chapters, I ended up generating my own data using simulation techniques in R. I believe that was a good choice, since it enabled me to introduce some data generating techniques that you can use in your own work.

The data I used covers a good spectrum of marketing, retail and healthcare applications. I also would have liked to include some financial predictive analytics use cases but ran out of time. Maybe I will leave that for another book!

What this book covers

Chapter 1, Getting Started with Predictive Analytics, begins with a little bit of history of how predictive analytics developed. We then discuss some different roles of predictive analytics practitioners, and describe the industries in which they work. Ways to organize predictive analytic projects on a PC is discussed next, the R language is introduced, and we end the chapter with a short example of a predictive model.

Chapter 2, The Modeling Process, discusses how the development of predictive models can be organized into a series of stages, each with different goals, such as exploration and problem definition, leading to the actual development of a predictive model. We discuss two important analytics methodologies, CRISP-DM and SEMMA. Code examples are sprinkled through the chapter to demonstrate some of the ideas central to the methodologies, so you will hopefully, never be bored...

Chapter 3, Inputting and Exploring Data, introduces various ways that you can bring your own input data into R. We also discuss various data preparation techniques using standard SQL functions as well as analogous methods using the R dplyr package. Have no data to input? No problem. We will show you how to generate your own human-like data using the R package wakefield.

Chapter 4, Introduction to Regression Algorithms, begins with a discussion of supervised versus unsupervised algorithms. The rest of the chapter concentrates on regression algorithms, which represent the supervised algorithm category. You will learn about interpreting regression output such as model coefficients and residual plots. There is even an interactive game that supplies an interact test to see if you can determine if a series of residuals are random or not.

Chapter 5, Introduction to Decision trees, Clustering, and SVM, concentrates on three other core predictive algorithms that have widespread use, and, along with regression, can be used to solve many, if not most, of your predictive analytics problems. The last algorithm discussed, Support Vector Machines (SVMs), are often used with high-dimensional data, such as unstructured text, so we will accompany this example with some text mining techniques using some customer complaint comments.

Chapter 6, Using Survival Analysis to Predict and Analyze Customer Churn, discusses a specific modeling technique known as survival analysis and follows a hypothetical customer marketing satisfaction and retention example. We will also delve more deeply into simulating customer choice using some sampling functions available in R.

Chapter 7, Using Market Basket Analysis as a Recommender Engine, introduces the concept of association rules and market basket analysis, and steps you through some techniques that can predict future purchases based upon various combinations of previous purchases from an online retail store. It also introduces some text analytics techniques coupled with some cluster analysis that places various customers into different segments. You will learn some additional data cleaning techniques, and learn how to generate some interesting association plots.

Chapter 8, Exploring Health Care Enrollment Data as a Time Series, introduces time series analytics. Healthcare enrollment data from the CMS website is first explored. Then we move on to defining some basic time series concepts such as simple and exponential moving averages. Finally, we work with the R forecast package which, as its name implies, helps you to perform some time series forecasting.

Chapter 9, Introduction to Spark Using R, introduces RSpark, which is an environment for accessing large Spark clusters using R. No local version of R needs to be installed. It also introduces Databricks, which is a cloud-based environment for running R (as well as Python, SQL, and other language), against Spark-based big data. This chapter also demonstrates techniques for transforming small datasets into larger Spark clusters using the Pima Indians Diabetes database as reference.

Chapter 10, Exploring Large Datasets Using Spark, shows how to perform some exploratory data analysis using a combination of RSpark and Spark SQL using the Pima Indians Diabetes data loaded into Spark. We will learn the basics of exploring Spark data using some Spark-specific commands that allow us to filter, group and summarize, and visualize our Spark data.

Chapter 11, Spark Machine Learning – Regression and Cluster Models, covers machine learning by first illustrating a logistic regression model that has been built using a Spark cluster. We will learn how to split Spark data into training and test data in Spark, run a logistic regression model, and then evaluate its performance.

Chapter 12, Spark Models - Rules-Based Learning, teaches you how to run decision tree models in Spark using the Stop and Frisk dataset. You will learn how to overcome some of the algorithmic limitations of the Spark MLlib environment by extracting some cluster samples to your local machine and then run some non-Spark algorithms that you are already familiar with. This chapter will also introduce you to a new rule-based algorithm, OneR, and will also demonstrate how you can mix different languages together in Spark, such as mixing R, SQL, and even Python code together in the same notebook using the %magic directive.

What you need for this book

This is neither an introductory predictive analytics book, not an introductory book for learning R or Spark. Some knowledge of base R data manipulation techniques is expected. Some prior knowledge of predictive analytics is useful. As mentioned earlier, knowledge of basic statistical concepts such as hypothesis testing, correlation, means, standard deviations, and p-values will also help you navigate this book.

Who this book is for

This book is for those who have already had an introduction to R, and are looking to learn how to develop enterprise predictive analytics solutions. Additionally, traditional business analysts and managers who wish to extend their skills into predictive analytics using open source R may find the book useful. Existing predictive analytic practitioners who know another language, or those who wish to learn about analytics using Spark, will also find the chapters on Spark and R beneficial.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Predictive-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PracticalPredictiveAnalytics_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Getting Started with Predictive Analytics

"In God we trust, all others must bring Data"
- Deming

I enjoy working and explaining predictive analytics to people because it is based upon a simple concept: predicting the probability of future events based upon historical data. Its history may date back to at least 650 BC. Some early examples include the Babylonians, who tried to predict short-term weather changes based on cloud appearances and halos: Weather Forecasting through the Ages, NASA.

Medicine also has a long history of needing to classify diseases. The Babylonian king Adad-apla-iddina decreed that medical records be collected to form the Diagnostic Handbook. Some predictions in this corpus list treatments based on the number of days the patient had been sick, and their pulse rate (Linda Miner et al., 2014). One of the first instances of bioinformatics!

In later times, specialized predictive analytics was developed at the onset of the insurance underwriting industries. This was used as a way to predict the risk associated with insuring marine vessels (https://www.lloyds.com/lloyds/about-us/history/corporate-history). At about the same time, life insurance companies began predicting the age that a person would live to in order to set the most appropriate premium rates.

Although the idea of prediction always seemed to be rooted early in the human need to understand and classify, it was not until the 20th century, and the advent of modern computing, that it really took hold.

In addition to helping the US government in the 1940s break code, Alan Turing also worked on the initial computer chess algorithms that pitted man against machine. Monte Carlo simulation methods originated as part of the Manhattan project, where mainframe computers crunched numbers for days in order to determine the probability of nuclear attacks (Computing and the Manhattan Project, n.d).

In the 1950s, Operations Research (OR) theory developed, in which one could optimize the shortest distance between two points. To this day, these techniques are used in logistics by companies such as UPS and Amazon.

Non-mathematicians have also gotten in on the act. In the 1970s, cardiologist Lee Goldman (who worked aboard a submarine) spend years developing a decision tree that did this efficiently. This helped the staff determine whether or not the submarine needed to resurface in order to help the chest pain sufferers (Gladwell, 2005)!

What many of these examples had in common was that people first made observations about events which had already occurred, and then used this information to generalize and then make decisions about might occur in the future. Along with prediction, came further understanding of cause and effect and how the various parts of the problem were interrelated. Discovery and insight came about through methodology and adhering to the scientific method.

Most importantly, they came about in order to find solutions to important, and often practical, problems of the times. That is what made them unique.

Predictive analytics are in so many industries

We have come a long way since then, and practical analytics solutions have furthered growth in so many different industries. The internet has had a profound effect on this; it has enabled every click to be stored and analyzed. More data is being collected and stored, some with very little effort, than ever before. That in itself has enabled more industries to enter predictive analytics.

Predictive Analytics in marketing

One industry that has embraced PA for quite a long time is marketing. Marketing has always been concerned with customer acquisition and retention, and has developed predictive models involving various promotional offers and customer touch points, all geared to keeping customers and acquiring new ones. This is very pronounced in certain segments of marking, such as wireless and online shopping cards, in which customers are always searching for the best deal.

Specifically, advanced analytics can help answer questions such as, If I offer a customer 10% off with free shipping, will that yield more revenue than 15% off with no free shipping? The 360-degree view of the customer has expanded the number of ways one can engage with the customer, therefore enabling marketing mix and attribution modeling to become increasingly important. Location-based devices have enabled marketing predictive applications to incorporate real-time data to issue recommendation to the customer while in the store.

Predictive Analytics in healthcare

Predictive analytics in healthcare has its roots in clinical trials, which use carefully selected samples to test the efficacy of drugs and treatments. However, healthcare has been going beyond this. With the advent of sensors, data can be incorporated into predictive analytics to monitor patients with critical illness, and to send alerts to the patient when he is at risk. Healthcare companies can now predict which individual patients will comply with courses of treatment advocated by health providers. This will send early warning signs to all parties, which will prevent future complications, as well as lower the total cost of treatment.

Predictive Analytics in other industries

Other examples can be found in just about every other industry. Here are just a few:

Finance:

Fraud detection is a huge area. Financial institutions are able to monitor client's internal and external transactions for fraud, through pattern recognition and other machine learning algorithms, and then alert a customer concerning suspicious activity. Analytics are often performed in real time. This is a big advantage, as criminals can be very sophisticated and be one step ahead of the previous analysis.

Wall street program trading. Trading algorithms will predict intraday highs and lows, and will decide when to buy and sell securities.

Sports management:

Sports management analytics is able to predict which sports events will yield the greatest attendance and institute variable ticket pricing based upon audience interest.

In baseball, a pitcher's entire game can be recorded and then digitally analyzed. Sensors can also be attached to their arm, to alert when future injury might occur.

Higher education:

Colleges can predict how many, and which kind of, students are likely to attend the next semester, and be able to plan resources accordingly. This is a challenge which is beginning to surface now, many schools may be looking at how scoring changes made to the SAT in 2016 are affecting admissions.

Time-based assessments of online modules can enable professors to identify students' potential problems areas, and tailor individual instruction.

Government:

Federal and State Governments have embraced the open data concept and have made more data available to the public, which has empowered

Citizen Data Scientists

to help solve critical social and governmental problems.

The potential use of data for the purpose of emergency services, traffic safety, and healthcare use is overwhelmingly positive.

Although these industries can be quite different, the goals of predictive analytics are typically implemented to increase revenue, decrease costs, or alter outcomes for the better.

Skills and roles that are important in Predictive Analytics

So what skills do you need to be successful in predictive analytics? I believe that there are three basic skills that are needed:

Algorithmic/statistical/programming skills

: These are the actual technical skills needed to implement a technical solution to a problem. I bundle these all together since these skills are typically used in tandem. Will it be a purely statistical solution, or will there need to be a bit of programming thrown in to customize an algorithm, and clean the data? There are always multiple ways of doing the same task and it will be up to you, the predictive modeler, to determine how it is to be done.

Business skills

: These are the skills needed for communicating thoughts and ideas among groups of the interested parties. Business and data analysts who have worked in certain industries for long periods of time, and know their business very well, are increasingly being called upon to participate in predictive analytics projects. Data science is becoming a team sport and most projects include working with others in the organization, summarizing findings, and having good presentation and documentation skills are important. You will often hear the term domain knowledge associated with this. Domain knowledge is important since it allows you to apply your particular analytics skills to the particular analytic problems of whatever business you are (or wish to) work within. Everyone business has its own nuances when it comes to solving analytic problems. If you do not have the time or inclination to learn all about the inner workings of the problem at hand yourself, partner with someone who does. That will be the start of a great team!

Data storage/Extract Transform and Load (ETL) skills

: This can refer to specialized knowledge regarding extracting data, and storing it in a relational or non-relational NoSQL data store. Historically, these tasks were handled exclusively within a data warehouse. But now that the age of big data is upon us, specialists have emerged who understand the intricacies of data storage, and the best way to organize it.

Related job skills and terms

Along with the term predictive analytics, here are some terms that are very much related:

Predictive modeling

: This specifically means using a mathematical/statistical model to predict the likelihood of a dependent or target variable. You may still be able to predict; however, if there is no underlying model, it is not a predictive model.

Artificial intelligence (AI)

: A broader term for how machines are able to rationalize and solve problems. AI's early days were rooted in neural networks.

Machine learning

: A subset of AI. Specifically deals with how a machine learns automatically from data, usually to try to replicate human decision-making or to best it. At this point, everyone knows about Watson, who beat two human opponents in Jeopardy.

Data science

: Data science encompasses predictive analytics but also adds algorithmic development via coding, and good presentation skills via visualization.

Data engineering

: Data engineering concentrates on data extraction and data preparation processes, which allow raw data to be transformed into a form suitable for analytics. A knowledge of system architecture is important. The data engineer will typically produce the data to be used by the predictive analysts (or data scientists).

Data analyst/business analyst/domain expert

: This is an umbrella term for someone who is well versed in the way the business at hand works, and is an invaluable person to learn from in terms of what may have meaning, and what may not.

Statistics

: The classical form of inference, typically done via hypothesis testing. Statistics also forms the basis for the probability distributions used in machine learning, and is closely tied with predictive analytics and data science.

Predictive analytics software

Originally, predictive analytics was performed by hand, by statisticians on mainframe computers using a progression of various language such as FORTRAN. Some of these languages are still very much in use today. FORTRAN, for example, is still one of the fastest-performing languages around, and operates with very little memory. So, although it may no longer be as widespread in predictive model development as other languages, it certain can be used to implement models in a production environment.

Nowadays, there are many choices about which software to use, and many loyalists remain true to their chosen software. The reality is that for solving a specific type of predictive analytics problem, there exists a certain amount of overlap, and certainly the goal is the same. Once you get the hang of the methodologies used for predictive analytics in one software package, it should be fairly easy to translate your skills to another package.

Open source software

Open source emphasizes agile development and community sharing. Of course, open source software is free, but free must also be balanced in the context of Total Cost Of Ownership (TCO). TCO costs include everything that is factored into a softwares cost over a period of time: that not only includes the cost of the software itself, but includes training, infrastructure setup, maintenance, people costs, as well as other expenses associated with the quick upgrade and development cycles which exist in some products.

Closed source software

Closed source (or proprietary) software such as SAS and SPSS was at the forefront of predictive analytics, and has continued to this day to extend its reach beyond the traditional realm of statistics and machine learning. Closed source software emphasizes stability, better support, and security, with better memory management, which are important factors for some companies.

Peaceful coexistence

There is much debate nowadays regarding which one is better. My prediction is that they both will coexist peacefully, with one not replacing the other. Data sharing and common APIs will become more common. Each has its place within the data architecture and ecosystem that are deemed correct for a company. Each company will emphasize certain factors, and both open and closed software systems are constantly improving themselves. So, in terms of learning one or the other, it is not an either/or decision. Predictive analytics, per second does not care what software you use. Please be open to the advantages offered by both open and closed software. If you do, that will certainly open up possibilities for working for different kinds of companies and technologies

Other helpful tools

Man does not live by bread alone, so it would behave you to learn additional tools in addition to R, so as to advance your analytic skills:

SQL

: SQL is a valuable tool to know, regardless of which language/package/environment you choose to work in. Virtually every analytics tool will have a SQL interface, and knowledge of how to optimize SQL queries will definitely speed up your productivity, especially if you are doing a lot of data extraction directly from a SQL database. Today's common thought is to do as much pre-processing as possible within the database, so if you will be doing a lot of extracting from databases such as MySQL, PostgreSQL, Oracle, or Teradata, it will be a good thing to learn how queries are optimized within their native framework. In the R language, there are several SQL packages that are useful for interfacing with various external databases. We will be using

sqldf

, which is a popular R package for interfacing with R dataframes. There are other packages that are specifically tailored for the specific database you will be working with.

Web extraction tools

: Not every data source will originate from a data warehouse. Knowledge of APIs that extract data from the internet will be valuable to know. Some popular tools include Curl and Jsonlite.

Spreadsheets

: Despite their problems, spreadsheets are often the fastest way to do quick data analysis and, more importantly, enable you to share your results with others! R offers several interfaces to spreadsheets but, again, learning standalone spreadsheet skills such as pivot tables and Virtual Basic for applications will give you an advantage if you work for corporations in which these skills are heavily used.

Data visualization tools

: Data visualization tools are great for adding impact to an analysis, and for concisely encapsulating complex information. Native R visualization tools are great, but not every company will be using R. Learn some third-party visualization tools such as D3.js, Google Charts, Qlikview, or Tableau.

Big data, Spark, Hadoop, NoSQL database

: It is becoming increasingly important to know a little bit about these technologies, at least from the viewpoint of having to extract and analyze data that resides within these frameworks. Many software packages have APIs that talk directly to Hadoop and can run predictive analytics directly within the native environment, or extract data and perform the analytics locally.

Past the basics

Given that the predictive analytics space is so huge, once you are past the basics, ask yourself what area of predictive analytics really interests you, and what you would like to specialize in. Learning all you can about everything concerning predictive analytics is good at the beginning, but ultimately you will be called upon because you are an expert in certain industries or techniques. This could be research, algorithmic development, or even managing analytics teams.

Data analytics/research

But, as general guidance, if you are involved in, or are oriented toward, data, the analytics or research portion of data science, I would suggest that you concentrate on data mining methodologies and specific data modeling techniques that are heavily prevalent in the specific industries that interest you.

For example, logistic regression is heavily used in the insurance industry, but social network analysis is not. Economic research is geared toward time series analysis, but not so much cluster analysis. Recommender engines are prevalent in online retail.

Data engineering

If you are involved more on the data engineering side, concentrate more on data cleaning, being able to integrate various data sources, and the tools needed to accomplish this.

Management

If you are a manager, concentrate on model development, testing and control, metadata, and presenting results to upper management in order to demonstrate value or return on investment.

Team data science

Of course, predictive analytics is becoming more of a team sport, rather than a solo endeavor, and the data science team is very much alive. There is a lot that has been written about the components of a data science team, much of which can be reduced to the three basic skills that I outlined earlier.

Two different ways to look at predictive analytics

Various industries interpret the goals of predictive analytics differently. For example, social science and marketing like to understand the factors which go into a model, and can sacrifice a bit of accuracy if a model can be explained well enough. On the other hand, a black box stock trading model is more interested in minimizing the number of bad trades, and at the end of the day tallies up the gains and losses, not really caring which parts of the trading algorithm worked. Accuracy is more important in the end.

Depending upon how you intend to approach a particular problem, look at how two different analytical mindsets can affect the predictive analytics process:

Minimize prediction error goal

: This is a very common use case for machine learning. The initial goal is to predict using the appropriate algorithms in order to minimize the prediction error. If done incorrectly, an algorithm will ultimately fail and it will need to be continually optimized to come up with the

new

best algorithm. If this is performed mechanically without regard to understanding the model, this will certainly result in failed outcomes. Certain models, especially over optimized ones with many variables, can have a very high prediction rate, but be unstable in a variety of ways. If one does not have an understanding of the model, it can be difficult to react to changes in the data input.

Understanding model goal

: This came out of the scientific method and is tied closely to the concept of hypothesis testing. This can be done in certain kinds of models, such as regression and decision trees, and is more difficult in other kinds of models such as

Support Vector Machine

(

SVM

) and

neural networks