Mastering Machine Learning with R - Cory Lesmeister - E-Book

Mastering Machine Learning with R E-Book

Cory Lesmeister

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2. You will delve into statistical learning theory and supervised learning; design efficient algorithms; learn about creating Recommendation Engines; use multi-class classification and deep learning; and more.
You will explore, in depth, topics such as data mining, classification, clustering, regression, predictive modeling, anomaly detection, boosted trees with XGBOOST, and more. More than just knowing the outcome, you’ll understand how these concepts work and what they do.
With a slow learning curve on topics such as neural networks, you will explore deep learning, and more. By the end of this book, you will be able to perform machine learning with R in the cloud using AWS in various scenarios with different datasets.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 485

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Title Page

Mastering Machine Learning with R

Second Edition

               

Advanced prediction, algorithms, and learning methods with R 3.x

           

Cory Lesmeister

 

BIRMINGHAM - MUMBAI

Copyright

Mastering Machine Learning with R

Second Edition

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2015 Second Edition: April 2017

Production reference: 1140417

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham 
B3 2PB, UK.

ISBN 978-1-78728-747-1

www.packtpub.com

Credits

Author

 

Cory Lesmeister

Copy Editor

  Manisha Sinha

Reviewers

 

Doug Ortiz Miroslav Kopecky

Project Coordinator

 

Nidhi Joshi

Commissioning Editor

 

Veena Pagare

Proofreader

 

Safis Editing

Acquisition Editor

 

Tushar Gupta

Indexer

 

Mariammal Chettiyar 

ContentDevelopmentEditors

 

Manthan Raja Jagruti Babaria

Graphics

 

Tania Dutta

Technical Editor

 

Dharmendra Yadav

Production Coordinator

 

Shraddha Falebhai

  

About the Author

Cory Lesmeister has over a dozen years of quantitative experience and is currently a Senior Quantitative Manager in the banking industry, responsible for building marketing and regulatory models. Cory spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. A former U.S. Army active duty and reserve officer, Cory was in Baghdad, Iraq, in 2009 serving as the strategic advisor to the 29,000-person Iraqi Oil Police, where he supplied equipment to help the country secure and protect its oil infrastructure. An aviation aficionado, Cory has a BBA in aviation administration from the University of North Dakota and a commercial helicopter license.

About the Reviewers

Doug Ortiz is an Independent Consultant who has been architecting, developing, and integrating enterprise solutions throughout his whole career. Organizations that leverage his skillset have been able to rediscover and reuse their underutilized data via existing and emerging technologies, such as Microsoft BI Stack, Hadoop, NoSQL databases, SharePoint, Hadoop, and related toolsets and technologies.

He is the founder of Illustris, LLC, and can be reached at [email protected].

Interesting aspects of his profession are listed here:

Has experience integrating multiple platforms and products

Helps organizations gain a deeper understanding and value of their current investments in data and existing resources, turning them into useful sources of information

Has improved, salvaged, and architected projects by utilizing unique and innovative techniques

His hobbies include yoga and scuba diving.

   

Miroslav Kopecky has been a passionate JVM enthusiast since the first moment he joined SUN Microsystems in 2002. He truly believes in distributed system design, concurrency and parallel computing. One of Miro’s most favorite hobbies is the development of autonomic systems. He is one of the co-authors and main contributors to the open source Java IoT/Robotics framework Robo4J. Miro is currently working on the online energy trading platform for enmacc.de as a senior software developer.

I would like to thank my family and my wife Tanja for the big support during reviewing this book.

Packt Upsell

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787287475

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

A Process for Success

The process

Business understanding

Identifying the business objective

Assessing the situation

Determining the analytical goals

Producing a project plan

Data understanding

Data preparation

Modeling

Evaluation

Deployment

Algorithm flowchart

Summary

Linear Regression - The Blocking and Tackling of Machine Learning

Univariate linear regression

Business understanding

Multivariate linear regression

Business understanding

Data understanding and preparation

Modeling and evaluation

Other linear model considerations

Qualitative features

Interaction terms

Summary

Logistic Regression and Discriminant Analysis

Classification methods and linear regression

Logistic regression

Business understanding

Data understanding and preparation

Modeling and evaluation

The logistic regression model

Logistic regression with cross-validation

Discriminant analysis overview

Discriminant analysis application

Multivariate Adaptive Regression Splines (MARS)

Model selection

Summary

Advanced Feature Selection in Linear Models

Regularization in a nutshell

Ridge regression

LASSO

Elastic net

Business case

Business understanding

Data understanding and preparation

Modeling and evaluation

Best subsets

Ridge regression

LASSO

Elastic net

Cross-validation with glmnet

Model selection

Regularization and classification

Logistic regression example 

Summary

More Classification Techniques - K-Nearest Neighbors and Support Vector Machines

K-nearest neighbors

Support vector machines

Business case

Business understanding

Data understanding and preparation

Modeling and evaluation

KNN modeling

SVM modeling

Model selection

Feature selection for SVMs

Summary

Classification and Regression Trees

An overview of the techniques

Understanding the regression trees

Classification trees

Random forest

Gradient boosting

Business case

Modeling and evaluation

Regression tree

Classification tree

Random forest regression

Random forest classification

Extreme gradient boosting - classification

Model selection

Feature Selection with random forests

Summary

Neural Networks and Deep Learning

Introduction to neural networks

Deep learning, a not-so-deep overview

Deep learning resources and advanced methods

Business understanding

Data understanding and preparation

Modeling and evaluation

An example of deep learning

H2O background

Data upload to H2O

Create train and test datasets

Modeling

Summary

Cluster Analysis

Hierarchical clustering

Distance calculations

K-means clustering

Gower and partitioning around medoids

Gower

PAM

Random forest

Business understanding

Data understanding and preparation

Modeling and evaluation

Hierarchical clustering

K-means clustering

Gower and PAM

Random Forest and PAM

Summary

Principal Components Analysis

An overview of the principal components

Rotation

Business understanding

Data understanding and preparation

Modeling and evaluation

Component extraction

Orthogonal rotation and interpretation

Creating factor scores from the components

Regression analysis

Summary

Market Basket Analysis, Recommendation Engines, and Sequential Analysis

An overview of a market basket analysis

Business understanding

Data understanding and preparation

Modeling and evaluation

An overview of a recommendation engine

User-based collaborative filtering

Item-based collaborative filtering

Singular value decomposition and principal components analysis

Business understanding and recommendations

Data understanding, preparation, and recommendations

Modeling, evaluation, and recommendations

Sequential data analysis

Sequential analysis applied

Summary

Creating Ensembles and Multiclass Classification

Ensembles

Business and data understanding

Modeling evaluation and selection

Multiclass classification

Business and data understanding

Model evaluation and selection

Random forest

Ridge regression

MLR's ensemble

Summary

Time Series and Causality

Univariate time series analysis

Understanding Granger causality

Business understanding

Data understanding and preparation

Modeling and evaluation

Univariate time series forecasting

Examining the causality

Linear regression

Vector autoregression

Summary

Text Mining

Text mining framework and methods

Topic models

Other quantitative analyses

Business understanding

Data understanding and preparation

Modeling and evaluation

Word frequency and topic models

Additional quantitative analysis

Summary

R on the Cloud

Creating an Amazon Web Services account

Launch a virtual machine

Start RStudio

Summary

R Fundamentals

Getting R up-and-running

Using R

Data frames and matrices

Creating summary statistics

Installing and loading R packages

Data manipulation with dplyr

Summary

Sources

What this book covers

Here is a list of changes from the first edition by chapter:

Chapter 1, A process for success, has the flowchart redone to update an unintended typo and add additional methodologies.

Chapter 2, Linear Regression – the Blocking and Tackling of Machine Learning, has the code improved, and better charts have been provided; other than that, it remains relatively close to the original.

Chapter 3, Logistic Regression and Discriminant Analysis, has the code improved and streamlined. One of my favorite techniques, multivariate adaptive regression splines, has been added; it performs well, handles non-linearity, and is easy to explain. It is my base model, with others becoming "challengers" to try and outperform it.

Chapter 4, Advanced Feature Selection in Linear Models, has techniques not only for regression but also for a classification problem included.

Chapter 5, More Classification Techniques – K-Nearest Neighbors and Support Vector Machines, has the code streamlined and simplified.

Chapter 6, Classification and Regression Trees, has the addition of the very popular techniques provided by the XGBOOST package. Additionally, I added the technique of using random forest as a feature selection tool.

Chapter 7, Neural Networks and Deep Learning, has been updated with additional information on deep learning methods and has improved code for the H2O package, including hyper-parameter search.

Chapter 8, Cluster Analysis, has the methodology of doing unsupervised learning with random forests added.

Chapter 9, Principal Components Analysis, uses a different dataset, and an out-of-sample prediction has been added.

Chapter 10, Market Basket Analysis, Recommendation Engines, and Sequential Analysis, has the addition of sequential analysis, which, I'm discovering, is more and more important, especially in marketing.

Chapter 11, Creating Ensembles and Multiclass Classification, has completely new content, using several great packages.

Chapter 12, Time Series and Causality, has a couple of additional years of climate data added, along with a demonstration of different methods of causality test.

Chapter 13, Text Mining, has additional data and improved code.

Chapter 14, R on the Cloud, is another chapter of new content, allowing you to get R on the cloud, simply and quickly.

Appendix A, R Fundamentals, has additional data manipulation methods.Appendix B, Sources, has a list of sources and references.

What you need for this book

As R is free and open source software, you will only need to download and install it from https://www.r-project.org/. Although it is not mandatory, it is highly recommended that you download IDE and RStudio from https://www.rstudio.com/products/RStudio/.

Who this book is for

This book is for data science professionals, data analysts, or anyone with working knowledge of machine learning with R, who now want to take their skills to the next level and become an expert in the field.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Machine-Learning-with-R-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringMachineLearningwithRSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

A Process for Success

"If you don't know where you are going, any road will get you there."                                                                                                              - Robert Carrol
"If you can't describe what you are doing as a process, you don't know what you're doing."                                                                                                              - W. Edwards Deming

At first glance, this chapter may seem to have nothing to do with machine learning, but it has everything to do with machine learning (specifically, its implementation and making change happen). The smartest people, best software, and best algorithms do not guarantee success, no matter how well it is defined.

In most, if not all, projects, the key to successfully solving problems or improving decision-making is not the algorithm, but the softer, more qualitative skills of communication and influence. The problem many of us have with this is that it is hard to quantify how effective one is around these skills. It is probably safe to say that many of us ended up in this position because of a desire to avoid it. After all, the highly successful TV comedy The Big Bang Theory was built on this premise. Therefore, the goal of this chapter is to set you up for success. The intent is to provide a process, a flexible process no less, where you can become a change agent: a person who can influence and turn their insights into action without positional power. We will focus on Cross-Industry Standard Process for Data Mining (CRISP-DM). It is probably the most well-known and respected of all processes for analytical projects. Even if you use another industry process or something proprietary, there should still be a few gems in this chapter that you can take away.

I will not hesitate to say that this all is easier said than done; without question, I'm guilty of every sin (both commission and omission) that will be discussed in this chapter. With skill and some luck, you can avoid the many physical and emotional scars I've picked up over the last 12 years.

Finally, we will also have a look at a flow chart (a cheat sheet) that you can use to help you identify what methodologies to apply to the problem at hand.

The process

The CRISP-DM process was designed specifically for data mining. However, it is flexible and thorough enough to be applied to any analytical project, whether it is predictive analytics, data science, or machine learning. Don't be intimidated by the numerous lists of tasks as you can apply your judgment to the process and adapt it for any real-world situation. The following figure provides a visual representation of the process and shows the feedback loops that make it so flexible:

Figure 1: CRISP-DM 1.0, Step-by-step data mining guide

The process has the following six phases:

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

For an in-depth review of the entire process with all of its tasks and subtasks, you can examine the paper by SPSS, CRISP-DM 1.0, step-by-step data mining guide, available at https://the-modeling-agency.com/crisp-dm.pdf.

I will discuss each of the steps in the process, covering the important tasks. However, it will not be in as detailed as the guide, but more high-level. We will not skip any of the critical details but focus more on the techniques that one can apply to the tasks. Keep in mind that these process steps will be used in later chapters as a framework in the actual application of the machine-learning methods in general and the R code, in particular.

Business understanding

One cannot underestimate how important this first step in the process is in achieving success. It is the foundational step, and failure or success here will likely determine failure or success for the rest of the project. The purpose of this step is to identify the requirements of the business so that you can translate them into analytical objectives. It has the following four tasks:

Identifying the business objective.

Assessing the situation.

Determining analytical goals.

Producing a project plan.

Identifying the business objective

The key to this task is to identify the goals of the organization and frame the problem. An effective question to ask is, "What are we going to do different?" This may seem like a benign question, but it can really challenge people to work out what they need from an analytical perspective and it can get to the root of the decision that needs to be made. It can also prevent you from going out and doing a lot of unnecessary work on some kind of "fishing expedition." As such, the key for you is to identify the decision. A working definition of a decision can be put forward to the team as the irrevocable choice to commit or not commit the resources. Additionally, remember that the choice to do nothing different is indeed a decision.

This does not mean that a project should not be launched if the choices are not absolutely clear. There will be times when the problem is not, or cannot be, well defined; to paraphrase former Defense Secretary Donald Rumsfeld, there are known-unknowns. Indeed, there will probably be many times when the problem is ill defined and the project's main goal is to further the understanding of the problem and generate hypotheses; again calling on Secretary Rumsfeld, unknown-unknowns, which means that you don't know what you don't know. However, with ill-defined problems, one could go forward with an understanding of what will happen next in terms of resource commitment based on the various outcomes from hypothesis exploration.

Another thing to consider in this task is the management of expectations. There is no such thing as perfect data, no matter what its depth and breadth are. This is not the time to make guarantees but to communicate what is possible, given your expertise.

I recommend a couple of outputs from this task. The first is a mission statement. This is not the touchy-feely mission statement of an organization, but it is your mission statement or, more importantly, the mission statement approved by the project sponsor. I stole this idea from my years of military experience and I could write volumes on why it is effective, but that is for another day. Let's just say that, in the absence of clear direction or guidance, the mission statement, or whatever you want to call it, becomes the unifying statement for all stakeholders and can help prevent scope creep. It consists of the following points:

Who

: This is yourself or the team or project name; everyone likes a cool project name, for example, Project Viper, Project Fusion, and so on

What

: This is the task that you will perform, for example, conducting machine learning

When

: This is the deadline

Where

: This could be geographical, by function, department, initiative, and so on

Why

: This is the purpose behind implementing the project, that is, the business goal

The second task is to have as clear a definition of success as possible. Literally, ask "What does success look like?" Help the team/sponsor paint a picture of success that you can understand. Your job then is to translate this into modeling requirements.

Assessing the situation

This task helps you in project planning by gathering information on the resources available, constraints, and assumptions; identifying the risks; and building contingency plans. I would further add that this is also the time to identify the key stakeholders that will be impacted by the decision(s) to be made.

A couple of points here. When examining the resources that are available, do not neglect to scour the records of past and current projects. Odds are someone in the organization has worked, or is working on the same problem and it may be essential to synchronize your work with theirs. Don't forget to enumerate the risks considering time, people, and money. Do everything in your power to create a list of stakeholders, both those that impact your project and those that could be impacted by your project. Identify who these people are and how they can influence/be impacted by the decision. Once this is done, work with the project sponsor to formulate a communication plan with these stakeholders.

Determining the analytical goals

Here, you are looking to translate the business goal into technical requirements. This includes turning the success criterion from the task of creating a business objective to technical success. This might be things such as RMSE or a level of predictive accuracy.

Producing a project plan

The task here is to build an effective project plan with all the information gathered up to this point. Regardless of what technique you use, whether it be a Gantt chart or some other graphic, produce it and make it a part of your communication plan. Make this plan widely available to the stakeholders and update it on a regular basis and as circumstances dictate.

Data understanding

After enduring the all-important pain of the first step, you can now get busy with the data. The tasks in this process consist of the following:

Collecting the data.

Describing the data.

Exploring the data.

Verifying the data quality.

This step is the classic case of Extract, Transform, Load (ETL). There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine whether the variables are sparse and identify the extent to which data may be missing. This may drive the learning method that you use and/or determine whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon incomplete data collection, cases where unintended IT issues led to errors in the data, or planned changes in the business rules. This is critical in time series where often business rules on how the data is classified change over time. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself potential heartache and make one.

Data preparation

Almost there! This step has the following five tasks:

Selecting the data.

Cleaning the data.

Constructing the data.

Integrating the data.

Formatting the data.

These tasks are relatively self-explanatory. The goal is to get the data ready to input in the algorithms. This includes merging, feature engineering, and transformations. If imputation is needed, then it happens here as well. Additionally, with R, pay attention to how the outcome needs to be labeled. If your outcome/response variable is Yes/No, it may not work in some packages and will require a transformed or no variable with 1/0. At this point, you should also break your data into the various test sets if applicable: train, test, or validate. This step can be an unmitigated burden, but most experienced people will tell you that it is where you can separate yourself from your peers. With this, let's move on to the payoff, where you earn your money.

Modeling

This is where all the work that you've done up to this point can lead to fist-pumping exuberance or fist-pounding exasperation. But hey, if it was that easy, everyone would be doing it. The tasks are as follows:

Selecting a modeling technique.

Generating a test design.

Building a model.

Assessing a model.

Oddly, this process step includes the considerations that you have already thought of and prepared for. In the first step, you will need at least some idea about how you will be modeling. Remember that this is a flexible, iterative process and not some strict linear flowchart such as an aircrew checklist.

The cheat sheet included in this chapter should help guide you in the right direction for the modeling techniques. Test design refers to the creation of your test and train datasets and/or the use of cross-validation and this should have been thought of and accounted for in the data preparation.

Model assessment involves comparing the models with the criteria/criterion that you developed in the business understanding, for example, RMSE, Lift, ROC, and so on.

Evaluation

With the evaluation process, the main goal is to confirm that the model selected at this point meets the business objective. Ask yourself and others, "Have we achieved our definition of success?". Let the Netflix prize serve as a cautionary tale here. I'm sure you are aware that Netflix awarded a $1-million prize to the team that could produce the best recommendation algorithm as defined by the lowest RMSE. However, Netflix did not implement it because the incremental accuracy gained was not worth the engineering effort! Always apply Occam's razor. At any rate, here are the tasks:

Evaluating the results.

Reviewing the process.

Determining the next steps.

In reviewing the process, it may be necessary, as you no doubt determined earlier in the process, to take the results through governance and communicate with the other stakeholders in order to gain their buy-in. As for the next steps, if you want to be a change agent, make sure that you answer the what, so what, and now what in the stakeholders' minds. If you can tie their now what into the decision that you made earlier, you have earned your money.

Deployment

If everything is done according to the plan up to this point, it might just come down to flipping a switch and your model goes live. Assuming that this is not the case, here are the tasks for this step:

Deploying the plan.

Monitoring and maintaining the plan.

Producing the final report.

Reviewing the project.

After the deployment and monitoring/maintenance and underway, it is crucial for you and those who will walk in your steps to produce a well-written final report. This report should include a white paper and briefing slide. I have to say that I resisted the drive to put my findings in a white paper as I was an indentured servant to the military's passion for PowerPoint slides. However, slides can and will be used against you, cherry-picked or misrepresented by various parties for their benefit. Trust me, that just doesn't happen with a white paper as it becomes an extension of your findings and beliefs. Use PowerPoint to brief stakeholders, but use that the white paper as the document of record and as a preread, should your organization insist on one. It is my standard procedure to create this white paper in R using knitr and LaTex.

Now for the all-important process review, you may have your own proprietary way of conducting it; but here is what it should cover, whether you conduct it in a formal or informal way:

What was the plan?

What actually happened?

Why did it happen or not happen?

What should be sustained in future projects?

What should be improved upon in future projects?

Create an action plan to ensure sustainment and improvement happen

That concludes the review of the CRISP-DM process, which provides a comprehensive and flexible framework to guarantee the success of your project and make you an agent of change.

Algorithm flowchart

The purpose of this section is to create a tool that will help you not just select possible modeling techniques but also think deeper about the problem. The residual benefit is that it may help you frame the problem with the project sponsor/team. The techniques in the flowchart are certainly not comprehensive but are exhaustive enough to get you started. It also includes techniques not discussed in this book.

The following figure starts the flow of selecting the potential modeling techniques. As you answer the question(s), it will take you to one of the four additional charts:

Figure 2

If the data is text or in the time series format, then you will follow the flow in the following figure:

Figure 3

In this branch of the algorithm, you do not have text or time series data. You also do not want to predict a category, so you are looking to make recommendations, understand associations, or predict a quantity:

Figure 4

To get to this section, you will have data that is not text or time series. You want to categorize the data, but it does not have an outcome label, which brings us to clustering methods, as follows:

Figure 5

This brings us to the situation where we want to categorize the data and it is labeled, that is, classification:

Figure 6

Summary

This chapter was about how to set up you and your team for success in any project that you tackle. The CRISP-DM process is put forward as a flexible and comprehensive framework in order to facilitate the softer skills of communication and influence. Each step of the process and the tasks in each step were enumerated. More than that, the commentary provides some techniques and considerations to with the process execution. By taking heed of the process, you can indeed become an agent of positive change to any organization.

The other item put forth in this chapter was an algorithm flowchart; a cheat sheet to help identify some of the proper techniques to apply in order to solve the business problem. With this foundation in place, we can now move on to applying these techniques to real-world problems.

Linear Regression - The Blocking and Tackling of Machine Learning

"Some people try to find things in this game that don't exist, but football is only two things - blocking and tackling."                                                                     - Vince Lombardi, Hall of Fame Football Coach

It is important that we get started with a simple, yet extremely effective technique that has been used for a long time: linear regression. Albert Einstein is believed to have remarked at one time or another that things should be made as simple as possible, but no simpler. This is sage advice and a good rule of thumb in the development of algorithms for machine learning. Considering the other techniques that we will discuss later, there is no simpler model than tried and tested linear regression, which uses the least squares approach to predict a quantitative outcome. In fact, one can consider it to be the foundation of all the methods that we will discuss later, many of which are mere extensions. If you can master the linear regression method, well, then quite frankly, I believe you can master the rest of this book. Therefore, let us consider this a good starting point for our journey towards becoming a machine learning guru.

This chapter covers introductory material, and an expert in this subject can skip ahead to the next topic. Otherwise, ensure that you thoroughly understand this topic before venturing to other, more complex learning methods. I believe you will discover that many of your projects can be addressed by just applying what is discussed in the following section. Linear regression is probably the easiest model to explain to your customers, most of whom will have at least a cursory understanding of R-squared. Many of them will have been exposed to it at great depth and thus be comfortable with variable contribution, collinearity, and the like.

Multivariate linear regression