Scala Machine Learning Projects - Md. Rezaul Karim - E-Book

Scala Machine Learning Projects E-Book

Md. Rezaul Karim

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Machine learning has had a huge impact on academia and industry by turning data into actionable information. Scala has seen a steady rise in adoption over the past few years, especially in the fields of data science and analytics. This book is for data scientists, data engineers, and deep learning enthusiasts who have a background in complex numerical computing and want to know more hands-on machine learning application development.
If you're well versed in machine learning concepts and want to expand your knowledge by delving into the practical implementation of these concepts using the power of Scala, then this book is what you need! Through 11 end-to-end projects, you will be acquainted with popular machine learning libraries such as Spark ML, H2O, DeepLearning4j, and MXNet.
At the end, you will be able to use numerical computing and functional programming to carry out complex numerical tasks to develop, build, and deploy research or commercial projects in a production-ready environment.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 452

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Scala Machine Learning Projects

 

 

 

Build real-world machine learning and deep learning projects with Scala

 

 

 

 

 

 

 

 

Md. Rezaul Karim

 

 

 

BIRMINGHAM - MUMBAI

Scala Machine Learning Projects

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Tushar GuptaContent Development Editor: Cheryl DsaTechnical Editor: Sagar SawantCopy Editors: Vikrant Phadkay, Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Aishwarya GangawaneGraphics: Tania DuttaProduction Coordinator: Shantanu Zagade

First published: January 2018

Production reference: 1290118

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78847-904-2

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Md. Rezaul Karim is a Research Scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a Researcher at the Insight Centre for Data Analytics, Ireland. Before that, he worked as a Lead Engineer at Samsung Electronics, Korea.

He has 9 years of R&D experience with C++, Java, R, Scala, and Python. He has published several research papers concerning bioinformatics, big data, and deep learning. He has practical working experience with Spark, Zeppelin, Hadoop, Keras, Scikit-Learn, TensorFlow, DeepLearning4j, MXNet, and H2O.

 

About the reviewer

Dave Wentzel is the chief technology officer of Capax Global, a premier Microsoft consulting partner. Dave is responsible for setting the strategy and defining service offerings and capabilities for the data platform and Azure practice at Capax. He also works directly with clients to help them with their big data journeys. He is a frequent blogger and speaker on big data and data science topics.

 

Sumit Pal is a published author with Apress. He has more than 22 years of experience in software from startups to enterprises and is an independent consultant working with big data, data visualization, and data science. He builds end-to-end data-driven analytic systems.

Sumit has worked for Microsoft (SQLServer), Oracle (OLAP Kernel), and Verizon. He advises clients on their data architectures and build solutions in Spark and Scala. He has spoken at multiple conferences in North America and Europe and has developed a big data analyst training for Experfy. He has MS and BS in computer science.

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Analyzing Insurance Severity Claims

Machine learning and learning workflow

Typical machine learning workflow

Hyperparameter tuning and cross-validation

Analyzing and predicting insurance severity claims

Motivation

Description of the dataset

Exploratory analysis of the dataset

Data preprocessing

LR for predicting insurance severity claims

Developing insurance severity claims predictive model using LR

GBT regressor for predicting insurance severity claims

Boosting the performance using random forest regressor

Random Forest for classification and regression

Comparative analysis and model deployment

Spark-based model deployment for large-scale dataset

Summary

Analyzing and Predicting Telecommunication Churn

Why do we perform churn analysis, and how do we do it?

Developing a churn analytics pipeline

Description of the dataset

Exploratory analysis and feature engineering

LR for churn prediction

SVM for churn prediction

DTs for churn prediction

Random Forest for churn prediction

Selecting the best model for deployment

Summary

High Frequency Bitcoin Price Prediction from Historical and Live Data

Bitcoin, cryptocurrency, and online trading

State-of-the-art automated trading of Bitcoin

Training

Prediction

High-level data pipeline of the prototype

Historical and live-price data collection

Historical data collection

Transformation of historical data into a time series

Assumptions and design choices

Data preprocessing

Real-time data through the Cryptocompare API

Model training for prediction

Scala Play web service

Concurrency through Akka actors

Web service workflow

JobModule

Scheduler

SchedulerActor

PredictionActor and the prediction step

TraderActor

Predicting prices and evaluating the model

Demo prediction using Scala Play framework

Why RESTful architecture?

Project structure

Running the Scala Play web app

Summary

Population-Scale Clustering and Ethnicity Prediction

Population scale clustering and geographic ethnicity

Machine learning for genetic variants

1000 Genomes Projects dataset description

Algorithms, tools, and techniques

H2O and Sparkling water

ADAM for large-scale genomics data processing

Unsupervised machine learning

Population genomics and clustering

How does K-means work?

DNNs for geographic ethnicity prediction

Configuring programming environment

Data pre-processing and feature engineering

Model training and hyperparameter tuning

Spark-based K-means for population-scale clustering

Determining the number of optimal clusters

Using H2O for ethnicity prediction

Using random forest for ethnicity prediction

Summary

Topic Modeling - A Better Insight into Large-Scale Texts

Topic modeling and text clustering

How does LDA algorithm work?

Topic modeling with Spark MLlib and Stanford NLP

Implementation

Step 1 - Creating a Spark session

Step 2 - Creating vocabulary and tokens count to train the LDA after text pre-processing

Step 3 - Instantiate the LDA model before training

Step 4 - Set the NLP optimizer

Step 5 - Training the LDA model

Step 6 - Prepare the topics of interest

Step 7 - Topic modelling 

Step 8 - Measuring the likelihood of two documents

Other topic models versus the scalability of LDA

Deploying the trained LDA model

Summary

Developing Model-based Movie Recommendation Engines

Recommendation system

Collaborative filtering approaches

Content-based filtering approaches

Hybrid recommender systems

Model-based collaborative filtering

The utility matrix

Spark-based movie recommendation systems

Item-based collaborative filtering for movie similarity

Step 1 - Importing necessary libraries and creating a Spark session

Step 2 - Reading and parsing the dataset

Step 3 - Computing similarity

Step 4 - Testing the model

Model-based recommendation with Spark

Data exploration

Movie recommendation using ALS

Step 1 - Import packages, load, parse, and explore the movie and rating dataset

Step 2 - Register both DataFrames as temp tables to make querying easier

Step 3 - Explore and query for related statistics

Step 4 - Prepare training and test rating data and check the counts

Step 5 - Prepare the data for building the recommendation model using ALS

Step 6 - Build an ALS user product matrix

Step 7 - Making predictions

Step 8 - Evaluating the model

Selecting and deploying the best model 

Summary

Options Trading Using Q-learning and Scala Play Framework

Reinforcement versus supervised and unsupervised learning

Using RL

Notation, policy, and utility in RL

Policy

Utility

A simple Q-learning implementation

Components of the Q-learning algorithm

States and actions in QLearning

The search space

The policy and action-value

QLearning model creation and training

QLearning model validation

Making predictions using the trained model

Developing an options trading web app using Q-learning

Problem description

Implementating an options trading web application

Creating an option property

Creating an option model

Putting it altogether

Evaluating the model

Wrapping up the options trading app as a Scala web app

The backend

The frontend

Running and Deployment Instructions

Model deployment

Summary

Clients Subscription Assessment for Bank Telemarketing using Deep Neural Networks

Client subscription assessment through telemarketing

Dataset description

Installing and getting started with Apache Zeppelin

Building from the source

Starting and stopping Apache Zeppelin

Creating notebooks

Exploratory analysis of the dataset

Label distribution

Job distribution

Marital distribution

Education distribution

Default distribution

Housing distribution

Loan distribution

Contact distribution

Month distribution

Day distribution

Previous outcome distribution

Age feature

Duration distribution

Campaign distribution

Pdays distribution

Previous distribution

emp_var_rate distributions

cons_price_idx features

cons_conf_idx distribution

Euribor3m distribution

nr_employed distribution

Statistics of numeric features

Implementing a client subscription assessment model

Hyperparameter tuning and feature selection

Number of hidden layers

Number of neurons per hidden layer

Activation functions

Weight and bias initialization

Regularization

Summary

Fraud Analytics Using Autoencoders and Anomaly Detection

Outlier and anomaly detection

Autoencoders and unsupervised learning

Working principles of an autoencoder

Efficient data representation with autoencoders

Developing a fraud analytics model

Description of the dataset and using linear models

Problem description

Preparing programming environment

Step 1 - Loading required packages and libraries

Step 2 - Creating a Spark session and importing implicits

Step 3 - Loading and parsing input data

Step 4 - Exploratory analysis of the input data

Step 5 - Preparing the H2O DataFrame

Step 6 - Unsupervised pre-training using autoencoder

Step 7 - Dimensionality reduction with hidden layers

Step 8 - Anomaly detection

Step 9 - Pre-trained supervised model

Step 10 - Model evaluation on the highly-imbalanced data

Step 11 - Stopping the Spark session and H2O context

Auxiliary classes and methods

Hyperparameter tuning and feature selection

Summary

Human Activity Recognition using Recurrent Neural Networks

Working with RNNs

Contextual information and the architecture of RNNs

RNN and the long-term dependency problem

LSTM networks

Human activity recognition using the LSTM model

Dataset description

Setting and configuring MXNet for Scala

Implementing an LSTM model for HAR

Step 1 - Importing necessary libraries and packages

Step 2 - Creating MXNet context

Step 3 - Loading and parsing the training and test set

Step 4 - Exploratory analysis of the dataset

Step 5 - Defining internal RNN structure and LSTM hyperparameters

Step 6 - LSTM network construction

Step 7 - Setting up an optimizer

Step 8 - Training the LSTM network

Step 9 - Evaluating the model

Tuning LSTM hyperparameters and GRU

Summary

Image Classification using Convolutional Neural Networks

Image classification and drawbacks of DNNs

CNN architecture

Convolutional operations

Pooling layer and padding operations

Subsampling operations

Convolutional and subsampling operations in DL4j

Configuring DL4j, ND4s, and ND4j

Convolutional and subsampling operations in DL4j

Large-scale image classification using CNN

Problem description

Description of the image dataset

Workflow of the overall project

Implementing CNNs for image classification

Image processing

Extracting image metadata

Image feature extraction

Preparing the ND4j dataset

Training the CNNs and saving the trained models

Evaluating the model

Wrapping up by executing the main() method

Tuning and optimizing CNN hyperparameters

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Machine learning has made a huge impact on academia and industry by turning data into actionable intelligence. Scala, on the other hand, has been observing a steady rise in its adoption over the last few years, especially in the field of data science and analytics. This book has been written for data scientists, data engineers, and deep learning enthusiasts who have a solid background with complex numerical computing and want to learn more hands-on machine learning application development.

So, if you're well-versed in machine learning concepts and want to expand your knowledge by delving into practical implementations using the power of Scala, then this book is what you need! Through 11 end-to-end projects, you will be acquainted with popular machine learning libraries such as Spark ML, H2O, Zeppelin, DeepLearning4j, and MXNet.

After reading this book and practicing all of the projects, you will be able to dominate numerical computing, deep learning, and functional programming to carry out complex numerical tasks. You can thus develop, build, and deploy research and commercial projects in a production-ready environment.

This book isn’t meant to be read cover to cover. You can turn the pages to a chapter that looks like something you’re trying to accomplish or that simply ignites your interest. But any kind of improvement feedback is welcome.

Happy reading!

Who this book is for

If you want to leverage the power of both Scala and open source libraries such as Spark ML, Deeplearning4j, H2O, MXNet, and Zeppelin to make sense of Big Data, then this book is for you. A strong understanding of Scala and the Scala Play Framework is recommended. Basic familiarity with ML techniques will be an added advantage.

What this book covers

Chapter 1, Analyzing Insurance Severity Claims, shows how to develop a predictive model for analyzing insurance severity claims using some widely used regression techniques. We will demonstrate how to deploy this model in a production-ready environment.

Chapter 2, Analyzing and Predicting Telecommunication Churn, uses the Orange Telecoms Churn dataset, consisting of cleaned customer activity and churn labels specifying whether customers canceled their subscription or not, to develop a real-life predictive model.

Chapter 3, High-Frequency Bitcoin Price Prediction from Historical and Live Data, shows how to develop a real-life project that collects historical and live data. We predict the Bitcoin price for the upcoming weeks, months, and so on. In addition, we demonstrate how to generate a simple signal for online trading in Bitcoin. Finally, this chapter wraps up the whole application as a web app using the Scala Play Framework.

Chapter 4, Population-Scale Clustering and Ethnicity Prediction, uses genomic variation data from the 1,000 Genome Project to apply the K-means clustering approach to scalable genomic data analysis. This is aimed at clustering genotypic variants at the population scale. Finally, we train deep neural network and random forest models to predict ethnicity. 

Chapter 5, Topic Modeling in NLP – A Better Insight into Large-Scale Texts, shows how to develop a topic modeling application by utilizing the Spark-based LDA algorithm and Stanford NLP to handle large-scale raw texts.

Chapter 6, DevelopingModel-Based Movie Recommendation Engines, shows how to develop a scalable movie recommendation engine by inter-operating between singular value decomposition, ALS, and matrix factorization. The movie lens dataset will be used for this end-to-end project.

Chapter 7, Options Trading using Q-Learning and the Scala Play Framework, applies a reinforcement QLearning algorithm on real-life IBM stock datasets and designs a machine learning system driven by criticisms and rewards. The goal is to develop a real-life application called options trading. The chapter wraps up the whole application as a web app using the Scala Play Framework.

Chapter 8, Clients Subscription Assessment for Bank Telemarketing using Deep Neural Networks, is an end-to-end project that shows how to  solve a real-life problem called client subscription assessment. An H2O deep neural network will be trained using a bank telemarketing dataset. Finally, the chapter evaluates the performance of this predictive model.

 Chapter 9, Fraud Analytics using Autoencoders and Anomaly Detection, uses autoencoders and the anomaly detection technique for fraud analytics. The dataset used is a fraud detection dataset collected and analyzed during a research collaboration by Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles).

Chapter 10, Human Activity Recognition using Recurrent Neural Networks, includes another end-to-end project that shows how to use an RNN implementation called LSTM for human activity recognition using a smartphone sensor dataset.

Chapter 11, Image Classification using Convolutional Neural Networks, demonstrates how to develop predictive analytics applications such as image classification, using convolutional neural networks on a real image dataset called Yelp.

To get the most out of this book

This book is dedicated to developers, data analysts, and deep learning enthusiasts who do not have much background with complex numerical computations but want to know what deep learning is. A strong understanding of Scala and its functional programming concepts is recommended. Some basic understanding and high-level knowledge of Spark ML, H2O, Zeppelin, DeepLearning4j, and MXNet would act as an added advantage in order to grasp this book. Additionally, basic know-how of build tools such as Maven and SBT is assumed.

All the examples have been implemented using Scala on an Ubuntu 16.04 LTs 64-bit and Windows 10 64-bit. You will also need the following (preferably the latest versions):

Apache Spark 2.0.0 (or higher)

MXNet, Zeppelin, DeepLearning4j, and H2O (see the details in the chapter and in the supplied

pom.xml

files)

Hadoop 2.7 (or higher)

Java (JDK and JRE) 1.7+/1.8+

Scala 2.11.x (or higher)

Eclipse Mars or Luna (latest) with Maven plugin (2.9+), Maven compiler plugin (2.3.2+), and Maven assembly plugin (2.4.1+)

IntelliJ IDE

SBT plugin and Scala Play Framework installed

A computer with at least a Core i3 processor, Core i5 (recommended), or Core i7 (to get the best results) is needed. However, multicore processing will provide faster data processing and scalability. At least 8 GB RAM is recommended for standalone mode; use at least 32 GB RAM for a single VM and higher for a cluster. You should have enough storage for running heavy jobs (depending on the dataset size you will be handling); preferably, at least 50 GB of free disk storage (for standalone and for SQL Warehouse).

Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, CentOS, and many more). To be more specific, for example, for Ubuntu it is recommended to have a 14.04 (LTS) 64-bit (or later) complete installation, VMWare player 12, or VirtualBox. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Scala-Machine-Learning-Projects. We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/ScalaMachineLearningProjects_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Analyzing Insurance Severity Claims

Predicting the cost, and hence the severity, of claims in an insurance company is a real-life problem that needs to be solved in an accurate way. In this chapter, we will show you how to develop a predictive model for analyzing insurance severity claims using some of the most widely used regression algorithms.

We will start with simple linear regression (LR) and we will see how to improve the performance using some ensemble techniques, such as gradient boosted tree (GBT) regressors. Then we will look at how to boost the performance with Random Forest regressors. Finally, we will show you how to choose the best model and deploy it for a production-ready environment. Also, we will provide some background studies on machine learning workflow, hyperparameter tuning, and cross-validation.

For the implementation, we will use Spark ML API for faster computation and massive scalability. In a nutshell, we will learn the following topics throughout this end-to-end project:

Machine learning and learning workflow

Hyperparameter tuning and cross-validation of ML models

LR for analyzing insurance severity claims

Improving performance with gradient boosted regressors

Boosting the performance with random forest regressors

Model deployment

Machine learning and learning workflow

Machine learning (ML) is about using a set of statistical and mathematical algorithms to perform tasks such as concept learning, predictive modeling, clustering, and mining useful patterns can be performed. The ultimate goal is to improve the learning in such a way that it becomes automatic, so that no more human interactions are needed, or to reduce the level of human interaction as much as possible.

We now refer to a famous definition of ML by Tom M. Mitchell (Machine Learning, Tom Mitchell, McGraw Hill, 1997), where he explained what learning really means from a computer science perspective:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Based on the preceding definition, we can conclude that a computer program or machine can do the following:

Learn from data and histories

Be improved with experience

Interactively enhance a model that can be used to predict an outcome

A typical ML function can be formulated as a convex optimization problem for finding a minimizer of a convex function f that depends on a variable vector w (weights), which has d records. Formally, we can write this as the following optimization problem:

Here, the objective function is of the form:

Here, the vectors are the training data points for 1≤i≤n, and are their corresponding labels that we want to predict eventually. We call the method linear if L(w;x,y) can be expressed as a function of wTx and y.

The objective function f has two components:

A regularizer that controls the complexity of the model

The loss that measures the error of the model on the training data

The loss function L(w;) is typically a convex function in w. The fixed regularization parameter λ≥0 defines the trade-off between the two goals of minimizing the loss on the training error and minimizing model complexity to avoid overfitting. Throughout the chapters, we will learn in details on different learning types and algorithms.

On the other hand, deep neural networks (DNN) form the core of deep learning (DL) by providing algorithms to model complex and high-level abstractions in data and can better exploit large-scale datasets to build complex models

There are some widely used deep learning architectures based on artificial neural networks: DNNs, Capsule Networks, Restricted Boltzmann Machines, deep belief networks, factorization machines and recurrent neural networks.

These architectures have been widely used in computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics and drug design. Throughout the chapters, we will see several real-life examples using these architectures to achieve state-of-the art predictive accuracy.

Typical machine learning workflow

A typical ML application involves several processing steps, from the input to the output, forming a scientific workflow as shown in Figure 1, ML workflow. The following steps are involved in a typical ML application:

Load the data

Parse the data into the input format for the algorithm

Pre-process the data and handle the missing values

Split the data into three sets, for training, testing, and validation (train set and validation set respectively) and one for testing the model (test dataset)

Run the algorithm to build and train your ML model

Make predictions with the training data and observe the results

Test and evaluate the model with the test data or alternatively validate the model using some cross-validator technique using the third dataset called a 

validation dataset

Tune the model for better performance and accuracy

Scale up the model so that it can handle massive datasets in future

Deploy the ML model in production:

Figure 1: ML workflow

The preceding workflow is represent a few steps to solve ML problems. Where, ML tasks can be broadly categorized into supervised, unsupervised, semi-supervised, reinforcement, and recommendation systems. The following Figure 2, Supervised learning in action, shows the schematic diagram of supervised learning. After the algorithm has found the required patterns, those patterns can be used to make predictions for unlabeled test data:

Figure 2: Supervised learning in action

Examples include classification and regression for solving supervised learning problems so that predictive models can be built for predictive analytics based on them. Throughout the upcoming chapters, we will provide several examples of supervised learning, such as LR, logistic regression, random forest, decision trees, Naive Bayes, multilayer perceptron, and so on.

A regression algorithm is meant to produce continuous output. The input is allowed to be either discrete or continuous:

Figure 3: A regression algorithm is meant to produce continuous output

A classification algorithm, on the other hand, is meant to produce discrete output from an input of a set of discrete or continuous values. This distinction is important to know because discrete-valued outputs are handled better by classification, which will be discussed in upcoming chapters:

Figure 4: A classification algorithm is meant to produce discrete output

In this chapter, we will mainly focus on the supervised regression algorithms. We will start with describing the problem statement and then we move on to the very simple LR algorithm. Often, performance of these ML models is optimized using hyperparameter tuning and cross-validation techniques. So knowing them, in brief, is mandatory so that we can easily use them in future chapters.

Hyperparameter tuning and cross-validation

Tuning an algorithm is simply a process that one goes through in order to enable the algorithm to perform optimally in terms of runtime and memory usage. In Bayesian statistics, a hyperparameter is a parameter of a prior distribution. In terms of ML, the term hyperparameter refers to those parameters that cannot be directly learned from the regular training process.

Hyperparameters are usually fixed before the actual training process begins. This is done by setting different values for those hyperparameters, training different models, and deciding which ones work best by testing them. Here are some typical examples of such parameters:

Number of leaves, bins, or depth of a tree

Number of iterations

Number of latent factors in a matrix factorization

Learning rate

Number of hidden layers in a deep neural network

The number of clusters in k-means clustering and so on

In short, hyperparameter tuning is a technique for choosing the right combination of hyperparameters based on the performance of presented data. It is one of the fundamental requirements for obtaining meaningful and accurate results from ML algorithms in practice. The following figure shows the model tuning process, things to consider, and workflow:

Figure 5: Model tuning process

Cross-validation (also known as rotation estimation) is a model validation technique for assessing the quality of the statistical analysis and results. The target is to make the model generalized toward an independent test set. It will help if you want to estimate how a predictive model will perform accurately in practice when you deploy it as an ML application. During the cross-validation process, a model is usually trained with a dataset of a known type.

Conversely, it is tested using a dataset of an unknown type. In this regard, cross-validation helps to describe a dataset to test the model in the training phase using the validation set. There are two types of cross-validation that can be typed as follows:

Exhaustive cross-validation

: This includes leave-p-out cross-validation and leave-one-out cross-validation

Non-exhaustive cross-validation

: This includes K-fold cross-validation and repeated random subsampling cross-validation

In most cases, the researcher/data scientist/data engineer uses 10-fold cross-validation instead of testing on a validation set (see more in Figure 6, 10-fold cross-validation technique). This is the most widely used cross-validation technique across all use cases and problem types, as explained by the following figure.

Basically, using this technique, your complete training data is split into a number of folds. This parameter can be specified. Then the whole pipeline is run once for every fold and one ML model is trained for each fold. Finally, the different ML models obtained are joined by a voting scheme for classifiers or by averaging for regression:

Figure 6: 10-fold cross-validation technique

Moreover, to reduce the variability, multiple iterations of cross-validation are performed using different partitions; finally, the validation results are averaged over the rounds.

Analyzing and predicting insurance severity claims

Predicting the cost, and hence the severity, of claims in an insurance company is a real-life problem that needs to be solved in a more accurate and automated way. We will do something similar in this example.

We will start with simple logistic regression and will learn how to improve the performance using some ensemble techniques, such as an random forest regressor. Then we will look at how to boost the performance with a gradient boosted regressor. Finally, we will show how to choose the best model and deploy it for a production-ready environment.

Motivation

When someone is devastated by a serious car accident, his focus is on his life, family, child, friends, and loved ones. However, once a file is submitted for the insurance claim, the overall paper-based process to calculate the severity claim is a tedious task to be completed.

This is why insurance companies are continually seeking fresh ideas to improve their claims service for their clients in an automated way. Therefore, predictive analytics is a viable solution to predicting the cost, and hence severity, of claims on the available and historical data.

Description of the dataset

A dataset from the Allstate Insurance company will be used, which consists of more than 300,000 examples with masked and anonymous data and consisting of more than 100 categorical and numerical attributes, thus being compliant with confidentiality constraints, more than enough for building and evaluating a variety of ML techniques.

The dataset is downloaded from the Kaggle website at https://www.kaggle.com/c/allstate-claims-severity/data. Each row in the dataset represents an insurance claim. Now, the task is to predict the value for the loss column. Variables prefaced with cat are categorical, while those prefaced with cont are continuous.

It is to be noted that the Allstate Corporation is the second largest insurance company in the United States, founded in 1931. We are trying to make the whole thing automated, to predict the cost, and hence the severity, of accident and damage claims.