Machine Learning with Scala Quick Start Guide - Md. Rezaul Karim - E-Book

Machine Learning with Scala Quick Start Guide E-Book

Md. Rezaul Karim

0,0
28,14 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Supervised and unsupervised machine learning made easy in Scala with this quick-start guide.




Key Features



  • Construct and deploy machine learning systems that learn from your data and give accurate predictions


  • Unleash the power of Spark ML along with popular machine learning algorithms to solve complex tasks in Scala.


  • Solve hands-on problems by combining popular neural network architectures such as LSTM and CNN using Scala with DeepLearning4j library





Book Description



Scala is a highly scalable integration of object-oriented nature and functional programming concepts that make it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to develop and train effective machine learning models in Scala.






The book starts with an introduction to machine learning, while covering deep learning and machine learning basics. It then explains how to use Scala-based ML libraries to solve classification and regression problems using linear regression, generalized linear regression, logistic regression, support vector machine, and Naive Bayes algorithms.






It also covers tree-based ensemble techniques for solving both classification and regression problems. Moving ahead, it covers unsupervised learning techniques, such as dimensionality reduction, clustering, and recommender systems. Finally, it provides a brief overview of deep learning using a real-life example in Scala.





What you will learn



  • Get acquainted with JVM-based machine learning libraries for Scala such as Spark ML and Deeplearning4j


  • Learn RDDs, DataFrame, and Spark SQL for analyzing structured and unstructured data


  • Understand supervised and unsupervised learning techniques with best practices and pitfalls


  • Learn classification and regression analysis with linear regression, logistic regression, Naive Bayes, support vector machine, and tree-based ensemble techniques


  • Learn effective ways of clustering analysis with dimensionality reduction techniques


  • Learn recommender systems with collaborative filtering approach


  • Delve into deep learning and neural network architectures





Who this book is for



This book is for machine learning developers looking to train machine learning models in Scala without spending too much time and effort. Some fundamental knowledge of Scala programming and some basics of statistics and linear algebra is all you need to get started with this book.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 232

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning with Scala Quick Start Guide

 

 

 

 

 

 

Leverage popular machine learning algorithms and techniques and implement them in Scala

 

 

 

 

 

 

 

 

 

 

Md. Rezaul Karim

 

 

 

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Machine Learning with Scala Quick Start Guide

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

Commissioning Editor:Amey VarangaonkarAcquisition Editor:Aditi GourContent Development Editor:Roshan KumarTechnical Editor:Nilesh SawakhandeCopy Editor: Safis EditingProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer:Rekha NairGraphics:Alishon MendonsaProduction Coordinator:Shraddha Falebhai

First published: April 2019

Production reference: 1300419

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78934-507-0

www.packtpub.com

 
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

Contributors

About the author

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, plus 10 years of R&D experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI).

Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a Ph.D. candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.

About the reviewers

Ajay Kumar Nhas experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.

Sarbashree Ray has over 5 years' experience in big data analytics, currently at Reliance Jio as a deputy manager. Sarbashree is an engineering professional with experience of designing and executing solutions for complex business problems involving large-scale big data and machine learning technologies, real-time analytics, and reporting solutions. He is also known for using the right tools when and where they make sense, and creating intuitive architectures that help organizations effectively analyze and process terabytes of structured and unstructured data. He is also able to integrate state-of-the-art big data technologies into overall architectures and lead a team of developers through the construction, testing, and implementation phases.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Machine Learning with Scala Quick Start Guide

About Packt

Why subscribe?

Packt.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Code in Action

Conventions used

Get in touch

Reviews

Introduction to Machine Learning with Scala

Technical requirements

Overview of ML

Working principles of a learning algorithm

General machine learning rule of thumb

General issues in machine learning models

ML tasks

Supervised learning

Unsupervised learning

Reinforcement learning

Summarizing learning types with applications

Overview of Scala

ML libraries in Scala

Spark MLlib and ML

ScalNet and DynaML

ScalaNLP, Vegas, and Breeze

Getting started learning

Description of the dataset

Configuring the programming environment

Getting started with Apache Spark

Reading the training dataset

Preprocessing and feature engineering

Preparing training data and training a classifier

Evaluating the model

Summary

Scala for Regression Analysis

Technical requirements

An overview of regression analysis

Learning

Inferencing

Regression analysis algorithms

Performance metrics

Learning regression analysis through examples

Description of the dataset

Exploratory analysis of the dataset

Feature engineering and data preparation

Linear regression

Generalized linear regression (GLR)

Hyperparameter tuning and cross-validation

Hyperparameter tuning

Cross-validation

Tuning and cross-validation in Spark ML

Summary

Scala for Learning Classification

Technical requirements

Overview of classification

Developing predictive models for churn

Description of the dataset

Exploratory analysis and feature engineering

LR for churn prediction

NB for churn prediction

SVM for churn prediction

Summary

Scala for Tree-Based Ensemble Techniques

Technical requirements

Decision trees and tree ensembles

Decision trees for supervised learning

Decision trees for classification

Decision trees for regression

Gradient boosted trees for supervised learning

Gradient boosted trees for classification

GBTs for regression

Random forest for supervised learning

Random forest for classification

Random forest for regression

What's next?

Summary

Scala for Dimensionality Reduction and Clustering

Technical requirements

Overview of unsupervised learning

Clustering analysis

Clustering analysis algorithms

K-means for clustering analysis

Bisecting k-means

Gaussian mixture model

Other clustering analysis algorithms

Clustering analysis through examples

Description of the dataset

Preparing the programming environment

Clustering geographic ethnicity

Training the k-means algorithm

Dimensionality reduction

Principal component analysis with Spark ML

Determining the optimal number of clusters

The elbow method

The silhouette analysis

Summary

Scala for Recommender System

Technical requirements

Overview of recommendation systems

Types of recommender systems

Similarity-based recommender systems

Content-based filtering approaches

Collaborative filtering approaches

The utility matrix

Model-based book recommendation system

Matrix factorization

Exploratory analysis

Prepare training and test rating data

Adding new user ratings and making new predictions

Summary

Introduction to Deep Learning with Scala

Technical requirements

DL versus ML

DL and ANNs

ANNs and the human brain

A brief history of neural networks

How does an ANN learn?

Training a neural network

Weight and bias initialization

Activation functions

Neural network architectures

DNNs

Autoencoders

CNNs

RNNs

Generative adversarial networks (GANs)

Capsule networks

DL frameworks

Getting started with learning

Description of the dataset

Preparing the programming environment

Preprocessing

Dataset preparation

LSTM network construction

Network training

Evaluating the model

Observing the training using Deeplearning4j UI

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Machine learning has made a huge impact not only in academia, but also in industry, by turning data into actionable intelligence. Scala is not only an object-oriented and functional programming language, but can also leverage the advantages of Java Virtual Machine (JVM). Scala provides code complexity optimization and offers concise notation, which is probably the reason it has seen a steady rise in adoption over the last few years, especially in data science and analytics.

This book is aimed at aspiring data scientists, data engineers, and deep learning enthusiasts who are newbies and want to have a great head start at machine learning best practices. Even if you're not well versed in machine learning concepts, but still want to expand your knowledge by delving into practical implementations of supervised learning, unsupervised learning, and recommender systems with Scala, you will be able to grasp the content easily!

Throughout the chapters, you'll become acquainted with popular machine learning libraries in Scala, learning how to carry out regression and classification analysis using both linear methods and tree-based ensemble techniques, as well as looking at clustering analysis, dimensionality reduction, and recommender systems, before delving into deep learning at the end.

After reading this book, you will have a good head start in solving more complex machine learning tasks. This book isn't meant to be read cover to cover. You can turn the pages to a chapter that looks like something you're trying to accomplish or that ignites your interest.

Suggestions for improvement are always welcome. Happy reading!

Who this book is for

Machine learning developers looking to learn how to train machine learning models in Scala, without spending too much time and effort, will find this book to be very useful. Some fundamental knowledge of Scala programming and some basics of statistics and linear algebra is all you need to get started with this book.

What this book covers

Chapter 1, Introduction to Machine Learning with Scala, first explains some basic concepts of machine learning and different learning tasks. It then discusses Scala-based machine learning libraries, which is followed by configuring your programming environment. Finally, it covers Apache Spark briefly, before demonstrating a step-by-step example.

Chapter 2, Scala for Regression Analysis, covers a supervised learning task called regression analysis with examples, followed by regression metrics. It then explains some regression analysis algorithms, including linear regression and generalized linear regression. Finally, it demonstrates a step-by-step solution to a regression analysis task using Spark ML in Scala.

Chapter 3, Scala for Learning Classification, briefly explains another supervised learning task called classification with examples, followed by explaining how to interpret performance evaluation metrics. It then covers widely used classification algorithms such as logistic regression, Naïve Bayes, and support vector machines (SVMs). Finally, it demonstrates a step-by-step solution to a classification problem using Spark ML in Scala.

Chapter 4, Scala for Tree-Based Ensemble Techniques, covers very powerful and widely used tree-based approaches, including decision trees, gradient-boosted trees, and random forest algorithms, for both classification and regression analysis. It then revisits the examples of Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, before solving them using these tree-based algorithms.

Chapter 5, Scala for Dimensionality Reduction and Clustering, briefly discusses different clustering analysis algorithms, followed by a step-by-step example of solving a clustering problem. Finally, it discusses the curse of dimensionality in high-dimensional data, before showing an example of solving it using principal component analysis (PCA).

Chapter 6, Scala for RecommenderSystem, briefly covers similarity-based, content-based, and collaborative filtering approaches for developing recommendation systems. Finally, it demonstrates an example of a book recommender system with Spark ML in Scala.

Chapter 7, Introduction to Deep Learning with Scala, briefly covers deep learning, artificial neural networks, and neural network architectures. It then discusses some available deep learning frameworks. Finally, it demonstrates a step-by-step example of solving a cancer type prediction problem using a long short-term memory (LSTM) network.

To get the most out of this book

All the examples have been implemented in Scala with some open source libraries, including Apahe Spark MLlib/ML and Deeplearning4j. However, to get the best out of this, you should have a powerful computer and software stack.

A Linux distribution is preferable (for example, Debian, Ubuntu, or CentOS). For example, for Ubuntu, it is recommended to have at least a 14.04 (LTS) 64-bit complete installation on VMware Workstation Player 12 or VirtualBox. You can run Spark jobs on Windows (7/8/10) or macOS X (10.4.7+) as well.

A computer with a Core i5 processor, enough storage (for example, for running Spark jobs, you'll need at least 50 GB of free disk storage for standalone cluster and for the SQL warehouse), and at least 16 GB RAM are recommended. And optionally, if you want to perform the neural network training on the GPU (for the last chapter only), the NVIDIA GPU driver has to be installed with CUDA and CuDNN configured.

The following APIs and tools are required in order to execute the source code in this book:

Java/JDK, version 1.8

Scala, version 2.11.8

Spark, version 2.2.0 or higher

Spark csv_2.11, version 1.3.0

ND4j backend version nd4j-cuda-9.0-platform for GPU; otherwise, nd4j-native

ND4j, version 1.0.0-alpha

DL4j, version 1.0.0-alpha          

Datavec, version 1.0.0-alpha

Arbiter, version 1.0.0-alpha

Eclipse Mars or Luna (latest version) or IntelliJ IDEA

Maven Eclipse plugin (2.9 or higher)

Maven compiler plugin for Eclipse (2.3.2 or higher)

Maven assembly plugin for Eclipse (2.4.1 or higher)

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Code in Action

Visit the following link to check out videos of the code being run:http://bit.ly/2WhQf2i

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Introduction to Machine Learning with Scala

In this chapter, we will explain some basic concepts ofmachine learning(ML) that will be used in all subsequent chapters. We will start with a brief introduction to ML including basic learning workflow, ML rule of thumb, and different learning tasks. Then we will gradually cover most important ML tasks. 

Also, we will discuss getting started with Scala and Scala-based ML libraries for getting a quick start for the next chapter. Finally, we get started with ML with Scala and Spark ML by solving a real-life problem. The chapter will briefly cover the following topics:

Overview of ML

ML tasks

Introduction to Scala

Scala ML libraries

Getting started with ML with Spark ML

Technical requirements

You'll be required to have basic knowledge of Scala and Java. Since Scala is also a JVM-based language, make sure both Java JRE and JDK are installed and configured on your machine. To be more specific, you'll need Scala 2.11.x and Java 1.8.x version installed. Also, you need an IDE, such as Eclipse, IntelliJ IDEA, or Scala IDE, with the necessary plugins. However, if you're using IntelliJ IDEA, Scala will already be integrated.

The code files of this chapter can be found on GitHub:

https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide/tree/master/Chapter01

Check out the following video to see the Code in Action:http://bit.ly/2V3Id08

Overview of ML

ML approaches are based on a set of statistical and mathematical algorithms in order to carry out tasks such as classification, regression analysis, concept learning, predictive modeling, clustering, and mining of useful patterns. Using ML, we aim to improve the whole learning process automatically such that we may not need complete human interactions, or we can at least reduce the level of such interactions as much as possible.

Working principles of a learning algorithm

Tom M. Mitchell explained what learning really means from a computer science perspective:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Based on this definition, we can conclude that a computer program or machine can do the following:

Learn from data and histories

Improve with experience

Iteratively enhance a model that can be used to predict outcomes of questions

Since the preceding points are at the core of predictive analytics, almost every ML algorithm we use can be treated as an optimization problem. This is about finding parameters that minimize an objective function, for example, a weighted sum of two terms such as a cost function and regularization. Typically, an objective function has two components:

A regularizer, which controls the complexity of the model

The loss, which measures the error of the model on the training data

On the other hand, the regularization parameter defines the trade-off between minimizing the training error and the model's complexity, in an effort to avoid overfitting problems. Now, if both of these components are convex, then their sum is also convex. So, when using an ML algorithm, the goal is to obtain the best hyperparameters of a function that return the minimum error when making predictions. Therefore, by using a convex optimization technique, we can minimize the function until it converges toward the minimum error.

Given that a problem is convex, it is usually easier to analyze the asymptotic behavior of the algorithm, which shows how fast it converges as the model observes more and more training data. The task of ML is to train a model so that it can recognize complex patterns from the given input data and can make decisions in an automated way.

Thus, inferencing is all about testing the model against new (that is, unobserved) data and evaluating the performance of the model itself. However, in the whole process and for making the predictive model a successful one, data acts as the first-class citizen in all ML tasks. In reality, the data that we feed to our machine learning systems must be made up of mathematical objects, such as vectors, so that they can consume such data. For example, in the following diagram, raw images are embedded into numeric values called feature vectors before feeding in to the learning algorithm: 

Depending on the available data and feature types, the performance of your predictive model can vacillate dramatically. Therefore, selecting the right features is one of the most important steps before the inferencing takes place. This is called feature engineering, where the domain knowledge about the data is used to create only selective or useful features that help prepare the feature vectors to be used so that a machine learning algorithm works.

For example, comparing hotels is quite difficult unless we already have a personal experience of staying in multiple hotels. However, with the help of an ML model, which is already trained with quality features out of thousands of reviews and features (for example, how many stars does a hotel have, size of the room, location, room service, and so on), it is pretty feasible now. We'll see several examples throughout the chapters. However, before developing such an ML model, knowing some ML concepts is also important. 

General machine learning rule of thumb

The general machine learning rule of thumb is that the more data there is, the better the predictive model. However, having more features often creates a mess, to the extent that the performance degrades drastically, especially if the dataset is high-dimensional. The entire learning process requires input datasets that can be split into three types (or are already provided as such):

A

training set

is the knowledge base coming from historical or live data that is used to fit the parameters of the ML algorithm. During the training phase, the ML model utilizes the training set to find optimal weights of the network and reach the objective function by minimizing the training error. Here, the back-prop rule or an optimization algorithm is used to train the model, but all the hyperparameters are needed to be set before the learning process starts.

A

validation set

is a set of examples used to tune the parameters of an ML model. It ensures that the model is trained well and generalizes toward avoiding overfitting. Some ML practitioners refer to it as a development set or dev set as well.

A

test set

is used for evaluating the performance of the trained model on unseen data. This step is also referred to as model inferencing. After assessing the final model on the test set (that is, when we're fully satisfied with the model's performance), we do not have to tune the model any further, but the trained model can be deployed in a production-ready environment.

A common practice is splitting the input data (after necessary pre-processing and feature engineering) into 60% for training, 10% for validation, and 20% for testing, but it really depends on use cases. Sometimes, we also need to perform up-sampling or down-sampling on the data based on the availability and quality of the datasets.

This rule of thumb of learning on different types of training sets can differ across machine learning tasks, as we will cover in the next section. However, before that, let's take a quick look at a few common phenomena in machine learning.

General issues in machine learning models

When we use this input data for the training, validation, and testing, usually the learning algorithms cannot learn 100% accurately, which involves training, validation, and test error (or loss). There are two types of error that one can encounter in a machine learning model:

Irreducible error

Reducible error

The irreducible error cannot be reduced even with the most robust and sophisticated model. However, the reducible error, which has two components, called bias and variance, can be reduced. Therefore, to understand the model (that is, prediction errors), we need to focus on bias and variance only:

Bias means how far the predicted value are from the actual values. Usually, if the average predicted values are very different from the actual values (labels), then the bias is higher.

An ML model will have a high bias because it can't model the relationship between input and output variables (can't capture the complexity of data well) and becomes very simple. Thus, a too-simple model with high variance causes underfitting of the data.

The following diagram gives some high-level insights and also shows what a just-right fit model should look like:

Variance signifies the variability between the predicted values and the actual values (how scattered they are).

Identifying high bias and high variance: If the model has a high training error as well as the validation error or test error is the same as the training error, the model has high bias. On the other hand, if the model has low training error but has high validation or high test error, the model has a high variance.

An ML model usually performs very well on the training set but doesn't work well on the test set (because of high error rates). Ultimately, it results in an underfit model. We can recap the overfitting and underfitting once more:

Underfitting

: I

f your training and validation error are both relatively equal and very high, then your model is most likely underfitting your training data.

Overfitting

: I