28,14 €
Supervised and unsupervised machine learning made easy in Scala with this quick-start guide.
Key Features
Book Description
Scala is a highly scalable integration of object-oriented nature and functional programming concepts that make it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to develop and train effective machine learning models in Scala.
The book starts with an introduction to machine learning, while covering deep learning and machine learning basics. It then explains how to use Scala-based ML libraries to solve classification and regression problems using linear regression, generalized linear regression, logistic regression, support vector machine, and Naive Bayes algorithms.
It also covers tree-based ensemble techniques for solving both classification and regression problems. Moving ahead, it covers unsupervised learning techniques, such as dimensionality reduction, clustering, and recommender systems. Finally, it provides a brief overview of deep learning using a real-life example in Scala.
What you will learn
Who this book is for
This book is for machine learning developers looking to train machine learning models in Scala without spending too much time and effort. Some fundamental knowledge of Scala programming and some basics of statistics and linear algebra is all you need to get started with this book.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 232
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor:Amey VarangaonkarAcquisition Editor:Aditi GourContent Development Editor:Roshan KumarTechnical Editor:Nilesh SawakhandeCopy Editor: Safis EditingProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer:Rekha NairGraphics:Alishon MendonsaProduction Coordinator:Shraddha Falebhai
First published: April 2019
Production reference: 1300419
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78934-507-0
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, plus 10 years of R&D experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI).
Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a Ph.D. candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Ajay Kumar Nhas experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Sarbashree Ray has over 5 years' experience in big data analytics, currently at Reliance Jio as a deputy manager. Sarbashree is an engineering professional with experience of designing and executing solutions for complex business problems involving large-scale big data and machine learning technologies, real-time analytics, and reporting solutions. He is also known for using the right tools when and where they make sense, and creating intuitive architectures that help organizations effectively analyze and process terabytes of structured and unstructured data. He is also able to integrate state-of-the-art big data technologies into overall architectures and lead a team of developers through the construction, testing, and implementation phases.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Machine Learning with Scala Quick Start Guide
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Code in Action
Conventions used
Get in touch
Reviews
Introduction to Machine Learning with Scala
Technical requirements
Overview of ML
Working principles of a learning algorithm
General machine learning rule of thumb
General issues in machine learning models
ML tasks
Supervised learning
Unsupervised learning
Reinforcement learning
Summarizing learning types with applications
Overview of Scala
ML libraries in Scala
Spark MLlib and ML
ScalNet and DynaML
ScalaNLP, Vegas, and Breeze
Getting started learning
Description of the dataset
Configuring the programming environment
Getting started with Apache Spark
Reading the training dataset
Preprocessing and feature engineering
Preparing training data and training a classifier
Evaluating the model
Summary
Scala for Regression Analysis
Technical requirements
An overview of regression analysis
Learning
Inferencing
Regression analysis algorithms
Performance metrics
Learning regression analysis through examples
Description of the dataset
Exploratory analysis of the dataset
Feature engineering and data preparation
Linear regression
Generalized linear regression (GLR)
Hyperparameter tuning and cross-validation
Hyperparameter tuning
Cross-validation
Tuning and cross-validation in Spark ML
Summary
Scala for Learning Classification
Technical requirements
Overview of classification
Developing predictive models for churn
Description of the dataset
Exploratory analysis and feature engineering
LR for churn prediction
NB for churn prediction
SVM for churn prediction
Summary
Scala for Tree-Based Ensemble Techniques
Technical requirements
Decision trees and tree ensembles
Decision trees for supervised learning
Decision trees for classification
Decision trees for regression
Gradient boosted trees for supervised learning
Gradient boosted trees for classification
GBTs for regression
Random forest for supervised learning
Random forest for classification
Random forest for regression
What's next?
Summary
Scala for Dimensionality Reduction and Clustering
Technical requirements
Overview of unsupervised learning
Clustering analysis
Clustering analysis algorithms
K-means for clustering analysis
Bisecting k-means
Gaussian mixture model
Other clustering analysis algorithms
Clustering analysis through examples
Description of the dataset
Preparing the programming environment
Clustering geographic ethnicity
Training the k-means algorithm
Dimensionality reduction
Principal component analysis with Spark ML
Determining the optimal number of clusters
The elbow method
The silhouette analysis
Summary
Scala for Recommender System
Technical requirements
Overview of recommendation systems
Types of recommender systems
Similarity-based recommender systems
Content-based filtering approaches
Collaborative filtering approaches
The utility matrix
Model-based book recommendation system
Matrix factorization
Exploratory analysis
Prepare training and test rating data
Adding new user ratings and making new predictions
Summary
Introduction to Deep Learning with Scala
Technical requirements
DL versus ML
DL and ANNs
ANNs and the human brain
A brief history of neural networks
How does an ANN learn?
Training a neural network
Weight and bias initialization
Activation functions
Neural network architectures
DNNs
Autoencoders
CNNs
RNNs
Generative adversarial networks (GANs)
Capsule networks
DL frameworks
Getting started with learning
Description of the dataset
Preparing the programming environment
Preprocessing
Dataset preparation
LSTM network construction
Network training
Evaluating the model
Observing the training using Deeplearning4j UI
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Machine learning has made a huge impact not only in academia, but also in industry, by turning data into actionable intelligence. Scala is not only an object-oriented and functional programming language, but can also leverage the advantages of Java Virtual Machine (JVM). Scala provides code complexity optimization and offers concise notation, which is probably the reason it has seen a steady rise in adoption over the last few years, especially in data science and analytics.
This book is aimed at aspiring data scientists, data engineers, and deep learning enthusiasts who are newbies and want to have a great head start at machine learning best practices. Even if you're not well versed in machine learning concepts, but still want to expand your knowledge by delving into practical implementations of supervised learning, unsupervised learning, and recommender systems with Scala, you will be able to grasp the content easily!
Throughout the chapters, you'll become acquainted with popular machine learning libraries in Scala, learning how to carry out regression and classification analysis using both linear methods and tree-based ensemble techniques, as well as looking at clustering analysis, dimensionality reduction, and recommender systems, before delving into deep learning at the end.
After reading this book, you will have a good head start in solving more complex machine learning tasks. This book isn't meant to be read cover to cover. You can turn the pages to a chapter that looks like something you're trying to accomplish or that ignites your interest.
Suggestions for improvement are always welcome. Happy reading!
Machine learning developers looking to learn how to train machine learning models in Scala, without spending too much time and effort, will find this book to be very useful. Some fundamental knowledge of Scala programming and some basics of statistics and linear algebra is all you need to get started with this book.
Chapter 1, Introduction to Machine Learning with Scala, first explains some basic concepts of machine learning and different learning tasks. It then discusses Scala-based machine learning libraries, which is followed by configuring your programming environment. Finally, it covers Apache Spark briefly, before demonstrating a step-by-step example.
Chapter 2, Scala for Regression Analysis, covers a supervised learning task called regression analysis with examples, followed by regression metrics. It then explains some regression analysis algorithms, including linear regression and generalized linear regression. Finally, it demonstrates a step-by-step solution to a regression analysis task using Spark ML in Scala.
Chapter 3, Scala for Learning Classification, briefly explains another supervised learning task called classification with examples, followed by explaining how to interpret performance evaluation metrics. It then covers widely used classification algorithms such as logistic regression, Naïve Bayes, and support vector machines (SVMs). Finally, it demonstrates a step-by-step solution to a classification problem using Spark ML in Scala.
Chapter 4, Scala for Tree-Based Ensemble Techniques, covers very powerful and widely used tree-based approaches, including decision trees, gradient-boosted trees, and random forest algorithms, for both classification and regression analysis. It then revisits the examples of Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, before solving them using these tree-based algorithms.
Chapter 5, Scala for Dimensionality Reduction and Clustering, briefly discusses different clustering analysis algorithms, followed by a step-by-step example of solving a clustering problem. Finally, it discusses the curse of dimensionality in high-dimensional data, before showing an example of solving it using principal component analysis (PCA).
Chapter 6, Scala for RecommenderSystem, briefly covers similarity-based, content-based, and collaborative filtering approaches for developing recommendation systems. Finally, it demonstrates an example of a book recommender system with Spark ML in Scala.
Chapter 7, Introduction to Deep Learning with Scala, briefly covers deep learning, artificial neural networks, and neural network architectures. It then discusses some available deep learning frameworks. Finally, it demonstrates a step-by-step example of solving a cancer type prediction problem using a long short-term memory (LSTM) network.
All the examples have been implemented in Scala with some open source libraries, including Apahe Spark MLlib/ML and Deeplearning4j. However, to get the best out of this, you should have a powerful computer and software stack.
A Linux distribution is preferable (for example, Debian, Ubuntu, or CentOS). For example, for Ubuntu, it is recommended to have at least a 14.04 (LTS) 64-bit complete installation on VMware Workstation Player 12 or VirtualBox. You can run Spark jobs on Windows (7/8/10) or macOS X (10.4.7+) as well.
A computer with a Core i5 processor, enough storage (for example, for running Spark jobs, you'll need at least 50 GB of free disk storage for standalone cluster and for the SQL warehouse), and at least 16 GB RAM are recommended. And optionally, if you want to perform the neural network training on the GPU (for the last chapter only), the NVIDIA GPU driver has to be installed with CUDA and CuDNN configured.
The following APIs and tools are required in order to execute the source code in this book:
Java/JDK, version 1.8
Scala, version 2.11.8
Spark, version 2.2.0 or higher
Spark csv_2.11, version 1.3.0
ND4j backend version nd4j-cuda-9.0-platform for GPU; otherwise, nd4j-native
ND4j, version 1.0.0-alpha
DL4j, version 1.0.0-alpha
Datavec, version 1.0.0-alpha
Arbiter, version 1.0.0-alpha
Eclipse Mars or Luna (latest version) or IntelliJ IDEA
Maven Eclipse plugin (2.9 or higher)
Maven compiler plugin for Eclipse (2.3.2 or higher)
Maven assembly plugin for Eclipse (2.4.1 or higher)
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Visit the following link to check out videos of the code being run:http://bit.ly/2WhQf2i
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this chapter, we will explain some basic concepts ofmachine learning(ML) that will be used in all subsequent chapters. We will start with a brief introduction to ML including basic learning workflow, ML rule of thumb, and different learning tasks. Then we will gradually cover most important ML tasks.
Also, we will discuss getting started with Scala and Scala-based ML libraries for getting a quick start for the next chapter. Finally, we get started with ML with Scala and Spark ML by solving a real-life problem. The chapter will briefly cover the following topics:
Overview of ML
ML tasks
Introduction to Scala
Scala ML libraries
Getting started with ML with Spark ML
You'll be required to have basic knowledge of Scala and Java. Since Scala is also a JVM-based language, make sure both Java JRE and JDK are installed and configured on your machine. To be more specific, you'll need Scala 2.11.x and Java 1.8.x version installed. Also, you need an IDE, such as Eclipse, IntelliJ IDEA, or Scala IDE, with the necessary plugins. However, if you're using IntelliJ IDEA, Scala will already be integrated.
The code files of this chapter can be found on GitHub:
https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide/tree/master/Chapter01
Check out the following video to see the Code in Action:http://bit.ly/2V3Id08
ML approaches are based on a set of statistical and mathematical algorithms in order to carry out tasks such as classification, regression analysis, concept learning, predictive modeling, clustering, and mining of useful patterns. Using ML, we aim to improve the whole learning process automatically such that we may not need complete human interactions, or we can at least reduce the level of such interactions as much as possible.
Tom M. Mitchell explained what learning really means from a computer science perspective:
Based on this definition, we can conclude that a computer program or machine can do the following:
Learn from data and histories
Improve with experience
Iteratively enhance a model that can be used to predict outcomes of questions
Since the preceding points are at the core of predictive analytics, almost every ML algorithm we use can be treated as an optimization problem. This is about finding parameters that minimize an objective function, for example, a weighted sum of two terms such as a cost function and regularization. Typically, an objective function has two components:
A regularizer, which controls the complexity of the model
The loss, which measures the error of the model on the training data
On the other hand, the regularization parameter defines the trade-off between minimizing the training error and the model's complexity, in an effort to avoid overfitting problems. Now, if both of these components are convex, then their sum is also convex. So, when using an ML algorithm, the goal is to obtain the best hyperparameters of a function that return the minimum error when making predictions. Therefore, by using a convex optimization technique, we can minimize the function until it converges toward the minimum error.
Given that a problem is convex, it is usually easier to analyze the asymptotic behavior of the algorithm, which shows how fast it converges as the model observes more and more training data. The task of ML is to train a model so that it can recognize complex patterns from the given input data and can make decisions in an automated way.
Thus, inferencing is all about testing the model against new (that is, unobserved) data and evaluating the performance of the model itself. However, in the whole process and for making the predictive model a successful one, data acts as the first-class citizen in all ML tasks. In reality, the data that we feed to our machine learning systems must be made up of mathematical objects, such as vectors, so that they can consume such data. For example, in the following diagram, raw images are embedded into numeric values called feature vectors before feeding in to the learning algorithm:
Depending on the available data and feature types, the performance of your predictive model can vacillate dramatically. Therefore, selecting the right features is one of the most important steps before the inferencing takes place. This is called feature engineering, where the domain knowledge about the data is used to create only selective or useful features that help prepare the feature vectors to be used so that a machine learning algorithm works.
For example, comparing hotels is quite difficult unless we already have a personal experience of staying in multiple hotels. However, with the help of an ML model, which is already trained with quality features out of thousands of reviews and features (for example, how many stars does a hotel have, size of the room, location, room service, and so on), it is pretty feasible now. We'll see several examples throughout the chapters. However, before developing such an ML model, knowing some ML concepts is also important.
The general machine learning rule of thumb is that the more data there is, the better the predictive model. However, having more features often creates a mess, to the extent that the performance degrades drastically, especially if the dataset is high-dimensional. The entire learning process requires input datasets that can be split into three types (or are already provided as such):
A
training set
is the knowledge base coming from historical or live data that is used to fit the parameters of the ML algorithm. During the training phase, the ML model utilizes the training set to find optimal weights of the network and reach the objective function by minimizing the training error. Here, the back-prop rule or an optimization algorithm is used to train the model, but all the hyperparameters are needed to be set before the learning process starts.
A
validation set
is a set of examples used to tune the parameters of an ML model. It ensures that the model is trained well and generalizes toward avoiding overfitting. Some ML practitioners refer to it as a development set or dev set as well.
A
test set
is used for evaluating the performance of the trained model on unseen data. This step is also referred to as model inferencing. After assessing the final model on the test set (that is, when we're fully satisfied with the model's performance), we do not have to tune the model any further, but the trained model can be deployed in a production-ready environment.
A common practice is splitting the input data (after necessary pre-processing and feature engineering) into 60% for training, 10% for validation, and 20% for testing, but it really depends on use cases. Sometimes, we also need to perform up-sampling or down-sampling on the data based on the availability and quality of the datasets.
This rule of thumb of learning on different types of training sets can differ across machine learning tasks, as we will cover in the next section. However, before that, let's take a quick look at a few common phenomena in machine learning.
When we use this input data for the training, validation, and testing, usually the learning algorithms cannot learn 100% accurately, which involves training, validation, and test error (or loss). There are two types of error that one can encounter in a machine learning model:
Irreducible error
Reducible error
The irreducible error cannot be reduced even with the most robust and sophisticated model. However, the reducible error, which has two components, called bias and variance, can be reduced. Therefore, to understand the model (that is, prediction errors), we need to focus on bias and variance only:
Bias means how far the predicted value are from the actual values. Usually, if the average predicted values are very different from the actual values (labels), then the bias is higher.
An ML model will have a high bias because it can't model the relationship between input and output variables (can't capture the complexity of data well) and becomes very simple. Thus, a too-simple model with high variance causes underfitting of the data.
The following diagram gives some high-level insights and also shows what a just-right fit model should look like:
Variance signifies the variability between the predicted values and the actual values (how scattered they are).
An ML model usually performs very well on the training set but doesn't work well on the test set (because of high error rates). Ultimately, it results in an underfit model. We can recap the overfitting and underfitting once more:
Underfitting
: I
f your training and validation error are both relatively equal and very high, then your model is most likely underfitting your training data.
Overfitting
: I