Scala for Machine Learning - Second Edition - Patrick R. Nicolas - E-Book

Scala for Machine Learning - Second Edition E-Book

Patrick R. Nicolas

0,0
55,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Leverage Scala and Machine Learning to study and construct systems that can learn from data

About This Book

  • Explore a broad variety of data processing, machine learning, and genetic algorithms through diagrams, mathematical formulation, and updated source code in Scala
  • Take your expertise in Scala programming to the next level by creating and customizing AI applications
  • Experiment with different techniques and evaluate their benefits and limitations using real-world applications in a tutorial style

Who This Book Is For

If you're a data scientist or a data analyst with a fundamental knowledge of Scala who wants to learn and implement various Machine learning techniques, this book is for you. All you need is a good understanding of the Scala programming language, a basic knowledge of statistics, a keen interest in Big Data processing, and this book!

What You Will Learn

  • Build dynamic workflows for scientific computing
  • Leverage open source libraries to extract patterns from time series
  • Write your own classification, clustering, or evolutionary algorithm
  • Perform relative performance tuning and evaluation of Spark
  • Master probabilistic models for sequential data
  • Experiment with advanced techniques such as regularization and kernelization
  • Dive into neural networks and some deep learning architecture
  • Apply some basic multiarm-bandit algorithms
  • Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters
  • Apply key learning strategies to a technical analysis of financial markets

In Detail

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies.

The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naive Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You'll move on to evolutionary computing, multibandit algorithms, and reinforcement learning.

Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.

Style and approach

This book is designed as a tutorial with hands-on exercises using technical analysis of financial markets and corporate data. The approach of each chapter is such that it allows you to understand key concepts easily.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 867

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Scala for Machine Learning Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Getting Started
Mathematical notations for the curious
Why machine learning?
Classification
Prediction
Optimization
Regression
Why Scala?
Scala as a functional language
Abstraction
Higher kinded types
Functors
Monads
Scala as an object oriented language
Scala as a scalable language
Model categorization
Taxonomy of machine learning algorithms
Unsupervised learning
Clustering
Dimension reduction
Supervised learning
Generative models
Discriminative models
Semi-supervised learning
Reinforcement learning
Leveraging Java libraries
Tools and frameworks
Java
Scala
Eclipse Scala IDE
IntelliJ IDEA Scala plugin
Simple build tool
Apache Commons Math
Description
Licensing
Installation
JFreeChart
Description
Licensing
Installation
Other libraries and frameworks
Source code
Convention
Context bounds
Presentation
Primitives and implicits
Immutability
Let's kick the tires
Writing a simple workflow
Step 1 – scoping the problem
Step 2 – loading data
Step 3 – preprocessing data
Immutable normalization
Step 4 – discovering patterns
Analyzing data
Plotting data
Visualizing model features
Visualizing label
Step 5 – implementing the classifier
Selecting an optimizer
Training the model
Classifying observations
Step 6 – evaluating the model
Summary
2. Data Pipelines
Modeling
What is a model?
Model versus design
Selecting features
Extracting features
Defining a methodology
Monadic data transformation
Error handling
Monads to the rescue
Implicit models
Explicit models
Workflow computational model
Supporting mathematical abstractions
Step 1 – variable declaration
Step 2 – model definition
Step 3 – instantiation
Composing mixins to build workflow
Understanding the problem
Defining modules
Instantiating the workflow
Modularizing
Profiling data
Immutable statistics
Z-score and Gauss
Assessing a model
Validation
Key quality metrics
F-score for binomial classification
F-score for multinomial classification
Area under the curves
Area under PRC
Area under ROC
Cross-validation
One-fold cross-validation
K-fold cross-validation
Bias-variance decomposition
Overfitting
Summary
3. Data Preprocessing
Time series in Scala
Context bounds
Types and operations
Transpose operator
Differential operator
Lazy views
Moving averages
Simple moving average
Weighted moving average
Exponential moving average
Fourier analysis
Discrete Fourier transform (DFT)
DFT-based filtering
Detection of market cycles
The discrete Kalman filter
The state space estimation
The transition equation
The measurement equation
The recursive algorithm
Prediction
Correction
Kalman smoothing
Fixed lag smoothing
Experimentation
Benefits and drawbacks
Alternative preprocessing techniques
Summary
4. Unsupervised Learning
K-mean clustering
K-means
Measuring similarity
Defining the algorithm
Step 1 – Clusters configuration
Defining clusters
Initializing clusters
Step 2 – Clusters assignment
Step 3 – Reconstruction error minimization
Creating K-means components
Tail recursive implementation
Iterative implementation
Step 4 – Classification
Curse of dimensionality
Evaluation
The results
Tuning the number of clusters
Validation
Expectation-Maximization (EM)
Gaussian mixture model
EM overview
Implementation
Classification
Testing
Online EM
Summary
5. Dimension Reduction
Challenging model complexity
The divergences
The Kullback-Leibler divergence
Overview
Implementation
Testing
The mutual information
Principal components analysis (PCA)
Algorithm
Implementation
Test case
Evaluation
Extending PCA
Validation
Categorical features
Performance
Nonlinear models
Kernel PCA
Manifolds
Summary
6. Naïve Bayes Classifiers
Probabilistic graphical models
Naïve Bayes classifiers
Introducing the multinomial Naïve Bayes
Formalism
The frequentist perspective
The predictive model
The zero-frequency problem
Implementation
Design
Training
Class likelihood
Binomial model
Multinomial model
Classifier components
Classification
F1 Validation
Features extraction
Testing
Multivariate Bernoulli classification
Model
Implementation
Naïve Bayes and text mining
Basics information retrieval
Implementation
Analyzing documents
Extracting relative terms frequency
Generating the features
Testing
Retrieving textual information
Evaluating text mining classifier
Pros and cons
Summary
7. Sequential Data Models
Markov decision processes
The Markov property
The first-order discrete Markov chain
The hidden Markov model (HMM)
Notation
The lambda model
Design
Evaluation (CF-1)
Alpha (forward pass)
Beta (backward pass)
Training (CF-2)
Baum-Welch estimator (EM)
Decoding (CF-3)
The Viterbi algorithm
Putting it all together
Test case 1 – Training
Test case 2 – Evaluation
HMM as filtering technique
Conditional random fields
Introduction to CRF
Linear chain CRF
Regularized CRF and text analytics
The feature functions model
Design
Implementation
Configuring the CRF classifier
Training the CRF model
Applying the CRF model
Tests
The training convergence profile
Impact of the size of the training set
Impact of L2 regularization factor
Comparing CRF and HMM
Performance consideration
Summary
8. Monte Carlo Inference
The purpose of sampling
Gaussian sampling
Box-Muller transform
Monte Carlo approximation
Overview
Implementation
Bootstrapping with replacement
Overview
Resampling
Implementation
Pros and cons of bootstrap
Markov Chain Monte Carlo (MCMC)
Overview
Metropolis-Hastings (MH)
Implementation
Test
Summary
9. Regression and Regularization
Linear regression
Univariate linear regression
Implementation
Test case
Ordinary least squares (OLS) regression
Design
Implementation
Test case 1 – trending
Test case 2 – features selection
Regularization
Ln roughness penalty
Ridge regression
Design
Implementation
Test case
Numerical optimization
Logistic regression
Logistic function
Design
Training workflow
Step 1 – configuring the optimizer
Step 2 – computing the Jacobian matrix
Step 3 – managing the convergence of optimizer
Step 4 – defining the least squares problem
Step 5 – minimizing the sum of square errors
Test
Classification
Summary
10. Multilayer Perceptron
Feed-forward neural networks (FFNN)
The biological background
Mathematical background
The multilayer perceptron (MLP)
Activation function
Network topology
Design
Configuration
Network components
Network topology
Input and hidden layers
Output layer
Synapses
Connections
Weights initialization
Model
Problem types (modes)
Online versus batch training
Training epoch
Step 1 – input forward propagation
Computational flow
Error functions
Operating modes
Softmax
Step 2 – error backpropagation
Weights adjustment
Error propagation
The computational model
Step 3 – exit condition
Putting it all together
Training and classification
Regularization
Model generation
Fast Fisher-Yates shuffle
Prediction
Model fitness
Evaluation
Execution profile
Impact of learning rate
Impact of the momentum factor
Impact of the number of hidden layers
Test case
Implementation
Models evaluation
Impact of hidden layers' architecture
Benefits and limitations
Summary
11. Deep Learning
Sparse autoencoder
Undercomplete autoencoder
Deterministic autoencoder
Categorization
Feed-forward sparse, undercomplete autoencoder
Sparsity updating equations
Implementation
Restricted Boltzmann Machines (RBMs)
Boltzmann machine
Binary restricted Boltzmann machines
Conditional probabilities
Sampling
Log-likelihood gradient
Contrastive divergence
Configuration parameters
Unsupervised learning
Convolution neural networks
Local receptive fields
Weight sharing
Convolution layers
Sub-sampling layers
Putting it all together
Summary
12. Kernel Models and SVM
Kernel functions
Overview
Common discriminative kernels
Kernel monadic composition
The support vector machine (SVM)
The linear SVM
The separable case (hard margin)
The non-separable case (soft margin)
The nonlinear SVM
Max-margin classification
The kernel trick
Support vector classifier (SVC)
The binary SVC
LIBSVM
Design
Configuration parameters
The SVM formulation
The SVM kernel function
The SVM execution
Interface to LIBSVM
Training
Classification
C-penalty and margin
Kernel evaluation
Application to risk analysis
Anomaly detection with one-class SVC
Support vector regression (SVR)
Overview
SVR versus linear regression
Performance considerations
Summary
13. Evolutionary Computing
Evolution
The origin
NP problems
Evolutionary computing
Genetic algorithms and machine learning
Genetic algorithm components
Encodings
Value encoding
Predicate encoding
Solution encoding
The encoding scheme
Flat encoding
Hierarchical encoding
Genetic operators
Selection
Crossover
Mutation
Fitness score
Implementation
Software design
Key components
Population
Chromosomes
Genes
Selection
Controlling population growth
GA configuration
Crossover
Population
Chromosomes
Genes
Mutation
Population
Chromosomes
Genes
Reproduction
Solver
GA for trading strategies
Definition of trading strategies
Trading operators
The cost function
Market signals
Trading strategies
Signal encoding
Test case – Fall 2008 market crash
Creating trading strategies
Configuring the optimizer
Finding the best trading strategy
Tests
The weighted score
The unweighted score
Advantages and risks of genetic algorithms
Summary
14. Multiarmed Bandits
K-armed bandit
Exploration-exploitation trade-offs
Expected cumulative regret
Bayesian Bernoulli bandits
Epsilon-greedy algorithm
Thompson sampling
Bandit context
Prior/posterior beta distribution
Implementation
Simulated exploration and exploitation
Upper bound confidence
Confidence interval
Implementation
Summary
15. Reinforcement Learning
Reinforcement learning
Understanding the challenge
A solution – Q-learning
Terminology
Concept
Value of policy
Bellman optimality equations
Temporal difference for model-free learning
Action-value iterative update
Implementation
Software design
The states and actions
The search space
The policy and action-value
The Q-learning components
The Q-learning training
Tail recursion to the rescue
Validation
The prediction
Option trading using Q-learning
Option property
Option model
Quantization
Putting it all together
Evaluation
Pros and cons of reinforcement learning
Learning classifier systems
Introduction to LCS
Combining learning and evolution
Terminology
Extended learning classifier systems
XCS components
Application to portfolio management
XCS core data
XCS rules
Covering
Example of implementation
Benefits and limitations of learning classifier systems
Summary
16. Parallelism in Scala and Akka
Overview
Scala
Object creation
Streams
Memory on demand
Design for reusing Streams memory
Parallel collections
Processing a parallel collection
Benchmark framework
Performance evaluation
Scalability with Actors
The Actor model
Partitioning
Beyond Actors – reactive programming
Akka
Master-workers
Messages exchange
Worker Actors
The workflow controller
The master Actor
Master with routing
Distributed discrete Fourier transform
Limitations
Futures
Blocking on futures
Future callbacks
Putting it all together
Summary
17. Apache Spark MLlib
Overview
Apache Spark core
Why Spark?
Design principles
In-memory persistency
Laziness
Transforms and actions
Shared variables
Experimenting with Spark
Deploying Spark
Using Spark shell
MLlib library
Overview
Creating RDDs
K-means using MLlib
Tests
Reusable ML pipelines
Reusable ML transforms
Encoding features
Training the model
Predictive model
Training summary statistics
Validating the model
Grid search
Apache Spark and ScalaTest
Extending Spark
Kullback-Leibler divergence
Implementation
Kullback-Leibler evaluator
Streaming engine
Why streaming?
Batch and real-time processing
Architecture overview
Discretized streams
Use case – continuous parsing
Checkpointing
Performance evaluation
Tuning parameters
Performance considerations
Pros and cons
Summary
A. Basic Concepts
Scala programming
List of libraries and tools
Code snippets format
Best practices
Encapsulation
Class constructor template
Companion objects versus case classes
Enumerations versus case classes
Overloading
Design template for immutable classifiers
Utility classes
Data extraction
Financial data sources
Documents extraction
DMatrix class
Counter
Monitor
Mathematics
Linear algebra
QR decomposition
LU factorization
LDL decomposition
Cholesky factorization
Singular Value Decomposition (SVD)
Eigenvalue decomposition
Algebraic and numerical libraries
First order predicate logic
Jacobian and Hessian matrices
Summary of optimization techniques
Gradient descent methods
Steepest descent
Conjugate gradient
Stochastic gradient descent
Quasi-Newton algorithms
BFGS
L-BFGS
Nonlinear least squares minimization
Gauss-Newton
Levenberg-Marquardt
Lagrange multipliers
Overview dynamic programming
Finances 101
Fundamental analysis
Technical analysis
Terminology
Trading data
Trading signal and strategy
Price patterns
Options trading
Financial data sources
Suggested online courses
References
B. References
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Index

Scala for Machine Learning Second Edition

Scala for Machine Learning Second Edition

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused direFmaptctly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2015

Second edition: September 2017

Production reference: 1190917

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78712-238-3

www.packtpub.com

Credits

Author

Patrick R. Nicolas

Reviewers

Sumit Pal

Dave Wentzel

Commissioning Editor

Amey Varangaonkar

Acquisition Editor

Tushar Gupta

Content Development Editor

Amrita Noronha

Technical Editor

Nilesh Sawakhande

Copy Editors

Safis Editing

Laxmi Subramanian

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Tania Dutta

Production Coordinator

Shantanu Zagade

Cover Work

Deepika Naik

About the Author

Patrick R. Nicolas is the director of engineering at Agile SDE, California. He has more than 25 years of experience in software engineering and building applications in C++, Java, and more recently in Scala/Spark, and has held several managerial positions. His interests include real-time analytics, modeling, and the development of nonlinear models.

About the Reviewers

Sumit Pal has more than 24 years of experience in the software industry, spanning companies from start-ups to enterprises.

He is a big data architect, visualization, and data science consultant, and builds end-to-end data-driven analytic systems.

Sumit has worked for Microsoft (SQLServer), Oracle (OLAP), and Verizon (big data analytics).

Currently, he works for multiple clients, building their data architectures and big data solutions and works with Spark, Scala, Java, and Python.

He has extensive experience in building scalable systems in middle tier, data tier to visualization for analytics applications, using big data and NoSQL databases.

Sumit has expertise in database internals, data warehouses, and dimensional modeling, as an associate director for big data at Verizon. Sumit strategized, managed, architected, and developed analytic platforms for machine learning applications. Sumit was the chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the core analytics platform.

He is the author of SQL On Big Data - Technology, Architecture and Roadmap published by Apress in October 2016.

He has spoken on the topic covered in this book at the following conferences:

May 2016, Big Data Conference—Linux Foundation in Vancouver, CanadaMarch 2016, World Data Center Conference in Las Vegas, USANovember 2015, BigData TechCon in Chicago, USAAugust 2015, Global Big Data Conference in Boston, USA

He is also the author of SQL On Big Data by Apress in December 2016.

Dave Wentzel is the Chief Technology Officer (CTO) of Capax Global, a premier Microsoft consulting partner. Dave is responsible for setting the strategy and defining service offerings and capabilities for the data platform and Azure practice at Capax. Dave also works directly with clients to help them with their big data journey. Dave is a frequent blogger and speaker on big data and data science topics.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.

If you’d like to join our team of regular reviewers, you can e-mail us at <[email protected]>. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

Not a single day passes that we do not hear about big data in the news media, technical conferences, and even coffee shops. The ever-increasing amount of data collected in process monitoring, research, or simple human behavior becomes valuable only if you extract knowledge from it. Machine learning is the essential tool to mine data for knowledge. This book covers the what, why, and how of machine learning:

What are the objectives and the mathematical foundations of machine learning?Why is Scala the ideal programming language to implement machine learning algorithms?How can you apply machine learning to solve real-world problems?

Throughout this book, machine learning algorithms are described with diagrams, mathematical formulations, and documented snippets of Scala code, allowing you to understand these key concepts in your own unique way.

What this book covers

Chapter 1, Getting Started, introduces the basic concepts of statistical analysis, classification, regression, prediction, clustering, and optimization. This chapter covers the Scala languages, features, and libraries, followed by the implementation of a simple application.

Chapter 2, Data Pipelines, describes a typical workflow for classification, the concept of bias/variance trade-off, and validation using the Scala dependency injection applied to the technical analysis of financial markets.

Chapter 3, Data Preprocessing, covers time series analyses and leverages Scala to implement data preprocessing and smoothing techniques such as moving averages, discrete Fourier transform, and the Kalman recursive filter.

Chapter 4, Unsupervised Learning, covers key clustering methods such as K-means clustering, Gaussian mixture Expectation-Maximization and function approximation.

Chapter 5, Dimension Reduction, describes the Kullback-Leibler divergence, the principal component analysis for linear models followed by an overview of manifold applied to non-linear models.

Chapter 6, Naive Bayes Classifiers, focuses on the probabilistic graphical models and more specifically the implementation of Naive Bayes models and its application to text mining.

Chapter 7, Sequential Data Models, introduces the Markov processes followed by a full implementation of the hidden Markov model, and conditional random fields applied to pattern recognition in financial market data.

Chapter 8, Monte Carlo Inference, describes Gaussian sampling using Box-Muller technique, Bootstrap replication with replacement, and the ubiquitous Metropolis-Hastings algorithm for Markov Chain Monte Carlo.

Chapter 9, Regression and Regularization, covers a typical implementation of the linear and least squares regression, the ridge regression as a regularization technique, and finally, the logistic regression.

Chapter 10, Multilayer Perception, describes feed-forward neural networks followed by a full implementation of the multilayer perceptron classifier.

Chapter 11, Deep Learning, implements a sparse auto encoder and a restricted Boltzmann machines for dimension reduction in Scala followed by an overview of the convolutional neural network.

Chapter 12, Kernel Models and Support Vector Machines, covers the concept of kernel functions with implementation of support vector machine classification and regression, followed by the application of the one-class SVM to anomaly detection.

Chapter 13, Evolutionary Computing, covers describes the basics of evolutionary computing and the implementation of the different components of a multipurpose genetic algorithm.

Chapter 14, Multiarmed Bandits, Multiarmed Bandits, introduces the concept of exploration-exploitation trade-off using Epsilon-greedy algorithm, the Upper confidence bound technique and the context-free Thompson sampling.

Chapter 15, Reinforcement Learning, covers introduces the concept of reinforcement learning with an implementation of the Q-learning algorithm followed by a template to build a learning classifier system.

Chapter 16, Parallelism in Scala and Akka, describes some of the artifacts and frameworks to create scalable applications and evaluate the relative performance of Scala parallel collections and Akka-based distributed computation.

Chapter 17, Apache Spark MLlib, covers the architecture and key concepts of Apache Spark, machine learning leveraging resilient distributed datasets, reusable ML pipelines, extending MLlib with distributed divergences and an example of Spark streaming library.

Appendix A, Basic Concepts, describes the Scala language constructs used throughout the book, elements of linear algebra and optimization techniques.

Appendix B, References, provides a chapter-wise list of references [source, entry] for each chapter.

What you need for this book

A decent command of the Scala programming language is a prerequisite. Reading through a mathematical formulation, conveniently defied in an information box, is optional. However, some basic knowledge of mathematics and statistics might be helpful to understand the inner workings of some algorithms.

The book uses the following libraries:

Scala 2.11.8 or higherJava 1.8.0_25SBT 0.13 or higherJFreeChart 1.0.17Apache Commons Math library 3.5 (Chapter 3, Data Pre-processing, Chapter 4, Unsupervised Learning, and Chapter 9, Regression and Regularization)Indian Institute of Technology Bombay CRF 0.2 (Chapter 7, Sequential Data Models)LIBSVM 0.1.6 (Chapter 8, Kernel Models and Support Vector Machines)Akka 2.3.8 or higher (or Typesafe activator 1.2.10 or higher) (Chapter 16, Parallelism in Scala and Akka)Apache Spark 2.1.0 or higher (Chapter 17, Apache Spark MLlib)

Tip

Understanding the mathematical formulation of a model is optional.

Who this book is for

This book is for software developers with a background in Scala programming who want to learn how to create, validate, and apply machine learning algorithms. The book is also beneficial to data scientists who want to explore functional programming or improve the scalability of their existing applications using Scala.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Scala-for-Machine-Learning-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/ScalaforMachineLearningSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Model categorization

A model can be predictive, descriptive, or adaptive.

Predictive models discover patterns in historical data and extract fundamental trends and relationships between factors (or features). They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields, including marketing, insurance, and pharmaceuticals. Predictive models are created through supervised learning using a pre-selected training set.

Descriptive models attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. These models define the first and important step in knowledge discovery. They are commonly generated through unsupervised learning.

A third category of models, known as adaptive modeling, is created through reinforcement learning. Reinforcement learning consists of one or several decision-making agents that recommend, and possibly execute, actions in an attempt to solve a problem, optimizing an objective function or resolving constraints.

Leveraging Java libraries

There are numerous robust, accurate, and efficient Java libraries for mathematics, linear algebra, or optimization that have been widely used for many years:

JBlas/Linpack: https://github.com/mikiobraun/jblasParallel Colt: https://github.com/rwl/ParallelColtApache Commons Math: http://commons.apache.org/proper/commons-math

There is absolutely no need to rewrite, debug, and test these components in Scala. Developers should consider creating a wrapper or interface to his/her favorite and reliable Java library. The book leverages the Apache Commons Math library for some specific linear algebra algorithms.

Tools and frameworks

Before getting your hands dirty, you need to download and deploy the minimum set of tools and libraries; there is no need to reinvent the wheel, after all. A few key components have to be installed in order to compile and run the source code described throughout this book. We will focus on open source and commonly available libraries, although you are invited to experiment with the equivalent tools of your choice. The learning curve for the frameworks described here is minimal.

Java

The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64. You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME, PATH, and CLASSPATH have to be updated accordingly.

Scala

The code has been tested with Scala 2.11.4 and 2.11.8. We recommend using Scala version 2.11.4 or higher with SBT 0.13.1 or higher. Let's assume that the Scala runtime (REPL) and libraries have been properly installed and that the environment variables SCALA_HOME, and PATH have been updated.

The Scala standard library can be downloaded as binaries or as part of the Typesafe Activator tool by visiting http://www.scala-lang.org/download/.

Eclipse Scala IDE

The description and installation instructions for the Eclipse Scala IDE version 4.0 and higher is available at http://scala-ide.org/docs/user/gettingstarted.html.

IntelliJ IDEA Scala plugin

You can also download the IntelliJ IDEA Scala plugin version 13 or higher from the JetBrains website at http://confluence.jetbrains.com/display/SCA/.

Simple build tool

The ubiquitous Simple Build Tool (SBT) will be our primary building engine. It can be downloaded as part of the Typesafe activator or directly from http://www.scala-sbt.org/download.html.

The syntax of the build file sbt/build.sbt conforms to version 0.13 and is used to compile and assemble the source code presented throughout this book. To build Scala for machine learning, do the following:

Set the maximum size for the JVM heap to 2058 Mbytes or higher and the permanent memory to 512 Mbytes or higher (that is, -Xmx4096m -Xms512m -XX:MaxPermSize=512m)To build the Scala for machine learning library package: $(ROOT)/sbt clean publish-localTo build the package including test and resource files: $(ROOT)/sbt clean packageTo generate Scala doc for the library: $(ROOT)/sbt docTo generate Scala doc for the example: $(ROOT)/sbt test:docTo generate report for compliance to Scala style guide: $(ROOT)/sbt scalastyleTo compile all examples: $(ROOT)/sbt test:compile

Apache Commons Math

Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6].

Description

This is a lightweight library that provides developers with a foundation of small, ready-to-use Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.5 or higher.

The math library supports the following:

Functions, differentiation, integral, and ordinary differential equationsStatistics distributionsLinear and non-linear optimizationDense and sparse vectors and matrixCurve fitting, correlation, and regressio

For more information, visit http://commons.apache.org/proper/commons-math.

Licensing

We need Apache Public License 2.0; the terms are available at https://www.apache.org/licenses/LICENSE-2.0.

Installation

The installation and deployment of the Apache Commons Math library are quite simple. The steps are as follows:

Go to the download page at http://commons.apache.org/proper/commons-math/download_math.cgi.Download the latest .jar files in the binary section, commons-math3-3.6-bin.zip (for version 3.6, for instance).Unzip and install the .jar file.Add commons-math3-3.6.jar to the CLASSPATH, as follows:
For macOS X:
export CLASSPATH=$CLASSPATH:/Commons_Math_path /commons-math3-3.6.jar
For Windows:

Go to System property | Advanced system settings | Advanced | Environment variables and then edit the entry CLASSPATH variable.

Add the commons-math3-3.6.jar file to your IDE environment if needed:
Eclipse Scala IDE: Project | Properties | Java Build Path | Libraries | Add External JARsIntelliJ IDEA: File | Project Structure | Project Settings | Libraries |

the source commons-math3-3.6-src.zip from the sourcesection.

JFreeChart

JFreeChart is an open source chart and plotting java library widely used in the Java programmer community. It was originally created by David Gilbert [1:8].

Description

The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithm throughout the book, but you are encouraged to explore this great library on your own, as time permits.

Licensing

It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications.

Installation

To install and deploy JFreeChart, perform the following steps:

Visit http://www.jfree.org/jfreechart/.Download the latest version from Source Forge: https://sourceforge.net/projects/jfreechart/files/.Unzip and deploy the .jar file.Add jfreechart-1.0.17.jar (for version 1.0.17) to the CLASSPATH, as follows:
For macOS X:
export CLASSPATH=$CLASSPATH:/JFreeChart_path/jfreechart-1.0.17.jar
For Windows:

Go to System property | Advanced system settings | Advanced | Environment variables and then edit the entry CLASSPATH variable.

Add the jfreechart-1.0.17.jar file to your IDE environment:
Eclipse Scala IDE: Project | Properties | Java Build Path | Libraries | Add External JARsIntelliJ IDEA: File | Project Structure | Project Settings | Libraries | +

Other libraries and frameworks

Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with instructions for downloading them. Libraries related to the conditional random fields and support vector machines are described in their respective chapters.

Note

Why aren't we using Scala algebra and Scala numerical libraries?

Libraries such as Breeze, ScalaNLP, and Algebird are interesting Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a high-quality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using existing or legacy java libraries [1:9].