E-Book
55,19 €

Scala for Machine Learning - Second Edition E-Book

Patrick R. Nicolas

0,0

55,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Leverage Scala and Machine Learning to study and construct systems that can learn from data

About This Book

Explore a broad variety of data processing, machine learning, and genetic algorithms through diagrams, mathematical formulation, and updated source code in Scala
Take your expertise in Scala programming to the next level by creating and customizing AI applications
Experiment with different techniques and evaluate their benefits and limitations using real-world applications in a tutorial style

Who This Book Is For

If you're a data scientist or a data analyst with a fundamental knowledge of Scala who wants to learn and implement various Machine learning techniques, this book is for you. All you need is a good understanding of the Scala programming language, a basic knowledge of statistics, a keen interest in Big Data processing, and this book!

What You Will Learn

Build dynamic workflows for scientific computing
Leverage open source libraries to extract patterns from time series
Write your own classification, clustering, or evolutionary algorithm
Perform relative performance tuning and evaluation of Spark
Master probabilistic models for sequential data
Experiment with advanced techniques such as regularization and kernelization
Dive into neural networks and some deep learning architecture
Apply some basic multiarm-bandit algorithms
Solve big data problems with Scala parallel collections, Akka actors, and Apache Spark clusters
Apply key learning strategies to a technical analysis of financial markets

In Detail

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies.

The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naive Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You'll move on to evolutionary computing, multibandit algorithms, and reinforcement learning.

Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.

Style and approach

This book is designed as a tutorial with hands-on exercises using technical analysis of financial markets and corporate data. The approach of each chapter is such that it allows you to understand key concepts easily.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 867

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Scala for Machine Learning Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Getting Started

Mathematical notations for the curious

Why machine learning?

Classification

Prediction

Optimization

Regression

Why Scala?

Scala as a functional language

Abstraction

Higher kinded types

Functors

Monads

Scala as an object oriented language

Scala as a scalable language

Model categorization

Taxonomy of machine learning algorithms

Unsupervised learning

Clustering

Dimension reduction

Supervised learning

Generative models

Discriminative models

Semi-supervised learning

Reinforcement learning

Leveraging Java libraries

Tools and frameworks

Java

Scala

Eclipse Scala IDE

IntelliJ IDEA Scala plugin

Simple build tool

Apache Commons Math

Description

Licensing

Installation

JFreeChart

Description

Licensing

Installation

Other libraries and frameworks

Source code

Convention

Context bounds

Presentation

Primitives and implicits

Immutability

Let's kick the tires

Writing a simple workflow

Step 1 – scoping the problem

Step 2 – loading data

Step 3 – preprocessing data

Immutable normalization

Step 4 – discovering patterns

Analyzing data

Plotting data

Visualizing model features

Visualizing label

Step 5 – implementing the classifier

Selecting an optimizer

Training the model

Classifying observations

Step 6 – evaluating the model

Summary

2. Data Pipelines

Modeling

What is a model?

Model versus design

Selecting features

Extracting features

Defining a methodology

Monadic data transformation

Error handling

Monads to the rescue

Implicit models

Explicit models

Workflow computational model

Supporting mathematical abstractions

Step 1 – variable declaration

Step 2 – model definition

Step 3 – instantiation

Composing mixins to build workflow

Understanding the problem

Defining modules

Instantiating the workflow

Modularizing

Profiling data

Immutable statistics

Z-score and Gauss

Assessing a model

Validation

Key quality metrics

F-score for binomial classification

F-score for multinomial classification

Area under the curves

Area under PRC

Area under ROC

Cross-validation

One-fold cross-validation

K-fold cross-validation

Bias-variance decomposition

Overfitting

Summary

3. Data Preprocessing

Time series in Scala

Context bounds

Types and operations

Transpose operator

Differential operator

Lazy views

Moving averages

Simple moving average

Weighted moving average

Exponential moving average

Fourier analysis

Discrete Fourier transform (DFT)

DFT-based filtering

Detection of market cycles

The discrete Kalman filter

The state space estimation

The transition equation

The measurement equation

The recursive algorithm

Prediction

Correction

Kalman smoothing

Fixed lag smoothing

Experimentation

Benefits and drawbacks

Alternative preprocessing techniques

Summary

4. Unsupervised Learning

K-mean clustering

K-means

Measuring similarity

Defining the algorithm

Step 1 – Clusters configuration

Defining clusters

Initializing clusters

Step 2 – Clusters assignment

Step 3 – Reconstruction error minimization

Creating K-means components

Tail recursive implementation

Iterative implementation

Step 4 – Classification

Curse of dimensionality

Evaluation

The results

Tuning the number of clusters

Validation

Expectation-Maximization (EM)

Gaussian mixture model

EM overview

Implementation

Classification

Testing

Online EM

Summary

5. Dimension Reduction

Challenging model complexity

The divergences

The Kullback-Leibler divergence

Overview

Implementation

Testing

The mutual information

Principal components analysis (PCA)

Algorithm

Implementation

Test case

Evaluation

Extending PCA

Validation

Categorical features

Performance

Nonlinear models

Kernel PCA

Manifolds

Summary

6. Naïve Bayes Classifiers

Probabilistic graphical models

Naïve Bayes classifiers

Introducing the multinomial Naïve Bayes

Formalism

The frequentist perspective

The predictive model

The zero-frequency problem

Implementation

Design

Training

Class likelihood

Binomial model

Multinomial model

Classifier components

Classification

F1 Validation

Features extraction

Testing

Multivariate Bernoulli classification

Model

Implementation

Naïve Bayes and text mining

Basics information retrieval

Implementation

Analyzing documents

Extracting relative terms frequency

Generating the features

Testing

Retrieving textual information

Evaluating text mining classifier

Pros and cons

Summary

7. Sequential Data Models

Markov decision processes

The Markov property

The first-order discrete Markov chain

The hidden Markov model (HMM)

Notation

The lambda model

Design

Evaluation (CF-1)

Alpha (forward pass)

Beta (backward pass)

Training (CF-2)

Baum-Welch estimator (EM)

Decoding (CF-3)

The Viterbi algorithm

Putting it all together

Test case 1 – Training

Test case 2 – Evaluation

HMM as filtering technique

Conditional random fields

Introduction to CRF

Linear chain CRF

Regularized CRF and text analytics

The feature functions model

Design

Implementation

Configuring the CRF classifier

Training the CRF model

Applying the CRF model

Tests

The training convergence profile

Impact of the size of the training set

Impact of L2 regularization factor

Comparing CRF and HMM

Performance consideration

Summary

8. Monte Carlo Inference

The purpose of sampling

Gaussian sampling

Box-Muller transform

Monte Carlo approximation

Overview

Implementation

Bootstrapping with replacement

Overview

Resampling

Implementation

Pros and cons of bootstrap

Markov Chain Monte Carlo (MCMC)

Overview

Metropolis-Hastings (MH)

Implementation

Test

Summary

9. Regression and Regularization

Linear regression

Univariate linear regression

Implementation

Test case

Ordinary least squares (OLS) regression

Design

Implementation

Test case 1 – trending

Test case 2 – features selection

Regularization

Ln roughness penalty

Ridge regression

Design

Implementation

Test case

Numerical optimization

Logistic regression

Logistic function

Design

Training workflow

Step 1 – configuring the optimizer

Step 2 – computing the Jacobian matrix

Step 3 – managing the convergence of optimizer

Step 4 – defining the least squares problem

Step 5 – minimizing the sum of square errors

Test

Classification

Summary

10. Multilayer Perceptron

Feed-forward neural networks (FFNN)

The biological background

Mathematical background

The multilayer perceptron (MLP)

Activation function

Network topology

Design

Configuration

Network components

Network topology

Input and hidden layers

Output layer

Synapses

Connections

Weights initialization

Model

Problem types (modes)

Online versus batch training

Training epoch

Step 1 – input forward propagation

Computational flow

Error functions

Operating modes

Softmax

Step 2 – error backpropagation

Weights adjustment

Error propagation

The computational model

Step 3 – exit condition

Putting it all together

Training and classification

Regularization

Model generation

Fast Fisher-Yates shuffle

Prediction

Model fitness

Evaluation

Execution profile

Impact of learning rate

Impact of the momentum factor

Impact of the number of hidden layers

Test case

Implementation

Models evaluation

Impact of hidden layers' architecture

Benefits and limitations

Summary

11. Deep Learning

Sparse autoencoder

Undercomplete autoencoder

Deterministic autoencoder

Categorization

Feed-forward sparse, undercomplete autoencoder

Sparsity updating equations

Implementation

Restricted Boltzmann Machines (RBMs)

Boltzmann machine

Binary restricted Boltzmann machines

Conditional probabilities

Sampling

Log-likelihood gradient

Contrastive divergence

Configuration parameters

Unsupervised learning

Convolution neural networks

Local receptive fields

Weight sharing

Convolution layers

Sub-sampling layers

Putting it all together

Summary

12. Kernel Models and SVM

Kernel functions

Overview

Common discriminative kernels

Kernel monadic composition

The support vector machine (SVM)

The linear SVM

The separable case (hard margin)

The non-separable case (soft margin)

The nonlinear SVM

Max-margin classification

The kernel trick

Support vector classifier (SVC)

The binary SVC

LIBSVM

Design

Configuration parameters

The SVM formulation

The SVM kernel function

The SVM execution

Interface to LIBSVM

Training

Classification

C-penalty and margin

Kernel evaluation

Application to risk analysis

Anomaly detection with one-class SVC

Support vector regression (SVR)

Overview

SVR versus linear regression

Performance considerations

Summary

13. Evolutionary Computing

Evolution

The origin

NP problems

Evolutionary computing

Genetic algorithms and machine learning

Genetic algorithm components

Encodings

Value encoding

Predicate encoding

Solution encoding

The encoding scheme

Flat encoding

Hierarchical encoding

Genetic operators

Selection

Crossover

Mutation

Fitness score

Implementation

Software design

Key components

Population

Chromosomes

Genes

Selection

Controlling population growth

GA configuration

Crossover

Population

Chromosomes

Genes

Mutation

Population

Chromosomes

Genes

Reproduction

Solver

GA for trading strategies

Definition of trading strategies

Trading operators

The cost function

Market signals

Trading strategies

Signal encoding

Test case – Fall 2008 market crash

Creating trading strategies

Configuring the optimizer

Finding the best trading strategy

Tests

The weighted score

The unweighted score

Advantages and risks of genetic algorithms

Summary

14. Multiarmed Bandits

K-armed bandit

Exploration-exploitation trade-offs

Expected cumulative regret

Bayesian Bernoulli bandits

Epsilon-greedy algorithm

Thompson sampling

Bandit context

Prior/posterior beta distribution

Implementation

Simulated exploration and exploitation

Upper bound confidence

Confidence interval

Implementation

Summary

15. Reinforcement Learning

Reinforcement learning

Understanding the challenge

A solution – Q-learning

Terminology

Concept

Value of policy

Bellman optimality equations

Temporal difference for model-free learning

Action-value iterative update

Implementation

Software design

The states and actions

The search space

The policy and action-value

The Q-learning components

The Q-learning training

Tail recursion to the rescue

Validation

The prediction

Option trading using Q-learning

Option property

Option model

Quantization

Putting it all together

Evaluation

Pros and cons of reinforcement learning

Learning classifier systems

Introduction to LCS

Combining learning and evolution

Terminology

Extended learning classifier systems

XCS components

Application to portfolio management

XCS core data

XCS rules

Covering

Example of implementation

Benefits and limitations of learning classifier systems

Summary

16. Parallelism in Scala and Akka

Overview

Scala

Object creation

Streams

Memory on demand

Design for reusing Streams memory

Parallel collections

Processing a parallel collection

Benchmark framework

Performance evaluation

Scalability with Actors

The Actor model

Partitioning

Beyond Actors – reactive programming

Akka

Master-workers

Messages exchange

Worker Actors

The workflow controller

The master Actor

Master with routing

Distributed discrete Fourier transform

Limitations

Futures

Blocking on futures

Future callbacks

Putting it all together

Summary

17. Apache Spark MLlib

Overview

Apache Spark core

Why Spark?

Design principles

In-memory persistency

Laziness

Transforms and actions

Shared variables

Experimenting with Spark

Deploying Spark

Using Spark shell

MLlib library

Overview

Creating RDDs

K-means using MLlib

Tests

Reusable ML pipelines

Reusable ML transforms

Encoding features

Training the model

Predictive model

Training summary statistics

Validating the model

Grid search

Apache Spark and ScalaTest

Extending Spark

Kullback-Leibler divergence

Implementation

Kullback-Leibler evaluator

Streaming engine

Why streaming?

Batch and real-time processing

Architecture overview

Discretized streams

Use case – continuous parsing

Checkpointing

Performance evaluation

Tuning parameters

Performance considerations

Pros and cons

Summary

A. Basic Concepts

Scala programming

List of libraries and tools

Code snippets format

Best practices

Encapsulation

Class constructor template

Companion objects versus case classes

Enumerations versus case classes

Overloading

Design template for immutable classifiers

Utility classes

Data extraction

Financial data sources

Documents extraction

DMatrix class

Counter

Monitor

Mathematics

Linear algebra

QR decomposition

LU factorization

LDL decomposition

Cholesky factorization

Singular Value Decomposition (SVD)

Eigenvalue decomposition

Algebraic and numerical libraries

First order predicate logic

Jacobian and Hessian matrices

Summary of optimization techniques

Gradient descent methods

Steepest descent

Conjugate gradient

Stochastic gradient descent

Quasi-Newton algorithms

BFGS

L-BFGS

Nonlinear least squares minimization

Gauss-Newton

Levenberg-Marquardt

Lagrange multipliers

Overview dynamic programming

Finances 101

Fundamental analysis

Technical analysis

Terminology

Trading data

Trading signal and strategy

Price patterns

Options trading

Financial data sources

Suggested online courses

References

B. References

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 11

Chapter 12

Chapter 13

Chapter 14

Chapter 15

Chapter 16

Chapter 17

Index

Scala for Machine Learning Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused direFmaptctly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2015

Second edition: September 2017

Production reference: 1190917

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78712-238-3

www.packtpub.com

Credits

Author

Patrick R. Nicolas

Reviewers

Sumit Pal

Dave Wentzel

Commissioning Editor

Amey Varangaonkar

Acquisition Editor

Tushar Gupta

Content Development Editor

Amrita Noronha

Technical Editor

Nilesh Sawakhande

Copy Editors

Safis Editing

Laxmi Subramanian

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Mariammal Chettiyar

Graphics

Tania Dutta

Production Coordinator

Shantanu Zagade

Cover Work

Deepika Naik

About the Author

Patrick R. Nicolas is the director of engineering at Agile SDE, California. He has more than 25 years of experience in software engineering and building applications in C++, Java, and more recently in Scala/Spark, and has held several managerial positions. His interests include real-time analytics, modeling, and the development of nonlinear models.

About the Reviewers

Sumit Pal has more than 24 years of experience in the software industry, spanning companies from start-ups to enterprises.

He is a big data architect, visualization, and data science consultant, and builds end-to-end data-driven analytic systems.

Sumit has worked for Microsoft (SQLServer), Oracle (OLAP), and Verizon (big data analytics).

Currently, he works for multiple clients, building their data architectures and big data solutions and works with Spark, Scala, Java, and Python.

He has extensive experience in building scalable systems in middle tier, data tier to visualization for analytics applications, using big data and NoSQL databases.

Sumit has expertise in database internals, data warehouses, and dimensional modeling, as an associate director for big data at Verizon. Sumit strategized, managed, architected, and developed analytic platforms for machine learning applications. Sumit was the chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the core analytics platform.

He is the author of SQL On Big Data - Technology, Architecture and Roadmap published by Apress in October 2016.

He has spoken on the topic covered in this book at the following conferences:

May 2016, Big Data Conference—Linux Foundation in Vancouver, CanadaMarch 2016, World Data Center Conference in Las Vegas, USANovember 2015, BigData TechCon in Chicago, USAAugust 2015, Global Big Data Conference in Boston, USA

He is also the author of SQL On Big Data by Apress in December 2016.

Dave Wentzel is the Chief Technology Officer (CTO) of Capax Global, a premier Microsoft consulting partner. Dave is responsible for setting the strategy and defining service offerings and capabilities for the data platform and Azure practice at Capax. Dave also works directly with clients to help them with their big data journey. Dave is a frequent blogger and speaker on big data and data science topics.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.

If you’d like to join our team of regular reviewers, you can e-mail us at <[email protected]>. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

Not a single day passes that we do not hear about big data in the news media, technical conferences, and even coffee shops. The ever-increasing amount of data collected in process monitoring, research, or simple human behavior becomes valuable only if you extract knowledge from it. Machine learning is the essential tool to mine data for knowledge. This book covers the what, why, and how of machine learning:

What are the objectives and the mathematical foundations of machine learning?Why is Scala the ideal programming language to implement machine learning algorithms?How can you apply machine learning to solve real-world problems?

Throughout this book, machine learning algorithms are described with diagrams, mathematical formulations, and documented snippets of Scala code, allowing you to understand these key concepts in your own unique way.

What this book covers

Chapter 1, Getting Started, introduces the basic concepts of statistical analysis, classification, regression, prediction, clustering, and optimization. This chapter covers the Scala languages, features, and libraries, followed by the implementation of a simple application.

Chapter 2, Data Pipelines, describes a typical workflow for classification, the concept of bias/variance trade-off, and validation using the Scala dependency injection applied to the technical analysis of financial markets.

Chapter 3, Data Preprocessing, covers time series analyses and leverages Scala to implement data preprocessing and smoothing techniques such as moving averages, discrete Fourier transform, and the Kalman recursive filter.

Chapter 4, Unsupervised Learning, covers key clustering methods such as K-means clustering, Gaussian mixture Expectation-Maximization and function approximation.

Chapter 5, Dimension Reduction, describes the Kullback-Leibler divergence, the principal component analysis for linear models followed by an overview of manifold applied to non-linear models.

Chapter 6, Naive Bayes Classifiers, focuses on the probabilistic graphical models and more specifically the implementation of Naive Bayes models and its application to text mining.

Chapter 7, Sequential Data Models, introduces the Markov processes followed by a full implementation of the hidden Markov model, and conditional random fields applied to pattern recognition in financial market data.

Chapter 8, Monte Carlo Inference, describes Gaussian sampling using Box-Muller technique, Bootstrap replication with replacement, and the ubiquitous Metropolis-Hastings algorithm for Markov Chain Monte Carlo.

Chapter 9, Regression and Regularization, covers a typical implementation of the linear and least squares regression, the ridge regression as a regularization technique, and finally, the logistic regression.

Chapter 10, Multilayer Perception, describes feed-forward neural networks followed by a full implementation of the multilayer perceptron classifier.

Chapter 11, Deep Learning, implements a sparse auto encoder and a restricted Boltzmann machines for dimension reduction in Scala followed by an overview of the convolutional neural network.

Chapter 12, Kernel Models and Support Vector Machines, covers the concept of kernel functions with implementation of support vector machine classification and regression, followed by the application of the one-class SVM to anomaly detection.

Chapter 13, Evolutionary Computing, covers describes the basics of evolutionary computing and the implementation of the different components of a multipurpose genetic algorithm.

Chapter 14, Multiarmed Bandits, Multiarmed Bandits, introduces the concept of exploration-exploitation trade-off using Epsilon-greedy algorithm, the Upper confidence bound technique and the context-free Thompson sampling.

Chapter 15, Reinforcement Learning, covers introduces the concept of reinforcement learning with an implementation of the Q-learning algorithm followed by a template to build a learning classifier system.

Chapter 16, Parallelism in Scala and Akka, describes some of the artifacts and frameworks to create scalable applications and evaluate the relative performance of Scala parallel collections and Akka-based distributed computation.

Chapter 17, Apache Spark MLlib, covers the architecture and key concepts of Apache Spark, machine learning leveraging resilient distributed datasets, reusable ML pipelines, extending MLlib with distributed divergences and an example of Spark streaming library.

Appendix A, Basic Concepts, describes the Scala language constructs used throughout the book, elements of linear algebra and optimization techniques.

Appendix B, References, provides a chapter-wise list of references [source, entry] for each chapter.

What you need for this book

A decent command of the Scala programming language is a prerequisite. Reading through a mathematical formulation, conveniently defied in an information box, is optional. However, some basic knowledge of mathematics and statistics might be helpful to understand the inner workings of some algorithms.

The book uses the following libraries:

Scala 2.11.8 or higherJava 1.8.0_25SBT 0.13 or higherJFreeChart 1.0.17Apache Commons Math library 3.5 (Chapter 3, Data Pre-processing, Chapter 4, Unsupervised Learning, and Chapter 9, Regression and Regularization)Indian Institute of Technology Bombay CRF 0.2 (Chapter 7, Sequential Data Models)LIBSVM 0.1.6 (Chapter 8, Kernel Models and Support Vector Machines)Akka 2.3.8 or higher (or Typesafe activator 1.2.10 or higher) (Chapter 16, Parallelism in Scala and Akka)Apache Spark 2.1.0 or higher (Chapter 17, Apache Spark MLlib)

Tip

Understanding the mathematical formulation of a model is optional.

Who this book is for

This book is for software developers with a background in Scala programming who want to learn how to create, validate, and apply machine learning algorithms. The book is also beneficial to data scientists who want to explore functional programming or improve the scalability of their existing applications using Scala.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Scala-for-Machine-Learning-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/ScalaforMachineLearningSecondEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Model categorization

A model can be predictive, descriptive, or adaptive.

Predictive models discover patterns in historical data and extract fundamental trends and relationships between factors (or features). They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields, including marketing, insurance, and pharmaceuticals. Predictive models are created through supervised learning using a pre-selected training set.

Descriptive models attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. These models define the first and important step in knowledge discovery. They are commonly generated through unsupervised learning.

A third category of models, known as adaptive modeling, is created through reinforcement learning. Reinforcement learning consists of one or several decision-making agents that recommend, and possibly execute, actions in an attempt to solve a problem, optimizing an objective function or resolving constraints.

Leveraging Java libraries

There are numerous robust, accurate, and efficient Java libraries for mathematics, linear algebra, or optimization that have been widely used for many years:

JBlas/Linpack: https://github.com/mikiobraun/jblasParallel Colt: https://github.com/rwl/ParallelColtApache Commons Math: http://commons.apache.org/proper/commons-math

There is absolutely no need to rewrite, debug, and test these components in Scala. Developers should consider creating a wrapper or interface to his/her favorite and reliable Java library. The book leverages the Apache Commons Math library for some specific linear algebra algorithms.

Tools and frameworks

Before getting your hands dirty, you need to download and deploy the minimum set of tools and libraries; there is no need to reinvent the wheel, after all. A few key components have to be installed in order to compile and run the source code described throughout this book. We will focus on open source and commonly available libraries, although you are invited to experiment with the equivalent tools of your choice. The learning curve for the frameworks described here is minimal.

Java

The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64. You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME, PATH, and CLASSPATH have to be updated accordingly.

Scala

The code has been tested with Scala 2.11.4 and 2.11.8. We recommend using Scala version 2.11.4 or higher with SBT 0.13.1 or higher. Let's assume that the Scala runtime (REPL) and libraries have been properly installed and that the environment variables SCALA_HOME, and PATH have been updated.

The Scala standard library can be downloaded as binaries or as part of the Typesafe Activator tool by visiting http://www.scala-lang.org/download/.

Eclipse Scala IDE

The description and installation instructions for the Eclipse Scala IDE version 4.0 and higher is available at http://scala-ide.org/docs/user/gettingstarted.html.

IntelliJ IDEA Scala plugin

You can also download the IntelliJ IDEA Scala plugin version 13 or higher from the JetBrains website at http://confluence.jetbrains.com/display/SCA/.

Simple build tool

The ubiquitous Simple Build Tool (SBT) will be our primary building engine. It can be downloaded as part of the Typesafe activator or directly from http://www.scala-sbt.org/download.html.

The syntax of the build file sbt/build.sbt conforms to version 0.13 and is used to compile and assemble the source code presented throughout this book. To build Scala for machine learning, do the following:

Set the maximum size for the JVM heap to 2058 Mbytes or higher and the permanent memory to 512 Mbytes or higher (that is, -Xmx4096m -Xms512m -XX:MaxPermSize=512m)To build the Scala for machine learning library package: $(ROOT)/sbt clean publish-localTo build the package including test and resource files: $(ROOT)/sbt clean packageTo generate Scala doc for the library: $(ROOT)/sbt docTo generate Scala doc for the example: $(ROOT)/sbt test:docTo generate report for compliance to Scala style guide: $(ROOT)/sbt scalastyleTo compile all examples: $(ROOT)/sbt test:compile

Apache Commons Math

Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6].

Description

This is a lightweight library that provides developers with a foundation of small, ready-to-use Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.5 or higher.

The math library supports the following:

Functions, differentiation, integral, and ordinary differential equationsStatistics distributionsLinear and non-linear optimizationDense and sparse vectors and matrixCurve fitting, correlation, and regressio

For more information, visit http://commons.apache.org/proper/commons-math.

Licensing

We need Apache Public License 2.0; the terms are available at https://www.apache.org/licenses/LICENSE-2.0.

Installation

The installation and deployment of the Apache Commons Math library are quite simple. The steps are as follows:

Go to the download page at http://commons.apache.org/proper/commons-math/download_math.cgi.Download the latest .jar files in the binary section, commons-math3-3.6-bin.zip (for version 3.6, for instance).Unzip and install the .jar file.Add commons-math3-3.6.jar to the CLASSPATH, as follows:

For macOS X:

export CLASSPATH=$CLASSPATH:/Commons_Math_path /commons-math3-3.6.jar

For Windows:

Go to System property | Advanced system settings | Advanced | Environment variables and then edit the entry CLASSPATH variable.

Add the commons-math3-3.6.jar file to your IDE environment if needed:

the source commons-math3-3.6-src.zip from the sourcesection.

JFreeChart

JFreeChart is an open source chart and plotting java library widely used in the Java programmer community. It was originally created by David Gilbert [1:8].

Description

The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithm throughout the book, but you are encouraged to explore this great library on your own, as time permits.

Licensing

It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications.

Installation

To install and deploy JFreeChart, perform the following steps:

Visit http://www.jfree.org/jfreechart/.Download the latest version from Source Forge: https://sourceforge.net/projects/jfreechart/files/.Unzip and deploy the .jar file.Add jfreechart-1.0.17.jar (for version 1.0.17) to the CLASSPATH, as follows:

For macOS X:

export CLASSPATH=$CLASSPATH:/JFreeChart_path/jfreechart-1.0.17.jar

For Windows:

Go to System property | Advanced system settings | Advanced | Environment variables and then edit the entry CLASSPATH variable.

Add the jfreechart-1.0.17.jar file to your IDE environment:

Other libraries and frameworks

Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with instructions for downloading them. Libraries related to the conditional random fields and support vector machines are described in their respective chapters.

Note

Why aren't we using Scala algebra and Scala numerical libraries?

Libraries such as Breeze, ScalaNLP, and Algebird are interesting Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a high-quality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using existing or legacy java libraries [1:9].

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Scala for Machine Learning - Second Edition E-Book

Patrick R. Nicolas

About This Book

Who This Book Is For

What You Will Learn

In Detail

Style and approach

Table of Contents

Scala for Machine Learning Second Edition

Scala for Machine Learning Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Tip

Who this book is for

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Model categorization

Leveraging Java libraries

Tools and frameworks

Java

Scala

Eclipse Scala IDE

IntelliJ IDEA Scala plugin

Simple build tool

Apache Commons Math

Description

Licensing

Installation

JFreeChart

Description

Licensing

Installation

Other libraries and frameworks

Note