55,19 €
Leverage Scala and Machine Learning to study and construct systems that can learn from data
If you're a data scientist or a data analyst with a fundamental knowledge of Scala who wants to learn and implement various Machine learning techniques, this book is for you. All you need is a good understanding of the Scala programming language, a basic knowledge of statistics, a keen interest in Big Data processing, and this book!
The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering design, logistics, manufacturing, and trading strategies, to detection of genetic anomalies.
The book is your one stop guide that introduces you to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. You start by learning data preprocessing and filtering techniques. Following this, you'll move on to unsupervised learning techniques such as clustering and dimension reduction, followed by probabilistic graphical models such as Naive Bayes, hidden Markov models and Monte Carlo inference. Further, it covers the discriminative algorithms such as linear, logistic regression with regularization, kernelization, support vector machines, neural networks, and deep learning. You'll move on to evolutionary computing, multibandit algorithms, and reinforcement learning.
Finally, the book includes a comprehensive overview of parallel computing in Scala and Akka followed by a description of Apache Spark and its ML library. With updated codes based on the latest version of Scala and comprehensive examples, this book will ensure that you have more than just a solid fundamental knowledge in machine learning with Scala.
This book is designed as a tutorial with hands-on exercises using technical analysis of financial markets and corporate data. The approach of each chapter is such that it allows you to understand key concepts easily.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 867
Veröffentlichungsjahr: 2017
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused direFmaptctly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: December 2015
Second edition: September 2017
Production reference: 1190917
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78712-238-3
www.packtpub.com
Author
Patrick R. Nicolas
Reviewers
Sumit Pal
Dave Wentzel
Commissioning Editor
Amey Varangaonkar
Acquisition Editor
Tushar Gupta
Content Development Editor
Amrita Noronha
Technical Editor
Nilesh Sawakhande
Copy Editors
Safis Editing
Laxmi Subramanian
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Tania Dutta
Production Coordinator
Shantanu Zagade
Cover Work
Deepika Naik
Patrick R. Nicolas is the director of engineering at Agile SDE, California. He has more than 25 years of experience in software engineering and building applications in C++, Java, and more recently in Scala/Spark, and has held several managerial positions. His interests include real-time analytics, modeling, and the development of nonlinear models.
Sumit Pal has more than 24 years of experience in the software industry, spanning companies from start-ups to enterprises.
He is a big data architect, visualization, and data science consultant, and builds end-to-end data-driven analytic systems.
Sumit has worked for Microsoft (SQLServer), Oracle (OLAP), and Verizon (big data analytics).
Currently, he works for multiple clients, building their data architectures and big data solutions and works with Spark, Scala, Java, and Python.
He has extensive experience in building scalable systems in middle tier, data tier to visualization for analytics applications, using big data and NoSQL databases.
Sumit has expertise in database internals, data warehouses, and dimensional modeling, as an associate director for big data at Verizon. Sumit strategized, managed, architected, and developed analytic platforms for machine learning applications. Sumit was the chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the core analytics platform.
He is the author of SQL On Big Data - Technology, Architecture and Roadmap published by Apress in October 2016.
He has spoken on the topic covered in this book at the following conferences:
He is also the author of SQL On Big Data by Apress in December 2016.
Dave Wentzel is the Chief Technology Officer (CTO) of Capax Global, a premier Microsoft consulting partner. Dave is responsible for setting the strategy and defining service offerings and capabilities for the data platform and Azure practice at Capax. Dave also works directly with clients to help them with their big data journey. Dave is a frequent blogger and speaker on big data and data science topics.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.
If you’d like to join our team of regular reviewers, you can e-mail us at <[email protected]>. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Not a single day passes that we do not hear about big data in the news media, technical conferences, and even coffee shops. The ever-increasing amount of data collected in process monitoring, research, or simple human behavior becomes valuable only if you extract knowledge from it. Machine learning is the essential tool to mine data for knowledge. This book covers the what, why, and how of machine learning:
Throughout this book, machine learning algorithms are described with diagrams, mathematical formulations, and documented snippets of Scala code, allowing you to understand these key concepts in your own unique way.
Chapter 1, Getting Started, introduces the basic concepts of statistical analysis, classification, regression, prediction, clustering, and optimization. This chapter covers the Scala languages, features, and libraries, followed by the implementation of a simple application.
Chapter 2, Data Pipelines, describes a typical workflow for classification, the concept of bias/variance trade-off, and validation using the Scala dependency injection applied to the technical analysis of financial markets.
Chapter 3, Data Preprocessing, covers time series analyses and leverages Scala to implement data preprocessing and smoothing techniques such as moving averages, discrete Fourier transform, and the Kalman recursive filter.
Chapter 4, Unsupervised Learning, covers key clustering methods such as K-means clustering, Gaussian mixture Expectation-Maximization and function approximation.
Chapter 5, Dimension Reduction, describes the Kullback-Leibler divergence, the principal component analysis for linear models followed by an overview of manifold applied to non-linear models.
Chapter 6, Naive Bayes Classifiers, focuses on the probabilistic graphical models and more specifically the implementation of Naive Bayes models and its application to text mining.
Chapter 7, Sequential Data Models, introduces the Markov processes followed by a full implementation of the hidden Markov model, and conditional random fields applied to pattern recognition in financial market data.
Chapter 8, Monte Carlo Inference, describes Gaussian sampling using Box-Muller technique, Bootstrap replication with replacement, and the ubiquitous Metropolis-Hastings algorithm for Markov Chain Monte Carlo.
Chapter 9, Regression and Regularization, covers a typical implementation of the linear and least squares regression, the ridge regression as a regularization technique, and finally, the logistic regression.
Chapter 10, Multilayer Perception, describes feed-forward neural networks followed by a full implementation of the multilayer perceptron classifier.
Chapter 11, Deep Learning, implements a sparse auto encoder and a restricted Boltzmann machines for dimension reduction in Scala followed by an overview of the convolutional neural network.
Chapter 12, Kernel Models and Support Vector Machines, covers the concept of kernel functions with implementation of support vector machine classification and regression, followed by the application of the one-class SVM to anomaly detection.
Chapter 13, Evolutionary Computing, covers describes the basics of evolutionary computing and the implementation of the different components of a multipurpose genetic algorithm.
Chapter 14, Multiarmed Bandits, Multiarmed Bandits, introduces the concept of exploration-exploitation trade-off using Epsilon-greedy algorithm, the Upper confidence bound technique and the context-free Thompson sampling.
Chapter 15, Reinforcement Learning, covers introduces the concept of reinforcement learning with an implementation of the Q-learning algorithm followed by a template to build a learning classifier system.
Chapter 16, Parallelism in Scala and Akka, describes some of the artifacts and frameworks to create scalable applications and evaluate the relative performance of Scala parallel collections and Akka-based distributed computation.
Chapter 17, Apache Spark MLlib, covers the architecture and key concepts of Apache Spark, machine learning leveraging resilient distributed datasets, reusable ML pipelines, extending MLlib with distributed divergences and an example of Spark streaming library.
Appendix A, Basic Concepts, describes the Scala language constructs used throughout the book, elements of linear algebra and optimization techniques.
Appendix B, References, provides a chapter-wise list of references [source, entry] for each chapter.
A decent command of the Scala programming language is a prerequisite. Reading through a mathematical formulation, conveniently defied in an information box, is optional. However, some basic knowledge of mathematics and statistics might be helpful to understand the inner workings of some algorithms.
The book uses the following libraries:
Understanding the mathematical formulation of a model is optional.
This book is for software developers with a background in Scala programming who want to learn how to create, validate, and apply machine learning algorithms. The book is also beneficial to data scientists who want to explore functional programming or improve the scalability of their existing applications using Scala.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Scala-for-Machine-Learning-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/ScalaforMachineLearningSecondEdition_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
A model can be predictive, descriptive, or adaptive.
Predictive models discover patterns in historical data and extract fundamental trends and relationships between factors (or features). They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields, including marketing, insurance, and pharmaceuticals. Predictive models are created through supervised learning using a pre-selected training set.
Descriptive models attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. These models define the first and important step in knowledge discovery. They are commonly generated through unsupervised learning.
A third category of models, known as adaptive modeling, is created through reinforcement learning. Reinforcement learning consists of one or several decision-making agents that recommend, and possibly execute, actions in an attempt to solve a problem, optimizing an objective function or resolving constraints.
There are numerous robust, accurate, and efficient Java libraries for mathematics, linear algebra, or optimization that have been widely used for many years:
There is absolutely no need to rewrite, debug, and test these components in Scala. Developers should consider creating a wrapper or interface to his/her favorite and reliable Java library. The book leverages the Apache Commons Math library for some specific linear algebra algorithms.
Before getting your hands dirty, you need to download and deploy the minimum set of tools and libraries; there is no need to reinvent the wheel, after all. A few key components have to be installed in order to compile and run the source code described throughout this book. We will focus on open source and commonly available libraries, although you are invited to experiment with the equivalent tools of your choice. The learning curve for the frameworks described here is minimal.
The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64. You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME, PATH, and CLASSPATH have to be updated accordingly.
The code has been tested with Scala 2.11.4 and 2.11.8. We recommend using Scala version 2.11.4 or higher with SBT 0.13.1 or higher. Let's assume that the Scala runtime (REPL) and libraries have been properly installed and that the environment variables SCALA_HOME, and PATH have been updated.
The Scala standard library can be downloaded as binaries or as part of the Typesafe Activator tool by visiting http://www.scala-lang.org/download/.
The description and installation instructions for the Eclipse Scala IDE version 4.0 and higher is available at http://scala-ide.org/docs/user/gettingstarted.html.
You can also download the IntelliJ IDEA Scala plugin version 13 or higher from the JetBrains website at http://confluence.jetbrains.com/display/SCA/.
The ubiquitous Simple Build Tool (SBT) will be our primary building engine. It can be downloaded as part of the Typesafe activator or directly from http://www.scala-sbt.org/download.html.
The syntax of the build file sbt/build.sbt conforms to version 0.13 and is used to compile and assemble the source code presented throughout this book. To build Scala for machine learning, do the following:
Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6].
This is a lightweight library that provides developers with a foundation of small, ready-to-use Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.5 or higher.
The math library supports the following:
For more information, visit http://commons.apache.org/proper/commons-math.
We need Apache Public License 2.0; the terms are available at https://www.apache.org/licenses/LICENSE-2.0.
The installation and deployment of the Apache Commons Math library are quite simple. The steps are as follows:
Go to System property | Advanced system settings | Advanced | Environment variables and then edit the entry CLASSPATH variable.
the source commons-math3-3.6-src.zip from the sourcesection.
JFreeChart is an open source chart and plotting java library widely used in the Java programmer community. It was originally created by David Gilbert [1:8].
The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithm throughout the book, but you are encouraged to explore this great library on your own, as time permits.
It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications.
To install and deploy JFreeChart, perform the following steps:
Go to System property | Advanced system settings | Advanced | Environment variables and then edit the entry CLASSPATH variable.
Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with instructions for downloading them. Libraries related to the conditional random fields and support vector machines are described in their respective chapters.
Why aren't we using Scala algebra and Scala numerical libraries?
Libraries such as Breeze, ScalaNLP, and Algebird are interesting Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a high-quality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using existing or legacy java libraries [1:9].
