45,59 €
Simplify machine learning model implementations with Spark
This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and ecosystem.
Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks.
This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we'll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.
This book is packed with intuitive recipes supported with line-by-line explanations to help you understand how to optimize your work flow and resolve problems when working with complex data modeling tasks and predictive algorithms. This is a valuable resource for data scientists and those working on large scale data projects.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 575
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2017
Production reference: 1200917
ISBN 978-1-78355-160-6
www.packtpub.com
Authors
Siamak Amirghodsi
Meenakshi Rajendran
Broderick Hall
Shuen Mei
Copy Editor
Safis Editing
Reviewers
Sumit Pal
Mohammad Guller
Project Coordinator
Sheejal Shah
Commissioning Editor
Ashwin Nair
Proofreader
Safis Editing
Acquisition Editor
Vinay Argekar
Indexer
Rekha Nair
ContentDevelopmentEditor
Nikhil Borkar
Graphics
Kirk D'Penha
Technical Editor
Madhunikita Sunil Chindarkar
Production Coordinator
Melwyn Dsa
Siamak Amirghodsi (Sammy) is a world-class senior technology executive leader with an entrepreneurial track record of overseeing big data strategies, cloud transformation, quantitative risk management, advanced analytics, large-scale regulatory data platforming, enterprise architecture, technology road mapping, multi-project execution, and organizational streamlining in Fortune 20 environments in a global setting.
Siamak is a hands-on big data, cloud, machine learning, and AI expert, and is currently overseeing the large-scale cloud data platforming and advanced risk analytics build out for a tier-1 financial institution in the United States. Siamak's interests include building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, TensorFlow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, IoT, blockchain, probabilistic graphical models, cryptography, and NLP.
Siamak is fully certified on Cloudera's big data platform and follows Apache Spark, TensorFlow, Hadoop, Hive, Pig, Zookeeper, Amazon AWS, Cassandra, HBase, Neo4j, MongoDB, and GPU architecture, while being fully grounded in the traditional IBM/Oracle/Microsoft technology stack for business continuity and integration.
Siamak has a PMP designation. He holds an advanced degree in computer science and an MBA from the University of Chicago (ChicagoBooth), with emphasis on strategic management, quantitative finance, and econometrics.
Meenakshi Rajendran is a hands-on big data analytics and data governance manager with expertise in large-scale data platforming and machine learning program execution on a global scale. She is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master's degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. She not only understands the underpinnings of big data and data science technology but also has a solid understanding of the human side of the equation as well.
Meenakshi’s favorite languages are Python, R, Julia, and Scala. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale. In her free time, she dabbles in software engineering management literature, cognitive psychology, and chess for relaxation.
Broderick Hall is a hands-on big data analytics expert and holds a master’s degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He has an extensive experience in designing and building real-time financial applications for some of the largest financial institutions and exchanges in USA. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation.
Broderick has extensive experience working in healthcare, travel, real estate, and data center management. Broderick also enjoys his role as an adjunct professor, instructing courses in Java programming and object-oriented programming. He is currently focused on delivering real-time big data mission-critical analytics applications in the financial services industry.
Broderick has been actively involved with Hadoop, Spark, Cassandra, TensorFlow, and deep learning since the early days, while actively pursuing machine learning, cloud architecture, data platforms, data science, and practical applications in cognitive sciences. He enjoys programming in Scala, Python, R, Java, and Julia.
Shuen Mei is a big data analytic platforms expert with 15+ years of experience in the financial services industry. He is experienced in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase.
Shuen is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems. Shuen is a skilled software engineer with extensive experience in delivering infrastructure, code, data architecture, and performance tuning solutions in trading and finance for Fortune 100 companies.
Shuen holds a master's degree in MIS from the University of Illinois. He actively follows Spark, TensorFlow, Hadoop, Spark, Cloud Architecture, Apache Flink, Hive, HBase, Cassandra, and related systems. He is passionate about Scala, Python, Java, Julia, cloud computing, machine learning algorithms, and deep learning at scale.
Sumit Pal, who has authored SQL on Big Data - Technology, Architecture, and Innovations by Apress, has more than 22 years of experience in the software industry in various roles, spanning companies from startups to enterprises.
Sumit is an independent consultant working with big data, data visualization, and data science, and he is a software architect building end-to-end data-driven analytic systems.
Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team), and Verizon (big data analytics team) in a career spanning 22 years. Currently, he works for multiple clients advising them on their data architectures and big data solutions, and does hands-on coding with Spark, Scala, Java, and Python.
Sumit has spoken at the following Big Data Conferences:
Data Summit NY, May 2017
Big Data Symposium Boston, May 2017
Apache Linux Foundation, May 2016, Vancouver, Canada,
Data Center World, March 2016, Las Vegas
Chicago, Nov 2015
Big Data Conferences in Global Big Data Conference in Boston, Aug 2015
Sumit has also developed a Big Data Analyst Training course for Experfy, more details of which can be found at https://www.experfy.com/training/courses/big-data-analyst.
Sumit has an extensive experience in building scalable systems across the stack from middle tier and data tier to visualization for analytics applications, using big data and NoSQL DB. He has deep expertise in database internals, data warehouses, dimensional modeling, data science with Java and Python, and SQL.
Sumit started his career as a part of the SQL Server Development Team at Microsoft in 1996-97 and then as a core server engineer for Oracle Corporation at their OLAP Development team in Burlington, MA.
Sumit has also worked at Verizon as an Associate Director for big data architecture, where he strategized, managed, architected, and developed platforms and solutions for analytics and machine learning applications. He has also served as a chief architect at ModelN/LeapfrogRX (2006-2013), where he architected the middle-tier core analytics platform with open source OLAP engine (Mondrian) on J2EE and solved some complex Dimensional ETL, Modeling, and performance optimization problems.
Sumit has MS and BS in computer science. He hiked to the Mt. Everest Base camp in Oct, 2016.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.
If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Sections
Getting ready
How to do it…
How it works…
There's more…
See also
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Practical Machine Learning with Spark Using Scala
Introduction
Apache Spark
Machine learning
Scala
Software versions and libraries used in this book
Downloading and installing the JDK
Getting ready
How to do it...
Downloading and installing IntelliJ
Getting ready
How to do it...
Downloading and installing Spark
Getting ready
How to do it...
Configuring IntelliJ to work with Spark and run Spark ML sample codes
Getting ready
How to do it...
There's more...
See also
Running a sample ML code from Spark
Getting ready
How to do it...
Identifying data sources for practical machine learning
Getting ready
How to do it...
See also
Running your first program using Apache Spark 2.0 with the IntelliJ IDE
How to do it...
How it works...
There's more...
See also
How to add graphics to your Spark program
How to do it...
How it works...
There's more...
See also
Just Enough Linear Algebra for Machine Learning with Spark
Introduction
Package imports and initial setup for vectors and matrices
How to do it...
There's more...
See also
Creating DenseVector and setup with Spark 2.0
How to do it...
How it works...
There's more...
See also
Creating SparseVector and setup with Spark
How to do it...
How it works...
There's more...
See also
Creating dense matrix and setup with Spark 2.0
Getting ready
How to do it...
How it works...
There's more...
See also
Using sparse local matrices with Spark 2.0
How to do it...
How it works...
There's more...
See also
Performing vector arithmetic using Spark 2.0
How to do it...
How it works...
There's more...
See also
Performing matrix arithmetic using Spark 2.0
How to do it...
How it works...
Exploring RowMatrix in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring Distributed IndexedRowMatrix in Spark 2.0
How to do it...
How it works...
See also
Exploring distributed CoordinateMatrix in Spark 2.0
How to do it...
How it works...
See also
Exploring distributed BlockMatrix in Spark 2.0
How to do it...
How it works...
See also
Spark's Three Data Musketeers for Machine Learning - Perfect Together
Introduction
RDDs - what started it all...
DataFrame - a natural evolution to unite API and SQL via a high-level API
Dataset - a high-level unifying Data API
Creating RDDs with Spark 2.0 using internal data sources
How to do it...
How it works...
Creating RDDs with Spark 2.0 using external data sources
How to do it...
How it works...
There's more...
See also
Transforming RDDs with Spark 2.0 using the filter() API
How to do it...
How it works...
There's more...
See also
Transforming RDDs with the super useful flatMap() API
How to do it...
How it works...
There's more...
See also
Transforming RDDs with set operation APIs
How to do it...
How it works...
See also
RDD transformation/aggregation with groupBy() and reduceByKey()
How to do it...
How it works...
There's more...
See also
Transforming RDDs with the zip() API
How to do it...
How it works...
See also
Join transformation with paired key-value RDDs
How to do it...
How it works...
There's more...
Reduce and grouping transformation with paired key-value RDDs
How to do it...
How it works...
See also
Creating DataFrames from Scala data structures
How to do it...
How it works...
There's more...
See also
Operating on DataFrames programmatically without SQL
How to do it...
How it works...
There's more...
See also
Loading DataFrames and setup from an external source
How to do it...
How it works...
There's more...
See also
Using DataFrames with standard SQL language - SparkSQL
How to do it...
How it works...
There's more...
See also
Working with the Dataset API using a Scala Sequence
How to do it...
How it works...
There's more...
See also
Creating and using Datasets from RDDs and back again
How to do it...
How it works...
There's more...
See also
Working with JSON using the Dataset API and SQL together
How to do it...
How it works...
There's more...
See also
Functional programming with the Dataset API using domain objects
How to do it...
How it works...
There's more...
See also
Common Recipes for Implementing a Robust Machine Learning System
Introduction
Spark's basic statistical API to help you build your own algorithms
How to do it...
How it works...
There's more...
See also
ML pipelines for real-life machine learning applications
How to do it...
How it works...
There's more...
See also
Normalizing data with Spark
How to do it...
How it works...
There's more...
See also
Splitting data for training and testing
How to do it...
How it works...
There's more...
See also
Common operations with the new Dataset API
How to do it...
How it works...
There's more...
See also
Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0
How to do it...
How it works...
There's more...
See also
LabeledPoint data structure for Spark ML
How to do it...
How it works...
There's more...
See also
Getting access to Spark cluster in Spark 2.0
How to do it...
How it works...
There's more...
See also
Getting access to Spark cluster pre-Spark 2.0
How to do it...
How it works...
There's more...
See also
Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0
How to do it...
How it works...
There's more...
See also
New model export and PMML markup in Spark 2.0
How to do it...
How it works...
There's more...
See also
Regression model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Binary classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Multiclass classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Multilabel classification model evaluation using Spark 2.0
How to do it...
How it works...
There's more...
See also
Using the Scala Breeze library to do graphics in Spark 2.0
How to do it...
How it works...
There's more...
See also
Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I
Introduction
Fitting a linear regression line to data the old fashioned way
How to do it...
How it works...
There's more...
See also
Generalized linear regression in Spark 2.0
How to do it...
How it works...
There's more...
See also
Linear regression API with Lasso and L-BFGS in Spark 2.0
How to do it...
How it works...
There's more...
See also
Linear regression API with Lasso and 'auto' optimization selection in Spark 2.0
How to do it...
How it works...
There's more...
See also
Linear regression API with ridge regression and 'auto' optimization selection in Spark 2.0
How to do it...
How it works...
There's more...
See also
Isotonic regression in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
Multilayer perceptron classifier in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
One-vs-Rest classifier (One-vs-All) in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
Survival regression – parametric AFT model in Apache Spark 2.0
How to do it...
How it works...
There's more...
See also
Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II
Introduction
Linear regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Logistic regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Ridge regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Lasso regression with SGD optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Logistic regression with L-BFGS optimization in Spark 2.0
How to do it...
How it works...
There's more...
See also
Support Vector Machine (SVM) with Spark 2.0
How to do it...
How it works...
There's more...
See also
Naive Bayes machine learning with Spark 2.0 MLlib
How to do it...
How it works...
There's more...
See also
Exploring ML pipelines and DataFrames using logistic regression in Spark 2.0
Getting ready
How to do it...
How it works...
There's more...
PipeLine
Vectors
See also
Recommendation Engine that Scales with Spark
Introduction
Content filtering
Collaborative filtering
Neighborhood method
Latent factor models techniques
Setting up the required data for a scalable recommendation engine in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring the movies data details for the recommendation system in Spark 2.0
How to do it...
How it works...
There's more...
See also
Exploring the ratings data details for the recommendation system in Spark 2.0
How to do it...
How it works...
There's more...
See also
Building a scalable recommendation engine using collaborative filtering in Spark 2.0
How to do it...
How it works...
There's more...
See also
Dealing with implicit input for training
Unsupervised Clustering with Apache Spark 2.0
Introduction
Building a KMeans classifying system in Spark 2.0
How to do it...
How it works...
KMeans (Lloyd Algorithm)
KMeans++ (Arthur's algorithm)
KMeans|| (pronounced as KMeans Parallel)
There's more...
See also
Bisecting KMeans, the new kid on the block in Spark 2.0
How to do it...
How it works...
There's more...
See also
Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
How to do it...
How it works...
New GaussianMixture()
There's more...
See also
Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0
How to do it...
How it works...
There's more...
See also
Latent Dirichlet Allocation (LDA) to classify documents and text into topics
How to do it...
How it works...
There's more...
See also
Streaming KMeans to classify data in near real-time
How to do it...
How it works...
There's more...
See also
Optimization - Going Down the Hill with Gradient Descent
Introduction
How do machines learn using an error-based system?
Optimizing a quadratic cost function and finding the minima using just math to gain insight
How to do it...
How it works...
There's more...
See also
Coding a quadratic cost function optimization using Gradient Descent (GD) from scratch
How to do it...
How it works...
There's more...
See also
Coding Gradient Descent optimization to solve Linear Regression from scratch
How to do it...
How it works...
There's more...
See also
Normal equations as an alternative for solving Linear Regression in Spark 2.0
How to do it...
How it works...
There's more...
See also
Building Machine Learning Systems with Decision Tree and Ensemble Models
Introduction
Ensemble models
Measures of impurity
Getting and preparing real-world medical data for exploring Decision Trees and Ensemble models in Spark 2.0
How to do it...
There's more...
Building a classification system with Decision Trees in Spark 2.0
How to do it
How it works...
There's more...
See also
Solving Regression problems with Decision Trees in Spark 2.0
How to do it...
How it works...
See also
Building a classification system with Random Forest Trees in Spark 2.0
How to do it...
How it works...
See also
Solving regression problems with Random Forest Trees in Spark 2.0
How to do it...
How it works...
See also
Building a classification system with Gradient Boosted Trees (GBT) in Spark 2.0
How to do it...
How it works....
There's more...
See also
Solving regression problems with Gradient Boosted Trees (GBT) in Spark 2.0
How to do it...
How it works...
There's more...
See also
Curse of High-Dimensionality in Big Data
Introduction
Feature selection versus feature extraction
Two methods of ingesting and preparing a CSV file for processing in Spark
How to do it...
How it works...
There's more...
See also
Singular Value Decomposition (SVD) to reduce high-dimensionality in Spark
How to do it...
How it works...
There's more...
See also
Principal Component Analysis (PCA) to pick the most effective latent factor for machine learning in Spark
How to do it...
How it works...
There's more...
See also
Implementing Text Analytics with Spark 2.0 ML Library
Introduction
Doing term frequency with Spark - everything that counts
How to do it...
How it works...
There's more...
See also
Displaying similar words with Spark using Word2Vec
How to do it...
How it works...
There's more...
See also
Downloading a complete dump of Wikipedia for a real-life Spark ML project
How to do it...
There's more...
See also
Using Latent Semantic Analysis for text analytics with Spark 2.0
How to do it...
How it works...
There's more...
See also
Topic modeling with Latent Dirichlet allocation in Spark 2.0
How to do it...
How it works...
There's more...
See also
Spark Streaming and Machine Learning Library
Introduction
Structured streaming for near real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming DataFrames for real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming Datasets for real-time machine learning
How to do it...
How it works...
There's more...
See also
Streaming data and debugging with queueStream
How to do it...
How it works...
See also
Downloading and understanding the famous Iris data for unsupervised classification
How to do it...
How it works...
There's more...
See also
Streaming KMeans for a real-time on-line classifier
How to do it...
How it works...
There's more...
See also
Downloading wine quality data for streaming regression
How to do it...
How it works...
There's more...
Streaming linear regression for a real-time regression
How to do it...
How it works...
There's more...
See also
Downloading Pima Diabetes data for supervised classification
How to do it...
How it works...
There's more...
See also
Streaming logistic regression for an on-line classifier
How to do it...
How it works...
There's more...
See also
Data is the new silicon of our age, and machine learning, coupled with biologically inspired cognitive systems, serves as the core foundation to not only enable but also accelerate the birth of the fourth industrial revolution. This book is dedicated to our parents, who through extreme hardship and sacrifice, made our education possible and taught us to always practice kindness.
The Apache Spark 2.x Machine Learning Cookbook is crafted by four friends with diverse background, who bring in a vast experience across multiple industries and academic disciplines. The team has immense experience in the subject matter at hand. The book is as much about friendship as it is about the science underpinning Spark and Machine Learning. We wanted to put our thoughts together and write a book for the community that not only combines Spark’s ML code and real-world data sets but also provides context-relevant explanation, references, and readings for a deeper understanding and promoting further research. This book is a reflection of what our team would have wished to have when we got started with Apache Spark.
My own interest in machine learning and artificial intelligence started in the mid eighties when I had the opportunity to read two significant artifacts that happened to be listed back to back in Artificial Intelligence, An International Journal, Volume 28, Number 1, February 1986. While it has been a long journey for engineers and scientists of my generation, fortunately, the advancements in resilient distributed computing, cloud computing, GPUs, cognitive computing, optimization, and advanced machine learning have made the dream of long decades come true. All these advancements have become accessible for the current generation of ML enthusiasts and data scientists alike.
We live in one of the rarest periods in history--a time when multiple technological and sociological trends have merged at the same point in time. The elasticity of cloud computing with built-in access to ML and deep learning nets will provide a whole new set of opportunities to create and capture new markets. The emergence of Apache Spark as the lingua franca or the common language of near real-time resilient distributed computing and data virtualization has provided smart companies the opportunity to employ ML techniques at a scale without a heavy investment in specialized data centers or hardware.
The Apache Spark 2.x Machine Learning Cookbook is one of the most comprehensive treatments of the Apache Spark machine learning API, with selected subcomponents of Spark to give you the foundation you need before you can master a high-end career in machine learning and Apache Spark. The book is written with the goal of providing clarity and accessibility, and it reflects our own experience (including reading the source code) and learning curve with Apache Spark, which started with Spark 1.0.
The Apache Spark 2.x Machine Learning Cookbook lives at the intersection of Apache Spark, machine learning, and Scala for developers, and data scientists through a practitioner’s lens who not only has to understand the code but also the details, theory, and inner workings of a given Spark ML algorithm or API to establish a successful career in the new economy.
The book takes the cookbook format to a whole new level by blending downloadable ready-to-run Apache Spark ML code recipes with background, actionable theory, references, research, and real-life data sets to help the reader understand the what, how and the why behind the extensive facilities offered by Spark for the machine learning library. The book starts by laying the foundations needed to succeed and then rapidly evolves to cover all the meaningful ML algorithms available in Apache Spark.
Chapter 1, Practical Machine Learning with Spark Using Scala, covers installing and configuring a real-life development environment with machine learning and programming with Apache Spark. Using screenshots, it walks you through downloading, installing, and configuring Apache Spark and IntelliJ IDEA along with the necessary libraries that would reflect a developer’s desktop in a real-world setting. It then proceeds to identify and list over 40 data repositories with real-world data sets that can help the reader in experimenting and advancing even further with the code recipes. In the final step, we run our first ML program on Spark and then provide directions on how to add graphics to your machine learning programs, which are used in the subsequent chapters.
Chapter 2, Just Enough Linear Algebra for Machine Learning with Spark, covers the use of linear algebra (vector and matrix), which is the foundation of some of the most monumental works in machine learning. It provides a comprehensive treatment of the DenseVector, SparseVector, and matrix facilities available in Apache Spark, with the recipes in the chapter. It provides recipes for both local and distributed matrices, including RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix to provide a detailed explanation of this topic. We included this chapter because mastery of the Spark and ML/MLlib was only possible by reading most of the source code line by line and understanding how the matrix decomposition and vector/matrix arithmetic work underneath the more course-grain algorithm in Spark.
Chapter 3, Spark’s Three Data Musketeers for Machine Learning - Perfect Together, provides an end-to-end treatment of the three pillars of resilient distributed data manipulation and wrangling in Apache spark. The chapter comprises detailed recipes covering RDDs, DataFrame, and Dataset facilities from a practitioner’s point of view. Through an exhaustive list of 17 recipes, examples, references, and explanation, it lays out the foundation to build a successful career in machine learning sciences. The chapter provides both functional (code) as well as non-functional (SQL interface) programming approaches to solidify the knowledge base reflecting the real demands of a successful Spark ML engineer at tier 1 companies.
Chapter 4, Common Recipes for Implementing a Robust Machine Learning System, covers and factors out the tasks that are common in most machine learning systems through 16 short but to-the-point code recipes that the reader can use in their own real-world systems. It covers a gamut of techniques, ranging from normalizing data to evaluating the model output, using best practice metrics via Spark’s ML/MLlib facilities that might not be readily visible to the reader. It is a combination of recipes that we use in our day-to-day jobs in most situations but are listed separately to save on space and complexity of other recipes.
Chapter 5, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I, is the first of two chapters exploring classification and regression in Apache Spark. This chapter starts with Generalized Linear Regression (GLM) extending it to Lasso, Ridge with different types of optimization available in Spark. The chapter then proceeds to cover Isotonic regression, Survival regression with multi-layer perceptron (neural networks) and One-vs-Rest classifier.
Chapter 6, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II, is the second of the two regression and classification chapters. This chapter covers RDD-based regression systems, ranging from Linear, Logistic, and Ridge to Lasso, using Stochastic Gradient Decent and L_BFGS optimization in Spark. The last three recipes cover Support Vector Machine (SVM) and Naïve Bayes, ending with a detailed recipe for ML pipelines that are gaining a prominent position in the Spark ML ecosystem.
Chapter 7, Recommendation Engine that Scales with Spark, covers how to explore your data set and build a movie recommendation engine using Spark’s ML library facilities. It uses a large dataset and some recipes in addition to figures and write-ups to explore the various methods of recommenders before going deep into collaborative filtering techniques in Spark.
Chapter 8, Unsupervised Clustering with Apache Spark 2.0, covers the techniques used in unsupervised learning, such as KMeans, Mixture, and Expectation (EM), Power Iteration Clustering (PIC), and Latent Dirichlet Allocation (LDA), while also covering the why and how to help the reader to understand the core concepts. Using Spark Streaming, the chapter commences with a real-time KMeans clustering recipe to classify the input stream into labeled classes via unsupervised means.
Chapter 9, Optimization - Going Down the Hill with Gradient Descent, is a unique chapter that walks you through optimization as it applies to machine learning. It starts from a closed form formula and quadratic function optimization (for example, cost function), to using Gradient Descent (GD) in order to solve a regression problem from scratch. The chapter helps to look underneath the hood by developing the reader’s skill set using Scala code while providing in-depth explanation of how to code and understand Stochastic Descent (GD) from scratch. The chapter concludes with one of Spark’s ML API to achieve the same concepts that we code from scratch.
Chapter 10, Building Machine Learning Systems with Decision Tree and Ensemble Models, covers the Tree and Ensemble models for classification and regression in depth using Spark’s machine library. We use three real-world data sets to explore the classification and regression problems using Decision Tree, Random Forest Tree, and Gradient Boosted Tree. The chapter provides an in-depth explanation of these methods in addition to plug-and-play code recipes that explore Apache Spark’s machine library step by step.
Chapter 11, The Curse of High-Dimensionality in Big Data, demystifies the art and science of dimensionality reduction and provides a complete coverage of Spark’s ML/MLlib library, which facilitates this important concept in machine learning at scale. The chapter provides sufficient and in-depth coverage of the theory (the what and why) and then proceeds to cover two fundamental techniques available (the how) in Spark for the readers to use. The chapter covers Single Value Decomposition (SVD), which relates well with the second chapter and then proceeds to examine the Principal Component Analysis (PCA) in depth with code and write ups.
Chapter 12, Implementing Text Analytics with Spark 2.0 ML Library, covers the various techniques available in Spark for implementing text analytics at scale. It provides a comprehensive treatment by starting from the basics, such as Term Frequency (TF) and similarity techniques, such as Word2Vec, and moves on to analyzing a complete dump of Wikipedia for a real-life Spark ML project. The chapter concludes with an in-depth discussion and code for implementing Latent Semantic Analysis (LSA) and Topic Modeling with Latent Dirichlet Allocation (LDA) in Spark.
Chapter 13, Spark Streaming and Machine Learning Library, starts by providing an introduction to and the future direction of Spark streaming, and then proceeds to provide recipes for both RDD-based (DStream) and structured streaming to establish a baseline. The chapter then proceeds to cover all the available ML streaming algorithms in Spark at the time of writing this book. The chapter provides code and shows how to implement streaming DataFrame and streaming data sets, and then proceeds to cover queueStream for debugging before it goes into Streaming KMeans (unsupervised learning) and streaming linear models such as Linear and Logistic regression using real-world datasets.
Please use the details from the software list document.
To execute the recipes in this book, you need a system running Windows 7 and above, or Mac 10, with the following software installed:
Apache Spark 2.x
Oracle JDK SE 1.8.x
JetBrain IntelliJ Community Edition 2016.2.X or later version
Scala plug-in for IntelliJ 2016.2.x
Jfreechart 1.0.19
breeze-core 0.12
Cloud9 1.5.0 JAR
Bliki-core 3.0.19
hadoop-streaming 2.2.0
Jcommon 1.0.23
Lucene-analyzers-common 6.0.0
Lucene-core-6.0.0
Spark-streaming-flume-assembly 2.0.0
Spark-streaming-kafka-assembly 2.0.0
The hardware requirements for this software are mentioned in the software list provided with the code bundle of this book.
This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but who lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as some hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and the ecosystem.
In this book, you will find several headings that appear frequently (Getting ready, How to do it…, How it works…, There's more…, and See also). To give clear instructions on how to complete a recipe, we use these sections as follows:
This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Mac users note that we installed Spark 2.0 in the /Users/USERNAME/spark/spark-2.0.0-bin-hadoop2.7/ directory on a Mac machine."
A block of code is set as follows:
object HelloWorld extends App { println("Hello World!") }
Any command-line input or output is written as follows:
mysql -u root -p
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Configure Global Libraries. Select Scala SDK as your global library."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Spark-2x-Machine-Learning-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we will cover:
Downloading and installing the JDK
Downloading and installing IntelliJ
Downloading and installing Spark
Configuring IntelliJ to work with Spark and run Spark ML sample codes
Running a sample ML code from Spark
Identifying data sources for practical machine learning
Running your first program using Apache Spark 2.0 with the IntelliJ IDE
How to add graphics to your Spark program
With the recent advancements in cluster computing coupled with the rise of big data, the field of machine learning has been pushed to the forefront of computing. The need for an interactive platform that enables data science at scale has long been a dream that is now a reality.
The following three areas together have enabled and accelerated interactive data science at scale:
Apache Spark
: A unified technology platform for data science that combines a fast compute engine and fault-tolerant data structures into a well-designed and integrated offering
Machine learning
: A field of artificial intelligence that enables machines to mimic some of the tasks originally reserved exclusively for the human brain
Scala
: A modern JVM-based language that builds on traditional languages, but unites functional and object-oriented concepts without the verboseness of other languages
First, we need to set up the development environment, which will consist of the following components:
Spark
IntelliJ community edition IDE
Scala
The recipes in this chapter will give you detailed instructions for installing and configuring the IntelliJ IDE, Scala plugin, and Spark. After the development environment is set up, we'll proceed to run one of the Spark ML sample codes to test the setup.
Apache Spark is emerging as the de facto platform and trade language for big data analytics and as a complement to the Hadoop paradigm. Spark enables a data scientist to work in the manner that is most conducive to their workflow right out of the box. Spark's approach is to process the workload in a completely distributed manner without the need for MapReduce (MR) or repeated writing of the intermediate results to a disk.
Spark provides an easy-to-use distributed framework in a unified technology stack, which has made it the platform of choice for data science projects, which more often than not require an iterative algorithm that eventually merges toward a solution. These algorithms, due to their inner workings, generate a large amount of intermediate results that need to go from one stage to the next during the intermediate steps. The need for an interactive tool with a robust native distributed machine learning library (MLlib) rules out a disk-based approach for most of the data science projects.
Spark has a different approach toward cluster computing. It solves the problem as a technology stack rather than as an ecosystem. A large number of centrally managed libraries combined with a lightning-fast compute engine that can support fault-tolerant data structures has poised Spark to take over Hadoop as the preferred big data platform for analytics.
Spark has a modular approach, as depicted in the following diagram:
The aim of machine learning is to produce machines and devices that can mimic human intelligence and automate some of the tasks that have been traditionally reserved for a human brain. Machine learning algorithms are designed to go through very large data sets in a relatively short time and approximate answers that would have taken a human much longer to process.
The field of machine learning can be classified into many forms and at a high level, it can be classified as supervised and unsupervised learning. Supervised learning algorithms are a class of ML algorithms that use a training set (that is, labeled data) to compute a probabilistic distribution or graphical model that in turn allows them to classify the new data points without further human intervention. Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.
Out of the box, Spark offers a rich set of ML algorithms that can be deployed on large datasets without any further coding. The following figure depicts Spark's MLlib algorithms as a mind map. Spark's MLlib is designed to take advantage of parallelism while having fault-tolerant distributed data structures. Spark refers to such data structures as Resilient Distributed Datasets or RDDs:
Scala is a modern programming language that is emerging as an alternative to traditional programming languages such as Java and C++. Scala is a JVM-based language that not only offers a concise syntax without the traditional boilerplate code, but also incorporates both object-oriented and functional programming into an extremely crisp and extraordinarily powerful type-safe language.
Scala takes a flexible and expressive approach, which makes it perfect for interacting with Spark's MLlib. The fact that Spark itself is written in Scala provides a strong evidence that the Scala language is a full-service programming language that can be used to create sophisticated system code with heavy performance needs.
Scala builds on Java's tradition by addressing some of its shortcomings, while avoiding an all-or-nothing approach. Scala code compiles into Java bytecode, which in turn makes it possible to coexist with rich Java libraries interchangeably. The ability to use Java libraries with Scala and vice versa provides continuity and a rich environment for software engineers to build modern and complex machine learning systems without being fully disconnected from the Java tradition and code base.
Scala fully supports a feature-rich functional programming paradigm with standard support for lambda, currying, type interface, immutability, lazy evaluation, and a pattern-matching paradigm reminiscent of Perl without the cryptic syntax. Scala is an excellent match for machine learning programming due to its support for algebra-friendly data types, anonymous functions, covariance, contra-variance, and higher-order functions.
Here's a hello world program in Scala:
object HelloWorld extends App { println("Hello World!") }
Compiling and running HelloWorld in Scala looks like this:
The Apache Spark Machine Learning Cookbook takes a practical approach by offering a multi-disciplinary view with the developer in mind. This book focuses on the interactions and cohesiveness of machine learning, Apache Spark, and Scala. We also take an extra step and teach you how to set up and run a comprehensive development environment familiar to a developer and provide code snippets that you have to run in an interactive shell without the modern facilities that an IDE provides:
The following table provides a detailed list of software versions and libraries used in this book. If you follow the installation instructions covered in this chapter, it will include most of the items listed here. Any other JAR or library files that may be required for specific recipes are covered via additional installation instructions in the respective recipes:
Core systems
Version
Spark
2.0.0
Java
1.8
IntelliJ IDEA
2016.2.4
Scala-sdk
2.11.8
Miscellaneous JARs that will be required are as follows:
Miscellaneous JARs
Version
bliki-core
3.0.19
breeze-viz
0.12
Cloud9
1.5.0
Hadoop-streaming
2.2.0
JCommon
1.0.23
JFreeChart
1.0.19
lucene-analyzers-common
6.0.0
Lucene-Core
6.0.0
scopt
3.3.0
spark-streaming-flume-assembly
2.0.0
spark-streaming-kafka-0-8-assembly
2.0.0
We have additionally tested all the recipes in this book on Spark 2.1.1 and found that the programs executed as expected. It is recommended for learning purposes you use the software versions and libraries listed in these tables.
To stay current with the rapidly changing Spark landscape and documentation, the API links to the Spark documentation mentioned throughout this book point to the latest version of Spark 2.x.x, but the API references in the recipes are explicitly for Spark 2.0.0.
All the Spark documentation links provided in this book will point to the latest documentation on Spark's website. If you prefer to look for documentation for a specific version of Spark (for example, Spark 2.0.0), look for relevant documentation on the Spark website using the following URL:
https://spark.apache.org/documentation.html
We've made the code as simple as possible for clarity purposes rather than demonstrating the advanced features of Scala.
The first step is to download the JDK development environment that is required for Scala/Spark development.
When you are ready to download and install the JDK, access the following link:
http://www.oracle.com/technetwork/java/javase/downloads/index.html
After successful download, follow the on-screen instructions to install the JDK.
IntelliJ Community Edition is a lightweight IDE for Java SE, Groovy, Scala, and Kotlin development. To complete setting up your machine learning with the Spark development environment, the IntelliJ IDE needs to be installed.
When you are ready to download and install IntelliJ, access the following link:
https://www.jetbrains.com/idea/download/
At the time of writing, we are using IntelliJ version 15.x or later (for example, version 2016.2.4) to test the examples in the book, but feel free to download the latest version. Once the installation file is downloaded, double-click on the downloaded file (.exe) and begin to install the IDE. Leave all the installation options at the default settings if you do not want to make any changes. Follow the on-screen instructions to complete the installation:
We now proceed to download and install Spark.
When you are ready to download and install Spark, access the Apache website at this link:
http://spark.apache.org/downloads.html
Go to the Apache website and select the required download parameters, as shown in this screenshot:
Make sure to accept the default choices (click on Next) and proceed with the installation.
We need to run some configurations to ensure that the project settings are correct before being able to run the samples that are provided by Spark or any of the programs listed this book.
We need to be particularly careful when configuring the project structure and global libraries. After we set everything up, we proceed to run the sample ML code provided by the Spark team to verify the setup. Sample code can be found under the Spark directory or can be obtained by downloading the Spark source code with samples.
The following are the steps for configuring IntelliJ to work with Spark MLlib and for running the sample ML code provided by Spark in the examples directory. The examples directory can be found in your home directory for Spark. Use the Scala samples to proceed:
Click on the
Project Structure...
option, as shown in the following screenshot, to configure project settings:
Verify the settings:
Configure
Global Libraries
. Select
Scala SDK
as your global library:
Select the JARs for the new Scala SDK and let the download complete:
Select the project name:
Verify the settings and additional libraries:
Add dependency JARs. Select modules under the
Project Settings
in the left-hand pane and click on dependencies to choose the required JARs, as shown in the following screenshot:
Select the JAR files provided by Spark. Choose Spark's default installation directory and then select the
lib
directory:
We then select the JAR files for examples that are provided for Spark out of the box.
Add required JARs by verifying that you selected and imported all the JARs listed under
External Libraries
in the the left-hand pane:
Spark 2.0 uses Scala 2.11. Two new streaming JARs, Flume and Kafka, are needed to run the examples, and can be downloaded from the following URLs:
https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-flume-assembly_2.11/2.0.0/spark-streaming-flume-assembly_2.11-2.0.0.jar
https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-8-assembly_2.11/2.0.0/spark-streaming-kafka-0-8-assembly_2.11-2.0.0.jar
The next step is to download and install the Flume and Kafka JARs. For the purposes of this book, we have used the Maven repo:
Download and install the Kafka assembly:
Download and install the Flume assembly:
After the download is complete, move the downloaded JAR files to the
lib
directory of Spark. We used the
C
drive when we installed Spark:
Open your IDE and verify that all the JARs under the
External Libraries
folder on the left, as shown in the following screenshot, are present in your setup:
Build the example projects in Spark to verify the setup:
Verify that the build was successful:
Prior to Spark 2.0, we needed another library from Google called Guava for facilitating I/O and for providing a set of rich methods of defining tables and then letting Spark broadcast them across the cluster. Due to dependency issues that were hard to work around, Spark 2.0 no longer uses the Guava library. Make sure you use the Guava library if you are using Spark versions prior to 2.0 (required in version 1.5.2). The Guava library can be accessed at the following URL:
https://github.com/google/guava/wiki
You may want to use Guava version 15.0, which can be found here:
https://mvnrepository.com/artifact/com.google.guava/guava/15.0
If you are using installation instructions from previous blogs, make sure to exclude the Guava library from the installation set.
If there are other third-party libraries or JARs required for the completion of the Spark installation, you can find those in the following Maven repository:
https://repo1.maven.org/maven2/org/apache/spark/
We can verify the setup by simply downloading the sample code from the Spark source tree and importing it into IntelliJ to make sure it runs.
We will first run the logistic regression code from the samples to verify installation. In the next section, we proceed to write our own version of the same program and examine the output in order to understand how it works.
Go to the source directory and pick one of the ML sample code files to run. We've selected the logistic regression example.
After selecting the example, select
Edit Configurations...
, as shown in the following screenshot:
In the
Configurations
tab, define the following options:
VM options
: The choice shown allows you to run a standalone Spark cluster
Program arguments
: What we are supposed to pass into the program
Run the logistic regression by going to
Run 'LogisticRegressionExample'
, as shown in the following screenshot:
Verify the exit code and make sure it is as shown in the following screenshot:
Getting data for machine learning projects was a challenge in the past. However, now there is a rich set of public data sources specifically suitable for machine learning.
In addition to the university and government sources, there are many other open sources of data that can be used to learn and code your own examples and projects. We will list the data sources and show you how to best obtain and download data for each chapter.
The following is a list of open source data worth exploring if you would like to develop applications in this field:
UCI machine learning repository
