39,59 €
This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML.
Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML.
By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 530
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2015 Second edition: April 2017
Production reference: 1270417
ISBN 978-1-78588-993-6
www.packtpub.com
Authors
Rajdeep Dua
Manpreet Singh Ghotra
Nick Pentreath
Copy Editors
Safis Editing Sonia Mathur
Reviewer
Brian O'Neill
Project Coordinator
Vaidehi Sawant
Commissioning Editor
Akram Hussain
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Francy Puthiry
Content Development Editor
Rohit Kumar Singh
Production Coordinator
Deepika Naik
Technical Editors
Nirant Carvalho Kunal Mali
Rajdeep Dua has over 16 years of experience in the Cloud and Big Data space. He worked in the advocacy team for Google's big data tools, BigQuery. He worked on the Greenplum big data platform at VMware in the developer evangelist team. He also worked closely with a team on porting Spark to run on VMware's public and private cloud as a feature set. He has taught Spark and Big Data at some of the most prestigious tech schools in India: IIIT Hyderabad, ISB, IIIT Delhi, and College of Engineering Pune.
Currently, he leads the developer relations team at Salesforce India. He also works with the data pipeline team at Salesforce, which uses Hadoop and Spark to expose big data processing tools for developers.
He has published Big Data and Spark tutorials at http://www.clouddatalab.com.He has also presented BigQuery and Google App Engine at the W3C conference in Hyderabad (http://wwwconference.org/proceedings/www2011/schedule/www2011_Program.pdf). He led the developer relations teams at Google, VMware, and Microsoft, and he has spoken at hundreds of other conferences on the cloud. Some of the other references to his work can be seen at http://yourstory.com/2012/06/vmware-hires-rajdeep-dua-to-lead-the-developer-relations-in-india/ and http://dl.acm.org/citation.cfm?id=2624641.
His contributions to the open source community are related to Docker, Kubernetes, Android, OpenStack, and cloudfoundry.
You can connect with him on LinkedIn at https://www.linkedin.com/in/rajdeepd.
Manpreet Singh Ghotra has more than 12 years of experience in software development for both enterprise and big data software. He is currently working on developing a machine learning platform using Apache Spark at Salesforce. He has worked on a sentiment analyzer using the Apache stack and machine learning.
He was part of the machine learning group at one of the largest online retailers in the world, working on transit time calculations using Apache Mahout and the R Recommendation system using Apache Mahout.
With a master's and postgraduate degree in machine learning, he has contributed to and worked for the machine learning community.
His GitHub profile is https://github.com/badlogicmanpreet and you can find him on LinkedIn at https://in.linkedin.com/in/msghotra.
Nick Pentreath has a background in financial markets, machine learning, and software development. He has worked at Goldman Sachs Group, Inc., as a research scientist at the online ad targeting start-up, Cognitive Match Limited, London, and led the data science and analytics team at Mxit, Africa's largest social network.He is a cofounder of Graphflow, a big data and machine learning company focused on user-centric recommendations and customer intelligence. He is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add value to the bottom line. Nick is a member of the Apache Spark Project Management Committee.
Brian O'Neill is the principal architect at Monetate, Inc. Monetate's personalization platform leverages Spark and machine learning algorithms to process millions of events per second, leveraging real-time context and analytics to create personalized brand experiences at scale. Brian is a perennial Datastax Cassandra MVP and has also won InfoWorld’s Technology Leadership award. Previously, he was CTO for Health Market Science (HMS), now a LexisNexis company. He is a graduate of Brown University and holds patents in artificial intelligence and data management.
Prior to this publication, Brian authored a book on distributed computing, Storm Blueprints: Patterns for Distributed Real-time Computation, and contributed to Learning Cassandra for Administrators.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://goo.gl/5LgUpI.
If you'd like to join our team of regular reviewers, you can e-mail us [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Getting Up and Running with Spark
Installing and setting up Spark locally
Spark clusters
The Spark programming model
SparkContext and SparkConf
SparkSession
The Spark shell
Resilient Distributed Datasets
Creating RDDs
Spark operations
Caching RDDs
Broadcast variables and accumulators
SchemaRDD
Spark data frame
The first step to a Spark program in Scala
The first step to a Spark program in Java
The first step to a Spark program in Python
The first step to a Spark program in R
SparkR DataFrames
Getting Spark running on Amazon EC2
Launching an EC2 Spark cluster
Configuring and running Spark on Amazon Elastic Map Reduce
UI in Spark
Supported machine learning algorithms by Spark
Benefits of using Spark ML as compared to existing libraries
Spark Cluster on Google Compute Engine - DataProc
Hadoop and Spark Versions
Creating a Cluster
Submitting a Job
Summary
Math for Machine Learning
Linear algebra
Setting up the Scala environment in Intellij
Setting up the Scala environment on the Command Line
Fields
Real numbers
Complex numbers
Vectors
Vector spaces
Vector types
Vectors in Breeze
Vectors in Spark
Vector operations
Hyperplanes
Vectors in machine learning
Matrix
Types of matrices
Matrix in Spark
Distributed matrix in Spark
Matrix operations
Determinant
Eigenvalues and eigenvectors
Singular value decomposition
Matrices in machine learning
Functions
Function types
Functional composition
Hypothesis
Gradient descent
Prior, likelihood, and posterior
Calculus
Differential calculus
Integral calculus
Lagranges multipliers
Plotting
Summary
Designing a Machine Learning System
What is Machine Learning?
Introducing MovieStream
Business use cases for a machine learning system
Personalization
Targeted marketing and customer segmentation
Predictive modeling and analytics
Types of machine learning models
The components of a data-driven machine learning system
Data ingestion and storage
Data cleansing and transformation
Model training and testing loop
Model deployment and integration
Model monitoring and feedback
Batch versus real time
Data Pipeline in Apache Spark
An architecture for a machine learning system
Spark MLlib
Performance improvements in Spark ML over Spark MLlib
Comparing algorithms supported by MLlib
Classification
Clustering
Regression
MLlib supported methods and developer APIs
Spark Integration
MLlib vision
MLlib versions compared
Spark 1.6 to 2.0
Summary
Obtaining, Processing, and Preparing Data with Spark
Accessing publicly available datasets
The MovieLens 100k dataset
Exploring and visualizing your data
Exploring the user dataset
Count by occupation
Movie dataset
Exploring the rating dataset
Rating count bar chart
Distribution of number ratings
Processing and transforming your data
Filling in bad or missing data
Extracting useful features from your data
Numerical features
Categorical features
Derived features
Transforming timestamps into categorical features
Extract time of day
Text features
Simple text feature extraction
Sparse Vectors from Titles
Normalizing features
Using ML for feature normalization
Using packages for feature extraction
TFID
IDF
Word2Vector
Skip-gram model
Standard scalar
Summary
Building a Recommendation Engine with Spark
Types of recommendation models
Content-based filtering
Collaborative filtering
Matrix factorization
Explicit matrix factorization
Implicit Matrix Factorization
Basic model for Matrix Factorization
Alternating least squares
Extracting the right features from your data
Extracting features from the MovieLens 100k dataset
Training the recommendation model
Training a model on the MovieLens 100k dataset
Training a model using Implicit feedback data
Using the recommendation model
ALS Model recommendations
User recommendations
Generating movie recommendations from the MovieLens 100k dataset
Inspecting the recommendations
Item recommendations
Generating similar movies for the MovieLens 100k dataset
Inspecting the similar items
Evaluating the performance of recommendation models
ALS Model Evaluation
Mean Squared Error
Mean Average Precision at K
Using MLlib's built-in evaluation functions
RMSE and MSE
MAP
FP-Growth algorithm
FP-Growth Basic Sample
FP-Growth Applied to Movie Lens Data
Summary
Building a Classification Model with Spark
Types of classification models
Linear models
Logistic regression
Multinomial logistic regression
Visualizing the StumbleUpon dataset
Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
StumbleUponExecutor
Linear support vector machines
The naive Bayes model
Decision trees
Ensembles of trees
Random Forests
Gradient-Boosted Trees
Multilayer perceptron classifier
Extracting the right features from your data
Training classification models
Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
Using classification models
Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
Evaluating the performance of classification models
Accuracy and prediction error
Precision and recall
ROC curve and AUC
Improving model performance and tuning parameters
Feature standardization
Additional features
Using the correct form of data
Tuning model parameters
Linear models
Iterations
Step size
Regularization
Decision trees
Tuning tree depth and impurity
The naive Bayes model
Cross-validation
Summary
Building a Regression Model with Spark
Types of regression models
Least squares regression
Decision trees for regression
Evaluating the performance of regression models
Mean Squared Error and Root Mean Squared Error
Mean Absolute Error
Root Mean Squared Log Error
The R-squared coefficient
Extracting the right features from your data
Extracting features from the bike sharing dataset
Training and using regression models
BikeSharingExecutor
Training a regression model on the bike sharing dataset
Generalized linear regression
Decision tree regression
Ensembles of trees
Random forest regression
Gradient boosted tree regression
Improving model performance and tuning parameters
Transforming the target variable
Impact of training on log-transformed targets
Tuning model parameters
Creating training and testing sets to evaluate parameters
Splitting data for Decision tree
The impact of parameter settings for linear models
Iterations
Step size
L2 regularization
L1 regularization
Intercept
The impact of parameter settings for the decision tree
Tree depth
Maximum bins
The impact of parameter settings for the Gradient Boosted Trees
Iterations
MaxBins
Summary
Building a Clustering Model with Spark
Types of clustering models
k-means clustering
Initialization methods
Mixture models
Hierarchical clustering
Extracting the right features from your data
Extracting features from the MovieLens dataset
K-means - training a clustering model
Training a clustering model on the MovieLens dataset
K-means - interpreting cluster predictions on the MovieLens dataset
Interpreting the movie clusters
Interpreting the movie clusters
K-means - evaluating the performance of clustering models
Internal evaluation metrics
External evaluation metrics
Computing performance metrics on the MovieLens dataset
Effect of iterations on WSSSE
Bisecting KMeans
Bisecting K-means - training a clustering model
WSSSE and iterations
Gaussian Mixture Model
Clustering using GMM
Plotting the user and item data with GMM clustering
GMM - effect of iterations on cluster boundaries
Summary
Dimensionality Reduction with Spark
Types of dimensionality reduction
Principal components analysis
Singular value decomposition
Relationship with matrix factorization
Clustering as dimensionality reduction
Extracting the right features from your data
Extracting features from the LFW dataset
Exploring the face data
Visualizing the face data
Extracting facial images as vectors
Loading images
Converting to grayscale and resizing the images
Extracting feature vectors
Normalization
Training a dimensionality reduction model
Running PCA on the LFW dataset
Visualizing the Eigenfaces
Interpreting the Eigenfaces
Using a dimensionality reduction model
Projecting data using PCA on the LFW dataset
The relationship between PCA and SVD
Evaluating dimensionality reduction models
Evaluating k for SVD on the LFW dataset
Singular values
Summary
Advanced Text Processing with Spark
What's so special about text data?
Extracting the right features from your data
Term weighting schemes
Feature hashing
Extracting the tf-idf features from the 20 Newsgroups dataset
Exploring the 20 Newsgroups data
Applying basic tokenization
Improving our tokenization
Removing stop words
Excluding terms based on frequency
A note about stemming
Feature Hashing
Building a tf-idf model
Analyzing the tf-idf weightings
Using a tf-idf model
Document similarity with the 20 Newsgroups dataset and tf-idf features
Training a text classifier on the 20 Newsgroups dataset using tf-idf
Evaluating the impact of text processing
Comparing raw features with processed tf-idf features on the 20 Newsgroups dataset
Text classification with Spark 2.0
Word2Vec models
Word2Vec with Spark MLlib on the 20 Newsgroups dataset
Word2Vec with Spark ML on the 20 Newsgroups dataset
Summary
Real-Time Machine Learning with Spark Streaming
Online learning
Stream processing
An introduction to Spark Streaming
Input sources
Transformations
Keeping track of state
General transformations
Actions
Window operators
Caching and fault tolerance with Spark Streaming
Creating a basic streaming application
The producer application
Creating a basic streaming application
Streaming analytics
Stateful streaming
Online learning with Spark Streaming
Streaming regression
A simple streaming regression program
Creating a streaming data producer
Creating a streaming regression model
Streaming K-means
Online model evaluation
Comparing model performance with Spark Streaming
Structured Streaming
Summary
Pipeline APIs for Spark ML
Introduction to pipelines
DataFrames
Pipeline components
Transformers
Estimators
How pipelines work
Machine learning pipeline with an example
StumbleUponExecutor
Summary
In recent years, the volume of data being collected, stored, and analyzed has exploded, in particular in relation to activity on the Web and mobile devices, as well as data from the physical world collected via sensor networks. While large-scale data storage, processing, analysis, and modeling were previously the domain of the largest institutions, such as Google, Yahoo!, Facebook, Twitter, and Salesforce, increasingly, many organizations are being faced with the challenge of how to handle a massive amount of data.
When faced with this quantity of data and the common requirement to utilize it in real time, human-powered systems quickly become infeasible. This has led to a rise in so-called big data and machine learning systems that learn from this data to make automated decisions.
In answer to the challenge of dealing with ever larger-scale data without any prohibitive cost, new open source technologies emerged at companies such as Google, Yahoo!, Amazon, and Facebook, which aimed at making it easier to handle massive data volumes by distributing data storage and computation across a cluster of computers.
The most widespread of these is Apache Hadoop, which made it significantly easier and cheaper to both store large amounts of data (via the Hadoop Distributed File System, or HDFS) and run computations on this data (via Hadoop MapReduce, a framework to perform computation tasks in parallel across many nodes in a computer cluster).
However, MapReduce has some important shortcomings, including high overheads to launch each job and reliance on storing intermediate data and results of the computation to disk, both of which make Hadoop relatively ill-suited for use cases of an iterative or low-latency nature. Apache Spark is a new framework for distributed computing that is designed from the ground up to be optimized for low-latency tasks and to store intermediate data and results in memory, thus addressing some of the major drawbacks of the Hadoop framework. Spark provides a clean, functional, and easy-to-understand API to write applications, and is fully compatible with the Hadoop ecosystem.
Furthermore, Spark provides native APIs in Scala, Java, Python, and R. The Scala and Python APIs allow all the benefits of the Scala or Python language, respectively, to be used directly in Spark applications, including using the relevant interpreter for real-time, interactive exploration. Spark itself now provides a toolkit (Spark MLlib in 1.6 and Spark ML in 2.0) of distributed machine learning and data mining models that is under heavy development and already contains high-quality, scalable, and efficient algorithms for many common machine learning tasks, some of which we will delve into in this book.
Applying machine learning techniques to massive datasets is challenging, primarily because most well-known machine learning algorithms are not designed for parallel architectures. In many cases, designing such algorithms is not an easy task. The nature of machine learning models is generally iterative, hence the strong appeal of Spark for this use case. While there are many competing frameworks for parallel computing, Spark is one of the few that combines speed, scalability, in-memory processing, and fault tolerance with ease of programming and a flexible, expressive, and powerful API design.
Throughout this book, we will focus on real-world applications of machine learning technology. While we may briefly delve into some theoretical aspects of machine learning algorithms and required maths for machine learning, the book will generally take a practical, applied approach with a focus on using examples and code to illustrate how to effectively use the features of Spark and MLlib, as well as other well-known and freely available packages for machine learning and data analysis, to create a useful machine learning system.
Chapter 1, Getting Up and Running with Spark, shows how to install and set up a local development environment for the Spark framework, as well as how to create a Spark cluster in the cloud using Amazon EC2. The Spark programming model and API will be introduced and a simple Spark application will be created using Scala, Java, and Python.
Chapter 2, Math for Machine Learning, provides a mathematical introduction to machine learning. Understanding math and many of its techniques is important to get a good hold on the inner workings of the algorithms and to get the best results.
Chapter 3, Designing a Machine Learning System, presents an example of a real-world use case for a machine learning system. We will design a high-level architecture for an intelligent system in Spark based on this illustrative use case.
Chapter 4, Obtaining, Processing, and Preparing Data with Spark, details how to go about obtaining data for use in a machine learning system, in particular from various freely and publicly available sources. We will learn how to process, clean, and transform the raw data into features that may be used in machine learning models, using available tools, libraries, and Spark's functionality.
Chapter 5, Building a Recommendation Engine with Spark, deals with creating a recommendation model based on the collaborative filtering approach. This model will be used to recommend items to a given user, as well as create lists of items that are similar to a given item. Standard metrics to evaluate the performance of a recommendation model will be covered here.
Chapter 6, Building a Classification Model with Spark, details how to create a model for binary classification, as well as how to utilize standard performance-evaluation metrics for classification tasks.
Chapter 7, Building a Regression Model with Spark, shows how to create a model for regression, extending the classification model created in Chapter 6, Building a Classification Model with Spark. Evaluation metrics for the performance of regression models will be detailed here.
Chapter 8, Building a Clustering Model with Spark, explores how to create a clustering model and how to use related evaluation methodologies. You will learn how to analyze and visualize the clusters that are generated.
Chapter 9, Dimensionality Reduction with Spark, takes us through methods to extract the underlying structure from, and reduce the dimensionality of, our data. You will learn some common dimensionality-reduction techniques and how to apply and analyze them. You will also see how to use the resulting data representation as an input to another machine learning model.
Chapter 10, Advanced Text Processing with Spark, introduces approaches to deal with large-scale text data, including techniques for feature extraction from text and dealing with the very high-dimensional features typical in text data.
Chapter 11, Real-Time Machine Learning with Spark Streaming, provides an overview of Spark Streaming and how it fits in with the online and incremental learning approaches to apply machine learning on data streams.
Chapter 12, Pipeline APIs for Spark ML, provides a uniform set of APIs that are built on top of Data Frames and help the user to create and tune machine learning pipelines.
Throughout this book, we assume that you have some basic experience with programming in Scala or Python and have some basic knowledge of machine learning, statistics, and data analysis.
This book is aimed at entry-level to intermediate data scientists, data analysts, software engineers, and practitioners involved in machine learning or data mining with an interest in large-scale machine learning approaches, but who are not necessarily familiar with Spark. You may have some experience of statistics or machine learning software (perhaps including MATLAB, scikit-learn, Mahout, R, Weka, and so on) or distributed systems (including some exposure to Hadoop).
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Machine-Learning-with-Spark-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MachineLearningwithSparkSecondEdition_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Apache Spark is a framework for distributed computing; this framework aims to make it simpler to write programs that run in parallel across many nodes in a cluster of computers or virtual machines. It tries to abstract the tasks of resource scheduling, job submission, execution, tracking, and communication between nodes as well as the low-level operations that are inherent in parallel data processing. It also provides a higher level API to work with distributed data. In this way, it is similar to other distributed processing frameworks such as Apache Hadoop; however, the underlying architecture is somewhat different.
Spark began as a research project at the AMP lab in University of California, Berkeley (https://amplab.cs.berkeley.edu/projects/spark-lightning-fast-cluster-computing/). The university was focused on the use case of distributed machine learning algorithms. Hence, it is designed from the ground up for high performance in applications of an iterative nature, where the same data is accessed multiple times. This performance is achieved primarily through caching datasets in memory combined with low latency and overhead to launch parallel computation tasks. Together with other features such as fault tolerance, flexible distributed-memory data structures, and a powerful functional API, Spark has proved to be broadly useful for a wide range of large-scale data processing tasks, over and above machine learning and iterative analytics.
For more information, you can visit:
http://spark.apache.org/community.htmlhttp://spark.apache.org/community.html#historyPerformance wise, Spark is much faster than Hadoop for related workloads. Refer to the following graph:
Spark runs in four modes:
The standalone local mode, where all Spark processes are run within the same
Java Virtual Machine
(
JVM
) process
The standalone cluster mode, using Spark's own built-in, job-scheduling framework
Using
Mesos
, a popular open source cluster-computing framework
Using YARN (commonly referred to as NextGen MapReduce), Hadoop
In this chapter, we will do the following:
Download the Spark binaries and set up a development environment that runs in Spark's standalone local mode. This environment will be used throughout the book to run the example code.
Explore Spark's programming model and API using Spark's interactive console.
Write our first Spark program in Scala, Java, R, and Python.
Set up a Spark cluster using Amazon's
Elastic Cloud Compute
(
EC2
) platform, which can be used for large-sized data and heavier computational requirements, rather than running in the local mode.
Set up a Spark Cluster using Amazon Elastic Map Reduce
If you have previous experience in setting up Spark and are familiar with the basics of writing a Spark program, feel free to skip this chapter.
Spark can be run using the built-in standalone cluster scheduler in the local mode. This means that all the Spark processes are run within the same JVM-effectively, a single, multithreaded instance of Spark. The local mode is very used for prototyping, development, debugging, and testing. However, this mode can also be useful in real-world scenarios to perform parallel computation across multiple cores on a single computer.
As Spark's local mode is fully compatible with the cluster mode; programs written and tested locally can be run on a cluster with just a few additional steps.
The first step in setting up Spark locally is to download the latest version http://spark.apache.org/downloads.html, which contains links to download various versions of Spark as well as to obtain the latest source code via GitHub.
Spark needs to be built against a specific version of Hadoop in order to access Hadoop Distributed File System (HDFS) as well as standard and custom Hadoop input sources Cloudera's Hadoop Distribution, MapR's Hadoop distribution, and Hadoop 2 (YARN). Unless you wish to build Spark against a specific Hadoop version, we recommend that you download the prebuilt Hadoop 2.7 package from an Apache mirror from http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz.
Spark requires the Scala programming language (version 2.10.x or 2.11.x at the time of writing this book) in order to run. Fortunately, the prebuilt binary package comes with the Scala runtime packages included, so you don't need to install Scala separately in order to get started. However, you will need to have a Java Runtime Environment (JRE) or Java Development Kit (JDK).
Once you have downloaded the Spark binary package, unpack the contents of the package and change it to the newly created directory by running the following commands:
$ tar xfvz spark-2.0.0-bin-hadoop2.7.tgz
$ cd spark-2.0.0-bin-hadoop2.7
Spark places user scripts to run Spark in the bin directory. You can test whether everything is working correctly by running one of the example programs included in Spark. Run the following command:
$ bin/run-example SparkPi 100
This will run the example in Spark's local standalone mode. In this mode, all the Spark processes are run within the same JVM, and Spark uses multiple threads for parallel processing. By default, the preceding example uses a number of threads equal to the number of cores available on your system. Once the program is executed, you should see something similar to the following lines toward the end of the output:
...
16/11/24 14:41:58 INFO Executor: Finished task 99.0 in stage 0.0 (TID 99). 872 bytes result sent to driver
16/11/24 14:41:58 INFO TaskSetManager: Finished task 99.0 in stage 0.0 (TID 99) in 59 ms on localhost (100/100)
16/11/24 14:41:58 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.988 s
16/11/24 14:41:58 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/11/24 14:41:58 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 2.235920 s
Pi is roughly 3.1409527140952713
The preceding command calls class org.apache.spark.examples.SparkPi class.
This class takes parameter in the local[N] form, where N is the number of threads to use. For example, to use only two threads, run the following command instead:N is the number of threads to use. Giving local[*] will use all of the cores on the local machine--that is a common usage.
To use only two threads, run the following command instead:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master local[2] ./examples/jars/spark-examples_2.11-2.0.0.jar 100
A Spark cluster is made up of two types of processes: a driver program and multiple executors. In the local mode, all these processes are run within the same JVM. In a cluster, these processes are usually run on separate nodes.
For example, a typical cluster that runs in Spark's standalone mode (that is, using Spark's built-in cluster management modules) will have the following:
A master node that runs the Spark standalone master process as well as the driver program
A number of worker nodes, each running an executor process
While we will be using Spark's local standalone mode throughout this book to illustrate concepts and examples, the same Spark code that we write can be run on a Spark cluster. In the preceding example, if we run the code on a Spark standalone cluster, we could simply pass in the URL for the master node, as follows:
$ MASTER=spark://IP:PORT --class org.apache.spark.examples.SparkPi
./examples/jars/spark-examples_2.11-2.0.0.jar 100
Here, IP is the IP address and PORT is the port of the Spark master. This tells Spark to run the program on the cluster where the Spark master process is running.
A full treatment of Spark's cluster management and deployment is beyond the scope of this book. However, we will briefly teach you how to set up and use an Amazon EC2 cluster later in this chapter.
For an overview of the Spark cluster-application deployment, take a look at the following links:
http://spark.apache.org/docs/latest/cluster-overview.htmlhttp://spark.apache.org/docs/latest/submitting-applications.htmlBefore we delve into a high-level overview of Spark's design, we will introduce the SparkContext object as well as the Spark shell, which we will use to interactively explore the basics of the Spark programming model.
While this section provides a brief overview and examples of using Spark, we recommend that you read the following documentation to get a detailed understanding:
Refer to the following URLs:
For the Spark Quick Start refer to, http://spark.apache.org/docs/latest/quick-startFor the Spark Programming guide, which covers Scala, Java, Python and R--, refer to, http://spark.apache.org/docs/latest/programming-guide.htmlThe core of Spark is a concept called the Resilient Distributed Dataset (RDD). An RDD is a collection of records (strictly speaking, objects of some type) that are distributed or partitioned across many nodes in a cluster (for the purposes of the Spark local mode, the single multithreaded process can be thought of in the same way). An RDD in Spark is fault-tolerant; this means that if a given node or task fails (for some reason other than erroneous user code, such as hardware failure, loss of communication, and so on), the RDD can be reconstructed automatically on the remaining nodes and the job will still be completed.
SchemaRDD is a combination of RDD and schema information. It also offers many rich and easy-to-use APIs (that is, the DataSet API). SchemaRDD is not used with 2.0 and is internally used by DataFrame and Dataset APIs.
A schema is used to describe how structured data is logically organized. After obtaining the schema information, the SQL engine is able to provide the structured query capability for the corresponding data. The DataSet API is a replacement for Spark SQL parser's functions. It is an API to achieve the original program logic tree. Subsequent processing steps reuse Spark SQL's core logic. We can safely consider DataSet API's processing functions as completely equivalent to that of SQL queries.
SchemaRDD is an RDD subclass. When a program calls the DataSet API, a new SchemaRDD object is created, and a logic plan attribute of the new object is created by adding a new logic operation node on the original logic plan tree. Operations of the DataSet API (like RDD) are of two types--Transformation and Action.
APIs related to the relational operations are attributed to the Transformation type.
Operations associated with data output sources are of Action type. Like RDD, a Spark job is triggered and delivered for cluster execution, only when an Action type operation is called.
We will now use the ideas we introduced in the previous section to write a basic Spark program to manipulate a dataset. We will start with Scala and then write the same program in Java and Python. Our program will be based on exploring some data from an online store, about which users have purchased which products. The data is contained in a Comma-Separated-Value (CSV) file called UserPurchaseHistory.csv. This file is expected to be in the data directory.
The contents are shown in the following snippet. The first column of the CSV is the username, the second column is the product name, and the final column is the price: