E-Book
31,19 €

Fast Data Processing with Spark 2 E-Book

Krishna Sankar

0,0

31,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

When people want a way to process big data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere.
Beginning with the fundamentals, we’ll show you how to get set up with Spark with minimum fuss. You’ll then get to grips with some simple APIs before investigating machine learning and graph processing – throughout we’ll make sure you know exactly how to apply your knowledge.
You will also learn how to use the Spark shell, how to load data before finding out how to build and run your own Spark applications. Discover how to manipulate your RDD and get stuck into a range of DataFrame APIs. As if that’s not enough, you’ll also learn some useful Machine Learning algorithms with the help of Spark MLlib and integrating Spark with R. We’ll also make sure you’re confident and prepared for graph processing, as you learn more about the GraphX API.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 255

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Fast Data Processing with Spark 2 Third Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Installing Spark and Setting Up Your Cluster

Directory organization and convention

Installing the prebuilt distribution

Building Spark from source

Downloading the source

Compiling the source with Maven

Compilation switches

Testing the installation

Spark topology

A single machine

Running Spark on EC2

Downloading EC-scripts

Running Spark on EC2 with the scripts

Deploying Spark on Elastic MapReduce

Deploying Spark with Chef (Opscode)

Deploying Spark on Mesos

Spark on YARN

Spark standalone mode

References

Summary

2. Using the Spark Shell

The Spark shell

Exiting out of the shell

Using Spark shell to run the book code

Loading a simple text file

Interactively loading data from S3

Running the Spark shell in Python

Summary

3. Building and Running a Spark Application

Building Spark applications

Data wrangling with iPython

Developing Spark with Eclipse

Developing Spark with other IDEs

Building your Spark job with Maven

Building your Spark job with something else

References

Summary

4. Creating a SparkSession Object

SparkSession versus SparkContext

Building a SparkSession object

SparkContext - metadata

Shared Java and Scala APIs

Python

iPython

Reference

Summary

5. Loading and Saving Data in Spark

Spark abstractions

RDDs

Data modalities

Data modalities and Datasets/DataFrames/RDDs

Loading data into an RDD

Saving your data

References

Summary

6. Manipulating Your RDD

Manipulating your RDD in Scala and Java

Scala RDD functions

Functions for joining the PairRDD classes

Other PairRDD functions

Double RDD functions

General RDD functions

Java RDD functions

Spark Java function classes

Common Java RDD functions

Methods for combining JavaRDDs

Functions on JavaPairRDDs

Manipulating your RDD in Python

Standard RDD functions

The PairRDD functions

References

Summary

7. Spark 2.0 Concepts

Code and Datasets for the rest of the book

Code

IDE

iPython startup and test

Datasets

Car-mileage

Northwind industries sales data

Titanic passenger list

State of the Union speeches by POTUS

Movie lens Dataset

The data scientist and Spark features

Who is this data scientist DevOps person?

The Data Lake architecture

Data Hub

Reporting Hub

Analytics Hub

Spark v2.0 and beyond

Apache Spark - evolution

Apache Spark - the full stack

The art of a big data store - Parquet

Column projection and data partition

Compression

Smart data storage and predicate pushdown

Support for evolving schema

Performance

References

Summary

8. Spark SQL

The Spark SQL architecture

Spark SQL how-to in a nutshell

Spark SQL with Spark 2.0

Spark SQL programming

Datasets/DataFrames

SQL access to a simple data table

Handling multiple tables with Spark SQL

Aftermath

References

Summary

9. Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists

Datasets - a quick introduction

Dataset APIs - an overview

org.apache.spark.sql.SparkSession/pyspark.sql.SparkSession

org.apache.spark.sql.Dataset/pyspark.sql.DataFrame

org.apache.spark.sql.{Column,Row}/pyspark.sql.(Column,Row)

org.apache.spark.sql.Column

org.apache.spark.sql.Row

org.apache.spark.sql.functions/pyspark.sql.functions

Dataset interfaces and functions

Read/write operations

Aggregate functions

Statistical functions

Scientific functions

Data wrangling with Datasets

Reading data into the respective Datasets

Aggregate and sort

Date columns, totals, and aggregations

The OrderTotal column

Date operations

Final aggregations for the answers we want

References

Summary

10. Spark with Big Data

Parquet - an efficient and interoperable big data format

Saving files in the Parquet format

Loading Parquet files

Saving processed RDDs in the Parquet format

HBase

Loading from HBase

Saving to HBase

Other HBase operations

Reference

Summary

11. Machine Learning with Spark ML Pipelines

Spark's machine learning algorithm table

Spark machine learning APIs - ML pipelines and MLlib

ML pipelines

Spark ML examples

The API organization

Basic statistics

Loading data

Computing statistics

Linear regression

Data transformation and feature extraction

Data split

Predictions using the model

Model evaluation

Classification

Loading data

Data transformation and feature extraction

Data split

The regression model

Prediction using the model

Model evaluation

Clustering

Loading data

Data transformation and feature extraction

Data split

Predicting using the model

Model evaluation and interpretation

Clustering model interpretation

Recommendation

Loading data

Data transformation and feature extraction

Data splitting

Predicting using the model

Model evaluation and interpretation

Hyper parameters

The final thing

References

Summary

12. GraphX

Graphs and graph processing - an introduction

Spark GraphX

GraphX - computational model

The first example - graph

Building graphs

The GraphX API landscape

Structural APIs

What's wrong with the output?

Community, affiliation, and strengths

Algorithms

Graph parallel computation APIs

The aggregateMessages() API

The first example - the oldest follower

The second example - the oldest followee

The third example - the youngest follower/followee

The fourth example - inDegree/outDegree

Partition strategy

Case study - AlphaGo tweets analytics

Data pipeline

GraphX modeling

GraphX processing and algorithms

References

Summary

Fast Data Processing with Spark 2 Third Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2013

Second edition: March 2015

Third edition: October 2016

Production reference: 1141016

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-927-1

www.packtpub.com

Credits

Author

Krishna Sankar

Copy Editor

Safis Editing

Reviewers

Sumit Pal

Alexis Roos

Project Coordinator

Suzzane Coutinho

Commissioning Editor

Akram Hussain

Proofreader

Safis Editing

Acquisition Editor

Tushar Gupta

Indexer

Tejal Daruwale Soni

Content Development Editor

Nikhil Borkar

Graphics

Kirk D'Penha

Technical Editor

Madhunikita Sunil Chindarkar

Production Coordinator

Melwyn D'sa

About the Author

Krishna Sankar is a Senior Specialist—AI Data Scientist with Volvo Cars focusing on Autonomous Vehicles. His earlier stints include Chief Data Scientist at http://cadenttech.tv/, Principal Architect/Data Scientist at Tata America Intl. Corp., Director of Data Science at a bioinformatics startup, and as a Distinguished Engineer at Cisco. He has been speaking at various conferences including ML tutorials at Strata SJC and London 2016, Spark Summit [goo.gl/ab30lD], Strata-Spark Camp, OSCON, PyCon, and PyData, writes about Robots Rules of Order [goo.gl/5yyRv6], Big Data Analytics—Best of the Worst [goo.gl/ImWCaz], predicting NFL, Spark [http://goo.gl/E4kqMD], Data Science [http://goo.gl/9pyJMH], Machine Learning [http://goo.gl/SXF53n], Social Media Analysis [http://goo.gl/D9YpVQ] as well as has been a guest lecturer at the Naval Postgraduate School. His occasional blogs can be found at https://doubleclix.wordpress.com/. His other passion is flying drones (working towards Drone Pilot License (FAA UAS Pilot) and Lego Robotics—you will find him at the St.Louis FLL World Competition as Robots Design Judge.

My first thanks goes to you, the reader, who is taking time to understand the technologies that Apache Spark brings to computation and to the developers of the Spark platform. The book reviewers Sumit and Alexis did a wonderful and thorough job morphing my rough materials into correct readable prose. This book is the result of dedicated work by many at Packt, notably Nikhil Borkar, the Content Development Editor, who deserves all the credit. Madhunikita, as always, has been the guiding force behind the hard work to bring the materials together, in more than one way. On a personal note, my bosses at Volvo viz. Petter Horling, Vedad Cajic, Andreas Wallin, and Mats Gustafsson are a constant source of guidance and insights. And of course, my spouse Usha and son Kaushik always have an encouraging word; special thanks to Usha’s father Mr.Natarajan, whose wisdom we all rely upon, and my late mom for her kindness.

About the Reviewers

Sumit Pal has more than 22 years of experience in the software industry in various roles spanning companies from startups to enterprises. He is a big data, visualization, and data science consultant and a software architect and big data enthusiast and builds end-to-end data-driven analytic systems. He has worked for Microsoft (SQL server development team), Oracle (OLAP development team), and Verizon (big data analytics team) in a career spanning 22 years. Currently, he works for multiple clients, advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java, and Python. He has extensive experience in building scalable systems across the stack from middle tier, data tier to visualization for analytics applications, using big data and NoSQL databases.

Sumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, and Data Science with Java and Python and SQL. Sumit started his career being part of SQL Server development team at Microsoft in 1996-97 and then as a Core Server Engineer for Oracle at their OLAP development team in Burlington, MA. Sumit has also worked at Verizon as an Associate Director for big data architecture, where he strategized, managed, architected, and developed platforms and solutions for analytics and machine learning applications. He has also served as Chief Architect at ModelN/LeapfrogRX (2006-2013) where he architected the middle tier core Analytics Platform with open source OLAP engine (Mondrian) on J2EE and solved some complex Dimensional ETL, modeling, and performance optimization problems. Sumit has MS and BS in computer science.

Alexis Roos (@alexisroos) has over 20 years of software engineering experience with strong expertise in data science, big data, and application infrastructure. Currently an engineering manager at Salesforce, Alexis is managing a team of backend engineers building entry level Salesforce CRM (SalesforceIQ). Prior Alexis designed a comprehensive US business graph built from billion of records using Spark, GraphX, MLLib, and Scala at Radius Intelligence.

Alexis also worked for Couchbase, Concurrent Inc startups, and Sun Microsystems/Oracle for over 13 years and several large SIs over in Europe where he built and supported dozens of architectures of distributed applications across a range of verticals including telecommunications, healthcare, finance, and government. Alexis holds a master’s degree in computer science with a focus on cognitive science. He has spoken at dozens of conferences worldwide (including Spark summit, Scala by the Bay, Hadoop Summit, and Java One) as well as delivered university courses and participated in industry panels.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Apache Spark has captured the imagination of the analytics and big data developers, rightfully so. In a nutshell, Spark enables distributed computing at scale in the lab or in production. Until now, the collect-store-transform pipeline was distinct from the data science Reason-Model pipeline , which was again distinct from the deployment of the analytics and machine learning models. Now with Spark and technologies such as Kafka, we can seamlessly span the data management and data science pipelines. Moreover, now we can build data science models on larger datasets and need not just sample data. And whatever models we build can be deployed into production (with added work from engineering on the “ilities”, of course). It is our hope that this book will enable a data engineer to get familiar with the fundamentals of the Spark platform as well as provide hands-on experience of some of the advanced capabilities.

What this book covers

Chapter 1, Installing Spark and Setting Up Your Cluster, details some common methods for setting up Spark.

Chapter 2, Using the Spark Shell, introduces the command line for Spark. The shell is good for trying out quick program snippets or just figuring out the syntax of a call interactively.

Chapter 3, Building and Running a Spark Application, covers the ways for compiling Spark applications.

Chapter 4, Creating a SparkSession Object, describe the programming aspects of the connection to a spark server regarding the Spark session and the enclosed spark context.

Chapter 5, Loading and Saving Data in Spark, deals with how we can get data in and out of a spark environment.

Chapter 6, Manipulating Your RDD, describes how to program Resilient Distributed Datasets, which is the fundamental data abstraction layer in Spark that makes all the magic possible.

Chapter 7, Spark 2.0 Concepts, is a short, interesting chapter that discusses the evolution of Spark and the concepts underpinning the Spark 2.0 release, which is a major milestone.

Chapter 8 , Spark SQL, deals with the SQL interface in Spark. Spark SQL probably is the most widely used feature.

Chapter 9, Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists, is another interesting chapter, which introduces the Datasets/DataFrames that are added in the Spark 2.0 release.

Chapter 10, Spark with Big Data, describes the interfaces with Parquet and HBase.

Chapter 11, Machine Learning with Spark ML Pipelines, is my favorite chapter. We talk about regression, classification, clustering, and recommendation in this chapter. This is probably the largest chapter in this book. If you are stranded in a remote island and could take only one chapter with you, this should be the one!

Chapter 12, GraphX, talks about an important capability, processing graphs at scale, and also discusses interesting algorithms such as PageRank.

What you need for this book

Like any development platform, learning to develop systems with Spark takes trial and error. Writing programs, encountering errors, and agonizing over pesky bugs are all part of the process. We assume a basic level of programming – Python or Java and experience in working with operating system commands. We have kept the examples simple and to the point. In terms of resources, we do not assume any esoteric equipment for running the examples and developing code. A normal development machine is enough.

Who this book is for

Data scientists and data engineers who are new to Spark will benefit from this book. Our goal in developing this book is to give an in-depth, hands-on, end-to-end knowledge of Apache Spark 2. We have kept it simple and short so that one can get a good introduction in a short period of time. Folks who have an exposure to big data and analytics will recognize the patterns and the pragmas. Having said that, anyone who wants to understand distributed programming will benefit from working through the examples and reading the book.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The hallmark of a MapReduce system is this: map and reduce, the two primitives."

A block of code is set as follows:

<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency>

Any command-line input or output is written as follows:

./ec2/spark-ec2 -i ~/spark-keypair.pem launch myfirstsparkcluster --resume

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "From Spark 2.0.0 onwards, they have changed the packaging, so we have to include spark-2.0.0/assembly/target/scala-2.11/jars in Add External Jars…."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Fast-Data-Processing-with-Spark-2. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. Installing Spark and Setting Up Your Cluster

This chapter will detail some common methods to set up Spark. Spark on a single machine is excellent for testing or exploring small Datasets, but here you will also learn to use Spark's built-in deployment scripts with a dedicated cluster via Secure Shell (SSH). For Cloud deployments of Spark, this chapter will look at EC2 (both traditional and Elastic Map reduce). Feel free to skip this chapter if you already have your local Spark instance installed and want to get straight to programming. The best way to navigate through installation is to use this chapter as a guide and refer to the Spark installation documentation at http://spark.apache.org/docs/latest/cluster-overview.html.

Regardless of how you are going to deploy Spark, you will want to get the latest version of Spark from https://spark.apache.org/downloads.html (Version 2.0.0 as of this writing). Spark currently releases every 90 days. For coders who want to work with the latest builds, try cloning the code directly from the repository at https://github.com/apache/spark. The building instructions are available at https://spark.apache.org/docs/latest/building-spark.html. Both source code and prebuilt binaries are available at this link. To interact with Hadoop Distributed File System (HDFS), you need to use Spark, which is built against the same version of Hadoop as your cluster. For Version 2.0.0 of Spark, the prebuilt package is built against the available Hadoop Versions 2.3, 2.4, 2.6, and 2.7. If you are up for the challenge, it's recommended that you build against the source as it gives you the flexibility of choosing the HDFS version that you want to support as well as apply patches with. In this chapter, we will do both.

Tip

As you explore the latest version of Spark, an essential task is to read the release notes and especially what has been changed and deprecated. For 2.0.0, the list is slightly long and is available at https://spark.apache.org/releases/spark-release-2-0-0.html#removals-behavior-changes-and-deprecations. For example, the note talks about where the EC2 scripts have moved to and support for Hadoop 2.1 and earlier.

To compile the Spark source, you will need the appropriate version of Scala and the matching JDK. The Spark source tar utility includes the required Scala components. The following discussion is only for information there is no need to install Scala.

The Spark developers have done a good job of managing the dependencies. Refer to the https://spark.apache.org/docs/latest/building-spark.html web page for the latest information on this. The website states that:

"Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+."

Scala gets pulled down as a dependency by Maven (currently Scala 2.11.8). Scala does not need to be installed separately; it is just a bundled dependency.

Just as a note, Spark 2.0.0 by default runs with Scala 2.11.8, but can be compiled to run with Scala 2.10. I have just seen e-mails in the Spark users' group on this.

Tip

This brings up another interesting point about the Spark community. The two essential mailing lists are [email protected] and [email protected]. More details about the Spark community are available at https://spark.apache.org/community.html.

Directory organization and convention

One convention that would be handy is to download and install software in the /opt directory. Also, have a generic soft link to Spark that points to the current version. For example, /opt/spark points to /opt/spark-2.0.0 with the following command:

sudo ln -f -s spark-2.0.0 spark

Tip

Downloading the example code

You can download the example code files for all of the Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Later, if you upgrade, say to Spark 2.1, you can change the soft link.

However, remember to copy any configuration changes and old logs when you change to a new distribution. A more flexible way is to change the configuration directory to /etc/opt/spark and the log files to /var/log/spark/. In this way, these files will stay independent of the distribution updates. More details are available at https://spark.apache.org/docs/latest/configuration.html#overriding-configuration-directory and https://spark.apache.org/docs/latest/configuration.html#configuring-logging.

Installing the prebuilt distribution

Let's download prebuilt Spark and install it. Later, we will also compile a version and build from the source. The download is straightforward. The download page is at http://spark.apache.org/downloads.html. Select the options as shown in the following screenshot:

We will use wget from the command line. You can do a direct download as well:

cd /optsudo wget http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz

We are downloading the prebuilt version for Apache Hadoop 2.7 from one of the possible mirrors. We could have easily downloaded other prebuilt versions as well, as shown in the following screenshot:

To uncompress it, execute the following command:

sudo tar xvf spark-2.0.0-bin-hadoop2.7.tgz

To test the installation, run the following command:

/opt/spark-2.0.0-bin-hadoop2.7/bin/run-example SparkPi 10

It will fire up the Spark stack and calculate the value of Pi. The result will be as shown in the following screenshot:

Building Spark from source

Let's compile Spark on a new AWS instance. In this way, you can clearly understand what all the requirements are to get a Spark stack compiled and installed. I am using the Amazon Linux AMI, which has Java and other base stacks installed by default. As this is a book on Spark, we can safely assume that you would have the base configurations covered. We will cover the incremental installs for the Spark stack here.

Note

The latest instructions for building from the source are available at http://spark.apache.org/docs/latest/building-spark.html.

Downloading the source

The first order of business is to download the latest source from https://spark.apache.org/downloads.html. Select Source Code from option 2. Choose a package type and either download directly or select a mirror. The download page is shown in the following screenshot:

We can either download from the web page or use wget.

We will use wget from the first mirror shown in the preceding screenshot and download it to the opt subdirectory, as shown in the following command:

cd /optsudo wget http://www-eu.apache.org/dist/spark/spark-2.0.0/spark-2.0.0.tgzsudo tar -xzf spark-2.0.0.tgz

Tip

The latest development source is in GitHub, which is available at https://github.com/apache/spark. The latest version can be checked out by the Git clone at https://github.com/apache/spark.git. This should be done only when you want to see the developments for the next version or when you are contributing to the source.

Compiling the source with Maven

Compilation by nature is uneventful, but a lot of information gets displayed on the screen:

cd /opt/spark-2.0.0export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"sudo mvn clean package -Pyarn -Phadoop-2.7 -DskipTests

In order for the preceding snippet to work, we will need Maven installed on our system. Check by typing mvn -v. You will see the output as shown in the following screenshot:

In case Maven is not installed in your system, the commands to install the latest version of Maven are given here:

wget http://mirror.cc.columbia.edu/pub/software/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gzsudo tar -xzf apache-maven-3.3.9-bin.tar.gzsudo ln -f -s apache-maven-3.3.9 mavenexport M2_HOME=/opt/mavenexport PATH=${M2_HOME}/bin:${PATH}

Tip

Detailed Maven installation instructions are available at http://maven.apache.org/download.cgi#Installation. Sometimes, you will have to debug Maven using the -X switch. When I ran Maven, the Amazon Linux AMI didn't have the Java compiler! I had to install javac for Amazon Linux AMI using the following command: sudo yum install java-1.7.0-openjdk-devel

The compilation time varies. On my Mac, it took approximately 28 minutes. The Amazon Linux on a t2-medium instance took 38 minutes. The times could vary, depending on the Internet connection, what libraries are cached, and so forth.

In the end, you will see a build success message like the one shown in the following screenshot:

Compilation switches

As an example, the switches for the compilation of -Pyarn -Phadoop-2.7 -DskipTests are explained in https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version. The -D instance defines a system property and -P defines a profile.

Tip

You can also compile the source code in IDEA, and then upload the built version to your cluster.

Testing the installation

A quick way to test the installation is by calculating Pi:

/opt/spark/bin/run-example SparkPi 10

The result will be a few debug messages, and then the value of Pi, as shown in the following screenshot:

Spark topology

This is a good time to talk about the basic mechanics and mechanisms of Spark. We will progressively dig deeper, but for now let's take a quick look at the top level.

Essentially, Spark provides a framework to process the vast amounts of data, be it in gigabytes, terabytes, and occasionally petabytes. The two main ingredients are computation and scale. The size and effectiveness of the problems that we can solve depends on these two factors, that is, the ability to apply complex computations over large amounts of data in a timely fashion. If our monthly runs take 40 days, we have a problem.

The key, of course, is parallelism, massive parallelism to be exact. We can make our computational algorithm tasks work in parallel, that is, instead of doing the steps one after another, we can perform many steps at the same time, or carry out data parallelism. This means that we run the same algorithms over a partitioned Dataset in parallel. In my humble opinion, Spark is extremely effective in applying data parallelism in an elegant framework. As you will see in the rest of this book, the two components are Resilient Distributed Dataset (RDD) and cluster manager. The cluster manager distributes the code and manages the data that is represented in RDDs. RDDs with transformations and actions are the main programming abstractions and present parallelized collections. Behind the scenes, a cluster manager controls the distribution and interaction with RDDs, distributes code, and manages fault-tolerant execution. As you will see later in the book, Spark has more abstractions on RDDs, namely DataFrames and Datasets. These layers make it extremely efficient for a data engineer or a data scientist to work on distributed data. Spark works with three types of cluster managers-standalone, Apache Mesos, and Hadoop YARN. The Spark page at http://spark.apache.org/docs/latest/cluster-overview.html has a lot more details on this. I just gave you a quick introduction here.

Tip

If you have installed Hadoop 2.0, it is recommended to install Spark on YARN. If you have installed Hadoop 1.0, the standalone version is recommended. If you want to try Mesos, you can choose to install Spark on Mesos. Users are not recommended to install both YARN and Mesos.

Refer to the following diagram: