31,19 €
When people want a way to process big data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere.
Beginning with the fundamentals, we’ll show you how to get set up with Spark with minimum fuss. You’ll then get to grips with some simple APIs before investigating machine learning and graph processing – throughout we’ll make sure you know exactly how to apply your knowledge.
You will also learn how to use the Spark shell, how to load data before finding out how to build and run your own Spark applications. Discover how to manipulate your RDD and get stuck into a range of DataFrame APIs. As if that’s not enough, you’ll also learn some useful Machine Learning algorithms with the help of Spark MLlib and integrating Spark with R. We’ll also make sure you’re confident and prepared for graph processing, as you learn more about the GraphX API.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 255
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Second edition: March 2015
Third edition: October 2016
Production reference: 1141016
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-927-1
www.packtpub.com
Author
Krishna Sankar
Copy Editor
Safis Editing
Reviewers
Sumit Pal
Alexis Roos
Project Coordinator
Suzzane Coutinho
Commissioning Editor
Akram Hussain
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Tejal Daruwale Soni
Content Development Editor
Nikhil Borkar
Graphics
Kirk D'Penha
Technical Editor
Madhunikita Sunil Chindarkar
Production Coordinator
Melwyn D'sa
Krishna Sankar is a Senior Specialist—AI Data Scientist with Volvo Cars focusing on Autonomous Vehicles. His earlier stints include Chief Data Scientist at http://cadenttech.tv/, Principal Architect/Data Scientist at Tata America Intl. Corp., Director of Data Science at a bioinformatics startup, and as a Distinguished Engineer at Cisco. He has been speaking at various conferences including ML tutorials at Strata SJC and London 2016, Spark Summit [goo.gl/ab30lD], Strata-Spark Camp, OSCON, PyCon, and PyData, writes about Robots Rules of Order [goo.gl/5yyRv6], Big Data Analytics—Best of the Worst [goo.gl/ImWCaz], predicting NFL, Spark [http://goo.gl/E4kqMD], Data Science [http://goo.gl/9pyJMH], Machine Learning [http://goo.gl/SXF53n], Social Media Analysis [http://goo.gl/D9YpVQ] as well as has been a guest lecturer at the Naval Postgraduate School. His occasional blogs can be found at https://doubleclix.wordpress.com/. His other passion is flying drones (working towards Drone Pilot License (FAA UAS Pilot) and Lego Robotics—you will find him at the St.Louis FLL World Competition as Robots Design Judge.
My first thanks goes to you, the reader, who is taking time to understand the technologies that Apache Spark brings to computation and to the developers of the Spark platform. The book reviewers Sumit and Alexis did a wonderful and thorough job morphing my rough materials into correct readable prose. This book is the result of dedicated work by many at Packt, notably Nikhil Borkar, the Content Development Editor, who deserves all the credit. Madhunikita, as always, has been the guiding force behind the hard work to bring the materials together, in more than one way. On a personal note, my bosses at Volvo viz. Petter Horling, Vedad Cajic, Andreas Wallin, and Mats Gustafsson are a constant source of guidance and insights. And of course, my spouse Usha and son Kaushik always have an encouraging word; special thanks to Usha’s father Mr.Natarajan, whose wisdom we all rely upon, and my late mom for her kindness.
Sumit Pal has more than 22 years of experience in the software industry in various roles spanning companies from startups to enterprises. He is a big data, visualization, and data science consultant and a software architect and big data enthusiast and builds end-to-end data-driven analytic systems. He has worked for Microsoft (SQL server development team), Oracle (OLAP development team), and Verizon (big data analytics team) in a career spanning 22 years. Currently, he works for multiple clients, advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java, and Python. He has extensive experience in building scalable systems across the stack from middle tier, data tier to visualization for analytics applications, using big data and NoSQL databases.
Sumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, and Data Science with Java and Python and SQL. Sumit started his career being part of SQL Server development team at Microsoft in 1996-97 and then as a Core Server Engineer for Oracle at their OLAP development team in Burlington, MA. Sumit has also worked at Verizon as an Associate Director for big data architecture, where he strategized, managed, architected, and developed platforms and solutions for analytics and machine learning applications. He has also served as Chief Architect at ModelN/LeapfrogRX (2006-2013) where he architected the middle tier core Analytics Platform with open source OLAP engine (Mondrian) on J2EE and solved some complex Dimensional ETL, modeling, and performance optimization problems. Sumit has MS and BS in computer science.
Alexis Roos (@alexisroos) has over 20 years of software engineering experience with strong expertise in data science, big data, and application infrastructure. Currently an engineering manager at Salesforce, Alexis is managing a team of backend engineers building entry level Salesforce CRM (SalesforceIQ). Prior Alexis designed a comprehensive US business graph built from billion of records using Spark, GraphX, MLLib, and Scala at Radius Intelligence.
Alexis also worked for Couchbase, Concurrent Inc startups, and Sun Microsystems/Oracle for over 13 years and several large SIs over in Europe where he built and supported dozens of architectures of distributed applications across a range of verticals including telecommunications, healthcare, finance, and government. Alexis holds a master’s degree in computer science with a focus on cognitive science. He has spoken at dozens of conferences worldwide (including Spark summit, Scala by the Bay, Hadoop Summit, and Java One) as well as delivered university courses and participated in industry panels.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Apache Spark has captured the imagination of the analytics and big data developers, rightfully so. In a nutshell, Spark enables distributed computing at scale in the lab or in production. Until now, the collect-store-transform pipeline was distinct from the data science Reason-Model pipeline , which was again distinct from the deployment of the analytics and machine learning models. Now with Spark and technologies such as Kafka, we can seamlessly span the data management and data science pipelines. Moreover, now we can build data science models on larger datasets and need not just sample data. And whatever models we build can be deployed into production (with added work from engineering on the “ilities”, of course). It is our hope that this book will enable a data engineer to get familiar with the fundamentals of the Spark platform as well as provide hands-on experience of some of the advanced capabilities.
Chapter 1, Installing Spark and Setting Up Your Cluster, details some common methods for setting up Spark.
Chapter 2, Using the Spark Shell, introduces the command line for Spark. The shell is good for trying out quick program snippets or just figuring out the syntax of a call interactively.
Chapter 3, Building and Running a Spark Application, covers the ways for compiling Spark applications.
Chapter 4, Creating a SparkSession Object, describe the programming aspects of the connection to a spark server regarding the Spark session and the enclosed spark context.
Chapter 5, Loading and Saving Data in Spark, deals with how we can get data in and out of a spark environment.
Chapter 6, Manipulating Your RDD, describes how to program Resilient Distributed Datasets, which is the fundamental data abstraction layer in Spark that makes all the magic possible.
Chapter 7, Spark 2.0 Concepts, is a short, interesting chapter that discusses the evolution of Spark and the concepts underpinning the Spark 2.0 release, which is a major milestone.
Chapter 8 , Spark SQL, deals with the SQL interface in Spark. Spark SQL probably is the most widely used feature.
Chapter 9, Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists, is another interesting chapter, which introduces the Datasets/DataFrames that are added in the Spark 2.0 release.
Chapter 10, Spark with Big Data, describes the interfaces with Parquet and HBase.
Chapter 11, Machine Learning with Spark ML Pipelines, is my favorite chapter. We talk about regression, classification, clustering, and recommendation in this chapter. This is probably the largest chapter in this book. If you are stranded in a remote island and could take only one chapter with you, this should be the one!
Chapter 12, GraphX, talks about an important capability, processing graphs at scale, and also discusses interesting algorithms such as PageRank.
Like any development platform, learning to develop systems with Spark takes trial and error. Writing programs, encountering errors, and agonizing over pesky bugs are all part of the process. We assume a basic level of programming – Python or Java and experience in working with operating system commands. We have kept the examples simple and to the point. In terms of resources, we do not assume any esoteric equipment for running the examples and developing code. A normal development machine is enough.
Data scientists and data engineers who are new to Spark will benefit from this book. Our goal in developing this book is to give an in-depth, hands-on, end-to-end knowledge of Apache Spark 2. We have kept it simple and short so that one can get a good introduction in a short period of time. Folks who have an exposure to big data and analytics will recognize the patterns and the pragmas. Having said that, anyone who wants to understand distributed programming will benefit from working through the examples and reading the book.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The hallmark of a MapReduce system is this: map and reduce, the two primitives."
A block of code is set as follows:
<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency>Any command-line input or output is written as follows:
./ec2/spark-ec2 -i ~/spark-keypair.pem launch myfirstsparkcluster --resumeNew terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "From Spark 2.0.0 onwards, they have changed the packaging, so we have to include spark-2.0.0/assembly/target/scala-2.11/jars in Add External Jars…."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Fast-Data-Processing-with-Spark-2. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
This chapter will detail some common methods to set up Spark. Spark on a single machine is excellent for testing or exploring small Datasets, but here you will also learn to use Spark's built-in deployment scripts with a dedicated cluster via Secure Shell (SSH). For Cloud deployments of Spark, this chapter will look at EC2 (both traditional and Elastic Map reduce). Feel free to skip this chapter if you already have your local Spark instance installed and want to get straight to programming. The best way to navigate through installation is to use this chapter as a guide and refer to the Spark installation documentation at http://spark.apache.org/docs/latest/cluster-overview.html.
Regardless of how you are going to deploy Spark, you will want to get the latest version of Spark from https://spark.apache.org/downloads.html (Version 2.0.0 as of this writing). Spark currently releases every 90 days. For coders who want to work with the latest builds, try cloning the code directly from the repository at https://github.com/apache/spark. The building instructions are available at https://spark.apache.org/docs/latest/building-spark.html. Both source code and prebuilt binaries are available at this link. To interact with Hadoop Distributed File System (HDFS), you need to use Spark, which is built against the same version of Hadoop as your cluster. For Version 2.0.0 of Spark, the prebuilt package is built against the available Hadoop Versions 2.3, 2.4, 2.6, and 2.7. If you are up for the challenge, it's recommended that you build against the source as it gives you the flexibility of choosing the HDFS version that you want to support as well as apply patches with. In this chapter, we will do both.
As you explore the latest version of Spark, an essential task is to read the release notes and especially what has been changed and deprecated. For 2.0.0, the list is slightly long and is available at https://spark.apache.org/releases/spark-release-2-0-0.html#removals-behavior-changes-and-deprecations. For example, the note talks about where the EC2 scripts have moved to and support for Hadoop 2.1 and earlier.
To compile the Spark source, you will need the appropriate version of Scala and the matching JDK. The Spark source tar utility includes the required Scala components. The following discussion is only for information there is no need to install Scala.
The Spark developers have done a good job of managing the dependencies. Refer to the https://spark.apache.org/docs/latest/building-spark.html web page for the latest information on this. The website states that:
"Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+."
Scala gets pulled down as a dependency by Maven (currently Scala 2.11.8). Scala does not need to be installed separately; it is just a bundled dependency.
Just as a note, Spark 2.0.0 by default runs with Scala 2.11.8, but can be compiled to run with Scala 2.10. I have just seen e-mails in the Spark users' group on this.
This brings up another interesting point about the Spark community. The two essential mailing lists are [email protected] and [email protected]. More details about the Spark community are available at https://spark.apache.org/community.html.
One convention that would be handy is to download and install software in the /opt directory. Also, have a generic soft link to Spark that points to the current version. For example, /opt/spark points to /opt/spark-2.0.0 with the following command:
sudo ln -f -s spark-2.0.0 sparkDownloading the example code
You can download the example code files for all of the Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Later, if you upgrade, say to Spark 2.1, you can change the soft link.
However, remember to copy any configuration changes and old logs when you change to a new distribution. A more flexible way is to change the configuration directory to /etc/opt/spark and the log files to /var/log/spark/. In this way, these files will stay independent of the distribution updates. More details are available at https://spark.apache.org/docs/latest/configuration.html#overriding-configuration-directory and https://spark.apache.org/docs/latest/configuration.html#configuring-logging.
Let's download prebuilt Spark and install it. Later, we will also compile a version and build from the source. The download is straightforward. The download page is at http://spark.apache.org/downloads.html. Select the options as shown in the following screenshot:
We will use wget from the command line. You can do a direct download as well:
cd /optsudo wget http://www-us.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgzWe are downloading the prebuilt version for Apache Hadoop 2.7 from one of the possible mirrors. We could have easily downloaded other prebuilt versions as well, as shown in the following screenshot:
To uncompress it, execute the following command:
sudo tar xvf spark-2.0.0-bin-hadoop2.7.tgzTo test the installation, run the following command:
/opt/spark-2.0.0-bin-hadoop2.7/bin/run-example SparkPi 10It will fire up the Spark stack and calculate the value of Pi. The result will be as shown in the following screenshot:
Let's compile Spark on a new AWS instance. In this way, you can clearly understand what all the requirements are to get a Spark stack compiled and installed. I am using the Amazon Linux AMI, which has Java and other base stacks installed by default. As this is a book on Spark, we can safely assume that you would have the base configurations covered. We will cover the incremental installs for the Spark stack here.
The latest instructions for building from the source are available at http://spark.apache.org/docs/latest/building-spark.html.
The first order of business is to download the latest source from https://spark.apache.org/downloads.html. Select Source Code from option 2. Choose a package type and either download directly or select a mirror. The download page is shown in the following screenshot:
We can either download from the web page or use wget.
We will use wget from the first mirror shown in the preceding screenshot and download it to the opt subdirectory, as shown in the following command:
cd /optsudo wget http://www-eu.apache.org/dist/spark/spark-2.0.0/spark-2.0.0.tgzsudo tar -xzf spark-2.0.0.tgzThe latest development source is in GitHub, which is available at https://github.com/apache/spark. The latest version can be checked out by the Git clone at https://github.com/apache/spark.git. This should be done only when you want to see the developments for the next version or when you are contributing to the source.
Compilation by nature is uneventful, but a lot of information gets displayed on the screen:
cd /opt/spark-2.0.0export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"sudo mvn clean package -Pyarn -Phadoop-2.7 -DskipTestsIn order for the preceding snippet to work, we will need Maven installed on our system. Check by typing mvn -v. You will see the output as shown in the following screenshot:
In case Maven is not installed in your system, the commands to install the latest version of Maven are given here:
wget http://mirror.cc.columbia.edu/pub/software/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gzsudo tar -xzf apache-maven-3.3.9-bin.tar.gzsudo ln -f -s apache-maven-3.3.9 mavenexport M2_HOME=/opt/mavenexport PATH=${M2_HOME}/bin:${PATH}Detailed Maven installation instructions are available at http://maven.apache.org/download.cgi#Installation. Sometimes, you will have to debug Maven using the -X switch. When I ran Maven, the Amazon Linux AMI didn't have the Java compiler! I had to install javac for Amazon Linux AMI using the following command: sudo yum install java-1.7.0-openjdk-devel
The compilation time varies. On my Mac, it took approximately 28 minutes. The Amazon Linux on a t2-medium instance took 38 minutes. The times could vary, depending on the Internet connection, what libraries are cached, and so forth.
In the end, you will see a build success message like the one shown in the following screenshot:
As an example, the switches for the compilation of -Pyarn -Phadoop-2.7 -DskipTests are explained in https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version. The -D instance defines a system property and -P defines a profile.
You can also compile the source code in IDEA, and then upload the built version to your cluster.
A quick way to test the installation is by calculating Pi:
/opt/spark/bin/run-example SparkPi 10The result will be a few debug messages, and then the value of Pi, as shown in the following screenshot:
This is a good time to talk about the basic mechanics and mechanisms of Spark. We will progressively dig deeper, but for now let's take a quick look at the top level.
Essentially, Spark provides a framework to process the vast amounts of data, be it in gigabytes, terabytes, and occasionally petabytes. The two main ingredients are computation and scale. The size and effectiveness of the problems that we can solve depends on these two factors, that is, the ability to apply complex computations over large amounts of data in a timely fashion. If our monthly runs take 40 days, we have a problem.
The key, of course, is parallelism, massive parallelism to be exact. We can make our computational algorithm tasks work in parallel, that is, instead of doing the steps one after another, we can perform many steps at the same time, or carry out data parallelism. This means that we run the same algorithms over a partitioned Dataset in parallel. In my humble opinion, Spark is extremely effective in applying data parallelism in an elegant framework. As you will see in the rest of this book, the two components are Resilient Distributed Dataset (RDD) and cluster manager. The cluster manager distributes the code and manages the data that is represented in RDDs. RDDs with transformations and actions are the main programming abstractions and present parallelized collections. Behind the scenes, a cluster manager controls the distribution and interaction with RDDs, distributes code, and manages fault-tolerant execution. As you will see later in the book, Spark has more abstractions on RDDs, namely DataFrames and Datasets. These layers make it extremely efficient for a data engineer or a data scientist to work on distributed data. Spark works with three types of cluster managers-standalone, Apache Mesos, and Hadoop YARN. The Spark page at http://spark.apache.org/docs/latest/cluster-overview.html has a lot more details on this. I just gave you a quick introduction here.
If you have installed Hadoop 2.0, it is recommended to install Spark on YARN. If you have installed Hadoop 1.0, the standalone version is recommended. If you want to try Mesos, you can choose to install Spark on Mesos. Users are not recommended to install both YARN and Mesos.
Refer to the following diagram: