E-Book
34,79 €

Learning Apache Spark 2 E-Book

Muhammad Asif Abbasi

0,0

34,79 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Apache Spark has seen an unprecedented growth in terms of its adoption over the last few years, mainly because of its speed, diversity and real-time data processing capabilities. It has quickly become the preferred choice of tool for many Big Data professionals looking to find quick insights from large chunks of data. This book introduces you to the Apache Spark framework, and familiarizes you with all the latest features and capabilities introduced in Spark 2.
Starting with a detailed introduction to Spark’s architecture and the installation procedure, this book covers everything you need to know about the Spark framework in the most practical manner. You will learn how to perform the basic ETL activities using Spark, and work with different components of Spark such as Spark SQL, as well as the Dataset and DataFrame APIs for manipulating your data. Then, you will perform machine learning using Spark MLlib, as well as perform streaming analytics and graph processing using the Spark Streaming and GraphX modules respectively. The book also gives special emphasis on deploying your Spark models, and how they can be operated in a clustered mode.
During the course of the book, you will come across implementations of different real-world use-cases and examples, giving you the hands-on knowledge you need to use Apache Spark in the best possible manner.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 355

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

www.packtpub.com

Why subscribe?

Customer Feedback

Preface

The Past

Why are people so excited about Spark?

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Architecture and Installation

Apache Spark architecture overview

Spark-core

Spark SQL

Spark streaming

MLlib

GraphX

Spark deployment

Installing Apache Spark

Writing your first Spark program

Scala shell examples

Python shell examples

Spark architecture

High level overview

Driver program

Cluster Manager

Worker

Executors

Tasks

SparkContext

Spark Session

Apache Spark cluster manager types

Building standalone applications with Apache Spark

Submitting applications

Deployment strategies

Running Spark examples

Building your own programs

Brain teasers

References

Summary

2. Transformations and Actions with Spark RDDs

What is an RDD?

Constructing RDDs

Parallelizing existing collections

Referencing external data source

Operations on RDD

Transformations

Actions

Passing functions to Spark (Scala)

Anonymous functions

Static singleton functions

Passing functions to Spark (Java)

Passing functions to Spark (Python)

Transformations

Map(func)

Filter(func)

flatMap(func)

Sample (withReplacement, fraction, seed)

Set operations in Spark

Distinct()

Intersection()

Union()

Subtract()

Cartesian()

Actions

Reduce(func)

Collect()

Count()

Take(n)

First()

SaveAsXXFile()

foreach(func)

PairRDDs

Creating PairRDDs

PairRDD transformations

reduceByKey(func)

GroupByKey(func)

reduceByKey vs. groupByKey - Performance Implications

CombineByKey(func)

Transformations on two PairRDDs

Actions available on PairRDDs

Shared variables

Broadcast variables

Accumulators

References

Summary

3. ETL with Spark

What is ETL?

Exaction

Transformation

How is Spark being used?

Commonly Supported File Formats

Text Files

CSV and TSV Files

Writing CSV files

Tab Separated Files

JSON files

Sequence files

Object files

Commonly supported file systems

Working with HDFS

Working with Amazon S3

Structured Data sources and Databases

Working with NoSQL Databases

Working with Cassandra

Obtaining a Cassandra table as an RDD

Saving data to Cassandra

Working with HBase

Bulk Delete example

Map Partition Example

Working with MongoDB

Connection to MongoDB

Writing to MongoDB

Loading data from MongoDB

Working with Apache Solr

Importing the JAR File via Spark-shell

Connecting to Solr via DataFrame API

Connecting to Solr via RDD

References

Summary

4. Spark SQL

What is Spark SQL?

What is DataFrame API?

What is DataSet API?

What's new in Spark 2.0?

Under the hood - catalyst optimizer

Solution 1

Solution 2

The Sparksession

Creating a SparkSession

Creating a DataFrame

Manipulating a DataFrame

Scala DataFrame manipulation - examples

Python DataFrame manipulation - examples

R DataFrame manipulation - examples

Java DataFrame manipulation - examples

Reverting to an RDD from a DataFrame

Converting an RDD to a DataFrame

Other data sources

Parquet files

Working with Hive

Hive configuration

SparkSQL CLI

Working with other databases

References

Summary

5. Spark Streaming

What is Spark Streaming?

DStream

StreamingContext

Steps involved in a streaming app

Architecture of Spark Streaming

Input sources

Core/basic sources

Advanced sources

Custom sources

Transformations

Sliding window operations

Output operations

Caching and persistence

Checkpointing

Setting up checkpointing

Setting up checkpointing with Scala

Setting up checkpointing with Java

Setting up checkpointing with Python

Automatic driver restart

DStream best practices

Fault tolerance

Worker failure impact on receivers

Worker failure impact on RDDs/DStreams

Worker failure impact on output operations

What is Structured Streaming?

Under the hood

Structured Spark Streaming API :Entry point

Output modes

Append mode

Complete mode

Update mode

Output sinks

Failure recovery and checkpointing

References

Summary

6. Machine Learning with Spark

What is machine learning?

Why machine learning?

Types of machine learning

Introduction to Spark MLLib

Why do we need the Pipeline API?

How does it work?

Scala syntax - building a pipeline

Building a pipeline

Predictions on test documents

Python program - predictions on test documents

Feature engineering

Feature extraction algorithms

Feature transformation algorithms

Feature selection algorithms

Classification and regression

Classification

Regression

Clustering

Collaborative filtering

ML-tuning - model selection and hyperparameter tuning

References

Summary

7. GraphX

Graphs in everyday life

What is a graph?

Why are Graphs elegant?

What is GraphX?

Creating your first Graph (RDD API)

Code samples

Basic graph operators (RDD API)

List of graph operators (RDD API)

Caching and uncaching of graphs

Graph algorithms in GraphX

PageRank

Code example -- PageRank algorithm

Connected components

Code example -- connected components

Triangle counting

GraphFrames

Why GraphFrames?

Basic constructs of a GraphFrame

Motif finding

GraphFrames algorithms

Loading and saving of GraphFrames

Comparison between GraphFrames and GraphX

GraphX <=> GraphFrames

Converting from GraphFrame to GraphX

Converting from GraphX to GraphFrames

References

Summary

8. Operating in Clustered Mode

Clusters, nodes and daemons

Key bits about Spark Architecture

Running Spark in standalone mode

Installing Spark standalone on a cluster

Starting a Spark cluster manually

Cluster overview

Workers overview

Running applications and drivers overview

Completed applications and drivers overview

Using the Cluster Launch Scripts to Start a Standalone Cluster

Environment Properties

Connecting Spark-Shell, PySpark, and R-Shell to the cluster

Resource scheduling

Running Spark in YARN

Spark with a Hadoop Distribution (Cloudera)

Interactive Shell

Batch Application

Important YARN Configuration Parameters

Running Spark in Mesos

Before you start

Running in Mesos

Modes of operation in Mesos

Client Mode

Batch Applications

Interactive Applications

Cluster Mode

Steps to use the cluster mode

Mesos run modes

Key Spark on Mesos configuration properties

References:

Summary

9. Building a Recommendation System

What is a recommendation system?

Types of recommendations

Manual recommendations

Simple aggregated recommendations based on Popularity

User-specific recommendations

User specific recommendations

Key issues with recommendation systems

Gathering known input data

Predicting unknown from known ratings

Content-based recommendations

Predicting unknown ratings

Pros and cons of content based recommendations

Collaborative filtering

Jaccard similarity

Cosine similarity

Centered cosine (Pearson Correlation)

Latent factor methods

Evaluating prediction method

Recommendation system in Spark

Sample dataset

How does Spark offer recommendation?

Importing relevant libraries

Defining the schema for ratings

Defining the schema for movies

Loading ratings and movies data

Data partitioning

Training an ALS model

Predicting the test dataset

Evaluating model performance

Using implicit preferences

Sanity checking

Model Deployment

References

Summary

10. Customer Churn Prediction

Overview of customer churn

Why is predicting customer churn important?

How do we predict customer churn with Spark?

Data set description

Code example

Defining schema

Loading data

Data exploration

PySpark import code

Exploring international minutes

Exploring night minutes

Exploring day minutes

Exploring eve minutes

Comparing minutes data for churners and non-churners

Comparing charge data for churners and non-churners

Exploring customer service calls

Scala code - constructing a scatter plot

Exploring the churn variable

Data transformation

Building a machine learning pipeline

References

Summary

Theres More with Spark

Performance tuning

Data serialization

Memory tuning

Execution and storage

Tasks running in parallel

Operators within the same task

Memory management configuration options

Memory tuning key tips

I/O tuning

Data locality

Sizing up your executors

Calculating memory overhead

Setting aside memory/CPU for YARN application master

I/O throughput

Sample calculations

The skew problem

Security configuration in Spark

Kerberos authentication

Shared secrets

Shared secret on YARN

Shared secret on other cluster managers

Setting up Jupyter Notebook with Spark

What is a Jupyter Notebook?

Setting up a Jupyter Notebook

Securing the notebook server

Preparing a hashed password

Using Jupyter (only with version 5.0 and later)

Manually creating hashed password

Setting up PySpark on Jupyter

Shared variables

Broadcast variables

Accumulators

References

Summary

Learning Apache Spark 2

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: March 2017

Production reference: 1240317

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-513-6

www.packtpub.com

Credits

Authors

Muhammad Asif Abbasi

Copy Editor

Safis Editing

Reviewers

Prashant Verma

Project Coordinator

Nidhi Joshi

Commissioning Editor

Veena Pagare

Proofreader

Safis Editing

Acquisition Editor

Tushar Gupta

Indexer

Tejal Daruwale Soni

Content Development Editor

Mayur Pawanikar

Graphics

Tania Dutta

Technical Editor

Karan Thakkar

Production Coordinator

Nilesh Mohite

About the Author

Muhammad Asif Abbasi has worked in the industry for over 15 years in a variety of roles from engineering solutions to selling solutions and everything in between. Asif is currently working with SAS a market leader in Analytic Solutions as a Principal Business Solutions Manager for the Global Technologies Practice. Based in London, Asif has vast experience in consulting for major organizations and industries across the globe, and running proof-of-concepts across various industries including but not limited to telecommunications, manufacturing, retail, finance, services, utilities and government. Asif is an Oracle Certified Java EE 5 Enterprise architect, Teradata Certified Master, PMP, Hortonworks Hadoop Certified developer, and administrator. Asif also holds a Master's degree in Computer Science and Business Administration.

About the Reviewers

Prashant Verma started his IT carrier in 2011 as a Java developer in Ericsson working in telecom domain. After couple of years of JAVA EE experience, he moved into Big Data domain, and has worked on almost all the popular big data technologies, such as Hadoop, Spark, Flume, Mongo, Cassandra,etc. He has also played with Scala. Currently, He works with QA Infotech as Lead Data Enginner, working on solving e-Learning problems using analytics and machine learning.

Prashant has also worked on Apache Spark for Java Developers, Packt as a Technical Reviewer.

I want to thank Packt Publishing for giving me the chance to review the book as well as my employer and my family for their patience while I was busy working on this book.

www.packtpub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review at the website where you acquired this product.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

This book will cover the technical aspects of Apache Spark 2.0, one of the fastest growing open-source projects. In order to understand what Apache Spark is, we will quickly recap a the history of Big Data, and what has made Apache Spark popular. Irrespective of your expertise level, we suggest going through this introduction as it will help set the context of the book.

The Past

Before going into the present-day Spark, it might be worthwhile understanding what problems Spark intend to solve, and especially the data movement. Without knowing the background we will not be able to predict the future.

"You have to learn the past to predict the future."

Late 1990s: The world was a much simpler place to live, with proprietary databases being the sole choice of consumers. Data was growing at quite an amazing pace, and some of the biggest databases boasted of maintaining datasets in excess of a Terabyte.

Early 2000s: The dotcom bubble happened, meant companies started going online, and likes of Amazon and eBay leading the revolution. Some of the dotcom start-ups failed, while others succeeded. The commonality among the business models started was a razor-sharp focus on page views, and everything started getting focused on the number of users. A lot of marketing budget was spent on getting people online. This meant more customer behavior data in the form of weblogs. Since the defacto storage was an MPP database, and the value of such weblogs was unknown, more often than not these weblogs were stuffed into archive storage or deleted.

2002: In search for a better search engine, Doug Cutting and Mike Cafarella started work on an open source project called Nutch, the objective of which was to be a web scale crawler. Web-Scale was defined as billions of web pages and Doug and Mike were able to index hundreds of millions of web-pages, running on a handful of nodes and had a knack of falling down.

2004-2006: Google published a paper on the Google File System (GFS) (2003) and MapReduce (2004) demonstrating the backbone of their search engine being resilient to failures, and almost linearly scalable. Doug Cutting took particular interest in this development as he could see that GFS and MapReduce papers directly addressed Nutch’s shortcomings. Doug Cutting added Map Reduce implementation to Nutch which ran on 20 nodes, and was much easier to program. Of course we are talking in comparative terms here.

2006-2008: Cutting went to work with Yahoo in 2006 who had lost the search crown to Google and were equally impressed by the GFS and MapReduce papers. The storage and processing parts of Nutch were spun out to form a separate project named Hadoop under AFS where as Nutch web crawler remained a separate project. Hadoop became a top-level Apache project in 2008. On February 19, 2008 Yahoo announced that its search index is run on a 10000 node Hadoop cluster (truly an amazing feat).

We haven't forget about the proprietary database vendors. the majority of them didn’t expect Hadoop to change anything for them, as database vendors typically focused on relational data, which was smaller in volumes but higher in value. I was talking to a CTO of a major database vendor (will remain unnamed), and discussing this new and upcoming popular elephant (Hadoop of course! Thanks to Doug Cutting’s son for choosing a sane name. I mean he could have chosen anything else, and you know how kids name things these days..). The CTO was quite adamant that the real value is in the relational data, which was the bread and butter of his company, and despite that fact that the relational data had huge volumes, it had less of a business value. This was more of a 80-20 rule for data, where from a size perspective unstructured data was 4 times the size of structured data (80-20), whereas the same structured data had 4 times the value of unstructured data. I would say that the relational database vendors massively underestimated the value of unstructured data back then.

Anyways, back to Hadoop: So, after the announcement by Yahoo, a lot of companies wanted to get a piece of the action. They realised something big was about to happen in the dataspace. Lots of interesting use cases started to appear in the Hadoop space, and the defacto compute engine on Hadoop, MapReduce wasn’t able to meet all those expectations.

The MapReduce Conundrum: The original Hadoop comprised primarily HDFS and Map-Reduce as a compute engine. The original use case of web scale search meant that the architecture was primarily aimed at long-running batch jobs (typically single-pass jobs without iterations), like the original use case of indexing web pages. The core requirement of such a framework was scalability and fault-tolerance, as you don’t want to restart a job that had been running for 3 days, having completed 95% of its work. Furthermore, the objective of MapReduce was to target acyclic data flows.

A typical MapReduce program is composed of a Map() operation and optionally a Reduce() operation, and any workload had to be converted to the MapReduce paradigm before you could get the benefit of Hadoop. Not only that majority of other open source projects on Hadoop also used MapReduce as a way to perform computation. For example: Hive and Pig Latin both generated MapReduce to operate on Big Data sets. The problem with the architecture of MapReduce was that the job output data from each step had to be store in a distributed system before the next step could begin. This meant that each iteration had to reload the data from the disk thus incurring a significant performance penalty. Furthermore, while typically design, for batch jobs, Hadoop has often been used to do exploratory analysis through SQL-like interfaces such as Pig and Hive. Each query incurs significant latency due to initial MapReduce job setup, and initial data read which often means increased wait times for the users.

Beginning of Spark: In June of 2011, Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker and Ion Stoica published a paper in which they proposed a framework that could outperform Hadoop 10 times in iterative machine learning jobs. The framework is now known as Spark. The paper aimed to solve two of the major inadequacies of the Hadoop/MR framework:

Iterative jobsInteractive analysis

The idea that you can plug the gaps of map-reduce from an iterative and interactive analysis point of view, while maintaining its scalability and resilience meant that the platform could be used across a wide variety of use cases.

This created huge interest in Spark, particularly from communities of users who had become frustrated with the relatively slow response from MapReduce, particularly for interactive queries requests. Spark in 2015 became the most active open source project in Big Data, and had tons of new features of improvements during the course of the project. The community grew almost 300%, with attendances at Spark-Summit increasing from just 1,100 in 2014 to almost 4,000 in 2015. The number of meetup groups grew by a factor of 4, and the contributors to the project increased from just over a 100 in 2013 to 600 in 2015.

Spark is today the hottest technology for big data analytics. Numerous benchmarks have confirmed that it is the fastest engine out there. If you go to any Big data conference be it Strata + Hadoop World or Hadoop Summit, Spark is considered to be the technology for future.

Stack Overflow released the results of a 2016 developer survey (http://bit.ly/1MpdIlU) with responses from 56,033 engineers across 173 countries. Some of the facts related to Spark were pretty interesting. Spark was the leader in Trending Tech and the Top-Paying Tech.

Why are people so excited about Spark?

In addition to plugging MapReduce deficiencies, Spark provides three major things that make it really powerful:

General engine with libraries for many data analysis tasks - includes built-in libraries for Streaming, SQL, machine learning and graph processingAccess to diverse data sources, means it can connect to Hadoop, Cassandra, traditional SQL databases, and Cloud Storage including Amazon and OpenStackLast but not the least, Spark provides a simple unified API that means users have to learn just one API to get the benefit of the entire framework stack

We hope that this book gives you the foundation of understanding Spark as a framework, and helps you take the next step towards using it for your implementations.

What this book covers

Chapter 1, Architecture and Installation, will help you get started on the journey of learning Spark. This will walk you through key architectural components before helping you write your first Spark application.

Chapter 2, Transformations and Actions with Spark RDDs, will help you understand the basic constructs as Spark RDDs and help you understand the difference between transformations, actions, and lazy evaluation, and how you can share data.

Chapter 3, ELT with Spark, will help you with data loading, transformation, and saving it back to external storage systems.

Chapter 4, Spark SQL, will help you understand the intricacies of the DataFrame and Dataset API before a discussion of the under-the-hood power of the Catalyst optimizer and how it ensures that your client applications remain performant irrespective of your client AP.

Chapter 5, Spark Streaming, will help you understand the architecture of Spark Streaming, sliding window operations, caching, persistence, check-pointing, fault-tolerance before discussing structured streaming and how it revolutionizes Stream processing.

Chapter 6, Machine Learning with Spark, is where the rubber hits the road, and where you understand the basics of machine learning before looking at the various types of machine learning, and feature engineering utility functions, and finally looking at the algorithms provided by Spark MLlib API.

Chapter 7, GraphX, will help you understand the importance of Graph in today’s world, before understanding terminology such vertex, edge, Motif etc. We will then look at some of the graph algorithms in GraphX and also talk about GraphFrames.

Chapter 8, Operating in Clustered mode, helps the user understand how Spark can be deployed as standalone, or with YARN or Mesos.

Chapter 9, Building a Recommendation system, will help the user understand the intricacies of a recommendation system before building one with an ALS model.

Chapter 10, Customer ChurnPredicting, will help the user understand the importance of Churn prediction before using a random forest classifier to predict churn on a telecommunication dataset.

Appendix, There's More with Spark, is where we cover the topics around performance tuning, sizing your executors, and security before walking the user through setting up PySpark with Jupyter notebook.

What you need for this book

You will need Spark 2.0, which you can download from Apache Spark website. We have used few different configurations, but you can essentially run most of these examples inside a virtual machine with 4-8GB of RAM, and 10 GB of available disk space.

Who this book is for

This book is for people who have heard of Spark, and want to understand more. This is a beginner-level book for people who want to have some hands-on exercise with the fastest growing open source project. This book provides ample reading and links to exciting YouTube videos for additional exploration of the topics.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

A block of code is set as follows:

[default] exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100) exten => i,1,Voicemail(s0)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default] exten => s,1,Dial(Zap/1|30) exten => s,2,Voicemail(u100) exten => s,102,Voicemail(b100) exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample /etc/asterisk/cdr_mysql.conf

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Apache-Spark-2. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. Architecture and Installation

This chapter intends to provide and describe the big-picture around Spark, which includes Spark architecture. You will be taken from the higher-level details of the framework to installing Spark and writing your very first program on Spark.

We'll cover the following core topics in this chapter. If you are already familiar with these topics please feel free to jump to the next chapter on Spark: Resilient Distributed Datasets (RDDs):

Apache Spark architecture overview:

Apache Spark deploymentInstalling Apache SparkWriting your first Spark programSubmitting applications

Apache Spark architecture overview

Apache Spark is being an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platforms.

Figure 1.1: Apache Spark Unified Stack

At the core of the project is a set of APIs for Streaming, SQL, Machine Learning (ML), and Graph. Spark community supports the Spark project by providing connectors to various open source and proprietary data storage engines. Spark also has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation. This is thus a marked difference from Hadoop eco-system where Hadoop provides a complete platform in terms of storage formats, compute engine, cluster manager, and so on. Spark has been designed with the single goal of being an optimized compute engine. This therefore allows you to run Spark on a variety of cluster managers including being run standalone, or being plugged into YARN and Mesos. Similarly, Spark does not have its own storage, but it can connect to a wide number of storage engines.

Currently Spark APIs are available in some of the most common languages including Scala, Java, Python, and R.

Let's start by going through various API's available in Spark.

Spark-core

At the heart of the Spark architecture is the core engine of Spark, commonly referred to as spark-core, which forms the foundation of this powerful architecture. Spark-core provides services such as managing the memory pool, scheduling of tasks on the cluster (Spark works as a Massively Parallel Processing (MPP) system when deployed in cluster mode), recovering failed jobs, and providing support to work with a wide variety of storage systems such as HDFS, S3, and so on.

Tip

Spark-Core provides a full scheduling component for Standalone Scheduling: Code is available at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler

Spark-Core abstracts the users of the APIs from lower-level technicalities of working on a cluster. Spark-Core also provides the RDD APIs which are the basis of other higher-level APIs, and are the core programming elements on Spark. We'll talk about RDD, DataFrame and Dataset APIs later in this book.

Note

MPP systems generally use a large number of processors (on separate hardware or virtualized) to perform a set of operations in parallel. The objective of the MPP systems is to divide work into smaller task pieces and running them in parallel to increase in throughput time.

Spark SQL

Spark SQL is one of the most popular modules of Spark designed for structured and semi-structured data processing. Spark SQL allows users to query structured data inside Spark programs using SQL or the DataFrame and the Dataset API, which is usable in Java, Scala, Python, and R. Because of the fact that the DataFrame API provides a uniform way to access a variety of data sources, including Hive datasets, Avro, Parquet, ORC, JSON, and JDBC, users should be able to connect to any data source the same way, and join across these multiple sources together. The usage of Hive meta store by Spark SQL gives the user full compatibility with existing Hive data, queries, and UDFs. Users can seamlessly run their current Hive workload without modification on Spark.

Spark SQL can also be accessed through spark-sql shell, and existing business tools can connect via standard JDBC and ODBC interfaces.

Spark streaming

More than 50% of users consider Spark Streaming to be the most important component of Apache Spark. Spark Streaming is a module of Spark that enables processing of data arriving in passive or live streams of data. Passive streams can be from static files that you choose to stream to your Spark cluster. This can include all sorts of data ranging from web server logs, social-media activity (following a particular Twitter hashtag), sensor data from your car/phone/home, and so on. Spark-streaming provides a bunch of APIs that help you to create streaming applications in a way similar to how you would create a batch job, with minor tweaks.

As of Spark 2.0, the philosophy behind Spark Streaming is not to reason about streaming and building data application as in the case of a traditional data source. This means the data from sources is continuously appended to the existing tables, and all the operations are run on the new window. A single API lets the users create batch or streaming applications, with the only difference being that a table in batch applications is finite, while the table for a streaming job is considered to be infinite.

MLlib

MLlib is Machine Learning Library for Spark, if you remember from the preface, iterative algorithms are one of the key drivers behind the creation of Spark, and most machine learning algorithms perform iterative processing in one way or another.

Note

Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.

Spark MLlib allows developers to use Spark API and build machine learning algorithms by tapping into a number of data sources including HDFS, HBase, Cassandra, and so on. Spark is super fast with iterative computing and it performs 100 times better than MapReduce. Spark MLlib contains a number of algorithms and utilities including, but not limited to, logistic regression, Support Vector Machine (SVM), classification and regression trees, random forest and gradient-boosted trees, recommendation via ALS, clustering via K-Means, Principal Component Analysis (PCA), and many others.

GraphX

GraphX is an API designed to manipulate graphs. The graphs can range from a graph of web pages linked to each other via hyperlinks to a social network graph on Twitter connected by followers or retweets, or a Facebook friends list.

Graph theory is a study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph is made up of vertices (nodes/points), which are connected by edges (arcs/lines).

--Wikipedia.org

Spark provides a built-in library for graph manipulation, which therefore allows the developers to seamlessly work with both graphs and collections by combining ETL, discovery analysis, and iterative graph manipulation in a single workflow. The ability to combine transformations, machine learning, and graph computation in a single system at high speed makes Spark one of the most flexible and powerful frameworks out there. The ability of Spark to retain the speed of computation with the standard features of fault-tolerance makes it especially handy for big data problems. Spark GraphX has a number of built-in graph algorithms including PageRank, Connected components, Label propagation, SVD++, and Triangle counter.

Spark deployment

Apache Spark runs on both Windows and Unix-like systems (for example, Linux and Mac OS). If you are starting with Spark you can run it locally on a single machine. Spark requires Java 7+, Python 2.6+, and R 3.1+. If you would like to use Scala API (the language in which Spark was written), you need at least Scala version 2.10.x.

Spark can also run in a clustered mode, using which Spark can run both by itself, and on several existing cluster managers. You can deploy Spark on any of the following cluster managers, and the list is growing everyday due to active community support:

Hadoop YARNApache MesosStandalone schedulerYet Another Resource Negotiator (YARN) is one of the key features including a redesigned resource manager thus splitting out the scheduling and resource management capabilities from original Map Reduce in Hadoop .Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It provides efficient resource isolation and sharing across distributed applications, or frameworks.

Installing Apache Spark

As mentioned in the earlier pages, while Spark can be deployed on a cluster, you can also run it in local mode on a single machine.

In this chapter, we are going to download and install Apache Spark on a Linux machine and run it in local mode. Before we do anything we need to download Apache Spark from Apache's web page for the Spark project:

Use your recommended browser to navigate to http://spark.apache.org/downloads.html.Choose a Spark release. You'll find all previous Spark releases listed here. We'll go with release 2.0.0 (at the time of writing, only the preview edition was available).You can download Spark source code, which can be built for several versions of Hadoop, or download it for a specific Hadoop version. In this case, we are going to download one that has been pre-built for Hadoop 2.7 or later.You can also choose to download directly or from among a number of different Mirrors. For the purpose of our exercise we'll use direct download and download it to our preferred location.

Note

If you are using Windows, please remember to use a pathname without any spaces.

The file that you have downloaded is a compressed TAR archive. You need to extract the archive.

Note

The TAR utility is generally used to unpack TAR files. If you don't have TAR, you might want to download that from the repository or use 7-ZIP, which is also one of my favorite utilities.

Once unpacked, you will see a number of directories/files. Here's what you would typically see when you list the contents of the unpacked directory:

The bin folder contains a number of executable shell scripts such as pypark, sparkR, spark-shell, spark-sql, and spark-submit. All of these executables are used to interact with Spark, and we will be using most if not all of these.

If you see my particular download of spark you will find a folder called yarn. The example below is a Spark that was built for Hadoop version 2.7 which comes with YARN as a cluster manager.

Figure 1.2: Spark folder contents

We'll start by running Spark shell, which is a very simple way to get started with Spark and learn the API. Spark shell is a Scala Read-Evaluate-Print-Loop (REPL), and one of the few REPLs available with Spark which also include Python and R.

You should change to the Spark download directory and run the Spark shell as follows: /bin/spark-shell

Figure 1.3: Starting Spark shell

We now have Spark running in standalone mode. We'll discuss the details of the deployment architecture a bit later in this chapter, but now let's kick start some basic Spark programming to appreciate the power and simplicity of the Spark framework.