Learning PySpark - Tomasz Drabas - E-Book

Learning PySpark E-Book

Tomasz Drabas

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark.
You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command.
By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 276

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Learning PySpark
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Understanding Spark
What is Apache Spark?
Spark Jobs and APIs
Execution process
Resilient Distributed Dataset
DataFrames
Datasets
Catalyst Optimizer
Project Tungsten
Spark 2.0 architecture
Unifying Datasets and DataFrames
Introducing SparkSession
Tungsten phase 2
Structured Streaming
Continuous applications
Summary
2. Resilient Distributed Datasets
Internal workings of an RDD
Creating RDDs
Schema
Reading from files
Lambda expressions
Global versus local scope
Transformations
The .map(...) transformation
The .filter(...) transformation
The .flatMap(...) transformation
The .distinct(...) transformation
The .sample(...) transformation
The .leftOuterJoin(...) transformation
The .repartition(...) transformation
Actions
The .take(...) method
The .collect(...) method
The .reduce(...) method
The .count(...) method
The .saveAsTextFile(...) method
The .foreach(...) method
Summary
3. DataFrames
Python to RDD communications
Catalyst Optimizer refresh
Speeding up PySpark with DataFrames
Creating DataFrames
Generating our own JSON data
Creating a DataFrame
Creating a temporary table
Simple DataFrame queries
DataFrame API query
SQL query
Interoperating with RDDs
Inferring the schema using reflection
Programmatically specifying the schema
Querying with the DataFrame API
Number of rows
Running filter statements
Querying with SQL
Number of rows
Running filter statements using the where Clauses
DataFrame scenario – on-time flight performance
Preparing the source datasets
Joining flight performance and airports
Visualizing our flight-performance data
Spark Dataset API
Summary
4. Prepare Data for Modeling
Checking for duplicates, missing observations, and outliers
Duplicates
Missing observations
Outliers
Getting familiar with your data
Descriptive statistics
Correlations
Visualization
Histograms
Interactions between features
Summary
5. Introducing MLlib
Overview of the package
Loading and transforming the data
Getting to know your data
Descriptive statistics
Correlations
Statistical testing
Creating the final dataset
Creating an RDD of LabeledPoints
Splitting into training and testing
Predicting infant survival
Logistic regression in MLlib
Selecting only the most predictable features
Random forest in MLlib
Summary
6. Introducing the ML Package
Overview of the package
Transformer
Estimators
Classification
Regression
Clustering
Pipeline
Predicting the chances of infant survival with ML
Loading the data
Creating transformers
Creating an estimator
Creating a pipeline
Fitting the model
Evaluating the performance of the model
Saving the model
Parameter hyper-tuning
Grid search
Train-validation splitting
Other features of PySpark ML in action
Feature extraction
NLP - related feature extractors
Discretizing continuous variables
Standardizing continuous variables
Classification
Clustering
Finding clusters in the births dataset
Topic mining
Regression
Summary
7. GraphFrames
Introducing GraphFrames
Installing GraphFrames
Creating a library
Preparing your flights dataset
Building the graph
Executing simple queries
Determining the number of airports and trips
Determining the longest delay in this dataset
Determining the number of delayed versus on-time/early flights
What flights departing Seattle are most likely to have significant delays?
What states tend to have significant delays departing from Seattle?
Understanding vertex degrees
Determining the top transfer airports
Understanding motifs
Determining airport ranking using PageRank
Determining the most popular non-stop flights
Using Breadth-First Search
Visualizing flights using D3
Summary
8. TensorFrames
What is Deep Learning?
The need for neural networks and Deep Learning
What is feature engineering?
Bridging the data and algorithm
What is TensorFlow?
Installing Pip
Installing TensorFlow
Matrix multiplication using constants
Matrix multiplication using placeholders
Running the model
Running another model
Discussion
Introducing TensorFrames
TensorFrames – quick start
Configuration and setup
Launching a Spark cluster
Creating a TensorFrames library
Installing TensorFlow on your cluster
Using TensorFlow to add a constant to an existing column
Executing the Tensor graph
Blockwise reducing operations example
Building a DataFrame of vectors
Analysing the DataFrame
Computing elementwise sum and min of all vectors
Summary
9. Polyglot Persistence with Blaze
Installing Blaze
Polyglot persistence
Abstracting data
Working with NumPy arrays
Working with pandas' DataFrame
Working with files
Working with databases
Interacting with relational databases
Interacting with the MongoDB database
Data operations
Accessing columns
Symbolic transformations
Operations on columns
Reducing data
Joins
Summary
10. Structured Streaming
What is Spark Streaming?
Why do we need Spark Streaming?
What is the Spark Streaming application data flow?
Simple streaming application using DStreams
A quick primer on global aggregations
Introducing Structured Streaming
Summary
11. Packaging Spark Applications
The spark-submit command
Command line parameters
Deploying the app programmatically
Configuring your SparkSession
Creating SparkSession
Modularizing code
Structure of the module
Calculating the distance between two points
Converting distance units
Building an egg
User defined functions in Spark
Submitting a job
Monitoring execution
Databricks Jobs
Summary
Index

Learning PySpark

Learning PySpark

Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2017

Production reference: 1220217

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78646-370-8

www.packtpub.com

Credits

Authors

Tomasz Drabas

Denny Lee

Reviewer

Holden Karau

Commissioning Editor

Amey Varangaonkar

Acquisition Editor

Prachi Bisht

Content Development Editor

Amrita Noronha

Technical Editor

Akash Patel

Copy Editor

Safis Editing

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Aishwarya Gangawane

Graphics

Disha Haria

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

Foreword

Thank you for choosing this book to start your PySpark adventures, I hope you are as excited as I am. When Denny Lee first told me about this new book I was delighted-one of the most important things that makes Apache Spark such a wonderful platform, is supporting both the Java/Scala/JVM worlds and Python (and more recently R) worlds. Many of the previous books for Spark have been focused on either all of the core languages, or primarily focused on JVM languages, so it's great to see PySpark get its chance to shine with a dedicated book from such experienced Spark educators. By supporting both of these different worlds, we are able to more effectively work together as Data Scientists and Data Engineers, while stealing the best ideas from each other's communities.

It has been a privilege to have the opportunity to review early versions of this book, which has only increased my excitement for the project. I've had the privilege of being at some of the same conferences and meetups and watching the authors introduce new concepts in the world of Spark to a variety of audiences (from first timers to old hands), and they've done a great job distilling their experience for this book. The experience of the authors shines through with everything from their explanations to the topics covered. Beyond simply introducing PySpark they have also taken the time to look at up and coming packages from the community, such as GraphFrames and TensorFrames.

I think the community is one of those often-overlooked components when deciding what tools to use, and Python has a great community and I'm looking forward to you joining the Python Spark community. So, enjoy your adventure; I know you are in good hands with Denny Lee and Tomek Drabas. I truly believe that by having a diverse community of Spark users we will be able to make better tools useful for everyone, so I hope to see you around at one of the conferences, meetups, or mailing lists soon :)

Holden Karau

P.S.

I owe Denny a beer; if you want to buy him a Bud Light lime (or lime-a-rita) for me I'd be much obliged (although he might not be quite as amused as I am).

About the Authors

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 13 years of experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting he gained while working on three continents: Europe, Australia, and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with a focus on choice modeling and revenue management applications in the airline industry.

At Microsoft, Tomasz works with big data on a daily basis, solving machine learning problems such as anomaly detection, churn prediction, and pattern recognition using Spark.

Tomasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016.

I would like to thank my family: Rachel, Skye, and Albert—you are the love of my life and I cherish every day I spend with you! Thank you for always standing by me and for encouraging me to push my career goals further and further. Also, to my family and my in-laws for putting up with me (in general).

There are many more people that have influenced me over the years that I would have to write another book to thank them all. You know who you are and I want to thank you from the bottom of my heart!

However, I would not have gotten through my PhD if it was not for Czesia Wieruszewska; Czesiu - dziękuję za Twoją pomoc bez której nie rozpocząłbym mojej podróży po Antypodach. Along with Krzys Krzysztoszek, you guys have always believed in me! Thank you!

Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB team—Microsoft's blazing fast, planet-scale managed document store service.  He is a hands-on distributed systems and data science engineer with more than 18 years of experience developing Internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments.

He has extensive experience of building greenfield teams as well as turnaround/change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft's Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters in Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers for the last 15 years.

I would like to thank my wonderful spouse, Hua-Ping, and my awesome daughters, Isabella and Samantha. You are the ones who keep me grounded and help me reach for the stars!

About the Reviewer

Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. Holden is a co-author of numerous books on Spark including High Performance Spark (which she believes is the gift of the season for those with expense accounts) & Learning Spark. Holden is a Spark committer, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.

www.PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1786463709.

If you'd like to join our team of regular reviewers, you can email us at <[email protected]>. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

It is estimated that in 2013 the whole world produced around 4.4 zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as the human race) are expected to produce ten times that. With data getting larger literally by the second, and given the growing appetite for making sense out of it, in 2004 Google employees Jeffrey Dean and Sanjay Ghemawat published the seminal paper MapReduce: Simplified Data Processing on Large Clusters. Since then, technologies leveraging the concept started growing very quickly with Apache Hadoop initially being the most popular. It ultimately created a Hadoop ecosystem that included abstraction layers such as Pig, Hive, and Mahout – all leveraging this simple concept of map and reduce.

However, even though capable of chewing through petabytes of data daily, MapReduce is a fairly restricted programming framework. Also, most of the tasks require reading and writing to disk. Seeing these drawbacks, in 2009 Matei Zaharia started working on Spark as part of his PhD. Spark was first released in 2012. Even though Spark is based on the same MapReduce concept, its advanced ways of dealing with data and organizing tasks make it 100x faster than Hadoop (for in-memory computations).

In this book, we will guide you through the latest incarnation of Apache Spark using Python. We will show you how to read structured and unstructured data, how to use some fundamental data types available in PySpark, build machine learning models, operate on graphs, read streaming data, and deploy your models in the cloud. Each chapter will tackle different problem, and by the end of the book we hope you will be knowledgeable enough to solve other problems we did not have space to cover here.

What this book covers

Chapter 1, Understanding Spark, provides an introduction into the Spark world with an overview of the technology and the jobs organization concepts.

Chapter 2, Resilient Distributed Datasets, covers RDDs, the fundamental, schema-less data structure available in PySpark.

Chapter 3, DataFrames, provides a detailed overview of a data structure that bridges the gap between Scala and Python in terms of efficiency.

Chapter 4, Prepare Data for Modeling, guides the reader through the process of cleaning up and transforming data in the Spark environment.

Chapter 5, Introducing MLlib, introduces the machine learning library that works on RDDs and reviews the most useful machine learning models.

Chapter 6, Introducing the ML Package, covers the current mainstream machine learning library and provides an overview of all the models currently available.

Chapter 7, GraphFrames, will guide you through the new structure that makes solving problems with graphs easy.

Chapter 8, TensorFrames, introduces the bridge between Spark and the Deep Learning world of TensorFlow.

Chapter 9, Polyglot Persistence with Blaze, describes how Blaze can be paired with Spark for even easier abstraction of data from various sources.

Chapter 10, Structured Streaming, provides an overview of streaming tools available in PySpark.

Chapter 11, Packaging Spark Applications, will guide you through the steps of modularizing your code and submitting it for execution to Spark through command-line interface.

For more information, we have provided two bonus chapters as follows:

Installing Spark: https://www.packtpub.com/sites/default/files/downloads/InstallingSpark.pdf

Free Spark Cloud Offering: https://www.packtpub.com/sites/default/files/downloads/FreeSparkCloudOffering.pdf

What you need for this book

For this book you need a personal computer (can be either Windows machine, Mac, or Linux). To run Apache Spark, you will need Java 7+ and an installed and configured Python 2.6+ or 3.4+ environment; we use the Anaconda distribution of Python in version 3.5, which can be downloaded from https://www.continuum.io/downloads.

The Python modules we randomly use throughout the book come preinstalled with Anaconda. We also use GraphFrames and TensorFrames that can be loaded dynamically while starting a Spark instance: to load these you just need an Internet connection. It is fine if some of those modules are not currently installed on your machine – we will guide you through the installation process.

Who this book is for

This book is for everyone who wants to learn the fastest-growing technology in big data: Apache Spark. We hope that even the more advanced practitioners from the field of data science can find some of the examples refreshing and the more advanced topics interesting.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

All the code is also available on GitHub: https://github.com/drabastomek/learningPySpark.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-PySpark. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningPySpark_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the erratasubmissionform link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Understanding Spark

Apache Spark is a powerful open source processing engine originally developed by Matei Zaharia as a part of his PhD thesis while at UC Berkeley. The first version of Spark was released in 2012. Since then, in 2013, Zaharia co-founded and has become the CTO at Databricks; he also holds a professor position at Stanford, coming from MIT. At the same time, the Spark codebase was donated to the Apache Software Foundation and has become its flagship project.

Apache Spark is fast, easy to use framework, that allows you to solve a wide variety of complex data problems whether semi-structured, structured, streaming, and/or machine learning / data sciences. It also has become one of the largest open source communities in big data with more than 1,000 contributors from 250+ organizations and with 300,000+ Spark Meetup community members in more than 570+ locations worldwide.

In this chapter, we will provide a primer to understanding Apache Spark. We will explain the concepts behind Spark Jobs and APIs, introduce the Spark 2.0 architecture, and explore the features of Spark 2.0.

The topics covered are:

What is Apache Spark?Spark Jobs and APIsReview of Resilient Distributed Datasets (RDDs), DataFrames, and DatasetsReview of Catalyst Optimizer and Project TungstenReview of the Spark 2.0 architecture

What is Apache Spark?

Apache Spark is an open-source powerful distributed querying and processing engine. It provides flexibility and extensibility of MapReduce but at significantly higher speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory and up to 10 times when accessing disk.

Apache Spark allows the user to read, transform, and aggregate data, as well as train and deploy sophisticated statistical models with ease. The Spark APIs are accessible in Java, Scala, Python, R and SQL. Apache Spark can be used to build applications or package them up as libraries to be deployed on a cluster or perform quick analytics interactively through notebooks (like, for instance, Jupyter, Spark-Notebook, Databricks notebooks, and Apache Zeppelin).

Apache Spark exposes a host of libraries familiar to data analysts, data scientists or researchers who have worked with Python's pandas or R's data.frames or data.tables. It is important to note that while Spark DataFrames will be familiar to pandas or data.frames / data.tables users, there are some differences so please temper your expectations. Users with more of a SQL background can use the language to shape their data as well. Also, delivered with Apache Spark are several already implemented and tuned algorithms, statistical models, and frameworks: MLlib and ML for machine learning, GraphX and GraphFrames for graph processing, and Spark Streaming (DStreams and Structured). Spark allows the user to combine these libraries seamlessly in the same application.

Apache Spark can easily run locally on a laptop, yet can also easily be deployed in standalone mode, over YARN, or Apache Mesos - either on your local cluster or in the cloud. It can read and write from a diverse data sources including (but not limited to) HDFS, Apache Cassandra, Apache HBase, and S3:

Source: Apache Spark is the smartphone of Big Data http://bit.ly/1QsgaNj

Note

For more information, please refer to: Apache Spark is the Smartphone of Big Data at http://bit.ly/1QsgaNj

Spark Jobs and APIs

In this section, we will provide a short overview of the Apache Spark Jobs and APIs. This provides the necessary foundation for the subsequent section on Spark 2.0 architecture.

Execution process

Any Spark application spins off a single driver process (that can contain multiple jobs) on the master node that then directs executor processes (that contain multiple tasks) distributed to a number of worker nodes as noted in the following diagram:

The driver process determines the number and the composition of the task processes directed to the executor nodes based on the graph generated for the given job. Note, that any worker node can execute tasks from a number of different jobs.

A Spark job is associated with a chain of object dependencies organized in a direct acyclic graph (DAG) such as the following example generated from the Spark UI. Given this, Spark can optimize the scheduling (for example, determine the number of tasks and workers required) and execution of these tasks:

Note

For more information on the DAG scheduler, please refer to http://bit.ly/29WTiK8.

Resilient Distributed Dataset

Apache Spark is built around a distributed collection of immutable Java Virtual Machine (JVM) objects called Resilient Distributed Datasets (RDDs for short). As we are working with Python, it is important to note that the Python data is stored within these JVM objects. More of this will be discussed in the subsequent chapters on RDDs and DataFrames. These objects allow any job to perform calculations very quickly. RDDs are calculated against, cached, and stored in-memory: a scheme that results in orders of magnitude faster computations compared to other traditional distributed frameworks like Apache Hadoop.

At the same time, RDDs expose some coarse-grained transformations (such as map(...), reduce(...), and filter(...) which we will cover in greater detail in Chapter 2, Resilient Distributed Datasets), keeping the flexibility and extensibility of the Hadoop platform to perform a wide variety of calculations. RDDs apply and log transformations to the data in parallel, resulting in both increased speed and fault-tolerance. By registering the transformations, RDDs provide data lineage - a form of an ancestry tree for each intermediate step in the form of a graph. This, in effect, guards the RDDs against data loss - if a partition of an RDD is lost it still has enough information to recreate that partition instead of simply depending on replication.

Note

If you want to learn more about data lineage check this link http://ibm.co/2ao9B1t .

RDDs have two sets of parallel operations: transformations (which return pointers to new RDDs) and actions (which return values to the driver after running a computation); we will cover these in greater detail in later chapters.

Note

For the latest list of transformations and actions, please refer to the Spark Programming Guide at http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations.

RDD transformation operations are lazy in a sense that they do not compute their results immediately. The transformations are only computed when an action is executed and the results need to be returned to the driver. This delayed execution results in more fine-tuned queries: Queries that are optimized for performance. This optimization starts with Apache Spark's DAGScheduler – the stage oriented scheduler that transforms using stages as seen in the preceding screenshot. By having separate RDD transformations and actions, the DAGScheduler can perform optimizations in the query including being able to avoid shuffling, the data (the most resource intensive task).

For more information on the DAGScheduler and optimizations (specifically around narrow or wide dependencies), a great reference is the Narrow vs. Wide Transformations section in High Performance Spark in Chapter 5, Effective Transformations (https://smile.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203).

DataFrames

DataFrames, like RDDs, are immutable collections of data distributed among the nodes in a cluster. However, unlike RDDs, in DataFrames data is organized into named columns.

Note

If you are familiar with Python's pandas or R data.frames, this is a similar concept.

DataFrames were designed to make large data sets processing even easier. They allow developers to formalize the structure of the data, allowing higher-level abstraction; in that sense DataFrames resemble tables from the relational database world. DataFrames provide a domain specific language API to manipulate the distributed data and make Spark accessible to a wider audience, beyond specialized data engineers.

One of the major benefits of DataFrames is that the Spark engine initially builds a logical execution plan and executes generated code based on a physical plan determined by a cost optimizer. Unlike RDDs that can be significantly slower on Python compared with Java or Scala, the introduction of DataFrames has brought performance parity across all the languages.

Datasets

Introduced in Spark 1.6, the goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and benefits of the robust Spark SQL execution engine. Unfortunately, at the time of writing this book Datasets are only available in Scala or Java. When they are available in PySpark we will cover them in future editions.

Catalyst Optimizer

Spark SQL is one of the most technically involved components of Apache Spark as it powers both SQL queries and the DataFrame API. At the core of Spark SQL is the Catalyst Optimizer. The optimizer is based on functional programming constructs and was designed with two purposes in mind: To ease the addition of new optimization techniques and features to Spark SQL and to allow external developers to extend the optimizer (for example, adding data source specific rules, support for new data types, and so on):

Note

For more information, check out Deep Dive into Spark SQL's Catalyst Optimizer (http://bit.ly/271I7Dk) and Apache Spark DataFrames: Simple and Fast Analysis of Structured Data (http://bit.ly/29QbcOV)

Project Tungsten

Tungsten is the codename for an umbrella project of Apache Spark's execution engine. The project focuses on improving the Spark algorithms so they use memory and CPU more efficiently, pushing the performance of modern hardware closer to its limits.

The efforts of this project focus, among others, on:

Managing memory explicitly so the overhead of JVM's object model and garbage collection are eliminatedDesigning algorithms and data structures that exploit the memory hierarchyGenerating code in runtime so the applications can exploit modern compliers and optimize for CPUsEliminating virtual function dispatches so that multiple CPU calls are reducedUtilizing low-level programming (for example, loading immediate data to CPU registers) speed up the memory access and optimizing Spark's engine to efficiently compile and execute simple loops

Note

For more information, please refer to

Project Tungsten: Bringing Apache Spark Closer to Bare Metal (https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html)

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal [SSE 2015 Video and Slides] (https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-spark-closer-to-bare-metal/) and

Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop (https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html)

Spark 2.0 architecture

The introduction of Apache Spark 2.0 is the recent major release of the Apache Spark project based on the key learnings from the last two years of development of the platform:

Source: Apache Spark 2.0: Faster, Easier, and Smarter http://bit.ly/2ap7qd5

The three overriding themes of the Apache Spark 2.0 release surround performance enhancements (via Tungsten Phase 2), the introduction of structured streaming, and unifying Datasets and DataFrames. We will describe the Datasets as they are part of Spark 2.0 even though they are currently only available in Scala and Java.

Note

Refer to the following presentations by key Spark committers for more information about Apache Spark 2.0: