E-Book
50,39 €

Learning Spark SQL E-Book

Aurobindo Sarkar

0,0

50,39 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API

About This Book

Learn about the design and implementation of streaming applications, machine learning pipelines, deep learning, and large-scale graph processing applications using Spark SQL APIs and Scala.
Learn data exploration, data munging, and how to process structured and semi-structured data using real-world datasets and gain hands-on exposure to the issues and challenges of working with noisy and "dirty" real-world data.
Understand design considerations for scalability and performance in web-scale Spark application architectures.

Who This Book Is For

If you are a developer, engineer, or an architect and want to learn how to use Apache Spark in a web-scale project, then this is the book for you. It is assumed that you have prior knowledge of SQL querying. A basic programming knowledge with Scala, Java, R, or Python is all you need to get started with this book.

What You Will Learn

Familiarize yourself with Spark SQL programming, including working with DataFrame/Dataset API and SQL
Perform a series of hands-on exercises with different types of data sources, including CSV, JSON, Avro, MySQL, and MongoDB
Perform data quality checks, data visualization, and basic statistical analysis tasks
Perform data munging tasks on publically available datasets
Learn how to use Spark SQL and Apache Kafka to build streaming applications
Learn key performance-tuning tips and tricks in Spark SQL applications
Learn key architectural components and patterns in large-scale Spark SQL applications

In Detail

In the past year, Apache Spark has been increasingly adopted for the development of distributed applications. Spark SQL APIs provide an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Hence, understanding the design and implementation best practices before you start your project will help you avoid these problems.

This book gives an insight into the engineering practices used to design and build real-world, Spark-based applications. The book's hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

It starts by familiarizing you with data exploration and data munging tasks using Spark SQL and Scala. Extensive code examples will help you understand the methods used to implement typical use-cases for various types of applications. You will get a walkthrough of the key concepts and terms that are common to streaming, machine learning, and graph applications. You will also learn key performance-tuning details including Cost Based Optimization (Spark 2.2) in Spark SQL applications. Finally, you will move on to learning how such systems are architected and deployed for a successful delivery of your project.

Style and approach

This book is a hands-on guide to designing, building, and deploying Spark SQL-centric production applications at scale.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 374

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Learning Spark SQL

Architect streaming analytics and machine learning solutions

Aurobindo Sarkar

BIRMINGHAM - MUMBAI

Learning Spark SQL

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2017

Production reference: 1010917

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-835-9

www.packtpub.com

Credits

Author

Aurobindo Sarkar

Copy Editor

Shaila Kusanale

Reviewer

Sumit Gupta

Project Coordinator

Ritika Manoj

Commissioning Editor

Kunal Parikh

Proofreader

Safis Editing

Acquisition Editor

Larissa Pinto

Indexer

Tejal Daruwale Soni

ContentDevelopmentEditor

Arun Nadar

Graphics

Jason Monteiro

Technical Editor

Shweta Jadhav

Production Coordinator

Shantanu Zagade

About the Author

Aurobindo Sarkar is currently the Country Head (India Engineering Center) for ZineOne Inc. With a career spanning over 24 years, he has consulted at some of the leading organizations in India, US, UK, and Canada. He specializes in real-time web-scale architectures, machine learning, deep learning, cloud engineering, and big data analytics. Aurobindo has been actively working as a CTO in technology start-ups for over 8 years now. As a member of the top leadership team at various start-ups, he has mentored founders and CxOs, provided technology advisory services, and led product architecture and engineering teams.

I would like to thank Packt for giving me the opportunity to write this book. Their patience, understanding, and support as I wrote, rewrote, revised, and improved upon the content of this book was massive in ensuring that the book remained current with the rapidly evolving versions of Spark. I would especially like to thank Larissa Pinto, the acquisition editor (who first contacted me to write this book over a year ago) and Arun Nadar, the content development editor, who continuously, and patiently, worked with me to bring this book to a conclusion. I would also like to thank my friends and colleagues who encouraged me throughout the journey. Most of all, I want to thank my wife, Nitya, and kids, Somnath, Ravishankar, and Nandini, who understood, encouraged, and supported me, and sacrificed many family moments for me to be able to complete this book successfully. This one is for them…

About the Reviewer

Sumit Gupta is a seasoned professional, innovator, and technology evangelist with over 100 months of experience in architecting, managing, and delivering enterprise solutions revolving around a variety of business domains, such as hospitality, healthcare, risk management, insurance, and more. He is passionate about technology and has an overall hands-on experience of over 16 years in the software industry. He has been using big data and cloud technologies over the last 5 years to solve complex business problems.

Sumit has also authored Neo4j Essentials, Building Web Applications with Python, and Neo4j, Real-Time Big Data Analytics, and Learning Real-time Processing with Spark Streaming, all by Packt.

You can find him on LinkedIn at sumit1001.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Spark SQL

What is Spark SQL?

Introducing SparkSession

Understanding Spark SQL concepts

Understanding Resilient Distributed Datasets (RDDs)

Understanding DataFrames and Datasets

Understanding the Catalyst optimizer

Understanding Catalyst optimizations

Understanding Catalyst transformations

Introducing Project Tungsten

Using Spark SQL in streaming applications

Understanding Structured Streaming internals

Summary

Using Spark SQL for Processing Structured and Semistructured Data

Understanding data sources in Spark applications

Selecting Spark data sources

Using Spark with relational databases

Using Spark with MongoDB (NoSQL database)

Using Spark with JSON data

Using Spark with Avro files

Using Spark with Parquet files

Defining and using custom data sources in Spark

Summary

Using Spark SQL for Data Exploration

Introducing Exploratory Data Analysis (EDA)

Using Spark SQL for basic data analysis

Identifying missing data

Computing basic statistics

Identifying data outliers

Visualizing data with Apache Zeppelin

Sampling data with Spark SQL APIs

Sampling with the DataFrame/Dataset API

Sampling with the RDD API

Using Spark SQL for creating pivot tables

Summary

Using Spark SQL for Data Munging

Introducing data munging

Exploring data munging techniques

Pre-processing of the household electric consumption Dataset

Computing basic statistics and aggregations

Augmenting the Dataset

Executing other miscellaneous processing steps

Pre-processing of the weather Dataset

Analyzing missing data

Combining data using a JOIN operation

Munging textual data

Processing multiple input data files

Removing stop words

Munging time series data

Pre-processing of the time-series Dataset

Processing date fields

Persisting and loading data

Defining a date-time index

Using the TimeSeriesRDD object

Handling missing time-series data

Computing basic statistics

Dealing with variable length records

Converting variable-length records to fixed-length records

Extracting data from "messy" columns

Preparing data for machine learning

Pre-processing data for machine learning

Creating and running a machine learning pipeline

Summary

Using Spark SQL in Streaming Applications

Introducing streaming data applications

Building Spark streaming applications

Implementing sliding window-based functionality

Joining a streaming Dataset with a static Dataset

Using the Dataset API in Structured Streaming

Using output sinks

Using the Foreach Sink for arbitrary computations on output

Using the Memory Sink to save output to a table

Using the File Sink to save output to a partitioned table

Monitoring streaming queries

Using Kafka with Spark Structured Streaming

Introducing Kafka concepts

Introducing ZooKeeper concepts

Introducing Kafka-Spark integration

Introducing Kafka-Spark Structured Streaming

Writing a receiver for a custom data source

Summary

Using Spark SQL in Machine Learning Applications

Introducing machine learning applications

Understanding Spark ML pipelines and their components

Understanding the steps in a pipeline application development process

Introducing feature engineering

Creating new features from raw data

Estimating the importance of a feature

Understanding dimensionality reduction

Deriving good features

Implementing a Spark ML classification model

Exploring the diabetes Dataset

Pre-processing the data

Building the Spark ML pipeline

Using StringIndexer for indexing categorical features and labels

Using VectorAssembler for assembling features into one column

Using a Spark ML classifier

Creating a Spark ML pipeline

Creating the training and test Datasets

Making predictions using the PipelineModel

Selecting the best model

Changing the ML algorithm in the pipeline

Introducing Spark ML tools and utilities

Using Principal Component Analysis to select features

Using encoders

Using Bucketizer

Using VectorSlicer

Using Chi-squared selector

Using a Normalizer

Retrieving our original labels

Implementing a Spark ML clustering model

Summary

Using Spark SQL in Graph Applications

Introducing large-scale graph applications

Exploring graphs using GraphFrames

Constructing a GraphFrame

Basic graph queries and operations

Motif analysis using GraphFrames

Processing subgraphs

Applying graph algorithms

Saving and loading GraphFrames

Analyzing JSON input modeled as a graph

Processing graphs containing multiple types of relationships

Understanding GraphFrame internals

Viewing GraphFrame physical execution plan

Understanding partitioning in GraphFrames

Summary

Using Spark SQL with SparkR

Introducing SparkR

Understanding the SparkR architecture

Understanding SparkR DataFrames

Using SparkR for EDA and data munging tasks

Reading and writing Spark DataFrames

Exploring structure and contents of Spark DataFrames

Running basic operations on Spark DataFrames

Executing SQL statements on Spark DataFrames

Merging SparkR DataFrames

Using User Defined Functions (UDFs)

Using SparkR for computing summary statistics

Using SparkR for data visualization

Visualizing data on a map

Visualizing graph nodes and edges

Using SparkR for machine learning

Summary

Developing Applications with Spark SQL

Introducing Spark SQL applications

Understanding text analysis applications

Using Spark SQL for textual analysis

Preprocessing textual data

Computing readability

Using word lists

Creating data preprocessing pipelines

Understanding themes in document corpuses

Using Naive Bayes classifiers

Developing a machine learning application

Summary

Using Spark SQL in Deep Learning Applications

Introducing neural networks

Understanding deep learning

Understanding representation learning

Understanding stochastic gradient descent

Introducing deep learning in Spark

Introducing CaffeOnSpark

Introducing DL4J

Introducing TensorFrames

Working with BigDL

Tuning hyperparameters of deep learning models

Introducing deep learning pipelines

Understanding Supervised learning

Understanding convolutional neural networks

Using neural networks for text classification

Using deep neural networks for language processing

Understanding Recurrent Neural Networks

Introducing autoencoders

Summary

Tuning Spark SQL Components for Performance

Introducing performance tuning in Spark SQL

Understanding DataFrame/Dataset APIs

Optimizing data serialization

Understanding Catalyst optimizations

Understanding the Dataset/DataFrame API

Understanding Catalyst transformations

Visualizing Spark application execution

Exploring Spark application execution metrics

Using external tools for performance tuning

Cost-based optimizer in Apache Spark 2.2

Understanding the CBO statistics collection

Statistics collection functions

Filter operator

Join operator

Build side selection

Understanding multi-way JOIN ordering optimization

Understanding performance improvements using whole-stage code generation

Summary

Spark SQL in Large-Scale Application Architectures

Understanding Spark-based application architectures

Using Apache Spark for batch processing

Using Apache Spark for stream processing

Understanding the Lambda architecture

Understanding the Kappa Architecture

Design considerations for building scalable stream processing applications

Building robust ETL pipelines using Spark SQL

Choosing appropriate data formats

Transforming data in ETL pipelines

Addressing errors in ETL pipelines

Implementing a scalable monitoring solution

Deploying Spark machine learning pipelines

Understanding the challenges in typical ML deployment environments

Understanding types of model scoring architectures

Using cluster managers

Summary

Preface

We will start this book with the basics of Spark SQL and its role in Spark applications. After the initial familiarization with Spark SQL, we will focus on using Spark SQL to execute tasks that are common to all big data projects, such as working with various types of data sources, exploratory data analysis, and data munging. We will also see how Spark SQL and SparkR can be leveraged to accomplish typical data science tasks at scale.

With the DataFrame/Dataset API and the Catalyst optimizer at the heart of Spark SQL, it is no surprise that it plays a key role in all applications based on the Spark technology stack. These applications include large-scale machine learning pipelines, large-scale graph applications, and emerging Spark-based deep learning applications. Additionally, we will present Spark SQL-based Structured Streaming applications that are deployed in complex production environments as continuous applications.

We will also review performance tuning in Spark SQL applications, including cost-based optimization (CBO) introduced in Spark 2.2. Finally, we will present application architectures that leverage Spark modules and Spark SQL in real-world applications. More specifically, we will cover key architectural components and patterns in large-scale Spark applications that architects and designers will find useful as building blocks for their own specific use cases.

What this book covers

Chapter 1, Getting Started with Spark SQL, gives you an overview of Spark SQL while getting you comfortable with the Spark environment through hands-on sessions.

Chapter 2, Using Spark SQL for Processing Structured and Semistructured Data, will help you use Spark to work with a relational database (MySQL), NoSQL database (MongoDB), semistructured data (JSON), and data storage formats commonly used in the Hadoop ecosystem (Avro and Parquet).

Chapter 3, Using Spark SQL for Data Exploration, demonstrates the use of Spark SQL to explore datasets, perform basic data quality checks, generate samples and pivot tables, and visualize data with Apache Zeppelin.

Chapter 4, Using Spark SQL for Data Munging, uses Spark SQL for performing some basic data munging/wrangling tasks. It also introduces you to a few techniques to handle missing data, bad data, duplicate records, and so on.

Chapter 5, Using Spark SQL in Streaming Applications, provides a few examples of using Spark SQL DataFrame/Dataset APIs to build streaming applications. Additionally, it also shows how to use Kafka in structured streaming applications.

Chapter 6, Using Spark SQL in Machine Learning Applications, focuses on using Spark SQL in machine learning applications. In this chapter, we will mainly explore the key concepts in feature engineering and implement machine learning pipelines.

Chapter 7, Using Spark SQL in Graph Applications, introduces you to GraphFrame applications. It provides examples of using Spark SQL DataFrame/Dataset APIs to build graph applications and apply the various graph algorithms into your graph applications.

Chapter 8, Using Spark SQL with SparkR, covers the SparkR architecture and SparkR DataFrames API. It provides code examples for using SparkR for Exploratory Data Analysis (EDA) and data munging tasks, data visualization, and machine learning.

Chapter 9, Developing Applications with Spark SQL, helps you build Spark applications using a mix of Spark modules. It presents examples of applications that combine Spark SQL with Spark Streaming, Spark Machine Learning, and so on.

Chapter 10, Using Spark SQL in Deep Learning Applications, introduces you to deep learning in Spark. It covers the basic concepts of a few popular deep learning models before you delve into working with BigDL and Spark.

Chapter 11, Tuning Spark SQL Components for Performance, presents you with the foundational concepts related to tuning a Spark application, including data serialization using encoders. It also covers the key aspects of the cost-based optimizer introduced in Spark 2.2 to optimize Spark SQL execution automatically.

Chapter 12, Spark SQL in Large-Scale Application Architectures, teaches you to identify the use cases where Spark SQL can be used in large-scale application architectures to implement typical functional and non-functional requirements.

What you need for this book

This book is based on Spark 2.2.0 (pre-built for Apache Hadoop 2.7 or later) and Scala 2.11.8. For one or two subsections, Spark 2.1.0 has also been used due to the unavailability of certain libraries and reported bugs (when used with Apache Spark 2.2). The hardware and OS specifications include minimum 8 GB RAM (16 GB strongly recommended), 100 GB HDD, and OS X 10.11.6 or later (or appropriate Linux versions recommended for Spark development).

Who this book is for

If you are a developer, engineer, or an architect and want to learn how to use Apache Spark in a web-scale project, then this is the book for you. It is assumed that you have prior knowledge of SQL querying. Basic programming knowledge with Scala, Java, R, or Python is all you need to get started with this book.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have several things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

Enter the name of the book in the

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Spark-SQL. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningSparkSQL_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/supportand enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Getting Started with Spark SQL

Spark SQL is at the heart of all applications developed using Spark. In this book, we will explore Spark SQL in great detail, including its usage in various types of applications as well as its internal workings. Developers and architects will appreciate the technical concepts and hands-on sessions presented in each chapter, as they progress through the book.

In this chapter, we will introduce you to the key concepts related to Spark SQL. We will start with SparkSession, the new entry point for Spark SQL in Spark 2.0. Then, we will explore Spark SQL's interfaces RDDs, DataFrames, and Dataset APIs. Later on, we will explain the developer-level details regarding the Catalyst optimizer and Project Tungsten.

Finally, we will introduce an exciting new feature in Spark 2.0 for streaming applications, called Structured Streaming. Specific hands-on exercises (using publicly available Datasets) are presented throughout the chapter, so you can actively follow along as you read through the various sections.

More specifically, the sections in this chapter will cover the following topics along with practice hands-on sessions:

What is Spark SQL?

Introducing SparkSession

Understanding Spark SQL concepts

Understanding RDDs, DataFrames, and Datasets

Understanding the Catalyst optimizer

Understanding Project Tungsten

Using Spark SQL in continuous applications

Understanding Structured Streaming internals

What is Spark SQL?

Spark SQL is one of the most advanced components of Apache Spark. It has been a part of the core distribution since Spark 1.0 and supports Python, Scala, Java, and R programming APIs. As illustrated in the figure below, Spark SQL components provide the foundation for Spark machine learning applications, streaming applications, graph applications, and many other types of application architectures.

Such applications, typically, use Spark ML pipelines, Structured Streaming, and GraphFrames, which are all based on Spark SQL interfaces (DataFrame/Dataset API). These applications, along with constructs such as SQL, DataFrames, and Datasets API, receive the benefits of the Catalyst optimizer, automatically. This optimizer is also responsible for generating executable query plans based on the lower-level RDD interfaces.

We will explore ML pipelines in more detail in Chapter 6, Using Spark SQL in Machine Learning Applications. GraphFrames will be covered in Chapter 7, Using Spark SQL in Graph Applications. While, we will introduce the key concepts regarding Structured Streaming and the Catalyst optimizer in this chapter, we will get more details about them in Chapter 5, Using Spark SQL in Streaming Applications, and Chapter 11, Tuning Spark SQL Components for Performance.

In Spark 2.0, the DataFrame API has been merged with the Dataset API, thereby unifying data processing capabilities across Spark libraries. This also enables developers to work with a single high-level and type-safe API. However, the Spark software stack does not prevent developers from directly using the low-level RDD interface in their applications. Though the low-level RDD API will continue to be available, a vast majority of developers are expected to (and are recommended to) use the high-level APIs, namely, the Dataset and DataFrame APIs.

Additionally, Spark 2.0 extends Spark SQL capabilities by including a new ANSI SQL parser with support for subqueries and the SQL:2003 standard. More specifically, the subquery support now includes correlated/uncorrelated subqueries, and IN / NOT IN and EXISTS / NOTEXISTS predicates in WHERE / HAVING clauses.

At the core of Spark SQL is the Catalyst optimizer, which leverages Scala's advanced features, such as pattern matching, to provide an extensible query optimizer. DataFrames, Datasets, and SQL queries share the same execution and optimization pipeline; hence, there is no performance impact of using any one or the other of these constructs (or of using any of the supported programming APIs). The high-level DataFrame-based code written by the developer is converted to Catalyst expressions and then to low-level Java bytecode as it passes through this pipeline.

SparkSession is the entry point into Spark SQL-related functionality and we describe it in more detail in the next section.

Understanding Spark SQL concepts

In this section, we will explore key concepts related to Resilient Distributed Datasets (RDD), DataFrames, and Datasets, Catalyst Optimizer and Project Tungsten.

Understanding the Catalyst optimizer

The Catalyst optimizer is at the core of Spark SQL and is implemented in Scala. It enables several key features, such as schema inference (from JSON data), that are very useful in data analysis work.

The following figure shows the high-level transformation process from a developer's program containing DataFrames/Datasets to the final execution plan:

The internal representation of the program is a query plan. The query plan describes data operations such as aggregate, join, and filter, which match what is defined in your query. These operations generate a new Dataset from the input Dataset. After we have an initial version of the query plan ready, the Catalyst optimizer will apply a series of transformations to convert it to an optimized query plan. Finally, the Spark SQL code generation mechanism translates the optimized query plan into a DAG of RDDs that is ready for execution. The query plans and the optimized query plans are internally represented as trees. So, at its core, the Catalyst optimizer contains a general library for representing trees and applying rules to manipulate them. On top of this library, are several other libraries that are more specific to relational query processing.

Catalyst has two types of query plans: Logical and Physical Plans. The Logical Plan describes the computations on the Datasets without defining how to carry out the specific computations. Typically, the Logical Plan generates a list of attributes or columns as output under a set of constraints on the generated rows. The Physical Plan describes the computations on Datasets with specific definitions on how to execute them (it is executable).

Let's explore the transformation steps in more detail. The initial query plan is essentially an unresolved Logical Plan, that is, we don't know the source of the Datasets or the columns (contained in the Dataset) at this stage and we also don't know the types of columns. The first step in this pipeline is the analysis step. During analysis, the catalog information is used to convert the unresolved Logical Plan to a resolved Logical Plan.

In the next step, a set of logical optimization rules is applied to the resolved Logical Plan, resulting in an optimized Logical Plan. In the next step the optimizer may generate multiple Physical Plans and compare their costs to pick the best one. The first version of the Cost-based Optimizer (CBO), built on top of Spark SQL has been released in Spark 2.2. More details on cost-based optimization are presented in Chapter 11, Tuning Spark SQL Components for Performance.

All three--DataFrame, Dataset and SQL--share the same optimization pipeline as illustrated in the following figure:

Understanding Catalyst optimizations

In Catalyst, there are two main types of optimizations: Logical and Physical:

Logical Optimizations: This includes the ability of the optimizer to push filter predicates down to the data source and enable execution to skip irrelevant data. For example, in the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of RDBMSs, the predicates are pushed down to the database to reduce the amount of data traffic.

Physical Optimizations: This includes the ability to choose intelligently between broadcast joins and shuffle joins to reduce network traffic, performing lower-level optimizations, such as eliminating expensive object allocations and reducing virtual function calls. Hence, and performance typically improves when DataFrames are introduced in your programs.

The Rule Executor is responsible for the analysis and logical optimization steps, while a set of strategies and the Rule Executor are responsible for the physical planning step. The Rule Executor transforms a tree to another of the same type by applying a set of rules in batches. These rules can be applied one or more times. Also, each of these rules is implemented as a transform. A transform is basically a function, associated with every tree, and is used to implement a single rule. In Scala terms, the transformation is defined as a partial function (a function defined for a subset of its possible arguments). These are typically defined as case statements to determine whether the partial function (using pattern matching) is defined for the given input.

The Rule Executor makes the Physical Plan ready for execution by preparing scalar subqueries, ensuring that the input rows meet the requirements of the specific operation and applying the physical optimizations. For example, in the sort merge join operations, the input rows need to be sorted as per the join condition. The optimizer inserts the appropriate sort operations, as required, on the input rows before the sort merge join operation is executed.