50,39 €
Design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API
About This Book
Who This Book Is For
If you are a developer, engineer, or an architect and want to learn how to use Apache Spark in a web-scale project, then this is the book for you. It is assumed that you have prior knowledge of SQL querying. A basic programming knowledge with Scala, Java, R, or Python is all you need to get started with this book.
What You Will Learn
In Detail
In the past year, Apache Spark has been increasingly adopted for the development of distributed applications. Spark SQL APIs provide an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Hence, understanding the design and implementation best practices before you start your project will help you avoid these problems.
This book gives an insight into the engineering practices used to design and build real-world, Spark-based applications. The book's hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.
It starts by familiarizing you with data exploration and data munging tasks using Spark SQL and Scala. Extensive code examples will help you understand the methods used to implement typical use-cases for various types of applications. You will get a walkthrough of the key concepts and terms that are common to streaming, machine learning, and graph applications. You will also learn key performance-tuning details including Cost Based Optimization (Spark 2.2) in Spark SQL applications. Finally, you will move on to learning how such systems are architected and deployed for a successful delivery of your project.
Style and approach
This book is a hands-on guide to designing, building, and deploying Spark SQL-centric production applications at scale.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 374
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2017
Production reference: 1010917
ISBN 978-1-78588-835-9
www.packtpub.com
Author
Aurobindo Sarkar
Copy Editor
Shaila Kusanale
Reviewer
Sumit Gupta
Project Coordinator
Ritika Manoj
Commissioning Editor
Kunal Parikh
Proofreader
Safis Editing
Acquisition Editor
Larissa Pinto
Indexer
Tejal Daruwale Soni
ContentDevelopmentEditor
Arun Nadar
Graphics
Jason Monteiro
Technical Editor
Shweta Jadhav
Production Coordinator
Shantanu Zagade
Aurobindo Sarkar is currently the Country Head (India Engineering Center) for ZineOne Inc. With a career spanning over 24 years, he has consulted at some of the leading organizations in India, US, UK, and Canada. He specializes in real-time web-scale architectures, machine learning, deep learning, cloud engineering, and big data analytics. Aurobindo has been actively working as a CTO in technology start-ups for over 8 years now. As a member of the top leadership team at various start-ups, he has mentored founders and CxOs, provided technology advisory services, and led product architecture and engineering teams.
Sumit Gupta is a seasoned professional, innovator, and technology evangelist with over 100 months of experience in architecting, managing, and delivering enterprise solutions revolving around a variety of business domains, such as hospitality, healthcare, risk management, insurance, and more. He is passionate about technology and has an overall hands-on experience of over 16 years in the software industry. He has been using big data and cloud technologies over the last 5 years to solve complex business problems.
Sumit has also authored Neo4j Essentials, Building Web Applications with Python, and Neo4j, Real-Time Big Data Analytics, and Learning Real-time Processing with Spark Streaming, all by Packt.
You can find him on LinkedIn at sumit1001.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review.
If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Getting Started with Spark SQL
What is Spark SQL?
Introducing SparkSession
Understanding Spark SQL concepts
Understanding Resilient Distributed Datasets (RDDs)
Understanding DataFrames and Datasets
Understanding the Catalyst optimizer
Understanding Catalyst optimizations
Understanding Catalyst transformations
Introducing Project Tungsten
Using Spark SQL in streaming applications
Understanding Structured Streaming internals
Summary
Using Spark SQL for Processing Structured and Semistructured Data
Understanding data sources in Spark applications
Selecting Spark data sources
Using Spark with relational databases
Using Spark with MongoDB (NoSQL database)
Using Spark with JSON data
Using Spark with Avro files
Using Spark with Parquet files
Defining and using custom data sources in Spark
Summary
Using Spark SQL for Data Exploration
Introducing Exploratory Data Analysis (EDA)
Using Spark SQL for basic data analysis
Identifying missing data
Computing basic statistics
Identifying data outliers
Visualizing data with Apache Zeppelin
Sampling data with Spark SQL APIs
Sampling with the DataFrame/Dataset API
Sampling with the RDD API
Using Spark SQL for creating pivot tables
Summary
Using Spark SQL for Data Munging
Introducing data munging
Exploring data munging techniques
Pre-processing of the household electric consumption Dataset
Computing basic statistics and aggregations
Augmenting the Dataset
Executing other miscellaneous processing steps
Pre-processing of the weather Dataset
Analyzing missing data
Combining data using a JOIN operation
Munging textual data
Processing multiple input data files
Removing stop words
Munging time series data
Pre-processing of the time-series Dataset
Processing date fields
Persisting and loading data
Defining a date-time index
Using the TimeSeriesRDD object
Handling missing time-series data
Computing basic statistics
Dealing with variable length records
Converting variable-length records to fixed-length records
Extracting data from "messy" columns
Preparing data for machine learning
Pre-processing data for machine learning
Creating and running a machine learning pipeline
Summary
Using Spark SQL in Streaming Applications
Introducing streaming data applications
Building Spark streaming applications
Implementing sliding window-based functionality
Joining a streaming Dataset with a static Dataset
Using the Dataset API in Structured Streaming
Using output sinks
Using the Foreach Sink for arbitrary computations on output
Using the Memory Sink to save output to a table
Using the File Sink to save output to a partitioned table
Monitoring streaming queries
Using Kafka with Spark Structured Streaming
Introducing Kafka concepts
Introducing ZooKeeper concepts
Introducing Kafka-Spark integration
Introducing Kafka-Spark Structured Streaming
Writing a receiver for a custom data source
Summary
Using Spark SQL in Machine Learning Applications
Introducing machine learning applications
Understanding Spark ML pipelines and their components
Understanding the steps in a pipeline application development process
Introducing feature engineering
Creating new features from raw data
Estimating the importance of a feature
Understanding dimensionality reduction
Deriving good features
Implementing a Spark ML classification model
Exploring the diabetes Dataset
Pre-processing the data
Building the Spark ML pipeline
Using StringIndexer for indexing categorical features and labels
Using VectorAssembler for assembling features into one column
Using a Spark ML classifier
Creating a Spark ML pipeline
Creating the training and test Datasets
Making predictions using the PipelineModel
Selecting the best model
Changing the ML algorithm in the pipeline
Introducing Spark ML tools and utilities
Using Principal Component Analysis to select features
Using encoders
Using Bucketizer
Using VectorSlicer
Using Chi-squared selector
Using a Normalizer
Retrieving our original labels
Implementing a Spark ML clustering model
Summary
Using Spark SQL in Graph Applications
Introducing large-scale graph applications
Exploring graphs using GraphFrames
Constructing a GraphFrame
Basic graph queries and operations
Motif analysis using GraphFrames
Processing subgraphs
Applying graph algorithms
Saving and loading GraphFrames
Analyzing JSON input modeled as a graph
Processing graphs containing multiple types of relationships
Understanding GraphFrame internals
Viewing GraphFrame physical execution plan
Understanding partitioning in GraphFrames
Summary
Using Spark SQL with SparkR
Introducing SparkR
Understanding the SparkR architecture
Understanding SparkR DataFrames
Using SparkR for EDA and data munging tasks
Reading and writing Spark DataFrames
Exploring structure and contents of Spark DataFrames
Running basic operations on Spark DataFrames
Executing SQL statements on Spark DataFrames
Merging SparkR DataFrames
Using User Defined Functions (UDFs)
Using SparkR for computing summary statistics
Using SparkR for data visualization
Visualizing data on a map
Visualizing graph nodes and edges
Using SparkR for machine learning
Summary
Developing Applications with Spark SQL
Introducing Spark SQL applications
Understanding text analysis applications
Using Spark SQL for textual analysis
Preprocessing textual data
Computing readability
Using word lists
Creating data preprocessing pipelines
Understanding themes in document corpuses
Using Naive Bayes classifiers
Developing a machine learning application
Summary
Using Spark SQL in Deep Learning Applications
Introducing neural networks
Understanding deep learning
Understanding representation learning
Understanding stochastic gradient descent
Introducing deep learning in Spark
Introducing CaffeOnSpark
Introducing DL4J
Introducing TensorFrames
Working with BigDL
Tuning hyperparameters of deep learning models
Introducing deep learning pipelines
Understanding Supervised learning
Understanding convolutional neural networks
Using neural networks for text classification
Using deep neural networks for language processing
Understanding Recurrent Neural Networks
Introducing autoencoders
Summary
Tuning Spark SQL Components for Performance
Introducing performance tuning in Spark SQL
Understanding DataFrame/Dataset APIs
Optimizing data serialization
Understanding Catalyst optimizations
Understanding the Dataset/DataFrame API
Understanding Catalyst transformations
Visualizing Spark application execution
Exploring Spark application execution metrics
Using external tools for performance tuning
Cost-based optimizer in Apache Spark 2.2
Understanding the CBO statistics collection
Statistics collection functions
Filter operator
Join operator
Build side selection
Understanding multi-way JOIN ordering optimization
Understanding performance improvements using whole-stage code generation
Summary
Spark SQL in Large-Scale Application Architectures
Understanding Spark-based application architectures
Using Apache Spark for batch processing
Using Apache Spark for stream processing
Understanding the Lambda architecture
Understanding the Kappa Architecture
Design considerations for building scalable stream processing applications
Building robust ETL pipelines using Spark SQL
Choosing appropriate data formats
Transforming data in ETL pipelines
Addressing errors in ETL pipelines
Implementing a scalable monitoring solution
Deploying Spark machine learning pipelines
Understanding the challenges in typical ML deployment environments
Understanding types of model scoring architectures
Using cluster managers
Summary
We will start this book with the basics of Spark SQL and its role in Spark applications. After the initial familiarization with Spark SQL, we will focus on using Spark SQL to execute tasks that are common to all big data projects, such as working with various types of data sources, exploratory data analysis, and data munging. We will also see how Spark SQL and SparkR can be leveraged to accomplish typical data science tasks at scale.
With the DataFrame/Dataset API and the Catalyst optimizer at the heart of Spark SQL, it is no surprise that it plays a key role in all applications based on the Spark technology stack. These applications include large-scale machine learning pipelines, large-scale graph applications, and emerging Spark-based deep learning applications. Additionally, we will present Spark SQL-based Structured Streaming applications that are deployed in complex production environments as continuous applications.
We will also review performance tuning in Spark SQL applications, including cost-based optimization (CBO) introduced in Spark 2.2. Finally, we will present application architectures that leverage Spark modules and Spark SQL in real-world applications. More specifically, we will cover key architectural components and patterns in large-scale Spark applications that architects and designers will find useful as building blocks for their own specific use cases.
Chapter 1, Getting Started with Spark SQL, gives you an overview of Spark SQL while getting you comfortable with the Spark environment through hands-on sessions.
Chapter 2, Using Spark SQL for Processing Structured and Semistructured Data, will help you use Spark to work with a relational database (MySQL), NoSQL database (MongoDB), semistructured data (JSON), and data storage formats commonly used in the Hadoop ecosystem (Avro and Parquet).
Chapter 3, Using Spark SQL for Data Exploration, demonstrates the use of Spark SQL to explore datasets, perform basic data quality checks, generate samples and pivot tables, and visualize data with Apache Zeppelin.
Chapter 4, Using Spark SQL for Data Munging, uses Spark SQL for performing some basic data munging/wrangling tasks. It also introduces you to a few techniques to handle missing data, bad data, duplicate records, and so on.
Chapter 5, Using Spark SQL in Streaming Applications, provides a few examples of using Spark SQL DataFrame/Dataset APIs to build streaming applications. Additionally, it also shows how to use Kafka in structured streaming applications.
Chapter 6, Using Spark SQL in Machine Learning Applications, focuses on using Spark SQL in machine learning applications. In this chapter, we will mainly explore the key concepts in feature engineering and implement machine learning pipelines.
Chapter 7, Using Spark SQL in Graph Applications, introduces you to GraphFrame applications. It provides examples of using Spark SQL DataFrame/Dataset APIs to build graph applications and apply the various graph algorithms into your graph applications.
Chapter 8, Using Spark SQL with SparkR, covers the SparkR architecture and SparkR DataFrames API. It provides code examples for using SparkR for Exploratory Data Analysis (EDA) and data munging tasks, data visualization, and machine learning.
Chapter 9, Developing Applications with Spark SQL, helps you build Spark applications using a mix of Spark modules. It presents examples of applications that combine Spark SQL with Spark Streaming, Spark Machine Learning, and so on.
Chapter 10, Using Spark SQL in Deep Learning Applications, introduces you to deep learning in Spark. It covers the basic concepts of a few popular deep learning models before you delve into working with BigDL and Spark.
Chapter 11, Tuning Spark SQL Components for Performance, presents you with the foundational concepts related to tuning a Spark application, including data serialization using encoders. It also covers the key aspects of the cost-based optimizer introduced in Spark 2.2 to optimize Spark SQL execution automatically.
Chapter 12, Spark SQL in Large-Scale Application Architectures, teaches you to identify the use cases where Spark SQL can be used in large-scale application architectures to implement typical functional and non-functional requirements.
This book is based on Spark 2.2.0 (pre-built for Apache Hadoop 2.7 or later) and Scala 2.11.8. For one or two subsections, Spark 2.1.0 has also been used due to the unavailability of certain libraries and reported bugs (when used with Apache Spark 2.2). The hardware and OS specifications include minimum 8 GB RAM (16 GB strongly recommended), 100 GB HDD, and OS X 10.11.6 or later (or appropriate Linux versions recommended for Spark development).
If you are a developer, engineer, or an architect and want to learn how to use Apache Spark in a web-scale project, then this is the book for you. It is assumed that you have prior knowledge of SQL querying. Basic programming knowledge with Scala, Java, R, or Python is all you need to get started with this book.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have several things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Spark-SQL. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningSparkSQL_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/supportand enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Spark SQL is at the heart of all applications developed using Spark. In this book, we will explore Spark SQL in great detail, including its usage in various types of applications as well as its internal workings. Developers and architects will appreciate the technical concepts and hands-on sessions presented in each chapter, as they progress through the book.
In this chapter, we will introduce you to the key concepts related to Spark SQL. We will start with SparkSession, the new entry point for Spark SQL in Spark 2.0. Then, we will explore Spark SQL's interfaces RDDs, DataFrames, and Dataset APIs. Later on, we will explain the developer-level details regarding the Catalyst optimizer and Project Tungsten.
Finally, we will introduce an exciting new feature in Spark 2.0 for streaming applications, called Structured Streaming. Specific hands-on exercises (using publicly available Datasets) are presented throughout the chapter, so you can actively follow along as you read through the various sections.
More specifically, the sections in this chapter will cover the following topics along with practice hands-on sessions:
What is Spark SQL?
Introducing SparkSession
Understanding Spark SQL concepts
Understanding RDDs, DataFrames, and Datasets
Understanding the Catalyst optimizer
Understanding Project Tungsten
Using Spark SQL in continuous applications
Understanding Structured Streaming internals
Spark SQL is one of the most advanced components of Apache Spark. It has been a part of the core distribution since Spark 1.0 and supports Python, Scala, Java, and R programming APIs. As illustrated in the figure below, Spark SQL components provide the foundation for Spark machine learning applications, streaming applications, graph applications, and many other types of application architectures.
Such applications, typically, use Spark ML pipelines, Structured Streaming, and GraphFrames, which are all based on Spark SQL interfaces (DataFrame/Dataset API). These applications, along with constructs such as SQL, DataFrames, and Datasets API, receive the benefits of the Catalyst optimizer, automatically. This optimizer is also responsible for generating executable query plans based on the lower-level RDD interfaces.
We will explore ML pipelines in more detail in Chapter 6, Using Spark SQL in Machine Learning Applications. GraphFrames will be covered in Chapter 7, Using Spark SQL in Graph Applications. While, we will introduce the key concepts regarding Structured Streaming and the Catalyst optimizer in this chapter, we will get more details about them in Chapter 5, Using Spark SQL in Streaming Applications, and Chapter 11, Tuning Spark SQL Components for Performance.
In Spark 2.0, the DataFrame API has been merged with the Dataset API, thereby unifying data processing capabilities across Spark libraries. This also enables developers to work with a single high-level and type-safe API. However, the Spark software stack does not prevent developers from directly using the low-level RDD interface in their applications. Though the low-level RDD API will continue to be available, a vast majority of developers are expected to (and are recommended to) use the high-level APIs, namely, the Dataset and DataFrame APIs.
Additionally, Spark 2.0 extends Spark SQL capabilities by including a new ANSI SQL parser with support for subqueries and the SQL:2003 standard. More specifically, the subquery support now includes correlated/uncorrelated subqueries, and IN / NOT IN and EXISTS / NOTEXISTS predicates in WHERE / HAVING clauses.
At the core of Spark SQL is the Catalyst optimizer, which leverages Scala's advanced features, such as pattern matching, to provide an extensible query optimizer. DataFrames, Datasets, and SQL queries share the same execution and optimization pipeline; hence, there is no performance impact of using any one or the other of these constructs (or of using any of the supported programming APIs). The high-level DataFrame-based code written by the developer is converted to Catalyst expressions and then to low-level Java bytecode as it passes through this pipeline.
SparkSession is the entry point into Spark SQL-related functionality and we describe it in more detail in the next section.
In this section, we will explore key concepts related to Resilient Distributed Datasets (RDD), DataFrames, and Datasets, Catalyst Optimizer and Project Tungsten.
The Catalyst optimizer is at the core of Spark SQL and is implemented in Scala. It enables several key features, such as schema inference (from JSON data), that are very useful in data analysis work.
The following figure shows the high-level transformation process from a developer's program containing DataFrames/Datasets to the final execution plan:
The internal representation of the program is a query plan. The query plan describes data operations such as aggregate, join, and filter, which match what is defined in your query. These operations generate a new Dataset from the input Dataset. After we have an initial version of the query plan ready, the Catalyst optimizer will apply a series of transformations to convert it to an optimized query plan. Finally, the Spark SQL code generation mechanism translates the optimized query plan into a DAG of RDDs that is ready for execution. The query plans and the optimized query plans are internally represented as trees. So, at its core, the Catalyst optimizer contains a general library for representing trees and applying rules to manipulate them. On top of this library, are several other libraries that are more specific to relational query processing.
Catalyst has two types of query plans: Logical and Physical Plans. The Logical Plan describes the computations on the Datasets without defining how to carry out the specific computations. Typically, the Logical Plan generates a list of attributes or columns as output under a set of constraints on the generated rows. The Physical Plan describes the computations on Datasets with specific definitions on how to execute them (it is executable).
Let's explore the transformation steps in more detail. The initial query plan is essentially an unresolved Logical Plan, that is, we don't know the source of the Datasets or the columns (contained in the Dataset) at this stage and we also don't know the types of columns. The first step in this pipeline is the analysis step. During analysis, the catalog information is used to convert the unresolved Logical Plan to a resolved Logical Plan.
In the next step, a set of logical optimization rules is applied to the resolved Logical Plan, resulting in an optimized Logical Plan. In the next step the optimizer may generate multiple Physical Plans and compare their costs to pick the best one. The first version of the Cost-based Optimizer (CBO), built on top of Spark SQL has been released in Spark 2.2. More details on cost-based optimization are presented in Chapter 11, Tuning Spark SQL Components for Performance.
All three--DataFrame, Dataset and SQL--share the same optimization pipeline as illustrated in the following figure:
In Catalyst, there are two main types of optimizations: Logical and Physical:
Logical Optimizations: This includes the ability of the optimizer to push filter predicates down to the data source and enable execution to skip irrelevant data. For example, in the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of RDBMSs, the predicates are pushed down to the database to reduce the amount of data traffic.
Physical Optimizations: This includes the ability to choose intelligently between broadcast joins and shuffle joins to reduce network traffic, performing lower-level optimizations, such as eliminating expensive object allocations and reducing virtual function calls. Hence, and performance typically improves when DataFrames are introduced in your programs.
The Rule Executor is responsible for the analysis and logical optimization steps, while a set of strategies and the Rule Executor are responsible for the physical planning step. The Rule Executor transforms a tree to another of the same type by applying a set of rules in batches. These rules can be applied one or more times. Also, each of these rules is implemented as a transform. A transform is basically a function, associated with every tree, and is used to implement a single rule. In Scala terms, the transformation is defined as a partial function (a function defined for a subset of its possible arguments). These are typically defined as case statements to determine whether the partial function (using pattern matching) is defined for the given input.
The Rule Executor makes the Physical Plan ready for execution by preparing scalar subqueries, ensuring that the input rows meet the requirements of the specific operation and applying the physical optimizations. For example, in the sort merge join operations, the input rows need to be sorted as per the join condition. The optimizer inserts the appropriate sort operations, as required, on the input rows before the sort merge join operation is executed.
