E-Book
25,19 €

Databricks Certified Associate Developer for Apache Spark Using Python E-Book

Saba Shah

0,0

25,19 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Spark has become a de facto standard for big data processing. Migrating data processing to Spark saves resources, streamlines your business focus, and modernizes workloads, creating new business opportunities through Spark’s advanced capabilities. Written by a senior solutions architect at Databricks, with experience in leading data science and data engineering teams in Fortune 500s as well as startups, this book is your exhaustive guide to achieving the Databricks Certified Associate Developer for Apache Spark certification on your first attempt.
You’ll explore the core components of Apache Spark, its architecture, and its optimization, while familiarizing yourself with the Spark DataFrame API and its components needed for data manipulation. You’ll also find out what Spark streaming is and why it’s important for modern data stacks, before learning about machine learning in Spark and its different use cases. What’s more, you’ll discover sample questions at the end of each section along with two mock exams to help you prepare for the certification exam.
By the end of this book, you’ll know what to expect in the exam and gain enough understanding of Spark and its tools to pass the exam. You’ll also be able to apply this knowledge in a real-world setting and take your skillset to the next level.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 338

Veröffentlichungsjahr: 2024

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

The Truth About Employee Engagement

Databricks Certified Associate Developer for Apache Spark Using Python

The ultimate guide to getting certified in Apache Spark using practical examples with Python

Saba Shah

Databricks Certified Associate Developer for Apache Spark Using Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kaustubh Manglurkar

Publishing Product Manager: Chayan Majumdar

Book Project Manager: Hemangi Lotlikar

Senior Editor: Shrishti Pandey

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Shrishti Pandey

Indexer: Pratik Shirodkar

Production Designer: Ponraj Dhandapani

Senior DevRel Marketing Coordinator: Nivedita Singh

First published: May 2024

Production reference: 1160524

Published by Packt Publishing Ltd.Grosvenor House 11 St Paul’s SquareBirmingham B3 1RB, UK

ISBN: 978-1-80461-978-0

www.packtpub.com

To my parents, Neelam Khalid (late) and Syed Khalid Mahmood, for their sacrifices and for exemplifying the power of patience and determination. To my loving husband, Arslan Shah, for being by my side through the ups and downs of life and being my support through it all. To Amna Shah for being my sister and friend. To Mariam Wasim for reminding me what true friendship looks like.

– Saba Shah

Foreword

I have known and worked with Saba Shah for several years. Saba’s journey with Apache Spark began about 10 years ago. In this book, she will guide readers through the experiences she has gained on her journey.

In today’s dynamic data landscape, proficiency in Spark has become indispensable for data engineers, analysts, and scientists alike. This guide, meticulously crafted by seasoned experts, is your key to mastering Apache Spark and achieving certification success.

The journey begins with an insightful overview of the certification guide and exam, providing invaluable insights into what to expect and how to prepare effectively. From there, Saba delves deep into the core concepts of Spark, exploring its architecture, transformations, and the myriad of applications it enables.

As you progress through the chapters, you’ll gain a comprehensive understanding of Spark DataFrames and their operations, paving the way for advanced techniques and optimization strategies. From adaptive query execution to structured streaming, each topic is meticulously dissected, ensuring you gain a thorough grasp of Spark’s capabilities.

Machine learning enthusiasts will find a dedicated section on Spark ML, empowering them to harness the power of Spark for predictive analytics and model development. Additionally, two mock tests serve as the ultimate litmus test, allowing you to gauge your readiness and identify areas for improvement.

Whether you’re embarking on your Spark journey or seeking to validate your expertise with certification, this guide equips you with the knowledge, tools, and confidence needed to excel. Let this book be your trusted companion as you navigate the complexities of Apache Spark and embark on a journey of continuous learning and growth.

With Saba’s words, step-by-step instructions, screenshots, source code snippets, examples, and links to additional sources of information, you will learn how to continuously enhance your skills and be well-equipped to be a certified Apache Spark developer.

Best wishes on your certification journey!

Rod Waltermann

Distinguished Engineer

Chief Architect Cloud and AI Software

Lenovo

Contributors

About the author

Saba Shah is a data and AI architect and evangelist, with a wide technical breadth and deep understanding of big data and machine learning technologies. She has experience leading data science and data engineering teams at Fortune 500 firms as well as start-ups. She started her career as a software engineer but soon transitioned to big data. She is currently a solutions architect at Databricks and works with enterprises, building their data strategy and helping them create a vision for the future with machine learning and predictive analytics. She currently resides in Research Triangle Park, North Carolina. In this book, she shares her expertise to empower you in the dynamic world of Spark.

About the reviewers

Aviral Bhardwaj is a professional with six years of experience in the big data domain, showcasing expertise in technologies such as AWS and Databricks. Aviral has collaborated with companies including Knowledge Lens, ZS Associates, Amgen Inc., AstraZeneca, Lovelytics, and FanDuel as a contractor, and he currently works with GMG Inc. Furthermore, Aviral holds certifications as a Databricks Certified Spark Associate, Data Engineer Associate, and Data Engineering Professional, demonstrating a deep understanding of Databricks.

Rakesh Dey is a seasoned data engineer with eight years of experience in total and six years of experience in different big data technologies, such as Spark, Hive, and Impala. He has extensive knowledge of the Databricks platform to build any end-to-end ETL project implementation. He has worked on different projects with new technologies and helped customers to achieve performance and cost optimization relative to on-premises solutions. He has different Databricks certifications from the intermediate to professional levels. He currently works at Deloitte.

Preface

Part 1: Exam Overview

1 Overview of the Certification Guide and Exam

Overview of the certification exam

Distribution of questions

Resources to prepare for the exam

Resources available during the exam

Registering for your exam

Prerequisites for the exam

Online proctored exam

Types of questions

Theoretical questions

Code-based questions

Summary

Part 2: Introducing Spark

2 Understanding Apache Spark and Its Applications

What is Apache Spark?

The history of Apache Spark

Understanding Spark differentiators

The components of Spark

Why choose Apache Spark?

Speed

Reusability

In-memory computation

A unified platform

What are the Spark use cases?

Big data processing

Machine learning applications

Real-time streaming

Graph analytics

Who are the Spark users?

Data analysts

Data engineers

Data scientists

Machine learning engineers

Summary

Sample questions

3 Spark Architecture and Transformations

Spark architecture

Execution hierarchy

Spark components

Spark driver

SparkSession

Cluster manager

Spark executors

Partitioning in Spark

Deployment modes

RDDs

Lazy computation

Transformations

Summary

Sample questions

Answers

Part 3: Spark Operations

4 Spark DataFrames and their Operations

Getting Started in PySpark

Installing Spark

Creating a Spark session

Dataset API

DataFrame API

Creating DataFrame operations

Using a list of rows

Using a list of rows with schema

Using Pandas DataFrames

Using tuples

How to view the DataFrames

Viewing DataFrames

Viewing top n rows

Viewing DataFrame schema

Viewing data vertically

Viewing columns of data

Viewing summary statistics

Collecting the data

Using take

Using tail

Using head

Counting the number of rows of data

Converting a PySpark DataFrame to a Pandas DataFrame

How to manipulate data on rows and columns

Selecting columns

Creating columns

Dropping columns

Updating columns

Renaming columns

Finding unique values in a column

Changing the case of a column

Filtering a DataFrame

Logical operators in a DataFrame

Using isin()

Datatype conversions

Dropping null values from a DataFrame

Dropping duplicates from a DataFrame

Using aggregates in a DataFrame

Summary

Sample question

Answer

5 Advanced Operations and Optimizations in Spark

Grouping data in Spark and different Spark joins

Using groupBy in a DataFrame

A complex groupBy statement

Joining DataFrames in Spark

Reading and writing data

Reading and writing CSV files

Reading and writing Parquet files

Reading and writing ORC files

Reading and writing Delta files

Using SQL in Spark

UDFs in Apache Spark

What are UDFs?

Creating and registering UDFs

Use cases for UDFs

Best practices for using UDFs

Optimizations in Apache Spark

Understanding optimization in Spark

Catalyst optimizer

Adaptive Query Execution (AQE)

Data-based optimizations in Apache Spark

Addressing the small file problem in Apache Spark

Tackling data skew in Apache Spark

Managing data spills in Apache Spark

Managing data shuffle in Apache Spark

Shuffle joins

Shuffle sort-merge joins

Broadcast joins

Broadcast hash joins

Narrow and wide transformations in Apache Spark

Narrow transformations

Wide transformations

Choosing between narrow and wide transformations

Optimizing wide transformations

Persisting and caching in Apache Spark

Understanding data persistence

Caching data

Unpersisting data

Best practices

Repartitioning and coalescing in Apache Spark

Understanding data partitioning

Repartitioning data

Coalescing data

Use cases for repartitioning and coalescing

Best practices

Summary

Sample questions

Answers

6 SQL Queries in Spark

What is Spark SQL?

Advantages of Spark SQL

Integration with Apache Spark

Key concepts – DataFrames and datasets

Getting started with Spark SQL

Loading and saving data

Utilizing Spark SQL to filter and select data based on specific criteria

Exploring sorting and aggregation operations using Spark SQL

Grouping and aggregating data – grouping data based on specific columns and performing aggregate functions

Advanced Spark SQL operations

Leveraging window functions to perform advanced analytical operations on DataFrames

User-defined functions

Working with complex data types – pivot and unpivot

Summary

Sample questions

Answers

Part 4: Spark Applications

7 Structured Streaming in Spark

Real-time data processing

What is streaming?

Streaming architectures

Introducing Spark Streaming

Exploring the architecture of Spark Streaming

Key concepts

Advantages

Challenges

Introducing Structured Streaming

Key features and advantages

Structured Streaming versus Spark Streaming

Limitations and considerations

Streaming fundamentals

Stateless streaming – processing one event at a time

Stateful streaming – maintaining stateful information

The differences between stateless and stateful streaming

Structured Streaming concepts

Event time and processing time

Watermarking and late data handling

Triggers and output modes

Windowing operations

Joins and aggregations

Streaming sources and sinks

Built-in streaming sources

Custom streaming sources

Built-in streaming sinks

Custom streaming sinks

Advanced techniques in Structured Streaming

Handling fault tolerance

Handling schema evolution

Different joins in Structured Streaming

Stream-stream joins

Stream-static joins

Final thoughts and future developments

Summary

8 Machine Learning with Spark ML

Introduction to ML

The key concepts of ML

Types of ML

Types of supervised learning

ML with Spark

Advantages of Apache Spark for large-scale ML

Spark MLlib versus Spark ML

ML life cycle

Problem statement

Data preparation and feature engineering

Model training and evaluation

Model deployment

Model monitoring and management

Model iteration and improvement

Case studies and real-world examples

Customer churn prediction

Fraud detection

Future trends in Spark ML and distributed ML

Summary

Part 5: Mock Papers

9 Mock Test 1

Questions

Answers

10 Mock Test 2

Questions

Answers

Index

Other Books You May Enjoy

Part 1: Exam Overview

This part will show the basics of the certification exam for PySpark and the rules that need to be kept in mind. It will show the various types of questions asked in the exam and how to prepare for them.

This part has the following chapter:

Chapter 1, Overview of the Certification Guide and Exam

1 Overview of the Certification Guide and Exam

Preparing for any task initially involves comprehending the problem at hand thoroughly and, subsequently, devising a strategy to tackle the challenge. Creating a step-by-step methodology for addressing each aspect of the challenge is an effective approach within this planning phase. This method enables smaller tasks to be handled individually, aiding in a systematic progression through the challenges without the need to feel overwhelmed.

This chapter intends to demonstrate this step-by-step approach to working through your Spark certification exam. In this chapter, we will cover the following topics:

Overview of the certification examDifferent types of questions to expect in the examOverview of the rest of the chapters in this book

We’ll start by providing an overview of the certification exam.

Overview of the certification exam

The exam consists of 60 questions. The time you’re given to attempt these questions is 120 minutes. This gives you about 2 minutes per question.

To pass the exam, you need to have a score of 70%, which means that you need to answer 42 questions correctly out of 60 for you to pass.

If you are well prepared, this time should be enough for you to answer the questions and also review them before the time finishes.

Next, we will see how the questions are distributed throughout the exam.

Distribution of questions

Exam questions are distributed into the following broad categories. The following table provides a breakdown of questions based on different categories:

Topic

Percentage of Exam

Number of Questions

Spark Architecture: Understanding of Concepts

17%

Spark Architecture: Understanding of Applications

11%

Spark DataFrame API Applications

72%

Table 1.1: Exam breakdown

Looking at this distribution, you would want to focus on the Spark DataFrame API a lot more in your exam preparation since this section covers around 72% of the exam (about 43 questions). If you can answer these questions correctly, passing the exam will become easier.

But this doesn’t mean that you shouldn’t focus on the Spark architecture areas. Spark architecture questions have varied difficulty, and they can sometimes be confusing. At the same time, they allow you to score easy points as architecture questions are generally straightforward.

Let’s look at some of the other resources available that can help you prepare for this exam.

Resources to prepare for the exam

When you start planning to take the certification exam, the first thing you must do is master Spark concepts. This book will help you with these concepts. Once you’ve done this, it would be useful to do mock exams. There are two mock exams available in this book for you to take advantage of.

In addition, Databricks provides a practice exam, which is very useful for exam preparation. You can find it here: https://files.training.databricks.com/assessments/practice-exams/PracticeExam-DCADAS3-Python.pdf.

Resources available during the exam

During the exam, you will be given access to the Spark documentation. This is done via Webassessor and its interface is a little different than the regular Spark documentation you’ll find on the internet. It would be good for you to familiarize yourself with this interface. You can find the interface at https://www.webassessor.com/zz/DATABRICKS/Python_v2.html. I recommend going through it and trying to find different packages and functions of Spark via this documentation to make yourself comfortable navigating it during the exam.

Next, we will look at how we can register for the exam.

Registering for your exam

Databricks is the company that has prepared these exams and certifications. Here is the link to register for the exam: https://www.databricks.com/learn/certification/apache-spark-developer-associate.

Next, we will look at some of the prerequisites for the exam.

Prerequisites for the exam

Some prerequisites are needed before you can take the exam so that you can be successful in passing the certification. Some of the major ones are as follows:

Grasp the fundamentals of Spark architecture, encompassing the principles of Adaptive Query Execution.Utilize the Spark DataFrame API proficiently for various data manipulation tasks, such as the following:Performing column operations, such as selection, renaming, and manipulationExecuting row operations, including filtering, dropping, sorting, and aggregating dataConducting DataFrame-related tasks, such as joining, reading, writing, and implementing partitioning strategiesDemonstrating proficiency in working with user-defined functions (UDFs) and Spark SQL functionsWhile not explicitly tested, a functional understanding of either Python or Scala is expected. The examination is available in both programming languages.

Hopefully, by the end of this book, you will be able to fully grasp all these concepts and have done enough practice on your own to be prepared for the exam with full confidence.

Now, let’s discuss what to expect during the online proctored exam.

Online proctored exam

The Spark certification exam is an online proctored exam. What this means is that you will be taking the exam from the comfort of your home, but someone will be proctoring the exam online. I encourage you to understand the procedures and rules of the proctored exam in advance. This will save you a lot of trouble and anxiety at the time of the exam.

To give you an overview, throughout the exam session, the following procedures will be in place:

Webcam monitoring will be conducted by a Webassessor proctor to ensure exam integrityYou will need to present a valid form of identification with a photoYou will need to conduct the exam aloneYour desk needs to be decluttered and there should be no other electronic devices in the room except the laptop that you’ll need for the examThere should not be any posters or charts on the walls of the room that may aid you in the examThe proctor will be listening to you during the exam as well, so you’ll want to make sure that you’re sitting in a quiet and comfortable environmentIt is recommended to not use your work laptop for this exam as it requires software to be installed and your antivirus and firewall to be disabled

The proctor’s responsibilities are as follows:

Overseeing your exam session to maintain exam integrityAddressing any queries related to the exam delivery processOffering technical assistance if neededIt’s important to note that the proctor will not offer any form of assistance regarding the exam content

I recommend that you take sufficient time before the exam to set up the environment where you’ll be taking the exam. This will ensure a smooth online exam procedure where you can focus on the questions and not worry about anything else.

Now, let’s talk about the different types of questions that may appear in the exam.

Types of questions

There are different categories of questions that you will find in the exam. They can be broadly divided into theoretical and code questions. We will look at both categories and their respective subcategories in this section.

Theoretical questions

Theoretical questions are the questions where you will be asked about the conceptual understanding of certain topics. Theoretical questions can be subdivided further into different categories. Let’s look at some of these categories, along with example questions taken from previous exams that fall into them.

Explanation questions

Explanation questions are ones where you need to define and explain something. It can also include how something works and what it does. Let’s look at an example.

Which of the following describes a worker node?

Worker nodes are the nodes of a cluster that perform computations.Worker nodes are synonymous with executors.Worker nodes always have a one-to-one relationship with executors.Worker nodes are the most granular level of execution in the Spark execution hierarchy.Worker nodes are the coarsest level of execution in the Spark execution hierarchy.

Connection questions

Connections questions are such questions where you need to define how different things are related to each other or how they differ from each other. Let’s look at an example to demonstrate this.

Which of the following describes the relationship between worker nodes and executors?

An executor is a Java Virtual Machine (JVM) running on a worker node.A worker node is a JVM running on an executor.There are always more worker nodes than executors.There are always the same number of executors and worker nodes.Executors and worker nodes are not related.

Scenario question

Scenario questions involve defining how things work in different if-else scenarios – for example, “If ______ occurs, then _____ happens.” Moreover, it also includes questions where a statement is incorrect about a scenario. Let’s look at an example to demonstrate this.

If Spark is running in cluster mode, which of the following statements about nodes is incorrect?

There is a single worker node that contains the Spark driver and the executors.The Spark driver runs in its own non-worker node without any executors.Each executor is a running JVM inside a worker node.There is always more than one node.There might be more executors than total nodes or more total nodes than executors.

Categorization questions

Categorization questions are such questions where you need to describe categories that something belongs to. Let’s look at an example to demonstrate this.

Which of the following statements accurately describes stages?

Tasks within a stage can be simultaneously executed by multiple machines.Various stages within a job can run concurrently.Stages comprise one or more jobs.Stages temporarily store transactions before committing them through actions.

Configuration questions

Configuration questions are such questions where you need to outline how things will behave based on different cluster configurations. Let’s look at an example to demonstrate this.

Which of the following statements accurately describes Spark’s cluster execution mode?

Cluster mode runs executor processes on gateway nodes.Cluster mode involves the driver being hosted on a gateway machine.In cluster mode, the Spark driver and the cluster manager are not co-located.The driver in cluster mode is located on a worker node.

Next, we’ll look at the code-based questions and their subcategories.

Code-based questions

The next category is code-based questions. A large number of Spark API-based questions lie in this category. Code-based questions are the questions where you will be given a code snippet, and you will be asked questions about it. Code-based questions can be subdivided further into different categories. Let’s look at some of these categories, along with example questions taken from previous exams that fall into these different subcategories.

Function identification questions

Function identification questions are such questions where you need to define which function does something. It is important to know the different functions that are available in Spark for data manipulation, along with their syntax. Let’s look at an example to demonstrate this.

Which of the following code blocks returns a copy of the df DataFrame, where the column salary has been renamed employeeSalary?

df.withColumn(["salary", "employeeSalary"])df.withColumnRenamed("salary").alias("employeeSalary ")df.withColumnRenamed("salary", " employeeSalary ")df.withColumn("salary", " employeeSalary ")

Fill-in-the-blank questions

Fill-in-the-blank questions are such questions where you need to complete the code block by filling in the blanks. Let’s look at an example to demonstrate this.

The following code block should return a DataFrame with the employeeId, salary, bonus, and department columns from the transactionsDf DataFrame. Choose the answer that correctly fills the blanks to accomplish this.

df.__1__(__2__)drop"employeeId", "salary", "bonus", "department"filter"employeeId, salary, bonus, department"select["employeeId", "salary", "bonus", "department"]selectcol(["employeeId", "salary", "bonus", "department"])

Order-lines-of-code questions

Order-lines-of-code questions are such questions where you need to place the lines of code in a certain order so that you can execute an operation correctly. Let’s look at an example to demonstrate this.

Which of the following code blocks creates a DataFrame that shows the mean of the salary column of the salaryDf DataFrame based on the department and state columns, where age is greater than 35?

salaryDf.filter(col("age") > 35).filter(col("employeeID").filter(col("employeeID").isNotNull()).groupBy("department").groupBy("department", "state").agg(avg("salary").alias("mean_salary")).agg(average("salary").alias("mean_salary"))i, ii, v, vii, iii, v, vii, iii, vi, viii, ii, iv, vi

Summary

This chapter provided an overview of the certification exam. At this point, you know what to expect in the exam and how to best prepare for it. To do so, we covered different types of questions that you will encounter.

Going forward, each chapter of this book will equip you with practical knowledge and hands-on examples so that you can harness the power of Apache Spark for various data processing and analytics tasks.

Part 2: Introducing Spark

This part will offer you a comprehensive understanding of Spark’s capabilities and operational principles. It will cover what Spark is, why it’s important, and some of the applications Spark is most useful in. It will tell you about the different types of users who can benefit from Spark. It will also cover the basics of Spark architecture and how applications are navigated through in Spark. It will detail narrow and wide Spark transformations and discuss lazy evaluations in Spark. It’s important to have this understanding because Spark works differently than other traditional frameworks.

This part has the following chapters:

Chapter 2, Understanding Apache Spark and Its ApplicationsChapter 3, Spark Architecture and Transformations

2 Understanding Apache Spark and Its Applications

With the advent of machine learning and data science, the world is seeing a paradigm shift. A tremendous amount of data is being collected every second, and it’s hard for computing power to keep up with this pace of rapid data growth. To make use of all this data, Spark has become a de facto standard for big data processing. Migrating data processing to Spark is not only a question of saving resources that will allow you to focus on your business; it’s also a means of modernizing your workloads to leverage the capabilities of Spark and the modern technology stack to create new business opportunities.

In this chapter, we will cover the following topics:

What is Apache Spark?Why choose Apache Spark?Different components of SparkWhat are the Spark use cases?Who are the Spark users?

What is Apache Spark?

Apache Spark is an open-source big data framework that is used for multiple big data applications. The strength of Spark lies in its superior parallel processing capabilities that makes it a leader in its domain.

According to its website (https://spark.apache.org/), “The most widely-used engine for scalable computing.”

The history of Apache Spark

Apache Spark started as a research project at the UC Berkeley AMPLab in 2009 and moved to an open source license in 2010. Later, in 2013, it came under the Apache Software Foundation (https://spark.apache.org/). It gained popularity after 2013, and today, it serves as a backbone for a large number of big data products across various Fortune 500 companies and has thousands of developers actively working on it.

Spark came into being because of limitations in the Hadoop MapReduce framework. MapReduce’s main premise was to read data from disk, distribute that data for parallel processing, apply map functions to the data, and then reduce those functions and save them back to disk. This back-and-forth reading and saving to disk becomes time-consuming and costly very quickly.

To overcome this limitation, Spark introduced the concept of in-memory computation. On top of that, Spark has several capabilities that came as a result of different research initiatives. You will read more about them in the next section.

Understanding Spark differentiators

Spark’s foundation lies in its major capabilities such as in-memory computation, lazy evaluation, fault tolerance, and support for multiple languages such as Python, SQL, Scala, and R. We will discuss each one of them in detail in the following section.

Let’s start with in-memory computation.

In-memory computation

The first major differentiator technology that Spark’s foundation is built on is that it utilizes in-memory computations. Remember when we discussed Hadoop MapReduce technology? One of its major limitations is to write back to disk at each step. Spark saw this as an opportunity for improvement and introduced the concept of in-memory computation. The main idea is that the data remains in memory as long as it is worked on. If we can work with the size of data that can be stored in the memory at once, we can eliminate the need to write to disk at each step. As a result, the complete computation cycle can be done in memory if we can work with all computations on that amount of data. Now, the thing to note here is that with the advent of big data, it’s hard to contain all the data in memory. Even if we look at heavyweight servers and clusters in the cloud computing world, memory remains finite. This is where Spark’s internal framework of parallel processing comes into play. Spark framework utilizes the underlying hardware resources in the most efficient manner. It distributes the computations across multiple cores and utilizes the hardware capabilities to the maximum.

This tremendously reduces the computation time, since the overhead of writing to disk and reading it back for the subsequent step is minimized as long as the data can be fit in the memory of Spark compute.

Lazy evaluation

Generally, when we work with programming frameworks, the backend compilers look at each statement and execute it. While this works great for programming paradigms, with big data and parallel processing, we need to shift to a look-ahead kind of model. Spark is well known for its parallel processing capabilities. To achieve even better performance, Spark doesn’t execute code as it reads it, but once the code is there and we submit a Spark statement to execute, the first step is that Spark builds a logical map of the queries. Once that map is built, then it plans what the best path of execution is. You will read more about its intricacies in the Spark architecture chapters. Once the plan is established, only then will the execution begin. Once the execution begins, even then, Spark holds off executing all statements until it hits an “action” statement. There are two types of statements in Spark:

TransformationsActions

You will learn more about the different types of Spark statements in detail in Chapter 3, where we discuss Spark architecture. Here are a few advantages of lazy evaluation:

EfficiencyCode manageabilityQuery and resource optimizationReduced complexities

Resilient datasets/fault tolerance

Spark’s foundation is built on resilient distributed datasets (RDDs

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Databricks Certified Associate Developer for Apache Spark Using Python E-Book

Saba Shah

Databricks Certified Associate Developer for Apache Spark Using Python

Foreword

Contributors

About the author

About the reviewers

Table of Contents

Preface

Part 1: Exam Overview

1

Overview of the Certification Guide and Exam

Overview of the certification exam

Distribution of questions

Resources to prepare for the exam

Resources available during the exam

Registering for your exam

Prerequisites for the exam

Online proctored exam

Types of questions

Theoretical questions

Code-based questions

Summary

Part 2: Introducing Spark

2

Understanding Apache Spark and Its Applications

What is Apache Spark?

The history of Apache Spark

Understanding Spark differentiators

The components of Spark

Why choose Apache Spark?

Speed

Reusability

In-memory computation

A unified platform

What are the Spark use cases?

Big data processing

Machine learning applications

Real-time streaming

Graph analytics

Who are the Spark users?

Data analysts

Data engineers

Data scientists

Machine learning engineers

Summary

Sample questions

3

Spark Architecture and Transformations

Spark architecture

Execution hierarchy

Spark components

Spark driver

SparkSession

Cluster manager

Spark executors

Partitioning in Spark

Deployment modes

RDDs

Lazy computation

Transformations

Summary

Sample questions

Answers

Part 3: Spark Operations

4

Spark DataFrames and their Operations

Getting Started in PySpark

Installing Spark

Creating a Spark session

Dataset API

DataFrame API

Creating DataFrame operations

Using a list of rows

Using a list of rows with schema

Using Pandas DataFrames

Using tuples

How to view the DataFrames

Viewing DataFrames

Viewing top n rows