25,19 €
Spark has become a de facto standard for big data processing. Migrating data processing to Spark saves resources, streamlines your business focus, and modernizes workloads, creating new business opportunities through Spark’s advanced capabilities. Written by a senior solutions architect at Databricks, with experience in leading data science and data engineering teams in Fortune 500s as well as startups, this book is your exhaustive guide to achieving the Databricks Certified Associate Developer for Apache Spark certification on your first attempt.
You’ll explore the core components of Apache Spark, its architecture, and its optimization, while familiarizing yourself with the Spark DataFrame API and its components needed for data manipulation. You’ll also find out what Spark streaming is and why it’s important for modern data stacks, before learning about machine learning in Spark and its different use cases. What’s more, you’ll discover sample questions at the end of each section along with two mock exams to help you prepare for the certification exam.
By the end of this book, you’ll know what to expect in the exam and gain enough understanding of Spark and its tools to pass the exam. You’ll also be able to apply this knowledge in a real-world setting and take your skillset to the next level.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 338
Veröffentlichungsjahr: 2024
Databricks Certified Associate Developer for Apache Spark Using Python
The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Kaustubh Manglurkar
Publishing Product Manager: Chayan Majumdar
Book Project Manager: Hemangi Lotlikar
Senior Editor: Shrishti Pandey
Technical Editor: Kavyashree K S
Copy Editor: Safis Editing
Proofreader: Shrishti Pandey
Indexer: Pratik Shirodkar
Production Designer: Ponraj Dhandapani
Senior DevRel Marketing Coordinator: Nivedita Singh
First published: May 2024
Production reference: 1160524
Published by Packt Publishing Ltd.Grosvenor House 11 St Paul’s SquareBirmingham B3 1RB, UK
ISBN: 978-1-80461-978-0
www.packtpub.com
To my parents, Neelam Khalid (late) and Syed Khalid Mahmood, for their sacrifices and for exemplifying the power of patience and determination. To my loving husband, Arslan Shah, for being by my side through the ups and downs of life and being my support through it all. To Amna Shah for being my sister and friend. To Mariam Wasim for reminding me what true friendship looks like.
– Saba Shah
I have known and worked with Saba Shah for several years. Saba’s journey with Apache Spark began about 10 years ago. In this book, she will guide readers through the experiences she has gained on her journey.
In today’s dynamic data landscape, proficiency in Spark has become indispensable for data engineers, analysts, and scientists alike. This guide, meticulously crafted by seasoned experts, is your key to mastering Apache Spark and achieving certification success.
The journey begins with an insightful overview of the certification guide and exam, providing invaluable insights into what to expect and how to prepare effectively. From there, Saba delves deep into the core concepts of Spark, exploring its architecture, transformations, and the myriad of applications it enables.
As you progress through the chapters, you’ll gain a comprehensive understanding of Spark DataFrames and their operations, paving the way for advanced techniques and optimization strategies. From adaptive query execution to structured streaming, each topic is meticulously dissected, ensuring you gain a thorough grasp of Spark’s capabilities.
Machine learning enthusiasts will find a dedicated section on Spark ML, empowering them to harness the power of Spark for predictive analytics and model development. Additionally, two mock tests serve as the ultimate litmus test, allowing you to gauge your readiness and identify areas for improvement.
Whether you’re embarking on your Spark journey or seeking to validate your expertise with certification, this guide equips you with the knowledge, tools, and confidence needed to excel. Let this book be your trusted companion as you navigate the complexities of Apache Spark and embark on a journey of continuous learning and growth.
With Saba’s words, step-by-step instructions, screenshots, source code snippets, examples, and links to additional sources of information, you will learn how to continuously enhance your skills and be well-equipped to be a certified Apache Spark developer.
Best wishes on your certification journey!
Rod Waltermann
Distinguished Engineer
Chief Architect Cloud and AI Software
Lenovo
Saba Shah is a data and AI architect and evangelist, with a wide technical breadth and deep understanding of big data and machine learning technologies. She has experience leading data science and data engineering teams at Fortune 500 firms as well as start-ups. She started her career as a software engineer but soon transitioned to big data. She is currently a solutions architect at Databricks and works with enterprises, building their data strategy and helping them create a vision for the future with machine learning and predictive analytics. She currently resides in Research Triangle Park, North Carolina. In this book, she shares her expertise to empower you in the dynamic world of Spark.
Aviral Bhardwaj is a professional with six years of experience in the big data domain, showcasing expertise in technologies such as AWS and Databricks. Aviral has collaborated with companies including Knowledge Lens, ZS Associates, Amgen Inc., AstraZeneca, Lovelytics, and FanDuel as a contractor, and he currently works with GMG Inc. Furthermore, Aviral holds certifications as a Databricks Certified Spark Associate, Data Engineer Associate, and Data Engineering Professional, demonstrating a deep understanding of Databricks.
Rakesh Dey is a seasoned data engineer with eight years of experience in total and six years of experience in different big data technologies, such as Spark, Hive, and Impala. He has extensive knowledge of the Databricks platform to build any end-to-end ETL project implementation. He has worked on different projects with new technologies and helped customers to achieve performance and cost optimization relative to on-premises solutions. He has different Databricks certifications from the intermediate to professional levels. He currently works at Deloitte.
This part will show the basics of the certification exam for PySpark and the rules that need to be kept in mind. It will show the various types of questions asked in the exam and how to prepare for them.
This part has the following chapter:
Chapter 1, Overview of the Certification Guide and ExamPreparing for any task initially involves comprehending the problem at hand thoroughly and, subsequently, devising a strategy to tackle the challenge. Creating a step-by-step methodology for addressing each aspect of the challenge is an effective approach within this planning phase. This method enables smaller tasks to be handled individually, aiding in a systematic progression through the challenges without the need to feel overwhelmed.
This chapter intends to demonstrate this step-by-step approach to working through your Spark certification exam. In this chapter, we will cover the following topics:
Overview of the certification examDifferent types of questions to expect in the examOverview of the rest of the chapters in this bookWe’ll start by providing an overview of the certification exam.
The exam consists of 60 questions. The time you’re given to attempt these questions is 120 minutes. This gives you about 2 minutes per question.
To pass the exam, you need to have a score of 70%, which means that you need to answer 42 questions correctly out of 60 for you to pass.
If you are well prepared, this time should be enough for you to answer the questions and also review them before the time finishes.
Next, we will see how the questions are distributed throughout the exam.
Exam questions are distributed into the following broad categories. The following table provides a breakdown of questions based on different categories:
Topic
Percentage of Exam
Number of Questions
Spark Architecture: Understanding of Concepts
17%
10
Spark Architecture: Understanding of Applications
11%
7
Spark DataFrame API Applications
72%
43
Table 1.1: Exam breakdown
Looking at this distribution, you would want to focus on the Spark DataFrame API a lot more in your exam preparation since this section covers around 72% of the exam (about 43 questions). If you can answer these questions correctly, passing the exam will become easier.
But this doesn’t mean that you shouldn’t focus on the Spark architecture areas. Spark architecture questions have varied difficulty, and they can sometimes be confusing. At the same time, they allow you to score easy points as architecture questions are generally straightforward.
Let’s look at some of the other resources available that can help you prepare for this exam.
When you start planning to take the certification exam, the first thing you must do is master Spark concepts. This book will help you with these concepts. Once you’ve done this, it would be useful to do mock exams. There are two mock exams available in this book for you to take advantage of.
In addition, Databricks provides a practice exam, which is very useful for exam preparation. You can find it here: https://files.training.databricks.com/assessments/practice-exams/PracticeExam-DCADAS3-Python.pdf.
During the exam, you will be given access to the Spark documentation. This is done via Webassessor and its interface is a little different than the regular Spark documentation you’ll find on the internet. It would be good for you to familiarize yourself with this interface. You can find the interface at https://www.webassessor.com/zz/DATABRICKS/Python_v2.html. I recommend going through it and trying to find different packages and functions of Spark via this documentation to make yourself comfortable navigating it during the exam.
Next, we will look at how we can register for the exam.
Databricks is the company that has prepared these exams and certifications. Here is the link to register for the exam: https://www.databricks.com/learn/certification/apache-spark-developer-associate.
Next, we will look at some of the prerequisites for the exam.
Some prerequisites are needed before you can take the exam so that you can be successful in passing the certification. Some of the major ones are as follows:
Grasp the fundamentals of Spark architecture, encompassing the principles of Adaptive Query Execution.Utilize the Spark DataFrame API proficiently for various data manipulation tasks, such as the following:Performing column operations, such as selection, renaming, and manipulationExecuting row operations, including filtering, dropping, sorting, and aggregating dataConducting DataFrame-related tasks, such as joining, reading, writing, and implementing partitioning strategiesDemonstrating proficiency in working with user-defined functions (UDFs) and Spark SQL functionsWhile not explicitly tested, a functional understanding of either Python or Scala is expected. The examination is available in both programming languages.Hopefully, by the end of this book, you will be able to fully grasp all these concepts and have done enough practice on your own to be prepared for the exam with full confidence.
Now, let’s discuss what to expect during the online proctored exam.
The Spark certification exam is an online proctored exam. What this means is that you will be taking the exam from the comfort of your home, but someone will be proctoring the exam online. I encourage you to understand the procedures and rules of the proctored exam in advance. This will save you a lot of trouble and anxiety at the time of the exam.
To give you an overview, throughout the exam session, the following procedures will be in place:
Webcam monitoring will be conducted by a Webassessor proctor to ensure exam integrityYou will need to present a valid form of identification with a photoYou will need to conduct the exam aloneYour desk needs to be decluttered and there should be no other electronic devices in the room except the laptop that you’ll need for the examThere should not be any posters or charts on the walls of the room that may aid you in the examThe proctor will be listening to you during the exam as well, so you’ll want to make sure that you’re sitting in a quiet and comfortable environmentIt is recommended to not use your work laptop for this exam as it requires software to be installed and your antivirus and firewall to be disabledThe proctor’s responsibilities are as follows:
Overseeing your exam session to maintain exam integrityAddressing any queries related to the exam delivery processOffering technical assistance if neededIt’s important to note that the proctor will not offer any form of assistance regarding the exam contentI recommend that you take sufficient time before the exam to set up the environment where you’ll be taking the exam. This will ensure a smooth online exam procedure where you can focus on the questions and not worry about anything else.
Now, let’s talk about the different types of questions that may appear in the exam.
There are different categories of questions that you will find in the exam. They can be broadly divided into theoretical and code questions. We will look at both categories and their respective subcategories in this section.
Theoretical questions are the questions where you will be asked about the conceptual understanding of certain topics. Theoretical questions can be subdivided further into different categories. Let’s look at some of these categories, along with example questions taken from previous exams that fall into them.
Explanation questions are ones where you need to define and explain something. It can also include how something works and what it does. Let’s look at an example.
Which of the following describes a worker node?
Worker nodes are the nodes of a cluster that perform computations.Worker nodes are synonymous with executors.Worker nodes always have a one-to-one relationship with executors.Worker nodes are the most granular level of execution in the Spark execution hierarchy.Worker nodes are the coarsest level of execution in the Spark execution hierarchy.Connections questions are such questions where you need to define how different things are related to each other or how they differ from each other. Let’s look at an example to demonstrate this.
Which of the following describes the relationship between worker nodes and executors?
An executor is a Java Virtual Machine (JVM) running on a worker node.A worker node is a JVM running on an executor.There are always more worker nodes than executors.There are always the same number of executors and worker nodes.Executors and worker nodes are not related.Scenario questions involve defining how things work in different if-else scenarios – for example, “If ______ occurs, then _____ happens.” Moreover, it also includes questions where a statement is incorrect about a scenario. Let’s look at an example to demonstrate this.
If Spark is running in cluster mode, which of the following statements about nodes is incorrect?
There is a single worker node that contains the Spark driver and the executors.The Spark driver runs in its own non-worker node without any executors.Each executor is a running JVM inside a worker node.There is always more than one node.There might be more executors than total nodes or more total nodes than executors.Categorization questions are such questions where you need to describe categories that something belongs to. Let’s look at an example to demonstrate this.
Which of the following statements accurately describes stages?
Tasks within a stage can be simultaneously executed by multiple machines.Various stages within a job can run concurrently.Stages comprise one or more jobs.Stages temporarily store transactions before committing them through actions.Configuration questions are such questions where you need to outline how things will behave based on different cluster configurations. Let’s look at an example to demonstrate this.
Which of the following statements accurately describes Spark’s cluster execution mode?
Cluster mode runs executor processes on gateway nodes.Cluster mode involves the driver being hosted on a gateway machine.In cluster mode, the Spark driver and the cluster manager are not co-located.The driver in cluster mode is located on a worker node.Next, we’ll look at the code-based questions and their subcategories.
The next category is code-based questions. A large number of Spark API-based questions lie in this category. Code-based questions are the questions where you will be given a code snippet, and you will be asked questions about it. Code-based questions can be subdivided further into different categories. Let’s look at some of these categories, along with example questions taken from previous exams that fall into these different subcategories.
Function identification questions are such questions where you need to define which function does something. It is important to know the different functions that are available in Spark for data manipulation, along with their syntax. Let’s look at an example to demonstrate this.
Which of the following code blocks returns a copy of the df DataFrame, where the column salary has been renamed employeeSalary?
df.withColumn(["salary", "employeeSalary"])df.withColumnRenamed("salary").alias("employeeSalary ")df.withColumnRenamed("salary", " employeeSalary ")df.withColumn("salary", " employeeSalary ")Fill-in-the-blank questions are such questions where you need to complete the code block by filling in the blanks. Let’s look at an example to demonstrate this.
The following code block should return a DataFrame with the employeeId, salary, bonus, and department columns from the transactionsDf DataFrame. Choose the answer that correctly fills the blanks to accomplish this.
df.__1__(__2__)drop"employeeId", "salary", "bonus", "department"filter"employeeId, salary, bonus, department"select["employeeId", "salary", "bonus", "department"]selectcol(["employeeId", "salary", "bonus", "department"])Order-lines-of-code questions are such questions where you need to place the lines of code in a certain order so that you can execute an operation correctly. Let’s look at an example to demonstrate this.
Which of the following code blocks creates a DataFrame that shows the mean of the salary column of the salaryDf DataFrame based on the department and state columns, where age is greater than 35?
salaryDf.filter(col("age") > 35).filter(col("employeeID").filter(col("employeeID").isNotNull()).groupBy("department").groupBy("department", "state").agg(avg("salary").alias("mean_salary")).agg(average("salary").alias("mean_salary"))i, ii, v, vii, iii, v, vii, iii, vi, viii, ii, iv, viThis chapter provided an overview of the certification exam. At this point, you know what to expect in the exam and how to best prepare for it. To do so, we covered different types of questions that you will encounter.
Going forward, each chapter of this book will equip you with practical knowledge and hands-on examples so that you can harness the power of Apache Spark for various data processing and analytics tasks.
This part will offer you a comprehensive understanding of Spark’s capabilities and operational principles. It will cover what Spark is, why it’s important, and some of the applications Spark is most useful in. It will tell you about the different types of users who can benefit from Spark. It will also cover the basics of Spark architecture and how applications are navigated through in Spark. It will detail narrow and wide Spark transformations and discuss lazy evaluations in Spark. It’s important to have this understanding because Spark works differently than other traditional frameworks.
This part has the following chapters:
Chapter 2, Understanding Apache Spark and Its ApplicationsChapter 3, Spark Architecture and TransformationsWith the advent of machine learning and data science, the world is seeing a paradigm shift. A tremendous amount of data is being collected every second, and it’s hard for computing power to keep up with this pace of rapid data growth. To make use of all this data, Spark has become a de facto standard for big data processing. Migrating data processing to Spark is not only a question of saving resources that will allow you to focus on your business; it’s also a means of modernizing your workloads to leverage the capabilities of Spark and the modern technology stack to create new business opportunities.
In this chapter, we will cover the following topics:
What is Apache Spark?Why choose Apache Spark?Different components of SparkWhat are the Spark use cases?Who are the Spark users?Apache Spark is an open-source big data framework that is used for multiple big data applications. The strength of Spark lies in its superior parallel processing capabilities that makes it a leader in its domain.
According to its website (https://spark.apache.org/), “The most widely-used engine for scalable computing.”
Apache Spark started as a research project at the UC Berkeley AMPLab in 2009 and moved to an open source license in 2010. Later, in 2013, it came under the Apache Software Foundation (https://spark.apache.org/). It gained popularity after 2013, and today, it serves as a backbone for a large number of big data products across various Fortune 500 companies and has thousands of developers actively working on it.
Spark came into being because of limitations in the Hadoop MapReduce framework. MapReduce’s main premise was to read data from disk, distribute that data for parallel processing, apply map functions to the data, and then reduce those functions and save them back to disk. This back-and-forth reading and saving to disk becomes time-consuming and costly very quickly.
To overcome this limitation, Spark introduced the concept of in-memory computation. On top of that, Spark has several capabilities that came as a result of different research initiatives. You will read more about them in the next section.
Spark’s foundation lies in its major capabilities such as in-memory computation, lazy evaluation, fault tolerance, and support for multiple languages such as Python, SQL, Scala, and R. We will discuss each one of them in detail in the following section.
Let’s start with in-memory computation.
The first major differentiator technology that Spark’s foundation is built on is that it utilizes in-memory computations. Remember when we discussed Hadoop MapReduce technology? One of its major limitations is to write back to disk at each step. Spark saw this as an opportunity for improvement and introduced the concept of in-memory computation. The main idea is that the data remains in memory as long as it is worked on. If we can work with the size of data that can be stored in the memory at once, we can eliminate the need to write to disk at each step. As a result, the complete computation cycle can be done in memory if we can work with all computations on that amount of data. Now, the thing to note here is that with the advent of big data, it’s hard to contain all the data in memory. Even if we look at heavyweight servers and clusters in the cloud computing world, memory remains finite. This is where Spark’s internal framework of parallel processing comes into play. Spark framework utilizes the underlying hardware resources in the most efficient manner. It distributes the computations across multiple cores and utilizes the hardware capabilities to the maximum.
This tremendously reduces the computation time, since the overhead of writing to disk and reading it back for the subsequent step is minimized as long as the data can be fit in the memory of Spark compute.
Generally, when we work with programming frameworks, the backend compilers look at each statement and execute it. While this works great for programming paradigms, with big data and parallel processing, we need to shift to a look-ahead kind of model. Spark is well known for its parallel processing capabilities. To achieve even better performance, Spark doesn’t execute code as it reads it, but once the code is there and we submit a Spark statement to execute, the first step is that Spark builds a logical map of the queries. Once that map is built, then it plans what the best path of execution is. You will read more about its intricacies in the Spark architecture chapters. Once the plan is established, only then will the execution begin. Once the execution begins, even then, Spark holds off executing all statements until it hits an “action” statement. There are two types of statements in Spark:
TransformationsActionsYou will learn more about the different types of Spark statements in detail in Chapter 3, where we discuss Spark architecture. Here are a few advantages of lazy evaluation:
EfficiencyCode manageabilityQuery and resource optimizationReduced complexitiesSpark’s foundation is built on resilient distributed datasets (RDDs
