Data Engineering with Scala and Spark - Eric Tome - E-Book

Data Engineering with Scala and Spark E-Book

Eric Tome

0,0
26,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount.
This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You’ll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You’ll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users.
By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 310

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Engineering with Scala and Spark

Build streaming and batch pipelines that process massive amounts of data using Scala

Eric Tome

Rupam Bhattacharjee

David Radford

Data Engineering with Scala and Spark

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Associate Group Product Manager: Kaustubh Manglurkar

Associate Publishing Product Manager: Arindam Majumder

Book Project Manager: Kirti Pisat

Senior Editor: Tiksha Lad

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Alishon Mendonca

DevRel Marketing Coordinator: Nivedita Singh

First published: January 2024

Production reference: 1160124

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80461-258-3

www.packtpub.com

Contributors

About the authors

Eric Tome has over 25 years of experience working with data. He has contributed to and led teams that ingested, cleansed, standardized, and prepared data used by business intelligence, data science, and operations teams. He has a background in mathematics and currently works as a senior solutions architect at Databricks, helping customers solve their data and AI challenges.

Rupam Bhattacharjee works as a lead data engineer at IBM. He has architected and developed data pipelines, processing massive structured and unstructured data using Spark and Scala for on-premises Hadoop and K8s clusters on the public cloud. He has a degree in electrical engineering.

David Radford has worked in big data for over 10 years, with a focus on cloud technologies. He led consulting teams for several years, completing a migration from legacy systems to modern data stacks. He holds a master’s degree in computer science and works as a senior solutions architect at Databricks.

About the reviewers

Bartosz Konieczny is a freelance data engineer enthusiast who has been coding for 15+ years. He has held various senior hands-on positions that helped him work on many data engineering problems in batch and stream processing, such as sessionization, data ingestion, data cleansing, ordered data processing, and data migration. He enjoys solving data challenges with public cloud services and open source technologies, especially Apache Spark, Apache Kafka, Apache Airflow, and Delta Lake. In addition, he blogs at waitingforcode.com.

Palanivelrajan is a highly passionate data evangelist with 19.5 years of experience in the data and analytics space. He has rich experience in architecting, developing, and delivering modern data platforms, data lakes, data warehouses, business intelligence, data science, and ML solutions. For the last five years, he has worked in engineering management, and he has 12+ years of experience in data architecture (big data and the cloud). He has built data teams and data practices and has been active in presales, planning, roadmaps, and executions. He has hired, managed, and mentored data engineers, data analysts, data scientists, ML engineers, and data architects. He has worked as a data engineering manager and a data architect for Sigmoid analytics, Nike, the Data Team, and Tata Communications.

Table of Contents

Preface

Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

1

Scala Essentials for Data Engineers

Technical requirements

Understanding functional programming

Understanding objects, classes, and traits

Classes

Object

Trait

Working with higher-order functions (HOFs)

Examples of HOFs from the Scala collection library

Understanding polymorphic functions

Variance

Option type

Collections

Understanding pattern matching

Wildcard patterns

Constant patterns

Variable patterns

Constructor patterns

Sequence patterns

Tuple patterns

Typed patterns

Implicits in Scala

Summary

Further reading

2

Environment Setup

Technical requirements

Setting up a cloud environment

Leveraging cloud object storage

Using Databricks

Local environment setup

The build tool

Summary

Further reading

Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

3

An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

Technical requirements

Working with Apache Spark

How do Spark applications work?

What happens on executors?

Creating a Spark application using Scala

Spark stages

Shuffling

Understanding the Spark Dataset API

Understanding the Spark DataFrame API

Spark SQL

The select function

Creating temporary views

Summary

4

Working with Databases

Technical requirements

Understanding the Spark JDBC API

Working with the Spark JDBC API

Loading the database configuration

Creating a database interface

Creating a factory method for SparkSession

Performing various database operations

Working with databases

Updating the Database API with Spark read and write

Summary

5

Object Stores and Data Lakes

Understanding distributed file systems

Data lakes

Object stores

Streaming data

Working with streaming sources

Processing and sinks

Aggregating streams

Summary

6

Understanding Data Transformation

Technical requirements

Understanding the difference between transformations and actions

Using Select and SelectExpr

Filtering and sorting

Learning how to aggregate, group, and join data

Leveraging advanced window functions

Working with complex dataset types

Summary

7

Data Profiling and Data Quality

Technical requirements

Understanding components of Deequ

Performing data analysis

Leveraging automatic constraint suggestion

Defining constraints

Storing metrics using MetricsRepository

Detecting anomalies

Summary

Part 3 – Software Engineering Best Practices for Data Engineering in Scala

8

Test-Driven Development, Code Health, and Maintainability

Technical requirements

Introducing TDD

Creating unit tests

Performing integration testing

Checking code coverage

Running static code analysis

Installing SonarQube locally

Creating a project

Running SonarScanner

Understanding linting and code style

Linting code with WartRemover

Formatting code using scalafmt

Summary

9

CI/CD with GitHub

Technical requirements

Introducing CI/CD and GitHub

Understanding Continuous Integration (CI)

Understanding Continuous Delivery (CD)

Understanding the big picture of CI/CD

Working with GitHub

Cloning a repository

Understanding branches

Writing, committing, and pushing code

Creating pull requests

Reviewing and merging pull requests

Understanding GitHub Actions

Workflows

Jobs

Steps

Summary

Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

10

Data Pipeline Orchestration

Technical requirements

Understanding the basics of orchestration

Understanding core features of Apache Airflow

Apache Airflow’s extensibility

Extending beyond operators

Monitoring and UI

Hosting and deployment options

Designing data pipelines with Airflow

Working with Argo Workflows

Installing Argo Workflows

Understanding the core components of Argo Workflows

Taking a short detour

Creating an Argo workflow

Using Databricks Workflows

Leveraging Azure Data Factory

Primary components of ADF

Summary

11

Performance Tuning

Introducing the Spark UI

Navigating the Spark UI

The Jobs tab – overview of job execution

Leveraging the Spark UI for performance tuning

Identifying performance bottlenecks

Optimizing data shuffling

Memory management and garbage collection

Scaling resources

Analyzing SQL query performance

Right-sizing compute resources

Understanding the basics

Understanding data skewing, indexing, and partitioning

Data skew

Indexing and partitioning

Summary

Part 5 – End-to-End Data Pipelines

12

Building Batch Pipelines Using Spark and Scala

Understanding our business use case

What’s our marketing use case?

Understanding the data

Understanding the medallion architecture

The end-to-end pipeline

Ingesting the data

Transforming the data

Checking data quality

Creating a serving layer

Orchestrating our batch process

Summary

13

Building Streaming Pipelines Using Spark and Scala

Understanding our business use case

What’s our IoT use case?

Understanding the data

The end-to-end pipeline

Ingesting the data

Transforming the data

Creating a serving layer

Orchestrating our streaming process

Summary

Index

Other Books You May Enjoy

Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

In this part, Chapter 1 introduces Scala’s significance in data engineering, emphasizing its type safety and native compatibility with Spark. It covers key concepts such as functional programming, objects, classes, and higher-order functions. Moving to Chapter 2, it contrasts two data engineering environments – a cloud-based setup offering portability and easy access with associated maintenance costs, and a local machine utilization option requiring setup but avoiding cloud expenses.

This part has the following chapters:

Chapter 1, Scala Essentials for Data EngineersChapter 2, Environment Setup