Building Big Data Pipelines with Apache Beam - Jan Lukavský - E-Book

Building Big Data Pipelines with Apache Beam E-Book

Jan Lukavský

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing.

This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors.

By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 407

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Building Big Data Pipelines with Apache Beam

Use a single programming model for both batch and stream data processing

Jan Lukavský

BIRMINGHAM—MUMBAI

Building Big Data Pipelines with Apache Beam

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Reshma Raman

Senior Editor: David Sugarman

Content Development Editor: Nathanya Dias

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Ponraj Dhandapani

Marketing Coordinator: Priyanka Mhatre

First published: January 2022

Production reference: 1161221

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-493-0

www.packt.com

Contributors

About the author

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.

I want to thank my family for all their support and patience, especially my wife, Pavla, and my children.

About the reviewer

Marcelo Henrique Neppel currently works as a software engineer at Canonical, interacting with technologies including Kubernetes and Juju. Previously, he worked at a big data company coordinating two teams and developing pipelines for projects using Apache Beam, and also at BPlus Tecnologia, working with databases and integrations.

I would like to thank Packt for giving me the opportunity to contribute to this excellent book. I would like to thank my wife, Janaina, and my family, for always supporting me, and also Gabriel Verani, who introduced me to Apache Beam.

Table of Contents

Preface

Section 1: Apache Beam: Essentials

Chapter 1: Introduction to Data Processing with Apache Beam

Technical requirements

Why Apache Beam?

Writing your first pipeline

Running our pipeline against streaming data

Exploring the key properties of unbounded data

Measuring event time progress inside data streams

States and triggers

Timers

Assigning data to windows

Defining the life cycle of a state in terms of windows

Pane accumulation

Unifying batch and streaming data processing

Summary

Chapter 2: Implementing, Testing, and Deploying Basic Pipelines

Technical requirements

Setting up the environment for this book

Installing Apache Kafka

Making our code accessible from minikube

Installing Apache Flink

Reinstalling the complete environment

Task 1 – Calculating the K most frequent words in a stream of lines of text

Defining the problem

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Task 2 – Calculating the maximal length of a word in a stream

Defining the problem

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Specifying the PCollection Coder object and the TypeDescriptor object

Understanding default triggers, on time, and closing behavior

Introducing the primitive PTransform object – Combine

Task 3 – Calculating the average length of words in a stream

Defining the problem

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Task 4 – Calculating the average length of words in a stream with fixed lookback

Defining the problem

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Ensuring pipeline upgradability

Task 5 – Calculating performance statistics for a sport activity tracking application

Defining the problem

Discussing the problem decomposition

Solution implementation

Testing our solution

Deploying our solution

Introducing the primitive PTransform object – GroupByKey

Introducing the primitive PTransform object – Partition

Summary

Chapter 3: Implementing Pipelines Using Stateful Processing

Technical requirements

Task 6 – Using an external service for data augmentation

Defining the problem

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Introducing the primitive PTransform object – stateless ParDo

Task 7 – Batching queries to an external RPC service

Defining the problem

Discussing the problem decomposition

Implementing the solution

Task 8 – Batching queries to an external RPC service with defined batch sizes

Defining the problem

Discussing the problem decomposition

Implementing the solution

Introducing the primitive PTransform object – stateful ParDo

Describing the theoretical properties of the stateful ParDo object

Applying the theoretical properties of the stateful ParDo object to the API of DoFn

Using side outputs

Defining droppable data in Beam

Task 9 – Separating droppable data from the rest of the data processing

Defining the problem

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Task 10 – Separating droppable data from the rest of the data processing, part 2

Defining the problem

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Using side inputs

Summary

Section 2Apache Beam: Toward Improving Usability

Chapter 4: Structuring Code for Reusability

Technical requirements

Explaining PTransform expansion

Task 11 – Enhancing SportTracker by runner motivation using side inputs

Problem definition

Problem decomposition discussion

Solution implementation

Testing our solution

Deploying our solution

Introducing composite transform – CoGroupByKey

Task 12 – enhancing SportTracker by runner motivation using CoGroupByKey

Problem definition

Problem decomposition discussion

Solution implementation

Introducing the Join library DSL

Stream-to-stream joins explained

Task 13 – Writing a reusable PTransform – StreamingInnerJoin

Problem definition

Problem decomposition discussion

Solution implementation

Testing our solution

Deploying our solution

Table-stream duality

Summary

Chapter 5: Using SQL for Pipeline Implementation

Technical requirements

Understanding schemas

Attaching a schema to a PCollection

Transforms for PCollections with schemas

Implementing our first streaming pipeline using SQL

Task 14 – Implementing SQLMaxWordLength

Problem definition

Problem decomposition discussion

Solution implementation

Task 15 – Implementing SchemaSportTracker

Problem definition

Problem decomposition discussion

Solution implementation

Task 16 – Implementing SQLSportTrackerMotivation

Problem definition

Problem decomposition discussion

Solution implementation

Further development of Apache Beam SQL

Summary

Chapter 6: Using Your Preferred Language with Portability

Technical requirements

Introducing the portability layer

Portable representation of the pipeline

Job Service

SDK harness

Implementing our first pipelines in the Python SDK

Implementing our first Python pipeline

Implementing our first streaming Python pipeline

Task 17 – Implementing MaxWordLength in the Python SDK

Problem definition

Problem decomposition discussion

Solution implementation

Testing our solution

Deploying our solution

Python SDK type hints and coders

Task 18 – Implementing SportTracker in the Python SDK

Problem definition

Solution implementation

Testing our solution

Deploying our solution

Task 19 – Implementing RPCParDo in the Python SDK

Problem definition

Solution implementation

Testing our solution

Deploying our solution

Task 20 – Implementing SportTrackerMotivation in the Python SDK

Problem definition

Solution implementation

Deploying our solution

Using the DataFrame API

Interactive programming using InteractiveRunner

Introducing and using cross-language pipelines

Summary

Section 3Apache Beam: Advanced Concepts

Chapter 7: Extending Apache Beam's I/O Connectors

Technical requirements

Defining splittable DoFn as a unification for bounded and unbounded sources

Task 21 – Implementing our own splittable DoFn – a streaming file source

The problem definition

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

Task 22 – A non-I/O application of splittable DoFn – PiSampler

The problem definition

Discussing the problem decomposition

Implementing the solution

Testing our solution

Deploying our solution

The legacy Source API and the Read transform

Writing a custom data sink

The inherent non-determinism of Apache Beam pipelines

Summary

Chapter 8: Understanding How Runners Execute Pipelines

Describing the anatomy of an Apache Beam runner

Identifying which transforms should be overridden

Explaining the differences between classic and portable runners

Classic runners

Portable pipeline representations

The executable stage concept and the pipeline fusion process

Understanding how a runner handles state

Ensuring fault tolerance

Local state with periodic checkpoints

Remote state

Exploring the Apache Beam capability matrix

Understanding windowing semantics in depth

Merging and non-merging windows

Debugging pipelines and using Apache Beam metrics for observability

Using metrics in the Java SDK

Summary

Other Books You May Enjoy

Section 1 Apache Beam: Essentials

This section represents a general introduction to how most streaming data processing systems work, what the general properties of data streams are, and what problems are needed to be solved for computational correctness and for balancing throughput and latency in the context of Apache Beam. This section also covers how pipelines are implemented, tested, and run.

This section comprises the following chapters:

Chapter 1, Introduction to Data Processing with Apache BeamChapter 2, Implementing, Testing, and Deploying Basic PipelinesChapter 3, Implementing Pipelines Using Stateful Processing