Building ETL Pipelines with Python - Brij Kishore Pandey - E-Book

Building ETL Pipelines with Python E-Book

Brij Kishore Pandey

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing.
In this book, you’ll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You’ll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you’ll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments.
By the end of this book, you’ll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 291

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Building ETL Pipelines with Python

Create and deploy enterprise-ready ETL pipelines by employing modern methods

Brij Kishore Pandey

Emily Ro Schoof

BIRMINGHAM—MUMBAI

Building ETL Pipelines with Python

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Reshma Raman

Publishing Product Managers: Birjees Patel and Heramb Bhavsar

Content Development Editor: Shreya Moharir

Project Coordinator: Hemangi Lotlikar

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Prashant Ghare

DevRel Marketing Coordinator: Nivedita Singh

First published: September 2023

Production reference: 1250923

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB

ISBN 978-1-80461-525-6

www.packtpub.com

To my daughter, Yashvi, who lights up my life; to Khushboo, my wife, my rock, and my inspiration; to my parents, Madhwa Nand and Veena, who taught me everything I know; and to my brothers, who have always stood by my side.

– Brij Kishore Pandey

Contributors

About the authors

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement.

Hailing from the renowned SRM Institute of Science and Technology in Chennai, India, Brij’s academic foundation in electrical and electronics engineering has served as the bedrock upon which he built his dynamic career. He has had the privilege of collaborating with industry behemoths such as JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare, contributing immensely with his diverse skill set. Presently, Brij assumes a dual role, guiding teams as a principal software engineer and providing visionary architectural solutions at ADP (Automatic Data Processing Inc.).

A fervent believer in continuous learning and sharing knowledge, Brij has graced various international platforms as a speaker, sharing insights, experiences, and best practices with budding engineers and seasoned professionals alike. His influence doesn’t end there; he has also taken on mentorship roles, guiding the next generation of tech aficionados, in association with Mentor Cruise Inc.

Beyond the world of code, algorithms, and systems, Brij finds profound solace in spiritual pursuits. He devotes times to the ardent practice of meditation and myriad yoga disciplines, echoing his belief in a holistic approach to well-being. Deep spiritual guidance from his revered guru, Avdhoot Shivanand, has been pivotal in shaping his inner journey and perspective.

Originally from India, Brij Kishore Pandey resides in Parsippany, New Jersey, USA, with his wife and daughter.

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily’s multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly’s Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.

About the reviewers

Adonis Castillo Cordero has been working in software engineering, data engineering, and business intelligence for the last five years. He is passionate about systems engineering, data, and leadership. His recent focus areas include cloud-native landscape, business strategy, and data engineering and analytics. Based in Alajuela, Costa Rica, Adonis currently works as a lead data engineer and has worked for Fortune 500 companies such as Experian and 3M.

I’m grateful for my family and friends’ unwavering support during this project. Thanks to the publisher for their professionalism and guidance. I sincerely hope the book brings joy and is useful to readers.

Dr. Bipul Kumar is an AI consultant who brings over seven years of experience in deep learning and machine learning to the table. His journey in AI has encompassed various domains, including conversational AI, computer vision, and speech recognition. Bipul has had the privilege to work on impactful projects, including contributing to developing software as a medical device as the head of AI at Kaliber Labs. He also served as an AI consultant at AIX, specializing in developing conversational AI. His academic pursuits led him to earn a PhD from IIM Ranchi and a B.Tech from SRMIST. With a passion for research and innovation, Bipul has authored numerous publications and contributed to a patent application, humbly making his mark on the AI landscape.

Table of Contents

Preface

Part 1: Introduction to ETL, Data Pipelines, and Design Principles

1

A Primer on Python and the Development Environment

Introducing Python fundamentals

An overview of Python data structures

Python if…else conditions or conditional statements

Python looping techniques

Python functions

Object-oriented programming with Python

Working with files in Python

Establishing a development environment

Version control with Git tracking

Documenting environment dependencies with requirements.txt

Utilizing module management systems (MMSs)

Configuring a Pipenv environment in PyCharm

Summary

2

Understanding the ETL Process and Data Pipelines

What is a data pipeline?

How do we create a robust pipeline?

Pre-work – understanding your data

Design planning – planning your workflow

Architecture development – developing your resources

Putting it all together – project diagrams

What is an ETL data pipeline?

Batch processing

Streaming method

Cloud-native

Automating ETL pipelines

Exploring use cases for ETL pipelines

Summary

References

3

Design Principles for Creating Scalable and Resilient Pipelines

Technical requirements

Understanding the design patterns for ETL

Basic ETL design pattern

ETL-P design pattern

ETL-VP design pattern

ELT two-phase pattern

Preparing your local environment for installations

Open source Python libraries for ETL pipelines

Pandas

NumPy

Scaling for big data packages

Dask

Numba

Summary

References

Part 2: Designing ETL Pipelines with Python

4

Sourcing Insightful Data and Data Extraction Strategies

Technical requirements

What is data sourcing?

Accessibility to data

Types of data sources

Getting started with data extraction

CSV and Excel data files

Parquet data files

API connections

Databases

Data from web pages

Creating a data extraction pipeline using Python

Data extraction

Logging

Summary

References

5

Data Cleansing and Transformation

Technical requirements

Scrubbing your data

Data transformation

Data cleansing and transformation in ETL pipelines

Understanding the downstream applications of your data

Strategies for data cleansing and transformation in Python

Preliminary tasks – the importance of staging data

Transformation activities in Python

Creating data pipeline activity in Python

Summary

6

Loading Transformed Data

Technical requirements

Introduction to data loading

Choosing the load destination

Types of load destinations

Best practices for data loading

Optimizing data loading activities by controlling the data import method

Creating demo data

Full data loads

Incremental data loads

Precautions to consider

Tutorial – preparing your local environment for data loading activities

Downloading and installing PostgreSQL

Creating data schemas in PostgreSQL

Summary

7

Tutorial – Building an End-to-End ETL Pipeline in Python

Technical requirements

Introducing the project

The approach

The data

Creating tables in PostgreSQL

Sourcing and extracting the data

Transformation and data cleansing

Loading data into PostgreSQL tables

Making it deployable

Summary

8

Powerful ETL Libraries and Tools in Python

Technical requirements

Architecture of Python files

Configuring your local environment

config.ini

config.yaml

Part 1 – ETL tools in Python

Bonobo

Odo

Mito ETL

Riko

pETL

Luigi

Part 2 – pipeline workflow management platforms in Python

Airflow

Summary

Part 3: Creating ETL Pipelines in AWS

9

A Primer on AWS Tools for ETL Processes

Common data storage tools in AWS

Amazon RDS

Amazon Redshift

Amazon S3

Amazon EC2

Discussion – Building flexible applications in AWS

Leveraging S3 and EC2

Computing and automation with AWS

AWS Glue

AWS Lambda

AWS Step Functions

AWS big data tools for ETL pipelines

AWS Data Pipeline

Amazon Kinesis

Amazon EMR

Walk-through – creating a Free Tier AWS account

Prerequisites for running AWS from your device in AWS

AWS CLI

Docker

LocalStack

AWS SAM CLI

Summary

10

Tutorial – Creating an ETL Pipeline in AWS

Technical requirements

Creating a Python pipeline with Amazon S3, Lambda, and Step Functions

Setting the stage with the AWS CLI

Creating a “proof of concept” data pipeline in Python

Using Boto3 and Amazon S3 to read data

AWS Lambda functions

AWS Step Functions

An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS

Configuring your AWS environment with EC2 and RDS

Creating an RDS instance

Creating an EC2 instance

Creating a data pipeline locally with Bonobo

Adding the pipeline to AWS

Summary

11

Building Robust Deployment Pipelines in AWS

Technical requirements

What is CI/CD and why is it important?

The six key elements of CI/CD

Essential steps for CI/CD adoption

CI/CD is a continual process

Creating a robust CI/CD process for ETL pipelines in AWS

Creating a CI/CD pipeline

Building an ETL pipeline using various AWS services

Setting up a CodeCommit repository

Orchestrating with AWS CodePipeline

Testing the pipeline

Summary

Part 4: Automating and Scaling ETL Pipelines

12

Orchestration and Scaling in ETL Pipelines

Technical requirements

Performance bottlenecks

Inflexibility

Limited scalability

Operational overheads

Exploring the types of scaling

Vertical scaling

Horizontal scaling

Choose your scaling strategy

Processing requirements

Data volume

Cost

Complexity and skills

Reliability and availability

Data pipeline orchestration

Task scheduling

Error handling and recovery

Resource management

Monitoring and logging

Putting it together with a practical example

Summary

13

Testing Strategies for ETL Pipelines

Technical requirements

Benefits of testing data pipeline code

How to choose the right testing strategies for your ETL pipeline

How often should you test your ETL pipeline?

Creating tests for a simple ETL pipeline

Unit testing

Validation testing

Integration testing

End-to-end testing

Performance testing

Resilience testing

Best practices for a testing environment for ETL pipelines

Defining testing objectives

Establishing a testing framework

Automating ETL tests

Monitoring ETL pipelines

ETL testing challenges

Data privacy and security

Environment parity

Top ETL testing tools

Summary

14

Best Practices for ETL Pipelines

Technical requirements

Data quality

Poor scalability

Lack of error-handling and recovery methods

ETL logging in Python

Debugging and issue resolution

Auditing and compliance

Performance monitoring

Including contextual information

Handling exceptions and errors

The Goldilocks principle

Implementing logging in Python

Checkpoint for recovery

Avoiding SPOFs

Modularity and auditing

Modularity

Auditing

Summary

15

Use Cases and Further Reading

Technical requirements

New York Yellow Taxi data, ETL pipeline, and deployment

Step 1 – configuration

Step 2 – ETL pipeline script

Step 3 – unit tests

Building a robust ETL pipeline with US construction data in AWS

Prerequisites

Step 1 – data extraction

Step 2 – data transformation

Step 3 – data loading

Running the ETL pipeline

Bonus – deploying your ETL pipeline

Summary

Further reading

Index

Other Books You May Enjoy

Part 1:Introduction to ETL, Data Pipelines, and Design Principles

For the first part of this book, we will introduce the fundamentals of data pipelines in Python and set up your local development environment with Integrated Development Environments (IDEs), virtual environments, and Git version control. We will provide you with an overview of what Extract-Load-Transform (ETL) data pipelines are and how to design them yourself. As a word of caution, Python is at the core of this book; you must have a basic familiarity with Python in order to follow along accordingly.

This section contains the following chapters:

Chapter 1, A Primer on Python and the Development EnvironmentChapter 2, Understanding the ETL Process and Data PipelinesChapter 3, Design Principles for Creating Scalable and Resilient Pipelines