Data Ingestion with Python Cookbook - Gláucia Esppenchutz - E-Book

Data Ingestion with Python Cookbook E-Book

Gláucia Esppenchutz

0,0
28,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges.
You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation.
By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 349

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Ingestion with Python Cookbook

A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process

Gláucia Esppenchutz

BIRMINGHAM—MUMBAI

Data Ingestion with Python Cookbook

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Reshma Raman

Publishing Product Manager: Arindam Majumdar

Senior Editor: Tiksha Lad

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Jyoti Chauhan

Marketing Coordinator: Nivedita Singh

First published: May 2023

Production reference: 1300523

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-83763-260-2

www.packtpub.com

This book represents a lot and wouldn’t be possible without my loving husband, Lincoln, and his support and understanding during this challenging endeavor. I want to thank all my friends that didn’t let me give up and always boosted my spirits, along with my grandmother, who always believed, helped, and said I would do big things one day. Finally, I want to thank my beloved and four-pawed best friend, who is at peace, Minduim, for “helping” me to write this book.

– Gláucia Esppenchutz

Contributors

About the author

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo.com, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems.

She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.

I want to thank my patient and beloved husband and my friends. Thanks also to my mentors in the Python open source community and the DataBootCamp founders, who guided me at the beginning of my journey.

Thanks to the Packt team, who helped me through some hard times; you were terrific!

About the reviewers

Bitthal Khaitan is currently working as a big data and cloud engineer with CVS Health, a Fortune 4 organization. He has a demonstrated history of working in the cloud, data and analytics industry for 12+ years. His primary certified skills are Google Cloud Platform (GCP), the big data ecosystem (Hadoop, Spark, etc.), and data warehousing on Teradata. He has worked in all phases of the SDLC of DW/BI and big data projects with strong expertise in the USA healthcare, insurance and retail domains. He actively helps new graduates with mentoring, resume reviews, and job hunting tips in the data engineering domain. Over 20,000 people follow Bitthal on LinkedIn. He is currently based out of Dallas, Texas, USA.

Jagjeet Makhija is a highly accomplished technology leader with over 20 years of experience. They are skilled not only in various domains including AI, data warehouse architecture, and business analytics, but also have a strong passion for staying ahead of technology trends such as AI and ChatGPT. Jagjeet is recognized for their significant contributions to the industry, particularly in complex proof of concepts and integrating Microsoft products with ChatGPT. They are also an avid book reviewer and have actively shared their extensive knowledge and expertise through presentations, blog articles, and online forums.

Krishnan Raghavan is an IT professional with over 20 years of experience in the area of software development and delivery excellence across multiple domains and technology, ranging from C++ to Java, Python, data warehousing, and big data tools and technologies. Krishnan tries to give back to the community by being part of GDG – Pune Volunteer Group, helping the team in organizing events. When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, non-fiction, and technical books. Currently, he is unsuccessfully trying to learn how to play the guitar.

You can connect with Krishnan at mail to: [email protected] or via LinkedIn: www.linkedin.com/in/krishnan-raghavan

I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.

Table of Contents

Preface

Part 1: Fundamentals of Data Ingestion

1

Introduction to Data Ingestion

Technical requirements

Setting up Python and its environment

Getting ready

How to do it…

How it works…

There’s more…

See also

Installing PySpark

Getting ready

How to do it…

How it works…

There’s more…

See also

Configuring Docker for MongoDB

Getting ready

How to do it…

How it works…

There’s more…

See also

Configuring Docker for Airflow

Getting ready

How to do it…

How it works…

See also

Creating schemas

Getting ready

How to do it…

How it works…

See also

Applying data governance in ingestion

Getting ready

How to do it…

How it works…

See also

Implementing data replication

Getting ready

How to do it…

How it works…

There’s more…

Further reading

2

Principals of Data Access – Accessing Your Data

Technical requirements

Implementing governance in a data access workflow

Getting ready

How to do it…

How it works…

See also

Accessing databases and data warehouses

Getting ready

How to do it…

How it works…

There’s more…

See also

Accessing SSH File Transfer Protocol (SFTP) files

Getting ready

How to do it…

How it works…

There’s more…

See also

Retrieving data using API authentication

Getting ready

How to do it…

How it works…

There’s more…

See also

Managing encrypted files

Getting ready

How to do it…

How it works…

There’s more…

See also

Accessing data from AWS using S3

Getting ready

How to do it…

How it works…

There’s more…

See also

Accessing data from GCP using Cloud Storage

Getting ready

How to do it…

How it works…

There’s more…

Further reading

3

Data Discovery – Understanding Our Data before Ingesting It

Technical requirements

Documenting the data discovery process

Getting ready

How to do it…

How it works…

Configuring OpenMetadata

Getting ready

How to do it…

How it works…

There’s more…

See also

Connecting OpenMetadata to our database

Getting ready

How to do it…

How it works…

Further reading

Other tools

4

Reading CSV and JSON Files and Solving Problems

Technical requirements

Reading a CSV file

Getting ready

How to do it…

How it works…

There’s more…

See also

Reading a JSON file

Getting ready

How to do it…

How it works…

There’s more…

See also

Creating a SparkSession for PySpark

Getting ready

How to do it…

How it works…

There’s more…

See also

Using PySpark to read CSV files

Getting ready

How to do it…

How it works…

There’s more…

See also

Using PySpark to read JSON files

Getting ready

How to do it…

How it works…

There’s more…

See also

Further reading

5

Ingesting Data from Structured and Unstructured Databases

Technical requirements

Configuring a JDBC connection

Getting ready

How to do it…

How it works…

There’s more…

See also

Ingesting data from a JDBC database using SQL

Getting ready

How to do it…

How it works…

There’s more…

See also

Connecting to a NoSQL database (MongoDB)

Getting ready

How to do it…

How it works…

There’s more…

See also

Creating our NoSQL table in MongoDB

Getting ready

How to do it…

How it works…

There’s more…

See also

Ingesting data from MongoDB using PySpark

Getting ready

How to do it…

How it works…

There’s more…

See also

Further reading

6

Using PySpark with Defined and Non-Defined Schemas

Technical requirements

Applying schemas to data ingestion

Getting ready

How to do it…

How it works…

There’s more…

See also

Importing structured data using a well-defined schema

Getting ready

How to do it…

How it works…

There’s more…

See also

Importing unstructured data without a schema

Getting ready…

How to do it…

How it works…

Ingesting unstructured data with a well-defined schema and format

Getting ready

How to do it…

How it works…

There’s more…

See also

Inserting formatted SparkSession logs to facilitate your work

Getting ready

How to do it…

How it works…

There’s more…

See also

Further reading

7

Ingesting Analytical Data

Technical requirements

Ingesting Parquet files

Getting ready

How to do it…

How it works…

There’s more…

See also

Ingesting Avro files

Getting ready

How to do it…

How it works…

There’s more…

See also

Applying schemas to analytical data

Getting ready

How to do it…

How it works…

There’s more…

See also

Filtering data and handling common issues

Getting ready

How to do it…

How it works…

There’s more…

See also

Ingesting partitioned data

Getting ready

How to do it…

How it works…

There’s more…

See also

Applying reverse ETL

Getting ready

How to do it…

How it works…

There’s more…

See also

Selecting analytical data for reverse ETL

Getting ready

How to do it…

How it works…

See also

Further reading

Part 2: Structuring the Ingestion Pipeline

8

Designing Monitored Data Workflows

Technical requirements

Inserting logs

Getting ready

How to do it…

How it works…

See also

Using log-level types

Getting ready

How to do it…

How it works…

There’s more…

See also

Creating standardized logs

Getting ready

How to do it…

How it works…

There’s more…

See also

Monitoring our data ingest file size

Getting ready

How to do it…

How it works…

There’s more…

See also

Logging based on data

Getting ready

How to do it…

How it works…

There’s more…

See also

Retrieving SparkSession metrics

Getting ready

How to do it…

How it works…

There’s more…

See also

Further reading

9

Putting Everything Together with Airflow

Technical requirements

Installing Airflow

Configuring Airflow

Getting ready

How to do it…

How it works…

See also

Creating DAGs

Getting ready

How to do it…

How it works…

There's more…

See also

Creating custom operators

Getting ready

How to do it…

How it works…

There's more…

See also

Configuring sensors

Getting ready

How to do it…

How it works…

See also

Creating connectors in Airflow

Getting ready

How to do it…

How it works…

There's more…

See also

Creating parallel ingest tasks

Getting ready

How to do it…

How it works…

There's more…

See also

Defining ingest-dependent DAGs

Getting ready

How to do it…

How it works…

There's more…

See also

Further reading

10

Logging and Monitoring Your Data Ingest in Airflow

Technical requirements

Installing and running Airflow

Creating basic logs in Airflow

Getting ready

How to do it…

How it works…

See also

Storing log files in a remote location

Getting ready

How to do it…

How it works…

See also

Configuring logs in airflow.cfg

Getting ready

How to do it…

How it works…

There’s more…

See also

Designing advanced monitoring

Getting ready

How to do it…

How it works…

There’s more…

See also

Using notification operators

Getting ready

How to do it…

How it works…

There’s more…

Using SQL operators for data quality

Getting ready

How to do it…

How it works…

There’s more…

See also

Further reading

11

Automating Your Data Ingestion Pipelines

Technical requirements

Installing and running Airflow

Scheduling daily ingestions

Getting ready

How to do it…

How it works…

There's more…

See also

Scheduling historical data ingestion

Getting ready

How to do it…

How it works…

There's more…

Scheduling data replication

Getting ready

How to do it…

How it works…

There's more…

Setting up the schedule_interval parameter

Getting ready

How to do it…

How it works…

See also

Solving scheduling errors

Getting ready

How to do it…

How it works…

There’s more…

Further reading

12

Using Data Observability for Debugging, Error Handling, and Preventing Downtime

Technical requirements

Docker images

Setting up StatsD for monitoring

Getting ready

How to do it…

How it works…

See also

Setting up Prometheus for storing metrics

Getting ready

How to do it…

How it works…

There’s more…

Setting up Grafana for monitoring

Getting ready

How to do it…

How it works…

There’s more…

Creating an observability dashboard

Getting ready

How to do it…

How it works…

There’s more…

Setting custom alerts or notifications

Getting ready

How to do it…

How it works…

Further reading

Index

Other Books You May Enjoy

Preface

Welcome to Data Ingestion with Python Cookbook. I hope you are excited as me to enter the world of data engineering.

Data Ingestion with Python Cookbook is a practical guide that will empower you to design and implement efficient data ingestion pipelines. With real-world examples and renowned open-source tools, this book addresses your queries and hurdles head-on.

Beginning with designing pipelines, you’ll explore working with and without data schemas, constructing monitored workflows using Airflow, and embracing data observability principles while adhering to best practices. Tackling the challenges of reading diverse data sources and formats, you’ll gain a comprehensive understanding of all these.

Our journey continues with essential insights into error logging, identification, resolution, data orchestration, and effective monitoring. You’ll discover optimal approaches for storing logs, ensuring easy access and references for them in the future.

By the end of this book, you’ll possess a fully automated setup to initiate data ingestion and pipeline monitoring. This streamlined process will seamlessly integrate into the subsequent stages of the Extract, Transform, and Load (ETL) process, propelling your data integration capabilities to new heights. Get ready to embark on an enlightening and transformative data ingestion journey.

Who this book is for

This comprehensive book is specifically designed for Data Engineers, Data Integration Specialists, and passionate data enthusiasts seeking a deeper understanding of data ingestion processes, data flows, and the typical challenges encountered along the way. It provides valuable insights, best practices, and practical knowledge to enhance your skills and proficiency in handling data ingestion tasks effectively.

Whether you are a beginner in the data world or an experienced developer, this book will suit you. It is recommended to know the Python programming fundamentals and have basic knowledge of Docker to read and run this book’s code.

What this book covers

Chapter 1, Introduction to Data Ingestion, introduces you to data ingestion best practices and the challenges of working with diverse data sources. It explains the importance of the tools covered in the book, presents them, and provides installation instructions.

Chapter 2, Data Access Principals – Accessing your Data, explores data access concepts related to data governance, covering workflows and management of familiar sources such as SFTP servers, APIs, and cloud providers. It also provides examples of creating data access policies in databases, data warehouses, and the cloud.

Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It, teaches you the significance of carrying out the data discovery process before data ingestion. It covers manual discovery, documentation, and using an open-source tool, OpenMetadata, for local configuration.

Chapter 4, Reading CSV and JSON Files and Solving Problems, introduces you to ingesting CSV and JSON files using Python and PySpark. It demonstrates handling varying data volumes and infrastructures while addressing common challenges and providing solutions.

Chapter 5, Ingesting Data from Structured and Unstructured Databases, covers fundamental concepts of relational and non-relational databases, including everyday use cases. You will learn how to read and handle data from these models, understand vital considerations, and troubleshoot potential errors.

Chapter 6, Using PySpark with Defined and Non-Defined Schemas, delves deeper into common PySpark use cases, focusing on handling defined and non-defined schemas. It also explores reading and understanding complex logs from Spark (PySpark core) and formatting techniques for easier debugging.

Chapter 7, Ingesting Analytical Data, introduces you to analytical data and common formats for reading and writing. It explores reading partitioned data for improved performance and discusses Reverse ETL theory with real-life application workflows and diagrams.

Chapter 8, Designing Monitored Data Workflows, covers logging best practices for data ingestion, facilitating error identification, and debugging. Techniques such as monitoring file size, row count, and object count enable improved monitoring of dashboards, alerts, and insights.

Chapter 9, Putting Everything Together with Airflow, consolidates the previously presented information and guides you in building a real-life data ingestion application using Airflow. It covers essential components, configuration, and issue resolution in the process.

Chapter 10, Logging and Monitoring Your Data Ingest in Airflow, explores advanced logging and monitoring in data ingestion with Airflow. It covers creating custom operators, setting up notifications, and monitoring for data anomalies. Configuration of notifications for tools such as Slack is also covered to stay updated on the data ingestion process.

Chapter 11, Automating Your Data Ingestion Pipelines, focuses on automating data ingests using previously learned best practices, enabling reader autonomy. It addresses common challenges with schedulers or orchestration tools and provides solutions to avoid problems in production clusters.

Chapter 12,Using Data Observability for Debugging, Error Handling, and Preventing Downtime, explores data observability concepts, popular monitoring tools such as Grafana, and best practices for log storage and data lineage. It also covers creating visualization graphs to monitor data source issues using Airflow configuration and data ingestion scripts.

To get the most out of this book

To execute the code in this book, you must have at least a basic knowledge of Python. We will use Python as the core language to execute the code. The code examples have been tested using Python 3.8. However, it is expected to still work with future language versions.

Along with Python, this book uses Docker to emulate data systems and applications in our local machine, such as PostgreSQL, MongoDB, and Airflow. Therefore, a basic knowledge of Docker is recommended to edit container image files and run and stop containers.

Please, remember that some command-line commands may need adjustments depending on your local settings or operating system. The commands in the code examples are based on the Linux command-line syntax and might need some adaptations to run on Windows PowerShell.

Software/Hardware covered in the book

OS Requirements

Python 3.8 or higher

Windows, Mac OS X, and Linux (any)

Docker Engine 24.0 / Docker Desktop 4.19

Windows, Mac OS X, and Linux (any)

For almost all recipes in this book, you can use a Jupyter Notebook to execute the code. Even though it is not mandatory to install it, this tool can help you to test the code and try new things on the code due to the friendly interface.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook. In case there’s an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/xwl0U

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Then we proceeded with the with open statement.”

A block of code is set as follows:

def gets_csv_first_line (csv_file):     logging.info(f"Starting function to read first line")     try:         with open(csv_file, 'r') as file:             logging.info(f"Reading file")

Any command-line input or output is written as follows:

$ python3 –-version Python 3.8.10

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Then, when we selected showString at NativeMethodAccessorImpl.java:0, which redirected us to the Stages page.”

Tips or important notes

Appear like this.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There’s more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There’s more…

This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Data Ingestion with Python Cookbook, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837632602

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: Fundamentals of Data Ingestion

In this part, you will be introduced to the fundamentals of data ingestion and data engineering, passing through the basic definition of an ingestion pipeline, the common types of data sources, and the technologies involved.

This part has the following chapters:

Chapter 1, Introduction to Data IngestionChapter 2, Principals of Data Access – Accessing Your DataChapter 3, Data Discovery – Understanding Our Data Before Ingesting ItChapter 4, Reading CSV and JSON Files and Solving ProblemsChapter 5, Ingesting Data from Structured and Unstructured DatabasesChapter 6, Using PySpark with Defined and Non-Defined SchemasChapter 7, Ingesting Analytical Data