28,79 €
Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges.
You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation.
By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 349
Veröffentlichungsjahr: 2023
A practical guide to ingesting, monitoring, and identifying errors in the data ingestion process
Gláucia Esppenchutz
BIRMINGHAM—MUMBAI
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Reshma Raman
Publishing Product Manager: Arindam Majumdar
Senior Editor: Tiksha Lad
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Jyoti Chauhan
Marketing Coordinator: Nivedita Singh
First published: May 2023
Production reference: 1300523
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-83763-260-2
www.packtpub.com
This book represents a lot and wouldn’t be possible without my loving husband, Lincoln, and his support and understanding during this challenging endeavor. I want to thank all my friends that didn’t let me give up and always boosted my spirits, along with my grandmother, who always believed, helped, and said I would do big things one day. Finally, I want to thank my beloved and four-pawed best friend, who is at peace, Minduim, for “helping” me to write this book.
– Gláucia Esppenchutz
Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo.com, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems.
She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.
I want to thank my patient and beloved husband and my friends. Thanks also to my mentors in the Python open source community and the DataBootCamp founders, who guided me at the beginning of my journey.
Thanks to the Packt team, who helped me through some hard times; you were terrific!
Bitthal Khaitan is currently working as a big data and cloud engineer with CVS Health, a Fortune 4 organization. He has a demonstrated history of working in the cloud, data and analytics industry for 12+ years. His primary certified skills are Google Cloud Platform (GCP), the big data ecosystem (Hadoop, Spark, etc.), and data warehousing on Teradata. He has worked in all phases of the SDLC of DW/BI and big data projects with strong expertise in the USA healthcare, insurance and retail domains. He actively helps new graduates with mentoring, resume reviews, and job hunting tips in the data engineering domain. Over 20,000 people follow Bitthal on LinkedIn. He is currently based out of Dallas, Texas, USA.
Jagjeet Makhija is a highly accomplished technology leader with over 20 years of experience. They are skilled not only in various domains including AI, data warehouse architecture, and business analytics, but also have a strong passion for staying ahead of technology trends such as AI and ChatGPT. Jagjeet is recognized for their significant contributions to the industry, particularly in complex proof of concepts and integrating Microsoft products with ChatGPT. They are also an avid book reviewer and have actively shared their extensive knowledge and expertise through presentations, blog articles, and online forums.
Krishnan Raghavan is an IT professional with over 20 years of experience in the area of software development and delivery excellence across multiple domains and technology, ranging from C++ to Java, Python, data warehousing, and big data tools and technologies. Krishnan tries to give back to the community by being part of GDG – Pune Volunteer Group, helping the team in organizing events. When not working, Krishnan likes to spend time with his wife and daughter, as well as reading fiction, non-fiction, and technical books. Currently, he is unsuccessfully trying to learn how to play the guitar.
You can connect with Krishnan at mail to: [email protected] or via LinkedIn: www.linkedin.com/in/krishnan-raghavan
I would like to thank my wife, Anita, and daughter, Ananya, for giving me the time and space to review this book.
Welcome to Data Ingestion with Python Cookbook. I hope you are excited as me to enter the world of data engineering.
Data Ingestion with Python Cookbook is a practical guide that will empower you to design and implement efficient data ingestion pipelines. With real-world examples and renowned open-source tools, this book addresses your queries and hurdles head-on.
Beginning with designing pipelines, you’ll explore working with and without data schemas, constructing monitored workflows using Airflow, and embracing data observability principles while adhering to best practices. Tackling the challenges of reading diverse data sources and formats, you’ll gain a comprehensive understanding of all these.
Our journey continues with essential insights into error logging, identification, resolution, data orchestration, and effective monitoring. You’ll discover optimal approaches for storing logs, ensuring easy access and references for them in the future.
By the end of this book, you’ll possess a fully automated setup to initiate data ingestion and pipeline monitoring. This streamlined process will seamlessly integrate into the subsequent stages of the Extract, Transform, and Load (ETL) process, propelling your data integration capabilities to new heights. Get ready to embark on an enlightening and transformative data ingestion journey.
This comprehensive book is specifically designed for Data Engineers, Data Integration Specialists, and passionate data enthusiasts seeking a deeper understanding of data ingestion processes, data flows, and the typical challenges encountered along the way. It provides valuable insights, best practices, and practical knowledge to enhance your skills and proficiency in handling data ingestion tasks effectively.
Whether you are a beginner in the data world or an experienced developer, this book will suit you. It is recommended to know the Python programming fundamentals and have basic knowledge of Docker to read and run this book’s code.
Chapter 1, Introduction to Data Ingestion, introduces you to data ingestion best practices and the challenges of working with diverse data sources. It explains the importance of the tools covered in the book, presents them, and provides installation instructions.
Chapter 2, Data Access Principals – Accessing your Data, explores data access concepts related to data governance, covering workflows and management of familiar sources such as SFTP servers, APIs, and cloud providers. It also provides examples of creating data access policies in databases, data warehouses, and the cloud.
Chapter 3, Data Discovery – Understanding Our Data Before Ingesting It, teaches you the significance of carrying out the data discovery process before data ingestion. It covers manual discovery, documentation, and using an open-source tool, OpenMetadata, for local configuration.
Chapter 4, Reading CSV and JSON Files and Solving Problems, introduces you to ingesting CSV and JSON files using Python and PySpark. It demonstrates handling varying data volumes and infrastructures while addressing common challenges and providing solutions.
Chapter 5, Ingesting Data from Structured and Unstructured Databases, covers fundamental concepts of relational and non-relational databases, including everyday use cases. You will learn how to read and handle data from these models, understand vital considerations, and troubleshoot potential errors.
Chapter 6, Using PySpark with Defined and Non-Defined Schemas, delves deeper into common PySpark use cases, focusing on handling defined and non-defined schemas. It also explores reading and understanding complex logs from Spark (PySpark core) and formatting techniques for easier debugging.
Chapter 7, Ingesting Analytical Data, introduces you to analytical data and common formats for reading and writing. It explores reading partitioned data for improved performance and discusses Reverse ETL theory with real-life application workflows and diagrams.
Chapter 8, Designing Monitored Data Workflows, covers logging best practices for data ingestion, facilitating error identification, and debugging. Techniques such as monitoring file size, row count, and object count enable improved monitoring of dashboards, alerts, and insights.
Chapter 9, Putting Everything Together with Airflow, consolidates the previously presented information and guides you in building a real-life data ingestion application using Airflow. It covers essential components, configuration, and issue resolution in the process.
Chapter 10, Logging and Monitoring Your Data Ingest in Airflow, explores advanced logging and monitoring in data ingestion with Airflow. It covers creating custom operators, setting up notifications, and monitoring for data anomalies. Configuration of notifications for tools such as Slack is also covered to stay updated on the data ingestion process.
Chapter 11, Automating Your Data Ingestion Pipelines, focuses on automating data ingests using previously learned best practices, enabling reader autonomy. It addresses common challenges with schedulers or orchestration tools and provides solutions to avoid problems in production clusters.
Chapter 12,Using Data Observability for Debugging, Error Handling, and Preventing Downtime, explores data observability concepts, popular monitoring tools such as Grafana, and best practices for log storage and data lineage. It also covers creating visualization graphs to monitor data source issues using Airflow configuration and data ingestion scripts.
To execute the code in this book, you must have at least a basic knowledge of Python. We will use Python as the core language to execute the code. The code examples have been tested using Python 3.8. However, it is expected to still work with future language versions.
Along with Python, this book uses Docker to emulate data systems and applications in our local machine, such as PostgreSQL, MongoDB, and Airflow. Therefore, a basic knowledge of Docker is recommended to edit container image files and run and stop containers.
Please, remember that some command-line commands may need adjustments depending on your local settings or operating system. The commands in the code examples are based on the Linux command-line syntax and might need some adaptations to run on Windows PowerShell.
Software/Hardware covered in the book
OS Requirements
Python 3.8 or higher
Windows, Mac OS X, and Linux (any)
Docker Engine 24.0 / Docker Desktop 4.19
Windows, Mac OS X, and Linux (any)
For almost all recipes in this book, you can use a Jupyter Notebook to execute the code. Even though it is not mandatory to install it, this tool can help you to test the code and try new things on the code due to the friendly interface.
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook. In case there’s an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/xwl0U
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Then we proceeded with the with open statement.”
A block of code is set as follows:
def gets_csv_first_line (csv_file): logging.info(f"Starting function to read first line") try: with open(csv_file, 'r') as file: logging.info(f"Reading file")Any command-line input or output is written as follows:
$ python3 –-version Python 3.8.10Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: “Then, when we selected showString at NativeMethodAccessorImpl.java:0, which redirected us to the Stages page.”
Tips or important notes
Appear like this.
In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There’s more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Data Ingestion with Python Cookbook, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781837632602
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyIn this part, you will be introduced to the fundamentals of data ingestion and data engineering, passing through the basic definition of an ingestion pipeline, the common types of data sources, and the technologies involved.
This part has the following chapters:
Chapter 1, Introduction to Data IngestionChapter 2, Principals of Data Access – Accessing Your DataChapter 3, Data Discovery – Understanding Our Data Before Ingesting ItChapter 4, Reading CSV and JSON Files and Solving ProblemsChapter 5, Ingesting Data from Structured and Unstructured DatabasesChapter 6, Using PySpark with Defined and Non-Defined SchemasChapter 7, Ingesting Analytical Data