35,99 €
Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing.
In this book, you’ll walk through the end-to-end process of ETL data pipeline development, starting with an introduction to the fundamentals of data pipelines and establishing a Python development environment to create pipelines. Once you've explored the ETL pipeline design principles and ET development process, you'll be equipped to design custom ETL pipelines. Next, you'll get to grips with the steps in the ETL process, which involves extracting valuable data; performing transformations, through cleaning, manipulation, and ensuring data integrity; and ultimately loading the processed data into storage systems. You’ll also review several ETL modules in Python, comparing their pros and cons when building data pipelines and leveraging cloud tools, such as AWS, to create scalable data pipelines. Lastly, you’ll learn about the concept of test-driven development for ETL pipelines to ensure safe deployments.
By the end of this book, you’ll have worked on several hands-on examples to create high-performance ETL pipelines to develop robust, scalable, and resilient environments using Python.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 291
Veröffentlichungsjahr: 2023
Building ETL Pipelines with Python
Create and deploy enterprise-ready ETL pipelines by employing modern methods
Brij Kishore Pandey
Emily Ro Schoof
BIRMINGHAM—MUMBAI
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Reshma Raman
Publishing Product Managers: Birjees Patel and Heramb Bhavsar
Content Development Editor: Shreya Moharir
Project Coordinator: Hemangi Lotlikar
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Production Designer: Prashant Ghare
DevRel Marketing Coordinator: Nivedita Singh
First published: September 2023
Production reference: 1250923
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB
ISBN 978-1-80461-525-6
www.packtpub.com
To my daughter, Yashvi, who lights up my life; to Khushboo, my wife, my rock, and my inspiration; to my parents, Madhwa Nand and Veena, who taught me everything I know; and to my brothers, who have always stood by my side.
– Brij Kishore Pandey
Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement.
Hailing from the renowned SRM Institute of Science and Technology in Chennai, India, Brij’s academic foundation in electrical and electronics engineering has served as the bedrock upon which he built his dynamic career. He has had the privilege of collaborating with industry behemoths such as JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare, contributing immensely with his diverse skill set. Presently, Brij assumes a dual role, guiding teams as a principal software engineer and providing visionary architectural solutions at ADP (Automatic Data Processing Inc.).
A fervent believer in continuous learning and sharing knowledge, Brij has graced various international platforms as a speaker, sharing insights, experiences, and best practices with budding engineers and seasoned professionals alike. His influence doesn’t end there; he has also taken on mentorship roles, guiding the next generation of tech aficionados, in association with Mentor Cruise Inc.
Beyond the world of code, algorithms, and systems, Brij finds profound solace in spiritual pursuits. He devotes times to the ardent practice of meditation and myriad yoga disciplines, echoing his belief in a holistic approach to well-being. Deep spiritual guidance from his revered guru, Avdhoot Shivanand, has been pivotal in shaping his inner journey and perspective.
Originally from India, Brij Kishore Pandey resides in Parsippany, New Jersey, USA, with his wife and daughter.
Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily’s multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly’s Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Adonis Castillo Cordero has been working in software engineering, data engineering, and business intelligence for the last five years. He is passionate about systems engineering, data, and leadership. His recent focus areas include cloud-native landscape, business strategy, and data engineering and analytics. Based in Alajuela, Costa Rica, Adonis currently works as a lead data engineer and has worked for Fortune 500 companies such as Experian and 3M.
I’m grateful for my family and friends’ unwavering support during this project. Thanks to the publisher for their professionalism and guidance. I sincerely hope the book brings joy and is useful to readers.
Dr. Bipul Kumar is an AI consultant who brings over seven years of experience in deep learning and machine learning to the table. His journey in AI has encompassed various domains, including conversational AI, computer vision, and speech recognition. Bipul has had the privilege to work on impactful projects, including contributing to developing software as a medical device as the head of AI at Kaliber Labs. He also served as an AI consultant at AIX, specializing in developing conversational AI. His academic pursuits led him to earn a PhD from IIM Ranchi and a B.Tech from SRMIST. With a passion for research and innovation, Bipul has authored numerous publications and contributed to a patent application, humbly making his mark on the AI landscape.
For the first part of this book, we will introduce the fundamentals of data pipelines in Python and set up your local development environment with Integrated Development Environments (IDEs), virtual environments, and Git version control. We will provide you with an overview of what Extract-Load-Transform (ETL) data pipelines are and how to design them yourself. As a word of caution, Python is at the core of this book; you must have a basic familiarity with Python in order to follow along accordingly.
This section contains the following chapters:
Chapter 1, A Primer on Python and the Development EnvironmentChapter 2, Understanding the ETL Process and Data PipelinesChapter 3, Design Principles for Creating Scalable and Resilient Pipelines