35,99 €
dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.
This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work.
By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 931
Veröffentlichungsjahr: 2023
A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL
Roberto Zagni
BIRMINGHAM—MUMBAI
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Reshma Raman
Content Development Editor: Joseph Sunil
Technical Editor: Kavyashree KS
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Hemangini Bari
Production Designer: Aparna Bhagat
Marketing Coordinator: Shifa Ansari
First published: June 2023
Production reference: 1300623
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-628-4
www.packtpub.com
To the four females in my daily life: my wife, who supports me every day, my daughters, who keep me grounded in reality, and our dog Lily for the sparkles of love and happiness that she spreads around every day.
To my mother and my late father who through their sacrifices and support allowed me to become what I wanted to be.
I love you all.
– Roberto Zagni
Roberto Zagni is a senior leader with extensive hands-on experience in data architecture, software development, and agile methodologies. Roberto is an electronic engineer by training with a special interest in bringing software engineering best practices to cloud data platforms and growing great teams that enjoy what they do. He has been helping companies to better use their data, and now to transition to cloud-based data automation with an agile mindset and proper software engineering tools and processes, such as DataOps. Roberto also provides hands-on coaching to data teams on practical data architecture and the use of patterns, testing, version control, and agile collaboration.
Since 2019, his go-to tools have been dbt, dbt Cloud, and Snowflake or BigQuery.
I would like to thank my customers and colleagues for all the problems, and discussions to get to a working solution that helped me to become a better software and data engineer and collect a wide array of experiences in software and data engineering.
This book is my little contribution to the data engineering community.
I hope that I have been able to put the core set of knowledge that I would have loved to have in my days as a data engineer in one place, along with a simple, opinionated way to build data platforms using the modern data stack and proven patterns that scale and simplify everyday work.
Hari Krishnan has been in the data space for close to 20 years now. He started at Infosys Limited, working on mainframe technology for about 6 years, and then moved over to Informatica, then eventually into business intelligence, big data, and the cloud in general. Currently, he is senior manager of data engineering at Beachbody LLC, where he manages a team of data engineers with Airflow, dbt, and Snowflake as the ir primary tech stack. He built the data lake and migrated the data warehouse as well the ETL/ELT pipelines from on-premises to the cloud. He spent close to 13 years working for Infosys and has spent the last 7 years with Beachbody. He is a technology enthusiast and always has an appetite to discover, explore, and innovate new avenues in the data space.
Daniel Joshua Jayaraj S R is a data evangelist and business intelligence engineer, with over six years of experience in the field of data analytics, data modeling, and visualization. He has helped organizations understand the full potential of their data by providing stakeholders with strong business-oriented visuals, thereby enhancing data-driven decisions. He has worked with multiple tools and technologies during his career and completed his master’s in big data and business analytics.
I would like to thank my mother, S J Inbarani, who has been my motivation my whole life. I would also like to thank Roberto Zagni for allowing me to review this wonderful book on dbt.
dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.
This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases that you encounter in your daily work.
By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations (including master data and the desired business rules), and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.
This book is for data engineers, analytics engineers, BI professionals, and data analysts who want to learn how to build simple, future-proof, and maintainable data platforms in an agile way. Project managers, data team managers, and decision-makers looking to understand the importance of building a data platform and fostering a culture of high-performing data teams will also find this book useful. Basic knowledge of SQL and data modeling will help you get the most out of the many layers of this book. The book also includes primers on many data-related subjects to help juniors get started.
Chapter 1, Basics of SQL to Transform Data, explores the basics of SQL and demystifies this standard, powerful, yet easy-to-read language, which is ubiquitous when working with data.
You will understand the different types of commands in SQL, how to get started with a database, and the SQL commands to work with data. We will look a bit deeper into the SELECT statement and the JOIN logic, as they will be crucial in working with dbt. You will be guided to create a free Snowflake account to experiment the SQL commands and later use it together with dbt.
Chapter 2, Setting Up Your DBT Cloud Development Environment, gets started with DBT by creating your GitHub and DBT accounts. You will learn why version control is important and what the data engineering workflow is when working with DBT.
You will also understand the difference between the open source DBT Core and the commercial DBT Cloud. Finally, you will experiment with the default project and set up your environment for running basic SQL with DBT on Snowflake and understand the key functions of DBT: ref and source.
Chapter 3, Data Modeling for Data Engineering, shows why and how you describe data, and how to travel through different abstraction levels, from business processes to the storage of the data that supports them: conceptual, logical, and physical data models.
You will understand entities, relationships, attributes, entity-relationship (E-R) diagrams, modeling use cases and modeling patterns, Data Vault, dimensional models, wide tables, and business reporting.
Chapter 4, Analytics Engineering as the New Core of Data Engineering, showcases the full data life cycle and the different roles and responsibilities of people that work on data.
You will understand the modern data stack, the role of DBT, and analytic engineering. You will learn how to adopt software engineering practices to build data platforms (or DataOps), and about working as a team, not as a silo.
Chapter 5, Transforming Data with DBT, shows us how to develop an example application in dbt and learn all the steps to create, deploy, run, test, and document a data application with dbt.
Chapter 6, Writing Maintainable Code, continues the example that we started in the previous chapter, and we will guide you to configure dbt and write some basic but functionally complete code to build the three layers of our reference architecture: staging/storage, refined data, and delivery with data marts.
Chapter 7, Working with Dimensional Data, shows you how to incorporate dimensional data in our data models and utilize it for fact-checking and a multitude of purposes. We will explore how to create data models, edit the data for our reference architecture, and incorporate the dimensional data in data marts. We will also recap everything we learned in the previous chapters with an example.
Chapter 8, Delivering Consistency in Your Code, shows you how to add consistency to your transformations. You will learn how to go beyond basic SQL and bring the power of scripting into your code, write your first macros, and learn how to use external libraries in your projects.
Chapter 9, Delivering Reliability in Your Data, shows you how to ensure the reliability of your code by adding tests that verify your expectations and check the results of your transformations.
Chapter 10, Agile Development, teaches you how to develop with agility by mixing philosophy and practical hints, discussing how to keep the backlog agile through the phases of your projects, and a deep dive into building data marts.
Chapter 11, Collaboration, touches on a few practices that help developers work as a team and the support that dbt provides toward this.
Chapter 12, Deployment, Execution, and Documentation Automation, helps you learn how to automate the operation of your data platform, by setting up environments and jobs that automate the release and execution of your code following your deployment design.
Chapter 13, Moving beyond Basics, helps you learn how to manage the identity of your entities so that you can apply master data management to combine data from different systems. At the same time, you will review the best practices to apply modularity in your pipelines to simplify their evolution and maintenance. You will also discover macros to implement patterns.
Chapter 14, Enhancing Software Quality, helps you discover and apply more advanced patterns that provide high-quality results in real-life projects, and you will experiment with how to evolve your code with confidence through refactoring.
Chapter 15, Patterns for Frequent Use Cases, presents you with a small library of patterns that are frequently used for ingesting data from external files and storing this ingested data in what we call history tables. You will also get the insights and the code to ingest data in Snowflake.
Software/hardware covered in the book
Operating system requirements
dbt
Windows, macOS, or Linux
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-engineering-with-dbt. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Create the new database using the executor role. We named it PORTFOLIO_TRACKING.”
A block of code is set as follows:
CREATE TABLE ORDERS ( ORDER_ID NUMBER, CUSTOMER_CODE TEXT, TOTAL_AMOUNT FLOAT, ORDER_DATE DATE, CURRENCY TEXT DEFAULT 'EUR' );When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
CREATE VIEW BIG_ORDERS AS SELECT * FROM ORDERS WHERE TOTAL_AMOUNT > 1000;Any command-line input or output is written as follows:
$ mkdir css $ cd cssBold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the Administration panel.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Data Engineering with dbt, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781803246284
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyIn this section, you will get started on your path of learning how to build a data platform by learning the basics of SQL, modeling, and data engineering.
This section includes the following chapters:
Chapter 1, Basics of SQL to Transform DataChapter 2, Setting Up Your dbt Cloud Development EnvironmentChapter 3, Data Modeling for Data EngineeringChapter 4, Analytics Engineering as the New Core of Data EngineeringChapter 5, Transforming Data with dbt