Data Engineering with dbt - Roberto Zagni - E-Book

Data Engineering with dbt E-Book

Roberto Zagni

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.
This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work.
By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 931

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Engineering with dbt

A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

Roberto Zagni

BIRMINGHAM—MUMBAI

Data Engineering with dbt

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Reshma Raman

Content Development Editor: Joseph Sunil

Technical Editor: Kavyashree KS

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Aparna Bhagat

Marketing Coordinator: Shifa Ansari

First published: June 2023

Production reference: 1300623

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80324-628-4

www.packtpub.com

To the four females in my daily life: my wife, who supports me every day, my daughters, who keep me grounded in reality, and our dog Lily for the sparkles of love and happiness that she spreads around every day.

To my mother and my late father who through their sacrifices and support allowed me to become what I wanted to be.

I love you all.

– Roberto Zagni

Contributors

About the author

Roberto Zagni is a senior leader with extensive hands-on experience in data architecture, software development, and agile methodologies. Roberto is an electronic engineer by training with a special interest in bringing software engineering best practices to cloud data platforms and growing great teams that enjoy what they do. He has been helping companies to better use their data, and now to transition to cloud-based data automation with an agile mindset and proper software engineering tools and processes, such as DataOps. Roberto also provides hands-on coaching to data teams on practical data architecture and the use of patterns, testing, version control, and agile collaboration.

Since 2019, his go-to tools have been dbt, dbt Cloud, and Snowflake or BigQuery.

I would like to thank my customers and colleagues for all the problems, and discussions to get to a working solution that helped me to become a better software and data engineer and collect a wide array of experiences in software and data engineering.

This book is my little contribution to the data engineering community.

I hope that I have been able to put the core set of knowledge that I would have loved to have in my days as a data engineer in one place, along with a simple, opinionated way to build data platforms using the modern data stack and proven patterns that scale and simplify everyday work.

About the reviewers

Hari Krishnan has been in the data space for close to 20 years now. He started at Infosys Limited, working on mainframe technology for about 6 years, and then moved over to Informatica, then eventually into business intelligence, big data, and the cloud in general. Currently, he is senior manager of data engineering at Beachbody LLC, where he manages a team of data engineers with Airflow, dbt, and Snowflake as the ir primary tech stack. He built the data lake and migrated the data warehouse as well the ETL/ELT pipelines from on-premises to the cloud. He spent close to 13 years working for Infosys and has spent the last 7 years with Beachbody. He is a technology enthusiast and always has an appetite to discover, explore, and innovate new avenues in the data space.

Daniel Joshua Jayaraj S R is a data evangelist and business intelligence engineer, with over six years of experience in the field of data analytics, data modeling, and visualization. He has helped organizations understand the full potential of their data by providing stakeholders with strong business-oriented visuals, thereby enhancing data-driven decisions. He has worked with multiple tools and technologies during his career and completed his master’s in big data and business analytics.

I would like to thank my mother, S J Inbarani, who has been my motivation my whole life. I would also like to thank Roberto Zagni for allowing me to review this wonderful book on dbt.

Table of Contents

Preface

Part 1: The Foundations of Data Engineering

1

The Basics of SQL to Transform Data

Technical requirements

Introducing SQL

SQL basics – core concepts and commands

SQL core concepts

Understanding the categories of SQL commands

Setting up a Snowflake database with users and roles

Creating your Snowflake account

Setting up initial users, roles, and a database in Snowflake

Creating and granting your first role

Querying data in SQL – syntax and operators

Snowflake query syntax

SQL operators

Combining data in SQL – the JOIN clause

Combining orders and customers

JOIN types

Visual representation of join types

Advanced – introducing window functions

Window definition

Window frame definition

Summary

Further reading

2

Setting Up Your dbt Cloud Development Environment

Technical requirements

Setting up your GitHub account

Introducing Version Control

Creating your GitHub account

Setting up your first repository for dbt

Setting up your dbt Cloud account

Signing up for a dbt Cloud account

Setting up your first dbt Cloud project

Adding the default project to an empty repository

Comparing dbt Core and dbt Cloud workflows

dbt Core workflows

dbt Cloud workflows

Experimenting with SQL in dbt Cloud

Exploring the dbt Cloud IDE

Executing SQL from the dbt IDE

Introducing the source and ref dbt functions

Exploring the dbt default model

Using ref and source to connect models

Running your first models

Testing your first models

Editing your first model

Summary

Further reading

3

Data Modeling for Data Engineering

Technical requirements

What is and why do we need data modeling?

Understanding data

What is data modeling?

Why we need data modeling

Complementing a visual data model

Conceptual, logical, and physical data models

Conceptual data model

Logical data model

Physical data model

Tools to draw data models

Entity-Relationship modeling

Main notation

Cardinality

Time perspective

An example of an E-R model at different levels of detail

Generalization and specialization

Modeling use cases and patterns

Header-detail use case

Hierarchical relationships

Forecasts and actuals

Libraries of standard data models

Common problems in data models

Fan trap

Chasm trap

Modeling styles and architectures

Kimball method or dimensional modeling or star schema

Unified Star Schema

Inmon design style

Data Vault

Data mesh

Our approach, the Pragmatic Data Platform - PDP

Summary

Further reading

4

Analytics Engineering as the New Core of Data Engineering

Technical requirements

The data life cycle and its evolution

Understanding the data flow

Data creation

Data movement and storage

Data transformation

Business reporting

Feeding back to the source systems

Understanding the modern data stack

The traditional data stack

The modern data stack

Defining analytics engineering

The roles in the modern data stack

The analytics engineer

DataOps – software engineering best practices for data

Version control

Quality assurance

The modularity of the code base

Development environments

Designing for maintainability

Summary

Further reading

5

Transforming Data with dbt

Technical requirements

The dbt Core workflow for ingesting and transforming data

Introducing our stock tracking project

The initial data model and glossary

Setting up the project in dbt, Snowflake, and GitHub

Defining data sources and providing reference data

Defining data sources in dbt

Loading the first data for the portfolio project

How to write and test transformations

Writing the first dbt model

Real-time lineage and project navigation

Deploying the first dbt model

Committing the first dbt model

Configuring our project and where we store data

Re-deploying our environment to the desired schema

Configuring the layers for our architecture

Ensuring data quality with tests

Generating the documentation

Summary

Part 2: Agile Data Engineering with dbt

6

Writing Maintainable Code

Technical requirements

Writing code for humans

Refactoring our initial model to be human-readable

Creating the architectural layers

Creating the Staging layer

Goals and contents of the staging models

Connecting the REF model to the STG

Goals and contents of the refined layer

Creating the first data mart

Saving history is crucial

Saving history with dbt

Saving history using snapshots

Connecting the REF layer with the snapshot

Summary

7

Working with Dimensional Data

Adding dimensional data

Creating clear data models for the refined and data mart layers

Loading the data of the first dimension

Creating and loading a CSV as a seed

Configuring the seeds and loading them

Adding data types and a load timestamp to your seed

Building the STG model for the first dimension

Defining the external data source for seeds

Creating an STG model for the security dimension

Adding the default record to the STG

Saving history for the dimensional data

Saving the history with a snapshot

Building the REF layer with the dimensional data

Adding the dimensional data to the data mart

Exercise – adding a few more hand-maintained dimensions

Summary

8

Delivering Consistency in Your Data

Technical requirements

Keeping consistency by reusing code – macros

Repetition is inherent in data projects

Why copy and paste kills your future self

How to write a macro

Refactoring the “current” CTE into a macro

Fixing data loaded from our CSV file

The basics of macro writing

Building on the shoulders of giants – dbt packages

Creating dbt packages

How to import a package in dbt

Browsing through noteworthy packages for dbt

Adding the dbt-utils package to our project

Summary

9

Delivering Reliability in Your Data

Testing to provide reliability

Types of tests

Singular tests

Generic tests

Defining a generic test

Testing the right things in the right places

What do we test?

Where to test what?

Testing our models to ensure good quality

Summary

10

Agile Development

Technical requirements

Agile development and collaboration

Defining agile development

Applying agile to data engineering

Starting a project in an agile way

Organizing work the agile way

Managing the backlog in an agile way

Building reports in an agile way

S1 – designing a light data model for the data mart

S2 – designing a light data model for the REF layer

S3.x – developing with dbt models the pipeline for the XYZ table

S4 – an acceptance test of the data produced in the data mart

S5 – development and verification of the report in the BI application

Summary

11

Team Collaboration

Enabling collaboration

Core collaboration practices

Collaboration with dbt Cloud

Working with branches and PRs

Working with Git in dbt Cloud

The dbt Cloud Git process

Keeping your development environment healthy

Suggested Git branch naming

Adopting frequent releases

Making your first PR

Summary

Further reading

Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms

12

Deployment, Execution, and Documentation Automation

Technical requirements

Designing your deployment automation

Working with dbt environments

Creating our QA and PROD environments

Deciding where to deploy

Creating jobs

Designing the architecture of your data platform

Notifications

Advanced automation – hooks and run-operations

Hooks

Run-operations

Table migrations

Documentation

Lineage graph

dbt-generated documentation

Source freshness report

Exposures

Markdown documentation

Summary

13

Moving Beyond the Basics

Technical requirements

Building for modularity

Modularity in the storage layer

Modularity in the refined layer

Modularity in the delivery layer

Managing identity

Identity and semantics – defining your concepts

Different types of keys

Main uses of keys

Master Data management

Data for Master Data management

A light MDM approach with DBT

Saving history at scale

Understanding the save_history macro

Understanding the current_from_history macro

Summary

14

Enhancing Software Quality

Technical requirements

Refactoring and evolving models

Dealing with technical debt

Implementing real-world code and business rules

Replacing snapshots with HIST tables

Renaming the REF_ABC_BANK_SECURITY_INFO model

Handling orphans in facts

Calculating closed positions

Calculating transactions

Publishing dependable datasets

Managing data marts like APIs

What shape should you use for your data mart?

Self-completing dimensions

History in reports – that is, slowly changing dimensions type two

Summary

Further reading

15

Patterns for Frequent Use Cases

Technical requirements

Ingestion patterns

Basic setup for ingestion

Loading data from files

External tables

Landing tables

History patterns

Storing history with deletions – full load

Storing history with deletion – deletion list

Storing history with multiple versions in the input

Storing history with PII and GDPR compliance

History and schema evolution

Summary

Further reading

Index

Other Books You May Enjoy

Preface

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.

This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases that you encounter in your daily work.

By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations (including master data and the desired business rules), and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.

Who this book is for

This book is for data engineers, analytics engineers, BI professionals, and data analysts who want to learn how to build simple, future-proof, and maintainable data platforms in an agile way. Project managers, data team managers, and decision-makers looking to understand the importance of building a data platform and fostering a culture of high-performing data teams will also find this book useful. Basic knowledge of SQL and data modeling will help you get the most out of the many layers of this book. The book also includes primers on many data-related subjects to help juniors get started.

What this book covers

Chapter 1, Basics of SQL to Transform Data, explores the basics of SQL and demystifies this standard, powerful, yet easy-to-read language, which is ubiquitous when working with data.

You will understand the different types of commands in SQL, how to get started with a database, and the SQL commands to work with data. We will look a bit deeper into the SELECT statement and the JOIN logic, as they will be crucial in working with dbt. You will be guided to create a free Snowflake account to experiment the SQL commands and later use it together with dbt.

Chapter 2, Setting Up Your DBT Cloud Development Environment, gets started with DBT by creating your GitHub and DBT accounts. You will learn why version control is important and what the data engineering workflow is when working with DBT.

You will also understand the difference between the open source DBT Core and the commercial DBT Cloud. Finally, you will experiment with the default project and set up your environment for running basic SQL with DBT on Snowflake and understand the key functions of DBT: ref and source.

Chapter 3, Data Modeling for Data Engineering, shows why and how you describe data, and how to travel through different abstraction levels, from business processes to the storage of the data that supports them: conceptual, logical, and physical data models.

You will understand entities, relationships, attributes, entity-relationship (E-R) diagrams, modeling use cases and modeling patterns, Data Vault, dimensional models, wide tables, and business reporting.

Chapter 4, Analytics Engineering as the New Core of Data Engineering, showcases the full data life cycle and the different roles and responsibilities of people that work on data.

You will understand the modern data stack, the role of DBT, and analytic engineering. You will learn how to adopt software engineering practices to build data platforms (or DataOps), and about working as a team, not as a silo.

Chapter 5, Transforming Data with DBT, shows us how to develop an example application in dbt and learn all the steps to create, deploy, run, test, and document a data application with dbt.

Chapter 6, Writing Maintainable Code, continues the example that we started in the previous chapter, and we will guide you to configure dbt and write some basic but functionally complete code to build the three layers of our reference architecture: staging/storage, refined data, and delivery with data marts.

Chapter 7, Working with Dimensional Data, shows you how to incorporate dimensional data in our data models and utilize it for fact-checking and a multitude of purposes. We will explore how to create data models, edit the data for our reference architecture, and incorporate the dimensional data in data marts. We will also recap everything we learned in the previous chapters with an example.

Chapter 8, Delivering Consistency in Your Code, shows you how to add consistency to your transformations. You will learn how to go beyond basic SQL and bring the power of scripting into your code, write your first macros, and learn how to use external libraries in your projects.

Chapter 9, Delivering Reliability in Your Data, shows you how to ensure the reliability of your code by adding tests that verify your expectations and check the results of your transformations.

Chapter 10, Agile Development, teaches you how to develop with agility by mixing philosophy and practical hints, discussing how to keep the backlog agile through the phases of your projects, and a deep dive into building data marts.

Chapter 11, Collaboration, touches on a few practices that help developers work as a team and the support that dbt provides toward this.

Chapter 12, Deployment, Execution, and Documentation Automation, helps you learn how to automate the operation of your data platform, by setting up environments and jobs that automate the release and execution of your code following your deployment design.

Chapter 13, Moving beyond Basics, helps you learn how to manage the identity of your entities so that you can apply master data management to combine data from different systems. At the same time, you will review the best practices to apply modularity in your pipelines to simplify their evolution and maintenance. You will also discover macros to implement patterns.

Chapter 14, Enhancing Software Quality, helps you discover and apply more advanced patterns that provide high-quality results in real-life projects, and you will experiment with how to evolve your code with confidence through refactoring.

Chapter 15, Patterns for Frequent Use Cases, presents you with a small library of patterns that are frequently used for ingesting data from external files and storing this ingested data in what we call history tables. You will also get the insights and the code to ingest data in Snowflake.

To get the most out of this book

Software/hardware covered in the book

Operating system requirements

dbt

Windows, macOS, or Linux

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-engineering-with-dbt. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Create the new database using the executor role. We named it PORTFOLIO_TRACKING.”

A block of code is set as follows:

CREATE TABLE ORDERS ( ORDER_ID NUMBER, CUSTOMER_CODE TEXT, TOTAL_AMOUNT FLOAT, ORDER_DATE DATE, CURRENCY TEXT DEFAULT 'EUR' );

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

CREATE VIEW BIG_ORDERS AS SELECT * FROM ORDERS WHERE TOTAL_AMOUNT > 1000;

Any command-line input or output is written as follows:

$ mkdir css $ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the Administration panel.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Data Engineering with dbt, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781803246284

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: The Foundations of Data Engineering

In this section, you will get started on your path of learning how to build a data platform by learning the basics of SQL, modeling, and data engineering. 

This section includes the following chapters:

Chapter 1, Basics of SQL to Transform DataChapter 2, Setting Up Your dbt Cloud Development EnvironmentChapter 3, Data Modeling for Data EngineeringChapter 4, Analytics Engineering as the New Core of Data EngineeringChapter 5, Transforming Data with dbt