Fundamentals of Analytics Engineering - Dumky De Wilde - E-Book

Fundamentals of Analytics Engineering E-Book

Dumky De Wilde

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Written by a team of 7 industry experts, Fundamentals of Analytics Engineering will introduce you to everything from foundational concepts to advanced skills to get started as an analytics engineer.
After conquering data ingestion and techniques for data quality and scalability, you’ll learn about techniques such as data cleaning transformation, data modeling, SQL query optimization and reuse, and serving data across different platforms. Armed with this knowledge, you will implement a simple data platform from ingestion to visualization, using tools like Airbyte Cloud, Google BigQuery, dbt, and Tableau. You’ll also get to grips with strategies for data integrity with a focus on data quality and observability, along with collaborative coding practices like version control with Git. You’ll learn about advanced principles like CI/CD, automating workflows, gathering, scoping, and documenting business requirements, as well as data governance.
By the end of this book, you’ll be armed with the essential techniques and best practices for developing scalable analytics solutions from end to end.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 547

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Fundamentals of Analytics Engineering

An introduction to building end-to-end analytics solutions

Dumky De Wilde, Fanny Kassapian,

Jovan Gligorevic, Juan Manuel Perafan,

Lasse Benninga, Ricardo Angel Granados Lopez,

Taís Laurindo Pereira

Fundamentals of Analytics Engineering

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kaustubh Manglurkar

Publishing Product Manager: Apeksha Shetty

Book Project Manager: Aparna Nair

Editor: Juan Venegas

Senior Editor: Vandita Grover

Technical Editor: Kavyashree K S

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Jyoti Kadam

DevRel Marketing Coordinator: Nivedita Singh

First published: March 2024

Production reference: 1220324

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83763-645-7

www.packtpub.com

Dedicated to every Xebian who contributed to this book’s development!

To Juan Venegas, for your meticulous and all-encompassing editing work, as well as organizing writing workshops that shaped our work. Trust us, the book would not be half as coherent without your help.

To our teammates: Guillermo Sanchez, Pádraic Slattery, Thom van Engelenburg, and Thomas van Latum, for your work as technical reviewers and proofreaders. Go and follow them on GitHub and check out their conference talks. All of them are world-class analytics engineers!

To Giovanni Lanzani, for your editorial support. We are deeply grateful for the time you took to refine and transform some of our longest chapters into masterpieces. Your contribution has truly elevated the quality of this book.

And to Bram Ochsendorf, for your continuous support and encouragement. Thank you for keeping us sane during the whole journey.

Thank you all for your contributions.

Foreword

Analytics engineering, as a role, is relatively new. Broadly speaking, it encompasses tasks such as ingesting data into a central location, transforming this data into a format valuable for business use cases, and engaging with the wider organization to promote data-driven decisions. The advent of cloud data warehouses, the explosive growth of modern data stack products, and the widespread adoption of dbt have made the tasks of analytics engineers more complex but also more valuable. Given this pace of change, it is impossible to write a book that covers everything you need to know to become a successful analytics engineer, but the core concepts will remain the same regardless of this change.

This book is a curated collection of knowledge from a group of individuals with a wide set of skills who have learned valuable lessons working with data across a variety of industries and roles. Their collected insights will serve as a solid foundation for the success of your own career in analytics engineering.

Over the course of this book, you will find discussions on data modeling techniques, tutorials on how to set up modern cloud tooling, and best practices concerning data transformation. These are among the growing number of tasks falling under the scope of analytics engineering. As such, no two roles are the same, but I believe that the concepts covered will provide you with guidance as you progress on your journey. Whether you are embarking on your first steps in analytics engineering or you have been walking this road for several years, I wish you the best of luck. 

Pádraic Slattery

Analytics Engineer

Xebia

Contributors

About the authors

Dumky de Wilde is an award-winning analytics engineer with close to 10 years of experience in setting up data pipelines, data models, and cloud infrastructure. Dumky has worked with a multitude of clients from government to fintech and retail. His background is in marketing analytics and web tracking implementations, but he has since branched out to include other areas and deliver value from data and analytics across the entire organization.

Fanny Kassapian has a multidisciplinary background across various industries, giving her a unique perspective on analytics workflows, from engineering pipelines to driving value for the business.

As a consultant, Fanny helps companies translate opportunities and business needs into technical solutions, implement analytics engineering best practices to streamline their pipelines, and treat data as a product. She is an avid promoter of data democratization, through technology and literacy.

I would like to thank my talented Xebia colleagues – in particular, Padraic for his invaluable technical expertise and Guillermo and Lucy for their precious advice. A special thanks to my parents for their unwavering support and to my beloved sister Flora for her encouragement to pursue my ambitions.

Jovan Gligorevic is an analytics engineer specializing in data modeling and building analytical dashboards. His passion is building and delivering end-to-end analytics solutions and enabling self-service analytics for clients.

Jovan has a background in business and data science. His skills range from machine learning to dashboarding, and he has democratized data in a range of industries, including banking, e-commerce, automotive, FMCG, trading, and a seed production company. He is accustomed to working with different tooling and programming languages and has acquired vast experience working with the modern data stack.

Jovan also likes to deliver training in dbt and Power BI and share his knowledge.

I am grateful to Ivana for supporting me throughout this journey and would like to thank my dear colleagues for the combined effort to make this book a reality.

Juan Manuel Perafan has worked in the realm of analytics since 2017, specializing in SQL, data governance, and business intelligence (BI). He joined Xebia in 2020 as their first analytics engineer, making him an early adopter of this approach.

Besides helping his clients realize the value of their data, Juan is also very active in the data community. He has spoken at dozens of conferences and meetups around the world (including Coalesce 2023). Additionally, he is the founder of the Analytics Engineering meetup in the Netherlands as well as the Dutch dbt meetup.

A Megan, Daniel, Ella, Nugget y a mi querida familia feliz: Gracias por todo el amor y el apoyo en los momentos difíciles ¡Los quiero mucho a todos!

Lasse Benninga has been working in the data space since 2018, starting as a data engineer at a large airline, then switching to cloud engineering for consulting and working for different clients, ranging from healthcare to ride-sharing. Since 2021, he has been working as an analytics engineer at Xebia Data, where he combines his knowledge of cloud engineering with a love for analyzing data.

For Simone.

Ricardo Angel Granados Lopez is an analytics engineer who specializes in data engineering and analysis. With a master’s in IT management and a focus on data science, he is proficient in using various programming languages and tools. Ricardo is skilled in exploring efficient alternatives and has contributed to multicultural teams, creating business value with data products using modern data stack solutions. As an analytics engineer, he helps companies enhance data value through data modeling, best practices, task automation, and data quality improvement.

I want to thank my loving wife and daughter for their constant support and encouragement while writing this book. I’m also grateful to my family, colleagues at Xebia Data, and all the coauthors for their help during this project. Thank you all for being there and making this book a reality.

Taís Laurindo Pereira is a versatile data professional with experience in a diverse range of organizations – from big corporations to scale-ups. Before her move to Xebia, she had the chance to develop distinct data products, such as dashboards and machine learning implementations. Currently, she is focusing on end-to-end analytics as an analytics engineer.

With a mixed background in engineering and business, her mission is to contribute to data democratization in organizations, by helping them to overcome challenges when working with data at scale.

To Renske, my loving and patient future wife who supported me immensely during this process. Thank you for your understanding and for motivating me during hard times!

I would also like to thank my colleagues at Xebia that made this book a reality – the coauthors and other team members. Thank you for all your dedication and expertise.

About the reviewers

Kaveh Noorbakhsh is a computer scientist and biomedical engineer who has worked in technology fields for over 20 years. He has a B.S.E in biomedical engineering from Case Western Reserve University and an M.S. in computer science from Kent State University. Kaveh has spent his career building innovative products in the HealthIT, SaaS, and big data spaces. He has worked at companies at various levels, from small, 20-person start-ups to large, multinational firms. His work has been recognized by the awarding of two patents in big data analytics and data pipelines (US 20170206255 and US 20170201556).

I want to thank Shambhavi and Akshay who were patient with me and helped guide me through the review process.

Devanshu Tayal is a highly accomplished data scientist with a master’s degree from BITS, Pilani, India. His extensive expertise in data science is evidenced by his contributions to a wide range of industries. Devanshu is deeply committed to mentoring and guiding aspiring data scientists and is an avid researcher of emerging technologies in the field. He is a strong advocate for diversity and inclusion and has shared his insights through various publications. Devanshu is frequently invited to deliver guest lectures at universities throughout India, and his contributions as a technical reviewer have been acknowledged in multiple books. His comprehensive knowledge and experience in the field make him an asset to any team or project.

Table of Contents

Preface

Prologue

Part 1: Introduction to Analytics Engineering

1

What Is Analytics Engineering?

Introducing analytics engineering

Defining analytics engineering

Why do we need analytics engineering?

A supermarket analogy

The shift from ETL to ELT

The difference between analytics engineers, data analysts, and data engineers

Summary

2

The Modern Data Stack

Understanding a Modern Data Stack

Explaining three key differentiators versus legacy stacks

Lowering technical barriers with a SQL-first approach

Improving infrastructure efficiency with cloud-native systems

Simplifying implementation and maintenance with managed and modular solutions

Discussing the advantages and disadvantages of the MDS

Summary

Part 2: Building Data Pipelines

3

Data Ingestion

Digging into the problem of moving data between two systems

The source of all problems

Understanding the eight essential steps of a data ingestion pipeline

Trigger

Connection

State management

Data extraction

Transformations

Validation and data quality

Loading

Archiving and retention

Managing the quality and scalability of data ingestion pipelines – the three key topics

Scalability and resilience

Monitoring, logging, and alerting

Governance

Working with data ingestion – an example pipeline

Summary

4

Data Warehousing

Uncovering the evolution of data warehousing

The problem with transactional databases

The history of data warehouses

Moving to the cloud

Benefits of cloud versus on-premises data warehouses

Cloud data warehouse users – no one-size fits all

Building blocks of a cloud data warehouse

Compute

Knowing the market leaders in cloud data warehousing

Amazon Redshift

Google BigQuery

Snowflake

Databricks

Use case – choosing the right cloud data warehouse

Managed versus self-hosted data warehouses

Summary

5

Data Modeling

The importance of data models

Completeness

Enforcement of business rules

Minimizing redundancy

Data reusability

Stability and flexibility

Elegance

Communication

Integration

Potential trade-offs

The elephant in the room – performance

Designing your data model

Data modeling techniques

Bill Inmon and relational modeling

Ralph Kimball and dimensional modeling

Daniel Linstedt and Data Vault

Comparison of the different data models

Choosing a data model

Summary

6

Transforming Data

Transforming data – the foundation of analytics work

A key step in the data value chain

Challenges in transforming data

Design choices

Where to apply transformations

Specify your data model

Layering transformations

Data transformation best practices

Readability and reusability first, optimization second

Modularity

Other best practices

An example of writing modular code

Tools that facilitate data transformations

Types of transformation tools

Considerations

Summary

7

Serving Data

Exposing data using dashboarding and BI tools

Dashboards

Spreadsheets

Programming environments

Low-code tools

Reverse ETL

Valuable

Usable

Sensible

Serving data – four key topics

Self-serving analytics and report factories

Interactive and static reports

Actionable and vanity metrics

Reusability and bespoke processes

Summary

Part 3: Hands-On Guide to Building a Data Platform

8

Hands-On Analytics Engineering

Technical requirements

Understanding the Stroopwafelshop use case

Business objectives, metrics, and KPIs

Looking at the data

The thing about spreadsheets

What about BI tools?

The tooling

Preparing Google Cloud

ELT using Airbyte Cloud

Loading the Stroopwafelshop data using Airbyte Cloud

Modeling data using dbt Cloud

The shortcomings of conventional analytics

The role of dbt in analytics engineering

Setting up dbt Cloud

Data marts

Additional dbt features

Visualizing data with Tableau

Why Tableau?

Selecting the KPIs

First visualization

Creating measures

Creating the store growth dashboard

What’s next?

Summary

Part 4: DataOps

9

Data Quality and Observability

Understanding the problem of data quality at the source, in transformations, and in data governance

Data quality issues in source systems

Data quality issues in data infrastructure and data pipelines

How data governance impacts data quality

Finding solutions to data quality issues – observability, data catalogs, and semantic layers

Using observability to improve your data quality

The benefits of data catalogs for data quality

Improving data quality with a semantic layer

Summary

10

Writing Code in a Team

Identifying the responsibilities of team members

Tracking tasks and issues

Tools for issue and task tracking

Clear task definition

Categorization and tagging

Managing versions with version control

Working with Git

Git branching

Development workflow for analytics engineers

Working with coding standards

PEP8

ANSI

Linters

Pre-commit hooks

Reviewing code

Pull requests – The four eyes principle

Continuous integration/continuous deployment

Documenting code

Documenting code in dbt

Code comments

READMEs

Documentation on getting started

Conceptual documentation

Working with containers

Refactoring and technical debt

Summary

11

Automating Workflows

Introducing DataOps

Orchestrating data pipelines

Designing an automated workflow – considerations

dbt Cloud

Airflow

Continuous integration

Integration

Continuous

Handling integration issues

Automating testing with a CI pipeline

Continuous deployment

The CD pipeline

Slim CI/CD

Configuring CI/CD in dbt Cloud

Continuous delivery

Continuous delivery versus continuous deployment

Summary

Part 5: Data Strategy

12

Driving Business Adoption

Defining analytics translation

The analytics value chain

Scoping analytics use cases

Identifying stakeholders

Ideating analytics use cases

Prioritizing use cases

Ensuring business adoption

Working incrementally

Gathering feedback

Knowing when to stop developing

Communicating your results

Documenting business logic

Summary

13

Data Governance

Understanding data governance

The objective of data governance

Applying data governance in analytics engineering

Defining data ownership

Data quality and integrity

Managing data assets

Training, enablement, and best practices

Data definitions

Addressing critical areas for seamless data governance

Resistance to change and adoption

Engaging stakeholders and fostering collaboration

Establishing a data governance roadmap

Summary

14

Epilogue

Reviewing the fundamental insights – what you’ve learned so far

Making your career future-proof – how to take it further

Tip #1 – keep learning and developing your skills

Tip #2 – network and engage with the community

Tip #3 – showcase your work and build a portfolio

Closing remarks

Index

Other Books You May Enjoy

Share Your Thoughts

Once you’ve read Fundamentals of Analytics Engineering, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837636457

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Prologue

In my first job, one of my primary responsibilities was manually filling out an Excel sheet that powered a rudimentary dashboard with the most important KPIs. The dashboard was shown at the standup, which meant arriving early at the office every morning. Back then, I had little to no experience with data. But I was going to do anything in my power to sleep 15 minutes longer.

My plan was to find the right dashboarding tool and learn how to automate this report. That was when I came across Tableau. Very quickly, manually refreshing dashboards became a thing of the past. Honestly, I had a lot of fun building dashboards and torturing the data for insights.

One day, a top-level manager passed by my desk on his way to a meeting. When he noticed I was building a dashboard, he approached me and asked: “Are these the results of the latest campaign?”. I nodded. He grabbed a chair, sat next to me, and proceeded to ask me multiple follow-up questions. I will never forget the fascination in his eyes when I opened Tableau and started creating visuals in a matter of seconds. That moment changed the rest of my career.

As the requests started growing, I had to ask for help from my systems and database administrators. Unfortunately, they were always busy with other projects, and I was never on their priority list. Getting access to a system or requesting any change in the data warehouse was always a hassle. Often, it was easier to go to tutorial hell and teach myself how to do it myself.

The level of support from both the IT department and your management team can vary between companies and even roles. However, you can likely relate to this story. Despite this incident occurring almost a decade before the term “analytics engineer” was coined (during a time when the biggest buzzwords were Hadoop and IoT), two enduring lessons emerged: the ability to analyze data makes you indispensable, and enhancing your engineering skills reduces your reliance on the IT department. 

Please note that the journey from manual data entry and lack of IT support is not just a story of personal growth; it highlights a broader need within organizations. This book is born from those challenges. At first glance, this book is a roadmap to empower individuals with the skills and knowledge to bridge the gap between data analysis and engineering. On a deeper level, it is about overcoming the technical and human challenges to become a data-driven organization.

Juan Manuel Perafan

Analytics Engineer 

Xebia 

Part 1:Introduction to Analytics Engineering

The first two chapters present the fundamental ideas supporting analytics engineering and the Modern Data Stack. They highlight how analytics engineers utilize software engineering principles to structure, modify, verify, implement, and document data in a reusable manner, ultimately streamlining team effort and time. Conversely, the Modern Data Stack prioritizes rapid delivery, cost-efficiency, and data accessibility, yet it can result in multiple tools and fragmented data.

This section has the following chapters:

Chapter 1, What Is Analytics Engineering?Chapter 2, The Modern Data Stack

1

What Is Analytics Engineering?

In the era of modern data warehousing, the landscape for data teams saw a radical transformation. With affordable storage and the growing popularity of Extract, Load, and Transform (ELT), a new breed of data teams has emerged. No longer dependent solely on data engineers for warehouse management and data transformation, these teams seek to bridge the gap between engineering and business. This change resulted in the rise of a specialized role – the analytics engineer.

The first written mention of the term analytics engineer appeared in early 2019. Since then, the field of analytics engineering has grown in popularity, demand, and standards.

In this chapter, you will learn what an analytics engineer does and how they can contribute to a solid data strategy. Additionally, we will discuss the shift from Extract, Transform, and Load (ETL) to ELT and the difference between the work of data analysts, data engineers, and analytics engineers.

The following main topics will be covered:

Introducing analytics engineeringWhy do we need analytics engineering?

Introducing analytics engineering

In early 2019, in a blog post from the Locally Optimistic community, the term analytics engineer emerged to describe the need for a new role born out of new challenges in the data landscape. When working with data at scale, many companies face difficulties with traditional data teams, where data engineers and data analysts’ work is often siloed. Shortly after, dbt Labs – the company that developed dbt (data build tool), an open-source data transformation tool – started a community-wide campaign to promote this new job title.

Defining analytics engineering

Although currently there are many definitions of what this role entails, for this book we decided to present our own, backed up by years of consultancy experience in the field.

A definition of an analytics engineer

Analytics engineers are the bridge between business and data engineering. By adhering to software engineering best practices, they build and maintain robust transformation pipelines to produce high-quality datasets. With this process, they improve the accessibility and reliability of data products. The analytics engineer is a center role in empowering end users to make truly data-informed decisions.

Other relevant definitions can be easily found on the internet – the most popular being the one by dbt Labs. We believe that the analytics engineer role goes beyond working with dbt and writing SQL. It includes analytics translation skills, such as effective communication, domain knowledge, project management, and entrepreneurial spirit. These skills are essential when talking to business stakeholders to understand business processes, translate these to relevant data models, and finally build pipelines. By delivering reliable and well-documented datasets to downstream users, analytics engineers are essential to data democratization initiatives in an organization.

From the definition presented, it can be inferred that the analytics engineering field, as the name already suggests, sits at the intersection of data engineering and analytics. From data engineering, it applies its software engineering principles – such as version control, testing, and continuous integration and deployment – to data pipelines, and from analytics, its data modeling techniques. A skilled analytics engineer can balance these two areas of expertise, pairing them up with business understanding.

Why do we need analytics engineering?

Now that you understand the core definition and skills behind the term, it is time to deep-dive into why the role is necessary. Without analytics engineering, data teams often struggle with the quality and reliability of data models. Data engineers might be overworked trying to maintain infrastructure while working on data modeling. Meanwhile, it is hard for analysts to trust the metrics that they report on, since they have limited visibility on the complex transformation process that goes into it. When metrics are wrong, business stakeholders might lose trust in data. Ultimately, it impacts business performance, since it becomes harder to make data-informed decisions.

Let’s illustrate this situation with an analogy.

A supermarket analogy

Consider the data team as a supermarket supply chain. The data engineer is responsible for procurement, planning the transportation of products from producers and storage points to the supermarket for delivery. The data analyst, like a store manager, identifies shopping trends and presents them to business owners for pricing and promotional decisions.

Ultimately, the analytics engineer is responsible for store layout design, product grouping, and ensuring customers can easily find items. They oversee inventory, coordinate product arrivals, avoid damaged goods, and ensure proper placement of new products. They also maintain aisle labels, product freshness, and shelf organization to minimize expired products and promote store sales.

These tasks help the data analyst (store manager) monitor supermarket activities, maintain accurate metrics, and create reports on store conditions. It also helps the data engineer (procurement) to meet consumer expectations and provide an overall pleasant shopping experience.

From the analogy, we can take away that the analytics engineer acts as a bridge between the data engineer and the data analyst’s work, facilitating a smooth customer experience. In a smaller store without this role, the product stocking task would fall to either one of the two, adding to their responsibilities and increasing their workload. In such a store, customers might struggle to find the products they want, which could be expired, damaged, or out of stock. This situation would negatively impact their experience and delay product delivery. Thus, the analytics engineer plays a central role in enabling a sustainable business operation.

The shift from ETL to ELT

In the recent decades, the rise of cloud computing contributed to scalable and cost-effective compute and storage systems. This resulted in the wide adoption of cloud data warehouses, enabling affordable and efficient storage for large volumes of data. This reduced the need to transform data as much as possible before loading it, contributing to the shift from ETL to ELT.

This shift, in turn, is one of the factors that influenced the increased demand for analytics engineers. In ETL, which for so long has been the standard approach, data is extracted from various sources, transformed, and then loaded in a target data warehouse. In this process, data engineers typically own the whole cycle, and data analysts often have limited visibility of the logic used. This might result in siloed knowledge about how the data was processed. Conversely, with ELT, raw data is loaded into a data warehouse and transformed afterward. Figure 1.1 illustrates the ETL process:

Figure 1.1 – An ETL process

Figure 1.2 displays the ELT process:

Figure 1.2 – An ELT process

Figure 1.1 and Figure 1.2 illustrate that the shift from ETL to ELT facilitated data engineers to focus on managing data sources and loading them into data warehouses. With that, the need for a specialized role focused on transformation emerged, creating the demand for analytics engineers. In ELT, they transform the raw loaded data and provide curated datasets for downstream users, such as analysts and business stakeholders.

Since with ELT all raw data is loaded in the warehouse, it is often an improvement in terms of agility and simplicity. Data ingestion is faster, and transformations can be done in SQL instead of complex Python code. Analytics engineers would then own the transformation layer while enabling data analysts to contribute to the SQL models behind it. This process would reduce siloed knowledge between engineers and analysts and streamline metrics definitions.

Moreover, with the availability of raw data, exploration can be done easily, allowing data teams to move fast and correct bugs quickly when necessary. With ELT, since historical raw data is loaded in the target data warehouse, transformations can be easily redone and rerun.

However, it is important to note that ETL is still widely used in many organizations. Whether ETL or ELT – or even Extract, Transform, Load, and Transform (ETLT) – is employed depends on the requirements for the data products and team resources and skills. A common application for the ETL process is when organizations have strict requirements regarding Personally Identifiable Information (PII). In this scenario, PII fields need to be deleted or masked before entering the data warehouse. Although ELT has many advantages regarding flexibility and efficiency (especially in terms of data ingestion speed), ETL processes continue to improve in terms of methods and tools.

The difference between analytics engineers, data analysts, and data engineers

Now that the shift from ETL to ELT has been explained, it is time to understand in depth the differences – and in some cases, the overlap – between the work of analytics engineers, analysts, and data engineers.

To start, it is important to highlight that roles can differ significantly between companies, depending on their maturity stage, industry, and organizational structure. Additionally, individual preferences and skills can also influence how job responsibilities are shaped.

Still, although sometimes the lines may be blurry, differences remain between the work of these data professionals.

Hence, if we think about a data supply chain, from data collection to information usage – with its intermediary states being cleaned and prepared raw data (equivalent to the Transform step in Figures 1.1 and 1.2) – analysts’ and engineers’ core tasks are distinct. However, some responsibilities might overlap, especially in smaller organizations where job specialization is rare.

If their responsibilities can intersect, what about analytics engineers? The following diagram (Figure 1.3, Xebia, https://pages.xebia.com/whitepaper-data-democratization) builds upon Figures 1.1 and 1.2 and explores in depth the steps from data capture to information usage (BI and analytics):

Figure 1.3 – Intersecting the job roles of data engineers , analytics engineers, and analysts

We observe from Figure 1.3, data analysts focus on analyzing data, developing dashboards and reports, and uncovering insights that influence business decisions. Their work is centered around descriptive, diagnostic, and prescriptive analytics, requiring constant contact with business stakeholders and often leadership. Because of that, besides technical capabilities such as SQL, Python/R, and BI tools, this type of professionals are expected to possess significant domain knowledge to deliver relevant analyses. In their effort to uncover insights, data analysts are often responsible for cleaning and preparing (modeling) data before information delivery.

Conversely, data engineers are concerned with designing, building, and maintaining the infrastructure that supports analytics engineers and downstream users, such as data analysts and data scientists. They work closely with these roles and other stakeholders to ensure that this infrastructure enables analytics-driven decisions. Data engineers are also often responsible for custom data extraction pipelines that require custom Python code. In terms of technical skills, high proficiency in Python is expected, such as a clear understanding of Object-Oriented Programming (OOP) paradigms – SQL, Continuous Integration/Continuous Development (CI/CD), orchestration, and infrastructure as code. Depending on the organization and the tasks involved, other programming languages might be necessary in addition to Python, such as Scala and Java.

Therefore, the analytics engineer serves as a bridge between data engineers and analysts, focusing on the entire data domain of the supply chain – raw, cleaned, and prepared data – up to the point of information delivery. Analytics engineers often leverage tools from the Modern Data Stack, which will be discussed in Chapter 2, The Modern Data Stack, to achieve this task. Among these tools, a key one is dbt, which allows analytics engineers to transform data using SQL, testing and documenting these transformations while tracking data lineage. For this role, it is necessary to have a high proficiency in SQL, Python, and data modeling techniques, and an understanding of orchestration and CI/CD principles. Finally, business acumen is also essential to understand how to contribute to business value, through relevant data models that represent business processes and enable key decisions.

Summary

In this chapter, you learned how analytics engineers act as the bridge between business, analytics, and data engineering, enabling data democratization through the quality and accessibility of data products. By delivering reliable data models, analytics engineers are a centerpiece of data-informed decisions.

You also learned how the shift from ETL to ELT contributed to the rise of analytics engineering, and how both methods are still applicable currently. Additionally, we discussed how roles can differ across organizations, and how analytics engineers fit in the data supply chain.

In the next chapter, The Modern Data Stack, you will get an overview of the collection of tools and technologies that enable this life cycle, and how this shifted roles and technical barriers in the data landscape.

2

The Modern Data Stack

Data’s exponential growth and the need to incorporate analytics-driven decision-making have brought new challenges and requirements to organizations. Information must be reliable, scalable, and delivered quickly to remain competitive. In this scenario, a new set of tools, technologies, and processes starts to gain traction – the Modern Data Stack (MDS).

In this chapter, you will learn what the MDS is, the principles that differentiate it from legacy stacks, and its advantages and disadvantages. Although this set of tools brought significant technological advancements versus legacy tightly coupled systems, it is important to be aware of its pitfalls and the recent considerations when choosing this type of stack.

The main topics that are going to be covered are as follows:

Understanding the Modern Data StackExplaining three key differentiators versus legacy stacksDiscussing the advantages and disadvantages of the MDS

Understanding a Modern Data Stack

As the name suggests, the MDS represents a technological evolution compared to previous systems widely used in recent decades. From the development of the business data warehouse in the 1980s to the rise of cloud technology with Amazon Web Services (AWS) in the early 2000s, on-premises legacy data stacks dominated the landscape. These systems had a monolithic IT infrastructure, resulting in complex maintenance. The MDS transformed this scenario – bringing modularity and cloud-native tools. However, before we dive into the details, let’s first define what a data stack is.

A data stack is a collection of tools and services as part of an extensive technology infrastructure designed to ingest, store, transform, and serve data. It makes data accessible across an organization and is fundamental to delivering business insights through reporting and dashboards, advanced analytics, and Machine Learning (ML) applications. Figure 2.1 illustrates an example of a high-level architecture of a data stack.

Figure 2.1 – An example of a high-level architecture of a data stack

Here, the data flows from left to right. The raw data is ingested, stored in a data warehouse, transformed, and finally, served to data analysts, data scientists, and business users.

Consequently, the MDS is nothing more than a subset of such architecture – a specific set of tools that democratizes access to the main functionalities of a data stack, reducing the complexity of implementation and improving the scalability of the data life cycle.

In the following table, we compare the main characteristics of the legacy data stack and the MDS.

Characteristic

Legacy Data Stack

Modern Data Stack

Architecture

Monolithic architecture

Modular tools

Servers

On-premises servers

Cloud-based

Maintenance

Complex – many resources required

Simplified, managed solutions

Programming languages

Java/Scala/Python

SQL-first

Data ingestion

ETL-focused

ELT-focused

Table 2.1 – A legacy versus Modern Data Stack comparison

Now that we have seen the definition of the MDS and how it compares to legacy stacks, it is time to see how it looks in practice. In Figure 2.2, an example is provided, with an overview of some of the tools that are used for key functionalities within this design:

Figure 2.2 – An example of an MDS

As we can see in Figure 2.2, the main blocks of an MDS can be considered as follows:

Managed ingestion: This is responsible for the EL in ELT. It helps to streamline data extraction and ingestion through third-party-managed software applications.Data warehouse/lakehouse: These are cloud-based systems used to store large volumes of data.Transformation: This is what the T in ELT stands for. Transformations help in data cleaning and preparation to meet business intelligence needs.Orchestration: Orchestration helps in setting up specific tasks to run automatically at a particular event. As an example, an orchestration tool can be paired up with dbt Core for scheduled transformation.Self-service layer: In this block, reports and analysis are provided to business users, enabling these stakeholders to answer their questions and make data-informed decisions.

With managed ingestion, data teams are less dependent on the work of data engineers. Tools such as Stitch, Fivetran, and Airbyte empower analytics engineers to own end-to-end data pipelines, focusing on data modeling and transformation.

The list of tools and blocks in Figure 2.2 is not exhaustive. New tools and MDS companies emerge every year. Other common and important blocks are as follows:

Data catalog tools are used for search and discovery tools, helping users to document and democratize access to business logic and data assets. Examples are Atlan, data.world, and DataHub.Data quality and observability tools are used to monitor pipelines and ensure the overall quality of data. Tool examples are Soda, Datafold, and Monte Carlo.Reverse ETL is used to retrieve data from a data warehouse and publish it to the systems used by business users, such as CRM software. Key market leaders are Census, Hightouch, and RudderStack.

We will now move our focus to how the modern data stack differs from the legacy stacks.

Explaining three key differentiators versus legacy stacks

The MDS brought a series of technological advancements in comparison with precursor systems (pre-cloud). As demonstrated in Figure 2.1, the main differentiators are its SQL-first approach, cloud-native tools, and managed and modular solutions. In this section, we are going to deep dive into the details of these characteristics.

Lowering technical barriers with a SQL-first approach

Developed in the 1970s by IBM researchers, SQL – or SEQUEL as it was called originally – was designed to access data in an integrated relational database, by both professional programmers and more occasional database users. Here is a link to the full version of the paper, published by Donald D. Chamberlin and Raymond F. Boyce (1976): https://doi.org/10.1145/800296.811515.

Decades later, SQL remains indispensable for this purpose, and its popularity has increased sharply. In the Stack Overflow Developer Survey 2022, it was the third most used programming language, as voted by 49.43% of all respondents – professional and beginner developers.

Despite SQL’s popularity, until recent years, big data professionals focused on programming languages such as Java and Scala. Due to their complexity, these languages demand a high technical skill level. SQL, conversely, is simple, accessible, and flexible, contributing to a lower technical barrier for the required skill set to work with data at scale. Therefore, SQL and Python are the main programming languages currently seen in job postings for analytics engineers, data engineers, data analysts, and data scientists.

Not surprisingly, in the MDS, SQL has a fundamental role – with SQL-first tools such as dbt, analytics engineers (or, in some cases, data analysts) can own a more significant part of the data life cycle. Through models using this language, these professionals can now execute data transformation pipelines following software engineering best practices, such as version control and modular code – once limited to Java/Scala roles in traditional data teams.

The following screenshot displays a dbt model in dbt Cloud:

Figure 2.3 – An example of a dbt model written in SQL

In the figure, through mainly SQL commands, a staging orders table was created referencing a source (raw table), as a pre-step to a more complex data transformation. More on data transformation will be discussed in Chapter 6, Transforming Data.

Therefore, the SQL-first character of the MDS is fundamental to the democratization of tasks once limited to technical experts in more complex programming languages.

A quick glance at popular technologies

You can check out the Stack Overflow Developer Survey 2022 at https://survey.stackoverflow.co/2022/#most-popular-technologies-language for a quick review of the popular tools and technologies that developers use these days.

Next, you will learn about another key principle – cloud-native solutions.

Improving infrastructure efficiency with cloud-native systems

As mentioned in the first section, Understanding a Modern Data Stack, legacy systems were often on-premises, making scalability and maintenance harder. With the launch of modern cloud computing systems by Amazon with AWS in 2002, this landscape started to change. Since then, new players have emerged, such as Google, Microsoft, and Alibaba.

According to a recent market share report (https://www.srgresearch.com/articles/q1-cloud-spending-grows-by-over-10-billion-from-2022-the-big-three-account-for-65-of-the-total) developed by the Synergy Research Group and shared by Statista, AWS has the highest global market share in revenue – 32% in Q1 2023. It is followed by Microsoft Azure, with 23%, and Google Cloud, with 10%. These three largest cloud providers account for 65% of the global market. The report also states that the cloud market continues to grow worldwide, where EMEA (Europe, Middle-East, and Africa), North America, and APAC (Asia-Pacific) lead it, with a growth of 20% year over year.

The MDS is cloud-native, meaning that its tools are designed to integrate seamlessly with different clouds. More specifically, they are designed to integrate easily with cloud data warehouses. This type of integration is an improvement on the legacy data stack in the overall system infrastructure. In these systems, connecting new applications to the existing on-premises database was often an arduous task.

In cloud-based environments, adaptability, scalability, and maintenance become less complex, resulting in fewer necessary resources – both in cost and people – when compared to traditional monolithic systems. Lastly, since the MDS tools can be integrated with different clouds, vendor lock-in (such as data warehouse vendors) is also reduced.

Next, it will become clear how the combination of being cloud-based and using out-of-the-box managed and modular tools translates into the speed of delivery and cost advantages discussed in the first section.

Simplifying implementation and maintenance with managed and modular solutions

An important innovation in the MDS is the use of managed solutions. Managed refers to software applications that are maintained and hosted by third-party service providers, which, in this case, are the companies that own these products.

In the case of managed ingestion, these tools simplify the deployment and maintenance of data pipelines, preventing the need for custom code and custom deployments. With several pre-made connectors available, it is easy and quick to set up a data source and a destination (e.g., a data warehouse or a lakehouse) and have an almost maintenance-free ingestion.

Another important innovation refers to modularity. Legacy data stacks were often tightly coupled with a high-level interdependency between components, making this system very resistant to changes. As shown previously, in Figure 2.2, in the MDS, different tools serve different purposes, such as ingestion, storage, transformation, orchestration, and business intelligence. With this configuration, it is possible to integrate new functionality faster or change parts of the stack without impacting its core function – to ingest, transform, and serve data. This design reduces vendor lock-in, due to several available options for each block.

Additionally, it is important to note that some managed tools are based on open-source software. This is the case for Airbyte – an open-source data integration platform that offers a fully managed cloud-based solution called Airbyte Cloud. Another key example is dbt, which offers dbt Core (open-source) and dbt Cloud. In the orchestration block, Dagster also has a cloud-based platform, Dagster Cloud.

You will also come across the term managed open-source software. This has become a popular business model. Many data teams don’t have the expertise or resources to self-host and maintain open-source solutions, turning to the managed version instead. Therefore, managed software democratizes access to functionality once restricted to highly software engineering technical teams.

The combination of managed and modular applications adds an essential layer of flexibility to organizations. By simplifying data ingestion, storage, transformation, and serving, while using modular functionality, companies can increase agility and adapt to changes in business requirements.

Now that you understand the definition, tools, and building blocks, it is time to discuss the Modern Data Stack’s advantages and disadvantages.

Discussing the advantages and disadvantages of the MDS

Although the MDS simplifies the data life cycle significantly, it is important to highlight and discuss its pitfalls.

The advantages can be summarizedas follows:

Speed of delivery: Setting up an MDS can take as little as a few hours to a couple of days, depending on the specific use case. Additionally, its self-service characteristic brings data products and solutions closer to where the business questions arise.Cost: As the MDS is cloud-based and uses out-of-the-box tools, costs are cut down significantly due to reduced complexity in architecture and hardware. In addition, cloud-native data warehouses provide cheaper data processing and storage when compared to on-premises systems.Democratization: Here, we see two dimensions of democratization – lowered technical barriers and the dissemination of information across an organization. The combination of managed ingestion and SQL-first tools lowers the technical barrier for data professionals. Therefore, the introduction of this stack shifted the required skill set, also making it easier to hire qualified talent. Also, the self-service block democratizes access to data across an organization, enabling business users to make data-informed decisions without relying solely on data analysts to provide information.

Yet, implementing a Modern Data Stack can have some disadvantages. It is important to be aware of the following:

Tool proliferation: With so many tools available, it might be difficult for companies and data professionals to keep track of the latest developments and decide which set of tools and technologies to use. Moreover, since different tools serve different purposes, organizations might end up with 4-8 different products, adding complexity in managing licenses, vendors, and contracts.Siloed information: This fragmented approach can lead to knowledge silos, where different teams or individuals focus on a set of tools and technologies, not sharing knowledge across an organization and reducing efficiency.

An important point to highlight is that most MDS companies are startups or scale-ups backed by VC (venture capital) investors. In this business model, product pricing changes can happen abruptly, and there is a certain risk associated with the company’s continuity of activities. Organizations that adopt these products should be aware of this and make architecture decisions where replacing a tool – or going for an open-source variant if there is one – is feasible and won’t drastically impact key business decisions.

Many argue that the simplicity and speed of configuration of the MDS have led data teams to shift their focus to delivering quick insights, rather than emphasizing best practices. It is important to be aware that the MDS does not substitute the need for proper software engineering, data modeling, and architecture principles. This is why analytics engineering continues to be essential in this scenario.

Summary

In this chapter, you learned what the Modern Data Stack is and how it can transform the work of data teams by lowering technical barriers, improving the speed of information delivery, and bringing modular and managed solutions. With its combination of a SQL-first approach, cloud-based systems, and managed and modular tools, the MDS brings improvements compared to legacy, on-premises, and monolithic stacks. Still, it is important to be aware of its pitfalls, such as the large number of tools and considerations about siloed information.

With the MDS, analytics engineers can leverage managed ingestion and data transformation tools to deliver end-to-end analytics.

In the following chapter, you will get the opportunity to deepen your knowledge of one of the MDS’s components – data ingestion. You will learn common approaches to it and what the steps are to set up an efficient and performant ingestion pipeline.

Part 2: Building Data Pipelines

This section is arguably the book’s core. In these chapters, you will gain hands-on skills in managing data and constructing platforms effectively. Each chapter addresses essential aspects of an analytics engineer’s role. Whether you focus on data ingestion, warehousing, modeling, transformation, or serving data, this section aims to provide a holistic understanding, recognizing the interdependence of each step in the pipeline.

This section has the following chapters:

Chapter 3, Data IngestionChapter 4, Data WarehousingChapter 5, Data ModelingChapter 6, Transforming DataChapter 7, Serving Data

3

Data Ingestion

A data platform is useless without any actual data in it. To access your data, combine it with other sources, enrich it, or share it across an organization, you will first need to get that data into your data platform. This is the process we call data ingestion. Data ingestion comes in all sorts and forms. Everyone is familiar with the age-old process of emailing Excel sheets back and forth, but luckily, there are more advanced and consistent ways of adding data to your platform.

Whether clicking your way through a managed ingestion tool such as Fivetran, Stitch, or Airbyte, or writing scripts to handle the parallel processing of multiple real-time data streams in a distributed system such as Spark, learning the steps of a data ingestion pipeline will help you build robust solutions. Building such a solution will help you guarantee the quality of your data, keep your stakeholders happy, and allow you to spend less time debugging and fixing broken code and more time adding new data to your organization’s data analytics platform.

In this chapter, we’re going to cover the following main topics:

Digging into the problem of moving data between two systemsUnderstanding the eight essential steps of a data ingestion pipelineManaging the quality and scalability of your data ingestion pipeline – the three key topicsWorking with data ingestion – an example pipeline

Digging into the problem of moving data between two systems

We have talked, in Chapter 1, What Is Analytics Engineering?, about the changing process of extracting, transforming, and loading (ETL) data, but understanding these steps is only part of ingesting data. Whenever you add new data to your data platform, whether that is sales data, currency exchange data, web analytics data, or video footage, you will have to make certain choices around the frequency, quality, reliability, and retention of that data and many other choices. If you do not think ahead at the beginning, reality will catch up with you when your data provider makes a change, a pipeline accidentally runs twice, or the requirements from the business change. But why do we need to move data from one system to the other in the first place?

You might consider it a given that you have to manipulate data a bit to make it fit your purpose. Maybe you’re used to pivoting tables in an Excel sheet, adjusting formulas where needed, or you have written your fair share of SQL queries to load and transform data for a dashboard or report, but it is a strange thing. Why is it that we need data engineers, analytics engineers, analysts, and who knows how many people, to move data that already exists from one place to another?

It is easy to say, “Well, it’s in the wrong format in the wrong place, so I’ll just move it and leave it at that.”

But understanding why it is in the wrong format – because it likely is – and why it is in the wrong place – because again, it likely is – will help you build better data ingestion pipelines. Understanding the key differences between the requirements for the data at the source compared to the requirements at the destination will allow you to make sure your pipelines fulfill their purpose. Let’s understand what these differences are.

The source of all problems