Data Engineering with Google Cloud Platform - Adi Wijaya - E-Book

Data Engineering with Google Cloud Platform E-Book

Adi Wijaya

0,0
50,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards.
Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP.
By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 508

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Engineering with Google Cloud Platform

A practical guide to operationalizing scalable data analytics systems on GCP

Adi Wijaya

BIRMINGHAM—MUMBAI

Data Engineering with Google Cloud Platform

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Devika Battike

Senior Editor: David Sugarman

Content Development Editor: Sean Lobo

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Jyoti Chauhan

Marketing Coordinator: Priyanka Mhatre

First published: March 2022

Production reference: 2100322

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-132-8

www.packt.com

Contributors

About the author

Adi Wijaya is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.

About the reviewer

Fajar Muharandy has over 15 years' experience in the data and analytics space. Throughout his career, he has been involved in some of the largest data warehouse and big data platform designs and implementations. Fajar is a strong believer that every data platform implementation should always start with business questions in mind, and that all stakeholders should strive toward defining the right data and technology to achieve the common goal of getting the answers to those business questions. Aside from his professional career, Fajar is also the co-founder of the Data Science Indonesia community, a community of data science enthusiasts in Indonesia who believe that data should be the foundation to push and drive actions for the greater good.

Table of Contents

Preface

Section 1: Getting Started with Data Engineering with GCP

Chapter 1: Fundamentals of Data Engineering

Understanding the data life cycle

Understanding the need for a data warehouse

Knowing the roles of a data engineer before starting

Data engineer versus data scientist

The focus of data engineers

Foundational concepts for data engineering

ETL concept in data engineering 

The difference between ETL and ELT

What is NOT big data?

A quick look at how big data technologies store data

A quick look at how to process multiple files using MapReduce

Summary

Exercise

See also

Chapter 2: Big Data Capabilities on GCP

Technical requirements

Understanding what the cloud is

The difference between the cloud and non-cloud era

The on-demand nature of the cloud

Getting started with Google Cloud Platform

Introduction to the GCP console

Practicing pinning services

Creating your first GCP project

Using GCP Cloud Shell

A quick overview of GCP services for data engineering

Understanding the GCP serverless service

Service mapping and prioritization

The concept of quotas on GCP services

User account versus service account

Summary

Section 2: Building Solutions with GCP Components

Chapter 3: Building a Data Warehouse in BigQuery

Technical requirements

Introduction to Google Cloud Storage and BigQuery

BigQuery data location

Introduction to the BigQuery console

Creating a dataset in BigQuery using the console

Loading a local CSV file into the BigQuery table

Using public data in BigQuery

Data types in BigQuery compared to other databases

Timestamp data in BigQuery compared to other databases

Preparing the prerequisites before developing our data warehouse

Step 1: Access your Cloud shell

Step 2: Check the current setup using the command line 

Step 3: The gcloud init command

Step 4: Download example data from Git

Step 5: Upload data to GCS from Git

Practicing developing a data warehouse

Data warehouse in BigQuery – Requirements for scenario 1

Steps and planning for handling scenario 1

Data warehouse in BigQuery – Requirements for scenario 2

Steps and planning for handling scenario 2

Summary

Exercise – Scenario 3

See also

Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer

Technical requirements

Introduction to Cloud Composer

Understanding the working of Airflow

Provisioning Cloud Composer in a GCP project

Exercise: Build data pipeline orchestration using Cloud Composer

Level 1 DAG – Creating dummy workflows

Level 2 DAG – Scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets

Level 3 DAG – Parameterized variables

Level 4 DAG – Guaranteeing task idempotency in Cloud Composer

Level 5 DAG – Handling late data using a sensor

Summary

Chapter 5: Building a Data Lake Using Dataproc

Technical requirements

Introduction to Dataproc

A brief history of the data lake and Hadoop ecosystem

A deeper look into Hadoop components

How much Hadoop-related knowledge do you need on GCP?

Introducing the Spark RDD and the DataFrame concept

Introducing the data lake concept

Hadoop and Dataproc positioning on GCP

Exercise – Building a data lake on a Dataproc cluster

Creating a Dataproc cluster on GCP

Using Cloud Storage as an underlying Dataproc file system 

Exercise: Creating and running jobs on a Dataproc cluster

Preparing log data in GCS and HDFS

Developing Spark ETL from HDFS to HDFS

Developing Spark ETL from GCS to GCS

Developing Spark ETL from GCS to BigQuery

Understanding the concept of the ephemeral cluster

Practicing using a workflow template on Dataproc

Building an ephemeral cluster using Dataproc and Cloud Composer

Summary 

Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow

Technical requirements

Processing streaming data

Streaming data for data engineers

Introduction to Pub/Sub

Introduction to Dataflow

Exercise – Publishing event streams to cloud Pub/Sub

Creating a Pub/Sub topic

Creating and running a Pub/Sub publisher using Python

Creating a Pub/Sub subscription

Exercise – Using Cloud Dataflow to stream data from Pub/Sub to GCS

Creating a HelloWorld application using Apache Beam

Creating a Dataflow streaming job without aggregation

Creating a streaming job with aggregation

Summary

Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio

Technical requirements

Unlocking the power of your data with Data Studio

From data to metrics in minutes with an illustrative use case

Understanding what BigQuery INFORMATION_SCHEMA is

Exercise – Exploring the BigQuery INFORMATION_SCHEMA table using Data Studio

Exercise – Creating a Data Studio report using data from a bike-sharing data warehouse

Understanding how Data Studio can impact the cost of BigQuery

What kind of table could be 1 TB in size?

How can a table be accessed 10,000 times in a month?

How to create materialized views and understanding how BI Engine works

Understanding BI Engine

Summary

Chapter 8: Building Machine Learning Solutions on Google Cloud Platform

Technical requirements

A quick look at machine learning

Exercise – practicing ML code using Python

Preparing the ML dataset by using a table from the BigQuery public dataset

Training the ML model using Random Forest in Python

Creating Batch Prediction using the training dataset's output

The MLOps landscape in GCP

Understanding the basic principles of MLOps

Introducing GCP services related to MLOps

Exercise – leveraging pre-built GCP models as a service 

Uploading the image to a GCS bucket

Creating a detect text function in Python

Exercise – using GCP in AutoML to train an ML model

Exercise – deploying a dummy workflow with Vertex AI Pipeline

Creating a dedicated regional GCS bucket

Developing the pipeline on Python

Monitoring the pipeline on the Vertex AI Pipeline console

Exercise – deploying a scikit-learn model pipeline with Vertex AI

Creating the first pipeline, which will result in an ML model file in GCS

Running the first pipeline in Vertex AI Pipeline

Creating the second pipeline, which will use the model file from the prediction results as a CSV file in GCS

Running the second pipeline in Vertex AI Pipeline

Summary 

Section 3: Key Strategies for Architecting Top-Notch Data Pipelines

Chapter 9: User and Project Management in GCP 

Technical requirements 

Understanding IAM in GCP 

Planning a GCP project structure

Understanding the GCP organization, folder, and project hierarchy

Deciding how many projects we should have in a GCP organization

Controlling user access to our data warehouse

Use-case scenario – planninga BigQuery ACL on an e-commerce organization

Column-level security in BigQuery

Practicing the concept of IaC using Terraform

Exercise – creating and running basic Terraform scripts

Self-exercise – managing a GCP project and resources using Terraform

Summary

Chapter 10: Cost Strategy in GCP

Technical requirements

Estimating the cost of your end-to-end data solution in GCP

Comparing BigQuery on-demand and flat-rate 

Example – estimating data engineering use case

Tips for optimizing BigQuery using partitioned and clustered tables 

Partitioned tables

Clustered tables

Exercise – optimizing BigQuery on-demand cost

Summary

Chapter 11: CI/CD on Google Cloud Platform for Data Engineers

Technical requirements

Introduction to CI/CD

Understanding the data engineer's relationship with CI/CD practices

Understanding CI/CD components with GCP services

Exercise – implementing continuous integration using Cloud Build

Creating a GitHub repository using Cloud Source Repository

Developing the code and Cloud Build scripts

Creating the Cloud Build Trigger

Pushing the code to the GitHub repository

Exercise – deploying Cloud Composer jobs using Cloud Build

Preparing the CI/CD environment

Preparing the cloudbuild.yaml configuration file

Pushing the DAG to our GitHub repository

Checking the CI/CD result in the GCS bucket and Cloud Composer

Summary

Further reading

Chapter 12: Boosting Your Confidence as a Data Engineer

Overviewing the Google Cloud certification

Exam preparation tips

Extra GCP services material

Quiz – reviewing all the concepts you've learned about

Questions

Answers

The past, present, and future of Data Engineering

Boosting your confidence and final thoughts

Summary

Why subscribe?

Other Books You May Enjoy

Preface

There is too much information; too many plans; it's complicated. We live in a world where there is more and more information that is as problematic as too little information, and I'm aware that this condition applies when people want to start doing data engineering in the cloud, specifically Google Cloud Platform (GCP) in this book.

When people want to embark on a career in data, there are so many different roles whose definitions sometimes vary from one company to the next.

When someone chooses to be a data engineer, there are a great number of technology options: cloud versus non-cloud; big data database versus traditional; self-managed versus a managed service; and many more.

When they decide to use the cloud on GCP, the public documentation contains a wide variety of product options and tutorials.

In this book, instead of adding further dimensions to the data engineering and GCP products, the main goal of this book is to help you narrow down the information. This book will help you narrow down all the important concepts and components from the vast array of information available on the internet. The guidance and exercises are based on the writer's experience in the field, and will give you a clear focus. By reading the book and following the exercises, you will learn the most relevant and clear path to start and boost your career in data engineering using GCP.

Who this book is for

This book is intended for anyone involved in the data and analytics space, including IT developers, data analysts, data scientists, or any other relevant roles where an individual wants to gain a jump start in the data engineering field.

This book is also intended for data engineers who want to start using GCP, prepare certification, and get practical examples based on real-world scenarios.

Finally, this book will be of interest to anyone who wants to know the thought process, have practical guidance, and a clear path to run through the technology components to be able to start, achieve the certification, and gain a practical perspective in data engineering with GCP.

What this book covers

This book is divided into 3 sections and 12 chapters. Each section is a collection of independent chapters that have one objective:

Chapter 1, Fundamentals of Data Engineering, explains the role of data engineers and how data engineering relates to GCP.

Chapter 2, Big Data Capabilities on GCP, introduces the relevant GCP services related to data engineering.

Chapter 3, Building a Data Warehouse in BigQuery, covers the data warehouse concept using BigQuery.

Chapter 4, Building Orchestration for Batch Data Loading Using Cloud Composer, explains data orchestration using Cloud Composer.

Chapter 5, Building a Data Lake Using Dataproc, details the Data Lake concept with Hadoop using DataProc.

Chapter 6, Processing Streaming Data with Pub/Sub and Dataflow, explains the concept of streaming data using Pub/Sub and Dataflow.

Chapter 7, Visualizing Data for Making Data-Driven Decisions with Data Studio, covers how to use data from BigQuery to visualize it as charts in Data Studio.

Chapter 8, Building Machine Learning Solutions on Google Cloud Platform, sets out the concept of MLOps using Vertex AI.

Chapter 9, G User and Project Management in GCP, explains the fundamentals of GCP Identity and Access Management and project structures.

Chapter 10, Cost Strategy in GCP, covers how to estimate the overall data solution using GCP.

Chapter 11, CI/CD on Google Cloud Platform for Data Engineers, explains the concept of CI/CD and its relevance to data engineers.

Chapter 12, Boosting Your Confidence as a Data Engineer, prepares you for the GCP certification and offers some final thoughts in terms of summarizing what's been learned in this book.

To get the most out of this book

To successfully follow the examples in this book, you need a GCP account and project. If, at this point, you don't have a GCP account and project, don't worry. We will cover that as part of the exercises in this book.

Occasionally, we will use the free tier from GCP for practice, but be aware that some products might not have free tiers. Notes will be provided if this is the case.

All the exercises in this book can be completed without any additional software installation. The exercises will be done in the GCP console that you can open from any operating system using your favorite browser.

You should be familiar with basic programming languages. In this book, I will focus on utilizing Python and the Linux command line.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

This book is not positioned to replace GCP public documentation. Hence, comprehensive information on every single feature of GCP services might not be available in this book. We also won't use all the GCP services that are available. For such information, you can always check the public documentation.

Remember that the main goal of this book is to help you narrow down information. Use this book as your step-by-step guide to build solutions to common challenges facing data engineers. Follow the patterns from the exercises, the relationship between concepts, important GCP services, and best practices. Always use the hands-on exercises so you can experience working with GCP.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800561328_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map {

height: 100%;

margin: 0;

padding: 0

}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]

exten => s,1,Dial(Zap/1|30)

exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Data Engineering with Google Cloud Platform, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Section 1: Getting Started with Data Engineering with GCP

This part will talk about the purpose, value, and concepts of big data and cloud computing and how GCP products are relevant to data engineering. You will learn about a data engineer's core responsibilities, how they differ from those of a data scientist, and how to facilitate the flow of data through an organization to derive insights.

This section comprises the following chapters:

Chapter 1, Fundamentals of Data EngineeringChapter 2, Big Data Capabilities on GCP

Chapter 2: Big Data Capabilities on GCP

One of the most common scenarios when people start using Google Cloud Platform or GCP is getting lost because there are too many products and services. GCP offers a very broad range of services for multiple disciplines, for example, for application development, microservices, security, AI, and of course, one of them is Big Data. But even for big data products, there are multiple options that you can choose. 

As an analogy, GCP is like a supermarket. A supermarket has almost everything that you need to support your daily life. For example, if you plan to cook pasta and go to a supermarket to buy the ingredients, no one will tell you what ingredients you should buy, or even if you know the ingredients, you will still be offered the ingredients by different brands, price tags, and producers. If you fail to make the right decision, you will end up cooking bad pasta. In GCP it's the same; you need to be able to choose your own services yourself. It is also important to know how each service is dependent on the others, since it's impossible to build a solution using only a single service.

In this chapter, we will learn about the products and services of GCP. But instead of explaining each product one by one, which you can access from public documentation, the focus in this chapter is to help you narrow down the options. We will map the services into categories and priorities. After finishing this chapter, you will know exactly what products you need to start your journey.

And on top of that, we will start using GCP! We will kick off our hands-on practice with basic fundamental knowledge of the GCP console.

By the end of the chapter, you will understand the positioning of GCP products related to big data, be able to start using the GCP console, and be able to plan what services you should focus on as a data engineer.

Here is the list of topics we will discuss in this chapter:

Understanding what the cloud isGetting started with Google Cloud PlatformA quick overview of GCP services for data engineering

Technical requirements

In this chapter's exercise, we will start using the GCP console, Cloud Shell, and Cloud Editor. All of the tools can be opened using any internet browser.

To use the GCP console, we need to register using a Google account (Gmail). This requires a payment method. Please check the available payment method to make sure you are successfully registered.

Understanding what the cloud is

Renting someone else's server: this definition of the cloud is my favorite, very simple, to the point, definition of what the cloud really is. So as long as you don't need to buy your own machine to store and process data, you are using the cloud. 

But increasingly, after some leading cloud providers such as Google Cloud having gained more traction and technology maturity, the terminology is becoming representative of sets of architecture, managed services, and highly scalable environments that define how we build solutions. For data engineering, that means building data products using collections of services, APIs, and trusting the underlying infrastructure of the cloud provider one hundred percent.

The difference between the cloud and non-cloud era

If we want to compare the cloud with the non-cloud era from a data engineering perspective, we will find that almost all the data engineering principles are the same. But from a technology perspective, there are a lot of differences in the cloud.

In the cloud, computation and storage are configured as services, which means that as engineers, we can control them using code and make them available on demand.  

If you're starting your data journey directly with cloud services, maybe it's a bit difficult for you to imagine what non-cloud looks like. As an illustrative example of a non-cloud experience, I once helped a company to implement a data warehouse before the cloud era. The customer had data in Oracle databases and wanted to store the data in an on-premises data warehouse. How long do you think it took before I could store my first table in the data warehouse? 4 months!

We needed to wait for the physical server to be shipped from the manufacturer's continent to our continent. And after the server arrived, we waited for the network engineer to plug in cables, routers, power, the software installation, and everything, which again took months before the data engineers could access the software and store our first table in it.

How long do you need to store your file in a table in BigQuery? Less than 1 minute.

The on-demand nature of the cloud

Another important aspect of the cloud is the ability to turn on, turn off, create, and delete services based on your actual usage.