50,39 €
With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards.
Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP.
By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 508
Veröffentlichungsjahr: 2022
A practical guide to operationalizing scalable data analytics systems on GCP
Adi Wijaya
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Devika Battike
Senior Editor: David Sugarman
Content Development Editor: Sean Lobo
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Hemangini Bari
Production Designer: Jyoti Chauhan
Marketing Coordinator: Priyanka Mhatre
First published: March 2022
Production reference: 2100322
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-132-8
www.packt.com
Adi Wijaya is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Fajar Muharandy has over 15 years' experience in the data and analytics space. Throughout his career, he has been involved in some of the largest data warehouse and big data platform designs and implementations. Fajar is a strong believer that every data platform implementation should always start with business questions in mind, and that all stakeholders should strive toward defining the right data and technology to achieve the common goal of getting the answers to those business questions. Aside from his professional career, Fajar is also the co-founder of the Data Science Indonesia community, a community of data science enthusiasts in Indonesia who believe that data should be the foundation to push and drive actions for the greater good.
There is too much information; too many plans; it's complicated. We live in a world where there is more and more information that is as problematic as too little information, and I'm aware that this condition applies when people want to start doing data engineering in the cloud, specifically Google Cloud Platform (GCP) in this book.
When people want to embark on a career in data, there are so many different roles whose definitions sometimes vary from one company to the next.
When someone chooses to be a data engineer, there are a great number of technology options: cloud versus non-cloud; big data database versus traditional; self-managed versus a managed service; and many more.
When they decide to use the cloud on GCP, the public documentation contains a wide variety of product options and tutorials.
In this book, instead of adding further dimensions to the data engineering and GCP products, the main goal of this book is to help you narrow down the information. This book will help you narrow down all the important concepts and components from the vast array of information available on the internet. The guidance and exercises are based on the writer's experience in the field, and will give you a clear focus. By reading the book and following the exercises, you will learn the most relevant and clear path to start and boost your career in data engineering using GCP.
This book is intended for anyone involved in the data and analytics space, including IT developers, data analysts, data scientists, or any other relevant roles where an individual wants to gain a jump start in the data engineering field.
This book is also intended for data engineers who want to start using GCP, prepare certification, and get practical examples based on real-world scenarios.
Finally, this book will be of interest to anyone who wants to know the thought process, have practical guidance, and a clear path to run through the technology components to be able to start, achieve the certification, and gain a practical perspective in data engineering with GCP.
This book is divided into 3 sections and 12 chapters. Each section is a collection of independent chapters that have one objective:
Chapter 1, Fundamentals of Data Engineering, explains the role of data engineers and how data engineering relates to GCP.
Chapter 2, Big Data Capabilities on GCP, introduces the relevant GCP services related to data engineering.
Chapter 3, Building a Data Warehouse in BigQuery, covers the data warehouse concept using BigQuery.
Chapter 4, Building Orchestration for Batch Data Loading Using Cloud Composer, explains data orchestration using Cloud Composer.
Chapter 5, Building a Data Lake Using Dataproc, details the Data Lake concept with Hadoop using DataProc.
Chapter 6, Processing Streaming Data with Pub/Sub and Dataflow, explains the concept of streaming data using Pub/Sub and Dataflow.
Chapter 7, Visualizing Data for Making Data-Driven Decisions with Data Studio, covers how to use data from BigQuery to visualize it as charts in Data Studio.
Chapter 8, Building Machine Learning Solutions on Google Cloud Platform, sets out the concept of MLOps using Vertex AI.
Chapter 9, G User and Project Management in GCP, explains the fundamentals of GCP Identity and Access Management and project structures.
Chapter 10, Cost Strategy in GCP, covers how to estimate the overall data solution using GCP.
Chapter 11, CI/CD on Google Cloud Platform for Data Engineers, explains the concept of CI/CD and its relevance to data engineers.
Chapter 12, Boosting Your Confidence as a Data Engineer, prepares you for the GCP certification and offers some final thoughts in terms of summarizing what's been learned in this book.
To successfully follow the examples in this book, you need a GCP account and project. If, at this point, you don't have a GCP account and project, don't worry. We will cover that as part of the exercises in this book.
Occasionally, we will use the free tier from GCP for practice, but be aware that some products might not have free tiers. Notes will be provided if this is the case.
All the exercises in this book can be completed without any additional software installation. The exercises will be done in the GCP console that you can open from any operating system using your favorite browser.
You should be familiar with basic programming languages. In this book, I will focus on utilizing Python and the Linux command line.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
This book is not positioned to replace GCP public documentation. Hence, comprehensive information on every single feature of GCP services might not be available in this book. We also won't use all the GCP services that are available. For such information, you can always check the public documentation.
Remember that the main goal of this book is to help you narrow down information. Use this book as your step-by-step guide to build solutions to common challenges facing data engineers. Follow the patterns from the exercises, the relationship between concepts, important GCP services, and best practices. Always use the hands-on exercises so you can experience working with GCP.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform. If there's an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800561328_ColorImages.pdf.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
A block of code is set as follows:
html, body, #map {
height: 100%;
margin: 0;
padding: 0
}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
[default]
exten => s,1,Dial(Zap/1|30)
exten => s,2,Voicemail(u100)
exten => s,102,Voicemail(b100)
exten => i,1,Voicemail(s0)
Any command-line input or output is written as follows:
$ mkdir css
$ cd css
Bold: Indicates a new term, an important word, or words that you see on screen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."
Tips or Important Notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Data Engineering with Google Cloud Platform, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
This part will talk about the purpose, value, and concepts of big data and cloud computing and how GCP products are relevant to data engineering. You will learn about a data engineer's core responsibilities, how they differ from those of a data scientist, and how to facilitate the flow of data through an organization to derive insights.
This section comprises the following chapters:
Chapter 1, Fundamentals of Data EngineeringChapter 2, Big Data Capabilities on GCPOne of the most common scenarios when people start using Google Cloud Platform or GCP is getting lost because there are too many products and services. GCP offers a very broad range of services for multiple disciplines, for example, for application development, microservices, security, AI, and of course, one of them is Big Data. But even for big data products, there are multiple options that you can choose.
As an analogy, GCP is like a supermarket. A supermarket has almost everything that you need to support your daily life. For example, if you plan to cook pasta and go to a supermarket to buy the ingredients, no one will tell you what ingredients you should buy, or even if you know the ingredients, you will still be offered the ingredients by different brands, price tags, and producers. If you fail to make the right decision, you will end up cooking bad pasta. In GCP it's the same; you need to be able to choose your own services yourself. It is also important to know how each service is dependent on the others, since it's impossible to build a solution using only a single service.
In this chapter, we will learn about the products and services of GCP. But instead of explaining each product one by one, which you can access from public documentation, the focus in this chapter is to help you narrow down the options. We will map the services into categories and priorities. After finishing this chapter, you will know exactly what products you need to start your journey.
And on top of that, we will start using GCP! We will kick off our hands-on practice with basic fundamental knowledge of the GCP console.
By the end of the chapter, you will understand the positioning of GCP products related to big data, be able to start using the GCP console, and be able to plan what services you should focus on as a data engineer.
Here is the list of topics we will discuss in this chapter:
Understanding what the cloud isGetting started with Google Cloud PlatformA quick overview of GCP services for data engineeringIn this chapter's exercise, we will start using the GCP console, Cloud Shell, and Cloud Editor. All of the tools can be opened using any internet browser.
To use the GCP console, we need to register using a Google account (Gmail). This requires a payment method. Please check the available payment method to make sure you are successfully registered.
Renting someone else's server: this definition of the cloud is my favorite, very simple, to the point, definition of what the cloud really is. So as long as you don't need to buy your own machine to store and process data, you are using the cloud.
But increasingly, after some leading cloud providers such as Google Cloud having gained more traction and technology maturity, the terminology is becoming representative of sets of architecture, managed services, and highly scalable environments that define how we build solutions. For data engineering, that means building data products using collections of services, APIs, and trusting the underlying infrastructure of the cloud provider one hundred percent.
If we want to compare the cloud with the non-cloud era from a data engineering perspective, we will find that almost all the data engineering principles are the same. But from a technology perspective, there are a lot of differences in the cloud.
In the cloud, computation and storage are configured as services, which means that as engineers, we can control them using code and make them available on demand.
If you're starting your data journey directly with cloud services, maybe it's a bit difficult for you to imagine what non-cloud looks like. As an illustrative example of a non-cloud experience, I once helped a company to implement a data warehouse before the cloud era. The customer had data in Oracle databases and wanted to store the data in an on-premises data warehouse. How long do you think it took before I could store my first table in the data warehouse? 4 months!
We needed to wait for the physical server to be shipped from the manufacturer's continent to our continent. And after the server arrived, we waited for the network engineer to plug in cables, routers, power, the software installation, and everything, which again took months before the data engineers could access the software and store our first table in it.
How long do you need to store your file in a table in BigQuery? Less than 1 minute.
Another important aspect of the cloud is the ability to turn on, turn off, create, and delete services based on your actual usage.
