Official Google Cloud Certified Professional Data Engineer Study Guide - Dan Sullivan - E-Book

Official Google Cloud Certified Professional Data Engineer Study Guide E-Book

Dan Sullivan

0,0
42,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The proven Study Guide that prepares you for this new Google Cloud exam The Google Cloud Certified Professional Data Engineer Study Guide, provides everything you need to prepare for this important exam and master the skills necessary to land that coveted Google Cloud Professional Data Engineer certification. Beginning with a pre-book assessment quiz to evaluate what you know before you begin, each chapter features exam objectives and review questions, plus the online learning environment includes additional complete practice tests. Written by Dan Sullivan, a popular and experienced online course author for machine learning, big data, and Cloud topics, Google Cloud Certified Professional Data Engineer Study Guide is your ace in the hole for deploying and managing analytics and machine learning applications. * Build and operationalize storage systems, pipelines, and compute infrastructure * Understand machine learning models and learn how to select pre-built models * Monitor and troubleshoot machine learning models * Design analytics and machine learning applications that are secure, scalable, and highly available. This exam guide is designed to help you develop an in depth understanding of data engineering and machine learning on Google Cloud Platform.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 572

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Official GoogleCloud CertifiedProfessional Data Engineer

Study Guide

Dan Sullivan

Copyright © 2020 by John Wiley & Sons, Inc., Indianapolis, Indiana

Published simultaneously in Canada and in the United Kingdom

ISBN: 978-1-119-61843-0ISBN: 978-1-119-61844-7 (ebk.)ISBN: 978-1-119-61845-4 (ebk.)

Manufactured in the United States of America

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services or to obtain technical support, please contact our Customer Care Department within the U.S. at (877) 762-2974, outside the U.S. at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2020936943

TRADEMARKS: Wiley, the Wiley logo, and the Sybex logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Google Cloud and the Google Cloud logo are trademarks of Google, LLC. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

to Katherine

Acknowledgments

I have been fortunate to work again with professionals from Waterside Productions, Wiley, and Google to create this Study Guide.

Carole Jelen, vice president of Waterside Productions, and Jim Minatel, associate publisher at John Wiley & Sons, continue to lead the effort to create Google Cloud certification guides. It was a pleasure to work with Gary Schwartz, project editor, who managed the process that got us from outline to a finished manuscript. Thanks to Christine O’Connor, senior production editor, for making the last stages of book development go as smoothly as they did.

I was also fortunate to work with Valerie Parham-Thompson again. Valerie’s technical review improved the clarity and accuracy of this book tremendously.

Thank you to the Google Cloud subject-matter experts who reviewed and contributed to the material in this book:

Name

Title

Damon A. Runion

Technical Curriculum Developer, Data Engineering

Julianne Cuneo

Data Analytics Specialist, Google Cloud

Geoff McGill

Customer Engineer, Data Analytics

Susan Pierce

Solutions Manager, Smart Analytics and AI

Rachel Levy

Cloud Data Specialist Lead

Dustin Williams

Data Analytics Specialist, Google Cloud

Gbenga Awodokun

Customer Engineer, Data and Marketing Analytics

Dilraj Kaur

Big Data Specialist

Rebecca Ballough

Data Analytics Manager, Google Cloud

Robert Saxby

Staff Solutions Architect

Niel Markwick

Cloud Solutions Architect

Sharon Dashet

Big Data Product Specialist

Barry Searle

Solution Specialist - Cloud Data Management

Jignesh Mehta

Customer Engineer, Cloud Data Platform and Advanced Analytics

My sons James and Nicholas were my first readers, and they helped me to get the manuscript across the finish line.

This book is dedicated to Katherine, my wife and partner in so many adventures.

About the Author

Dan Sullivan is a principal engineer and software architect. He specializes in data science, machine learning, and cloud computing. Dan is the author of the Official Google Cloud Certified Professional Architect Study Guide (Sybex, 2019), Official Google Cloud Certified Associate Cloud Engineer Study Guide (Sybex, 2019), NoSQL for Mere Mortals (Addison-Wesley Professional, 2015), and several LinkedIn Learning courses on databases, data science, and machine learning. Dan has certifications from Google and AWS, along with a Ph.D. in genetics and computational biology from Virginia Tech.

About the Technical Editor

Valerie Parham-Thompson has experience with a variety of open source data storage technologies, including MySQL, MongoDB, and Cassandra, as well as a foundation in web development in software-as-a-service (SaaS) environments. Her work in both development and operations in startups and traditional enterprises has led to solid expertise in web-scale data storage and data delivery.

Valerie has spoken at technical conferences on topics such as database security, performance tuning, and container management. She also often speaks at local meetups and volunteer events.

Valerie holds a bachelor’s degree from the Kenan Flagler Business School at UNC-Chapel Hill, has certifications in MySQL and MongoDB, and is a Google Certified Professional Cloud Architect. She currently works in the Open Source Database Cluster at Pythian, headquartered in Ottawa, Ontario.

Follow Valerie’s contributions to technical blogs on Twitter at dataindataout.

CONTENTS

Cover

Acknowledgments

About the Author

About the Technical Editor

Introduction

Assessment Test

Answers to Assessment Test

Chapter 1 Selecting Appropriate Storage Technologies

From Business Requirements to Storage Systems

Technical Aspects of Data: Volume, Velocity, Variation, Access, and Security

Types of Structure: Structured, Semi-Structured, and Unstructured

Schema Design Considerations

Exam Essentials

Review Questions

Chapter 2 Building and Operationalizing Storage Systems

Cloud SQL

Cloud Spanner

Cloud Bigtable

Cloud Firestore

BigQuery

Cloud Memorystore

Cloud Storage

Unmanaged Databases

Exam Essentials

Review Questions

Chapter 3 Designing Data Pipelines

Overview of Data Pipelines

GCP Pipeline Components

Migrating Hadoop and Spark to GCP

Exam Essentials

Review Questions

Chapter 4 Designing a Data Processing Solution

Designing Infrastructure

Designing for Distributed Processing

Migrating a Data Warehouse

Exam Essentials

Review Questions

Chapter 5 Building and Operationalizing Processing Infrastructure

Provisioning and Adjusting Processing Resources

Monitoring Processing Resources

Exam Essentials

Review Questions

Chapter 6 Designing for Security and Compliance

Identity and Access Management with Cloud IAM

Using IAM with Storage and Processing Services

Data Security

Ensuring Privacy with the Data Loss Prevention API

Legal Compliance

Exam Essentials

Review Questions

Chapter 7 Designing Databases for Reliability, Scalability, and Availability

Designing Cloud Bigtable Databases for Scalability and Reliability

Designing Cloud Spanner Databases for Scalability and Reliability

Designing BigQuery Databases for Data Warehousing

Exam Essentials

Review Questions

Chapter 8 Understanding Data Operations for Flexibility and Portability

Cataloging and Discovery with Data Catalog

Data Preprocessing with Dataprep

Visualizing with Data Studio

Exploring Data with Cloud Datalab

Orchestrating Workflows with Cloud Composer

Exam Essentials

Review Questions

Chapter 9 Deploying Machine Learning Pipelines

Structure of ML Pipelines

GCP Options for Deploying Machine Learning Pipeline

Exam Essentials

Review Questions

Chapter 10 Choosing Training and Serving Infrastructure

Hardware Accelerators

Distributed and Single Machine Infrastructure

Edge Computing with GCP

Exam Essentials

Review Questions

Chapter 11 Measuring, Monitoring, and Troubleshooting Machine Learning Models

Three Types of Machine Learning Algorithms

Deep Learning

Engineering Machine Learning Models

Common Sources of Error in Machine Learning Models

Exam Essentials

Review Questions

Chapter 12 Leveraging Prebuilt Models as a Service

Sight

Conversation

Language

Structured Data

Exam Essentials

Review Questions

Appendix Answers to Review Questions

Chapter 1: Selecting Appropriate Storage Technologies

Chapter 2: Building and Operationalizing Storage Systems

Chapter 3: Designing Data Pipelines

Chapter 4: Designing a Data Processing Solution

Chapter 5: Building and Operationalizing Processing Infrastructure

Chapter 6: Designing for Security and Compliance

Chapter 7: Designing Databases for Reliability, Scalability, and Availability

Chapter 8: Understanding Data Operations for Flexibility and Portability

Chapter 9: Deploying Machine Learning Pipelines

Chapter 10: Choosing Training and Serving Infrastructure

Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models

Chapter 12: Leveraging Prebuilt Models as a Service

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1

Table 1.2

Table 1.3

Table 1.4

Table 1.5

Table 1.6

Table 1.7

Chapter 9

Table 9.1

Chapter 11

Table 11.1

Table 11.2

Table 11.3

Guide

Cover

Table of Contents

Introduction

Pages

iii

iv

v

vii

ix

xi

xxiii

xxiv

xxv

xxvi

xxvii

xxviii

xxix

xxx

xxxi

xxxii

xxxiii

xxxiv

xxxv

xxxvi

xxxvii

xxxviii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

307

308

309

310

311

312

313

314

315

Introduction

The Google Cloud Certified Professional Data Engineer exam tests your ability to design, deploy, monitor, and adapt services and infrastructure for data-driven decision-making. The four primary areas of focus in this exam are as follows:

Designing data processing systems

Building and operationalizing data processing systems

Operationalizing machine learning models

Ensuring solution quality

Designing data processing systems involves selecting storage technologies, including relational, analytical, document, and wide-column databases, such as Cloud SQL, BigQuery, Cloud Firestore, and Cloud Bigtable, respectively. You will also be tested on designing pipelines using services such as Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer. The exam will test your ability to design distributed systems that may include hybrid clouds, message brokers, middleware, and serverless functions. Expect to see questions on migrating data warehouses from on-premises infrastructure to the cloud.

The building and operationalizing data processing systems parts of the exam will test your ability to support storage systems, pipelines, and infrastructure in a production environment. This will include using managed services for storage as well as batch and stream processing. It will also cover common operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. As a data engineer, you are expected to understand how to provision resources, monitor pipelines, and test distributed systems.

Machine learning is an increasingly important topic. This exam will test your knowledge of prebuilt machine learning models available in GCP as well as the ability to deploy machine learning pipelines with custom-built models. You can expect to see questions about machine learning service APIs and data ingestion, as well as training and evaluating models. The exam uses machine learning terminology, so it is important to understand the nomenclature, especially terms such as model, supervised and unsupervised learning, regression, classification, and evaluation metrics.

The fourth domain of knowledge covered in the exam is ensuring solution quality, which includes security, scalability, efficiency, and reliability. Expect questions on ensuring privacy with data loss prevention techniques, encryption, identity, and access management, as well ones about compliance with major regulations. The exam also tests a data engineer’s ability to monitor pipelines with Stackdriver, improve data models, and scale resources as needed. You may also encounter questions that assess your ability to design portable solutions and plan for future business requirements.

In your day-to-day experience with GCP, you may spend more time working on some data engineering tasks than others. This is expected. It does, however, mean that you should be aware of the exam topics about which you may be less familiar. Machine learning questions can be especially challenging to data engineers who work primarily on ingestion and storage systems. Similarly, those who spend a majority of their time developing machine learning models may need to invest more time studying schema modeling for NoSQL databases and designing fault-tolerant distributed systems.

What Does This Book Cover?

This book covers the topics outlined in the Google Cloud Professional Data Engineer exam guide available here:

cloud.google.com/certification/guides/data-engineer

Chapter 1: Selecting Appropriate Storage Technologies  This chapter covers selecting appropriate storage technologies, including mapping business requirements to storage systems; understanding the distinction between structured, semi-structured, and unstructured data models; and designing schemas for relational and NoSQL databases. By the end of the chapter, you should understand the various criteria that data engineers consider when choosing a storage technology.

Chapter 2: Building and Operationalizing Storage Systems  This chapter discusses how to deploy storage systems and perform data management operations, such as importing and exporting data, configuring access controls, and doing performance tuning. The services included in this chapter are as follows: Cloud SQL, Cloud Spanner, Cloud Bigtable, Cloud Firestore, BigQuery, Cloud Memorystore, and Cloud Storage. The chapter also includes a discussion of working with unmanaged databases, understanding storage costs and performance, and performing data lifecycle management.

Chapter 3: Designing Data Pipelines  This chapter describes high-level design patterns, along with some variations on those patterns, for data pipelines. It also reviews how GCP services like Cloud Dataflow, Cloud Dataproc, Cloud Pub/Sub, and Cloud Composer are used to implement data pipelines. It also covers migrating data pipelines from an on-premises Hadoop cluster to GCP.

Chapter 4: Designing a Data Processing Solution  In this chapter, you learn about designing infrastructure for data engineering and machine learning, including how to do several tasks, such as choosing an appropriate compute service for your use case; designing for scalability, reliability, availability, and maintainability; using hybrid and edge computing architecture patterns and processing models; and migrating a data warehouse from on-premises data centers to GCP.

Chapter 5: Building and Operationalizing Processing Infrastructure  This chapter discusses managed processing resources, including those offered by App Engine, Cloud Functions, and Cloud Dataflow. The chapter also includes a discussion of how to use Stackdriver Metrics, Stackdriver Logging, and Stackdriver Trace to monitor processing infrastructure.

Chapter 6: Designing for Security and Compliance  This chapter introduces several key topics of security and compliance, including identity and access management, data security, encryption and key management, data loss prevention, and compliance.

Chapter 7: Designing Databases for Reliability, Scalability, and Availability  This chapter provides information on designing for reliability, scalability, and availability of three GPC databases: Cloud Bigtable, Cloud Spanner, and Cloud BigQuery. It also covers how to apply best practices for designing schemas, querying data, and taking advantage of the physical design properties of each database.

Chapter 8: Understanding Data Operations for Flexibility and Portability  This chapter describes how to use the Data Catalog, a metadata management service supporting the discovery and management of data in Google Cloud. It also introduces Cloud Dataprep, a preprocessing tool for transforming and enriching data, as well as Data Studio for visualizing data and Cloud Datalab for interactive exploration and scripting.

Chapter 9: Deploying Machine Learning Pipelines  Machine learning pipelines include several stages that begin with data ingestion and preparation and then perform data segregation followed by model training and evaluation. GCP provides multiple ways to implement machine learning pipelines. This chapter describes how to deploy ML pipelines using general-purpose computing resources, such as Compute Engine and Kubernetes Engine. Managed services, such as Cloud Dataflow and Cloud Dataproc, are also available, as well as specialized machine learning services, such as AI Platform, formerly known as Cloud ML.

Chapter 10: Choosing Training and Serving Infrastructure  This chapter focuses on choosing the appropriate training and serving infrastructure for your needs when serverless or specialized AI services are not a good fit for your requirements. It discusses distributed and single-machine infrastructure, the use of edge computing for serving machine learning models, and the use of hardware accelerators.

Chapter 11: Measuring, Monitoring, and Troubleshooting Machine Learning Models  This chapter focuses on key concepts in machine learning, including machine learning terminology and core concepts and common sources of error in machine learning. Machine learning is a broad discipline with many areas of specialization. This chapter provides you with a high-level overview to help you pass the Professional Data Engineer exam, but it is not a substitute for learning machine learning from resources designed for that purpose.

Chapter 12: Leveraging Prebuilt ML Models as a Service  This chapter describes Google Cloud Platform options for using pretrained machine learning models to help developers build and deploy intelligent services quickly. The services are broadly grouped into sight, conversation, language, and structured data. These services are available through APIs or through Cloud AutoML services.

Interactive Online Learning Environment and TestBank

Learning the material in the Official Google Cloud Certified Professional Engineer Study Guide is an important part of preparing for the Professional Data Engineer certification exam, but we also provide additional tools to help you prepare. The online TestBank will help you understand the types of questions that will appear on the certification exam.

The sample tests in the TestBank include all the questions in each chapter as well as the questions from the assessment test. In addition, there are two practice exams with 50 questions each. You can use these tests to evaluate your understanding and identify areas that may require additional study.

The flashcards in the TestBank will push the limits of what you should know for the certification exam. Over 100 questions are provided in digital format. Each flashcard has one question and one correct answer.

The online glossary is a searchable list of key terms introduced in this Study Guide that you should know for the Professional Data Engineer certification exam.

To start using these to study for the Google Cloud Certified Professional Data Engineer exam, go to www.wiley.com/go/sybextestprep and register your book to receive your unique PIN. Once you have the PIN, return to www.wiley.com/go/sybextestprep, find your book, and click Register, or log in and follow the link to register a new account or add this book to an existing account.

Additional Resources

People learn in different ways. For some, a book is an ideal way to study, whereas other learners may find video and audio resources a more efficient way to study. A combination of resources may be the best option for many of us. In addition to this Study Guide, here are some other resources that can help you prepare for the Google Cloud Professional Data Engineer exam:

The Professional Data Engineer Certification Exam Guide:

https://cloud.google.com/certification/guides/data-engineer/

Exam FAQs:

https://cloud.google.com/certification/faqs/

Google’s Assessment Exam:

https://cloud.google.com/certification/practice-exam/data-engineer

Google Cloud Platform documentation:

https://cloud.google.com/docs/

Cousera’s on-demand courses in “Architecting with Google Cloud Platform Specialization” and “Data Engineering with Google Cloud” are both relevant to data engineering:

www.coursera.org/specializations/gcp-architecture

https://www.coursera.org/professional-certificates/gcp-data-engineering

QwikLabs Hands-on Labs:

https://google.qwiklabs.com/quests/25

Linux Academy Google Cloud Certified Professional Data Engineer video course:

https://linuxacademy.com/course/google-cloud-data-engineer/

The best way to prepare for the exam is to perform the tasks of a data engineer and work with the Google Cloud Platform.

 Exam objectives are subject to change at any time without prior notice and at Google’s sole discretion. Please visit the Google Cloud Professional Data Engineer website (https://cloud.google.com/certification/data-engineer) for the most current listing of exam objectives.

Objective Map

Objective

Chapter

Section 1: Designing data processing system

1.1 Selecting the appropriate storage technologies

1

1.2 Designing data pipelines

2, 3

1.3 Designing a data processing solution

4

1.4 Migrating data warehousing and data processing

4

Section 2: Building and operationalizing data processing systems

2.1 Building and operationalizing storage systems

2

2.2 Building and operationalizing pipelines

3

2.3 Building and operationalizing infrastructure

5

Section 3: Operationalizing machine learning models

3.1 Leveraging prebuilt ML models as a service

12

3.2 Deploying an ML pipeline

9

3.3 Choosing the appropriate training and serving infrastructure

10

3.4 Measuring, monitoring, and troubleshooting machine learning models

11

Section 4: Ensuring solution quality

4.1 Designing for security and compliance

6

4.2 Ensuring scalability and efficiency

7

4.3 Ensuring reliability and fidelity

8

4.4 Ensuring flexibility and portability

8

Assessment Test

You are migrating your machine learning operations to GCP and want to take advantage of managed services. You have been managing a Spark cluster because you use the MLlib library extensively. Which GCP managed service would you use?

Cloud DataprepCloud DataprocCloud DataflowCloud Pub/Sub

Your team is designing a database to store product catalog information. They have determined that you need to use a database that supports flexible schemas and transactions. What service would you expect to use?

Cloud SQLCloud BigQueryCloud FirestoreCloud Storage

Your company has been losing market share because competitors are attracting your customers with a more personalized experience on their e-commerce platforms, including providing recommendations for products that might be of interest to them. The CEO has stated that your company will provide equivalent services within 90 days. What GCP service would you use to help meet this objective?

Cloud BigtableCloud StorageAI PlatformCloud Datastore

The finance department at your company has been archiving data on premises. They no longer want to maintain a costly dedicated storage system. They would like to store up to 300 TB of data for 10 years. The data will likely not be accessed at all. They also want to minimize cost. What storage service would you recommend?

Cloud Storage multi-regional storageCloud Storage Nearline storageCloud Storage Coldline storageCloud Bigtable

You will be developing machine learning models using sensitive data. Your company has several policies regarding protecting sensitive data, including requiring enhanced security on virtual machines (VMs) processing sensitive data. Which GCP service would you look to for meeting those requirements?

Identity and access management (IAM)Cloud Key Management ServiceCloud IdentityShielded VMs

You have developed a machine learning algorithm for identifying objects in images. Your company has a mobile app that allows users to upload images and get back a list of identified objects. You need to implement the mechanism to detect when a new image is uploaded to Cloud Storage and invoke the model to perform the analysis. Which GCP service would you use for that?

Cloud FunctionsCloud Storage NearlineCloud DataflowCloud Dataproc

An IoT system streams data to a Cloud Pub/Sub topic for ingestion, and the data is processed in a Cloud Dataflow pipeline before being written to Cloud Bigtable. Latency is increasing as more data is added, even though nodes are not at maximum utilization. What would you look for first as a possible cause of this problem?

Too many nodes in the clusterA poorly designed row keyToo many column familiesToo many indexes being updated during write operations

A health and wellness startup in Canada has been more successful than expected. Investors are pushing the founders to expand into new regions outside of North America. The CEO and CTO are discussing the possibility of expanding into Europe. The app offered by the startup collects personal information, storing some locally on the user’s device and some in the cloud. What regulation will the startup need to plan for before expanding into the European market?

HIPAAPCI-DSSGDPRSOX

Your company has been collecting vehicle performance data for the past year and now has 500 TB of data. Analysts at the company want to analyze the data to understand performance differences better across classes of vehicles. The analysts are advanced SQL users, but not all have programming experience. They want to minimize administrative overhead by using a managed service, if possible. What service might you recommend for conducting preliminary analysis of the data?

Compute EngineKubernetes EngineBigQueryCloud Functions

An airline is moving its luggage-tracking applications to Google Cloud. There are many requirements, including support for SQL and strong consistency. The database will be accessed by users in the United States, Europe, and Asia. The database will store approximately 50 TB in the first year and grow at approximately 10 percent a year after that. What managed database service would you recommend?

Cloud SQLBigQueryCloud SpannerCloud Dataflow

You are using Cloud Firestore to store data about online game players’ state while in a game. The state information includes health score, a set of possessions, and a list of team members collaborating with the player. You have noticed that the size of the raw data in the database is approximately 2 TB, but the amount of space used by Cloud Firestore is almost 5 TB. What could be causing the need for so much more space?

The data model has been denormalized.There are multiple indexes.Nodes in the database cluster are misconfigured.There are too many column families in use.

You have a BigQuery table with data about customer purchases, including the date of purchase, the type of product purchases, the product name, and several other descriptive attributes. There is approximately three years of data. You tend to query data by month and then by customer. You would like to minimize the amount of data scanned. How would you organize the table?

Partition by purchase date and cluster by customerPartition by purchase date and cluster by productPartition by customer and cluster by productPartition by customer and cluster by purchase date

You are currently using Java to implement an ELT pipeline in Hadoop. You’d like to replace your Java programs with a managed service in GCP. Which would you use?

Data StudioCloud DataflowCloud BigtableBigQuery

A group of attorneys has hired you to help them categorize over a million documents in an intellectual property case. The attorneys need to isolate documents that are relevant to a patent that the plaintiffs argue has been infringed. The attorneys have 50,000 labeled examples of documents, and when the model is evaluated on training data, it performs quite well. However, when evaluated on test data, it performs quite poorly. What would you try to improve the performance?

Perform feature engineeringPerform validation testingAdd more dataRegularization

Your company is migrating from an on-premises pipeline that uses Apache Kafka for ingesting data and MongoDB for storage. What two managed services would you recommend as replacements for these?

Cloud Dataflow and Cloud BigtableCloud Dataprep and Cloud Pub/SubCloud Pub/Sub and Cloud FirestoreCloud Pub/Sub and BigQuery

A group of data scientists is using Hadoop to store and analyze IoT data. They have decided to use GCP because they are spending too much time managing the Hadoop cluster. They are particularly interested in using services that would allow them to port their models and machine learning workflows to other clouds. What service would you use as a replacement for their existing platform?

BigQueryCloud StorageCloud DataprocCloud Spanner

You are analyzing several datasets and will likely use them to build regression models. You will receive additional datasets, so you’d like to have a workflow to transform the raw data into a form suitable for analysis. You’d also like to work with the data in an interactive manner using Python. What services would you use in GCP?

Cloud Dataflow and Data StudioCloud Dataflow and Cloud DatalabCloud Dataprep and Data StudioCloud Datalab and Data Studio

You have a large number of files that you would like to store for several years. The files will be accessed frequently by users around the world. You decide to store the data in multi-regional Cloud Storage. You want users to be able to view files and their metadata in a Cloud Storage bucket. What role would you assign to those users? (Assume you are practicing the principle of least privilege.)

roles/storage.objectCreatorroles/storage.objectViewerroles/storage.adminroles/storage.bucketList

You have built a deep learning neural network to perform multiclass classification. You find that the model is overfitting. Which of the following would not be used to reduce overfitting?

DropoutL2 RegularizationL1 RegularizationLogistic regression

Your company would like to start experimenting with machine learning, but no one in the company is experienced with ML. Analysts in the marketing department have identified some data in their relational database that they think may be useful for training a model. What would you recommend that they try first to build proof-of-concept models?

AutoML TablesKubeflowCloud FirestoreSpark MLlib

You have several large deep learning networks that you have built using TensorFlow. The models use only standard TensorFlow components. You have been running the models on an n1-highcpu-64 VM, but the models are taking longer to train than you would like. What would you try first to accelerate the model training?

GPUsTPUsShielded VMsPreemptible VMs

Your company wants to build a data lake to store data in its raw form for extended periods of time. The data lake should provide access controls, virtually unlimited storage, and the lowest cost possible. Which GCP service would you suggest?

Cloud BigtableBigQueryCloud StorageCloud Spanner

Auditors have determined that your company’s processes for storing, processing, and transmitting sensitive data are insufficient. They believe that additional measures must be taken to prevent sensitive information, such as personally identifiable government-issued numbers, are not disclosed. They suggest masking or removing sensitive data before it is transmitted outside the company. What GCP service would you recommend?

Data loss prevention APIIn-transit encryptionStoring sensitive information in Cloud Key ManagementCloud Dataflow

You are using Cloud Functions to start the processing of images as they are uploaded into Cloud Storage. In the past, there have been spikes in the number of images uploaded, and many instances of the Cloud Function were created at those times. What can you do to prevent too many instances from starting?

Use the --max-limit parameter when deploying the function.Use the --max-instances parameter when deploying the function.Configure the --max-instance parameter in the resource hierarchy.Nothing. There is no option to limit the number of instances.

You have several analysis programs running in production. Sometimes they are failing, but there is no apparent pattern to the failures. You’d like to use a GCP service to record custom information from the programs so that you can better understand what is happening. Which service would you use?

Stackdriver DebuggerStackdriver LoggingStackdriver MonitoringStackdriver Trace

The CTO of your company is concerned about the rising costs of maintaining your company’s enterprise data warehouse. The current data warehouse runs in a PostgreSQL instance. You would like to migrate to GCP and use a managed service that reduces operational overhead and one that will scale to meet future needs of up to 3 PB. What service would you recommend?

Cloud SQL using PostgreSQLBigQueryCloud BigtableCloud Spanner

Answers to Assessment Test

B. Cloud Dataproc is a Hadoop and Spark managed service. Option A is incorrect; Cloud Dataprep is service for preparing data for analysis. Option C is incorrect; Cloud Dataflow is an implementation of Apache Beam, a stream and batch processing service. Option D is incorrect; Cloud Pub/Sub is a messaging service that can buffer data in a topic until a service is ready to process the data.

C. Cloud Firestore is a managed document database that supports flexible schemas and transactions. Option A is incorrect; Cloud SQL does not support flexible schemas. Option B is incorrect; BigQuery is an analytical database, not a NoSQL database with a flexible schema. Option D is incorrect; Cloud Storage is an object storage system, not a NoSQL database.

C. The AI Platform is a managed service for machine learning, which is needed to provide recommendations. Options A and B are incorrect because, although they are useful for storing data, they do not provide managed machine learning services. Option D is incorrect; Cloud Datastore is a NoSQL database.

C. Cloud Storage Coldline is the lowest-cost option, and it is designed for data that is accessed less than once a year. Options A and B are incorrect because they cost more than Coldline storage. Option D is incorrect because Cloud Bigtable is a low-latency, wide-column database.

D. Shielded VMs are instances with additional security controls. Option A is incorrect; IAM is used for managing identities and authorizations. Option B is incorrect; the Cloud Key Management Service is a service for managing encryption keys. Option C is incorrect; Cloud Identity is used for authentication.

A. Cloud Functions is a managed serverless product that is able to respond to events in the cloud, such as creating a file in Cloud Storage. Option B is incorrect; Cloud Storage Nearline is a class of object storage. Option C is incorrect; Cloud Dataflow is a stream and batch processing service that does not respond to events. Option D is incorrect; Cloud Dataproc is a managed Hadoop and Spark service.

B. A poorly designed row key could be causing hot spotting. Option A is incorrect; more nodes in a cluster will not increase latency. Option C is incorrect; the number of column families on its own would not lead to higher latency. Option D is incorrect; Bigtable does not have indexes.

C. The General Data Protection Regulation (GDPR) is a European Union regulation protecting personal information of persons in and citizens of the European Union. Option A is incorrect; HIPAA is a U.S. healthcare regulation. Option B is incorrect; PCI-DSS is a self-imposed global security standard by major brands in the credit card industry, not a government regulation. Although not necessarily law, the standard may be applicable to the start-up in Europe if it accepts payment cards for brands that require PCI-DSS compliance. Option D is a U.S. regulation that applies to all publicly traded companies in the United States, and wholly-owned subsidiaries and foreign companies that are publicly traded and do business in the United States - the company may be subject to that regulation already, and expanding to Europe will not change its status.

C. BigQuery is an analytical database that supports SQL. Options A and B are incorrect because, although they could be used for ad hoc analysis, doing so would require more administrative overhead. Option D is incorrect; the Cloud Functions feature is intended for running short programs in response to events in GCP.

C. Cloud Spanner is a globally scalable, strongly consistent relational database that can be queried using SQL. Option A is incorrect because it will not scale to the global scale as Cloud Spanner will, and it does not support storing 50 TB of data. Option B is incorrect; the requirements call for a transaction processing system, and BigQuery is designed for analytics and data warehousing. Option D is incorrect; Cloud Dataflow is a stream and batch processing service.

B. Cloud Firestore stores data redundantly when multiple indexes are used, so having more indexes will lead to greater storage sizes. Option A is incorrect; Cloud Firestore is a NoSQL document database that supports a denormalized data model without using excessive storage. Option C is incorrect; you do not configure nodes in Cloud Firestore. Option D is incorrect; column families are not used with document databases such as Cloud Firestore.

A. Partitioning by purchase date will keep all data for a day in a single partition. Clustering by customer will order the data in a partition by customer. This strategy will minimize the amount of data that needs to be scanned in order to answer a query by purchase date and customer. Option B is incorrect; clustering by product does not help reduce the amount of data scanned for date and customer-based queries. Option C is incorrect because partitioning by customer is not helpful in reducing the amount of data scanned. Option D is incorrect because partitioning by customer would spread data from one date over many partitions, and that would lead to scanning more data than partitioning by purchase date.

B. Cloud Dataflow is a stream and batch processing managed service that is a good replacement for Java ELT programs. Option A is incorrect; Data Studio is a reporting tool. Option C is incorrect; Cloud Bigtable is a NoSQL, wide-column database. Option D is incorrect; BigQuery is an analytical database.

D. This is a case of the model overfitting the training data. Regularization is a set of methods used to reduce the risk of overfitting. Option A is incorrect; feature engineering could be used to create new features if the existing set of features was not sufficient, but that is not a problem in this case. Option B is incorrect; validation testing will not improve the quality of the model, but it will measure the quality. Option C is incorrect; the existing dataset has a sufficient number of training instances.

C. Cloud Pub/Sub is a good replacement for Kafka, and Cloud Firestore is a good replacement for MongoDB, which is another document database. Option A is incorrect; Cloud Dataflow is for stream and batch processing, not ingestion. Option B is incorrect; there is no database in the option. Option D is incorrect; BigQuery is analytical database and not a good replacement for a document database such as MongoDB.

C. Cloud Dataproc is a managed Hadoop and Spark service; Spark has a machine learning library called MLlib, and Spark is an open source platform that can run in other clouds. Option A is incorrect; BigQuery is a managed data warehouse and analytical database that is not available in other clouds. Option B is incorrect; Cloud Storage is used for unstructured data and not a substitute for a Hadoop/Spark platform. Option D is incorrect; Cloud Spanner is used for global transaction-processing systems, not large-scale analytics and machine learning.

B. Cloud Dataflow is well suited to transforming batch data, and Cloud Datalab is a Jupyter Notebook managed service, which is useful for ad hoc analysis using Python. Options A, B, and C are incorrect. Data Studio is a reporting tool, and that is not needed in this use case.

B. The roles/storage.objectViewer role allows users to view objects and list metadata. Option A is incorrect; roles/storage.objectCreator allows a user to create an object only. Option C is incorrect; the roles/storage.admin role gives a user full control over buckets and objects, which is more privilege than needed. Option D is incorrect; there is no such role as roles/storage.bucketList.

D. Logistic regression is a binary classifier algorithm. Options A, B, and C are all regularization techniques.

A. AutoML Tables is a service for generating machine learning models from structured data. Option B is incorrect; Kubeflow is an orchestration platform for running machine learning workloads in Kubernetes, which is more than is needed for this use case. Option C is incorrect; Cloud Firestore is a document database, not a machine learning service. Option D is incorrect because Spark MLlib requires more knowledge of machine learning than AutoML Tables, and therefore it is not as good an option for this use case.

B. TPUs are the correct accelerator because they are designed specifically to accelerate TensorFlow models. Option A is incorrect because, although GPUs would accelerate the model training, GPUs are not optimized for the low-precision matrix math that is performed when training deep learning networks. Option C is incorrect; shielded VMs have additional security controls, but they do not accelerate model training. Option D is incorrect; preemptible machines cost less than non-preemptible machines, but they do not provide acceleration.

C. Cloud Storage is an object storage system that meets all of the requirements. Option A is incorrect; Cloud Bigtable is a wide-column database. Option B is incorrect; BigQuery is an analytical database. Option D is incorrect; Cloud Spanner is a horizontally scalable relational database.

A. A data loss prevention API can be used to remove many forms of sensitive data, such as government identifiers. Option B is incorrect; encryption can help keep data from being read, but it does not remove or mask sensitive data. Option C is incorrect; Cloud Key Management is a service for storing and managing encryption keys. Option D is incorrect; Cloud Dataflow is a batch and stream processing service.

B. The --max-instances parameter limits the number of concurrently executing function instances. Option A is incorrect; --max-limit is not a parameter used with function deployments. Option C is incorrect; there is no --max-instance parameter to set in the resource hierarchy. Option D is incorrect; there is a way to specify a limit using the --max-instances parameter.

B. Stackdriver Logging is used to collect semi-structured data about events. Option A is incorrect; Stackdriver Debugger is used to inspect the state of running code. Option C is incorrect because Stackdriver Monitoring collects performance metrics, not custom data. Option D is incorrect; Stackdriver Trace is used to collect information about the time required to execute functions in a call stack.

B. BigQuery is a managed service that is well suited to data warehousing, and it can scale to petabytes of storage. Option A is incorrect; Cloud SQL will not scale to meet future needs. Option C is incorrect; Bigtable is a NoSQL, wide-column database, which is not suitable for use with a data warehouse design that uses a relational data model. Option D is incorrect; Cloud Spanner is a transactional, scalable relational database.

Chapter 1Selecting Appropriate Storage Technologies

Google Cloud Professional Data Engineer Exam objectives covered in this chapter include the following:

Designing data processing systems

✔ 1.1 Selecting the appropriate storage technologies

Mapping storage systems to business requirements

Data modeling

Tradeoffs involving latency, throughput, transactions

Distributed systems

Schema design

Data engineers choose how to store data for many different situations. Sometimes data is written to a temporary staging area, where it stays only seconds or less before it is read by an application and deleted. In other cases, data engineers arrange long-term archival storage for data that needs to be retained for years. Data engineers are increasingly called on to work with data that streams into storage constantly and in high volumes. Internet of Things (IoT) devices are an example of streaming data.

Another common use case is storing large volumes of data for batch processing, including using data to train machine learning models. Data engineers also consider the range of variety in the structure of data. Some data, like the kind found in online transaction processing, is highly structured and varies little from one datum to the next. Other data, like product descriptions in a product catalog, can have a varying set of attributes. Data engineers consider these and other factors when choosing a storage technology.

This chapter covers objective 1.1 of the Google Cloud Professional Data Engineer exam—Selecting appropriate storage technologies. In this chapter, you will learn about the following:

The business aspects of choosing a storage system

The technical aspects of choosing a storage system

The distinction between structured, semi-structured, and unstructured data models

Designing schemas for relational and NoSQL databases

By the end of this chapter, you should understand the various criteria data engineers consider when choosing a storage technology. In Chapter 2, “Building and Operationalizing Storage Systems,” we will delve into the details of Google Cloud storage services.

From Business Requirements to Storage Systems

Business requirements are the starting point for choosing a data storage system. Data engineers will use different types of storage systems for different purposes. The specific storage system you should choose is determined, in large part, by the stage of the data lifecycle for which the storage system is used.

The data lifecycle consists of four stages:

Ingest

Store

Process and analyze

Explore and visualize

Ingestion is the first stage in the data lifecycle, and it entails acquiring data and bringing data into the Google Cloud Platform (GCP). The storage stage is about persisting data to a storage system from which it can be accessed for later stages of the data lifecycle. The process and analyze stage begins with transforming data into a usable format for analysis applications. Explore and visualize is the final stage, in which insights are derived from analysis and presented in tables, charts, and other visualizations for use by others.

Ingest

The three broad ingestion modes with which data engineers typically work are as follows:

Application data

Streaming data

Batch data

Application Data

Application data is generated by applications, including mobile apps, and pushed to backend services. This data includes user-generated data, like a name and shipping address collected as part of a sales transaction. It also includes data generated by the application, such as log data. Event data, like clickstream data, is also a type of application-generated data. The volume of this kind of data depends on the number of users of the application, the types of data the application generates, and the duration of time the application is in use. This size of application data that is sent in a single operation can vary widely. A clickstream event may have less than 1KB of data, whereas an image upload could be multiple megabytes. Examples of application data include the following:

Transactions from an online retail application

Clickstream data from users reading articles on a news site

Log data from a server running computer-aided design software

User registration data from an online service

Application data can be ingested by services running in Compute Engine, Kubernetes Engine, or App Engine, for example. Application data can also be written to Stackdriver Logging or one of the managed databases, such as Cloud SQL or Cloud Datastore.

Streaming Data

Streaming data is a set of data that is typically sent in small messages that are transmitted continuously from the data source. Streaming data may be sensor data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Examples of streaming data include the following:

Virtual machine monitoring data, such as CPU utilization rates and memory consumption data

An IoT device that sends temperature, humidity, and pressure data every minute

A customer adding an item to an online shopping cart, which then generates an event with data about the customer and the item

Streaming data often includes a timestamp indicating the time that the data was generated. This is often called the event time. Some applications will also track the time that data arrives at the beginning of the ingestion pipeline. This is known as the process time. Time-series data may require some additional processing early in the ingestion process. If a stream of data needs to be in time order for processing, then late arriving data will need to be inserted in the correct position in the stream. This can require buffering of data for a short period of time in case the data arrives out of order. Of course, there is a maximum amount of time to wait before processing data. These and other issues related to processing streaming data are discussed in Chapter 4, “Designing a Data Processing Solution.”

Streaming data is well suited for Cloud Pub/Sub ingestion, which can buffer data while applications process the data. During spikes in data ingestion in which application instances cannot keep up with the rate data is arriving, the data can be preserved in a Cloud Pub/Sub topic and processed later after application instances have a chance to catch up. Cloud Pub/Sub has global endpoints and uses GCP’s global frontend load balancer to support ingestion. The messaging service scales automatically to meet the demands of the current workload.

Batch Data

Batch data is ingested in bulk, typically in files. Examples of batch data ingestion include uploading files of data exported from one application to be processed by another. Examples of batch data include the following:

Transaction data that is collected from applications may be stored in a relational database and later exported for use by a machine learning pipeline

Archiving data in long-term storage to comply with data retention regulations

Migrating an application from on premises to the cloud by uploading files of exported data

Google Cloud Storage is typically used for batch uploads. It may also be used in conjunction with Cloud Transfer Service and Transfer Appliance when uploading large volumes of data.

Once data enters the GCP platform through ingestion, it can be stored for longer-term access by other applications or services.

Store

The focus of the storage stage of the data lifecycle is to make data available for transformation and analysis. Several factors influence the choice of storage system, including

How the data is accessed—by individual record (row) or by an aggregation of columns across many records (rows)

The way access controls need to be implemented, at the schema or database level or finer-grained level

How long the data will be stored

These three characteristics are the minimum that should be considered when choosing a storage system; there may be additional criteria for some use cases. (Structure is another factor and is discussed later in this chapter.)

Data Access Patterns

Data is accessed in different ways. Online transaction processing systems often query for specific records using a set of filtering parameters. For example, an e-commerce application may need to look up a customer shipping address from a data store table that holds tens of thousands of addresses. Databases, like Cloud SQL and Cloud Datastore, provide that kind of query functionality.

In another example, a machine learning pipeline might begin by accessing files with thousands of rows of data that is used for training the model. Since machine learning models are often trained in batch mode, all of the training data is needed. Cloud Storage is a good option for storing data that is accessed in bulk.

If you need to access files using filesystem operations, then Cloud Filestore is a good option.

Access Controls

Security and access control in particular also influence how data is stored.

Relational databases, like Cloud SQL and Cloud Spanner, provide mechanisms to restrict access to tables and views. Some users can be granted permission to update data, whereas others can only view data, and still others are not allowed any direct access to data in the database. Fine-grained security can be implemented at the application level or by creating views that limit the data available to some users.

Some access controls are coarse grained. For example, Cloud Storage can limit access based on bucket permissions and access control lists on objects stored in a bucket. If a user has access to a file in the bucket, then they will have access to all the data in that file. Cloud Storage treats files as atomic objects; there is no concept of a row of data, for example, in Cloud Storage as there is in a relational database.