38,99 €
A comprehensive and accessible roadmap to performing data analytics in the AWS cloud
In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you’ll explore every relevant aspect of data analytics—from data engineering to analysis, business intelligence, DevOps, and MLOps—as you discover how to integrate machine learning predictions with analytics engines and visualization tools.
You’ll also find:
A can't-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 496
Veröffentlichungsjahr: 2023
Cover
Title Page
Introduction
What Is a Data Lake?
The Data Platform
The End of the Beginning
Note
Chapter 1: AWS Data Lakes and Analytics Technology Overview
Why AWS?
What Does a Data Lake Look Like in AWS?
Analytics on AWS
Skills Required to Build and Maintain an AWS Analytics Pipeline
Chapter 2: The Path to Analytics: Setting Up a Data and Analytics Team
The Data Vision
DA Team Roles
Analytics Flow at a Process Level
The DA Team Mantra: “Automate Everything”
Analytics Models in the Wild: Centralized, Distributed, Center of Excellence
Summary
Chapter 3: Working on AWS
Accessing AWS
Everything Is a Resource
IAM: Policies, Roles, and Users
Working with the Web Console
The AWS Command‐Line Interface
Infrastructure‐as‐Code: CloudFormation and Terraform
Chapter 4: Serverless Computing and Data Engineering
Serverless vs. Fully Managed
AWS Serverless Technologies
AWS Serverless Application Model (SAM)
Summary
Chapter 5: Data Ingestion
AWS Data Lake Architecture
Sample Processing Architecture: Cataloging Images into DynamoDB
Serverless Ingestion
Fully Managed Ingestion with AppFlow
Operational Data Ingestion with Database Migration Service
Summary
Chapter 6: Processing Data
Phases of Data Preparation
Overview of ETL in AWS
ETL Job Design Concepts
AWS Glue for ETL
Connectors
Creating ETL Jobs with AWS Glue Visual Editor
Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target)
Creating ETL Jobs with the Spark Script Editor
Developing ETL Jobs with AWS Glue Notebooks
Creating ETL Jobs with AWS Glue Interactive Sessions
Streaming Jobs
Chapter 7: Cataloging, Governance, and Search
Cataloging with AWS Glue
Search with Amazon Athena: The Heart of Analytics in AWS
Governing: Athena Workgroups, Lake Formation, and More
AWS Lake Formation
Summary
Chapter 8: Data Consumption: BI, Visualization, and Reporting
QuickSight
Data Consumption: Not Only Dashboards
Summary
Chapter 9: Machine Learning at Scale
Machine Learning and Artificial Intelligence
Amazon SageMaker
Summary
Appendix: Example Data Architectures in AWS
Modern Data Lake Architecture
Batch Processing
Stream Processing
Architecture Design Recommendations
Summary
Index
Copyright
About the Author
About the Technical Editor
Acknowledgments
End User License Agreement
Chapter 6
Table 6.1: The magics available in AWS Glue
Chapter 7
Table 7.1: Permissions needed
Chapter 2
Figure 2.1: An example structure of an early‐stages DA team
Chapter 3
Figure 3.1: The Console Home screen
Figure 3.2: The Web Console
Figure 3.3: DynamoDB table creation form
Figure 3.4: Verifying the changes to the bucket in the Web Console
Figure 3.5: A Cloudcraft diagram
Chapter 4
Figure 4.1: Configuration of Lambdas
Figure 4.2: Node.js hello‐world blueprint
Figure 4.3: The changeset applied to the stack
Figure 4.4:
HelloWorldFunction
invoked from the Web Console
Chapter 5
Figure 5.1: Data lake architecture
Figure 5.2: Application architecture
Figure 5.3: Fargate‐based periodic batch import
Figure 5.4: Backend Service infrastructure
Figure 5.5: Two‐pronged delivery
Figure 5.6: Create Replication Instance
Figure 5.7: Specifying the security group
Figure 5.8: Create Endpoint
Figure 5.9: Test Endpoint Connection
Figure 5.10: Endpoint Configuration
Figure 5.11: Endpoint Settings
Figure 5.12: For Parquet and CSV
Figure 5.13: For CSV only
Figure 5.14: Test run
Figure 5.15: Create Database Migration Task
Figure 5.16: Table Mappings
Figure 5.17: Inspecting the migration task status
Figure 5.18: Full load successful
Figure 5.19: Exploring the content of the Parquet file
Figure 5.20: Inspecting the downloaded file
Chapter 6
Figure 6.1: AWS Glue interface in the Web Console
Figure 6.2: Connectors screen
Figure 6.3: Connector Access
Figure 6.4: Secrets Manager
Figure 6.5: Connection Properties section
Figure 6.6: Custom connectors list
Figure 6.7: ETL diagram in the Editor
Figure 6.8: Inspecting parquet
Figure 6.9: Bookmark Enable/Disable option
Figure 6.10: Available transformations
Figure 6.11: Mapping transformation
Figure 6.12: Edited diagram
Figure 6.13: Node Properties
Figure 6.14: Viewing local files
Figure 6.15: Inspecting parquet
Figure 6.16: What you see when you load a notebook
Figure 6.17: Verifying the first entries in the file
Figure 6.18: Notebook example
Figure 6.19: Available kernels
Figure 6.20: Interactive sessions in a notebook
Figure 6.21: Kinesis Create Data Stream
Figure 6.22: S3 bucket exploration
Figure 6.23: Job type option
Figure 6.24: Node properties
Figure 6.25: Target node properties
Figure 6.26: Seeing data stored in S3
Figure 6.27: Table and database selection
Figure 6.28: Setting Kinesis as the source
Figure 6.29: File format selection
Figure 6.30: Sample schema
Chapter 7
Figure 7.1: Adding a table
Figure 7.2: Table schema
Figure 7.3: Object summary
Figure 7.4: Crawler Name field
Figure 7.5: Crawler creation, source, and stores options
Figure 7.6: Crawler creation, folder, and path field
Figure 7.7: Crawler creation, prefix field
Figure 7.8: Crawler list
Figure 7.9: Crawler list, run information
Figure 7.10: Generated tables
Figure 7.11: Generated schema
Figure 7.12: Crawler created with the CLI
Figure 7.13: Crawler‐generated schema, single array field
Figure 7.14: Adding a classifier
Figure 7.15: Add the classifier to the crawler
Figure 7.16: Newly generated schema with classifier
Figure 7.17: Query editor
Figure 7.18: Table options in Athena
Figure 7.19: Copy and Download Results buttons
Figure 7.20: Query Stats graph
Figure 7.21: Saved queries
Figure 7.22: Result of query
Figure 7.23: Save button drop‐down options
Figure 7.24: Save Query dialog box
Figure 7.25: Query editor
Figure 7.26: Parameterized query
Figure 7.27: Connection Details pane
Figure 7.28: Connection error
Figure 7.29: Databases and tables in Athena
Figure 7.30: Create Workgroup
Figure 7.31: Workgroup details
Figure 7.32: Setting query limits for the workgroup
Figure 7.33: Lake Formation menu
Figure 7.34: Registering location in Lake Formation
Figure 7.35: List of registered locations
Figure 7.36: Create Database form
Figure 7.37: List of databases
Figure 7.38: Add LF‐Tag button
Figure 7.39: LF‐Tag creation form
Figure 7.40: Adding key and values
Figure 7.41: Empty tag list
Figure 7.42: Edit LF‐Tag form
Figure 7.43: LF‐Tag validation
Figure 7.44: Grant data permissions form
Figure 7.45: LF‐Tag‐based permission
Figure 7.46: Database Permissions
Figure 7.47: Data filter creation form
Figure 7.48: LF‐Tag available in form
Chapter 8
Figure 8.1: User invitation
Figure 8.2: Create New Group
Figure 8.3: Cost analysis in QuickSight
Figure 8.4: SPICE usage graph
Figure 8.5: QuickSight access to AWS services
Figure 8.6: Public access to dashboards
Figure 8.7: QuickSight's navigation menu
Figure 8.8: QuickSight available data sources
Figure 8.9: New Athena Data Source
Figure 8.10: New data source available
Figure 8.11: Choose Your Table
Figure 8.12: Enter Custom SQL Query
Figure 8.13: Apply query to data source
Figure 8.14: Duplicate Dataset
Figure 8.15: Available resource categories in the UI
Figure 8.16: Refresh Now button
Figure 8.17: Refresh schedule and history of a dataset
Figure 8.18: Common SQL error message
Figure 8.19: Available services in QuickSight
Figure 8.20: Editor view
Figure 8.21: Dataset options
Figure 8.22: Field options
Figure 8.23: Inspecting a script function
Figure 8.24: Placing a function in a script by selecting it from the list
Figure 8.25:
bodyLength
field now available
Figure 8.26: Data type icon changed
Figure 8.27: Add data to current dataset
Figure 8.28: Newly added dataset
Figure 8.29: Relationship UI
Figure 8.30: Field search
Figure 8.31: Specifying an
INNER
join
Figure 8.32: Recommended join
Figure 8.33: Single table joining to others
Figure 8.34: Complex relationship diagram
Figure 8.35: Excluded fields at the bottom of the list
Figure 8.36: Filter view
Figure 8.37: Available filter conditions
Figure 8.38: Add Field To Hierarchy
Figure 8.39: Adding to an existing hierarchy
Figure 8.40: Newly created visual default view
Figure 8.41: Add options
Figure 8.42: Autogenerated graph
Figure 8.43: Fields information bar
Figure 8.44: Field wells
Figure 8.45: The various graph type icons
Figure 8.46: Aggregation options
Figure 8.47: Example dashboard with one graph
Figure 8.48: Filtering values
Figure 8.49: Null Options
Figure 8.50: Group/Color field well
Figure 8.51: Drilling down
Figure 8.52: Navigation between levels of drill‐down
Figure 8.53: Create New Parameter
Figure 8.54: Add Control
Figure 8.55: Using a parameter in a filter
Figure 8.56: Application of parameters affecting graphs
Figure 8.57: Gauge control
Figure 8.58: Edit Action
Figure 8.59: Action available in menu
Figure 8.60: Before action trigger
Figure 8.61: After action is triggered
Figure 8.62: New Action
Figure 8.63: List of actions in context menu
Figure 8.64: Specifying a destination URL for the action
Figure 8.65: Suggested Insights
Figure 8.66: Example autonarratives
Figure 8.67: Narrative context menu
Figure 8.68: Edit Narrative
Figure 8.69: Dot indicating that ML‐Insight is available
Figure 8.70: Visual menu including
Figure 8.71: Forecast added to timeline
Figure 8.72: Integration with SageMaker
Figure 8.73: Example dashboard
Figure 8.74: Publishing options
Chapter 9
Figure 9.1: Domain creation form
Figure 9.2: IAM role for SageMaker execution
Figure 9.3: Launch Studio
Figure 9.4: View user
Figure 9.5: SageMaker Studio
Figure 9.6: Example prediction
Figure 9.7: Models list
Figure 9.8: Endpoints interface
Figure 9.9: Endpoints in SageMaker interface
Figure 9.10: Create Batch Transform Job
Figure 9.11: Input and output data configuration
Appendix
Figure A.1: Modern Data Lake Architecture
Figure A.2: Batch processing architecture
Figure A.3: Stream processing architecture
Cover
Table of Contents
Title Page
Copyright
About the Author
About the Technical Editor
Acknowledgments
Introduction
Begin Reading
Appendix: Example Data Architectures in AWS
Index
End User License Agreement
iii
xxiii
xxiv
xxv
xxvi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
iv
v
vii
ix
381
Joe Minichino
Welcome to your journey to AWS‐powered cloud‐based analytics!
If you need to build data lakes, import pipelines, or perform large‐scale analytics and then display them with state‐of‐the‐art visualization tools, all through the AWS ecosystem, then you are in the right place.
I will spare you an introduction on how we live in a connected world where businesses thrive on data‐driven decisions based on powerful analytics. Instead, I will open by saying that this book is for people who need to build a data platform to turn their organization into a data‐driven one, or who need to improve their current architectures in the real world. This book may help you gain the knowledge to pass an AWS certification exam, but this is most definitely not its only aim.
I will be covering a number of tools provided by AWS for building a data lake and analytics pipeline, but I will cover these tools insofar as they are applicable to data lakes and analytics, and I will deliberately omit features that are not relevant or particularly important. This is not a comprehensive guide to such tools—it's a guide to the features of those tools that are relevant to our topic.
It is my personal opinion that analytics, be they in the form of looking back at the past (business intelligence [BI]) or trying to predict the future (data science and predictive analytics), are the key to success.
You may think marketing is a key to success. It is, but only when your analytics direct your marketing efforts in the right direction, to the right customers, with the right approach for those customers.
You may think pricing, product features, and customer support are keys. They are, but only when your analytics reveal the correct prices and the right features to strengthen customer retention and success, and your support team possesses the necessary skills to adequately satisfy your customers' requests and complaints.
That is why you need analytics.
Even in the extremely unlikely case that your data all resides in one data store, you are probably keeping it in a relational database that's there to back your customer‐facing applications. Traditional RDBs are not made for large‐scale1 storage and analysis, and I have seen very few cases of storing the entire history of records of an RDB in the RDB itself.
So you need a massively scalable storage solution with a query engine that can deal with different data sources and formats, and you probably need a lot of preparation and clean‐up before your data can be used for large‐scale analysis.
You need a data lake.
A data lake is a centralized repository of structured, semi‐structured, and unstructured data, upon which you can run insightful analytics. This is my ultra‐short version of the definition.
While in the past we referred to a data lake strictly as the facility where all of our data was stored, nowadays the definition has extended to include all of the possible data stores that can be linked to the centralized data storage, in a kind of hybrid data lake that comprises flat‐file storage, data warehouses, and operational data stores.
If all your data resides in a single data store, you're not interested in analyzing it, or the size and velocity of your data are such that you can afford to record the entire history of all your records in the same data store and perform your analysis there without impacting customer‐facing services, then you do not need a data lake. I'll confess I never came across such a scenario. So, unless you are running some kind of micro and very particular business that does not benefit from analysis, most likely you will want to have a data lake in place and an analytics pipeline powering your decisions.
Really, always.
Almost always, and they are generally cheap solutions to maintain. In this book we will explore ways to store and analyze vast quantities of data for very little money.
One of the most common mistakes companies make is to put analysts to work before they have data engineers in place. If you do that, you are only going to cause these effects in order:
Your analysts will waste their time trying to either work around engineering problems or worse, try their hand at data engineering themselves.
Your analysts will get frustrated, as most of their time will be spent procuring, transforming, and cleaning the data instead of analyzing it.
Your analysts will produce analyses, but they are not likely to set up automation for the data engineering side of the work, meaning they will spend hours rerunning data acquisition, filtering, cleaning, and transforming rather than analyzing.
Your analysts will leave for a company that has an analytics team in place that includes both data analysts and data engineers.
So just skip that part and do things the right way. Get a vision for your analytics, put data engineers in place, and then analysts to work who can dedicate 100 percent of their time to analyzing data and nothing else. We will explore designing and setting up a data analytics team in Chapter 2, “The Path to Analytics: Setting Up a Data and Analytics Team.”
In this book, I will guide you through the extensive but extremely interesting and rewarding journey of creating a data platform that will allow you to produce analytics of all kinds: look at the past and visualize it through business intelligence and BI tools and predict the future with intelligent forecasting and machine learning models, producing metrics and the likelihood of events happening.
We will do so in a scalable, extensible way that will grant your organization the kind of agility needed for fast turnaround on analytics requests and to deal with changes in real time by building a platform that is centered around the best technologies for the task at hand with the correct resources in place to accomplish such tasks.
I hope you enjoy this book, which is the fruit of my many years of experience collected in the “battlefield” of work. Hopefully you will gain knowledge and insights that will help you in your job and personal projects, and you may reduce or altogether skip some of the common issues and problems I have encountered throughout the years.
1
Everything is relative, but generally speaking if you tried to store all the versions of all the records in a large RDBS you would put the database itself under unnecessary pressure,
and
you would be doing so at the higher cost of the I/O optimized storage that databases use in AWS (read about I/O provision), rather than utilizing a cheap storage facility that scales to virtually infinite size, like S3.
In the introduction I explained why you need analytics. Really powerful analytics require large amounts of data. The large here is relative to the context of your business or task, but the bottom line is that you should produce analytics based on a comprehensive dataset rather than a small (and inaccurate) sample of the entire body of data you possess.
But first let's address our choice of cloud computing provider. As of this writing (early 2022) there are a number of cloud computing providers, with three competitors leading the race: Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. I recommend AWS as your provider of choice, and I'll tell you why.
The answer for me lies in the fact that analytics is a vast realm of computing spanning numerous technologies and areas of technology: business analysis, data engineering, data analytics, data science, data storage (including transactional databases, data lakes, and warehouses), data mining/crawling, data cataloging, data governance and strategy, security, visualization, business intelligence, and reporting.
Although AWS may not win out on some of the costs of running services and has to cover some ground to catch up to its competitors in terms of user interface/user experience (UI/UX), it remains the only cloud provider that has a solid and stable solution for each area of the business, all seamlessly integrated through the AWS ecosystem.
It is true that other cloud providers are ideal for some use cases and that leveraging their strength in certain areas (for example, GCP tends to be very developer‐friendly) can make for easy and cost‐effective solutions. However, when it comes to running an entire business on it, AWS is the clear winner.
Also, AWS encourages businesses to use their resources in an optimal fashion by providing a free tier of operation, which means that for each tool you use there will be a certain amount of usage below a specified threshold provided for free. Free‐tier examples are 1 million AWS Lambda invocations per month, or 750 hours of small Relational Database Service (RDS) databases.
As far as this book's use case, which is setting up and delivering large‐scale analytics, AWS is clearly the leader in the field at this time.
For the most part, you will be dealing with Amazon Simple Storage Service (S3), with which you should be familiar, but if you aren't, fear not, because we've got you covered in the next chapters.
S3 is the storage facility of choice for the following reasons:
It can hold a virtually infinite amount of data.
It is inexpensive, and you can adopt storage solutions that make it up to 50 times cheaper.
It is seamlessly integrated with all data and analytics‐related tools in AWS, from tools like Kinesis that store data in S3 to tools like Athena that query the data in it.
Data can be protected through access permissions, it can be encrypted in a variety of ways, or it can be made publicly accessible.
There are other solutions for storage in AWS, but aside from one that has some use cases (the EMR File System, or EMRFS), you should rely on S3. Note that EMRFS is actually based on S3, too. Other storage solutions like Amazon Elastic Block Store (EBS) are not ideal for data lake and analytics purposes, and since I discourage their use in this context, I will not cover them in the book.
If you log into the AWS console, you will see the following products listed under the Analytics heading:
Athena
EMR
CloudSearch
Kinesis
QuickSight
Data Pipeline
AWS Data Exchange
AWS Glue
AWS Lake Formation
MSK
The main actors in the realm of analytics in the context of big data and data lakes are undoubtedly S3, Athena, and Kinesis.
EMR is useful for data preparation/transformation, and the output is generally data that is made available to Athena and QuickSight.
Other tools, like AWS Glue and Lake Formation, are not less important (Glue in particular is vital to the creation and maintenance of an analytics pipeline), but they are not directly generating or performing analytics. MSK is AWS's fully managed version of Kafka, and we will take a quick look at it, but we will generally favor Kinesis (as it performs a similar role in the stack).
Opting for MSK or plain Kafka comes down to cost and performance choices.
CloudSearch is a search engine for websites, and therefore is of limited interest to us in this context.
In addition, SageMaker can be a nice addition if you want to power your analytics with predictive models or any other machine learning/artificial intelligence (ML/AI) task.
First of all, you need familiarity with AWS tools. You will gain that familiarity through this book. For anything that goes beyond the creation of resources through the AWS console, you will need general AWS Sysops skills. Other skills you'll need include the following:
Knowledge of AWS Identity and Access Management (IAM) is necessary to understand the permissions requirements for each task.
DevOps skills are required if you want to automate the creation and destruction of resources using CloudFormation or Terraform (or any other infrastructure‐as‐code tool).
SQL skills are needed to write Athena queries, and basic database administrator (DBA) skills to understand Athena data types and schemas.
Data analysis and data science skills are required for SageMaker models.
Basic business understanding of charts and graphs are required to create QuickSight visualizations.
Creating analytics, especially in a large organization, can be a monumental effort, and a business needs to be prepared to invest time and resources, which will all repay the company manifold by enabling data‐driven decisions. The people who will make this shift toward data‐driven decision making are your Data and Analytics team, sometimes referred to as Data Analytics team or even simply as Data team (although this latest version tends to confuse people, as it may seem related to database administration). This book will refer to the Data and Analytics team as the DA team.
Although the focus of this book is architectural patterns and designs that will help you turn your organization into a data‐driven one, a high‐level overview of the skills and people you will need to make this happen is necessary.
Funny anecdote: At Teamwork, our DA team is referred to with the funny‐sounding name DANDA, because we create resources on AWS with the identifierD&A, but because AWS has a habit of converting some characters into full text, & became AND. Needless to say, it stuck, and since then we have been known as DANDA.
The first step in delivering analytics is to create a data vision, a statement for your business as a whole. This can be a simple quote that works as a compass for all the projects your DA team will work on.
A vision does not have to be immutable. However, you should only change it if it is somehow only applicable to certain conditions or periods of time and those conditions have been satisfied or that time has passed.
A vision is the North Star of your data journey. It should always be a factor when you're making decisions about what kind of work to carry out or how to prioritize a current backlog. An example of a data vision is “to create a unified analytics facility that enables business management to slice and dice data at will.”
It's important to create the vision, and it's also vital for the vision to have the support of all the involved stakeholders. Management will be responsible for allocating resources to the DA team, so these managers need to be behind the vision and the team's ability to carry it out. You should have a vision statement ready and submit it to management, or have management create it in the first place.
I won't linger any further on this topic because this book is more of a technical nature than a business one, but be sure not to skip this vital step.
Before diving into the steps for creating analytics, allow me to give you some friendly advice on how you should not go about it. I will do so by recounting a fictional yet all too common story of failure by businesses and companies.
Data Undriven Inc. is a successful company with hundreds of employees, but it's in dire need of analytics to reverse some worrying revenue trends. The leadership team recognizes the need for a far more accurate kind of analytics than what they currently have available, since it appears the company is unable to pinpoint exactly what side of the business is hemorrhaging money. Gemma, a member of the leadership team, decides to start a project to create analytics for the company, which will find its ultimate manifestation in a dashboard illustrating all sorts of useful metrics. Gemma thinks Bob is a great Python/SQL data analyst and tasks Bob with the creation of reports. The ideas are good, but data for these reports resides in various data sources. This data is unsuitable for analysis because it is sparse and inaccurate, some integrity is broken, there are holes due to temporary system failures, and the DBA team has been hit with large and unsustainable queries run against their live transactional databases, which are meant to serve data to customers, not to be reported on.
Bob collects the data from all the sources and after weeks of wrangling, cleaning, filtering, and general massaging of the data, produces analytics to Gemma in the form of a spreadsheet with graphs in it.
Gemma is happy with the result, although she notices some incongruence with the expected figures. She asks Bob to automate this analysis into a dashboard that managers can consult and that will contain up‐to‐date information.
Bob is in a state of panic, looking up how to automate his analytics scripts, while also trying to understand why his numbers do not match Gemma's expectations—not to mention the fact that his Python program takes between 3 and 4 hours to run every time, so the development cycle is horrendously slow.
The following weeks are a harrowing story of misunderstandings, failed attempts at automations, frustration, degraded database performance, with the ultimate result that Gemma has no analytics and Bob has quit his job to join a DA team elsewhere.
What is the moral of the story? Do not put any analyst to work before you have a data engineer in place. This cannot be stated strongly enough. Resist the temptation to want analytics now. Go about it the right way. Set up a DA team, even if it's small and you suffer from resource constraints in the beginning, and let analysts come into the picture when the data is ready for analytics and not before. Let's see what kind of skills and roles you should rely on to create a successful DA team and achieve analytics even at scale.
There are two groups of roles for a DA team: the early stages and the mature stage. The definitions for these are not strict and vary from business to business. Make sure core roles are covered before advancing to more niche and specialized ones.
By “early stage roles” we refer to a set of roles that will constitute the nucleus of your nascent DA team and that will help the team grow. At the very beginning, it is to be expected that the people involved will have to exercise some flexibility and open‐mindedness in terms of the scope and authority of their roles, because the priority is to build the foundation for a data platform. So a team lead will most likely be hands‐on, actively contributing to engineering, and the same can be said of the data architect, whereas data engineers will have to perform a lot of work in the realms of data platform engineering to enable the construction and monitoring of pipelines.
Your DA team should have, at least at the beginning, strong leadership in the form of a team lead. This is a person who is clearly technically proficient in the realm of analytics and is able to create tasks and delegate them to the right people, oversee the technical work that's being carried out, and act as a liaison between management and the DA team.
Analytics is a vast domain that has more business implications than other strictly technical areas (like feature development, for example), and yet the technical aspects can be incredibly challenging, normally requiring engineers with years of experience to carry out the work. For this reason, it is good to have a person spearheading the work in terms of workflow and methodology to avoid early‐stage fragmentation, discrepancies, and general disruption of the work due to lack of cohesion within the team. The team can potentially evolve into something more of a flat‐hierarchy unit later on, when every member is working with similar methods and practices that can be—at that later point—questioned and changed.
A data architect is a fundamental figure for a DA team and one the team cannot do without. Even if you don't elect someone to be officially recognized as the architect in the team, it is advisable to elect the most experienced and architecturally minded engineer to the role of supervisor of all the architectures designed and implemented by the DA team. Ideally the architect is a full‐time role, not only designing pipeline architectures but also completing work on the technology adoption front, which is a hefty and delicate task at the same time.
Deciding whether you should adopt a serverless architecture over an Airflow‐ or Hadoop‐based one is something that requires careful attention. Elements such as in‐house skills and maintenance costs are also involved in the decision‐making process.
The business can—especially under resource constraints—decide to combine the architect and team lead roles. I suggest making the data architect/team lead a full‐time role before the analytics demand volume in the company becomes too large to be handled by a single team lead or data architect.
Every DA team should have a data engineering (DE) subteam, which is the beating heart of data analytics. Data engineers are responsible for implementing systems that move, transform, and catalog data in order to render the data suitable for analytics.
In the context of analytics powered by AWS, data engineers nowadays are necessarily multifaceted engineers with skills spanning various areas of technology. They are cloud computing engineers, DevOps engineers, and database/data lake/data warehouse experts, and they are knowledgeable in continuous integration/continuous deployment (CI/CD).
You will find that most DEs have particular strengths and interests, so it would be wise to create a team of DEs with some diversity of skills. Cross‐functionality can be built over time; it's much more important to start with people who, on top of the classic extract, transform, load (ETL) work, can also complete infrastructure work, CI/CD pipelines, and general DevOps.
At its core, the Data Engineer’s job is to perform ETL operations. They can be of varied natures, dealing with different sources of data and targeting various data stores, and they can perform some kind of transformation, like flattening/unnesting, filtering, and computing values. Ultimately, the broad description of the work is to extract (data from a source), transform (the data that was extracted), and load (the transformed data into a target store).
You can view all the rest of the tasks as ancillary tasks to this fundamental operation.
Another classic subteam of a DA team is the Data Analysts team. The team consists of a number of data analysts who are responsible for the exploratory and investigative work that identifies trends and patterns through the use of statistical models and provides management with metrics and numbers that help decision making. At the early stages of a DA team, data analysts may also cover the role of business intelligence developers, responsible for visualizing data in the form of reports and dashboards, using descriptive analytics to give an easy‐to‐understand view of what happened in the business in the past.
When the team's workflow is established, it is a good idea to better define the scope of each role and include figures responsible for specialist areas of expertise, such as data science or cloud and data platform engineering, and let every member of the team focus on the areas they are best suited for.
A data scientist (DS) is the ultimate data “nerd” and responsible for work in the realm of predictive and prescriptive analytics. A DS usually analyzes a dataset and, through the use of machine‐learning (ML) techniques, is able to produce various predictive models, such as regression models that produce the likelihood of a certain outcome given certain conditions (for example, the likelihood of a prospective customer to convert from a trial user to a paying user). The DS may also produce forecasting models that use modern algorithms to predict the trend of a certain metric (such as revenue of the business), or even simply group records in clusters based on some of the records' features.
A data scientist's work is to investigate and resolve complex challenges that often involve a number of unknowns, and to identify patterns and trends not immediately evident to the human eye (or mind). An ideally structured centralized DA team will have a Data Science subteam at some point. The common ratio found in the industry is to have one DS for every four data analysts, but this is by no means a hard‐and‐fast rule. If the business is heavily involved in statistical models, or it leverages machine‐learning predictions as a main feature of its product(s), then it may have more data scientists than data analysts.
If your team has such a large volume of work that a single dedicated engineer responsible for maintaining infrastructure is justified, then having a cloud engineer is a good idea. I strongly encourage DEs to get familiar with infrastructure and “own” the resources that their code leverages/creates/consumes. So a cloud engineer would be a subject matter expert who is responsible for the domain and who oversees the cloud engineering work that DEs are already performing as part of their tasks, as well as completing work of their own. These kinds of engineers, in an AWS context, will be taking care of aspects such as the following:
Networking (VPCs, VPN access, subnets, and so on)
Security (encryption, parameter stores and secrets vault, security groups for applications, as well as role/user permission management with IAM)
Tools like CloudFormation (or similar ones such as Terraform) for writing and maintaining infrastructure
Once your DA team is mature enough, you will probably want to restrict the scope of the data analysts' work to exploration and investigation and leave the visualization and reporting to developers who are specialized in the use of business intelligence (BI) tools (such as Amazon QuickSight, Power BI, or Tableau) and who can more easily and quickly report their findings to stakeholders.
A machine learning engineer (MLE) is a close relative of the DE, specialized in ML‐focused operations, such as the setup and maintenance of ML‐oriented pipelines, including their development and deployment, and the creation and maintenance of specialized data stores (such as feature stores) exclusively aimed at the production of ML models. Since the tools used in ML engineering differ from classic DE tools and are more niche, they require a high level of understanding of ML processes. A person working as an MLE is normally a DE with an interest in data science, or a data scientist who can double as a DE and who has found their ideal place as an MLE.
The practice of automating the training and deployment of ML models is called MLOps, or machine learning operations.
A business analyst (BA) is the ideal point of contact between a technical team and the business/management. The main task of a BA is to gather requirements from the business and turn these requirements into tasks that the technical personnel can execute. I consider a BA a maturity stage role, because in the beginning this is work that the DA team lead should be able to complete, albeit at not as high a standard as a BA proper.
Other roles that you might consider including in your DA team, depending on the nature of the business and the size/resources of the team itself, are as follows:
AI Developer
All too often anything ML related is also referred to as artificial intelligence (AI). Although there are various schools of thought and endless debates on the subject, I agree with Microsoft in summarizing the matter like so: machine learning is how a system develops intelligence, whereas AI is the intelligence itself that allows a computer to perform a task on its own and makes independent decisions. In this respect ML is a subset of AI and a gear in a larger intelligent machine. If your business has a need for someone who is responsible for developing algorithms aimed at resolving an analytics problem, then an AI developer is what you need.
TechOps / DevOps Engineer
If your team is sizable, and the workload on the CI/CD and cloud infrastructure side is too much for DEs to tackle on top of their main function (creating pipelines), then you might want to have dedicated TechOps/DevOps personnel for the DA team.
MLOps Engineer
This is a subset role of the greater DevOps specialty, a DevOps engineer who specializes in CI/CD and infrastructure dedicated to putting ML models into production.
There are many ways to design the process to request and complete analytics in a business. However, I've found the following to be generally applicable to most businesses:
A stakeholder formulates a request, a business question that needs answering.
The BA (or team lead at early stages) translates this into a technical task for a data analyst.
The data analyst conducts some investigation and exploration, leading to a conclusion. The data analyst identifies the portion of their work that can be automated to produce up‐to‐date insights and designs a spec (if a BI developer is available, they will do this last part).
A DE picks up the spec, then designs and implements an ETL job/pipeline that will produce a dataset and store it in the suitable target database.
The BI developer utilizes the data made available by the DE at step 4 and visualizes it or creates reports from it.
The BA reviews the outcome with the stakeholder for final approval and sign‐off.
There are many available software development methodologies for managing the team's workload and achieving a satisfactory level of productivity and velocity. The methodology adopted by your team will greatly depend on the skills you have on your team and even the personalities of the various team members. However, I've found a number of common traits throughout the years:
Cloud engineering tends to be mostly planned work, such as enabling the team to create resources, setting up monitoring and alerting, creating CI/CD pipelines, and so on.
Data analytics tends to be mostly reactive work, whereby a stakeholder asks for a certain piece of work and analysts pick it up.
Data engineering is a mixed bag: on one hand, it is reactive insofar as it supports the work cascading from analysts and is destined to be used by BI developers; on the other hand, some tasks, such as developing utilities and tooling to help the team scale operations, is planned and would normally be associated with a traditional deadline for delivery.
Data architects tend to have more planned work than reactive, but at the beginning of a DA team's life there may be a lot of real‐time prioritization to be done.
So given these conditions, what software development methodology should you choose? Realistically it would be one of the many Agile methodologies available, but which one?
A good rule of thumb is as follows: if it's planned work, use Scrum; if it's reactive work, use Kanban. If in doubt, or you want to use one method for everyone, use Kanban.
Let me explain the reason for this guideline. Scrum's central concept for time estimation is user stories that can be scored. This is a very useful idea that enables teams to plan their sprints with just the right amount of work to be completed within that time frame. Planned work normally starts with specifications, and leadership/management will have an expectation for its completion. Therefore, planning the work ahead, and dividing it into small stories that can be estimated, will also produce a final time metric number that will work as the deadline.
In my opinion Scrum is more suited to this kind of work, as I find it more suited to feature‐oriented development (as in most product teams).
Kanban, on the other hand, is an extremely versatile methodology meant to combine agility and flexibility with team velocity and productivity. When a team is constantly dealing with a flow of requests, how do you go about completing them? The key is in real‐time prioritization, which in turn depends on breaking down tasks to the smallest possible unit.
Limits and constraints that I've found useful are as follows:
No task should ever exceed 3 days of work, with 1 being ideal.
There should never be more than one task per member in the In Progress column of your Kanban board.
There should never be more than one task per member in the Review/Demo column of your board.
Encourage cooperation by setting a “work in progress” limit that is less than twice the number of team members, so at least one task must have more than one person assigned to it. For example, if you only want this constraint to be applied to one task, you could set the WIP limit at
Also, I strongly encourage code‐based work to require the approval of at least one other team member before any one code contribution is merged into the codebase. This is true for DEs and data analysts alike.
Applying these constraints, you will immediately notice that if an urgent task lands in the team's backlog (the “drop what you're doing” kind of task), you should always be at most three days away from being able to assign the task and have it completed.
And aside from those business‐critical anomalies that require immediate attention (which, by the way, should never be the case in a DA team since they are rarely a customer‐facing team), real‐time prioritization and management of the backlog is relatively easy, especially in the realms of data analytics and BI, where demands for investigations and reports are an ever‐flowing stream.
In conclusion, Kanban is a versatile methodology, suitable for real‐time prioritization that can be applied to the whole team. If you have subteams only completing planned work, they could be more optimally managed with Scrum.
If there is one thing I wish readers would learn from my experience, it's the vital importance of automation. If you are dealing with terabytes of data across several data sources, vast data lakes and data warehouses, countless ETL pipelines, dashboards, and tables to catalog in metadata stores, you cannot expect to maintain the operation manually. Neither should you aspire to. On the contrary, you should strive to achieve complete automation where the data lake practically maintains itself.
Here is a list of aspects of the work that are better managed through automation:
Infrastructure Creation, Update, and Destruction
There are many tools to accomplish this. The main infrastructure‐as‐code solutions available are CloudFormation, Terraform, and AWS CDK (the latter two utilize CloudFormation under the hood but are easier to write and maintain).
Data Cataloging
As data flows into your data lake, new partitions and new tables are better discovered automatically. The umbrella tool AWS Glue covers this part of your automation by scanning newly deposited data with so‐called
crawlers
.
Pipeline Execution
AWS EventBridge allows pipelines to execute on particular triggers; this may be simple schedules or more complex events such as the creation of a new object in storage.
Visualizations/Dashboard Update
AWS QuickSight bases its dashboards on datasets that can be set to have a refresh rate, so reports are always up to date.
Test and Deployment
You should treat data engineering and analytics the same way you would a product, by setting up a CI/CD pipeline that tests code and deploys it upon a successful code review following a pull request. The de facto standard for version control of the code is Git, although other solutions are available.
Monitoring and Alerting
Whatever your delivery system of choice is (a message in a chat application, an email, an SMS), be sure to automate monitoring and alerting so that you are immediately notified when something has gone wrong. Especially in data engineering, missing a day's worth of data can result in problems and a lot of hassle to backfill the information.
Finally, let's take a look at how the DA team may be placed within the organization and how it could interact with the other functions.
There are plenty of models available, but there are three models that are in a way the basic version of every other variation available: centralized, distributed, and center of excellence, or CoE (which is ideal for a hybrid structure).
A centralized DA team is a place where all the analytics needs of an organization are satisfied. It not only means that every single piece of data engineering or analytics will be performed by the DA team, but it also means no data engineering, data analysis, or data science should happen outside of the DA team.
This may not be suitable for all organizations, but I do find that at least at the beginning of a business's transformation to data‐driven, a centralized approach brings order and method to the chaos. Rogue initiatives outside of it only create duplication and fragmentation of work methodology, practices, and tools and may even produce results that conflict with similar work conducted within the DA team, which may result in poor buy‐in from the business and slow down the production of analytics or question its accuracy. If you do not have analytics in your company, start with a centralized team.
If you do have analysts in your company because you made the very common mistake of putting analysts to work before data engineering was in place, bring your analysts into the DA team and transform what may be a community of practice into a structured team.
An early‐stages DA team works mainly in three areas: architecture, engineering, and analysis. Data science may come soon after but not right away. For this reason, I believe an early‐stages DA team and indeed a centralized DA team may have the structure shown in Figure 2.1.
It is important to note that, as specified earlier, the architect role can be covered by a team lead, but it is not the same thing. A competent person who can design resilient, maintainable, and extensible architectures is needed to review the work done by all the teams, but especially the data engineering team.
Later in the data journey, you may drift more toward a hub‐and‐spoke model. If so, your centralized team may in time become the core team of the center of excellence, which we will explore soon.
The main disadvantage of centralized teams in the long term is that they may produce slower lead times from request to analytics, as the analytics requests coming from the business will have to join a prioritized queue and there are no resources dedicated to each function.
Figure 2.1: An example structure of an early‐stages DA team
A main advantage of a centralized team is that it inherently encourages cross‐functionality among the members of each subteam; therefore, if resources are not available or for some reason temporarily constrained, it means work can proceed (albeit at a slower pace) rather than coming to a grinding halt. So a centralized team has a certain degree of resilience.
A distributed DA team is especially suitable for those organizations whose analytical needs are so large, and there is so much domain knowledge to be acquired by the people carrying out engineering and analysis work, that it is faster and more productive to split the team out. The main advantage of distributed teams is the quicker turnaround. If Finance is in need of a piece of analytics, they don't need to share resources with Marketing. The Finance DA team will swiftly produce the analytics requested without having to go to a centralized team and share resources with the entire business.
But there are drawbacks. Inevitably, teams will drift apart and develop practices and adopt methodologies that in time are going to diverge, especially given the different domains of work, and transferring resources or regulating analytics at the business level may become very challenging.
Distributed teams may have a structure that internally is similar to the centralized team but on a smaller scale.
There is a third model, which combines the benefits of centralized and distributed models: the center of excellence. This model requires a high level of data maturity in the business, because it involves a great deal of agility while remaining regulated, and it addresses domain knowledge, quick iterations, and data governance.
Instead of aggregating all of the DA resources into one team, you form a center of excellence containing the people with high‐value skills and experience. From this center of excellence, you can regulate activity and establish a rhythm to analytics production. You can also review work carried out in the distributed units and establish communities of practice to contain the drift between the various functions.
A center of excellence is at the core of a hub‐and‐spoke model where the central unit (the hub) is responsible for overseeing and regulating activities, as well as performing tasks that are to be considered business‐wide or business‐critical (for example, managing and regulating access to the business's centralized data lake). The spokes are units (teams) embedded within the various functions that can perform work at a higher pace while having their activity reviewed and approved by the people in the center of excellence.
As mentioned, this model suits businesses and organizations that are far down the road of analytics, and it is one model that allows quick iterations on producing insights and analytics while limiting fragmentation and duplication of work.
In this chapter we discussed the formation of a DA team, which is a vital prerequisite for the successful creation and maintenance of a data platform in any organization. While not all organizations are the same, the general advice is to start with an embryonic unit with a strong leadership, and gradually and iteratively add specialist roles to your growing team.
AWS is an incredibly vast ecosystem of tools and components, and—especially if you are not familiar with it—learning to work with it may seem like a daunting task.
Therefore, it seems only fitting that we should take a look at the basics of how to work in an AWS environment and build your understanding of cloud computing and engineering.
Since you are reading this book to implement analytics on AWS, it would seem logical that you are already using AWS for other parts of your business. Therefore, we will take a quick look at initial steps (sign‐up and user creation), but we will dive into deeper detail on the subsequent steps. We will discuss the following:
Accessing AWS
Managing users
Interacting with AWS through the Web Console
Interacting with the command line
Interacting with AWS CloudShell
Creating virtual private clouds to secure your resources
Managing roles and policies with IAM
Using CloudFormation to manage infrastructure
First things first: you need to access AWS. To do this, you will need to create an AWS account, with credentials that will grant you root access. The URL to create an AWS account (and for subsequent sign‐ins) is https://aws.amazon.com.
Once in, you will be prompted with the Console Home screen (shown in Figure 3.1