Data Analytics in the AWS Cloud - Joe Minichino - E-Book

Data Analytics in the AWS Cloud E-Book

Joe Minichino

0,0
38,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A comprehensive and accessible roadmap to performing data analytics in the AWS cloud

In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you’ll explore every relevant aspect of data analytics—from data engineering to analysis, business intelligence, DevOps, and MLOps—as you discover how to integrate machine learning predictions with analytics engines and visualization tools.

You’ll also find:

  • Real-world use cases of AWS architectures that demystify the applications of data analytics
  • Accessible introductions to data acquisition, importation, storage, visualization, and reporting
  • Expert insights into serverless data engineering and how to use it to reduce overhead and costs, improve stability, and simplify maintenance

A can't-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 496

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Introduction

What Is a Data Lake?

The Data Platform

The End of the Beginning

Note

Chapter 1: AWS Data Lakes and Analytics Technology Overview

Why AWS?

What Does a Data Lake Look Like in AWS?

Analytics on AWS

Skills Required to Build and Maintain an AWS Analytics Pipeline

Chapter 2: The Path to Analytics: Setting Up a Data and Analytics Team

The Data Vision

DA Team Roles

Analytics Flow at a Process Level

The DA Team Mantra: “Automate Everything”

Analytics Models in the Wild: Centralized, Distributed, Center of Excellence

Summary

Chapter 3: Working on AWS

Accessing AWS

Everything Is a Resource

IAM: Policies, Roles, and Users

Working with the Web Console

The AWS Command‐Line Interface

Infrastructure‐as‐Code: CloudFormation and Terraform

Chapter 4: Serverless Computing and Data Engineering

Serverless vs. Fully Managed

AWS Serverless Technologies

AWS Serverless Application Model (SAM)

Summary

Chapter 5: Data Ingestion

AWS Data Lake Architecture

Sample Processing Architecture: Cataloging Images into DynamoDB

Serverless Ingestion

Fully Managed Ingestion with AppFlow

Operational Data Ingestion with Database Migration Service

Summary

Chapter 6: Processing Data

Phases of Data Preparation

Overview of ETL in AWS

ETL Job Design Concepts

AWS Glue for ETL

Connectors

Creating ETL Jobs with AWS Glue Visual Editor

Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target)

Creating ETL Jobs with the Spark Script Editor

Developing ETL Jobs with AWS Glue Notebooks

Creating ETL Jobs with AWS Glue Interactive Sessions

Streaming Jobs

Chapter 7: Cataloging, Governance, and Search

Cataloging with AWS Glue

Search with Amazon Athena: The Heart of Analytics in AWS

Governing: Athena Workgroups, Lake Formation, and More

AWS Lake Formation

Summary

Chapter 8: Data Consumption: BI, Visualization, and Reporting

QuickSight

Data Consumption: Not Only Dashboards

Summary

Chapter 9: Machine Learning at Scale

Machine Learning and Artificial Intelligence

Amazon SageMaker

Summary

Appendix: Example Data Architectures in AWS

Modern Data Lake Architecture

Batch Processing

Stream Processing

Architecture Design Recommendations

Summary

Index

Copyright

About the Author

About the Technical Editor

Acknowledgments

End User License Agreement

List of Tables

Chapter 6

Table 6.1: The magics available in AWS Glue

Chapter 7

Table 7.1: Permissions needed

List of Illustrations

Chapter 2

Figure 2.1: An example structure of an early‐stages DA team

Chapter 3

Figure 3.1: The Console Home screen

Figure 3.2: The Web Console

Figure 3.3: DynamoDB table creation form

Figure 3.4: Verifying the changes to the bucket in the Web Console

Figure 3.5: A Cloudcraft diagram

Chapter 4

Figure 4.1: Configuration of Lambdas

Figure 4.2: Node.js hello‐world blueprint

Figure 4.3: The changeset applied to the stack

Figure 4.4:

HelloWorldFunction

invoked from the Web Console

Chapter 5

Figure 5.1: Data lake architecture

Figure 5.2: Application architecture

Figure 5.3: Fargate‐based periodic batch import

Figure 5.4: Backend Service infrastructure

Figure 5.5: Two‐pronged delivery

Figure 5.6: Create Replication Instance

Figure 5.7: Specifying the security group

Figure 5.8: Create Endpoint

Figure 5.9: Test Endpoint Connection

Figure 5.10: Endpoint Configuration

Figure 5.11: Endpoint Settings

Figure 5.12: For Parquet and CSV

Figure 5.13: For CSV only

Figure 5.14: Test run

Figure 5.15: Create Database Migration Task

Figure 5.16: Table Mappings

Figure 5.17: Inspecting the migration task status

Figure 5.18: Full load successful

Figure 5.19: Exploring the content of the Parquet file

Figure 5.20: Inspecting the downloaded file

Chapter 6

Figure 6.1: AWS Glue interface in the Web Console

Figure 6.2: Connectors screen

Figure 6.3: Connector Access

Figure 6.4: Secrets Manager

Figure 6.5: Connection Properties section

Figure 6.6: Custom connectors list

Figure 6.7: ETL diagram in the Editor

Figure 6.8: Inspecting parquet

Figure 6.9: Bookmark Enable/Disable option

Figure 6.10: Available transformations

Figure 6.11: Mapping transformation

Figure 6.12: Edited diagram

Figure 6.13: Node Properties

Figure 6.14: Viewing local files

Figure 6.15: Inspecting parquet

Figure 6.16: What you see when you load a notebook

Figure 6.17: Verifying the first entries in the file

Figure 6.18: Notebook example

Figure 6.19: Available kernels

Figure 6.20: Interactive sessions in a notebook

Figure 6.21: Kinesis Create Data Stream

Figure 6.22: S3 bucket exploration

Figure 6.23: Job type option

Figure 6.24: Node properties

Figure 6.25: Target node properties

Figure 6.26: Seeing data stored in S3

Figure 6.27: Table and database selection

Figure 6.28: Setting Kinesis as the source

Figure 6.29: File format selection

Figure 6.30: Sample schema

Chapter 7

Figure 7.1: Adding a table

Figure 7.2: Table schema

Figure 7.3: Object summary

Figure 7.4: Crawler Name field

Figure 7.5: Crawler creation, source, and stores options

Figure 7.6: Crawler creation, folder, and path field

Figure 7.7: Crawler creation, prefix field

Figure 7.8: Crawler list

Figure 7.9: Crawler list, run information

Figure 7.10: Generated tables

Figure 7.11: Generated schema

Figure 7.12: Crawler created with the CLI

Figure 7.13: Crawler‐generated schema, single array field

Figure 7.14: Adding a classifier

Figure 7.15: Add the classifier to the crawler

Figure 7.16: Newly generated schema with classifier

Figure 7.17: Query editor

Figure 7.18: Table options in Athena

Figure 7.19: Copy and Download Results buttons

Figure 7.20: Query Stats graph

Figure 7.21: Saved queries

Figure 7.22: Result of query

Figure 7.23: Save button drop‐down options

Figure 7.24: Save Query dialog box

Figure 7.25: Query editor

Figure 7.26: Parameterized query

Figure 7.27: Connection Details pane

Figure 7.28: Connection error

Figure 7.29: Databases and tables in Athena

Figure 7.30: Create Workgroup

Figure 7.31: Workgroup details

Figure 7.32: Setting query limits for the workgroup

Figure 7.33: Lake Formation menu

Figure 7.34: Registering location in Lake Formation

Figure 7.35: List of registered locations

Figure 7.36: Create Database form

Figure 7.37: List of databases

Figure 7.38: Add LF‐Tag button

Figure 7.39: LF‐Tag creation form

Figure 7.40: Adding key and values

Figure 7.41: Empty tag list

Figure 7.42: Edit LF‐Tag form

Figure 7.43: LF‐Tag validation

Figure 7.44: Grant data permissions form

Figure 7.45: LF‐Tag‐based permission

Figure 7.46: Database Permissions

Figure 7.47: Data filter creation form

Figure 7.48: LF‐Tag available in form

Chapter 8

Figure 8.1: User invitation

Figure 8.2: Create New Group

Figure 8.3: Cost analysis in QuickSight

Figure 8.4: SPICE usage graph

Figure 8.5: QuickSight access to AWS services

Figure 8.6: Public access to dashboards

Figure 8.7: QuickSight's navigation menu

Figure 8.8: QuickSight available data sources

Figure 8.9: New Athena Data Source

Figure 8.10: New data source available

Figure 8.11: Choose Your Table

Figure 8.12: Enter Custom SQL Query

Figure 8.13: Apply query to data source

Figure 8.14: Duplicate Dataset

Figure 8.15: Available resource categories in the UI

Figure 8.16: Refresh Now button

Figure 8.17: Refresh schedule and history of a dataset

Figure 8.18: Common SQL error message

Figure 8.19: Available services in QuickSight

Figure 8.20: Editor view

Figure 8.21: Dataset options

Figure 8.22: Field options

Figure 8.23: Inspecting a script function

Figure 8.24: Placing a function in a script by selecting it from the list

Figure 8.25:

bodyLength

field now available

Figure 8.26: Data type icon changed

Figure 8.27: Add data to current dataset

Figure 8.28: Newly added dataset

Figure 8.29: Relationship UI

Figure 8.30: Field search

Figure 8.31: Specifying an

INNER

join

Figure 8.32: Recommended join

Figure 8.33: Single table joining to others

Figure 8.34: Complex relationship diagram

Figure 8.35: Excluded fields at the bottom of the list

Figure 8.36: Filter view

Figure 8.37: Available filter conditions

Figure 8.38: Add Field To Hierarchy

Figure 8.39: Adding to an existing hierarchy

Figure 8.40: Newly created visual default view

Figure 8.41: Add options

Figure 8.42: Autogenerated graph

Figure 8.43: Fields information bar

Figure 8.44: Field wells

Figure 8.45: The various graph type icons

Figure 8.46: Aggregation options

Figure 8.47: Example dashboard with one graph

Figure 8.48: Filtering values

Figure 8.49: Null Options

Figure 8.50: Group/Color field well

Figure 8.51: Drilling down

Figure 8.52: Navigation between levels of drill‐down

Figure 8.53: Create New Parameter

Figure 8.54: Add Control

Figure 8.55: Using a parameter in a filter

Figure 8.56: Application of parameters affecting graphs

Figure 8.57: Gauge control

Figure 8.58: Edit Action

Figure 8.59: Action available in menu

Figure 8.60: Before action trigger

Figure 8.61: After action is triggered

Figure 8.62: New Action

Figure 8.63: List of actions in context menu

Figure 8.64: Specifying a destination URL for the action

Figure 8.65: Suggested Insights

Figure 8.66: Example autonarratives

Figure 8.67: Narrative context menu

Figure 8.68: Edit Narrative

Figure 8.69: Dot indicating that ML‐Insight is available

Figure 8.70: Visual menu including

Figure 8.71: Forecast added to timeline

Figure 8.72: Integration with SageMaker

Figure 8.73: Example dashboard

Figure 8.74: Publishing options

Chapter 9

Figure 9.1: Domain creation form

Figure 9.2: IAM role for SageMaker execution

Figure 9.3: Launch Studio

Figure 9.4: View user

Figure 9.5: SageMaker Studio

Figure 9.6: Example prediction

Figure 9.7: Models list

Figure 9.8: Endpoints interface

Figure 9.9: Endpoints in SageMaker interface

Figure 9.10: Create Batch Transform Job

Figure 9.11: Input and output data configuration

Appendix

Figure A.1: Modern Data Lake Architecture

Figure A.2: Batch processing architecture

Figure A.3: Stream processing architecture

Guide

Cover

Table of Contents

Title Page

Copyright

About the Author

About the Technical Editor

Acknowledgments

Introduction

Begin Reading

Appendix: Example Data Architectures in AWS

Index

End User License Agreement

Pages

iii

xxiii

xxiv

xxv

xxvi

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

iv

v

vii

ix

381

Data Analytics in the AWS Cloud

Building a Data Platform for BI and Predictive Analytics on AWS

 

Joe Minichino

 

 

 

 

 

Introduction

Welcome to your journey to AWS‐powered cloud‐based analytics!

If you need to build data lakes, import pipelines, or perform large‐scale analytics and then display them with state‐of‐the‐art visualization tools, all through the AWS ecosystem, then you are in the right place.

I will spare you an introduction on how we live in a connected world where businesses thrive on data‐driven decisions based on powerful analytics. Instead, I will open by saying that this book is for people who need to build a data platform to turn their organization into a data‐driven one, or who need to improve their current architectures in the real world. This book may help you gain the knowledge to pass an AWS certification exam, but this is most definitely not its only aim.

I will be covering a number of tools provided by AWS for building a data lake and analytics pipeline, but I will cover these tools insofar as they are applicable to data lakes and analytics, and I will deliberately omit features that are not relevant or particularly important. This is not a comprehensive guide to such tools—it's a guide to the features of those tools that are relevant to our topic.

It is my personal opinion that analytics, be they in the form of looking back at the past (business intelligence [BI]) or trying to predict the future (data science and predictive analytics), are the key to success.

You may think marketing is a key to success. It is, but only when your analytics direct your marketing efforts in the right direction, to the right customers, with the right approach for those customers.

You may think pricing, product features, and customer support are keys. They are, but only when your analytics reveal the correct prices and the right features to strengthen customer retention and success, and your support team possesses the necessary skills to adequately satisfy your customers' requests and complaints.

That is why you need analytics.

Even in the extremely unlikely case that your data all resides in one data store, you are probably keeping it in a relational database that's there to back your customer‐facing applications. Traditional RDBs are not made for large‐scale1 storage and analysis, and I have seen very few cases of storing the entire history of records of an RDB in the RDB itself.

So you need a massively scalable storage solution with a query engine that can deal with different data sources and formats, and you probably need a lot of preparation and clean‐up before your data can be used for large‐scale analysis.

You need a data lake.

What Is a Data Lake?

A data lake is a centralized repository of structured, semi‐structured, and unstructured data, upon which you can run insightful analytics. This is my ultra‐short version of the definition.

While in the past we referred to a data lake strictly as the facility where all of our data was stored, nowadays the definition has extended to include all of the possible data stores that can be linked to the centralized data storage, in a kind of hybrid data lake that comprises flat‐file storage, data warehouses, and operational data stores.

When You Do Not Need a Data Lake

If all your data resides in a single data store, you're not interested in analyzing it, or the size and velocity of your data are such that you can afford to record the entire history of all your records in the same data store and perform your analysis there without impacting customer‐facing services, then you do not need a data lake. I'll confess I never came across such a scenario. So, unless you are running some kind of micro and very particular business that does not benefit from analysis, most likely you will want to have a data lake in place and an analytics pipeline powering your decisions.

When Do You Need Analytics?

Really, always.

When Do You Need a Data Lake for Analytics?

Almost always, and they are generally cheap solutions to maintain. In this book we will explore ways to store and analyze vast quantities of data for very little money.

How About an Analytics Team?

One of the most common mistakes companies make is to put analysts to work before they have data engineers in place. If you do that, you are only going to cause these effects in order:

Your analysts will waste their time trying to either work around engineering problems or worse, try their hand at data engineering themselves.

Your analysts will get frustrated, as most of their time will be spent procuring, transforming, and cleaning the data instead of analyzing it.

Your analysts will produce analyses, but they are not likely to set up automation for the data engineering side of the work, meaning they will spend hours rerunning data acquisition, filtering, cleaning, and transforming rather than analyzing.

Your analysts will leave for a company that has an analytics team in place that includes both data analysts and data engineers.

So just skip that part and do things the right way. Get a vision for your analytics, put data engineers in place, and then analysts to work who can dedicate 100 percent of their time to analyzing data and nothing else. We will explore designing and setting up a data analytics team in Chapter 2, “The Path to Analytics: Setting Up a Data and Analytics Team.”

The Data Platform

In this book, I will guide you through the extensive but extremely interesting and rewarding journey of creating a data platform that will allow you to produce analytics of all kinds: look at the past and visualize it through business intelligence and BI tools and predict the future with intelligent forecasting and machine learning models, producing metrics and the likelihood of events happening.

We will do so in a scalable, extensible way that will grant your organization the kind of agility needed for fast turnaround on analytics requests and to deal with changes in real time by building a platform that is centered around the best technologies for the task at hand with the correct resources in place to accomplish such tasks.

The End of the Beginning

I hope you enjoy this book, which is the fruit of my many years of experience collected in the “battlefield” of work. Hopefully you will gain knowledge and insights that will help you in your job and personal projects, and you may reduce or altogether skip some of the common issues and problems I have encountered throughout the years.

Note

1

Everything is relative, but generally speaking if you tried to store all the versions of all the records in a large RDBS you would put the database itself under unnecessary pressure,

and

you would be doing so at the higher cost of the I/O optimized storage that databases use in AWS (read about I/O provision), rather than utilizing a cheap storage facility that scales to virtually infinite size, like S3.

CHAPTER 1AWS Data Lakes and Analytics Technology Overview

In the introduction I explained why you need analytics. Really powerful analytics require large amounts of data. The large here is relative to the context of your business or task, but the bottom line is that you should produce analytics based on a comprehensive dataset rather than a small (and inaccurate) sample of the entire body of data you possess.

Why AWS?

But first let's address our choice of cloud computing provider. As of this writing (early 2022) there are a number of cloud computing providers, with three competitors leading the race: Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. I recommend AWS as your provider of choice, and I'll tell you why.

The answer for me lies in the fact that analytics is a vast realm of computing spanning numerous technologies and areas of technology: business analysis, data engineering, data analytics, data science, data storage (including transactional databases, data lakes, and warehouses), data mining/crawling, data cataloging, data governance and strategy, security, visualization, business intelligence, and reporting.

Although AWS may not win out on some of the costs of running services and has to cover some ground to catch up to its competitors in terms of user interface/user experience (UI/UX), it remains the only cloud provider that has a solid and stable solution for each area of the business, all seamlessly integrated through the AWS ecosystem.

It is true that other cloud providers are ideal for some use cases and that leveraging their strength in certain areas (for example, GCP tends to be very developer‐friendly) can make for easy and cost‐effective solutions. However, when it comes to running an entire business on it, AWS is the clear winner.

Also, AWS encourages businesses to use their resources in an optimal fashion by providing a free tier of operation, which means that for each tool you use there will be a certain amount of usage below a specified threshold provided for free. Free‐tier examples are 1 million AWS Lambda invocations per month, or 750 hours of small Relational Database Service (RDS) databases.

As far as this book's use case, which is setting up and delivering large‐scale analytics, AWS is clearly the leader in the field at this time.

What Does a Data Lake Look Like in AWS?

For the most part, you will be dealing with Amazon Simple Storage Service (S3), with which you should be familiar, but if you aren't, fear not, because we've got you covered in the next chapters.

S3 is the storage facility of choice for the following reasons:

It can hold a virtually infinite amount of data.

It is inexpensive, and you can adopt storage solutions that make it up to 50 times cheaper.

It is seamlessly integrated with all data and analytics‐related tools in AWS, from tools like Kinesis that store data in S3 to tools like Athena that query the data in it.

Data can be protected through access permissions, it can be encrypted in a variety of ways, or it can be made publicly accessible.

There are other solutions for storage in AWS, but aside from one that has some use cases (the EMR File System, or EMRFS), you should rely on S3. Note that EMRFS is actually based on S3, too. Other storage solutions like Amazon Elastic Block Store (EBS) are not ideal for data lake and analytics purposes, and since I discourage their use in this context, I will not cover them in the book.

Analytics on AWS

If you log into the AWS console, you will see the following products listed under the Analytics heading:

Athena

EMR

CloudSearch

Kinesis

QuickSight

Data Pipeline

AWS Data Exchange

AWS Glue

AWS Lake Formation

MSK

The main actors in the realm of analytics in the context of big data and data lakes are undoubtedly S3, Athena, and Kinesis.

EMR is useful for data preparation/transformation, and the output is generally data that is made available to Athena and QuickSight.

Other tools, like AWS Glue and Lake Formation, are not less important (Glue in particular is vital to the creation and maintenance of an analytics pipeline), but they are not directly generating or performing analytics. MSK is AWS's fully managed version of Kafka, and we will take a quick look at it, but we will generally favor Kinesis (as it performs a similar role in the stack).

Opting for MSK or plain Kafka comes down to cost and performance choices.

CloudSearch is a search engine for websites, and therefore is of limited interest to us in this context.

In addition, SageMaker can be a nice addition if you want to power your analytics with predictive models or any other machine learning/artificial intelligence (ML/AI) task.

Skills Required to Build and Maintain an AWS Analytics Pipeline

First of all, you need familiarity with AWS tools. You will gain that familiarity through this book. For anything that goes beyond the creation of resources through the AWS console, you will need general AWS Sysops skills. Other skills you'll need include the following:

Knowledge of AWS Identity and Access Management (IAM) is necessary to understand the permissions requirements for each task.

DevOps skills are required if you want to automate the creation and destruction of resources using CloudFormation or Terraform (or any other infrastructure‐as‐code tool).

SQL skills are needed to write Athena queries, and basic database administrator (DBA) skills to understand Athena data types and schemas.

Data analysis and data science skills are required for SageMaker models.

Basic business understanding of charts and graphs are required to create QuickSight visualizations.

CHAPTER 2The Path to Analytics: Setting Up a Data and Analytics Team

Creating analytics, especially in a large organization, can be a monumental effort, and a business needs to be prepared to invest time and resources, which will all repay the company manifold by enabling data‐driven decisions. The people who will make this shift toward data‐driven decision making are your Data and Analytics team, sometimes referred to as Data Analytics team or even simply as Data team (although this latest version tends to confuse people, as it may seem related to database administration). This book will refer to the Data and Analytics team as the DA team.

Although the focus of this book is architectural patterns and designs that will help you turn your organization into a data‐driven one, a high‐level overview of the skills and people you will need to make this happen is necessary.

 Funny anecdote: At Teamwork, our DA team is referred to with the funny‐sounding name DANDA, because we create resources on AWS with the identifierD&A, but because AWS has a habit of converting some characters into full text, & became AND. Needless to say, it stuck, and since then we have been known as DANDA.

The Data Vision

The first step in delivering analytics is to create a data vision, a statement for your business as a whole. This can be a simple quote that works as a compass for all the projects your DA team will work on.

A vision does not have to be immutable. However, you should only change it if it is somehow only applicable to certain conditions or periods of time and those conditions have been satisfied or that time has passed.

A vision is the North Star of your data journey. It should always be a factor when you're making decisions about what kind of work to carry out or how to prioritize a current backlog. An example of a data vision is “to create a unified analytics facility that enables business management to slice and dice data at will.”

Support

It's important to create the vision, and it's also vital for the vision to have the support of all the involved stakeholders. Management will be responsible for allocating resources to the DA team, so these managers need to be behind the vision and the team's ability to carry it out. You should have a vision statement ready and submit it to management, or have management create it in the first place.

I won't linger any further on this topic because this book is more of a technical nature than a business one, but be sure not to skip this vital step.

REDUCTIO AD ABSURDUM: HOW NOT TO GO ABOUT CREATING ANALYTICS

Before diving into the steps for creating analytics, allow me to give you some friendly advice on how you should not go about it. I will do so by recounting a fictional yet all too common story of failure by businesses and companies.

Data Undriven Inc. is a successful company with hundreds of employees, but it's in dire need of analytics to reverse some worrying revenue trends. The leadership team recognizes the need for a far more accurate kind of analytics than what they currently have available, since it appears the company is unable to pinpoint exactly what side of the business is hemorrhaging money. Gemma, a member of the leadership team, decides to start a project to create analytics for the company, which will find its ultimate manifestation in a dashboard illustrating all sorts of useful metrics. Gemma thinks Bob is a great Python/SQL data analyst and tasks Bob with the creation of reports. The ideas are good, but data for these reports resides in various data sources. This data is unsuitable for analysis because it is sparse and inaccurate, some integrity is broken, there are holes due to temporary system failures, and the DBA team has been hit with large and unsustainable queries run against their live transactional databases, which are meant to serve data to customers, not to be reported on.

Bob collects the data from all the sources and after weeks of wrangling, cleaning, filtering, and general massaging of the data, produces analytics to Gemma in the form of a spreadsheet with graphs in it.

Gemma is happy with the result, although she notices some incongruence with the expected figures. She asks Bob to automate this analysis into a dashboard that managers can consult and that will contain up‐to‐date information.

Bob is in a state of panic, looking up how to automate his analytics scripts, while also trying to understand why his numbers do not match Gemma's expectations—not to mention the fact that his Python program takes between 3 and 4 hours to run every time, so the development cycle is horrendously slow.

The following weeks are a harrowing story of misunderstandings, failed attempts at automations, frustration, degraded database performance, with the ultimate result that Gemma has no analytics and Bob has quit his job to join a DA team elsewhere.

What is the moral of the story? Do not put any analyst to work before you have a data engineer in place. This cannot be stated strongly enough. Resist the temptation to want analytics now. Go about it the right way. Set up a DA team, even if it's small and you suffer from resource constraints in the beginning, and let analysts come into the picture when the data is ready for analytics and not before. Let's see what kind of skills and roles you should rely on to create a successful DA team and achieve analytics even at scale.

DA Team Roles

There are two groups of roles for a DA team: the early stages and the mature stage. The definitions for these are not strict and vary from business to business. Make sure core roles are covered before advancing to more niche and specialized ones.

Early Stage Roles

By “early stage roles” we refer to a set of roles that will constitute the nucleus of your nascent DA team and that will help the team grow. At the very beginning, it is to be expected that the people involved will have to exercise some flexibility and open‐mindedness in terms of the scope and authority of their roles, because the priority is to build the foundation for a data platform. So a team lead will most likely be hands‐on, actively contributing to engineering, and the same can be said of the data architect, whereas data engineers will have to perform a lot of work in the realms of data platform engineering to enable the construction and monitoring of pipelines.

Team Lead

Your DA team should have, at least at the beginning, strong leadership in the form of a team lead. This is a person who is clearly technically proficient in the realm of analytics and is able to create tasks and delegate them to the right people, oversee the technical work that's being carried out, and act as a liaison between management and the DA team.

Analytics is a vast domain that has more business implications than other strictly technical areas (like feature development, for example), and yet the technical aspects can be incredibly challenging, normally requiring engineers with years of experience to carry out the work. For this reason, it is good to have a person spearheading the work in terms of workflow and methodology to avoid early‐stage fragmentation, discrepancies, and general disruption of the work due to lack of cohesion within the team. The team can potentially evolve into something more of a flat‐hierarchy unit later on, when every member is working with similar methods and practices that can be—at that later point—questioned and changed.

Data Architect

A data architect is a fundamental figure for a DA team and one the team cannot do without. Even if you don't elect someone to be officially recognized as the architect in the team, it is advisable to elect the most experienced and architecturally minded engineer to the role of supervisor of all the architectures designed and implemented by the DA team. Ideally the architect is a full‐time role, not only designing pipeline architectures but also completing work on the technology adoption front, which is a hefty and delicate task at the same time.

Deciding whether you should adopt a serverless architecture over an Airflow‐ or Hadoop‐based one is something that requires careful attention. Elements such as in‐house skills and maintenance costs are also involved in the decision‐making process.

The business can—especially under resource constraints—decide to combine the architect and team lead roles. I suggest making the data architect/team lead a full‐time role before the analytics demand volume in the company becomes too large to be handled by a single team lead or data architect.

Data Engineer

Every DA team should have a data engineering (DE) subteam, which is the beating heart of data analytics. Data engineers are responsible for implementing systems that move, transform, and catalog data in order to render the data suitable for analytics.

In the context of analytics powered by AWS, data engineers nowadays are necessarily multifaceted engineers with skills spanning various areas of technology. They are cloud computing engineers, DevOps engineers, and database/data lake/data warehouse experts, and they are knowledgeable in continuous integration/continuous deployment (CI/CD).

You will find that most DEs have particular strengths and interests, so it would be wise to create a team of DEs with some diversity of skills. Cross‐functionality can be built over time; it's much more important to start with people who, on top of the classic extract, transform, load (ETL) work, can also complete infrastructure work, CI/CD pipelines, and general DevOps.

At its core, the Data Engineer’s job is to perform ETL operations. They can be of varied natures, dealing with different sources of data and targeting various data stores, and they can perform some kind of transformation, like flattening/unnesting, filtering, and computing values. Ultimately, the broad description of the work is to extract (data from a source), transform (the data that was extracted), and load (the transformed data into a target store).

You can view all the rest of the tasks as ancillary tasks to this fundamental operation.

Data Analyst

Another classic subteam of a DA team is the Data Analysts team. The team consists of a number of data analysts who are responsible for the exploratory and investigative work that identifies trends and patterns through the use of statistical models and provides management with metrics and numbers that help decision making. At the early stages of a DA team, data analysts may also cover the role of business intelligence developers, responsible for visualizing data in the form of reports and dashboards, using descriptive analytics to give an easy‐to‐understand view of what happened in the business in the past.

Maturity Stage Roles

When the team's workflow is established, it is a good idea to better define the scope of each role and include figures responsible for specialist areas of expertise, such as data science or cloud and data platform engineering, and let every member of the team focus on the areas they are best suited for.

Data Scientist

A data scientist (DS) is the ultimate data “nerd” and responsible for work in the realm of predictive and prescriptive analytics. A DS usually analyzes a dataset and, through the use of machine‐learning (ML) techniques, is able to produce various predictive models, such as regression models that produce the likelihood of a certain outcome given certain conditions (for example, the likelihood of a prospective customer to convert from a trial user to a paying user). The DS may also produce forecasting models that use modern algorithms to predict the trend of a certain metric (such as revenue of the business), or even simply group records in clusters based on some of the records' features.

A data scientist's work is to investigate and resolve complex challenges that often involve a number of unknowns, and to identify patterns and trends not immediately evident to the human eye (or mind). An ideally structured centralized DA team will have a Data Science subteam at some point. The common ratio found in the industry is to have one DS for every four data analysts, but this is by no means a hard‐and‐fast rule. If the business is heavily involved in statistical models, or it leverages machine‐learning predictions as a main feature of its product(s), then it may have more data scientists than data analysts.

Cloud Engineer

If your team has such a large volume of work that a single dedicated engineer responsible for maintaining infrastructure is justified, then having a cloud engineer is a good idea. I strongly encourage DEs to get familiar with infrastructure and “own” the resources that their code leverages/creates/consumes. So a cloud engineer would be a subject matter expert who is responsible for the domain and who oversees the cloud engineering work that DEs are already performing as part of their tasks, as well as completing work of their own. These kinds of engineers, in an AWS context, will be taking care of aspects such as the following:

Networking (VPCs, VPN access, subnets, and so on)

Security (encryption, parameter stores and secrets vault, security groups for applications, as well as role/user permission management with IAM)

Tools like CloudFormation (or similar ones such as Terraform) for writing and maintaining infrastructure

Business Intelligence (BI) Developer

Once your DA team is mature enough, you will probably want to restrict the scope of the data analysts' work to exploration and investigation and leave the visualization and reporting to developers who are specialized in the use of business intelligence (BI) tools (such as Amazon QuickSight, Power BI, or Tableau) and who can more easily and quickly report their findings to stakeholders.

Machine Learning Engineer

A machine learning engineer (MLE) is a close relative of the DE, specialized in ML‐focused operations, such as the setup and maintenance of ML‐oriented pipelines, including their development and deployment, and the creation and maintenance of specialized data stores (such as feature stores) exclusively aimed at the production of ML models. Since the tools used in ML engineering differ from classic DE tools and are more niche, they require a high level of understanding of ML processes. A person working as an MLE is normally a DE with an interest in data science, or a data scientist who can double as a DE and who has found their ideal place as an MLE.

The practice of automating the training and deployment of ML models is called MLOps, or machine learning operations.

Business Analyst

A business analyst (BA) is the ideal point of contact between a technical team and the business/management. The main task of a BA is to gather requirements from the business and turn these requirements into tasks that the technical personnel can execute. I consider a BA a maturity stage role, because in the beginning this is work that the DA team lead should be able to complete, albeit at not as high a standard as a BA proper.

Niche Roles

Other roles that you might consider including in your DA team, depending on the nature of the business and the size/resources of the team itself, are as follows:

AI Developer

  All too often anything ML related is also referred to as artificial intelligence (AI). Although there are various schools of thought and endless debates on the subject, I agree with Microsoft in summarizing the matter like so: machine learning is how a system develops intelligence, whereas AI is the intelligence itself that allows a computer to perform a task on its own and makes independent decisions. In this respect ML is a subset of AI and a gear in a larger intelligent machine. If your business has a need for someone who is responsible for developing algorithms aimed at resolving an analytics problem, then an AI developer is what you need.

TechOps / DevOps Engineer

  If your team is sizable, and the workload on the CI/CD and cloud infrastructure side is too much for DEs to tackle on top of their main function (creating pipelines), then you might want to have dedicated TechOps/DevOps personnel for the DA team.

MLOps Engineer

  This is a subset role of the greater DevOps specialty, a DevOps engineer who specializes in CI/CD and infrastructure dedicated to putting ML models into production.

Analytics Flow at a Process Level

There are many ways to design the process to request and complete analytics in a business. However, I've found the following to be generally applicable to most businesses:

A stakeholder formulates a request, a business question that needs answering.

The BA (or team lead at early stages) translates this into a technical task for a data analyst.

The data analyst conducts some investigation and exploration, leading to a conclusion. The data analyst identifies the portion of their work that can be automated to produce up‐to‐date insights and designs a spec (if a BI developer is available, they will do this last part).

A DE picks up the spec, then designs and implements an ETL job/pipeline that will produce a dataset and store it in the suitable target database.

The BI developer utilizes the data made available by the DE at step 4 and visualizes it or creates reports from it.

The BA reviews the outcome with the stakeholder for final approval and sign‐off.

Workflow Methodology

There are many available software development methodologies for managing the team's workload and achieving a satisfactory level of productivity and velocity. The methodology adopted by your team will greatly depend on the skills you have on your team and even the personalities of the various team members. However, I've found a number of common traits throughout the years:

Cloud engineering tends to be mostly planned work, such as enabling the team to create resources, setting up monitoring and alerting, creating CI/CD pipelines, and so on.

Data analytics tends to be mostly reactive work, whereby a stakeholder asks for a certain piece of work and analysts pick it up.

Data engineering is a mixed bag: on one hand, it is reactive insofar as it supports the work cascading from analysts and is destined to be used by BI developers; on the other hand, some tasks, such as developing utilities and tooling to help the team scale operations, is planned and would normally be associated with a traditional deadline for delivery.

Data architects tend to have more planned work than reactive, but at the beginning of a DA team's life there may be a lot of real‐time prioritization to be done.

So given these conditions, what software development methodology should you choose? Realistically it would be one of the many Agile methodologies available, but which one?

A good rule of thumb is as follows: if it's planned work, use Scrum; if it's reactive work, use Kanban. If in doubt, or you want to use one method for everyone, use Kanban.

Let me explain the reason for this guideline. Scrum's central concept for time estimation is user stories that can be scored. This is a very useful idea that enables teams to plan their sprints with just the right amount of work to be completed within that time frame. Planned work normally starts with specifications, and leadership/management will have an expectation for its completion. Therefore, planning the work ahead, and dividing it into small stories that can be estimated, will also produce a final time metric number that will work as the deadline.

In my opinion Scrum is more suited to this kind of work, as I find it more suited to feature‐oriented development (as in most product teams).

Kanban, on the other hand, is an extremely versatile methodology meant to combine agility and flexibility with team velocity and productivity. When a team is constantly dealing with a flow of requests, how do you go about completing them? The key is in real‐time prioritization, which in turn depends on breaking down tasks to the smallest possible unit.

Limits and constraints that I've found useful are as follows:

No task should ever exceed 3 days of work, with 1 being ideal.

There should never be more than one task per member in the In Progress column of your Kanban board.

There should never be more than one task per member in the Review/Demo column of your board.

Encourage cooperation by setting a “work in progress” limit that is less than twice the number of team members, so at least one task must have more than one person assigned to it. For example, if you only want this constraint to be applied to one task, you could set the WIP limit at

Also, I strongly encourage code‐based work to require the approval of at least one other team member before any one code contribution is merged into the codebase. This is true for DEs and data analysts alike.

Applying these constraints, you will immediately notice that if an urgent task lands in the team's backlog (the “drop what you're doing” kind of task), you should always be at most three days away from being able to assign the task and have it completed.

And aside from those business‐critical anomalies that require immediate attention (which, by the way, should never be the case in a DA team since they are rarely a customer‐facing team), real‐time prioritization and management of the backlog is relatively easy, especially in the realms of data analytics and BI, where demands for investigations and reports are an ever‐flowing stream.

In conclusion, Kanban is a versatile methodology, suitable for real‐time prioritization that can be applied to the whole team. If you have subteams only completing planned work, they could be more optimally managed with Scrum.

The DA Team Mantra: “Automate Everything”

If there is one thing I wish readers would learn from my experience, it's the vital importance of automation. If you are dealing with terabytes of data across several data sources, vast data lakes and data warehouses, countless ETL pipelines, dashboards, and tables to catalog in metadata stores, you cannot expect to maintain the operation manually. Neither should you aspire to. On the contrary, you should strive to achieve complete automation where the data lake practically maintains itself.

Here is a list of aspects of the work that are better managed through automation:

Infrastructure Creation, Update, and Destruction

  There are many tools to accomplish this. The main infrastructure‐as‐code solutions available are CloudFormation, Terraform, and AWS CDK (the latter two utilize CloudFormation under the hood but are easier to write and maintain).

Data Cataloging

  As data flows into your data lake, new partitions and new tables are better discovered automatically. The umbrella tool AWS Glue covers this part of your automation by scanning newly deposited data with so‐called

crawlers

.

Pipeline Execution 

  AWS EventBridge allows pipelines to execute on particular triggers; this may be simple schedules or more complex events such as the creation of a new object in storage.

Visualizations/Dashboard Update

  AWS QuickSight bases its dashboards on datasets that can be set to have a refresh rate, so reports are always up to date.

Test and Deployment

  You should treat data engineering and analytics the same way you would a product, by setting up a CI/CD pipeline that tests code and deploys it upon a successful code review following a pull request. The de facto standard for version control of the code is Git, although other solutions are available.

Monitoring and Alerting

  Whatever your delivery system of choice is (a message in a chat application, an email, an SMS), be sure to automate monitoring and alerting so that you are immediately notified when something has gone wrong. Especially in data engineering, missing a day's worth of data can result in problems and a lot of hassle to backfill the information.

Analytics Models in the Wild: Centralized, Distributed, Center of Excellence

Finally, let's take a look at how the DA team may be placed within the organization and how it could interact with the other functions.

There are plenty of models available, but there are three models that are in a way the basic version of every other variation available: centralized, distributed, and center of excellence, or CoE (which is ideal for a hybrid structure).

Centralized

A centralized DA team is a place where all the analytics needs of an organization are satisfied. It not only means that every single piece of data engineering or analytics will be performed by the DA team, but it also means no data engineering, data analysis, or data science should happen outside of the DA team.

This may not be suitable for all organizations, but I do find that at least at the beginning of a business's transformation to data‐driven, a centralized approach brings order and method to the chaos. Rogue initiatives outside of it only create duplication and fragmentation of work methodology, practices, and tools and may even produce results that conflict with similar work conducted within the DA team, which may result in poor buy‐in from the business and slow down the production of analytics or question its accuracy. If you do not have analytics in your company, start with a centralized team.

If you do have analysts in your company because you made the very common mistake of putting analysts to work before data engineering was in place, bring your analysts into the DA team and transform what may be a community of practice into a structured team.

An early‐stages DA team works mainly in three areas: architecture, engineering, and analysis. Data science may come soon after but not right away. For this reason, I believe an early‐stages DA team and indeed a centralized DA team may have the structure shown in Figure 2.1.

It is important to note that, as specified earlier, the architect role can be covered by a team lead, but it is not the same thing. A competent person who can design resilient, maintainable, and extensible architectures is needed to review the work done by all the teams, but especially the data engineering team.

Later in the data journey, you may drift more toward a hub‐and‐spoke model. If so, your centralized team may in time become the core team of the center of excellence, which we will explore soon.

The main disadvantage of centralized teams in the long term is that they may produce slower lead times from request to analytics, as the analytics requests coming from the business will have to join a prioritized queue and there are no resources dedicated to each function.

Figure 2.1: An example structure of an early‐stages DA team

A main advantage of a centralized team is that it inherently encourages cross‐functionality among the members of each subteam; therefore, if resources are not available or for some reason temporarily constrained, it means work can proceed (albeit at a slower pace) rather than coming to a grinding halt. So a centralized team has a certain degree of resilience.

Distributed

A distributed DA team is especially suitable for those organizations whose analytical needs are so large, and there is so much domain knowledge to be acquired by the people carrying out engineering and analysis work, that it is faster and more productive to split the team out. The main advantage of distributed teams is the quicker turnaround. If Finance is in need of a piece of analytics, they don't need to share resources with Marketing. The Finance DA team will swiftly produce the analytics requested without having to go to a centralized team and share resources with the entire business.

But there are drawbacks. Inevitably, teams will drift apart and develop practices and adopt methodologies that in time are going to diverge, especially given the different domains of work, and transferring resources or regulating analytics at the business level may become very challenging.

Distributed teams may have a structure that internally is similar to the centralized team but on a smaller scale.

Center of Excellence

There is a third model, which combines the benefits of centralized and distributed models: the center of excellence. This model requires a high level of data maturity in the business, because it involves a great deal of agility while remaining regulated, and it addresses domain knowledge, quick iterations, and data governance.

Instead of aggregating all of the DA resources into one team, you form a center of excellence containing the people with high‐value skills and experience. From this center of excellence, you can regulate activity and establish a rhythm to analytics production. You can also review work carried out in the distributed units and establish communities of practice to contain the drift between the various functions.

A center of excellence is at the core of a hub‐and‐spoke model where the central unit (the hub) is responsible for overseeing and regulating activities, as well as performing tasks that are to be considered business‐wide or business‐critical (for example, managing and regulating access to the business's centralized data lake). The spokes are units (teams) embedded within the various functions that can perform work at a higher pace while having their activity reviewed and approved by the people in the center of excellence.

As mentioned, this model suits businesses and organizations that are far down the road of analytics, and it is one model that allows quick iterations on producing insights and analytics while limiting fragmentation and duplication of work.

Summary

In this chapter we discussed the formation of a DA team, which is a vital prerequisite for the successful creation and maintenance of a data platform in any organization. While not all organizations are the same, the general advice is to start with an embryonic unit with a strong leadership, and gradually and iteratively add specialist roles to your growing team.

CHAPTER 3Working on AWS

AWS is an incredibly vast ecosystem of tools and components, and—especially if you are not familiar with it—learning to work with it may seem like a daunting task.

Therefore, it seems only fitting that we should take a look at the basics of how to work in an AWS environment and build your understanding of cloud computing and engineering.

Since you are reading this book to implement analytics on AWS, it would seem logical that you are already using AWS for other parts of your business. Therefore, we will take a quick look at initial steps (sign‐up and user creation), but we will dive into deeper detail on the subsequent steps. We will discuss the following:

Accessing AWS

Managing users

Interacting with AWS through the Web Console

Interacting with the command line

Interacting with AWS CloudShell

Creating virtual private clouds to secure your resources

Managing roles and policies with IAM

Using CloudFormation to manage infrastructure

Accessing AWS

First things first: you need to access AWS. To do this, you will need to create an AWS account, with credentials that will grant you root access. The URL to create an AWS account (and for subsequent sign‐ins) is https://aws.amazon.com.

Once in, you will be prompted with the Console Home screen (shown in Figure 3.1