22,99 €
Take a dive into data lakes "Data lakes" is the latest buzz word in the world of data storage, management, and analysis. Data Lakes For Dummies decodes and demystifies the concept and helps you get a straightforward answer the question: "What exactly is a data lake and do I need one for my business?" Written for an audience of technology decision makers tasked with keeping up with the latest and greatest data options, this book provides the perfect introductory survey of these novel and growing features of the information landscape. It explains how they can help your business, what they can (and can't) achieve, and what you need to do to create the lake that best suits your particular needs. With a minimum of jargon, prolific tech author and business intelligence consultant Alan Simon explains how data lakes differ from other data storage paradigms. Once you've got the background picture, he maps out ways you can add a data lake to your business systems; migrate existing information and switch on the fresh data supply; clean up the product; and open channels to the best intelligence software for to interpreting what you've stored. * Understand and build data lake architecture * Store, clean, and synchronize new and existing data * Compare the best data lake vendors * Structure raw data and produce usable analytics Whatever your business, data lakes are going to form ever more prominent parts of the information universe every business should have access to. Dive into this book to start exploring the deep competitive advantage they make possible--and make sure your business isn't left standing on the shore.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 547
Veröffentlichungsjahr: 2021
Data Lakes For Dummies®
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2021 by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit https://hub.wiley.com/community/support/dummies.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2021939570
ISBN 978-1-119-78616-0 (pbk); ISBN 978-1-119-78617-7 (ebk); ISBN 978-1-119-78618-4 (ebk)
Cover
Title Page
Copyright
Introduction
About This Book
Foolish Assumptions
Icons Used in This Book
Beyond the Book
Where to Go from Here
Part 1: Getting Started with Data Lakes
Chapter 1: Jumping into the Data Lake
What Is a Data Lake?
The Data Lake Olympics
Data Lakes and Big Data
The Data Lake Water Gets Murky
Chapter 2: Planning Your Day (and the Next Decade) at the Data Lake
Carpe Diem: Seizing the Day with Big Data
Managing Equal Opportunity Data
Building Today’s — and Tomorrow’s — Enterprise Analytical Data Environment
Reducing Existing Stand-Alone Data Marts
Eliminating Future Stand-Alone Data Marts
Establishing a Migration Path for Your Data Warehouses
Aligning Data with Decision Making
Speedboats, Canoes, and Lake Cruises: Traversing the Variable-Speed Data Lake
Managing Overall Analytical Costs
Chapter 3: Break Out the Life Vests: Tackling Data Lake Challenges
That’s Not a Data Lake, This Is a Data Lake!
Exposing Data Lake Myths and Misconceptions
Navigating Your Way through the Storm on the Data Lake
Building the Data Lake of Dreams
Performing Regular Data Lake Tune-ups — Or Else!
Technology Marches Forward
Part 2: Building the Docks, Avoiding the Rocks
Chapter 4: Imprinting Your Data Lake on a Reference Architecture
Playing Follow the Leader
Guiding Principles of a Data Lake Reference Architecture
A Reference Architecture for Your Data Lake Reference Architecture
Incoming! Filling Your Data Lake
Supporting the Fleet Sailing on Your Data Lake
The Old Meets the New at the Data Lake
Bringing Outside Water into Your Data Lake
Playing at the Edge of the Lake
Chapter 5: Anybody Hungry? Ingesting and Storing Raw Data in Your Bronze Zone
Ingesting Data with the Best of Both Worlds
Joining the Data Ingestion Fraternity
Storing Data in Your Bronze Zone
Just Passing Through: The Cross-Zone Express Lane
Taking Inventory at the Data Lake
Bringing Analytics to Your Bronze Zone
Chapter 6: Your Data Lake’s Water Treatment Plant: The Silver Zone
Funneling Data further into the Data Lake
Bringing Master Data into Your Data Lake
Impacting the Bronze Zone
Getting Clever with Your Storage Options
Working Hand-in-Hand with Your Gold Zone
Chapter 7: Bottling Your Data Lake Water in the Gold Zone
Laser-Focusing on the Purpose of the Gold Zone
Looking Inside the Gold Zone
Deciding What Data to Curate in Your Gold Zone
Seeing What Happens When Your Curated Data Becomes Less Useful
Chapter 8: Playing in the Sandbox
Developing New Analytical Models in Your Sandbox
Comparing Different Data Lake Architectural Options
Experimenting and Playing Around with Data
Chapter 9: Fishing in the Data Lake
Starting with the Latest Guidebook
Taking It Easy at the Data Lake
Staying in Your Lane
Doing a Little Bit of Exploring
Putting on Your Gear and Diving Underwater
Chapter 10: Rowing End-to-End across the Data Lake
Keeping versus Discarding Data Components
Getting Started with Your Data Lake
Shifting Your Focus to Data Ingestion
Finishing Up with the Sandbox
Part 3: Evaporating the Data Lake into the Cloud
Chapter 11: A Cloudy Day at the Data Lake
Rushing to the Cloud
Running through Some Cloud Computing Basics
The Big Guys in the Cloud Computing Game
Chapter 12: Building Data Lakes in Amazon Web Services
The Elite Eight: Identifying the Essential Amazon Services
Looking at the Rest of the Amazon Data Lake Lineup
Building Data Pipelines in AWS
Chapter 13: Building Data Lakes in Microsoft Azure
Setting Up the Big Picture in Azure
The Magnificent Seven, Azure Style
Filling Out the Azure Data Lake Lineup
Assembling the Building Blocks
Part 4: Cleaning Up the Polluted Data Lake
Chapter 14: Figuring Out If You Have a Data Swamp Instead of a Data Lake
Designing Your Report Card and Grading System
Looking at the Raw Data Lockbox
Knowing What to Do When Your Data Lake Is Out of Order
Too Fast, Too Slow, Just Right: Dealing with Data Lake Velocity and Latency
Dividing the Work in Your Component Architecture
Tallying Your Scores and Analyzing the Results
Chapter 15: Defining Your Data Lake Remediation Strategy
Setting Your Key Objectives
Doing Your Gap Analysis
Identifying Resolutions
Establishing Timelines
Defining Your Critical Success Factors
Chapter 16: Refilling Your Data Lake
The Three S’s: Setting the Stage for Success
Refining and Enriching Existing Raw Data
Making Better Use of Existing Refined Data
Building New Pipelines with Newly Ingested Raw Data
Part 5: Making Trips to the Data Lake a Tradition
Chapter 17: Checking Your GPS: The Data Lake Road Map
Getting an Overhead View of the Road to the Data Lake
Assessing Your Current State of Data and Analytics
Putting Together a Lofty Vision
Building Your Data Lake Architecture
Deciding on Your Kickoff Activities
Expanding Your Data Lake
Chapter 18: Booking Future Trips to the Data Lake
Searching for the All-in-One Data Lake
Spreading Artificial Intelligence Smarts throughout Your Data Lake
Part 6: The Part of Tens
Chapter 19: Top Ten Reasons to Invest in Building a Data Lake
Supporting the Entire Analytics Continuum
Bringing Order to Your Analytical Data throughout Your Enterprise
Retiring Aging Data Marts
Bringing Unfulfilled Analytics Ideas out of Dry Dock
Laying a Foundation for Future Analytics
Providing a Region for Experimentation
Improving Your Master Data Efforts
Opening Up New Business Possibilities
Keeping Up with the Competition
Getting Your Organization Ready for the Next Big Thing
Chapter 20: Ten Places to Get Help for Your Data Lake
Cloud Provider Professional Services
Major Systems Integrators
Smaller Systems Integrators
Individual Consultants
Training Your Internal Staff
Industry Analysts
Data Lake Bloggers
Data Lake Groups and Forums
Data-Oriented Associations
Academic Resources
Chapter 21: Ten Differences between a Data Warehouse and a Data Lake
Types of Data Supported
Data Volumes
Different Internal Data Models
Architecture and Topology
ETL versus ELT
Data Latency
Analytical Uses
Incorporating New Data Sources
User Communities
Hosting
Index
About the Author
Connect with Dummies
End User License Agreement
Chapter 1
TABLE 1-1 Data Lake Zones
Chapter 2
TABLE 2-1 Matching Analytics and Business Questions
Chapter 9
TABLE 9-1 Hospital Data Lake Permissions
Chapter 13
TABLE 13-1 ADLS Storage Tiers
Chapter 15
TABLE 15-1 Data Lake Remediation Priorities
TABLE 15-2 Defining Data Lake Remediation Success
Chapter 17
TABLE 17-1 Your Five-Phase A LAKE Data Lake Road Map
TABLE 17-2 A LAKE Confirmation Loopbacks
Chapter 1
FIGURE 1-1: A logically centralized data lake with underlying physical decentra...
FIGURE 1-2: Cloud-based data lake solutions.
FIGURE 1-3: Different types of data in your data lake.
FIGURE 1-4: Source applications feeding data into your data lake.
Chapter 2
FIGURE 2-1: The vision of an enterprise data warehouse.
FIGURE 2-2: The reality of numerous stand-alone data marts.
FIGURE 2-3: Using a data lake to retire data marts.
FIGURE 2-4: Leaving a data mart intact and alongside your data lake.
FIGURE 2-5: Incorporating a data mart into your data lake.
FIGURE 2-6: Migrating your data warehouse into your new data lake.
FIGURE 2-7: A data pipeline into, through, and then out of the data lake.
FIGURE 2-8: An easy way to understand data pipelines and data lakes.
Chapter 3
FIGURE 3-1: Playing “find the data lake.”
Chapter 4
FIGURE 4-1: A reference architecture for data lake reference architectures.
FIGURE 4-2: Two classes of inbound data flows for your data lake.
FIGURE 4-3: Object storage as the fundamental storage technology for your data ...
FIGURE 4-4: Incorporating database technology along with object storage.
FIGURE 4-5: Embedding a data warehouse into your data lake environment.
FIGURE 4-6: Adding heterogeneity to your data lake’s bronze zone.
FIGURE 4-7: Adding heterogeneity to your data lake’s bronze zone.
FIGURE 4-8: Incorporating the user layer of a legacy data warehouse into your d...
FIGURE 4-9: Subsuming an end-to-end legacy data warehouse into your new data la...
FIGURE 4-10: Your data lake feeding your data warehouse.
FIGURE 4-11: Split-streaming data feeds to support both your data lake and your...
FIGURE 4-12: Ongoing data interchange between your data lake and your data ware...
FIGURE 4-13: A data lake that is much larger than a data warehouse.
FIGURE 4-14: A data warehouse that is much larger than a data lake.
FIGURE 4-15: Feeding external data into the data lake.
FIGURE 4-16: On-demand access to external data for your analytics.
FIGURE 4-17: Drilling-site sensors and a data lake at an energy exploration com...
FIGURE 4-18: Edge analytics existing outside the control of the data lake.
FIGURE 4-19: Remote data from edge analytics can also be sent to the data lake.
Chapter 5
FIGURE 5-1: Data flowing into your data lake bronze zone.
FIGURE 5-2: Three different operational data feeds into your data lake bronze z...
FIGURE 5-3: Multiple subscribers to sensor and video data streams.
FIGURE 5-4: Using a streaming service to split-stream data into both a data lak...
FIGURE 5-5: Under-the-covers “micro-batching” within streaming input to your da...
FIGURE 5-6: The Lambda data ingestion architecture for your data lake.
FIGURE 5-7: The Kappa data ingestion architecture for your data lake.
FIGURE 5-8: Going for storage simplicity with only object storage in your bronz...
FIGURE 5-9: Implementing a multi-component bronze zone.
FIGURE 5-10: Ingesting data from a database: object storage versus database in ...
FIGURE 5-11: Carrying a bronze zone database through to your data lake gold zon...
FIGURE 5-12: Carrying bronze zone object storage through to your data lake gold...
FIGURE 5-13: Going back to a database in a multi-component gold zone.
FIGURE 5-14: Data streaming doing double duty as bronze zone storage for raw da...
FIGURE 5-15: Three different models for linking your analytics with streaming d...
Chapter 6
FIGURE 6-1: Refining an image between the bronze zone and the silver zone.
FIGURE 6-2: Enriching an image for storage in the data lake silver zone.
FIGURE 6-3: Enriching a tweet by determining and attaching sentiment analysis.
FIGURE 6-4: Building a master data taxonomy for your data lake.
FIGURE 6-5: Decisions, decisions: What should you do with bronze zone data dest...
FIGURE 6-6: Redefining your data lake zone boundaries rather than unnecessarily...
FIGURE 6-7: Ingesting a raw tweet.
FIGURE 6-8: Enriching a tweet followed by shifting your zone boundary rather th...
FIGURE 6-9: Step 1: Ingesting raw data into your bronze zone.
FIGURE 6-10: Step 2: Moving data into the silver zone rather than copying data.
FIGURE 6-11: Deciding whether to keep a raw image after refinement and enhancem...
FIGURE 6-12: Your data lake silver zone using Amazon S3.
FIGURE 6-13: Dividing your silver zone content among three different flavors of...
FIGURE 6-14: Carrying hierarchical storage back into your data lake bronze zone...
FIGURE 6-15: Step 1: Refine and enrich an image in your data lake silver zone.
FIGURE 6-16: Step 2: Move bronze zone image to S3 Glacier to save on storage co...
Chapter 7
FIGURE 7-1: Peeking inside the gold zone.
FIGURE 7-2: Building a curated gold zone data package.
FIGURE 7-3: Adding database data to object store data inside a gold zone curate...
FIGURE 7-4: Using persistent data streams for your gold zone curated data.
FIGURE 7-5: Using a specialized data store in your data lake gold zone.
FIGURE 7-6: Relocating an infrequently used or retired data package to less-exp...
Chapter 8
FIGURE 8-1: Using the data lake sandbox for analytical development.
FIGURE 8-2: Migrating curated data from the sandbox to the gold zone as analyti...
FIGURE 8-3: Using a data lake sandbox to explore architectural options.
FIGURE 8-4: Moving a graph database curated data package from the sandbox into ...
FIGURE 8-5: Exploratory analytics and your data lake sandbox.
Chapter 9
FIGURE 9-1: Data lakes and passive analytics users.
FIGURE 9-2: Light analytics user access to a data lake gold zone.
FIGURE 9-3: Light analytics user access to a database within the data lake gold...
FIGURE 9-4: A multistep gold zone integration process for a light analytics use...
FIGURE 9-5: Using a data abstraction tool for data lake access simplicity.
FIGURE 9-6: Using a data abstraction tool to integrate database and object data...
Chapter 10
FIGURE 10-1: Your hospital’s legacy systems environment.
FIGURE 10-2: Selecting data mart dimensional models to retain for your new data...
FIGURE 10-3: Replacing best-of-breed applications with an integrated EHR packag...
FIGURE 10-4: Pairing your new EHR system with a data lake.
FIGURE 10-5: Setting up curated data packages in your data lake gold zone.
FIGURE 10-6: Delaying platform decisions until you gain a broader view of your ...
FIGURE 10-7: Your EHR system using both streaming and batch feeds into your dat...
FIGURE 10-8: Making key ingestion and bronze zone data set decisions.
FIGURE 10-9: Streaming persistent data into the gold zone.
FIGURE 10-10: Making different architectural decisions for various data streams...
FIGURE 10-11: Putting your silver zone to work.
FIGURE 10-12: Adding data pipelines to your data lake buildout.
FIGURE 10-13: Bringing your data lake sandbox into the picture.
Chapter 11
FIGURE 11-1: Public versus private clouds: a visual analogy.
FIGURE 11-2: Allocation of responsibilities for SaaS, PaaS, and IaaS.
Chapter 12
FIGURE 12-1: The fundamental structure of Amazon S3.
FIGURE 12-2: Mimicking folders in Amazon S3 through filenames.
FIGURE 12-3: Building your entire AWS data lake using only S3 for data storage.
FIGURE 12-4: Using Glue Crawler and Glue Data Catalog to maintain up-to-date da...
FIGURE 12-5: Using a Lake Formation blueprint for data lake ingestion.
FIGURE 12-6: Using Amazon Kinesis Data Streams for hospital patient vital signs...
FIGURE 12-7: Athena using the Glue Data Catalog to access S3 data with SQL.
FIGURE 12-8: Using Amazon Redshift in your data lake’s gold zone.
FIGURE 12-9: An end-to-end hospital data lake built on AWS services.
Chapter 13
FIGURE 13-1: Organization of the Azure cloud.
FIGURE 13-2: An Azure data lake framework.
FIGURE 13-3: ADLS Gen2, the best of both worlds.
FIGURE 13-4: ADLS containers, folders, and files.
FIGURE 13-5: Ingesting, copying, and sinking data along an ADF pipeline.
FIGURE 13-6: Using Azure Event Hubs for a publish-and-subscribe model.
FIGURE 13-7: Bidirectional messaging and streaming with Azure IoT Hub.
FIGURE 13-8: Using Azure SQL Database in your Azure data lake.
FIGURE 13-9: Azure data lake architecture for IoT analytics.
FIGURE 13-10: Azure data lake architecture for industrial IoT predictive mainte...
FIGURE 13-11: Azure data lake architecture for defect analysis and prevention.
FIGURE 13-12: Azure data lake architecture for rideshare company forecasting.
Chapter 14
FIGURE 14-1: Your data lake four-element scorecard.
FIGURE 14-2: Dividing each data lake evaluation criteria into scoreable element...
FIGURE 14-3: Focus only on your raw data.
FIGURE 14-4: Identifying your raw data hot spots.
FIGURE 14-5: Diving deep into your data lake’s quality and governance.
FIGURE 14-6: The ominous results.
FIGURE 14-7: Grading your data velocity and latency.
FIGURE 14-8: Good news on the data velocity and latency front.
FIGURE 14-9: Grading your component architecture.
FIGURE 14-10: Bringing together all of your data lake evaluation scores.
Chapter 15
FIGURE 15-1: The current hospital operational applications.
FIGURE 15-2: Peer analytical solutions, one for administrative data and one for...
FIGURE 15-3: A downstream data warehouse taking feeds from both Hadoop and AWS.
FIGURE 15-4: The current state survey results.
FIGURE 15-5: Cataloging and assigning data lake issues.
FIGURE 15-6: A two-step process to migrate the hospital’s entire data lake onto...
FIGURE 15-7: Introducing streaming to benefit both the medical operations appli...
FIGURE 15-8: Adding shells for the silver and gold zones.
FIGURE 15-9: Adding a data warehouse component into the overall data lake archi...
FIGURE 15-10: Placing master data management in your silver zone.
FIGURE 15-11: Addressing the data warehouse–versus–data lake controversy withou...
FIGURE 15-12: The data lake remediation timeline.
FIGURE 15-13: The inevitable trio of technology, human and organizational facto...
Chapter 16
FIGURE 16-1: The starting point for the operating room efficiency study.
FIGURE 16-2: The first data pipeline to feed existing raw data into a curated g...
FIGURE 16-3: Batch ETL of patient bedside data in the current hospital data lak...
FIGURE 16-4: Streaming data and streaming analytics for the real-time patient d...
FIGURE 16-5: Emergency room data fed through the bronze zone into the silver zo...
FIGURE 16-6: Building the first emergency room and inpatient cross-reference wi...
FIGURE 16-7: Replacing a batch data feed with split-streaming.
FIGURE 16-8: The starting point for analyzing message content versus patient ou...
FIGURE 16-9: Building a batch interface between the app and the data lake for m...
FIGURE 16-10: Enriching semi-structured data and then repositioning the data in...
FIGURE 16-11: Completing the curated data package and the associated analytics.
Chapter 17
FIGURE 17-1: Dividing your current-state assessment into data and analytics.
FIGURE 17-2: Harvey balls for scoring.
FIGURE 17-3: Parallel paths of your analytics assessment.
FIGURE 17-4: A sample analytics scorecard.
FIGURE 17-5: Your data architecture and governance parallel paths.
FIGURE 17-6: A sample data architecture and governance scorecard.
FIGURE 17-7: Analyzing every scrap of data about an insurance customer: today v...
FIGURE 17-8: Your data lake and data warehouse as peers.
FIGURE 17-9: Your data warehouse feeding certain data into your data lake.
FIGURE 17-10: Progressively turning your data lake vision into a solid blueprin...
FIGURE 17-11: A multiphase, multiyear, high-level data lake road map.
Chapter 18
FIGURE 18-1: Your data lake doing double-duty for transactional and analytical ...
FIGURE 18-2: Equipping your data lake with an AI-enabled insights and analytics...
Cover
Title Page
Copyright
Table of Contents
Begin Reading
Index
About the Author
iii
iv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
369
370
371
In December 1995, I wrote an article for Database Programming & Design magazine entitled “I Want a Data Warehouse, So What Is It Again?” A few months later, I began writing Data Warehousing For Dummies (Wiley), building on the article’s content to help readers make sense of first-generation data warehousing.
Fast-forward a quarter of a century, and I could very easily write an article entitled “I Want a Data Lake, So What Is It Again?” This time, I’m cutting right to the chase with Data Lakes For Dummies. To quote a famous former baseball player named Yogi Berra, it’s déjà vu all over again!
Nearly every large and upper-midsize company and governmental agency is building a data lake or at least has an initiative on the drawing board. That’s the good news.
The not-so-good news, though, is that you’ll find a disturbing lack of agreement about data lake architecture, best practices for data lake development, data lake internal data flows, even what a data lake actually is! In fact, many first-generation data lakes have fallen short of original expectations and need to be rearchitected and rebuilt.
As with data warehousing in the mid-’90s, the data lake concept today is still a relatively new one. Consequently, almost everything about data lakes — from its very definition to alternatives for integration with or migration from existing data warehouses — is still very much a moving target. Software product vendors, cloud service providers, consulting firms, industry analysts, and academics often have varying — and sometimes conflicting — perspectives on data lakes. So, how do you navigate your way across a data lake when the waters are especially choppy and you’re being tossed from side to side?
That’s where Data Lakes For Dummies comes in.
Data Lakes For Dummies helps you make sense of the ABCs — acronym anarchy, buzzword bingo, and consulting confusion — of today’s and tomorrow’s data lakes.
This book is not only a tutorial about data lakes; it also serves as a reference that you may find yourself consulting on a regular basis. So, you don’t need to memorize large blocks of content (there’s no final exam!) because you can always go back to take a second or third or fourth look at any particular point during your own data lake efforts.
Right from the start, you find out what your organization should expect from all the time, effort, and money you’ll put into your data lake initiative, as well as see what challenges are lurking. You’ll dig deep into data lake architecture and leading cloud platforms and get your arms around the big picture of how all the pieces fit together.
One of the disadvantages of being an early adopter of any new technology is that you sometimes make mistakes or at least have a few false starts. Plenty of early data lake efforts have turned into more of a data dump, with tons of data that just isn’t very accessible or well organized. If you find yourself in this situation, fear not: You’ll see how to turn that data dump into the data lake you originally envisioned.
I don’t use many special conventions in this book, but you should be aware that sidebars (the gray boxes you see throughout the book) and anything marked with the Technical Stuff icon are all skippable. So, if you’re short on time, you can pass over these pieces without losing anything essential. On the other hand, if you have the time, you’re sure to find fascinating information here!
Within this book, you may note that some web addresses break across two lines of text. If you’re reading this book in print and want to visit one of these web pages, simply key in the web address exactly as it’s noted in the text, pretending as though the line break doesn’t exist. If you’re reading this as an e-book, you’ve got it easy — just click the web address to be taken directly to the web page.
The most relevant assumption I’ve made is that if you’re reading this book, you either are or will soon be working on a data lake initiative.
Maybe you’re a data strategist and architect, and what’s most important to you is sifting through mountains of sometimes conflicting — and often incomplete — information about data lakes. Your organization already makes use of earlier-generation data warehouses and data marts, and now it’s time to take that all-important next step to a data lake. If that’s the case, you’re definitely in the right place.
If you’re a developer or data architect who is working on a small subset of the overall data lake, your primary focus is how a particular software package or service works. Still, you’re curious about where your daily work fits into your organization’s overall data lake efforts. That’s where this book comes in: to provide context and that “aha!” factor to the big picture that surrounds your day-to-day tasks.
Or maybe you’re on the business and operational side of a company or governmental agency, working side by side with the technology team as they work to build an enterprise-scale data environment that will finally support the entire spectrum of your organization’s analytical needs. You don’t necessarily need to know too much about the techie side of data lakes, but you absolutely care about building an environment that meets today’s and tomorrow’s needs for data-driven insights.
The common thread is that data lakes are part of your organization’s present and future, and you’re seeking an unvarnished, hype-free, grounded-in-reality view of data lakes today and where they’re headed.
In any event, you don’t need to be a technical whiz with databases, programming languages such as Python, or specific cloud platforms such as Amazon Web Services (AWS) or Microsoft Azure. I cover many different technical topics in this book, but you’ll find clear explanations and diagrams that don’t presume any prerequisite knowledge on your part.
As you read this book, you encounter icons in the margins that indicate material of particular interest. Here’s what the icons mean:
These are the tricks of the data lake trade. You can save yourself a great deal of time and avoid more than a few false starts by following specific tips collected from the best practices (and learned from painful experiences) of those who preceded you on the path to the data lake.
Data lakes are often filled with dangerous icebergs. (Okay, bad analogy, but you hopefully get the idea.) When you’re working on your organization’s data lake efforts, pay particular attention to situations that are called out with this icon.
If you’re more interested in the conceptual and architectural aspects of data lakes than the nitty-gritty implementation details, you can skim or even skip material that is accompanied by this icon.
Some points are so critically important that you’ll be well served by committing them to memory. You’ll even see some of these points repeated later in the book because they tie in with other material. This icon calls out this crucial content.
In addition to the material in the print or e-book you’re reading right now, this product comes with a free Cheat Sheet for the three types of data for your data lake, four zones inside your data lake, five phases to building your data lake, and more. To access the Cheat Sheet, go to www.dummies.com and type Data Lakes For Dummies Cheat Sheet in the Search box.
Now it’s time to head off to the lake — the data lake, that is! If you’re totally new to the subject, you don’t want to skip the chapters in Part 1 because they’ll provide the foundation for the rest of the book. If you already have some exposure to data lakes, I still recommend that you at least skim Part 1 to get a sense of how to get beyond all the hype, buzzwords, and generalities related to data lakes.
You can then read the book sequentially from front to back or jump around as needed. Whatever path works best for you is the one you should take.
Part 1
IN THIS PART …
Separate the data lake reality from the hype.
Steer your data lake efforts in the right direction.
Diagnose and avoid common pitfalls that can dry up your data lake.
Chapter 1
IN THIS CHAPTER
Defining and scoping the data lake
Diving underwater in the data lake
Dividing up the data lake
Making sense of conflicting terminology
The lake is the place to be this season — the data lake, that is!
Just like the newest and hottest vacation destination, everyone is booking reservations for a trip to the data lake. Unlike a vacation, though, you won’t just be spending a long weekend or a week or even the entire summer at the data lake. If you and your work colleagues do a good job, your data lake will be your go-to place for a whole decade or even longer.
Ask a friend this question: “What’s a lake?” Your friend thinks for a moment, and then gives you this answer: “Well, it’s a big hole in the ground that’s filled with water.”
Technically, your friend is correct, but that answer also is far from detailed enough to really tell you what a lake actually is. You need more specifics, such as:
How big, dimension-wise (how long and how wide)
How deep that “big hole in the ground” goes
How much variability there is from one lake to another in terms of those length, width, and depth dimensions (the Great Lakes, anyone?)
How much water you’ll find in the lake and how much that amount of water may vary among different lakes
Whether a lake contains freshwater or saltwater
Some follow-up questions may pop into your mind as well:
A pond is also a big hole in the ground that’s filled with water, so is a lake the same as a pond?
What distinguishes a lake from an ocean or a sea?
Can a lake be physically connected to another lake?
Can the dividing line between two states or two countries be in the middle of a lake?
If a lake is empty, is it still considered a lake?
If one lake leaves Chicago, heading east and travels at 100 miles per hour, and another lake heads west from New York … oh wait, wrong kind of word problem, never mind… .
So many missing pieces of the puzzle, all arising from one simple question!
You’ll find the exact same situation if you ask someone this question: “What’s a data lake?” In fact, go ahead and ask your favorite search engine that question. You’ll find dozens of high-level definitions that will almost certainly spur plenty of follow-up questions as you try to get your arms around the idea of a data lake.
Here’s a better idea: Instead of filtering through all that varying — and even conflicting — terminology and then trying to consolidate all of it into a single comprehensive definition, just think of a data lake as the following:
A solidly architected, logically centralized, highly scalable environment filled with different types of analytic data that are sourced from both inside and outside your enterprise with varying latency, and which will be the primary go-to destination for your organization’s data-driven insights
Wow, that’s a mouthful! No worries: Just as if you were eating a gourmet fireside meal while camping at your favorite lake, you can break up that definition into bite-size pieces.
A data lake should remain viable and useful for a long time after it becomes operational. Also, you’ll be continually expanding and enhancing your data lake with new types and forms of data, new underlying technologies, and support for new analytical uses.
Building a data lake is more than just loading massive amounts of data into some storage location.
To support this near-constant expansion and growth, you need to ensure that your data lake is well architected and solidly engineered, which means that the data lake
Enforces standards and best practices for data ingestion, data storage, data transmission, and interchange among its components and data delivery to end users
Minimizes workarounds and temporary interfaces that have a tendency to stick around longer than planned and weaken your overall environment
Continues to meet your predetermined metrics and thresholds for overall technical performance, such as data loading and interchange, as well as user response time
Think about a resort that builds docks, a couple of lakeside restaurants, and other structures at various locations alongside a large lake. You wouldn’t just hand out lumber, hammers, and nails to a bunch of visitors and tell them to start building without detailed blueprints and engineering diagrams. The same is true with a data lake. From the first piece of data that arrives, you need as solid a foundation as possible to help keep your data lake viable for a long time.
You’ll come across definitions and descriptions that tell you a data lake is a centralized store of data, but that definition is only partially correct.
A data lake is logically centralized. You can certainly think of a data lake as a single place for your data, instead of having your data scattered among different databases. But in reality, even though your data lake is logically centralized, its data is physically decentralized and distributed among many different underlying servers.
The data services that you use for your data lake, such as the Amazon Simple Storage Service (S3), the Microsoft Azure Data Lake Storage (ADLS), or the Hadoop Distributed File System (HDFS) manage the distribution of data among potentially numerous servers where your data is actually stored. These services hide the physical distribution from almost everyone other than those who need to manage the data at the server storage level. Instead, they present the data as being logically part of a single data lake. Figure 1-1 illustrates how logical centralization accompanies physical decentralization.
FIGURE 1-1: A logically centralized data lake with underlying physical decentralization.
How big can your data lake get? To quote the old saying (and to answer a question with a question), how many angels can dance on the head of a pin?
Scalability is best thought of as “the ability to expand capacity, workload, and missions without having to go back to the drawing board and start all over.” Your data lake will almost always be a cloud-based solution (see Figure 1-2). Cloud-based platforms give you, in theory, infinite scalability for your data lake. New servers and storage devices (discs, solid state devices, and so on) can be incorporated into your data lake on demand, and the software services manage and control these new resources along with those that you’re already using. Your data lake contents can then expand from hundreds of terabytes to petabytes, and then to exabytes, and then zettabytes, and even into the ginormousbyte range. (Just kidding about that last one.)
FIGURE 1-2: Cloud-based data lake solutions.
Cloud providers give you pricing for data storage and access that increases as your needs grow or decreases if you cut back on your functionality. Basically, your data lake will be priced on a pay-as-you-go basis.
Some of the very first data lakes that were built in the Hadoop environment may reside in your corporate data center and be categorized as on-prem (short for on-premises, meaning “on your premises”) solutions. But most of today’s data lakes are built in the Amazon Web Services (AWS) or Microsoft Azure cloud environments. Given the ever-increasing popularity of cloud computing, it’s highly unlikely that this trend of cloud-based data lakes will reverse for a long time, if ever.
As long as Amazon, Microsoft, and other cloud platform providers can keep expanding their existing data centers and building new ones, as well as enhancing the capabilities of their data management services, then your data lake should be able to avoid scalability issues.
A multiple-component data lake architecture (see Chapter 4) further helps overcome performance and capacity constraints as your data lake grows in size and complexity, providing even greater scalability.
Think of a data lake as being closer to a lake resort rather than just the lake — the body of water — in its natural state. If you were a real estate developer, you might buy the property that includes the lake itself, along with plenty of acreage surrounding the lake. You’d then develop the overall property by building cabins, restaurants, boat docks, and other facilities. The lake might be the centerpiece of the overall resort, but its value is dramatically enhanced by all the additional assets that you’ve built surrounding the lake.
A data lake is an entire environment, not just a gigantic collection of data that is stored within a data service such as Amazon S3 or Microsoft ADLS.
In addition to data storage, a data lake also includes the following:
One or (usually) more mechanisms to move data from one part of the data lake to another.
A catalog or directory that helps keep track of what data is where, as well as the associated rules that apply to different groups of data; this is known as
metadata.
Capabilities that help unify meanings and business rules for key data subjects that may come into the data lake from different applications and systems; this is known as
master data management.
Monitoring services to track data quality and accuracy, response time when users access data, billing services to charge different organizations for their usage of the data lake, and plenty more.
If your data lake had a motto, it might be “All data are created equal.”
In a data lake, data is data is data. In other words, you don’t need to make special accommodations for more complex types of data than you would for simpler forms of data.
Your data lake will contain structured data, unstructured data, and semi-structured data (see Figure 1-3). The following sections cover these types of data in more detail.
You’re probably most familiar with structured data, which is made up of numbers, shorter-length character strings, and dates. Traditionally, most of the applications you’ve worked with have been based on structured data. Structured data is commonly stored in a relational database such as Microsoft SQL Server, MySQL, or Oracle Database.
FIGURE 1-3: Different types of data in your data lake.
In a database, you define columns (basically, fields) for each of your pieces of structured data, and each column is rigidly and precisely defined with the following:
A data type,
such as INTEGER, DECIMAL, CHARACTER, DATE, DATETIME, or something similar
The size of the field,
either explicitly declared (for example, how many characters a CHARACTER column will contain) or implicitly declared (the system-defined maximum number for an INTEGER or how a DATE column is structured)
Any specific rules that apply to a data column or field,
such as the permissible range of values (for example, a customer’s age must be between 18 and 130) or a list of allowable values (for example, an employee’s current status can only be FULL-TIME, PART-TIME, TERMINATED, or RETIRED)
Any additional constraints,
such as primary and foreign key designations, or
referential integrity
(rules that specify consistency for certain columns across multiple database tables)
Unstructured data is, by definition, data that lacks a formally defined structure. Images (such as JPEGs), audio (such as MP3s), and videos (such as MP4s or MOVs) are common forms of unstructured data.
Semi-structured data sort of falls in between structured and unstructured data. Examples include a blog post, a social media post, text messages, an email message, or a message from Slack or Microsoft Teams. Leaving aside any embedded or attached images or videos for a moment, all these examples consist of a long string of letters, numbers, and special characters. However, there’s no particular structure assigned to most of these text strings other than perhaps a couple of lines of heading information. The body of an email may be very short — only a line or two — while another email can go on for many long paragraphs.
In your data lake, you need to have all these types of data sitting side by side. Why? Because you’ll be running analytics against the data lake that may need more than one form of data. For example, you receive and then analyze a detailed report of sales by department in a large department store during the past month.
Then, after noticing a few anomalies in the sales numbers, you pull up in-store surveillance video to analyze traffic versus sales to better understand how many customers may be looking at merchandise but deciding not to make a purchase. You can even combine structured data from scanners with your unstructured video data as part of your analysis.
If you had to go to different data storage environments for your sales results (structured data) and then the video surveillance (unstructured data), your overall analysis is dramatically slowed down, especially if you need to integrate and cross-reference different types of data. With a data lake, all this data is sitting side by side, ready to be delivered for analysis and decision-making.
In their earliest days, relational databases only stored structured data. Later, they were extended with capabilities to store structured and unstructured data. Binary large objects (BLOBs) were a common way to store images and even video in a relational database. However, even an object-extended relational database doesn’t make a good platform for a data lake when compared with modern data services such as Amazon S3 or Microsoft ADLS.
A common misconception is that you store “all your data” in your data lake. Actually, you store all or most of your analytic data in a data lake. Analytic data is, as you may suspect from the name, data that you’re using for analytics. In contrast, you use operational data to run your business.
What’s the difference? From one perspective, operational and analytic data are one and the same. Suppose you work for a large retailer. A customer comes into one of your stores and makes some purchases. Another customer goes onto your company’s website and buys some items there. The records of those sales — which customers made the purchases, which products they bought, how many of each product, the dates of the sales, whether the sales were online or in a store, and so on — are all stored away as official records of those transactions, which are necessary for running your company’s operations.
But you also want to analyze that data, right? You want to understand which products are selling the best and where. You want to understand which customers are spending the most. You have dozens or even hundreds of questions you want to ask about your customers and their purchasing activity.
Here’s the catch: You need to make copies of your operational data for the deep analysis that you need to undertake; and the copies of that operational data are what goes into the data lake (see Figure 1-4).
FIGURE 1-4: Source applications feeding data into your data lake.
Wait a minute! Why in the world do you need to copy data into your data lake? Why can’t you just analyze the data right where it is, in the source applications and their databases?
Data lakes, at least as you need to build them today and for the foreseeable future, are a continuation of the same model that has been used for data warehousing since the early 1990s. For many technical reasons related to performance, deep analysis involving large data volumes and significant cross-referencing directly in your source applications isn’t a workable solution for the bulk of your analytics.
Consequently, you need to make copies of the operational data that you want for analytical purposes and store that data in your data lake. Think of the data inside your data lake as (in used-car terminology) previously owned data that has been refurbished and is now ready for a brand-new owner.
But if you can’t adequately do complex analytics directly from source applications and their databases, what about this idea: Run your applications off your data lake instead! This way, you can avoid having to copy your data, right? Unfortunately, that idea won’t work, at least with today’s technology.
Operational applications almost always use a relational database, which manages concurrency control among their users and applications. In simple terms, hundreds or even thousands of users can add new data and make changes to a relational database without interfering with each other’s work and corrupting the database. A data lake, however, is built on storage technology that is optimized for retrieving data for analysis and doesn’t support concurrency control for update operations.
Many vendors are working on new technology that will allow you to build a data lake for operational, as well as analytical purposes. This technology is still a bit down the road from full operational viability. For the time being, you’ll build a data lake by copying data from many different source applications.
What exactly does “copying data” look like, and how frequently do you need to copy data into the data lake?
Data lakes mostly use a technique called ELT, which stands for either extract, transform, and load or extraction, transformation, and loading. With ELT, you “blast” your data into a data lake without having to spend a great deal of time profiling and understanding the particulars of your data. You extract data (the E part of ELT) from its original home in a source application, and then, after that data has been transmitted to the data lake, you load the data (the L) into its initial storage location. Eventually, when it’s time for you to use the data for analytical purposes, you’ll need to transform the data (the T) into whatever format is needed for a specific type of analysis.
For data warehousing — the predecessor to data lakes that you’re almost certainly still also using — data is copied from source applications to the data warehouse using a technique called ETL, rather than ELT. With ETL, you need to thoroughly understand the particulars of your data on its way into the data warehouse, which requires the transformation (T) to occur before the data is loaded (L) into its usable form.
With ELT, you can control the latency, or “freshness,” of data that is brought into the data lake. Some data needed for critical, real-time analysis can be streamed into the data lake, which means that a copy is sent to the data lake immediately after data is created or updated within a source application. (This is referred to as a low-latency data feed.) You essentially push data into your data lake piece by piece immediately upon the creation of that data.
Other data may be less time-critical and can be “batched up” in a source application and then periodically transmitted in bulk to the data lake.
You can specify the latency requirements for every single data feed from every single source application.
The ELT model also allows you to identify a new source of data for your data lake and then very quickly bring in the data that you need. You don’t need to spend days or weeks dissecting the ins and outs of the new data source to understand its structure and business rules. You “blast” the data into your data lake in the natural form of the data: database tables, MP4 files, or however the data is stored. Then, when it’s time to use that data for analysis, you can proceed to dig into the particulars and get the data ready for reports, machine learning, or however you’re going to be using and analyzing the data.
Take a look around your organization today. Chances are, you have dozens or even hundreds of different places to go for reports and analytics. At one time, your company probably had the idea of building an enterprise data warehouse that would provide data for almost all the analytical needs across the entire company. Alas, for many reasons, you instead wound up with numerous data marts and other environments, very few of which work together. Even enterprise data warehouses are often accompanied by an entire portfolio of data marts in the typical organization.
Great news! The data lake will finally be that one-stop shopping place for the data to meet almost all the analytical needs across your entire enterprise.
Enterprise-scale data warehousing fell short for many different reasons, including the underlying technology platforms. Data lakes overcome those shortfalls and provide the foundation for an entirely new generation of integrated, enterprise-wide analytics.
Even with a data lake, you’ll almost certainly still have other data environments outside the data lake that support analytics. Your data lake objective should be to satisfy almost all your organization’s analytical needs and be the go-to place for data. If a few other environments pop up here and there, that’s okay. Just be careful about the overall proliferation of systems outside your data lake; otherwise, you’ll wind up right back in the same highly fragmented data mess that you have today before beginning work on your data lake.
Suppose you head off for a weeklong vacation to your favorite lake resort. The people who run the resort have divided the lake into different zones, each for a different recreational purpose. One zone is set aside for water-skiing; a second zone is for speedboats, but no water-skiing is permitted in that zone; a third zone is only for boats without motors; and a fourth zone allows only swimming but no water vessels at all.
The operators of the resort could’ve said, “What the heck, let’s just have a free-for-all out on the lake and hope for the best.” Instead, they wisely established different zones for different purposes, resulting in orderly, peaceful vacations (hopefully!) rather than chaos.
A data lake is also divided into different zones. The exact number of zones may vary from one organization’s data lake to another’s, but you’ll always find at least three zones in use — bronze, silver, and gold — and sometimes a fourth zone, the sandbox.
Bronze, silver, and gold aren’t “official” standardized names, but they are catchy and easy to remember. Other names that you may find are shown in Table 1-1.
TABLE 1-1 Data Lake Zones
Recommended Zone Name
Other Names
Bronze zone
Raw zone, landing zone
Silver zone
Cleansed zone, refined zone
Gold zone
Performance zone, curated zone, data model zone
Sandbox
Experimental zone, short-term analytics zone
All the data lake zones, including the sandbox, are discussed in more detail in Part 2, but the following sections provide a brief overview.
The boundaries and borders between your data lake zones can be fluid (Fluid? Get it?), especially with streaming data, as I explain in Part 2.
You load your data into the bronze zone when the data first enters the data lake. First, you extract the data from a source application (the E part of ELT), and then the data is transmitted into the bronze zone in raw form (thus, one of the alternative names for this zone). You don’t correct any errors or otherwise transform or modify the data at all. The original operational data should look identical to the copy of that data now in the bronze zone.
Your catchphrase for loading data into the bronze zone is “the need for speed.” You may be trickling one piece of data at a time or bulk-loading hundreds of gigabytes or even terabytes of data. Your objective is to transmit the data into the data lake environment as quickly as possible. You’ll worry about checking out and refining that data later.
The silver zone consists of data that has been error-checked and cleansed but still remains in its original format. Data may be copied from a source application in JavaScript Object Notation (JSON) format and land in the bronze zone in raw form, looking exactly as the data was in the source system itself — errors and all.
You’ll patch up any known errors, handle missing data, and otherwise cleanse the data. Then you’ll store the cleansed data in the silver zone, still in JSON format.
Not all data from your bronze zone will be cleansed and copied into your silver zone. The data lake model calls for loading massive amounts of data into the bronze zone without having to do upfront analysis to determine which data is definitely or likely needed for analysis. When you decide what data you need, you do the necessary data cleansing and move only the cleansed data into the silver zone.
The gold zone is the final home for your most valuable analytical data. You’ll curate data coming from the silver zone, meaning that you’ll group and restructure data into “packages” dedicated to your organization’s high-value analytical needs.
The following figure shows the progressive pipelines of data among the various zones, including the sandbox. Notice how not every piece or group of data is cleansed and then sent from the bronze zone to the silver zone. You’ll spend time refurbishing, refining, and transmitting data to the silver zone that you definitely or likely need for analytics.
