E-Book
22,99 €

Data Lakes For Dummies E-Book

Alan R. Simon

0,0

22,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Take a dive into data lakes "Data lakes" is the latest buzz word in the world of data storage, management, and analysis. Data Lakes For Dummies decodes and demystifies the concept and helps you get a straightforward answer the question: "What exactly is a data lake and do I need one for my business?" Written for an audience of technology decision makers tasked with keeping up with the latest and greatest data options, this book provides the perfect introductory survey of these novel and growing features of the information landscape. It explains how they can help your business, what they can (and can't) achieve, and what you need to do to create the lake that best suits your particular needs. With a minimum of jargon, prolific tech author and business intelligence consultant Alan Simon explains how data lakes differ from other data storage paradigms. Once you've got the background picture, he maps out ways you can add a data lake to your business systems; migrate existing information and switch on the fresh data supply; clean up the product; and open channels to the best intelligence software for to interpreting what you've stored. * Understand and build data lake architecture * Store, clean, and synchronize new and existing data * Compare the best data lake vendors * Structure raw data and produce usable analytics Whatever your business, data lakes are going to form ever more prominent parts of the information universe every business should have access to. Dive into this book to start exploring the deep competitive advantage they make possible--and make sure your business isn't left standing on the shore.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 547

Veröffentlichungsjahr: 2021

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Data Lakes For Dummies®

Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.

For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit https://hub.wiley.com/community/support/dummies.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2021939570

ISBN 978-1-119-78616-0 (pbk); ISBN 978-1-119-78617-7 (ebk); ISBN 978-1-119-78618-4 (ebk)

Data Lakes For Dummies®

To view this book's Cheat Sheet, simply go to www.dummies.com and search for “Data Lakes For Dummies Cheat Sheet” in the Search box.

Cover

Title Page

Introduction

About This Book

Foolish Assumptions

Icons Used in This Book

Beyond the Book

Where to Go from Here

Part 1: Getting Started with Data Lakes

Chapter 1: Jumping into the Data Lake

What Is a Data Lake?

The Data Lake Olympics

Data Lakes and Big Data

The Data Lake Water Gets Murky

Chapter 2: Planning Your Day (and the Next Decade) at the Data Lake

Carpe Diem: Seizing the Day with Big Data

Managing Equal Opportunity Data

Building Today’s — and Tomorrow’s — Enterprise Analytical Data Environment

Reducing Existing Stand-Alone Data Marts

Eliminating Future Stand-Alone Data Marts

Establishing a Migration Path for Your Data Warehouses

Aligning Data with Decision Making

Speedboats, Canoes, and Lake Cruises: Traversing the Variable-Speed Data Lake

Managing Overall Analytical Costs

Chapter 3: Break Out the Life Vests: Tackling Data Lake Challenges

That’s Not a Data Lake, This Is a Data Lake!

Exposing Data Lake Myths and Misconceptions

Navigating Your Way through the Storm on the Data Lake

Building the Data Lake of Dreams

Performing Regular Data Lake Tune-ups — Or Else!

Technology Marches Forward

Part 2: Building the Docks, Avoiding the Rocks

Chapter 4: Imprinting Your Data Lake on a Reference Architecture

Playing Follow the Leader

Guiding Principles of a Data Lake Reference Architecture

A Reference Architecture for Your Data Lake Reference Architecture

Incoming! Filling Your Data Lake

Supporting the Fleet Sailing on Your Data Lake

The Old Meets the New at the Data Lake

Bringing Outside Water into Your Data Lake

Playing at the Edge of the Lake

Chapter 5: Anybody Hungry? Ingesting and Storing Raw Data in Your Bronze Zone

Ingesting Data with the Best of Both Worlds

Joining the Data Ingestion Fraternity

Storing Data in Your Bronze Zone

Just Passing Through: The Cross-Zone Express Lane

Taking Inventory at the Data Lake

Bringing Analytics to Your Bronze Zone

Chapter 6: Your Data Lake’s Water Treatment Plant: The Silver Zone

Funneling Data further into the Data Lake

Bringing Master Data into Your Data Lake

Impacting the Bronze Zone

Getting Clever with Your Storage Options

Working Hand-in-Hand with Your Gold Zone

Chapter 7: Bottling Your Data Lake Water in the Gold Zone

Laser-Focusing on the Purpose of the Gold Zone

Looking Inside the Gold Zone

Deciding What Data to Curate in Your Gold Zone

Seeing What Happens When Your Curated Data Becomes Less Useful

Chapter 8: Playing in the Sandbox

Developing New Analytical Models in Your Sandbox

Comparing Different Data Lake Architectural Options

Experimenting and Playing Around with Data

Chapter 9: Fishing in the Data Lake

Starting with the Latest Guidebook

Taking It Easy at the Data Lake

Staying in Your Lane

Doing a Little Bit of Exploring

Putting on Your Gear and Diving Underwater

Chapter 10: Rowing End-to-End across the Data Lake

Keeping versus Discarding Data Components

Getting Started with Your Data Lake

Shifting Your Focus to Data Ingestion

Finishing Up with the Sandbox

Part 3: Evaporating the Data Lake into the Cloud

Chapter 11: A Cloudy Day at the Data Lake

Rushing to the Cloud

Running through Some Cloud Computing Basics

The Big Guys in the Cloud Computing Game

Chapter 12: Building Data Lakes in Amazon Web Services

The Elite Eight: Identifying the Essential Amazon Services

Looking at the Rest of the Amazon Data Lake Lineup

Building Data Pipelines in AWS

Chapter 13: Building Data Lakes in Microsoft Azure

Setting Up the Big Picture in Azure

The Magnificent Seven, Azure Style

Filling Out the Azure Data Lake Lineup

Assembling the Building Blocks

Part 4: Cleaning Up the Polluted Data Lake

Chapter 14: Figuring Out If You Have a Data Swamp Instead of a Data Lake

Designing Your Report Card and Grading System

Looking at the Raw Data Lockbox

Knowing What to Do When Your Data Lake Is Out of Order

Too Fast, Too Slow, Just Right: Dealing with Data Lake Velocity and Latency

Dividing the Work in Your Component Architecture

Tallying Your Scores and Analyzing the Results

Chapter 15: Defining Your Data Lake Remediation Strategy

Setting Your Key Objectives

Doing Your Gap Analysis

Identifying Resolutions

Establishing Timelines

Defining Your Critical Success Factors

Chapter 16: Refilling Your Data Lake

The Three S’s: Setting the Stage for Success

Refining and Enriching Existing Raw Data

Making Better Use of Existing Refined Data

Building New Pipelines with Newly Ingested Raw Data

Part 5: Making Trips to the Data Lake a Tradition

Chapter 17: Checking Your GPS: The Data Lake Road Map

Getting an Overhead View of the Road to the Data Lake

Assessing Your Current State of Data and Analytics

Putting Together a Lofty Vision

Building Your Data Lake Architecture

Deciding on Your Kickoff Activities

Expanding Your Data Lake

Chapter 18: Booking Future Trips to the Data Lake

Searching for the All-in-One Data Lake

Spreading Artificial Intelligence Smarts throughout Your Data Lake

Part 6: The Part of Tens

Chapter 19: Top Ten Reasons to Invest in Building a Data Lake

Supporting the Entire Analytics Continuum

Bringing Order to Your Analytical Data throughout Your Enterprise

Retiring Aging Data Marts

Bringing Unfulfilled Analytics Ideas out of Dry Dock

Laying a Foundation for Future Analytics

Providing a Region for Experimentation

Improving Your Master Data Efforts

Opening Up New Business Possibilities

Keeping Up with the Competition

Getting Your Organization Ready for the Next Big Thing

Chapter 20: Ten Places to Get Help for Your Data Lake

Cloud Provider Professional Services

Major Systems Integrators

Smaller Systems Integrators

Individual Consultants

Training Your Internal Staff

Industry Analysts

Data Lake Bloggers

Data Lake Groups and Forums

Data-Oriented Associations

Academic Resources

Chapter 21: Ten Differences between a Data Warehouse and a Data Lake

Types of Data Supported

Data Volumes

Different Internal Data Models

Architecture and Topology

ETL versus ELT

Data Latency

Analytical Uses

Incorporating New Data Sources

User Communities

Hosting

Index

About the Author

Connect with Dummies

End User License Agreement

List of Tables

Chapter 1

TABLE 1-1 Data Lake Zones

Chapter 2

TABLE 2-1 Matching Analytics and Business Questions

Chapter 9

TABLE 9-1 Hospital Data Lake Permissions

Chapter 13

TABLE 13-1 ADLS Storage Tiers

Chapter 15

TABLE 15-1 Data Lake Remediation Priorities

TABLE 15-2 Defining Data Lake Remediation Success

Chapter 17

TABLE 17-1 Your Five-Phase A LAKE Data Lake Road Map

TABLE 17-2 A LAKE Confirmation Loopbacks

List of Illustrations

Chapter 1

FIGURE 1-1: A logically centralized data lake with underlying physical decentra...

FIGURE 1-2: Cloud-based data lake solutions.

FIGURE 1-3: Different types of data in your data lake.

FIGURE 1-4: Source applications feeding data into your data lake.

Chapter 2

FIGURE 2-1: The vision of an enterprise data warehouse.

FIGURE 2-2: The reality of numerous stand-alone data marts.

FIGURE 2-3: Using a data lake to retire data marts.

FIGURE 2-4: Leaving a data mart intact and alongside your data lake.

FIGURE 2-5: Incorporating a data mart into your data lake.

FIGURE 2-6: Migrating your data warehouse into your new data lake.

FIGURE 2-7: A data pipeline into, through, and then out of the data lake.

FIGURE 2-8: An easy way to understand data pipelines and data lakes.

Chapter 3

FIGURE 3-1: Playing “find the data lake.”

Chapter 4

FIGURE 4-1: A reference architecture for data lake reference architectures.

FIGURE 4-2: Two classes of inbound data flows for your data lake.

FIGURE 4-3: Object storage as the fundamental storage technology for your data ...

FIGURE 4-4: Incorporating database technology along with object storage.

FIGURE 4-5: Embedding a data warehouse into your data lake environment.

FIGURE 4-6: Adding heterogeneity to your data lake’s bronze zone.

FIGURE 4-7: Adding heterogeneity to your data lake’s bronze zone.

FIGURE 4-8: Incorporating the user layer of a legacy data warehouse into your d...

FIGURE 4-9: Subsuming an end-to-end legacy data warehouse into your new data la...

FIGURE 4-10: Your data lake feeding your data warehouse.

FIGURE 4-11: Split-streaming data feeds to support both your data lake and your...

FIGURE 4-12: Ongoing data interchange between your data lake and your data ware...

FIGURE 4-13: A data lake that is much larger than a data warehouse.

FIGURE 4-14: A data warehouse that is much larger than a data lake.

FIGURE 4-15: Feeding external data into the data lake.

FIGURE 4-16: On-demand access to external data for your analytics.

FIGURE 4-17: Drilling-site sensors and a data lake at an energy exploration com...

FIGURE 4-18: Edge analytics existing outside the control of the data lake.

FIGURE 4-19: Remote data from edge analytics can also be sent to the data lake.

Chapter 5

FIGURE 5-1: Data flowing into your data lake bronze zone.

FIGURE 5-2: Three different operational data feeds into your data lake bronze z...

FIGURE 5-3: Multiple subscribers to sensor and video data streams.

FIGURE 5-4: Using a streaming service to split-stream data into both a data lak...

FIGURE 5-5: Under-the-covers “micro-batching” within streaming input to your da...

FIGURE 5-6: The Lambda data ingestion architecture for your data lake.

FIGURE 5-7: The Kappa data ingestion architecture for your data lake.

FIGURE 5-8: Going for storage simplicity with only object storage in your bronz...

FIGURE 5-9: Implementing a multi-component bronze zone.

FIGURE 5-10: Ingesting data from a database: object storage versus database in ...

FIGURE 5-11: Carrying a bronze zone database through to your data lake gold zon...

FIGURE 5-12: Carrying bronze zone object storage through to your data lake gold...

FIGURE 5-13: Going back to a database in a multi-component gold zone.

FIGURE 5-14: Data streaming doing double duty as bronze zone storage for raw da...

FIGURE 5-15: Three different models for linking your analytics with streaming d...

Chapter 6

FIGURE 6-1: Refining an image between the bronze zone and the silver zone.

FIGURE 6-2: Enriching an image for storage in the data lake silver zone.

FIGURE 6-3: Enriching a tweet by determining and attaching sentiment analysis.

FIGURE 6-4: Building a master data taxonomy for your data lake.

FIGURE 6-5: Decisions, decisions: What should you do with bronze zone data dest...

FIGURE 6-6: Redefining your data lake zone boundaries rather than unnecessarily...

FIGURE 6-7: Ingesting a raw tweet.

FIGURE 6-8: Enriching a tweet followed by shifting your zone boundary rather th...

FIGURE 6-9: Step 1: Ingesting raw data into your bronze zone.

FIGURE 6-10: Step 2: Moving data into the silver zone rather than copying data.

FIGURE 6-11: Deciding whether to keep a raw image after refinement and enhancem...

FIGURE 6-12: Your data lake silver zone using Amazon S3.

FIGURE 6-13: Dividing your silver zone content among three different flavors of...

FIGURE 6-14: Carrying hierarchical storage back into your data lake bronze zone...

FIGURE 6-15: Step 1: Refine and enrich an image in your data lake silver zone.

FIGURE 6-16: Step 2: Move bronze zone image to S3 Glacier to save on storage co...

Chapter 7

FIGURE 7-1: Peeking inside the gold zone.

FIGURE 7-2: Building a curated gold zone data package.

FIGURE 7-3: Adding database data to object store data inside a gold zone curate...

FIGURE 7-4: Using persistent data streams for your gold zone curated data.

FIGURE 7-5: Using a specialized data store in your data lake gold zone.

FIGURE 7-6: Relocating an infrequently used or retired data package to less-exp...

Chapter 8

FIGURE 8-1: Using the data lake sandbox for analytical development.

FIGURE 8-2: Migrating curated data from the sandbox to the gold zone as analyti...

FIGURE 8-3: Using a data lake sandbox to explore architectural options.

FIGURE 8-4: Moving a graph database curated data package from the sandbox into ...

FIGURE 8-5: Exploratory analytics and your data lake sandbox.

Chapter 9

FIGURE 9-1: Data lakes and passive analytics users.

FIGURE 9-2: Light analytics user access to a data lake gold zone.

FIGURE 9-3: Light analytics user access to a database within the data lake gold...

FIGURE 9-4: A multistep gold zone integration process for a light analytics use...

FIGURE 9-5: Using a data abstraction tool for data lake access simplicity.

FIGURE 9-6: Using a data abstraction tool to integrate database and object data...

Chapter 10

FIGURE 10-1: Your hospital’s legacy systems environment.

FIGURE 10-2: Selecting data mart dimensional models to retain for your new data...

FIGURE 10-3: Replacing best-of-breed applications with an integrated EHR packag...

FIGURE 10-4: Pairing your new EHR system with a data lake.

FIGURE 10-5: Setting up curated data packages in your data lake gold zone.

FIGURE 10-6: Delaying platform decisions until you gain a broader view of your ...

FIGURE 10-7: Your EHR system using both streaming and batch feeds into your dat...

FIGURE 10-8: Making key ingestion and bronze zone data set decisions.

FIGURE 10-9: Streaming persistent data into the gold zone.

FIGURE 10-10: Making different architectural decisions for various data streams...

FIGURE 10-11: Putting your silver zone to work.

FIGURE 10-12: Adding data pipelines to your data lake buildout.

FIGURE 10-13: Bringing your data lake sandbox into the picture.

Chapter 11

FIGURE 11-1: Public versus private clouds: a visual analogy.

FIGURE 11-2: Allocation of responsibilities for SaaS, PaaS, and IaaS.

Chapter 12

FIGURE 12-1: The fundamental structure of Amazon S3.

FIGURE 12-2: Mimicking folders in Amazon S3 through filenames.

FIGURE 12-3: Building your entire AWS data lake using only S3 for data storage.

FIGURE 12-4: Using Glue Crawler and Glue Data Catalog to maintain up-to-date da...

FIGURE 12-5: Using a Lake Formation blueprint for data lake ingestion.

FIGURE 12-6: Using Amazon Kinesis Data Streams for hospital patient vital signs...

FIGURE 12-7: Athena using the Glue Data Catalog to access S3 data with SQL.

FIGURE 12-8: Using Amazon Redshift in your data lake’s gold zone.

FIGURE 12-9: An end-to-end hospital data lake built on AWS services.

Chapter 13

FIGURE 13-1: Organization of the Azure cloud.

FIGURE 13-2: An Azure data lake framework.

FIGURE 13-3: ADLS Gen2, the best of both worlds.

FIGURE 13-4: ADLS containers, folders, and files.

FIGURE 13-5: Ingesting, copying, and sinking data along an ADF pipeline.

FIGURE 13-6: Using Azure Event Hubs for a publish-and-subscribe model.

FIGURE 13-7: Bidirectional messaging and streaming with Azure IoT Hub.

FIGURE 13-8: Using Azure SQL Database in your Azure data lake.

FIGURE 13-9: Azure data lake architecture for IoT analytics.

FIGURE 13-10: Azure data lake architecture for industrial IoT predictive mainte...

FIGURE 13-11: Azure data lake architecture for defect analysis and prevention.

FIGURE 13-12: Azure data lake architecture for rideshare company forecasting.

Chapter 14

FIGURE 14-1: Your data lake four-element scorecard.

FIGURE 14-2: Dividing each data lake evaluation criteria into scoreable element...

FIGURE 14-3: Focus only on your raw data.

FIGURE 14-4: Identifying your raw data hot spots.

FIGURE 14-5: Diving deep into your data lake’s quality and governance.

FIGURE 14-6: The ominous results.

FIGURE 14-7: Grading your data velocity and latency.

FIGURE 14-8: Good news on the data velocity and latency front.

FIGURE 14-9: Grading your component architecture.

FIGURE 14-10: Bringing together all of your data lake evaluation scores.

Chapter 15

FIGURE 15-1: The current hospital operational applications.

FIGURE 15-2: Peer analytical solutions, one for administrative data and one for...

FIGURE 15-3: A downstream data warehouse taking feeds from both Hadoop and AWS.

FIGURE 15-4: The current state survey results.

FIGURE 15-5: Cataloging and assigning data lake issues.

FIGURE 15-6: A two-step process to migrate the hospital’s entire data lake onto...

FIGURE 15-7: Introducing streaming to benefit both the medical operations appli...

FIGURE 15-8: Adding shells for the silver and gold zones.

FIGURE 15-9: Adding a data warehouse component into the overall data lake archi...

FIGURE 15-10: Placing master data management in your silver zone.

FIGURE 15-11: Addressing the data warehouse–versus–data lake controversy withou...

FIGURE 15-12: The data lake remediation timeline.

FIGURE 15-13: The inevitable trio of technology, human and organizational facto...

Chapter 16

FIGURE 16-1: The starting point for the operating room efficiency study.

FIGURE 16-2: The first data pipeline to feed existing raw data into a curated g...

FIGURE 16-3: Batch ETL of patient bedside data in the current hospital data lak...

FIGURE 16-4: Streaming data and streaming analytics for the real-time patient d...

FIGURE 16-5: Emergency room data fed through the bronze zone into the silver zo...

FIGURE 16-6: Building the first emergency room and inpatient cross-reference wi...

FIGURE 16-7: Replacing a batch data feed with split-streaming.

FIGURE 16-8: The starting point for analyzing message content versus patient ou...

FIGURE 16-9: Building a batch interface between the app and the data lake for m...

FIGURE 16-10: Enriching semi-structured data and then repositioning the data in...

FIGURE 16-11: Completing the curated data package and the associated analytics.

Chapter 17

FIGURE 17-1: Dividing your current-state assessment into data and analytics.

FIGURE 17-2: Harvey balls for scoring.

FIGURE 17-3: Parallel paths of your analytics assessment.

FIGURE 17-4: A sample analytics scorecard.

FIGURE 17-5: Your data architecture and governance parallel paths.

FIGURE 17-6: A sample data architecture and governance scorecard.

FIGURE 17-7: Analyzing every scrap of data about an insurance customer: today v...

FIGURE 17-8: Your data lake and data warehouse as peers.

FIGURE 17-9: Your data warehouse feeding certain data into your data lake.

FIGURE 17-10: Progressively turning your data lake vision into a solid blueprin...

FIGURE 17-11: A multiphase, multiyear, high-level data lake road map.

Chapter 18

FIGURE 18-1: Your data lake doing double-duty for transactional and analytical ...

FIGURE 18-2: Equipping your data lake with an AI-enabled insights and analytics...

Guide

Cover

Title Page

Table of Contents

Begin Reading

Index

About the Author

Pages

iii

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

369

370

371

Introduction

In December 1995, I wrote an article for Database Programming & Design magazine entitled “I Want a Data Warehouse, So What Is It Again?” A few months later, I began writing Data Warehousing For Dummies (Wiley), building on the article’s content to help readers make sense of first-generation data warehousing.

Fast-forward a quarter of a century, and I could very easily write an article entitled “I Want a Data Lake, So What Is It Again?” This time, I’m cutting right to the chase with Data Lakes For Dummies. To quote a famous former baseball player named Yogi Berra, it’s déjà vu all over again!

Nearly every large and upper-midsize company and governmental agency is building a data lake or at least has an initiative on the drawing board. That’s the good news.

The not-so-good news, though, is that you’ll find a disturbing lack of agreement about data lake architecture, best practices for data lake development, data lake internal data flows, even what a data lake actually is! In fact, many first-generation data lakes have fallen short of original expectations and need to be rearchitected and rebuilt.

As with data warehousing in the mid-’90s, the data lake concept today is still a relatively new one. Consequently, almost everything about data lakes — from its very definition to alternatives for integration with or migration from existing data warehouses — is still very much a moving target. Software product vendors, cloud service providers, consulting firms, industry analysts, and academics often have varying — and sometimes conflicting — perspectives on data lakes. So, how do you navigate your way across a data lake when the waters are especially choppy and you’re being tossed from side to side?

That’s where Data Lakes For Dummies comes in.

About This Book

Data Lakes For Dummies helps you make sense of the ABCs — acronym anarchy, buzzword bingo, and consulting confusion — of today’s and tomorrow’s data lakes.

This book is not only a tutorial about data lakes; it also serves as a reference that you may find yourself consulting on a regular basis. So, you don’t need to memorize large blocks of content (there’s no final exam!) because you can always go back to take a second or third or fourth look at any particular point during your own data lake efforts.

Right from the start, you find out what your organization should expect from all the time, effort, and money you’ll put into your data lake initiative, as well as see what challenges are lurking. You’ll dig deep into data lake architecture and leading cloud platforms and get your arms around the big picture of how all the pieces fit together.

One of the disadvantages of being an early adopter of any new technology is that you sometimes make mistakes or at least have a few false starts. Plenty of early data lake efforts have turned into more of a data dump, with tons of data that just isn’t very accessible or well organized. If you find yourself in this situation, fear not: You’ll see how to turn that data dump into the data lake you originally envisioned.

I don’t use many special conventions in this book, but you should be aware that sidebars (the gray boxes you see throughout the book) and anything marked with the Technical Stuff icon are all skippable. So, if you’re short on time, you can pass over these pieces without losing anything essential. On the other hand, if you have the time, you’re sure to find fascinating information here!

Within this book, you may note that some web addresses break across two lines of text. If you’re reading this book in print and want to visit one of these web pages, simply key in the web address exactly as it’s noted in the text, pretending as though the line break doesn’t exist. If you’re reading this as an e-book, you’ve got it easy — just click the web address to be taken directly to the web page.

Foolish Assumptions

The most relevant assumption I’ve made is that if you’re reading this book, you either are or will soon be working on a data lake initiative.

Maybe you’re a data strategist and architect, and what’s most important to you is sifting through mountains of sometimes conflicting — and often incomplete — information about data lakes. Your organization already makes use of earlier-generation data warehouses and data marts, and now it’s time to take that all-important next step to a data lake. If that’s the case, you’re definitely in the right place.

If you’re a developer or data architect who is working on a small subset of the overall data lake, your primary focus is how a particular software package or service works. Still, you’re curious about where your daily work fits into your organization’s overall data lake efforts. That’s where this book comes in: to provide context and that “aha!” factor to the big picture that surrounds your day-to-day tasks.

Or maybe you’re on the business and operational side of a company or governmental agency, working side by side with the technology team as they work to build an enterprise-scale data environment that will finally support the entire spectrum of your organization’s analytical needs. You don’t necessarily need to know too much about the techie side of data lakes, but you absolutely care about building an environment that meets today’s and tomorrow’s needs for data-driven insights.

The common thread is that data lakes are part of your organization’s present and future, and you’re seeking an unvarnished, hype-free, grounded-in-reality view of data lakes today and where they’re headed.

In any event, you don’t need to be a technical whiz with databases, programming languages such as Python, or specific cloud platforms such as Amazon Web Services (AWS) or Microsoft Azure. I cover many different technical topics in this book, but you’ll find clear explanations and diagrams that don’t presume any prerequisite knowledge on your part.

Icons Used in This Book

As you read this book, you encounter icons in the margins that indicate material of particular interest. Here’s what the icons mean:

These are the tricks of the data lake trade. You can save yourself a great deal of time and avoid more than a few false starts by following specific tips collected from the best practices (and learned from painful experiences) of those who preceded you on the path to the data lake.

Data lakes are often filled with dangerous icebergs. (Okay, bad analogy, but you hopefully get the idea.) When you’re working on your organization’s data lake efforts, pay particular attention to situations that are called out with this icon.

If you’re more interested in the conceptual and architectural aspects of data lakes than the nitty-gritty implementation details, you can skim or even skip material that is accompanied by this icon.

Some points are so critically important that you’ll be well served by committing them to memory. You’ll even see some of these points repeated later in the book because they tie in with other material. This icon calls out this crucial content.

Beyond the Book

In addition to the material in the print or e-book you’re reading right now, this product comes with a free Cheat Sheet for the three types of data for your data lake, four zones inside your data lake, five phases to building your data lake, and more. To access the Cheat Sheet, go to www.dummies.com and type Data Lakes For Dummies Cheat Sheet in the Search box.

Where to Go from Here

Now it’s time to head off to the lake — the data lake, that is! If you’re totally new to the subject, you don’t want to skip the chapters in Part 1 because they’ll provide the foundation for the rest of the book. If you already have some exposure to data lakes, I still recommend that you at least skim Part 1 to get a sense of how to get beyond all the hype, buzzwords, and generalities related to data lakes.

You can then read the book sequentially from front to back or jump around as needed. Whatever path works best for you is the one you should take.

Part 1

Getting Started with Data Lakes

IN THIS PART …

Separate the data lake reality from the hype.

Steer your data lake efforts in the right direction.

Diagnose and avoid common pitfalls that can dry up your data lake.

Chapter 1

Jumping into the Data Lake

IN THIS CHAPTER

Defining and scoping the data lake

Diving underwater in the data lake

Dividing up the data lake

Making sense of conflicting terminology

The lake is the place to be this season — the data lake, that is!

Just like the newest and hottest vacation destination, everyone is booking reservations for a trip to the data lake. Unlike a vacation, though, you won’t just be spending a long weekend or a week or even the entire summer at the data lake. If you and your work colleagues do a good job, your data lake will be your go-to place for a whole decade or even longer.

What Is a Data Lake?

Ask a friend this question: “What’s a lake?” Your friend thinks for a moment, and then gives you this answer: “Well, it’s a big hole in the ground that’s filled with water.”

Technically, your friend is correct, but that answer also is far from detailed enough to really tell you what a lake actually is. You need more specifics, such as:

How big, dimension-wise (how long and how wide)

How deep that “big hole in the ground” goes

How much variability there is from one lake to another in terms of those length, width, and depth dimensions (the Great Lakes, anyone?)

How much water you’ll find in the lake and how much that amount of water may vary among different lakes

Whether a lake contains freshwater or saltwater

Some follow-up questions may pop into your mind as well:

A pond is also a big hole in the ground that’s filled with water, so is a lake the same as a pond?

What distinguishes a lake from an ocean or a sea?

Can a lake be physically connected to another lake?

Can the dividing line between two states or two countries be in the middle of a lake?

If a lake is empty, is it still considered a lake?

If one lake leaves Chicago, heading east and travels at 100 miles per hour, and another lake heads west from New York … oh wait, wrong kind of word problem, never mind… .

So many missing pieces of the puzzle, all arising from one simple question!

You’ll find the exact same situation if you ask someone this question: “What’s a data lake?” In fact, go ahead and ask your favorite search engine that question. You’ll find dozens of high-level definitions that will almost certainly spur plenty of follow-up questions as you try to get your arms around the idea of a data lake.

Here’s a better idea: Instead of filtering through all that varying — and even conflicting — terminology and then trying to consolidate all of it into a single comprehensive definition, just think of a data lake as the following:

A solidly architected, logically centralized, highly scalable environment filled with different types of analytic data that are sourced from both inside and outside your enterprise with varying latency, and which will be the primary go-to destination for your organization’s data-driven insights

Wow, that’s a mouthful! No worries: Just as if you were eating a gourmet fireside meal while camping at your favorite lake, you can break up that definition into bite-size pieces.

Rock-solid water

A data lake should remain viable and useful for a long time after it becomes operational. Also, you’ll be continually expanding and enhancing your data lake with new types and forms of data, new underlying technologies, and support for new analytical uses.

Building a data lake is more than just loading massive amounts of data into some storage location.

To support this near-constant expansion and growth, you need to ensure that your data lake is well architected and solidly engineered, which means that the data lake

Enforces standards and best practices for data ingestion, data storage, data transmission, and interchange among its components and data delivery to end users

Minimizes workarounds and temporary interfaces that have a tendency to stick around longer than planned and weaken your overall environment

Continues to meet your predetermined metrics and thresholds for overall technical performance, such as data loading and interchange, as well as user response time

Think about a resort that builds docks, a couple of lakeside restaurants, and other structures at various locations alongside a large lake. You wouldn’t just hand out lumber, hammers, and nails to a bunch of visitors and tell them to start building without detailed blueprints and engineering diagrams. The same is true with a data lake. From the first piece of data that arrives, you need as solid a foundation as possible to help keep your data lake viable for a long time.

A really great lake

You’ll come across definitions and descriptions that tell you a data lake is a centralized store of data, but that definition is only partially correct.

A data lake is logically centralized. You can certainly think of a data lake as a single place for your data, instead of having your data scattered among different databases. But in reality, even though your data lake is logically centralized, its data is physically decentralized and distributed among many different underlying servers.

The data services that you use for your data lake, such as the Amazon Simple Storage Service (S3), the Microsoft Azure Data Lake Storage (ADLS), or the Hadoop Distributed File System (HDFS) manage the distribution of data among potentially numerous servers where your data is actually stored. These services hide the physical distribution from almost everyone other than those who need to manage the data at the server storage level. Instead, they present the data as being logically part of a single data lake. Figure 1-1 illustrates how logical centralization accompanies physical decentralization.

FIGURE 1-1: A logically centralized data lake with underlying physical decentralization.

Expanding the data lake

How big can your data lake get? To quote the old saying (and to answer a question with a question), how many angels can dance on the head of a pin?

Scalability is best thought of as “the ability to expand capacity, workload, and missions without having to go back to the drawing board and start all over.” Your data lake will almost always be a cloud-based solution (see Figure 1-2). Cloud-based platforms give you, in theory, infinite scalability for your data lake. New servers and storage devices (discs, solid state devices, and so on) can be incorporated into your data lake on demand, and the software services manage and control these new resources along with those that you’re already using. Your data lake contents can then expand from hundreds of terabytes to petabytes, and then to exabytes, and then zettabytes, and even into the ginormousbyte range. (Just kidding about that last one.)

FIGURE 1-2: Cloud-based data lake solutions.

Cloud providers give you pricing for data storage and access that increases as your needs grow or decreases if you cut back on your functionality. Basically, your data lake will be priced on a pay-as-you-go basis.

Some of the very first data lakes that were built in the Hadoop environment may reside in your corporate data center and be categorized as on-prem (short for on-premises, meaning “on your premises”) solutions. But most of today’s data lakes are built in the Amazon Web Services (AWS) or Microsoft Azure cloud environments. Given the ever-increasing popularity of cloud computing, it’s highly unlikely that this trend of cloud-based data lakes will reverse for a long time, if ever.

As long as Amazon, Microsoft, and other cloud platform providers can keep expanding their existing data centers and building new ones, as well as enhancing the capabilities of their data management services, then your data lake should be able to avoid scalability issues.

A multiple-component data lake architecture (see Chapter 4) further helps overcome performance and capacity constraints as your data lake grows in size and complexity, providing even greater scalability.

More than just the water

Think of a data lake as being closer to a lake resort rather than just the lake — the body of water — in its natural state. If you were a real estate developer, you might buy the property that includes the lake itself, along with plenty of acreage surrounding the lake. You’d then develop the overall property by building cabins, restaurants, boat docks, and other facilities. The lake might be the centerpiece of the overall resort, but its value is dramatically enhanced by all the additional assets that you’ve built surrounding the lake.

A data lake is an entire environment, not just a gigantic collection of data that is stored within a data service such as Amazon S3 or Microsoft ADLS.

In addition to data storage, a data lake also includes the following:

One or (usually) more mechanisms to move data from one part of the data lake to another.

A catalog or directory that helps keep track of what data is where, as well as the associated rules that apply to different groups of data; this is known as

metadata.

Capabilities that help unify meanings and business rules for key data subjects that may come into the data lake from different applications and systems; this is known as

master data management.

Monitoring services to track data quality and accuracy, response time when users access data, billing services to charge different organizations for their usage of the data lake, and plenty more.

Different types of data

If your data lake had a motto, it might be “All data are created equal.”

In a data lake, data is data is data. In other words, you don’t need to make special accommodations for more complex types of data than you would for simpler forms of data.

Your data lake will contain structured data, unstructured data, and semi-structured data (see Figure 1-3). The following sections cover these types of data in more detail.

Structured data: Staying in your own lane

You’re probably most familiar with structured data, which is made up of numbers, shorter-length character strings, and dates. Traditionally, most of the applications you’ve worked with have been based on structured data. Structured data is commonly stored in a relational database such as Microsoft SQL Server, MySQL, or Oracle Database.

FIGURE 1-3: Different types of data in your data lake.

In a database, you define columns (basically, fields) for each of your pieces of structured data, and each column is rigidly and precisely defined with the following:

A data type,

such as INTEGER, DECIMAL, CHARACTER, DATE, DATETIME, or something similar

The size of the field,

either explicitly declared (for example, how many characters a CHARACTER column will contain) or implicitly declared (the system-defined maximum number for an INTEGER or how a DATE column is structured)

Any specific rules that apply to a data column or field,

such as the permissible range of values (for example, a customer’s age must be between 18 and 130) or a list of allowable values (for example, an employee’s current status can only be FULL-TIME, PART-TIME, TERMINATED, or RETIRED)

Any additional constraints,

such as primary and foreign key designations, or

referential integrity

(rules that specify consistency for certain columns across multiple database tables)

Unstructured data: A picture may be worth ten million words

Unstructured data is, by definition, data that lacks a formally defined structure. Images (such as JPEGs), audio (such as MP3s), and videos (such as MP4s or MOVs) are common forms of unstructured data.

Semi-structured data: Stuck in the middle of the lake

Semi-structured data sort of falls in between structured and unstructured data. Examples include a blog post, a social media post, text messages, an email message, or a message from Slack or Microsoft Teams. Leaving aside any embedded or attached images or videos for a moment, all these examples consist of a long string of letters, numbers, and special characters. However, there’s no particular structure assigned to most of these text strings other than perhaps a couple of lines of heading information. The body of an email may be very short — only a line or two — while another email can go on for many long paragraphs.

In your data lake, you need to have all these types of data sitting side by side. Why? Because you’ll be running analytics against the data lake that may need more than one form of data. For example, you receive and then analyze a detailed report of sales by department in a large department store during the past month.

Then, after noticing a few anomalies in the sales numbers, you pull up in-store surveillance video to analyze traffic versus sales to better understand how many customers may be looking at merchandise but deciding not to make a purchase. You can even combine structured data from scanners with your unstructured video data as part of your analysis.

If you had to go to different data storage environments for your sales results (structured data) and then the video surveillance (unstructured data), your overall analysis is dramatically slowed down, especially if you need to integrate and cross-reference different types of data. With a data lake, all this data is sitting side by side, ready to be delivered for analysis and decision-making.

In their earliest days, relational databases only stored structured data. Later, they were extended with capabilities to store structured and unstructured data. Binary large objects (BLOBs) were a common way to store images and even video in a relational database. However, even an object-extended relational database doesn’t make a good platform for a data lake when compared with modern data services such as Amazon S3 or Microsoft ADLS.

Different water, different data

A common misconception is that you store “all your data” in your data lake. Actually, you store all or most of your analytic data in a data lake. Analytic data is, as you may suspect from the name, data that you’re using for analytics. In contrast, you use operational data to run your business.

What’s the difference? From one perspective, operational and analytic data are one and the same. Suppose you work for a large retailer. A customer comes into one of your stores and makes some purchases. Another customer goes onto your company’s website and buys some items there. The records of those sales — which customers made the purchases, which products they bought, how many of each product, the dates of the sales, whether the sales were online or in a store, and so on — are all stored away as official records of those transactions, which are necessary for running your company’s operations.

But you also want to analyze that data, right? You want to understand which products are selling the best and where. You want to understand which customers are spending the most. You have dozens or even hundreds of questions you want to ask about your customers and their purchasing activity.

Here’s the catch: You need to make copies of your operational data for the deep analysis that you need to undertake; and the copies of that operational data are what goes into the data lake (see Figure 1-4).

FIGURE 1-4: Source applications feeding data into your data lake.

Wait a minute! Why in the world do you need to copy data into your data lake? Why can’t you just analyze the data right where it is, in the source applications and their databases?

Data lakes, at least as you need to build them today and for the foreseeable future, are a continuation of the same model that has been used for data warehousing since the early 1990s. For many technical reasons related to performance, deep analysis involving large data volumes and significant cross-referencing directly in your source applications isn’t a workable solution for the bulk of your analytics.

Consequently, you need to make copies of the operational data that you want for analytical purposes and store that data in your data lake. Think of the data inside your data lake as (in used-car terminology) previously owned data that has been refurbished and is now ready for a brand-new owner.

But if you can’t adequately do complex analytics directly from source applications and their databases, what about this idea: Run your applications off your data lake instead! This way, you can avoid having to copy your data, right? Unfortunately, that idea won’t work, at least with today’s technology.

Operational applications almost always use a relational database, which manages concurrency control among their users and applications. In simple terms, hundreds or even thousands of users can add new data and make changes to a relational database without interfering with each other’s work and corrupting the database. A data lake, however, is built on storage technology that is optimized for retrieving data for analysis and doesn’t support concurrency control for update operations.

Many vendors are working on new technology that will allow you to build a data lake for operational, as well as analytical purposes. This technology is still a bit down the road from full operational viability. For the time being, you’ll build a data lake by copying data from many different source applications.

Refilling the data lake

What exactly does “copying data” look like, and how frequently do you need to copy data into the data lake?

Data lakes mostly use a technique called ELT, which stands for either extract, transform, and load or extraction, transformation, and loading. With ELT, you “blast” your data into a data lake without having to spend a great deal of time profiling and understanding the particulars of your data. You extract data (the E part of ELT) from its original home in a source application, and then, after that data has been transmitted to the data lake, you load the data (the L) into its initial storage location. Eventually, when it’s time for you to use the data for analytical purposes, you’ll need to transform the data (the T) into whatever format is needed for a specific type of analysis.

For data warehousing — the predecessor to data lakes that you’re almost certainly still also using — data is copied from source applications to the data warehouse using a technique called ETL, rather than ELT. With ETL, you need to thoroughly understand the particulars of your data on its way into the data warehouse, which requires the transformation (T) to occur before the data is loaded (L) into its usable form.

With ELT, you can control the latency, or “freshness,” of data that is brought into the data lake. Some data needed for critical, real-time analysis can be streamed into the data lake, which means that a copy is sent to the data lake immediately after data is created or updated within a source application. (This is referred to as a low-latency data feed.) You essentially push data into your data lake piece by piece immediately upon the creation of that data.

Other data may be less time-critical and can be “batched up” in a source application and then periodically transmitted in bulk to the data lake.

You can specify the latency requirements for every single data feed from every single source application.

The ELT model also allows you to identify a new source of data for your data lake and then very quickly bring in the data that you need. You don’t need to spend days or weeks dissecting the ins and outs of the new data source to understand its structure and business rules. You “blast” the data into your data lake in the natural form of the data: database tables, MP4 files, or however the data is stored. Then, when it’s time to use that data for analysis, you can proceed to dig into the particulars and get the data ready for reports, machine learning, or however you’re going to be using and analyzing the data.

Everyone visits the data lake

Take a look around your organization today. Chances are, you have dozens or even hundreds of different places to go for reports and analytics. At one time, your company probably had the idea of building an enterprise data warehouse that would provide data for almost all the analytical needs across the entire company. Alas, for many reasons, you instead wound up with numerous data marts and other environments, very few of which work together. Even enterprise data warehouses are often accompanied by an entire portfolio of data marts in the typical organization.

Great news! The data lake will finally be that one-stop shopping place for the data to meet almost all the analytical needs across your entire enterprise.

Enterprise-scale data warehousing fell short for many different reasons, including the underlying technology platforms. Data lakes overcome those shortfalls and provide the foundation for an entirely new generation of integrated, enterprise-wide analytics.

Even with a data lake, you’ll almost certainly still have other data environments outside the data lake that support analytics. Your data lake objective should be to satisfy almost all your organization’s analytical needs and be the go-to place for data. If a few other environments pop up here and there, that’s okay. Just be careful about the overall proliferation of systems outside your data lake; otherwise, you’ll wind up right back in the same highly fragmented data mess that you have today before beginning work on your data lake.

The Data Lake Olympics

Suppose you head off for a weeklong vacation to your favorite lake resort. The people who run the resort have divided the lake into different zones, each for a different recreational purpose. One zone is set aside for water-skiing; a second zone is for speedboats, but no water-skiing is permitted in that zone; a third zone is only for boats without motors; and a fourth zone allows only swimming but no water vessels at all.

The operators of the resort could’ve said, “What the heck, let’s just have a free-for-all out on the lake and hope for the best.” Instead, they wisely established different zones for different purposes, resulting in orderly, peaceful vacations (hopefully!) rather than chaos.

A data lake is also divided into different zones. The exact number of zones may vary from one organization’s data lake to another’s, but you’ll always find at least three zones in use — bronze, silver, and gold — and sometimes a fourth zone, the sandbox.

Bronze, silver, and gold aren’t “official” standardized names, but they are catchy and easy to remember. Other names that you may find are shown in Table 1-1.

TABLE 1-1 Data Lake Zones

Recommended Zone Name

Other Names

Bronze zone

Raw zone, landing zone

Silver zone

Cleansed zone, refined zone

Gold zone

Performance zone, curated zone, data model zone

Sandbox

Experimental zone, short-term analytics zone

All the data lake zones, including the sandbox, are discussed in more detail in Part 2, but the following sections provide a brief overview.

The boundaries and borders between your data lake zones can be fluid (Fluid? Get it?), especially with streaming data, as I explain in Part 2.

The bronze zone

You load your data into the bronze zone when the data first enters the data lake. First, you extract the data from a source application (the E part of ELT), and then the data is transmitted into the bronze zone in raw form (thus, one of the alternative names for this zone). You don’t correct any errors or otherwise transform or modify the data at all. The original operational data should look identical to the copy of that data now in the bronze zone.

Your catchphrase for loading data into the bronze zone is “the need for speed.” You may be trickling one piece of data at a time or bulk-loading hundreds of gigabytes or even terabytes of data. Your objective is to transmit the data into the data lake environment as quickly as possible. You’ll worry about checking out and refining that data later.

The silver zone

The silver zone consists of data that has been error-checked and cleansed but still remains in its original format. Data may be copied from a source application in JavaScript Object Notation (JSON) format and land in the bronze zone in raw form, looking exactly as the data was in the source system itself — errors and all.

You’ll patch up any known errors, handle missing data, and otherwise cleanse the data. Then you’ll store the cleansed data in the silver zone, still in JSON format.

Not all data from your bronze zone will be cleansed and copied into your silver zone. The data lake model calls for loading massive amounts of data into the bronze zone without having to do upfront analysis to determine which data is definitely or likely needed for analysis. When you decide what data you need, you do the necessary data cleansing and move only the cleansed data into the silver zone.

The gold zone

The gold zone is the final home for your most valuable analytical data. You’ll curate data coming from the silver zone, meaning that you’ll group and restructure data into “packages” dedicated to your organization’s high-value analytical needs.

LINKING THE DATA LAKE ZONES TOGETHER

The following figure shows the progressive pipelines of data among the various zones, including the sandbox. Notice how not every piece or group of data is cleansed and then sent from the bronze zone to the silver zone. You’ll spend time refurbishing, refining, and transmitting data to the silver zone that you definitely or likely need for analytics.

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: