Data Lake Development with Big Data - Pradeep Pasupuleti - E-Book

Data Lake Development with Big Data E-Book

Pradeep Pasupuleti

0,0
28,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications.
This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 229

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Data Lake Development with Big Data
Credits
About the Authors
Acknowledgement
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. The Need for Data Lake
Before the Data Lake
Need for Data Lake
Defining Data Lake
Key benefits of Data Lake
Challenges in implementing a Data Lake
When to go for a Data Lake implementation
Data Lake architecture
Architectural considerations
Architectural composition
Architectural details
Understanding Data Lake layers
The Data Governance and Security Layer
The Information Lifecycle Management layer
The Metadata Layer
Understanding Data Lake tiers
The Data Intake tier
The Source System Zone
The Transient Zone
The Raw Zone
Batch Raw Storage
The real-time Raw Storage
The Data Management tier
The Integration Zone
The Enrichment Zone
The Data Hub Zone
The Data Consumption tier
The Data Discovery Zone
The Data Provisioning Zone
Summary
2. Data Intake
Understanding Intake tier zones
Source System Zone functionalities
Understanding connectivity processing
Understanding Intake Processing for data variety
Structured data
The need for integrating Structured Data in the Data Lake
Structured data loading approaches
Semi-structured data
The need for integrating semi-structured data in the Data Lake
Semi-structured data loading approaches
Unstructured data
The need for integrating Unstructured data in the Data Lake
Unstructured data loading approaches
Transient Landing Zone functionalities
File validation checks
File duplication checks
File integrity checks
File size checks
File periodicity checks
Data Integrity checks
Checking record counts
Checking for column counts
Schema validation checks
Raw Storage Zone functionalities
Data lineage processes
Watermarking process
Metadata capture
Deep Integrity checks
Bit Level Integrity checks
Periodic checksum checks
Security and governance
Information Lifecycle Management
Practical Data Ingestion scenarios
Architectural guidance
Structured data use cases
Semi-structured and unstructured data use cases
Big Data tools and technologies
Ingestion of structured data
Sqoop
Use case scenarios for Sqoop
WebHDFS
Use case scenarios for WebHDFS
Ingestion of streaming data
Apache Flume
Use case scenarios for Flume
Fluentd
Use case scenarios for Fluentd
Kafka
Use case scenarios for Kafka
Amazon Kinesis
Use case scenarios for Kinesis
Apache Storm
Use case scenarios for Storm
Summary
3. Data Integration, Quality, and Enrichment
Introduction to the Data Management Tier
Understanding Data Integration
Introduction to Data Integration
Prominent features of Data Integration
Loosely coupled Integration
Ease of use
Secure access
High-quality data
Lineage tracking
Practical Data Integration scenarios
The workings of Data Integration
Raw data discovery
Data quality assessment
Profiling the data
Data cleansing
Deletion of missing, null, or invalid values
Imputation of missing, null, or invalid values
Data transformations
Unstructured text transformation techniques
Structured data transformations
Data enrichment
Collect metadata and track data lineage
Traditional Data Integration versus Data Lake
Data pipelines
Addressing the limitations using Data Lake
Data partitioning
Addressing the limitations using Data Lake
Scale on demand
Addressing the limitations using Data Lake
Data ingest parallelism
Addressing the limitations using Data Lake
Extensibility
Addressing the limitations using Data Lake
Big Data tools and technologies
Syncsort
Use case scenarios for Syncsort
Talend
Use case scenarios for Talend
Pentaho
Use case scenarios for Pentaho
Summary
4. Data Discovery and Consumption
Understanding the Data Consumption tier
Data Consumption – Traditional versus Data Lake
An introduction to Data Consumption
Practical Data Consumption scenarios
Data Discovery and metadata
Enabling Data Discovery
Data classification
Classifying unstructured data
Named entity recognition
Topic modeling
Text clustering
Applications of data classification
Relation extraction
Extracting relationships from unstructured data
Feature-based methods
Understanding how feature-based methods work
Implementation
Semantic technologies
Understanding how semantic technologies work
Implementation
Extracting Relationships from structured data
Applications of relation extraction
Indexing data
Inverted index
Understanding how inverted index works
Implementation
Applications of Indexing
Performing Data Discovery
Semantic search
Word sense disambiguation
Latent Semantic Analysis
Faceted search
Fuzzy search
Edit distance
Wildcard and regular expressions
Data Provisioning and metadata
Data publication
Data subscription
Data Provisioning functionalities
Data formatting
Data selection
Data Provisioning approaches
Post-provisioning processes
Architectural guidance
Data Discovery
Big Data tools and technologies
Elasticsearch
Use case scenarios for Elasticsearch
IBM InfoSphere Data Explorer
Use case scenarios for IBM InfoSphere Data Explorer
Tableau
Use case scenarios for Tableau
Splunk
Use case scenarios for Splunk
Data Provisioning
Big Data tools and technologies
Data Dispatch
Use case scenarios for Data Dispatch
Summary
5. Data Governance
Understanding Data Governance
Introduction to Data Governance
The need for Data Governance
Governing Big Data in the Data Lake
Data Governance – Traditional versus Data Lake
Practical Data Governance scenarios
Data Governance components
Metadata management and lineage tracking
Data security and privacy
Big Data implications for security and privacy
Security issues in the Data Lake tiers
The Intake Tier
The Management Tier
The Consumption Tier
Information Lifecycle Management
Big Data implications for ILM
Implementing ILM using Data Lake
The Intake Tier
The Management Tier
The Consumption Tier
Architectural guidance
Big Data tools and technologies
Apache Falcon
Understanding how Falcon works
Use case scenarios for Falcon
Apache Atlas
Understanding how Atlas works
Use case scenarios for Atlas
IBM Big Data platform
Understanding how governance is provided in IBM Big Data platform
Use case scenarios for IBM Big Data platform
The current and future trends
Data Lake and future enterprise trajectories
Future Data Lake technologies
Summary
Index

Data Lake Development with Big Data

Data Lake Development with Big Data

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: November 2015

Production reference: 1241115

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-808-3

www.packtpub.com

Credits

Authors

Pradeep Pasupuleti

Beulah Salome Purra

Reviewer

Dr. Kornel Amadeusz Skałkowski

Commissioning Editor

Priya Singh

Acquisition Editor

Ruchita Bhansali

Content Development Editor

Rohit Kumar Singh

Technical Editor

Saurabh Malhotra

Copy Editor

Trishya Hajare

Project Coordinator

Izzat Contractor

Proofreader

Safis Editing

Indexer

Hemangini Bari

Graphics

Jason Monteiro

Kirk D'Penha

Production Coordinator

Shantanu N. Zagade

Cover Work

Shantanu N. Zagade

About the Authors

Pradeep Pasupuleti has 18 years of experience in architecting and developing distributed and real-time data-driven systems. He constantly explores ways to use the power and promise of advanced analytics-driven platforms to solve the problems of the common man. He founded Datatma, a consulting firm, with a mission to humanize Big Data analytics, putting it to use to solve simple problems that serve a higher purpose.

He architected robust Big Data-enabled automated learning engines that enterprises regularly use in production in order to save time, money, and the lives of humans.

He built solid interdisciplinary data science teams that bridged the gap between theory and practice, thus, creating compelling data products. His primary focus is always to ensure his customers are delighted by assisting and addressing their business problems through data products that use Big Data technologies and algorithms. He consistently demonstrated thought leadership by solving high-dimensional data problems and getting phenomenal results.

He has performed strategic leadership roles in technology consulting, advising Fortune 100 companies on Big Data strategy and creating Big Data Centers of Excellence.

He has worked on use cases such as enterprise Data Lake, fraud detection, patient re-admission prediction, student performance prediction, claims optimization sentiment mining, cloud infrastructure SLA violation prediction, data leakage prevention, and mainframe offloaded ETL on Hadoop.

In the book Pig Design Patterns, Packt Publishing, he has compiled his learning and experiences from the challenges involved in building Hadoop-driven data products such as data ingest, data cleaning and validating, data transformation, dimensionality reduction, and many other interesting Big Data war stories.

Out of his office hours, he enjoys running marathons, exploring archeological sites, finding patterns in unrelated data sources, mentoring start-ups, and budding researchers.

He can be reached at <[email protected]> and https://in.linkedin.com/in/pradeeppasupuleti.

Acknowledgement

This book is dedicated to the loving memory of my mother, Smt. Sumathy; without her never-failing encouragement and everlasting love I would have never been half as good.

First and foremost, I have to thank my father, Sri. Prabhakar Pasupuleti, who never ceases to be a constant source of inspiration, a ray of hope, humility and strength, and whose support and guidance have given me the courage to chase my dreams.

I should also express my deep sense of gratitude to each of my family members, Sushma, Sresht, and Samvruth, who stood by me at every moment through very tough times and enabled me to complete this book.

I would like to sincerely thank all my teachers who were instrumental in shaping me. Among them, I would like to thank Usha Madam, Vittal Rao Sir, Gopal Krishna Sir, and Brindavan Sir for their stellar role in improving me.

I would also like to thank all my friends for their understanding in many ways. Their friendship makes my life a wonderful experience. I cannot list all the names here, but you are always on my mind.

Special thanks to the team at Packt for their contribution to this book.

Finally, I would like to thank my team, Salome et.al, that has placed immense faith in the power of Big Data analytics and built cutting edge data products.

Thank you lord, for always being there for me.

Beulah Salome Purra has over 11 years of experience and she specializes in building highly scalable distributed systems. She has worked extensively on architecting multiple large-scale Big Data solutions for Fortune 100 companies. Her core expertise lies in working on Big Data Analytics. In her current role at ATMECS, her focus is on building robust and scalable data products that extract value from huge data assets.

She can be reached at https://www.linkedin.com/in/beulahsalomep.

I am grateful to my parents, Rathnam and Padma, who have constantly encouraged and supported me throughout. I would like to thank my husband, Pratap, for his help on this book, his patience, love, and support; my brothers, Joel and Michael, for all their support.

I would like to profusely thank Pradeep Pasupuleti for mentoring me; working with him has been an enriching experience. I can't thank him enough for his constant encouragement, guidance, support, and for providing me an opportunity to work with him on this book.

Special thanks to David Hawke, Sanjay Singh, and Ravi Velagapudi—the leadership team at ATMECS—for their encouragement and support while I was writing this book.

Thanks to the editors and reviewers at Packt for all their effort in making this book better.

About the Reviewer

Dr. Kornel Amadeusz Skałkowski has a solid academic and industrial background. For more than 5 years, he worked as an assistant at AGH University of Science and Technology in Krakow. In 2015, he obtained his PhD. in the subject of machine learning-based adaptation of the SOA systems. He has cooperated with several companies on various projects concerning intelligent systems, machine learning, and Big Data. Currently, he works as a Big Data developer for SAP SE.

He is the co-author of 19 papers concerning software engineering, SOA systems, and machine learning. He also works as a reviewer for the American Journal of Software Engineering and Applications. He has participated in numerous European and national scientific projects. His research interests include machine learning, Big Data, and software engineering.

I would like to kindly thank my family, relatives, and friends, for their endless patience and support during the reviewing of this book.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

The book Data Lake Development with Big Data is a practical guide to help you learn the essential architectural approaches to design and build Data Lakes. It walks you through the various components of Data Lakes, such as data intake, management, consumption, and governance with a specific focus on practical implementation scenarios.

Data Lake is a highly scalable data platform for better search, analytical processing, and cheaper storage of huge volumes of any structured data acquired from disparate sources.

Traditional Data Management systems are constrained by data silos, upfront data modeling, rigid data structures, and schema-based write approaches while storing and processing data. This hampers the holistic analysis of data residing in multiple silos and excludes unstructured data sources from analysis. The data is generally modeled to answer known business questions.

With Data Lake, there are no more data silos; all the data can be utilized to get a coherent view that can power a new generation of data-aware analytics applications. With Data Lake, you don't have to know all the business questions in advance, as the data can be modeled later using the schema-less approach and it is possible to ask complex far-reaching questions on all the data at any time to find out hidden patterns and complex relationships in the data.

After reading this book, you will be able to address the shortcoming of traditional data systems through the best practices highlighted in this book for building Data Lake. You will understand the complete lifecycle of architecting/building Data Lake with Big Data technologies such as Hadoop, Storm, Spark, and Splunk. You will gain a comprehensive knowledge of various stages in Data Lake such as data intake, data management, and data consumption with focus on the practical use cases at each stage. You will benefit from the book's detailed coverage of data governance, data security, data lineage tracking, metadata management, data provisioning, and consumption.

As Data Lake is such an advanced complex topic, we are honored and excited to author the first book of its kind in the world. However, at the same time, as the topic being so vast and as there is no one-size-fits-all kind of Data Lake architecture, it is very challenging to appeal to a wide audience footprint. As it is a mini series book, which limits the page count, it is extremely difficult to cover every topic in detail without breaking the ceiling. Given these constraints, we have taken a reader-centric approach in writing this book because the broader understanding of the overall concept of Data Lake is far more important than the in-depth understanding of all the technologies and architectural possibilities that go into building Data Lake.

Using this guiding principle, we refrained from the in-depth coverage of any single topic, because we could not possibly do justice to it. At the same time we made efforts to organize chapters to mimick the sequential flow of data in a typical organization so that it is intuitive for the reader to quickly grasp the concepts of Data Lake from an organizational data flow perspective. In order to make the abstract concepts relatable to the real world, we have followed a use case-based approach where practical implementation scenarios of each key Data Lake component are explained. This we believe will help the reader quickly understand the architectural implications of various Big Data technologies that are used for building these components.

What this book covers

Chapter 1, The Need for Data Lake, helps you understand what Data Lake is, its architecture and key components, and the business contexts where Data Lake can be successfully deployed. You will also learn the limitations of the traditional data architectures and how Data Lake addresses some of these inadequacies and provides significant benefits.

Chapter 2, Data Intake, helps you understand the Intake Tier in detail where we will explore the process of obtaining huge volumes of data into Data Lake. You will learn the technology perspective of the various External Data Sources and Hadoop-based data transfer mechanisms to pull or push data into Data Lake.

Chapter 3, Data Integration, Quality, and Enrichment, explores the processes that are performed on vast quantities of data in the Management Tier. You will get a deeper understanding of the key technology aspects and components such as profiling, validation, integration, cleansing, standardization, and enrichment using Hadoop ecosystem components.

Chapter 4, Data Discovery and Consumption, helps you understand how data can be discovered, packaged, and provisioned, for it to be consumed by the downstream systems. You will learn the key technology aspects, architectural guidance and tools for data discovery, and data provisioning functionalities.

Chapter 5, Data Governance, explores the details, need, and utility of data governance in a Data Lake environment. You will learn how to deal with metadata management, lineage tracking, data lifecycle management to govern the usability, security, integrity, and availability of the data through the data governance processes applied on the data in Data Lake. This chapter also explores how the current Data Lake can evolve in a futuristic setting.

What you need for this book

As this book covers only the architectural details and acts as a guide for decision-making, we have not provided any code examples. Hence, there is no explicit software prerequisite.

Who this book is for

Data Lake Development with Big Data is intended for architects and senior managers who are responsible for building a strategy around their current data architecture, helping them identify the need for Data Lake implementation in an organizational business context.

Good knowledge on master data management, information lifecycle management, data governance, data product design, data engineering, systems architecture, and experience on Big Data technologies such as Hadoop, Spark, Splunk, and Storm is necessary.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. The Need for Data Lake

In this chapter, we will understand the rationale behind building a Data Lake in an organization that has huge data assets. The following topics will be covered in this chapter:

Explore the emerging need for Data Lake by understanding the limitations of the traditional architecturesDecipher how a Data Lake addresses the inadequacies of traditional architectures and provides significant benefits in terms of time and costUnderstand what a Data Lake is and also its architecturePractical guidance on the key points to consider before deciding to build a Data LakeUnderstand the key components that could be a part of a Data Lake and comprehend how crucial each of these components are to build a successful Data Lake

Before the Data Lake

In this section, let us quickly look at how the Data Lake has evolved from a historical perspective.

From the time data-intensive applications were used to solve business problems, we have seen many evolutionary steps in the way data has been stored, managed, analyzed, and visualized.

The earlier systems were designed to answer questions about the past; questions such as what were my total sales in the last year?, were answered by machines built around monolithic processors that ran COBOL, accessing data from tapes and disks. Since the dawn of faster processors and better storage, businesses were able to slice and dice data to find fine-grained answers from subsets of data; these questions resembled: what was the sales performance of x unit in y geography in z timeframe?

If we extract one common pattern, all the earlier systems were developed for business users, in order to help them make decisions for their businesses. The current breed of data systems empowers people like you and me to make decisions and improve the way we live. This is an ultimate paradigm shift brought by the advances in myriad technologies.

For many of us, the technologies that run in the background are transparent, while we consult applications that help us make decisions that alter our immediate future profoundly. We use applications to help us navigate to an address (mapping), decide on our holidays (weather and holiday planning sites), get a summary of product reviews (review sites), get similar products (recommendation engines), connect and grow professionally (professional social networks), and the list goes on.

All these applications use enabling technologies that understand natural languages, process humungous amounts of data, store and effortlessly process our personal data such as images and audio, and even extract intelligence from them by tagging our faces and finding relationships. Each of us, in a way, contributes to the flooding of these application servers with our personal data in the form of our preferences, likes, affiliations, networks, hobbies, friends, images, and videos.

If we can attribute one fundamental cause for today's explosion of data, it should be the proliferation of ubiquitous internet connectivity and the Smartphone; with it comes the exponential number of applications that transmit and store a variety of data.

Juxtaposing the growth of Smartphones and the internet with the rapid decline of storage costs and cloud computing, which also bring down the processing costs, we can immediately comprehend that the traditional data architectures do not scale to handle the volume and variety of data; thus cannot, answer questions that you and I want. They work well, extremely well for business users, but not directly for us.

In order to democratize the value hidden in data and thus empower common customers to use data for day-to-day decision making, organizations should first store and extract value from the different types of data being collected in such a huge quantities. For all this to happen, the following two key developments have had a revolutionary impact:

The development of distributed computing architectures that can scale linearly and perform computations at an unbelievable paceThe development of new-age algorithms that can analyze natural languages, comprehend the semantics of the spoken words and special types, run Neural Nets, perform deep learning, graph social network interactions, perform constraint-based stochastic optimization, and so on

Earlier systems were simply not architected to scale linearly and store/analyze these many types of data. They are good for the purpose they were initially built for. They excelled as a historical data store that can offload structured data from Online Transaction Processing (OLTP) systems, perform transformations, cleanse it, slice-dice and summarize it, and then feed it to Online Analytical Processing (OLAP) systems. Business Intelligence tools consume the exhaust of the OLAP systems and spew good-looking reports religiously at regular intervals so that the business users can make the decisions.

We can immediately grasp the glaring differences between the earlier systems and the new age systems by looking at these major aspects:

The storage and processing differs in the way it scales (distributed versus monolithic)In earlier systems, data is managed in relational systems versus NoSQL, MPP, and CEP systems in the new age Big Data systemsTraditional systems cannot handle high-velocity data that is efficiently ingested and processed by Big Data applicationsStructured data is predominantly used in earlier systems versus unstructured data being used in Big Data systems along with structured dataTraditional systems have limitations around the scale of data that they can handle; Big Data systems are scalable and can handle humongous amounts of data