Modern Big Data Processing with Hadoop - V Naresh Kumar - kostenlos E-Book

Modern Big Data Processing with Hadoop E-Book

V Naresh Kumar

0,0
0,00 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A comprehensive guide to design, build and execute effective Big Data strategies using Hadoop


Key Features-Get an in-depth view of the Apache Hadoop ecosystem and an overview of the architectural patterns pertaining to the popular Big Data platform-Conquer different data processing and analytics challenges using a multitude of tools such as Apache Spark, Elasticsearch, Tableau and more-A comprehensive, step-by-step guide that will teach you everything you need to know, to be an expert Hadoop ArchitectBook Description


The complex structure of data these days requires sophisticated solutions for data transformation, to make the information more accessible to the users.This book empowers you to build such solutions with relative ease with the help of Apache Hadoop, along with a host of other Big Data tools.


This book will give you a complete understanding of the data lifecycle management with Hadoop, followed by modeling of structured and unstructured data in Hadoop. It will also show you how to design real-time streaming pipelines by leveraging tools such as Apache Spark, and build efficient enterprise search solutions using Elasticsearch. You will learn to build enterprise-grade analytics solutions on Hadoop, and how to visualize your data using tools such as Apache Superset. This book also covers techniques for deploying your Big Data solutions on the cloud Apache Ambari, as well as expert techniques for managing and administering your Hadoop cluster.


By the end of this book, you will have all the knowledge you need to build expert Big Data systems.


What you will learn Build an efficient enterprise Big Data strategy centered around Apache Hadoop Gain a thorough understanding of using Hadoop with various Big Data frameworks such as Apache Spark, Elasticsearch and more Set up and deploy your Big Data environment on premises or on the cloud with Apache AmbariDesign effective streaming data pipelines and build your own enterprise search solutions Utilize the historical data to build your analytics solutions and visualize them using popular tools such as Apache Superset Plan, set up and administer your Hadoop cluster efficientlyWho this book is for


This book is for Big Data professionals who want to fast-track their career in the Hadoop industry and become an expert Big Data architect. Project managers and mainframe professionals looking forward to build a career in Big Data Hadoop will also find this book to be useful. Some understanding of Hadoop is required to get the best out of this book.


V. Naresh Kumar has more than a decade of professional experience in designing, implementing, and running very-large-scale Internet applications in Fortune 500 Companies. He is a full-stack architect with hands-on experience in e-commerce, web hosting, healthcare, big data, analytics, data streaming, advertising, and databases. He admires open source and contributes to it actively. He keeps himself updated with emerging technologies from Linux systems internals to frontend technologies. He studied in BITS- Pilani, Rajasthan, with a dual degree in computer science and economics. Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 353

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Modern Big Data Processing with Hadoop

 

 

 

Expert techniques for architecting end-to-end big data solutions to get valuable insights

 

 

 

 

 

 

 

 

V. Naresh Kumar
Prashant Shindgikar

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Modern Big Data Processing with Hadoop

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Varsha ShettyContent Development Editor:Cheryl DsaTechnical Editor: Sagar SawantCopy Editors: Vikrant Phadke, Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Pratik ShirodkarGraphics: Tania DuttaProduction Coordinator: Arvindkumar Gupta

First published: March 2018

Production reference: 1280318

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78712-276-5

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

V. Naresh Kumar has more than a decade of professional experience in designing, implementing, and running very-large-scale Internet applications in Fortune 500 Companies. He is a full-stack architect with hands-on experience in e-commerce, web hosting, healthcare, big data, analytics, data streaming, advertising, and databases. He admires open source and contributes to it actively. He keeps himself updated with emerging technologies from Linux systems internals to frontend technologies. He studied in BITS- Pilani, Rajasthan, with a dual degree in computer science and economics.

 

 

 

 

 

 

Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.

About the reviewers

Sumit Pal is a published author with Apress. He has 22+ years of experience in software from startups to enterprises and is an independent consultant working with big data, data visualization, and data science. He builds end-to-end data-driven analytic systems.

He has worked for Microsoft (SQLServer), Oracle (OLAP Kernel), and Verizon. He advises clients on their data architectures and builds solutions in Spark and Scala. He has spoken at many conferences in North America and Europe and has developed a big data analyst training for Experfy. He has an MS and BS in computer science.

 

 

 

 

Manoj R. Patil is a big data architect at TatvaSoft—an IT services and consulting firm. He has a bachelor's degree in engineering from COEP, Pune. He is a proven and highly skilled business intelligence professional with 18 years of experience in IT. He is a seasoned BI and big data consultant with exposure to all the leading platforms.

Earlier, he has served for organizations such as Tech Mahindra and Persistent Systems. Apart from authoring a book on Pentaho and big data, he has been an avid reviewer for different titles in the respective fields from Packt and other leading publishers.

 

Manoj would like to thank his entire family, especially his two beautiful angels Ayushee and Ananyaa for understanding him during the review process. He would also like to thank the Packt publication for giving this opportunity, the project co-ordinator and the author.

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Modern Big Data Processing with Hadoop

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Enterprise Data Architecture Principles

Data architecture principles

Volume

Velocity

Variety

Veracity

The importance of metadata

Data governance

Fundamentals of data governance

Data security

Application security

Input data

Big data security

RDBMS security

BI security

Physical security

Data encryption

Secure key management

Data as a Service

Evolution data architecture with Hadoop

Hierarchical database architecture

Network database architecture

Relational database architecture

Employees

Devices

Department

Department and employee mapping table

Hadoop data architecture

Data layer

Data management layer

Job execution layer

Summary

Hadoop Life Cycle Management

Data wrangling

Data acquisition

Data structure analysis

Information extraction

Unwanted data removal

Data transformation

Data standardization

Data masking

Substitution

Static 

Dynamic

Encryption

Hashing

Hiding

Erasing

Truncation

Variance

Shuffling

Data security

What is Apache Ranger?

Apache Ranger installation using Ambari

Ambari admin UI

Add service

Service placement

Service client placement

Database creation on master

Ranger database configuration

Configuration changes

Configuration review

Deployment progress

Application restart

Apache Ranger user guide

Login to UI

Access manager

Service details

Policy definition and auditing for HDFS

Summary

Hadoop Design Consideration

Understanding data structure principles

Installing Hadoop cluster

Configuring Hadoop on NameNode

Format NameNode

Start all services

Exploring HDFS architecture

Defining NameNode

Secondary NameNode

NameNode safe mode

DataNode

Data replication

Rack awareness

HDFS WebUI

Introducing YARN

YARN architecture

Resource manager

Node manager

Configuration of YARN

Configuring HDFS high availability

During Hadoop 1.x

During Hadoop 2.x and onwards

HDFS HA cluster using NFS

Important architecture points

Configuration of HA NameNodes with shared storage

HDFS HA cluster using the quorum journal manager

Important architecture points

Configuration of HA NameNodes with QJM

Automatic failover

Important architecture points

Configuring automatic failover

Hadoop cluster composition

Typical Hadoop cluster

Best practices Hadoop deployment

Hadoop file formats

Text/CSV file

JSON

Sequence file

Avro

Parquet

ORC

Which file format is better?

Summary

Data Movement Techniques

Batch processing versus real-time processing

Batch processing

Real-time processing

Apache Sqoop

Sqoop Import

Import into HDFS

Import a MySQL table into an HBase table

Sqoop export

Flume

Apache Flume architecture

Data flow using Flume

Flume complex data flow architecture

Flume setup

Log aggregation use case

Apache NiFi

Main concepts of Apache NiFi

Apache NiFi architecture

Key features

Real-time log capture dataflow

Kafka Connect

Kafka Connect – a brief history

Why Kafka Connect?

Kafka Connect features

Kafka Connect architecture

Kafka Connect workers modes

Standalone mode

Distributed mode

Kafka Connect cluster distributed architecture

Example 1

Example 2

Summary

Data Modeling in Hadoop

Apache Hive

Apache Hive and RDBMS

Supported datatypes

How Hive works

Hive architecture

Hive data model management

Hive tables

Managed tables

External tables

Hive table partition

Hive static partitions and dynamic partitions

Hive partition bucketing

How Hive bucketing works

Creating buckets in a non-partitioned table

Creating buckets in a partitioned table

Hive views

Syntax of a view

Hive indexes

Compact index

Bitmap index

JSON documents using Hive

Example 1 – Accessing simple JSON documents with Hive (Hive 0.14 and later versions)

Example 2 – Accessing nested JSON documents with Hive (Hive 0.14 and later versions)

Example 3 – Schema evolution with Hive and Avro (Hive 0.14 and later versions)

Apache HBase

Differences between HDFS and HBase

Differences between Hive and HBase

Key features of HBase

HBase data model

Difference between RDBMS table and column - oriented data store

HBase architecture

HBase architecture in a nutshell

HBase rowkey design

Example 4 – loading data from MySQL table to HBase table

Example 5 – incrementally loading data from MySQL table to HBase table

Example 6 – Load the MySQL customer changed data into the HBase table

Example 7 – Hive HBase integration

Summary

Designing Real-Time Streaming Data Pipelines

Real-time streaming concepts

Data stream

Batch processing versus real-time data processing

Complex event processing 

Continuous availability

Low latency

Scalable processing frameworks

Horizontal scalability

Storage

Real-time streaming components

Message queue

So what is Kafka?

Kafka features

Kafka architecture

Kafka architecture components

Kafka Connect deep dive

Kafka Connect architecture

Kafka Connect workers standalone versus distributed mode

Install Kafka

Create topics

Generate messages to verify the producer and consumer

Kafka Connect using file Source and Sink

Kafka Connect using JDBC and file Sink Connectors

Apache Storm

Features of Apache Storm

Storm topology

Storm topology components

Installing Storm on a single node cluster

Developing a real-time streaming pipeline with Storm

Streaming a pipeline from Kafka to Storm to MySQL

Streaming a pipeline with Kafka to Storm to HDFS

Other popular real-time data streaming frameworks

Kafka Streams API

Spark Streaming

Apache Flink

Apache Flink versus Spark

Apache Spark versus Storm

Summary

Large-Scale Data Processing Frameworks

MapReduce

Hadoop MapReduce

Streaming MapReduce

Java MapReduce

Summary

Apache Spark 2

Installing Spark using Ambari

Service selection in Ambari Admin

Add Service Wizard

Server placement

Clients and Slaves selection

Service customization

Software deployment

Spark installation progress

Service restarts and cleanup

Apache Spark data structures

RDDs, DataFrames and datasets

Apache Spark programming

Sample data for analysis

Interactive data analysis with pyspark

Standalone application with Spark

Spark streaming application

Spark SQL application

Summary

Building Enterprise Search Platform

The data search concept

The need for an enterprise search engine

Tools for building an enterprise search engine

Elasticsearch

Why Elasticsearch?

 Elasticsearch components

Index

Document

Mapping

Cluster

Type

How to index documents in Elasticsearch?

Elasticsearch installation

Installation of Elasticsearch

Create index

Primary shard

Replica shard

Ingest documents into index

Bulk Insert

Document search

Meta fields

Mapping

Static mapping

Dynamic mapping

Elasticsearch-supported data types

Mapping example

Analyzer

Elasticsearch stack components

Beats

Logstash

Kibana

Use case

Summary

Designing Data Visualization Solutions

Data visualization

Bar/column chart

Line/area chart

Pie chart

Radar chart

Scatter/bubble chart

Other charts

Practical data visualization in Hadoop

Apache Druid

Druid components

Other required components

Apache Druid installation

Add service

Select Druid and Superset

Service placement on servers

Choose Slaves and Clients

Service configurations

Service installation

Installation summary

Sample data ingestion into Druid

MySQL database

Sample database

Download the sample dataset

Copy the data to MySQL

Verify integrity of the tables

Single Normalized Table

Apache Superset

Accessing the Superset application

Superset dashboards

Understanding Wikipedia edits data

Create Superset Slices using Wikipedia data

Unique users count

Word Cloud for top US regions

Sunburst chart – top 10 cities

Top 50 channels and namespaces via directed force layout

Top 25 countries/channels distribution

Creating wikipedia edits dashboard from Slices

Apache Superset with RDBMS

Supported databases

Understanding employee database

Employees table

Departments table

Department manager table

Department Employees Table

Titles table

Salaries table

Normalized employees table

Superset Slices for employees database

Register MySQL database/table

Slices and Dashboard creation

Department salary breakup

Salary Diversity

Salary Change Per Role Per Year

Dashboard creation

Summary

Developing Applications Using the Cloud

What is the Cloud?

Available technologies in the Cloud

Planning the Cloud infrastructure

Dedicated servers versus shared servers

Dedicated servers

Shared servers

High availability

Business continuity planning

Infrastructure unavailability

Natural disasters

Business data

BCP design example

The Hot–Hot system

The Hot–Cold system

Security

Server security

Application security

Network security

Single Sign On

The AAA requirement

Building a Hadoop cluster in the Cloud

Google Cloud Dataproc

Getting a Google Cloud account

Activating the Google Cloud Dataproc service

Creating a new Hadoop cluster

Logging in to the cluster

Deleting the cluster 

Data access in the Cloud

Block storage

File storage

Encrypted storage

Cold storage

Summary

Production Hadoop Cluster Deployment

Apache Ambari architecture

The Ambari server

Daemon management

Software upgrade

Software setup

LDAP/PAM/Kerberos management

Ambari backup and restore

Miscellaneous options

Ambari Agent

Ambari web interface

Database

Setting up a Hadoop cluster with Ambari

Server configurations

Preparing the server 

Installing the Ambari server 

Preparing the Hadoop cluster

Creating the Hadoop cluster 

Ambari web interface

The Ambari home page

Creating a cluster

Managing users and groups

Deploying views

The cluster install wizard

Naming your cluster

Selecting the Hadoop version 

Selecting a server 

Setting up the node

Selecting services

Service placement on nodes

Selecting slave and client nodes 

Customizing services

Reviewing the services

Installing the services on the nodes

Installation summary

The cluster dashboard

Hadoop clusters

A single cluster for the entire business

Multiple Hadoop clusters

Redundancy

A fully redundant Hadoop cluster

A data redundant Hadoop cluster

Cold backup

High availability

Business continuity

Application environments

Hadoop data copy

HDFS data copy

Summary

Preface

The complex structure of data these days requires sophisticated solutions for data transformation and its semantic representation to make information more accessible to users. Apache Hadoop, along with a host of other big data tools, empowers you to build such solutions with relative ease. This book lists some unique ideas and techniques that enable you to conquer different data processing and analytics challenges on your path to becoming an expert big data architect.

The book begins by quickly laying down the principles of enterprise data architecture and showing how they are related to the Apache Hadoop ecosystem. You will get a complete understanding of data life cycle management with Hadoop, followed by modeling structured and unstructured data in Hadoop. The book will also show you how to design real-time streaming pipelines by leveraging tools such as Apache Spark, as well as building efficient enterprise search solutions using tools such as Elasticsearch. You will build enterprise-grade analytics solutions on Hadoop and learn how to visualize your data using tools such as Tableau and Python.

This book also covers techniques for deploying your big data solutions on-premise and on the cloud, as well as expert techniques for managing and administering your Hadoop cluster. 

By the end of this book, you will have all the knowledge you need to build expert big data systems that cater to any data or insight requirements, leveraging the full suite of modern big data frameworks and tools. You will have the necessary skills and know-how to become a true big data expert.

Who this book is for

This book is for big data professionals who want to fast-track their career in the Hadoop industry and become expert big data architects. Project managers and mainframe professionals looking forward to build a career in big data and Hadoop will also find this book useful. Some understanding of Hadoop is required to get the best out of this book.

What this book covers

Chapter 1, Enterprise Data Architecture Principles, shows how to store and model data in Hadoop clusters.

Chapter 2, Hadoop Life Cycle Management, covers various data life cycle stages, including when the data is created, shared, maintained, archived, retained, and deleted. It also further details data security tools and patterns.

Chapter 3, Hadoop Design Considerations, covers key data architecture principles and practices. The reader will learn how modern data architects adapt to big data architect use cases.

Chapter 4, Data Movement Techniques, covers different methods to transfer data to and from our Hadoop cluster to utilize its real power.

Chapter 5, Data Modeling in Hadoop, shows how to build enterprise applications using cloud infrastructure.

Chapter 6, Designing Real-Time Streaming Data Pipelines, covers different tools and techniques of designing real-time data analytics.

Chapter 7, Large-Scale Data Processing Frameworks, describes the architecture principles of enterprise data and the importance of governing and securing that data.

Chapter 8, Building an Enterprise Search Platform, gives a detailed architecture design to build search solutions using Elasticsearch.

Chapter 9, Designing Data Visualization Solutions, shows how to deploy your Hadoop cluster using Apache Ambari.

Chapter 10, Developing Applications Using the Cloud, covers different ways to visualize your data and the factors involved in choosing the correct visualization method.

Chapter 11, Production Hadoop Cluster Deployment, covers different data processing solutions to derive value out of our data.

To get the most out of this book

It would be great if proper installation of Hadoop is done as explained in the earlier set of chapters. Detailed or even little knowledge of Hadoop will serve as an added advantage.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Modern-Big-Data-Processing-with-Hadoop. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/ModernBigDataProcessingwithHadoop_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

export HADOOP_CONF_DIR="${HADOOP_CONF_DIR:-$YARN_HOME/etc/hadoop}" export HADOOP_COMMON_HOME="${HADOOP_COMMON_HOME:-$YARN_HOME}" export HADOOP_HDFS_HOME="${HADOOP_HDFS_HOME:-$YARN_HOME}"

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

$ hadoop fs -cat /tmp/output-7/part* NewDelhi,

440

Kolkata

, 390

Bangalore

, 270

Any command-line input or output is written as follows:

useradd hadoop passwd hadoop1

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Enterprise Data Architecture Principles

Traditionally, enterprises have embraced data warehouses to store, process, and access large volumes of data. These warehouses are typically large RDBMS databases capable of storing a very-large-scale variety of datasets. As the data complexity, volume, and access patterns have increased, many enterprises have started adopting big data as a model to redesign their data organization and define the necessary policies around it.

This figure depicts how a typical data warehouse looks in an Enterprise:

As Enterprises have many different departments, organizations, and geographies, each one tends to own a warehouse of their own and presents a variety of challenges to the Enterprise as a whole. For example:

Multiple sources and destinations of data

Data duplication and redundancy

Data access regulatory issues

Non-standard data definitions across the Enterprise.

Software and hardware scalability and reliability issues

Data movement and auditing

Integration between various warehouses

It is becoming very easy to build very-large-scale systems at less costs compared to what it was a few decades ago due to several advancements in technology, such as:

Cost per terabyte

Computation power per nanometer

Gigabits of network bandwidth

Cloud

With globalization, markets have gone global and the consumers are also global. This has increased the reach manifold. These advancements also pose several challenges to the Enterprises in terms of:

Human capital management

Warehouse management

Logistics management

Data privacy and security

Sales and billing management

Understanding demand and supply

In order to stay on top of the demands of the market, Enterprises have started collecting more and more metrics about themselves; thereby, there is an increase in the dimensions data is playing with in the current situation.

In this chapter, we will learn:

Data architecture principles

The importance of metadata

Data governance

Data security

Data as a Service

Data architecture evolution with Hadoop

Data architecture principles

Data at the current state can be defined in the following four dimensions (four Vs).

Volume

The volume of data is an important measure needed to design a big data system. This is an important factor that decides the investment an Enterprise has to make to cater to the present and future storage requirements.

Different types of data in an enterprise need different capacities to store, archive, and process. Petabyte storage systems are a very common in the industry today, which was almost impossible to reach a few decades ago.

Velocity

This is another dimension of the data that decides the mobility of data. There exist varieties of data within organizations that fall under the following categories:

Streaming data:

Real-time/near-real-time data

Data at rest:

Immutable data

Mutable data

This dimension has some impact on the network architecture that Enterprise uses to consume and process data.

Variety

This dimension talks about the form and shape of the data. We can further classify this into the following categories:

Streaming data:

On-wire data format (for example, JSON, MPEG, and Avro)

Data At Rest:

Immutable data (

for example, 

media files and customer invoices)

Mutable data (

for example, 

customer details, product inventory, and employee data)

Application data:

Configuration files, secrets, passwords, and so on

As an organization, it's very important to embrace very few technologies to reduce the variety of data. Having many different types of data poses a very big challenge to an Enterprise in terms of managing and consuming it all.

Veracity

This dimension talks about the accuracy of the data. Without having a solid understanding of the guarantee that each system within an Enterprise provides to keep the data safe, available, and reliable, it becomes very difficult to understand the Analytics generated out of this data and to further generate insights.

Necessary auditing should be in place to make sure that the data that flows through the system passes all the quality checks and finally goes through the big data system.

Let's see how a typical big data system looks:

As you can see, many different types of applications are interacting with the big data system to store, process, and generate analytics.

The importance of metadata

Before we try to understand the importance of Metadata, let's try to understand what metadata is. Metadata is simplydata about data. This sounds confusing as we are defining the definition in a recursive way.

In a typical big data system, we have these three levels of verticals:

Applications writing data to a big data system

Organizing data within the big data system

Applications consuming data from the big data system

This brings up a few challenges as we are talking about millions (even billions) of data files/segments that are stored in the big data system. We should be able to correctly identify the ownership, usage of these data files across the Enterprise.

Let's take an example of a TV broadcasting company that owns a TV channel; it creates television shows and broadcasts it to all the target audience over wired cable networks, satellite networks, the internet, and so on. If we look carefully, the source of content is only one. But it's traveling through all possible mediums and finally reaching the user’s Location for viewing on TV, mobile phone, tablets, and so on.

Since the viewers are accessing this TV content on a variety of devices, the applications running on these devices can generate several messages to indicate various user actions and preferences, and send them back to the application server. This data is pretty huge and is stored in a big data system.

Depending on how the data is organized within the big data system, it's almost impossible for outside applications or peer applications to know about the different types of data being stored within the system. In order to make this process easier, we need to describe and define how data organization takes place within the big data system. This will help us better understand the data organization and access within the big data system.

Let's extend this example even further and say there is another application that reads from the big data system to understand the best times to advertise in a given TV series. This application should have a better understanding of all other data that is available within the big data system. So, without having a well-defined metadata system, it's very difficult to do the following things:

Understand the diversity of data that is stored, accessed, and processed

Build interfaces across different types of datasets

Correctly tag the data from a security perspective as highly sensitive or insensitive data

Connect the dots between the given sets of systems in the big data ecosystem

Audit and troubleshoot issues that might arise because of data inconsistency

Data governance

Having very large volumes of data is not enough to make very good decisions that have a positive impact on the success of a business. It's very important to make sure that only quality data should be collected, preserved, and maintained. The data collection process also goes through evolution as new types of data are required to be collected. During this process, we might break a few interfaces that read from the previous generation of data. Without having a well-defined process and people, handling data becomes a big challenge for all sizes of organization.

To excel in managing data, we should consider the following qualities:

Good policies and processes

Accountability

Formal decision structures

Enforcement of rules in management

The implementation of these types of qualities is called data governance. At a high level, we'll define data governance as data that is managed well. This definition also helps us to clarify that data management and data governance are not the same thing. Managing data is concerned with the use of data to make good business decisions and ultimately run organizations. Data governance is concerned with the degree to which we use disciplined behavior across our entire organization in how we manage that data.

It's an important distinction. So what's the bottom line? Most organizations manage data, but far fewer govern those management techniques well.

Fundamentals of data governance

Let's try to understand the fundamentals of data governance:

Accountability

Standardization

Transparency

Transparency ensures that all the employees within an organization and outside the organization understand their role when interacting with the data that is related to the organization. This will ensure the following things:

Building trust

Avoiding surprises

Accountability makes sure that teams and employees who have access to data describe what they can do and cannot do with the data.

Standardization deals with how the data is properly labeled, describe, and categorized. One example is how to generate email address to the employees within the organization. One way is to use [email protected], or any other combination of these. This will ensure that everyone who has access to these email address understands which one is first and which one is last, without anybody explaining those in person.

Standardization improves the quality of data and brings order to multiple data dimensions.

Data security

Security is not a new concept. It's been adopted since the early UNIX time-sharing operating system design. In the recent past, security awareness has increased among individuals and organizations on this security front due to the widespread data breaches that led to a lot of revenue loss to organizations.

Security, as a general concept, can be applied to many different things. When it comes to data security, we need to understand the following fundamental questions:

What types of data exist?

Who owns the data?

Who has access to the data?

When does the data exit the system?

Is the data physically secured?

Let's have a look at a simple big data system and try to understand these questions in more detail. The scale of the systems makes security a nightmare for everyone. So, we should have proper policies in place to keep everyone on the same page:

In this example, we have the following components:

Heterogeneous applications running across the globe in multiple geographical regions.

Large volume and variety of input data is generated by the applications.

All the data is ingested into a big data system.

ETL/ELT applications consume the data from a big data system and put the consumable results into RDBMS (this is optional).

Business intelligence applications read from this storage and further generate insights into the data. These are the ones that power the leadership team's decisions.

You can imagine the scale and volume of data that flows through this system. We can also see that the number of servers, applications, and employees that participate in this whole ecosystem is very large in number. If we do not have proper policies in place, its not a very easy task to secure such a complicated system.

Also, if an attacker uses social engineering to gain access to the system, we should make sure that the data access is limited only to the lowest possible level. When poor security implementations are in place, attackers can have access to virtually all the business secrets, which could be a  serious loss to the business.

Just to think of an example, a start-up is building a next-generation computing device to host all its data on the cloud and does not have proper security policies in place. When an attacker compromises the security of the servers that are on the cloud, they can easily figure out what is being built by this start-up and can steal the intelligence. Once the intelligence is stolen, we can imagine how hackers use this for their personal benefit.

With this understanding of security's importance, let's define what needs to be secured.

Application security

Applications are the front line of product-based organizations, since consumers use these applications to interact with the products and services provided by the applications. We have to ensure that proper security standards are followed while programming these application interfaces.

Since these applications generate data to the backend system, we should make sure only proper access mechanisms are allowed in terms of firewalls.

Also, these applications interact with many other backend systems, we have to ensure that the correct data related to the user is shown. This boils down to implementing proper authentication and authorization, not only for the user but also for the application when accessing different types of an organization's resources.

Without proper auditing in place, it is very difficult to analyze the data access patterns by the applications. All the logs should be collected at a central place away from the application servers and can be further ingested into the big data system.

Input data

Once the applications generate several metrics, they can be temporarily stored locally that are further consumed by periodic processes or they are further pushed to streaming systems like Kafka.

In this case, we should carefully think through and design where the data is stores and which uses can have access to this data. If we are further writing this data to systems like Kafka or MQ, we have to make sure that further authentication, authorization, and access controls are in place.

Here we can leverage the operating-system-provided security measures such as process user ID, process group ID, filesystem user ID, group ID, and also advanced systems (such as SELinux) to further restrict access to the input data.

Big data security

Depending on which data warehouse solution is chosen, we have to ensure that authorized applications and users can write to and read from the data warehouse. Proper security policies and auditing should be in place to make sure that this large scale of data is not easily accessible to everyone.

In order to implement all these access policies, we can use the operating system provided mechanisms like file access controls and use access controls. Since we're talking about geographically distributed big data systems, we have to think and design centralized authentication systems to provide a seamless experience for employees when interacting with these big data systems.

RDBMS security

Many RDBMSes are highly secure and can provide the following access levels  to users:

Database

Table

Usage pattern

They also have built-in auditing mechanisms to tell which users have accessed what types of data and when. This data is vital to keeping the systems secure, and proper monitoring should be in place to keep a watch on these system's health and safety.

BI security

These can be applications built in-house for specific needs of the company, or external applications that can power the insights that business teams are looking for. These applications should also be properly secured by practicing single sign-on, role-based access control, and network-based access control.

Since the amount of insights these applications provide is very much crucial to the success of the organization, proper security measures should be taken to protect them.

So far, we have seen the different parts of an enterprise system and understood what things can be followed to improve the security of the overall enterprise data design. Let's talk about some of the common things that can be applied everywhere in the data design.

Physical security

This deals with physical device access, data center access, server access, and network access. If an unauthorized person gains access to the equipment owned by an Enterprise, they can gain access to all the data that is present in it.

As we have seen in the previous sections, when an operating system is running, we are able to protect the resources by leveraging the security features of the operating system. When an intruder gains physical access to the devices (or even decommissioned servers), they can connect these devices to another operating system that's in their control and access all the data that is present on our servers.

Care must be taken when we decommission servers, as there are ways in which data that's written to these devices (even after formatting) can be recovered. So we should follow industry-standard device erasing techniques to properly clean all of the data that is owned by enterprises.

In order to prevent those, we should consider encrypting data.

Data encryption

Encrypting data will ensure that even when authorized persons gain access to the devices, they will not be able to recover the data. This is a standard practice that is followed nowadays due to the increase in mobility of data and employees. Many big Enterprises encrypt hard disks on laptops and mobile phones.

Secure key management

If you have worked with any applications that need authentication, you will have used a combination of username and password to access the services. Typically these secrets are stored within the source code itself. This poses a challenge for programs which are non-compiled, as attackers can easily access the username and password to gain access to our resources.

Many enterprises started adopting centralized key management, using which applications can query these services to gain access to the resources that are authentication protected. All these access patterns are properly audited by the KMS

Employees should also access these systems with their own credentials to access the resources. This makes sure that secret keys are protected and accessible only to the authorized applications.

Data as a Service

Data as a Service (DaaS) is a concept that has become popular in recent times due to the increase in adoption of cloud. When it comes to data. It might some a little confusing that how can data be added to as a service model?

DaaS offers great flexibility to users of the service in terms of not worrying about the scale, performance, and maintenance of the underlying infrastructure that the service is being run on. The infrastructure automatically takes care of it for us, but given that we are dealing with a cloud model, we have all the benefits of the cloud such as pay as you go, capacity planning, and so on. This will reduce the burden of data management.