0,00 €
A comprehensive guide to design, build and execute effective Big Data strategies using Hadoop
Key Features-Get an in-depth view of the Apache Hadoop ecosystem and an overview of the architectural patterns pertaining to the popular Big Data platform-Conquer different data processing and analytics challenges using a multitude of tools such as Apache Spark, Elasticsearch, Tableau and more-A comprehensive, step-by-step guide that will teach you everything you need to know, to be an expert Hadoop ArchitectBook Description
The complex structure of data these days requires sophisticated solutions for data transformation, to make the information more accessible to the users.This book empowers you to build such solutions with relative ease with the help of Apache Hadoop, along with a host of other Big Data tools.
This book will give you a complete understanding of the data lifecycle management with Hadoop, followed by modeling of structured and unstructured data in Hadoop. It will also show you how to design real-time streaming pipelines by leveraging tools such as Apache Spark, and build efficient enterprise search solutions using Elasticsearch. You will learn to build enterprise-grade analytics solutions on Hadoop, and how to visualize your data using tools such as Apache Superset. This book also covers techniques for deploying your Big Data solutions on the cloud Apache Ambari, as well as expert techniques for managing and administering your Hadoop cluster.
By the end of this book, you will have all the knowledge you need to build expert Big Data systems.
What you will learn Build an efficient enterprise Big Data strategy centered around Apache Hadoop Gain a thorough understanding of using Hadoop with various Big Data frameworks such as Apache Spark, Elasticsearch and more Set up and deploy your Big Data environment on premises or on the cloud with Apache AmbariDesign effective streaming data pipelines and build your own enterprise search solutions Utilize the historical data to build your analytics solutions and visualize them using popular tools such as Apache Superset Plan, set up and administer your Hadoop cluster efficientlyWho this book is for
This book is for Big Data professionals who want to fast-track their career in the Hadoop industry and become an expert Big Data architect. Project managers and mainframe professionals looking forward to build a career in Big Data Hadoop will also find this book to be useful. Some understanding of Hadoop is required to get the best out of this book.
V. Naresh Kumar has more than a decade of professional experience in designing, implementing, and running very-large-scale Internet applications in Fortune 500 Companies. He is a full-stack architect with hands-on experience in e-commerce, web hosting, healthcare, big data, analytics, data streaming, advertising, and databases. He admires open source and contributes to it actively. He keeps himself updated with emerging technologies from Linux systems internals to frontend technologies. He studied in BITS- Pilani, Rajasthan, with a dual degree in computer science and economics. Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 353
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor: Varsha ShettyContent Development Editor:Cheryl DsaTechnical Editor: Sagar SawantCopy Editors: Vikrant Phadke, Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer:Pratik ShirodkarGraphics: Tania DuttaProduction Coordinator: Arvindkumar Gupta
First published: March 2018
Production reference: 1280318
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78712-276-5
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
V. Naresh Kumar has more than a decade of professional experience in designing, implementing, and running very-large-scale Internet applications in Fortune 500 Companies. He is a full-stack architect with hands-on experience in e-commerce, web hosting, healthcare, big data, analytics, data streaming, advertising, and databases. He admires open source and contributes to it actively. He keeps himself updated with emerging technologies from Linux systems internals to frontend technologies. He studied in BITS- Pilani, Rajasthan, with a dual degree in computer science and economics.
Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.
Sumit Pal is a published author with Apress. He has 22+ years of experience in software from startups to enterprises and is an independent consultant working with big data, data visualization, and data science. He builds end-to-end data-driven analytic systems.
He has worked for Microsoft (SQLServer), Oracle (OLAP Kernel), and Verizon. He advises clients on their data architectures and builds solutions in Spark and Scala. He has spoken at many conferences in North America and Europe and has developed a big data analyst training for Experfy. He has an MS and BS in computer science.
Manoj R. Patil is a big data architect at TatvaSoft—an IT services and consulting firm. He has a bachelor's degree in engineering from COEP, Pune. He is a proven and highly skilled business intelligence professional with 18 years of experience in IT. He is a seasoned BI and big data consultant with exposure to all the leading platforms.
Earlier, he has served for organizations such as Tech Mahindra and Persistent Systems. Apart from authoring a book on Pentaho and big data, he has been an avid reviewer for different titles in the respective fields from Packt and other leading publishers.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Modern Big Data Processing with Hadoop
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Enterprise Data Architecture Principles
Data architecture principles
Volume
Velocity
Variety
Veracity
The importance of metadata
Data governance
Fundamentals of data governance
Data security
Application security
Input data
Big data security
RDBMS security
BI security
Physical security
Data encryption
Secure key management
Data as a Service
Evolution data architecture with Hadoop
Hierarchical database architecture
Network database architecture
Relational database architecture
Employees
Devices
Department
Department and employee mapping table
Hadoop data architecture
Data layer
Data management layer
Job execution layer
Summary
Hadoop Life Cycle Management
Data wrangling
Data acquisition
Data structure analysis
Information extraction
Unwanted data removal
Data transformation
Data standardization
Data masking
Substitution
Static 
Dynamic
Encryption
Hashing
Hiding
Erasing
Truncation
Variance
Shuffling
Data security
What is Apache Ranger?
Apache Ranger installation using Ambari
Ambari admin UI
Add service
Service placement
Service client placement
Database creation on master
Ranger database configuration
Configuration changes
Configuration review
Deployment progress
Application restart
Apache Ranger user guide
Login to UI
Access manager
Service details
Policy definition and auditing for HDFS
Summary
Hadoop Design Consideration
Understanding data structure principles
Installing Hadoop cluster
Configuring Hadoop on NameNode
Format NameNode
Start all services
Exploring HDFS architecture
Defining NameNode
Secondary NameNode
NameNode safe mode
DataNode
Data replication
Rack awareness
HDFS WebUI
Introducing YARN
YARN architecture
Resource manager
Node manager
Configuration of YARN
Configuring HDFS high availability
During Hadoop 1.x
During Hadoop 2.x and onwards
HDFS HA cluster using NFS
Important architecture points
Configuration of HA NameNodes with shared storage
HDFS HA cluster using the quorum journal manager
Important architecture points
Configuration of HA NameNodes with QJM
Automatic failover
Important architecture points
Configuring automatic failover
Hadoop cluster composition
Typical Hadoop cluster
Best practices Hadoop deployment
Hadoop file formats
Text/CSV file
JSON
Sequence file
Avro
Parquet
ORC
Which file format is better?
Summary
Data Movement Techniques
Batch processing versus real-time processing
Batch processing
Real-time processing
Apache Sqoop
Sqoop Import
Import into HDFS
Import a MySQL table into an HBase table
Sqoop export
Flume
Apache Flume architecture
Data flow using Flume
Flume complex data flow architecture
Flume setup
Log aggregation use case
Apache NiFi
Main concepts of Apache NiFi
Apache NiFi architecture
Key features
Real-time log capture dataflow
Kafka Connect
Kafka Connect – a brief history
Why Kafka Connect?
Kafka Connect features
Kafka Connect architecture
Kafka Connect workers modes
Standalone mode
Distributed mode
Kafka Connect cluster distributed architecture
Example 1
Example 2
Summary
Data Modeling in Hadoop
Apache Hive
Apache Hive and RDBMS
Supported datatypes
How Hive works
Hive architecture
Hive data model management
Hive tables
Managed tables
External tables
Hive table partition
Hive static partitions and dynamic partitions
Hive partition bucketing
How Hive bucketing works
Creating buckets in a non-partitioned table
Creating buckets in a partitioned table
Hive views
Syntax of a view
Hive indexes
Compact index
Bitmap index
JSON documents using Hive
Example 1 – Accessing simple JSON documents with Hive (Hive 0.14 and later versions)
Example 2 – Accessing nested JSON documents with Hive (Hive 0.14 and later versions)
Example 3 – Schema evolution with Hive and Avro (Hive 0.14 and later versions)
Apache HBase
Differences between HDFS and HBase
Differences between Hive and HBase
Key features of HBase
HBase data model
Difference between RDBMS table and column - oriented data store
HBase architecture
HBase architecture in a nutshell
HBase rowkey design
Example 4 – loading data from MySQL table to HBase table
Example 5 – incrementally loading data from MySQL table to HBase table
Example 6 – Load the MySQL customer changed data into the HBase table
Example 7 – Hive HBase integration
Summary
Designing Real-Time Streaming Data Pipelines
Real-time streaming concepts
Data stream
Batch processing versus real-time data processing
Complex event processing 
Continuous availability
Low latency
Scalable processing frameworks
Horizontal scalability
Storage
Real-time streaming components
Message queue
So what is Kafka?
Kafka features
Kafka architecture
Kafka architecture components
Kafka Connect deep dive
Kafka Connect architecture
Kafka Connect workers standalone versus distributed mode
Install Kafka
Create topics
Generate messages to verify the producer and consumer
Kafka Connect using file Source and Sink
Kafka Connect using JDBC and file Sink Connectors
Apache Storm
Features of Apache Storm
Storm topology
Storm topology components
Installing Storm on a single node cluster
Developing a real-time streaming pipeline with Storm
Streaming a pipeline from Kafka to Storm to MySQL
Streaming a pipeline with Kafka to Storm to HDFS
Other popular real-time data streaming frameworks
Kafka Streams API
Spark Streaming
Apache Flink
Apache Flink versus Spark
Apache Spark versus Storm
Summary
Large-Scale Data Processing Frameworks
MapReduce
Hadoop MapReduce
Streaming MapReduce
Java MapReduce
Summary
Apache Spark 2
Installing Spark using Ambari
Service selection in Ambari Admin
Add Service Wizard
Server placement
Clients and Slaves selection
Service customization
Software deployment
Spark installation progress
Service restarts and cleanup
Apache Spark data structures
RDDs, DataFrames and datasets
Apache Spark programming
Sample data for analysis
Interactive data analysis with pyspark
Standalone application with Spark
Spark streaming application
Spark SQL application
Summary
Building Enterprise Search Platform
The data search concept
The need for an enterprise search engine
Tools for building an enterprise search engine
Elasticsearch
Why Elasticsearch?
 Elasticsearch components
Index
Document
Mapping
Cluster
Type
How to index documents in Elasticsearch?
Elasticsearch installation
Installation of Elasticsearch
Create index
Primary shard
Replica shard
Ingest documents into index
Bulk Insert
Document search
Meta fields
Mapping
Static mapping
Dynamic mapping
Elasticsearch-supported data types
Mapping example
Analyzer
Elasticsearch stack components
Beats
Logstash
Kibana
Use case
Summary
Designing Data Visualization Solutions
Data visualization
Bar/column chart
Line/area chart
Pie chart
Radar chart
Scatter/bubble chart
Other charts
Practical data visualization in Hadoop
Apache Druid
Druid components
Other required components
Apache Druid installation
Add service
Select Druid and Superset
Service placement on servers
Choose Slaves and Clients
Service configurations
Service installation
Installation summary
Sample data ingestion into Druid
MySQL database
Sample database
Download the sample dataset
Copy the data to MySQL
Verify integrity of the tables
Single Normalized Table
Apache Superset
Accessing the Superset application
Superset dashboards
Understanding Wikipedia edits data
Create Superset Slices using Wikipedia data
Unique users count
Word Cloud for top US regions
Sunburst chart – top 10 cities
Top 50 channels and namespaces via directed force layout
Top 25 countries/channels distribution
Creating wikipedia edits dashboard from Slices
Apache Superset with RDBMS
Supported databases
Understanding employee database
Employees table
Departments table
Department manager table
Department Employees Table
Titles table
Salaries table
Normalized employees table
Superset Slices for employees database
Register MySQL database/table
Slices and Dashboard creation
Department salary breakup
Salary Diversity
Salary Change Per Role Per Year
Dashboard creation
Summary
Developing Applications Using the Cloud
What is the Cloud?
Available technologies in the Cloud
Planning the Cloud infrastructure
Dedicated servers versus shared servers
Dedicated servers
Shared servers
High availability
Business continuity planning
Infrastructure unavailability
Natural disasters
Business data
BCP design example
The Hot–Hot system
The Hot–Cold system
Security
Server security
Application security
Network security
Single Sign On
The AAA requirement
Building a Hadoop cluster in the Cloud
Google Cloud Dataproc
Getting a Google Cloud account
Activating the Google Cloud Dataproc service
Creating a new Hadoop cluster
Logging in to the cluster
Deleting the cluster 
Data access in the Cloud
Block storage
File storage
Encrypted storage
Cold storage
Summary
Production Hadoop Cluster Deployment
Apache Ambari architecture
The Ambari server
Daemon management
Software upgrade
Software setup
LDAP/PAM/Kerberos management
Ambari backup and restore
Miscellaneous options
Ambari Agent
Ambari web interface
Database
Setting up a Hadoop cluster with Ambari
Server configurations
Preparing the server 
Installing the Ambari server 
Preparing the Hadoop cluster
Creating the Hadoop cluster 
Ambari web interface
The Ambari home page
Creating a cluster
Managing users and groups
Deploying views
The cluster install wizard
Naming your cluster
Selecting the Hadoop version 
Selecting a server 
Setting up the node
Selecting services
Service placement on nodes
Selecting slave and client nodes 
Customizing services
Reviewing the services
Installing the services on the nodes
Installation summary
The cluster dashboard
Hadoop clusters
A single cluster for the entire business
Multiple Hadoop clusters
Redundancy
A fully redundant Hadoop cluster
A data redundant Hadoop cluster
Cold backup
High availability
Business continuity
Application environments
Hadoop data copy
HDFS data copy
Summary
The complex structure of data these days requires sophisticated solutions for data transformation and its semantic representation to make information more accessible to users. Apache Hadoop, along with a host of other big data tools, empowers you to build such solutions with relative ease. This book lists some unique ideas and techniques that enable you to conquer different data processing and analytics challenges on your path to becoming an expert big data architect.
The book begins by quickly laying down the principles of enterprise data architecture and showing how they are related to the Apache Hadoop ecosystem. You will get a complete understanding of data life cycle management with Hadoop, followed by modeling structured and unstructured data in Hadoop. The book will also show you how to design real-time streaming pipelines by leveraging tools such as Apache Spark, as well as building efficient enterprise search solutions using tools such as Elasticsearch. You will build enterprise-grade analytics solutions on Hadoop and learn how to visualize your data using tools such as Tableau and Python.
This book also covers techniques for deploying your big data solutions on-premise and on the cloud, as well as expert techniques for managing and administering your Hadoop cluster.
By the end of this book, you will have all the knowledge you need to build expert big data systems that cater to any data or insight requirements, leveraging the full suite of modern big data frameworks and tools. You will have the necessary skills and know-how to become a true big data expert.
This book is for big data professionals who want to fast-track their career in the Hadoop industry and become expert big data architects. Project managers and mainframe professionals looking forward to build a career in big data and Hadoop will also find this book useful. Some understanding of Hadoop is required to get the best out of this book.
Chapter 1, Enterprise Data Architecture Principles, shows how to store and model data in Hadoop clusters.
Chapter 2, Hadoop Life Cycle Management, covers various data life cycle stages, including when the data is created, shared, maintained, archived, retained, and deleted. It also further details data security tools and patterns.
Chapter 3, Hadoop Design Considerations, covers key data architecture principles and practices. The reader will learn how modern data architects adapt to big data architect use cases.
Chapter 4, Data Movement Techniques, covers different methods to transfer data to and from our Hadoop cluster to utilize its real power.
Chapter 5, Data Modeling in Hadoop, shows how to build enterprise applications using cloud infrastructure.
Chapter 6, Designing Real-Time Streaming Data Pipelines, covers different tools and techniques of designing real-time data analytics.
Chapter 7, Large-Scale Data Processing Frameworks, describes the architecture principles of enterprise data and the importance of governing and securing that data.
Chapter 8, Building an Enterprise Search Platform, gives a detailed architecture design to build search solutions using Elasticsearch.
Chapter 9, Designing Data Visualization Solutions, shows how to deploy your Hadoop cluster using Apache Ambari.
Chapter 10, Developing Applications Using the Cloud, covers different ways to visualize your data and the factors involved in choosing the correct visualization method.
Chapter 11, Production Hadoop Cluster Deployment, covers different data processing solutions to derive value out of our data.
It would be great if proper installation of Hadoop is done as explained in the earlier set of chapters. Detailed or even little knowledge of Hadoop will serve as an added advantage.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Modern-Big-Data-Processing-with-Hadoop. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/ModernBigDataProcessingwithHadoop_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
A block of code is set as follows:
export HADOOP_CONF_DIR="${HADOOP_CONF_DIR:-$YARN_HOME/etc/hadoop}" export HADOOP_COMMON_HOME="${HADOOP_COMMON_HOME:-$YARN_HOME}" export HADOOP_HDFS_HOME="${HADOOP_HDFS_HOME:-$YARN_HOME}"
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
$ hadoop fs -cat /tmp/output-7/part* NewDelhi,
440
Kolkata
, 390
Bangalore
, 270
Any command-line input or output is written as follows:
useradd hadoop passwd hadoop1
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
Traditionally, enterprises have embraced data warehouses to store, process, and access large volumes of data. These warehouses are typically large RDBMS databases capable of storing a very-large-scale variety of datasets. As the data complexity, volume, and access patterns have increased, many enterprises have started adopting big data as a model to redesign their data organization and define the necessary policies around it.
This figure depicts how a typical data warehouse looks in an Enterprise:
As Enterprises have many different departments, organizations, and geographies, each one tends to own a warehouse of their own and presents a variety of challenges to the Enterprise as a whole. For example:
Multiple sources and destinations of data
Data duplication and redundancy
Data access regulatory issues
Non-standard data definitions across the Enterprise.
Software and hardware scalability and reliability issues
Data movement and auditing
Integration between various warehouses
It is becoming very easy to build very-large-scale systems at less costs compared to what it was a few decades ago due to several advancements in technology, such as:
Cost per terabyte
Computation power per nanometer
Gigabits of network bandwidth
Cloud
With globalization, markets have gone global and the consumers are also global. This has increased the reach manifold. These advancements also pose several challenges to the Enterprises in terms of:
Human capital management
Warehouse management
Logistics management
Data privacy and security
Sales and billing management
Understanding demand and supply
In order to stay on top of the demands of the market, Enterprises have started collecting more and more metrics about themselves; thereby, there is an increase in the dimensions data is playing with in the current situation.
In this chapter, we will learn:
Data architecture principles
The importance of metadata
Data governance
Data security
Data as a Service
Data architecture evolution with Hadoop
Data at the current state can be defined in the following four dimensions (four Vs).
The volume of data is an important measure needed to design a big data system. This is an important factor that decides the investment an Enterprise has to make to cater to the present and future storage requirements.
Different types of data in an enterprise need different capacities to store, archive, and process. Petabyte storage systems are a very common in the industry today, which was almost impossible to reach a few decades ago.
This is another dimension of the data that decides the mobility of data. There exist varieties of data within organizations that fall under the following categories:
Streaming data:
Real-time/near-real-time data
Data at rest:
Immutable data
Mutable data
This dimension has some impact on the network architecture that Enterprise uses to consume and process data.
This dimension talks about the form and shape of the data. We can further classify this into the following categories:
Streaming data:
On-wire data format (for example, JSON, MPEG, and Avro)
Data At Rest:
Immutable data (
for example,
media files and customer invoices)
Mutable data (
for example,
customer details, product inventory, and employee data)
Application data:
Configuration files, secrets, passwords, and so on
As an organization, it's very important to embrace very few technologies to reduce the variety of data. Having many different types of data poses a very big challenge to an Enterprise in terms of managing and consuming it all.
This dimension talks about the accuracy of the data. Without having a solid understanding of the guarantee that each system within an Enterprise provides to keep the data safe, available, and reliable, it becomes very difficult to understand the Analytics generated out of this data and to further generate insights.
Necessary auditing should be in place to make sure that the data that flows through the system passes all the quality checks and finally goes through the big data system.
Let's see how a typical big data system looks:
As you can see, many different types of applications are interacting with the big data system to store, process, and generate analytics.
Before we try to understand the importance of Metadata, let's try to understand what metadata is. Metadata is simplydata about data. This sounds confusing as we are defining the definition in a recursive way.
In a typical big data system, we have these three levels of verticals:
Applications writing data to a big data system
Organizing data within the big data system
Applications consuming data from the big data system
This brings up a few challenges as we are talking about millions (even billions) of data files/segments that are stored in the big data system. We should be able to correctly identify the ownership, usage of these data files across the Enterprise.
Let's take an example of a TV broadcasting company that owns a TV channel; it creates television shows and broadcasts it to all the target audience over wired cable networks, satellite networks, the internet, and so on. If we look carefully, the source of content is only one. But it's traveling through all possible mediums and finally reaching the user’s Location for viewing on TV, mobile phone, tablets, and so on.
Since the viewers are accessing this TV content on a variety of devices, the applications running on these devices can generate several messages to indicate various user actions and preferences, and send them back to the application server. This data is pretty huge and is stored in a big data system.
Depending on how the data is organized within the big data system, it's almost impossible for outside applications or peer applications to know about the different types of data being stored within the system. In order to make this process easier, we need to describe and define how data organization takes place within the big data system. This will help us better understand the data organization and access within the big data system.
Let's extend this example even further and say there is another application that reads from the big data system to understand the best times to advertise in a given TV series. This application should have a better understanding of all other data that is available within the big data system. So, without having a well-defined metadata system, it's very difficult to do the following things:
Understand the diversity of data that is stored, accessed, and processed
Build interfaces across different types of datasets
Correctly tag the data from a security perspective as highly sensitive or insensitive data
Connect the dots between the given sets of systems in the big data ecosystem
Audit and troubleshoot issues that might arise because of data inconsistency
Having very large volumes of data is not enough to make very good decisions that have a positive impact on the success of a business. It's very important to make sure that only quality data should be collected, preserved, and maintained. The data collection process also goes through evolution as new types of data are required to be collected. During this process, we might break a few interfaces that read from the previous generation of data. Without having a well-defined process and people, handling data becomes a big challenge for all sizes of organization.
To excel in managing data, we should consider the following qualities:
Good policies and processes
Accountability
Formal decision structures
Enforcement of rules in management
The implementation of these types of qualities is called data governance. At a high level, we'll define data governance as data that is managed well. This definition also helps us to clarify that data management and data governance are not the same thing. Managing data is concerned with the use of data to make good business decisions and ultimately run organizations. Data governance is concerned with the degree to which we use disciplined behavior across our entire organization in how we manage that data.
It's an important distinction. So what's the bottom line? Most organizations manage data, but far fewer govern those management techniques well.
Let's try to understand the fundamentals of data governance:
Accountability
Standardization
Transparency
Transparency ensures that all the employees within an organization and outside the organization understand their role when interacting with the data that is related to the organization. This will ensure the following things:
Building trust
Avoiding surprises
Accountability makes sure that teams and employees who have access to data describe what they can do and cannot do with the data.
Standardization deals with how the data is properly labeled, describe, and categorized. One example is how to generate email address to the employees within the organization. One way is to use [email protected], or any other combination of these. This will ensure that everyone who has access to these email address understands which one is first and which one is last, without anybody explaining those in person.
Standardization improves the quality of data and brings order to multiple data dimensions.
Security is not a new concept. It's been adopted since the early UNIX time-sharing operating system design. In the recent past, security awareness has increased among individuals and organizations on this security front due to the widespread data breaches that led to a lot of revenue loss to organizations.
Security, as a general concept, can be applied to many different things. When it comes to data security, we need to understand the following fundamental questions:
What types of data exist?
Who owns the data?
Who has access to the data?
When does the data exit the system?
Is the data physically secured?
Let's have a look at a simple big data system and try to understand these questions in more detail. The scale of the systems makes security a nightmare for everyone. So, we should have proper policies in place to keep everyone on the same page:
In this example, we have the following components:
Heterogeneous applications running across the globe in multiple geographical regions.
Large volume and variety of input data is generated by the applications.
All the data is ingested into a big data system.
ETL/ELT applications consume the data from a big data system and put the consumable results into RDBMS (this is optional).
Business intelligence applications read from this storage and further generate insights into the data. These are the ones that power the leadership team's decisions.
You can imagine the scale and volume of data that flows through this system. We can also see that the number of servers, applications, and employees that participate in this whole ecosystem is very large in number. If we do not have proper policies in place, its not a very easy task to secure such a complicated system.
Also, if an attacker uses social engineering to gain access to the system, we should make sure that the data access is limited only to the lowest possible level. When poor security implementations are in place, attackers can have access to virtually all the business secrets, which could be a serious loss to the business.
Just to think of an example, a start-up is building a next-generation computing device to host all its data on the cloud and does not have proper security policies in place. When an attacker compromises the security of the servers that are on the cloud, they can easily figure out what is being built by this start-up and can steal the intelligence. Once the intelligence is stolen, we can imagine how hackers use this for their personal benefit.
With this understanding of security's importance, let's define what needs to be secured.
Applications are the front line of product-based organizations, since consumers use these applications to interact with the products and services provided by the applications. We have to ensure that proper security standards are followed while programming these application interfaces.
Since these applications generate data to the backend system, we should make sure only proper access mechanisms are allowed in terms of firewalls.
Also, these applications interact with many other backend systems, we have to ensure that the correct data related to the user is shown. This boils down to implementing proper authentication and authorization, not only for the user but also for the application when accessing different types of an organization's resources.
Without proper auditing in place, it is very difficult to analyze the data access patterns by the applications. All the logs should be collected at a central place away from the application servers and can be further ingested into the big data system.
Once the applications generate several metrics, they can be temporarily stored locally that are further consumed by periodic processes or they are further pushed to streaming systems like Kafka.
In this case, we should carefully think through and design where the data is stores and which uses can have access to this data. If we are further writing this data to systems like Kafka or MQ, we have to make sure that further authentication, authorization, and access controls are in place.
Here we can leverage the operating-system-provided security measures such as process user ID, process group ID, filesystem user ID, group ID, and also advanced systems (such as SELinux) to further restrict access to the input data.
Depending on which data warehouse solution is chosen, we have to ensure that authorized applications and users can write to and read from the data warehouse. Proper security policies and auditing should be in place to make sure that this large scale of data is not easily accessible to everyone.
In order to implement all these access policies, we can use the operating system provided mechanisms like file access controls and use access controls. Since we're talking about geographically distributed big data systems, we have to think and design centralized authentication systems to provide a seamless experience for employees when interacting with these big data systems.
Many RDBMSes are highly secure and can provide the following access levels to users:
Database
Table
Usage pattern
They also have built-in auditing mechanisms to tell which users have accessed what types of data and when. This data is vital to keeping the systems secure, and proper monitoring should be in place to keep a watch on these system's health and safety.
These can be applications built in-house for specific needs of the company, or external applications that can power the insights that business teams are looking for. These applications should also be properly secured by practicing single sign-on, role-based access control, and network-based access control.
Since the amount of insights these applications provide is very much crucial to the success of the organization, proper security measures should be taken to protect them.
So far, we have seen the different parts of an enterprise system and understood what things can be followed to improve the security of the overall enterprise data design. Let's talk about some of the common things that can be applied everywhere in the data design.
This deals with physical device access, data center access, server access, and network access. If an unauthorized person gains access to the equipment owned by an Enterprise, they can gain access to all the data that is present in it.
As we have seen in the previous sections, when an operating system is running, we are able to protect the resources by leveraging the security features of the operating system. When an intruder gains physical access to the devices (or even decommissioned servers), they can connect these devices to another operating system that's in their control and access all the data that is present on our servers.
Care must be taken when we decommission servers, as there are ways in which data that's written to these devices (even after formatting) can be recovered. So we should follow industry-standard device erasing techniques to properly clean all of the data that is owned by enterprises.
In order to prevent those, we should consider encrypting data.
Encrypting data will ensure that even when authorized persons gain access to the devices, they will not be able to recover the data. This is a standard practice that is followed nowadays due to the increase in mobility of data and employees. Many big Enterprises encrypt hard disks on laptops and mobile phones.
If you have worked with any applications that need authentication, you will have used a combination of username and password to access the services. Typically these secrets are stored within the source code itself. This poses a challenge for programs which are non-compiled, as attackers can easily access the username and password to gain access to our resources.
Many enterprises started adopting centralized key management, using which applications can query these services to gain access to the resources that are authentication protected. All these access patterns are properly audited by the KMS
Employees should also access these systems with their own credentials to access the resources. This makes sure that secret keys are protected and accessible only to the authorized applications.
Data as a Service (DaaS) is a concept that has become popular in recent times due to the increase in adoption of cloud. When it comes to data. It might some a little confusing that how can data be added to as a service model?
DaaS offers great flexibility to users of the service in terms of not worrying about the scale, performance, and maintenance of the underlying infrastructure that the service is being run on. The infrastructure automatically takes care of it for us, but given that we are dealing with a cloud model, we have all the benefits of the cloud such as pay as you go, capacity planning, and so on. This will reduce the burden of data management.
