Big Data Analytics with Hadoop 3 - Sridhar Alla - E-Book

Big Data Analytics with Hadoop 3 E-Book

Sridhar Alla

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Explore big data concepts, platforms, analytics, and their applications using the power of Hadoop 3

Key Features

  • Learn Hadoop 3 to build effective big data analytics solutions on-premise and on cloud
  • Integrate Hadoop with other big data tools such as R, Python, Apache Spark, and Apache Flink
  • Exploit big data using Hadoop 3 with real-world examples

Book Description

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples.

Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases.

By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly.

What you will learn

  • Explore the new features of Hadoop 3 along with HDFS, YARN, and MapReduce
  • Get well-versed with the analytical capabilities of Hadoop ecosystem using practical examples
  • Integrate Hadoop with R and Python for more efficient big data processing
  • Learn to use Hadoop with Apache Spark and Apache Flink for real-time data analytics
  • Set up a Hadoop cluster on AWS cloud
  • Perform big data analytics on AWS using Elastic Map Reduce

Who this book is for

Big Data Analytics with Hadoop 3 is for you if you are looking to build high-performance analytics solutions for your enterprise or business using Hadoop 3’s powerful features, or you’re new to big data analytics. A basic understanding of the Java programming language is required.

Sridhar Alla is a big data expert helping companies solve complex problems in distributed computing, large scale data science and analytics practice. He presents regularly at several prestigious conferences and provides training and consulting to companies. He holds a bachelor's in computer science from JNTU, India. He loves writing code in Python, Scala, and Java. He also has extensive hands-on knowledge of several Hadoop-based technologies, TensorFlow, NoSQL, IoT, and deep learning.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 360

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Big Data Analytics with Hadoop 3

 

 

 

 

 

 

Build highly effective analytics solutions to gain valuable insight into your big data

 

 

 

 

 

 

 

 

Sridhar Alla

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Big Data Analytics with Hadoop 3

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Varsha ShettyContent Development Editor:Cheryl DsaTechnical Editor: Sagar SawantCopy Editors: Vikrant Phadke, Safis EditingProject Coordinator: Nidhi JoshiProofreader: Safis EditingIndexer: Rekha NairGraphics:Tania DuttaProduction Coordinator:Arvindkumar Gupta

First published: May 2018

Production reference: 1280518

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78862-884-6

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and, as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Sridhar Alla is a big data expert helping companies solve complex problems in distributed computing, large scale data science and analytics practice. He presents regularly at several prestigious conferences and provides training and consulting to companies. He holds a bachelor's in computer science from JNTU, India.

He loves writing code in Python, Scala, and Java. He also has extensive hands-on knowledge of several Hadoop-based technologies, TensorFlow, NoSQL, IoT, and deep learning.

 

 

 

I thank my loving wife, Rosie Sarkaria for all the love and patience during the many months I spent writing this book. I thank my parents Ravi and Lakshmi Alla for all the support and encouragement. I am very grateful to my wonderful niece Niharika and nephew Suman Kalyan who helped me with screenshots, proof reading and testing the code snippets.

 

About the reviewers

V. Naresh Kumar has more than a decade of professional experience in designing, implementing, and running very large-scale internet applications in Fortune 500 Companies. He is a full-stack architect with hands-on experience in e-commerce, web hosting, healthcare, big data, analytics, data streaming, advertising, and databases. He admires open source and contributes to it actively. He keeps himself updated with emerging technologies, from Linux system internals to frontend technologies. He studied in BITS- Pilani, Rajasthan, with a joint degree in computer science and economics.

 

 

 

 

Manoj R. Patil is a big data architect at TatvaSoft—an IT services and consulting firm. He has a bachelor's degree in engineering from COEP, Pune. He is a proven and highly skilled business intelligence professional with 18 years, experience in IT. He is a seasoned BI and big data consultant with exposure to all the leading platforms.

Previously, he worked for numerous organizations, including Tech Mahindra and Persistent Systems. Apart from authoring a book on Pentaho and big data, he has been an avid reviewer of various titles in the respective fields from Packt and other leading publishers.

Manoj would like to thank his entire family, especially his two beautiful angels, Ayushee and Ananyaa for understanding during the review process. He would also like to thank Packt for giving this opportunity, the project co-ordinator and the author.

 

 

 

 

 

 

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Big Data Analytics with Hadoop 3

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to Hadoop

Hadoop Distributed File System

High availability

Intra-DataNode balancer

Erasure coding

Port numbers

MapReduce framework

Task-level native optimization

YARN

Opportunistic containers

Types of container execution 

YARN timeline service v.2

Enhancing scalability and reliability

Usability improvements

Architecture

Other changes

Minimum required Java version 

Shell script rewrite

Shaded-client JARs

Installing Hadoop 3 

Prerequisites

Downloading

Installation

Setup password-less ssh

Setting up the NameNode

Starting HDFS

Setting up the YARN service

Erasure Coding

Intra-DataNode balancer

Installing YARN timeline service v.2

Setting up the HBase cluster

Simple deployment for HBase

Enabling the co-processor

Enabling timeline service v.2

Running timeline service v.2

Enabling MapReduce to write to timeline service v.2

Summary

Overview of Big Data Analytics

Introduction to data analytics

Inside the data analytics process

Introduction to big data

Variety of data

Velocity of data

Volume of data

Veracity of data

Variability of data

Visualization

Value

Distributed computing using Apache Hadoop

The MapReduce framework

Hive

Downloading and extracting the Hive binaries

Installing Derby

Using Hive

Creating a database

Creating a table

SELECT statement syntax

WHERE clauses

INSERT statement syntax

Primitive types

Complex types

Built-in operators and functions

Built-in operators

Built-in functions

Language capabilities

A cheat sheet on retrieving information 

Apache Spark

Visualization using Tableau

Summary

Big Data Processing with MapReduce

The MapReduce framework

Dataset

Record reader

Map

Combiner

Partitioner

Shuffle and sort

Reduce

Output format

MapReduce job types

Single mapper job

Single mapper reducer job

Multiple mappers reducer job

SingleMapperCombinerReducer job

Scenario

MapReduce patterns

Aggregation patterns

Average temperature by city

Record count

Min/max/count

Average/median/standard deviation

Filtering patterns

Join patterns

Inner join

Left anti join

Left outer join

Right outer join

Full outer join

Left semi join

Cross join

Summary

Scientific Computing and Big Data Analysis with Python and Hadoop

Installation

Installing standard Python

Installing Anaconda

Using Conda

Data analysis

Summary

Statistical Big Data Computing with R and Hadoop

Introduction

Install R on workstations and connect to the data in Hadoop

Install R on a shared server and connect to Hadoop

Utilize Revolution R Open

Execute R inside of MapReduce using RMR2

Summary and outlook for pure open source options

Methods of integrating R and Hadoop

RHADOOP – install R on workstations and connect to data in Hadoop

RHIPE – execute R inside Hadoop MapReduce

R and Hadoop Streaming

RHIVE – install R on workstations and connect to data in Hadoop

ORCH – Oracle connector for Hadoop

Data analytics

Summary

Batch Analytics with Apache Spark

SparkSQL and DataFrames

DataFrame APIs and the SQL API

Pivots

Filters

User-defined functions

Schema – structure of data

Implicit schema

Explicit schema

Encoders

Loading datasets

Saving datasets

Aggregations

Aggregate functions

count

first

last

approx_count_distinct

min

max

avg

sum

kurtosis

skewness

Variance

Standard deviation

Covariance

groupBy

Rollup

Cube

Window functions

ntiles

Joins

Inner workings of join

Shuffle join

Broadcast join

Join types

Inner join

Left outer join

Right outer join

Outer join

Left anti join

Left semi join

Cross join

Performance implications of join

Summary

Real-Time Analytics with Apache Spark

Streaming

At-least-once processing

At-most-once processing

Exactly-once processing

Spark Streaming

StreamingContext

Creating StreamingContext

Starting StreamingContext

Stopping StreamingContext

Input streams

receiverStream

socketTextStream

rawSocketStream

fileStream

textFileStream

binaryRecordsStream

queueStream

textFileStream example

twitterStream example

Discretized Streams

Transformations

Windows operations

Stateful/stateless transformations

Stateless transformations

Stateful transformations

Checkpointing

Metadata checkpointing

Data checkpointing

Driver failure recovery

Interoperability with streaming platforms (Apache Kafka)

Receiver-based

Direct Stream

Structured Streaming

Getting deeper into Structured Streaming

Handling event time and late date

Fault-tolerance semantics

Summary

Batch Analytics with Apache Flink

Introduction to Apache Flink

Continuous processing for unbounded datasets

Flink, the streaming model, and bounded datasets

Installing Flink

Downloading Flink

Installing Flink

Starting a local Flink cluster

Using the Flink cluster UI

Batch analytics

Reading file

File-based

Collection-based

Generic

Transformations

GroupBy

Aggregation

Joins

Inner join

Left outer join

Right outer join

Full outer join

Writing to a file

Summary

Stream Processing with Apache Flink

Introduction to streaming execution model

Data processing using the DataStream API

Execution environment

Data sources

Socket-based

File-based

Transformations

map

flatMap

filter

keyBy

reduce

fold

Aggregations

window

Global windows

Tumbling windows

Sliding windows

Session windows

windowAll

union

Window join

split

Select

Project

Physical partitioning

Custom partitioning

Random partitioning

Rebalancing partitioning

Rescaling

Broadcasting

Event time and watermarks

Connectors

Kafka connector

Twitter connector

RabbitMQ connector

Elasticsearch connector

Cassandra connector

Summary

Visualizing Big Data

Introduction

Tableau

Chart types

Line charts

Pie chart

Bar chart

Heat map

Using Python to visualize data

Using R to visualize data

Big data visualization tools

Summary

Introduction to Cloud Computing

Concepts and terminology

Cloud

IT resource

On-premise

Cloud consumers and Cloud providers

Scaling

 Types of scaling

Horizontal scaling

Vertical scaling

Cloud service

Cloud service consumer

Goals and benefits

Increased scalability

Increased availability and reliability

Risks and challenges

Increased security vulnerabilities

Reduced operational governance control

Limited portability between Cloud providers

Roles and boundaries

Cloud provider

Cloud consumer

Cloud service owner

Cloud resource administrator

Additional roles

Organizational boundary

Trust boundary

Cloud characteristics

On-demand usage

Ubiquitous access

Multi-tenancy (and resource pooling)

Elasticity

Measured usage

Resiliency

Cloud delivery models

Infrastructure as a Service

Platform as a Service

Software as a Service

Combining Cloud delivery models

IaaS + PaaS

IaaS + PaaS + SaaS

Cloud deployment models

Public Clouds

Community Clouds

Private Clouds

Hybrid Clouds

Summary

Using Amazon Web Services

Amazon Elastic Compute Cloud

Elastic web-scale computing

Complete control of operations

Flexible Cloud hosting services

Integration

High reliability

Security

Inexpensive

Easy to start

Instances and Amazon Machine Images

Launching multiple instances of an AMI

Instances

AMIs

Regions and availability zones

Region and availability zone concepts

Regions

Availability zones

Available regions

Regions and endpoints

Instance types

Tag basics

Amazon EC2 key pairs

Amazon EC2 security groups for Linux instances

Elastic IP addresses

Amazon EC2 and Amazon Virtual Private Cloud

Amazon Elastic Block Store

Amazon EC2 instance store

What is AWS Lambda?

When should I use AWS Lambda?

Introduction to Amazon S3

Getting started with Amazon S3

Comprehensive security and compliance capabilities

Query in place

Flexible management

Most supported platform with the largest ecosystem

Easy and flexible data transfer

Backup and recovery

Data archiving

Data lakes and big data analytics

Hybrid Cloud storage

Cloud-native application data

Disaster recovery

Amazon DynamoDB

Amazon Kinesis Data Streams

What can I do with Kinesis Data Streams?

Accelerated log and data feed intake and processing

Real-time metrics and reporting

Real-time data analytics

Complex stream processing

Benefits of using Kinesis Data Streams

AWS Glue

When should I use AWS Glue?

Amazon EMR

Practical AWS EMR cluster

Summary

Preface

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples.

Once you have taken a tour of Hadoop 3's latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you become acquainted with all of this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions in the cloud and an end-to-end pipeline to perform big data analysis using practical use cases.

By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insights effortlessly.

Who this book is for

Big Data Analytics with Hadoop 3 is for you if you are looking to build high-performance analytics solutions for your enterprise or business using Hadoop 3's powerful features, or if you’re new to big data analytics. A basic understanding of the Java programming language is required.

What this book covers

Chapter 1, Introduction to Hadoop, introduces you to the world of Hadoop and its core components, namely, HDFS and MapReduce.

Chapter 2, Overview of Big Data Analytics, introduces the process of examining large datasets to uncover patterns in data, generating reports, and gathering valuable insights.

Chapter 3, Big Data Processing with MapReduce, introduces the concept of MapReduce, which is the fundamental concept behind most of the big data computing/processing systems.

Chapter 4, Scientific Computing and Big Data Analysis with Python and Hadoop, provides an introduction to Python and an analysis of big data using Hadoop with the aid of Python packages.

Chapter 5, Statistical Big Data Computing with R and Hadoop, provides an introduction to R and demonstrates how to use R to perform statistical computing on big data using Hadoop.

Chapter 6, Batch Analytics with Apache Spark, introduces you to Apache Spark and demonstrates how to use Spark for big data analytics based on a batch processing model.

Chapter 7, Real-Time Analytics with Apache Spark, introduces the stream processing model of Apache Spark and demonstrates how to build streaming-based, real-time analytical applications.

Chapter 8, Batch Analytics with Apache Flink, covers Apache Flink and how to use it for big data analytics based on a batch processing model.

Chapter 9, Stream Processing with Apache Flink, introduces you to DataStream APIs and stream processing using Flink. Flink will be used to receive and process real-time event streams and store the aggregates and results in a Hadoop cluster.

Chapter 10, Visualizing Big Data, introduces you to the world of data visualization using various tools and technologies such as Tableau.

Chapter 11, Introduction to Cloud Computing, introduces Cloud computing and various concepts such as IaaS, PaaS, and SaaS. You will also get a glimpse into the top Cloud providers.

Chapter 12, Using Amazon Web Services, introduces you to AWS and various services in AWS useful for performing big data analytics using Elastic Map Reduce (EMR) to set up a Hadoop cluster in AWS Cloud.

To get the most out of this book

The examples have been implemented using Scala, Java, R, and Python on a Linux 64-bit. You will also need, or be prepared to install, the following on your machine (preferably the latest version):

Spark 2.3.0 (or higher)

Hadoop 3.1 (or higher)

Flink 1.4

Java (JDK and JRE) 1.8+

Scala 2.11.x (or higher)

Python 2.7+/3.4+

R 3.1+ and RStudio

1.0.143 (or higher)

Eclipse Mars or Idea IntelliJ (latest)

Regarding the operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and, to be more specific, for example, as regards Ubuntu, it is recommended having a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can also run code on Windows (XP/7/8/10) or macOS X (10.4.7+).

Regarding hardware configuration: Processor Core i3, Core i5 (recommended) ~ Core i7 (to get the best result). However, multicore processing would provide faster data processing and scalability. At least 8 GB RAM (recommended) for a standalone mode. At least 32 GB RAM for a single VM and higher for cluster. Enough storage for running heavy jobs (depending on the dataset size you will be handling) preferably at least 50 GB of free disk storage (for stand alone and SQL warehouse).

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Big-Data-Analytics-with-Hadoop-3. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/BigDataAnalyticswithHadoop3_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "This file, temperatures.csv, is available as a download and once downloaded, you can move it into hdfs by running the command, as shown in the following code."

A block of code is set as follows:

hdfs dfs -copyFromLocal temperatures.csv /user/normal

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

Map-Reduce Framework --

output average temperature per city name

Map input records=35

Map output records=33

Map output bytes=208 Map output materialized bytes=286

Any command-line input or output is written as follows:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Clicking on the Datanodes tab shows all the nodes."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Introduction to Hadoop

This chapter introduces the reader to the world of Hadoop and the core components of Hadoop, namely the Hadoop Distributed File System (HDFS) and MapReduce. We will start by introducing the changes and new features in the Hadoop 3 release. Particularly, we will talk about the new features of HDFS and Yet Another Resource Negotiator (YARN), and changes to client applications. Furthermore, we will also install a Hadoop cluster locally and demonstrate the new features such as erasure coding (EC) and the timeline service. As as quick note, Chapter 10, Visualizing Big Data shows you how to create a Hadoop cluster in AWS.

In a nutshell, the following topics will be covered throughout this chapter:

HDFS

High availability

Intra-DataNode balancer

EC

Port mapping

MapReduce

Task-level optimization

YARN

Opportunistic containers

Timeline service v.2

Docker containerization

Other changes

Installation of Hadoop 3.1

HDFS

YARN

EC

Timeline service v.2

Hadoop Distributed File System

HDFS is a software-based filesystem implemented in Java and it sits on top of the native filesystem. The main concept behind HDFS is that it divides a file into blocks (typically 128 MB) instead of dealing with a file as a whole. This allows many features such as distribution, replication, failure recovery, and more importantly distributed processing of the blocks using multiple machines. Block sizes can be 64 MB, 128 MB, 256 MB, or 512 MB, whatever suits the purpose. For a 1 GB file with 128 MB blocks, there will be 1024 MB/128 MB equal to eight blocks. If you consider a replication factor of three, this makes it 24 blocks. HDFS provides a distributed storage system with fault tolerance and failure recovery. HDFS has two main components: the NameNode and the DataNode. The NameNode contains all the metadata of all content of the filesystem: filenames, file permissions, and the location of each block of each file, and hence it is the most important machine in HDFS. DataNodes connect to the NameNode and store the blocks within HDFS. They rely on the NameNode for all metadata information regarding the content in the filesystem. If the NameNode does not have any information, the DataNode will not be able to serve information to any client who wants to read/write to the HDFS.

It is possible for NameNode and DataNode processes to be run on a single machine; however, generally HDFS clusters are made up of a dedicated server running the NameNode process and thousands of machines running the DataNode process. In order to be able to access the content information stored in the NameNode, it stores the entire metadata structure in memory. It ensures that there is no data loss as a result of machine failures by keeping a track of the replication factor of blocks. Since it is a single point of failure, to reduce the risk of data loss on account of the failure of a NameNode, a secondary NameNode can be used to generate snapshots of the primary NameNode's memory structures.

DataNodes have large storage capacities and, unlike the NameNode, HDFS will continue to operate normally if a DataNode fails. When a DataNode fails, the NameNode automatically takes care of the now diminished replication of all the data blocks in the failed DataNode and makes sure the replication is built back up. Since the NameNode knows all locations of the replicated blocks, any clients connected to the cluster are able to proceed with little to no hiccups.

In order to make sure that each block meets the minimum required replication factor, the NameNode replicates the lost blocks.

The following diagram depicts the mapping of files to blocks in the NameNode, and the storage of blocks and their replicas within the DataNodes:

The NameNode, as shown in the preceding diagram, has been the single point of failure since the beginning of Hadoop.

High availability

The loss of NameNodes can crash the cluster in both Hadoop 1.x as well as Hadoop 2.x. In Hadoop 1.x, there was no easy way to recover, whereas Hadoop 2.x introduced high availability (active-passive setup) to help recover from NameNode failures.

The following diagram shows how high availability works:

In Hadoop 3.x you can have two passive NameNodes along with the active node, as well as five JournalNodes to assist with recovery from catastrophic failures:

NameNode machines: The machines on which you run the active and standby NameNodes. They should have equivalent hardware to each other and to what would be used in a non-HA cluster.

JournalNode machines: The machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. 

Intra-DataNode balancer

HDFS has a way to balance the data blocks across the data nodes, but there is no such balancing inside the same data node with multiple hard disks. Hence, a 12-spindle DataNode can have out of balance physical disks. But why does this matter to performance? Well, by having out of balance disks, the blocks at DataNode level might be the same as other DataNodes but the reads/writes will be skewed because of imbalanced disks. Hence, Hadoop 3.x introduces the intra-node balancer to balance the physical disks inside each data node to reduce the skew of the data. 

This increases the reads and writes performed by any process running on the cluster, such as a mapper or reducer.

Erasure coding

HDFS has been the fundamental component since the inception of Hadoop. In Hadoop 1.x as well as Hadoop 2.x, a typical HDFS installation uses a replication factor of three.

Compared to the default replication factor of three, EC is probably the biggest change in HDFS in years and fundamentally doubles the capacity for many datasets by bringing down the replication factor from 3 to about 1.4. Let's now understand what EC is all about. 

EC is a method of data protection in which data is broken into fragments, expanded, encoded with redundant data pieces, and stored across a set of different locations or storage. If at some point during this process data is lost due to corruption, then it can be reconstructed using the information stored elsewhere. Although EC is more CPU intensive, this greatly reduces the storage needed for the reliable storing of large amounts of data (HDFS). HDFS uses replication to provide reliable storage and this is expensive, typically requiring three copies of data to be stored, thus causing a 200% overhead in storage space.

Port numbers

In Hadoop 3.x, many of the ports for various services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768–61000). This indicated that at startup, services would sometimes fail to bind to the port with another application due to a conflict.

These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. 

The changes are listed as follows:

NameNode ports

: 50470 → 9871, 50070 → 9870, and 8020 → 9820

Secondary NameNode ports

: 50091 → 9869 and 50090 → 9868

DataNode ports

:

5

0020 → 9867, 50010 → 9866, 50475 → 9865, and 50075 → 9864

MapReduce framework

An easy way to understand this concept is to imagine that you and your friends want to sort out piles of fruit into boxes. For that, you want to assign each person the task of going through one raw basket of fruit (all mixed up) and separating out the fruit into various boxes. Each person then does the same task of separating the fruit into the various types with this basket of fruit. In the end, you end up with a lot of boxes of fruit from all your friends. Then, you can assign a group to put the same kind of fruit together in a box, weigh the box, and seal the box for shipping. A classic example of showing the MapReduce framework at work is the word count example. The following are the various stages of processing the input data, first splitting the input across multiple worker nodes and then finally generating the output, the word counts:

The MapReduce framework consists of a single ResourceManager and multiple NodeManagers (usually, NodeManagers coexist with the DataNodes of HDFS). 

Task-level native optimization

MapReduce has added support for a native implementation of the map output collector. This new support can result in a performance improvement of about 30% or more, particularly for shuffle-intensive jobs.

The native library will build automatically with Pnative. Users may choose the new collector on a job-by-job basis by setting mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator in their job configuration. 

The basic idea is to be able to add a NativeMapOutputCollector in order to handle key/value pairs emitted by mapper. As a result of this sort, spill, and IFile serialization can all be done in native code. A preliminary test (on Xeon E5410, jdk6u24) showed promising results as follows:

sort

is about 3-10 times faster than Java (only binary string compare is supported)

IFile

serialization speed is about three times faster than Java: about 500 MB per second. If CRC32C hardware is used, things can get much faster in the range of 1 GB or higher per second

Merge code is not completed yet, so the test uses enough

io.sort.mb

to prevent mid-spill

YARN

When an application wants to run, the client launches the ApplicationMaster, which then negotiates with the ResourceManager to get resources in the cluster in the form of containers. A container represents CPUs (cores) and memory allocated on a single node to be used to run tasks and processes. Containers are supervised by the NodeManager and scheduled by the ResourceManager.

Examples of containers:

One core and 4 GB RAM

Two cores and 6 GB RAM

Four cores and 20 GB RAM

Some containers are assigned to be mappers and others to be reducers; all this is coordinated by the ApplicationMaster in conjunction with the ResourceManager. This framework is called YARN:

Using YARN, several different applications can request for and execute tasks on containers, sharing the cluster resources pretty well. However, as the size of the clusters grows and the variety of applications and requirements change, the efficiency of the resource utilization is not as good over time.

Opportunistic containers

Opportunistic containers can be transmitted to a NodeManager even if their execution at that particular time cannot begin immediately, unlike YARN containers, which are scheduled in a node if and only if there are unallocated resources.

In these types of scenarios, opportunistic containers will be queued at the NodeManager till the required resources are available for use. The ultimate goal of these containers is to enhance the cluster resource utilization and in turn improve task throughput.

Types of container execution 

There are two types of container, as follows:

Guaranteed containers

:

 These containers 

correspond to the existing YARN containers. They are assigned by the capacity scheduler. They are transmitted to a node if and only if there are resources available to begin their execution immediately. 

Opportunistic containers

: Unlike guaranteed containers, in this case we cannot guarantee that there will be resources available to begin their execution once they are dispatched to a node. On the contrary, they will be queued at the NodeManager itself until resources become available.

YARN timeline service v.2

The YARN timeline service v.2 addresses the following two major challenges:

Enhancing the scalability and reliability of the timeline service

Improving usability by introducing flows and aggregation

Enhancing scalability and reliability

Version 2 adopts a more scalable distributed writer architecture and backend storage, as opposed to v.1 which does not scale well beyond small clusters as it used a single instance of writer/reader architecture and backend storage.

Since Apache HBase scales well even to larger clusters and continues to maintain a good read and write response time, v.2 prefers to select it as the primary backend storage.

Usability improvements

Many a time, users are more interested in the information obtained at the level of flows or in logical groups of YARN applications. For this reason, it is more convenient to launch a series of YARN applications to complete a logical workflow.

In order to achieve this, v.2 supports the notion of flows and aggregates metrics at the flow level.

Architecture

YARN Timeline Service v.2 uses a set of collectors (writers) to write data to the back-end storage. The collectors are distributed and co-located with the application masters to which they are dedicated. All data that belong to that application are sent to the application level timeline collectors with the exception of the resource manager timeline collector.

For a given application, the application master can write data for the application to the co-located timeline collectors (which is an NM auxiliary service in this release). In addition, node managers of other nodes that are running the containers for the application also write data to the timeline collector on the node that is running the application master. 

The resource manager also maintains its own timeline collector. It emits only YARN-generic life-cycle events to keep its volume of writes reasonable.

The timeline readers are separate daemons separate from the timeline collectors, and they are dedicated to serving queries via REST API:

The following diagram illustrates the design at a high level:

Other changes

There are other changes coming up in Hadoop 3, which are mainly to make it easier to maintain and operate. Particularly, the command-line tools have been revamped to better suit the needs of operational teams.

Minimum required Java version 

All Hadoop JARs are now compiled to target a runtime version of Java 8. Hence, users that are still using Java 7 or lower must upgrade to Java 8.

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. 

Incompatible changes are documented in the release notes. You can find them at https://issues.apache.org/jira/browse/HADOOP-9902.

There are more details available in the documentation at https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/UnixShellGuide.html. The documentation present at https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/UnixShellAPI.html will appeal to power users, as it describes most of the new functionalities, particularly those related to extensibility.

Shaded-client JARs

The new hadoop-client-api and hadoop-client-runtime artifacts have been added, as referred to by https://issues.apache.org/jira/browse/HADOOP-11804. These artifacts shade Hadoop's dependencies into a single JAR. As a result, it avoids leaking Hadoop's dependencies onto the application's classpath.

Hadoop now also supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as an alternative for Hadoop-compatible filesystems.

Installing Hadoop 3 

In this section, we shall see how to install a single-node Hadoop 3 cluster on your local machine. In order to do this, we will be following the documentation given at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html.

This document gives us a detailed description of how to install and configure a single-node Hadoop setup in order to carry out simple operations using Hadoop MapReduce and the HDFS quickly.

Prerequisites

Java 8 must be installed for Hadoop to be run. If Java 8 does not exist on your machine, then you can download and install Java 8: https://www.java.com/en/download/.

The following will appear on your screen when you open the download link in the browser:

Downloading

Download the Hadoop 3.1 version using the following link: http://apache.spinellicreations.com/hadoop/common/hadoop-3.1.0/.

The following screenshot is the page shown when the download link is opened in the browser:

When you get this page in your browser, simply download the hadoop-3.1.0.tar.gz file to your local machine.

Installation

Perform the following steps to install a single-node Hadoop cluster on your machine:

Extract the downloaded file using the following command:

tar -xvzf hadoop-3.1.0.tar.gz

Once you have extracted the Hadoop binaries, just run the following commands to test the Hadoop binaries and make sure the binaries works on our local machine:

cd hadoop-3.1.0

mkdir input

cp etc/hadoop/*.xml input

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar grep input output 'dfs[a-z.]+'

cat output/*

If everything runs as expected, you will see an output directory showing some output, which shows that the sample command worked.

A typical error at this point will be missing Java. You might want to check and see if you have Java installed on your machine and the JAVA_HOME environment variable set correctly.

Setup password-less ssh

Now check if you can ssh to the localhost without a passphrase by running a simple command, shown as follows:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys

Setting up the NameNode

Make the following changes to the configuration file etc/hadoop/core-site.xml:

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>

Make the following changes to the configuration file etc/hadoop/hdfs-site.xml:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <name>dfs.name.dir</name> <value><YOURDIRECTORY>/hadoop-3.1.0/dfs/name</value> </property> </configuration>

Starting HDFS

Follow these steps as shown to start HDFS (NameNode and DataNode):

Format the filesystem:

$ ./bin/hdfs namenode -format

Start the NameNode daemon and the DataNode daemon:

$ ./sbin/start-dfs.sh

The Hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

Browse the web interface for the NameNode; by default it is available at 

http://localhost:9870/

.

Make the HDFS directories required to execute MapReduce jobs:

$ ./bin/hdfs dfs -mkdir /user $ ./bin/hdfs dfs -mkdir /user/<username>

When you're done, stop the daemons with the following:

$ ./sbin/stop-dfs.sh

Open a browser to check your local Hadoop, which can be launched in the browser as 

http://localhost:9870/

. The following is what the HDFS installation looks like:

Clicking on the 

Datanodes

tab shows the nodes as shown in the following screenshot:

Figure: Screenshot showing the nodes in the Datanodes tab

Clicking on the

logs

will show the various logs in your cluster, as shown in the following screenshot:

As shown in the following screenshot, you can also look at the various JVM metrics of your cluster components:

As shown in the following screenshot, you can also check the configuration. This is a good place to look at the entire configuration and all the default settings:

You can also browse the filesystem of your newly installed cluster, as shown in the following screenshot:

Figure: Screenshot showing the Browse Directory and how you can browse the filesystem in you newly installed cluster

At this point, we should all be able to see and use a basic HDFS cluster. But this is just a HDFS filesystem with some directories and files. We also need a job/task scheduling service to actually use the cluster for computational needs rather than just storage.

Setting up the YARN service

In this section, we will set up a YARN service and start the components needed to run and operate a YARN cluster:

Start the ResourceManager daemon and the NodeManager daemon:

$ sbin/start-yarn.sh

Browse the web interface for the ResourceManager; by default it is available at: http://localhost:8088/

Run a MapReduce job

When you're done, stop the daemons with the following:

$ sbin/stop-yarn.sh

The following is the YARN ResourceManager, which you can view by putting the URL http://localhost:8088/ into the browser:

Figure: Screenshot of YARN ResouceManager