Apache Hadoop 3 Quick Start Guide - Hrishikesh Vijay Karambelkar - E-Book

Apache Hadoop 3 Quick Start Guide E-Book

Hrishikesh Vijay Karambelkar

0,0
23,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics, including MapReduce, YARN, and HDFS.

The book begins with an overview of big data and Apache Hadoop. Then, you will set up a pseudo Hadoop development environment and a multi-node enterprise Hadoop cluster. You will see how the parallel programming paradigm, such as MapReduce, can solve many complex data processing problems.

The book also covers the important aspects of the big data software development lifecycle, including quality assurance and control, performance, administration, and monitoring.

You will then learn about the Hadoop ecosystem, and tools such as Kafka, Sqoop, Flume, Pig, Hive, and HBase. Finally, you will look at advanced topics, including real time streaming using Apache Storm, and data analytics using Apache Spark.

By the end of the book, you will be well versed with different configurations of the Hadoop 3 cluster.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 237

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Apache Hadoop 3 Quick Start Guide
Learn about big data processing and analytics
Hrishikesh Vijay Karambelkar
BIRMINGHAM - MUMBAI

Apache Hadoop 3 Quick Start Guide

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Reshma RamanContent Development Editor:Kirk DsouzaTechnical Editor: Jinesh TopiwalaCopy Editor:Safis EditingProject Coordinator: Hardik BhindeProofreader: Safis EditingIndexer:Rekha NairGraphics:Alishon MendonsaProduction Coordinator:Deepika Naik

First published: October 2018

Production reference: 1311018

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78899-983-0

www.packtpub.com

To my lovely wife, Dhanashree, for her unconditional support and endless love.
– Hrishikesh Vijay Karambelkar
mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Hrishikesh Vijay Karambelkar is an innovator and an enterprise architect with 16 years of software design and development experience, specifically in the areas of big data, enterprise search, data analytics, text mining, and databases. He is passionate about architecting new software implementations for the next generation of software solutions for various industries, including oil and gas, chemicals, manufacturing, utilities, healthcare, and government infrastructure. In the past, he has authored three books for Packt Publishing: two editions of Scaling Big Data with Hadoop and Solr and one of Scaling Apache Solr. He has also worked with graph databases, and some of his work has been published at international conferences such as VLDB and ICDE.

Writing a book is harder than I thought and more rewarding than I could have ever imagined. None of this would have been possible without support from my wife, Dhanashree. I'm eternally grateful to my parents, who have always encouraged me to work sincerely and respect others. Special thanks to my editor, Kirk, who ensured that the book was completed within the stipulated time and to the highest quality standards. I would also like to thank all the reviewers.

About the reviewer

Dayong Du has led a career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use cases with open source big data technology, such as Hadoop, Hive, HBase, and Spark. Dayong is a big data practitioner, as well as an author and coach. He has published the first and second editions of Apache HiveEssential and has coached lots of people who are interested in learning about and using big data technology. In addition, he is a seasonal blogger, contributor, and adviser for big data start-ups, and a co-founder of the Toronto Big Data Professionals Association.

I would like to sincerely thank my wife and daughter for their sacrifices and encouragement during my time spent on the big data community and technology.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Apache Hadoop 3 Quick Start Guide

Dedication

Packt Upsell

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Code in action

Conventions used

Get in touch

Reviews

Hadoop 3.0 - Background and Introduction

How it all started

What Hadoop is and why it is important

How Apache Hadoop works

Resource Manager

Node Manager

YARN Timeline Service version 2

NameNode

DataNode

Hadoop 3.0 releases and new features

Choosing the right Hadoop distribution

Cloudera Hadoop distribution

Hortonworks Hadoop distribution

MapR Hadoop distribution

Summary

Planning and Setting Up Hadoop Clusters

Technical requirements

Prerequisites for Hadoop setup

Preparing hardware for Hadoop

Readying your system

Installing the prerequisites

Working across nodes without passwords (SSH in keyless)

Downloading Hadoop

Running Hadoop in standalone mode

Setting up a pseudo Hadoop cluster

Planning and sizing clusters

Initial load of data

Organizational data growth

Workload and computational requirements

High availability and fault tolerance

Velocity of data and other factors

Setting up Hadoop in cluster mode

Installing and configuring HDFS in cluster mode

Setting up YARN in cluster mode

Diagnosing the Hadoop cluster

Working with log files

Cluster debugging and tuning tools

JPS (Java Virtual Machine Process Status)

JStack

Summary

Deep Dive into the Hadoop Distributed File System

Technical requirements

How HDFS works

Key features of HDFS

Achieving multi tenancy in HDFS

Snapshots of HDFS

Safe mode

Hot swapping

Federation

Intra-DataNode balancer

Data flow patterns of HDFS

HDFS as primary storage with cache

HDFS as archival storage

HDFS as historical storage

HDFS as a backbone

HDFS configuration files

Hadoop filesystem CLIs

Working with HDFS user commands

Working with Hadoop shell commands

Working with data structures in HDFS

Understanding SequenceFile

MapFile and its variants

Summary

Developing MapReduce Applications

Technical requirements

How MapReduce works

What is MapReduce?

An example of MapReduce

Configuring a MapReduce environment

Working with mapred-site.xml

Working with Job history server

RESTful APIs for Job history server

Understanding Hadoop APIs and packages

Setting up a MapReduce project

Setting up an Eclipse project

Deep diving into MapReduce APIs

Configuring MapReduce jobs

Understanding input formats

Understanding output formats

Working with Mapper APIs

Working with the Reducer API

Compiling and running MapReduce jobs

Triggering the job remotely

Using Tool and ToolRunner

Unit testing of MapReduce jobs

Failure handling in MapReduce

Streaming in MapReduce programming

Summary

Building Rich YARN Applications

Technical requirements

Understanding YARN architecture

Key features of YARN

Resource models in YARN

YARN federation

RESTful APIs

Configuring the YARN environment in a cluster

Working with YARN distributed CLI

Deep dive with YARN application framework

Setting up YARN projects

Writing your YARN application with YarnClient

Writing a custom application master

Building and monitoring a YARN application on a cluster

Building a YARN application

Monitoring your application

Summary

Monitoring and Administration of a Hadoop Cluster

Roles and responsibilities of Hadoop administrators

Planning your distributed cluster

Hadoop applications, ports, and URLs

Resource management in Hadoop

Fair Scheduler

Capacity Scheduler

High availability of Hadoop

High availability for NameNode

High availability for Resource Manager

Securing Hadoop clusters

Securing your Hadoop application

Securing your data in HDFS

Performing routine tasks

Working with safe mode

Archiving in Hadoop

Commissioning and decommissioning of nodes

Working with Hadoop Metric

Summary

Demystifying Hadoop Ecosystem Components

Technical requirements

Understanding Hadoop's Ecosystem

Working with Apache Kafka

Writing Apache Pig scripts

Pig Latin

User-defined functions (UDFs)

Transferring data with Sqoop

Writing Flume jobs

Understanding Hive

Interacting with Hive – CLI, beeline, and web interface

Hive as a transactional system

Using HBase for NoSQL storage

Summary

Advanced Topics in Apache Hadoop

Technical requirements

Hadoop use cases in industries

Healthcare

Oil and Gas

Finance

Government Institutions

Telecommunications

Retail

Insurance

Advanced Hadoop data storage file formats

Parquet

Apache ORC

Avro

Real-time streaming with Apache Storm

Data analytics with Apache Spark

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

This book is a quick-start guide for learning Apache Hadoop version 3. It is targeted at readers with no prior knowledge of Apache Hadoop, and covers key big data concepts, such as data manipulation using MapReduce, flexible model utilization with YARN, and storing different datasets with Hadoop Distributed File System (HDFS). This book will teach you about different configurations of Hadoop version 3 clusters, from a lightweight developer edition to an enterprise-ready deployment. Throughout your journey, this guide will demonstrate how parallel programming paradigms such as MapReduce can be used to solve many complex data processing problems, using case studies and code to do so. Along with development, the book will also cover the important aspects of the big data software development life cycle, such as quality assurance and control, performance, administration, and monitoring. This book serves as a starting point for those who wish to master the Apache Hadoop ecosystem.

Who this book is for

Hadoop 3 Quick Start Guide is intended for those who wish to learn about Apache Hadoop version 3 in the quickest manner, including the most important areas of it, such as MapReduce, YARN, and HDFS. This book serves as a starting point for programmers who are looking to analyze datasets of any kind with the help of big data, quality teams who are interested in evaluating MapReduce programs with respect to their functionality and performance, administrators who are setting up enterprise-ready Hadoop clusters with horizontal scaling, and individuals who wish to enhance their expertise on Apache Hadoop version 3 to solve complex problems.

What this book covers

Chapter 1, Hadoop 3.0 – Background and Introduction, gives you an overview of big data and Apache Hadoop. You will go through the history of Apache Hadoop's evolution, learn about what Hadoop offers today, and explore how it works. Also, you'll learn about the architecture of Apache Hadoop, as well as its new features and releases. Finally, you'll cover the commercial implementations of Hadoop.

Chapter 2, Planning and Setting Up Hadoop Clusters, covers the installation and setup of Apache Hadoop. We will start with learning about the prerequisites for setting up a Hadoop cluster. You will go through the different Hadoop configurations available for users, covering development mode, pseudo-distributed single nodes, and cluster setup. You'll learn how each of these configurations can be set up, and also run an example application of the configuration. Toward the end of the chapter, we will cover how you can diagnose Hadoop clusters by understanding log files and the different debugging tools available.

Chapter 3, Deep Diving into the Hadoop Distributed File System, goes into how HDFS works and its key features. We will look at the different data flowing patterns of HDFS, examining HDFS in different roles. Also, we'll take a look at various command-line interface commands for HDFS and the Hadoop shell. Finally, we'll look at the data structures that are used by HDFS with some examples.

Chapter 4, Developing MapReduce Applications, looks in depth at various topics pertaining to MapReduce. We will start by understanding the concept of MapReduce. We will take a look at the Hadoop application URL ports. Also, we'll study the different data formats needed for MapReduce. Then, we'll take a look at job compilation, remote job runs, and using utilities such as Tool. Finally, we'll learn about unit testing and failure handling.

Chapter 5, Building Rich YARN Applications, teaches you about the YARN architecture and the key features of YARN, such as resource models, federation, and RESTful APIs. Then, you'll configure a YARN environment in a Hadoop distributed cluster. Also, you'll study some of the additional properties of yarn-site.xml. You'll learn about the YARN distributed command-line interface. After this, we will delve into building YARN applications and monitoring them.

Chapter 6, Monitoring and Administration of a Hadoop Cluster, explores the different activities performed by Hadoop administrators for the monitoring and optimization of a Hadoop cluster. You'll learn about the roles and responsibilities of an administrator, followed by cluster planning. You'll dive deep into key management aspects of Hadoop clusters, such as resource management through job scheduling with algorithms such as Fair Scheduler and Capacity Scheduler. Also, you'll discover how to ensure high availability and security for an Apache Hadoop cluster.

Chapter 7, Demystifying Hadoop Ecosystem Components, covers the different components that constitute Hadoop's overall ecosystem offerings to solve complex industrial problems. We will take a brief overview of the tools and software that run on Hadoop. Also, we'll take a look at some components, such as Apache Kafka, Apache PIG, Apache Sqoop, and Apache Flume. After that, we'll cover the SQL and NoSQL Hadoop-based databases: Hive and HBase, respectively.

Chapter 8, Advanced Topics in Apache Hadoop, gets into advanced topics, such as the use of Hadoop for analytics using Apache Spark and processing streaming data using an Apache Storm pipeline. It will provide an overview of real-world use cases for different industries, with some sample code for you to try out independently.

To get the most out of this book

You won't need too much hardware to set up Hadoop. The minimum setup is a single machine / virtual machine, and the recommended setup is three machines.

It is better to have some hands-on experience of writing and running basic programs in Java, as well as some experience of using developer tools such as Eclipse.

Some understanding of the standard software development life cycle would be a plus.

As this is a quick-start guide, it does not provide complete coverage of all topics. Therefore, you will find links provided throughout the book o take you to the deep-dive of the given topic.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packt.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Apache-Hadoop-3-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Code in action

Visit the following link to check out videos of the code being run:

http://bit.ly/2AznxS3

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "You will need the hadoop-client-<version>.jar file to be added".

A block of code is set as follows:

<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.1.0</version> </dependency> </dependencies>

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://<master-host>:9000</value>

</property>

</configuration>

Any command-line input or output is written as follows:

hrishikesh@base0:/$ df -m

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Right-click on the project and run Maven install, as shown in the following screenshot".

Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

How it all started

In the early 2000s, search engines on the World Wide Web were competing to bring improved and accurate results. One of the key challenges was about indexing this large data, keeping a limit over the cost factor on hardware. Doug Cutting and Mike Caferella started development on Nutch in 2002, which would include a search engine and web crawler. However, the biggest challenge was to index billions of pages due to lack of matured cluster management systems. In 2003, Google published a research paper on Google's distributed filesystem (GFS) (https://ai.google/research/pubs/pub51). This helped them devise a distributed filesystem for Nutch called NDFS. In 2004, Google introduced MapReduce programming to the world. The concept of MapReduce was inspired from the Lisp programming language. In 2006, Hadoop was created under the Lucene umbrella. In the same year, Doug was employed by Yahoo to solve some of the most challenging issues with Yahoo Search, which was barely surviving. The following is a timeline of these and later events:

In 2007, many companies such as LinkedIn, Twitter, and Facebook started working on this platform, whereas Yahoo's production Hadoop cluster reached the 1,000-node mark. In 2008, Apache Software Foundation (ASF) moved Hadoop out of Lucene and graduated it as a top-level project. This was the time when the first Hadoop-based commercial system integration company, called Cloudera, was formed.

In 2009, AWS started giving MapReduce hosting capabilities, whereas Yahoo achieved the 24k nodes production cluster mark. This was the year when another SI (System Integrator) called MapR was founded. In 2010, ASF released HBase, Hive, and Pig to the world. In the year 2011, the road ahead for Yahoo looked difficult, so original Hadoop developers from Yahoo separated from it, and formed a company called Hortonworks. Hortonworks offers 100% open source implementation of Hadoop. The same team also become part of the Project Management Committee of ASF.

In 2012, ASF released the first major release of Hadoop 1.0, and immediately next year, it released Hadoop 2.X. In subsequent years, the Apache open source community continued with minor releases of Hadoop due to its dedicated, diverse community of developers. In 2017, ASF released Apache Hadoop version 3.0. On similar lines, companies such as Hortonworks, Cloudera, MapR, and Greenplum are also engaged in providing their own distribution of the Apache Hadoop ecosystem.

What Hadoop is and why it is important

The Apache Hadoop is a collection of open source software that enables distributed storage and processing of large datasets across a cluster of different types of computer systems. The Apache Hadoop framework consists of the following four key modules:

Apache Hadoop Common

Apache

Hadoop Distributed File System

(

HDFS

)

Apache Hadoop MapReduce

Apache Hadoop YARN (Yet Another Resource Manager)

Each of these modules covers different capabilities of the Hadoop framework. The following diagram depicts their positioning in terms of applicability for Hadoop 3.X releases:

Apache Hadoop Common consists of shared libraries that are consumed across all other modules including key management, generic I/O packages, libraries for metric collection, and utilities for registry, security, and streaming. Apache HDFS provides highly tolerant distributed filesystem across clustered computers.

Apache Hadoop provides a distributed data processing framework for large datasets using a simple programming model called MapReduce. A programming task that is divided into multiple identical subtasks and that is distributed among multiple machines for processing is called a map task. The results of these map tasks are combined together into one or many reduce tasks. Overall, this approach of computing tasks is called the MapReduce Approach. The MapReduce programming paradigm forms the heart of the Apache Hadoop framework, and any application that is deployed on this framework must comply to MapReduce programming. Each task is divided into a mapper task, followed by a reducer task. The following diagram demonstrates how MapReduce uses the divide-and-conquer methodology to solve its complex problem using a simplified methodology:

Apache Hadoop MapReduce provides a framework to write applications to process large amounts of data in parallel on Hadoop clusters in a reliable manner. The following diagram describes the placement of multiple layers of the Hadoop framework. Apache Hadoop YARN provides a new runtime for MapReduce (also called MapReduce 2) for running distributed applications across clusters. This module was introduced in Hadoop version 2 onward. We will be discussing these modules further in later chapters. Together, these components provide a base platform to build and compute applications from scratch. To speed up the overall application building experience and to provide efficient mechanisms for large data processing, storage, and analytics, the Apache Hadoop ecosystem comprises additional software. We will cover these in the last section of this chapter.

Now that we have given a quick overview of the Apache Hadoop framework, let's understand why Hadoop-based systems are needed in the real world.

Apache Hadoop was invented to solve large data problems that no existing system or commercial software could solve. With the help of Apache Hadoop, the data that used to get archived on tape backups or was lost is now being utilized in the system. This data offers immense opportunities to provide insights in history and to predict the best course of action. Hadoop is targeted to solve problems involving the four Vs (Volume, Variety, Velocity, and Veracity) of data. The following diagram shows key differentiators of why Apache Hadoop is useful for business: