Hadoop Real-World Solutions Cookbook - Second Edition - Tanmay Deshpande - E-Book

Hadoop Real-World Solutions Cookbook - Second Edition E-Book

Tanmay Deshpande

0,0
47,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Over 90 hands-on recipes to help you learn and master the intricacies of Apache Hadoop 2.X, YARN, Hive, Pig, Oozie, Flume, Sqoop, Apache Spark, and Mahout

About This Book

  • Implement outstanding Machine Learning use cases on your own analytics models and processes.
  • Solutions to common problems when working with the Hadoop ecosystem.
  • Step-by-step implementation of end-to-end big data use cases.

Who This Book Is For

Readers who have a basic knowledge of big data systems and want to advance their knowledge with hands-on recipes.

What You Will Learn

  • Installing and maintaining Hadoop 2.X cluster and its ecosystem.
  • Write advanced Map Reduce programs and understand design patterns.
  • Advanced Data Analysis using the Hive, Pig, and Map Reduce programs.
  • Import and export data from various sources using Sqoop and Flume.
  • Data storage in various file formats such as Text, Sequential, Parquet, ORC, and RC Files.
  • Machine learning principles with libraries such as Mahout
  • Batch and Stream data processing using Apache Spark

In Detail

Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has become easier for everyone to solve big data problems with great efficiency and at minimal cost. Grasping Machine Learning techniques will help you greatly in building predictive models and using this data to make the right decisions for your organization.

Hadoop Real World Solutions Cookbook gives readers insights into learning and mastering big data via recipes. The book not only clarifies most big data tools in the market but also provides best practices for using them. The book provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This book provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this book.

This guide is an invaluable tutorial if you are planning to implement a big data warehouse for your business.

Style and approach

An easy-to-follow guide that walks you through world of big data. Each tool in the Hadoop ecosystem is explained in detail and the recipes are placed in such a manner that readers can implement them sequentially. Plenty of reference links are provided for advanced reading.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 298

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Hadoop Real-World Solutions Cookbook Second Edition
Credits
About the Author
Acknowledgements
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why Subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Getting Started with Hadoop 2.X
Introduction
Installing a single-node Hadoop Cluster
Getting ready
How to do it...
How it works...
Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
There's more
Installing a multi-node Hadoop cluster
Getting ready
How to do it...
How it works...
Adding new nodes to existing Hadoop clusters
Getting ready
How to do it...
How it works...
Executing the balancer command for uniform data distribution
Getting ready
How to do it...
How it works...
There's more...
Entering and exiting from the safe mode in a Hadoop cluster
How to do it...
How it works...
Decommissioning DataNodes
Getting ready
How to do it...
How it works...
Performing benchmarking on a Hadoop cluster
Getting ready
How to do it...
TestDFSIO
NNBench
MRBench
How it works...
2. Exploring HDFS
Introduction
Loading data from a local machine to HDFS
Getting ready
How to do it...
How it works...
Exporting HDFS data to a local machine
Getting ready
How to do it...
How it works...
Changing the replication factor of an existing file in HDFS
Getting ready
How to do it...
How it works...
Setting the HDFS block size for all the files in a cluster
Getting ready
How to do it...
How it works...
Setting the HDFS block size for a specific file in a cluster
Getting ready
How to do it...
How it works...
Enabling transparent encryption for HDFS
Getting ready
How to do it...
How it works...
Importing data from another Hadoop cluster
Getting ready
How to do it...
How it works...
Recycling deleted data from trash to HDFS
Getting ready
How to do it...
How it works...
Saving compressed data in HDFS
Getting ready
How to do it...
How it works...
3. Mastering Map Reduce Programs
Introduction
Writing the Map Reduce program in Java to analyze web log data
Getting ready
How to do it...
How it works...
Executing the Map Reduce program in a Hadoop cluster
Getting ready
How to do it
How it works...
Adding support for a new writable data type in Hadoop
Getting ready
How to do it...
How it works...
Implementing a user-defined counter in a Map Reduce program
Getting ready
How to do it...
How it works...
Map Reduce program to find the top X
Getting ready
How to do it...
How it works
Map Reduce program to find distinct values
Getting ready
How to do it
How it works...
Map Reduce program to partition data using a custom partitioner
Getting ready
How to do it...
How it works...
Writing Map Reduce results to multiple output files
Getting ready
How to do it...
How it works...
Performing Reduce side Joins using Map Reduce
Getting ready
How to do it
How it works...
Unit testing the Map Reduce code using MRUnit
Getting ready
How to do it...
How it works...
4. Data Analysis Using Hive, Pig, and Hbase
Introduction
Storing and processing Hive data in a sequential file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the RC file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the ORC file format
Getting ready
How to do it...
How it works...
Storing and processing Hive data in the Parquet file format
Getting ready
How to do it...
How it works...
Performing FILTER By queries in Pig
Getting ready
How to do it...
How it works...
Performing Group By queries in Pig
Getting ready
How to do it...
How it works...
Performing Order By queries in Pig
Getting ready
How to do it..
How it works...
Performing JOINS in Pig
Getting ready
How to do it...
How it works
Replicated Joins
Skewed Joins
Merge Joins
Writing a user-defined function in Pig
Getting ready
How to do it...
How it works...
There's more...
Analyzing web log data using Pig
Getting ready
How to do it...
How it works...
Performing the Hbase operation in CLI
Getting ready
How to do it
How it works...
Performing Hbase operations in Java
Getting ready
How to do it
How it works...
Executing the MapReduce programming with an Hbase Table
Getting ready
How to do it
How it works
5. Advanced Data Analysis Using Hive
Introduction
Processing JSON data in Hive using JSON SerDe
Getting ready
How to do it...
How it works...
Processing XML data in Hive using XML SerDe
Getting ready
How to do it...
How it works
Processing Hive data in the Avro format
Getting ready
How to do it...
How it works...
Writing a user-defined function in Hive
Getting ready
How to do it
How it works...
Performing table joins in Hive
Getting ready
How to do it...
Left outer join
Right outer join
Full outer join
Left semi join
How it works...
Executing map side joins in Hive
Getting ready
How to do it...
How it works...
Performing context Ngram in Hive
Getting ready
How to do it...
How it works...
Call Data Record Analytics using Hive
Getting ready
How to do it...
How it works...
Twitter sentiment analysis using Hive
Getting ready
How to do it...
How it works
Implementing Change Data Capture using Hive
Getting ready
How to do it
How it works
Multiple table inserting using Hive
Getting ready
How to do it
How it works
6. Data Import/Export Using Sqoop and Flume
Introduction
Importing data from RDMBS to HDFS using Sqoop
Getting ready
How to do it...
How it works...
Exporting data from HDFS to RDBMS
Getting ready
How to do it...
How it works...
Using query operator in Sqoop import
Getting ready
How to do it...
How it works...
Importing data using Sqoop in compressed format
Getting ready
How to do it...
How it works...
Performing Atomic export using Sqoop
Getting ready
How to do it...
How it works...
Importing data into Hive tables using Sqoop
Getting ready
How to do it...
How it works...
Importing data into HDFS from Mainframes
Getting ready
How to do it...
How it works...
Incremental import using Sqoop
Getting ready
How to do it...
How it works...
Creating and executing Sqoop job
Getting ready
How to do it...
How it works...
Importing data from RDBMS to Hbase using Sqoop
Getting ready
How to do it...
How it works...
Importing Twitter data into HDFS using Flume
Getting ready
How to do it...
How it works
Importing data from Kafka into HDFS using Flume
Getting ready
How to do it...
How it works
Importing web logs data into HDFS using Flume
Getting ready
How to do it...
How it works...
7. Automation of Hadoop Tasks Using Oozie
Introduction
Implementing a Sqoop action job using Oozie
Getting ready
How to do it...
How it works
Implementing a Map Reduce action job using Oozie
Getting ready
How to do it...
How it works...
Implementing a Java action job using Oozie
Getting ready
How to do it
How it works
Implementing a Hive action job using Oozie
Getting ready
How to do it...
How it works...
Implementing a Pig action job using Oozie
Getting ready
How to do it...
How it works
Implementing an e-mail action job using Oozie
Getting ready
How to do it...
How it works...
Executing parallel jobs using Oozie (fork)
Getting ready
How to do it...
How it works...
Scheduling a job in Oozie
Getting ready
How to do it...
How it works...
8. Machine Learning and Predictive Analytics Using Mahout and R
Introduction
Setting up the Mahout development environment
Getting ready
How to do it...
How it works...
Creating an item-based recommendation engine using Mahout
Getting ready
How to do it...
How it works...
Creating a user-based recommendation engine using Mahout
Getting ready
How to do it...
How it works...
Predictive analytics on Bank Data using Mahout
Getting ready
How to do it...
How it works...
Text data clustering using K-Means using Mahout
Getting ready
How to do it...
How it works...
Population Data Analytics using R
Getting ready
How to do it...
How it works...
Twitter Sentiment Analytics using R
Getting ready
How to do it...
How it works...
Performing Predictive Analytics using R
Getting ready
How to do it...
How it works...
9. Integration with Apache Spark
Introduction
Running Spark standalone
Getting ready
How to do it...
How it works...
Running Spark on YARN
Getting ready
How to do it...
How it works...
Performing Olympics Athletes analytics using the Spark Shell
Getting ready
How to do it...
How it works...
Creating Twitter trending topics using Spark Streaming
Getting ready
How to do it...
How it works...
Twitter trending topics using Spark streaming
Getting ready
How to do it...
How it works...
Analyzing Parquet files using Spark
Getting ready
How to do it...
How it works...
Analyzing JSON data using Spark
Getting ready
How to do it...
How it works...
Processing graphs using Graph X
Getting ready
How to do it...
How it works...
Conducting predictive analytics using Spark MLib
Getting ready
How to do it...
How it works...
10. Hadoop Use Cases
Introduction
Call Data Record analytics
Getting ready
How to do it...
Problem Statement
Solution
How it works...
Web log analytics
Getting ready
How to do it...
Problem statement
Solution
How it works...
Sensitive data masking and encryption using Hadoop
Getting ready
How to do it...
Problem statement
Solution
How it works...
Index

Hadoop Real-World Solutions Cookbook Second Edition

Hadoop Real-World Solutions Cookbook Second Edition

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: February 2013

Second edition: March 2016

Production reference: 1220316

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78439-550-6

www.packtpub.com

Credits

Authors

Tanmay Deshpande

Jonathan R. Owens

Jon Lentz

Brian Femiano

Reviewer

Shashwat Shriparv

Commissioning Editor

Akram Hussain

Acquisition Editor

Manish Nainani

Content Development Editor

Sumeet Sawant

Technical Editor

Gebin George

Copy Editor

Sonia Cheema

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Tejal Daruwale Soni

Production Coordinator

Manu Joseph

Cover Work

Manu Joseph

About the Author

Tanmay Deshpande is a Hadoop and big data evangelist. He's interested in a wide range of technologies, such as Apache Spark, Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as finance, telecoms, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has a great interest in open source technologies and promotes them through his lectures. He has been invited to various computer science colleges to conduct brainstorming sessions with students on the latest technologies. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. Tanmay is currently working with Schlumberger as the lead developer of big data. Before Schlumberger, Tanmay worked with Lumiata, Symantec, and Infosys.

He currently blogs at http://hadooptutorials.co.in.

Acknowledgements

This is my fourth book, and I can't thank the Almighty, enough without whom this wouldn't have been true. I would like to take this opportunity to thank my wife, Sneha, my parents, Avinash and Manisha Deshpande, and my brother, Sakalya Deshpande, for being with me through thick and thin. Without you, I am nothing!

I would like to take this opportunity to thank my colleagues, friends, and family for appreciating my work and making it a grand success so far. I'm truly blessed to have each one of you in my life.

I am thankful to the authors of the first edition of this book, Jonathan R. Owens, Brian Femino, and Jon Lentz for setting the stage for me, and I hope this effort lives up to the expectations you had set in the first edition. I am also thankful to each person in Packt Publishing who has worked to make this book happen! You guys are family to me!

Above all, I am thankful to my readers for their love, appreciation, and criticism, and I assure you that I have tried to give you my best. Hope you enjoy this book! Happy learning!

About the Reviewer

Shashwat Shriparv has 6+ IT experience in industry, and 4+ in BigData technologies. He possesses a master degree in computer application. He has experience in technologies such as Hadoop, HBase, Hive, Pig, Flume, Sqoop, Mongo, Cassandra, Java, C#, Linux, Scripting, PHP,C++,C, Web technologies, and various real life use cases in BigData technologies as a developer and administrator.

He has worked with companies such as CDAC, Genilok, HCL, UIDAI(Aadhaar); he is currently working with CenturyLink Cognilytics. He is the author of Learning HBase, Packt Publishing and reviewer Pig design pattern book, Packt Publishing.

I want to acknowledge everyone I know.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Big Data is the need the day. Many organizations are producing huge amounts of data every day. With the advancement of Hadoop-like tools, it has become easier for everyone to solve Big Data problems with great efficiency and at a very low cost. When you are handling such a massive amount of data, even a small mistake can cost you dearly in terms of performance and storage. It's very important to learn the best practices of handling such tools before you start building an enterprise Big Data Warehouse, which will be greatly advantageous in making your project successful.

This book gives you insights into learning and mastering Big Data recipes. This book not only explores a majority of Big Data tools that are currently being used in the market, but also provides the best practices in order to implement them. This book will also provide you with recipes that are based on the latest version of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout, and many more ecosystem tools. This real-world solutions cookbook is packed with handy recipes that you can apply to your own everyday issues. Each chapter talks about recipes in great detail, and these can be referred to easily. This book provides detailed practice on the latest technologies, such as YARN and Apache Spark. This guide is an invaluable tutorial if you are planning to implement Big Data Warehouse for your business.

What this book covers

Chapter 1, Getting Started with Hadoop 2.x, introduces you to the installation details needed for single and multi-node Hadoop clusters. It also contains the recipes that will help you understand various important cluster management techniques, such as decommissioning, benchmarking, and so on.

Chapter 2, Exploring HDFS, provides you with hands-on recipes to manage and maintain the Hadoop Distributed File System (HDFS) in an efficient way. You will learn some important practices, such as transient encryption, saving data in a compressed format, recycling deleted data from HDFS, and so on.

Chapter 3, Mastering Map Reduce Programs, enlightens you about very important recipes for Map Reduce programming, which take you beyond the simple Word Count program. You will learn about various customization techniques in detail.

Chapter 4, Data Analysis Using Hive, Pig, and Hbase, takes you to the analytical world of Hive, Pig, and Hbase. This chapter talks about the use of various file formats, such as RC, ORC, Parquet, and so on. You will also get introduced to the Hbase NoSQL database.

Chapter 5, Advanced Data Analysis Using Hive, provides insights on the usage of serializers and deserializers (SerDe) in Hive for JSON and XML data operations. This chapter will provide you with a detailed explanation for Twitter sentiment analysis using Hive.

Chapter 6, Data Import/Export Using Sqoop and Flume, covers various recipes to import and export data from sources, such as RDBMS, Kafka, web log servers, and so on, using Sqoop and Flume.

Chapter 7, Automation of Hadoop Tasks Using Oozie, introduces you to a very rich scheduling tool called Oozie, which will help you build automated production-ready Big Data applications.

Chapter 8, Machine Learning and Predictive Analytics Using Mahout and R, gives you an end-to-end implementation of predictive analytics applications using Mahout and R. It covers the various visualization options available in R as well.

Chapter 9, Integration with Apache Spark, introduces you to a very important distributed computing framework called Apache Spark. It covers basic to advanced topics such as installation, Spark application development and execution, usage of the Spark Machine Learning Library, MLib, and graph processing using Spark.

Chapter 10, Hadoop Use Cases, provides you with end-to-end implementations of Hadoop use cases from various domains, such as telecom, finance, e-commerce, and so on.

What you need for this book

To get started with this hands-on recipe-driven book, you should have a laptop/desktop with any OS, such as Windows, Linux, or Mac. It's good to have an IDE, such as Eclipse or IntelliJ, and of course, you need a lot of enthusiasm to learn.

Who this book is for

This book is for those of you who have basic knowledge of Big Data systems and want to advance your knowledge with hands-on recipes.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/HadoopRealWorldSolutionsCookbookSecondEdition_ColoredImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Getting Started with Hadoop 2.X

This chapter covers the following topics:

Installing a single-node Hadoop clusterInstalling a multi-node Hadoop clusterAdding new nodes to existing Hadoop clustersExecuting the balancer command for uniform data distributionEntering and exiting from the safe mode in a Hadoop clusterDecommissioning DataNodesPerforming benchmarking on a Hadoop cluster

Introduction

Hadoop has been the primary platform for many people who deal with big data problems. It is the heart of big data. Hadoop was developed way back between 2003 and 2004 when Google published research papers onGoogle File System (GFS) andMap Reduce. Hadoop was structured around the crux of these research papers, and thus derived its shape. With the advancement of the Internet and social media, people slowly started realizing the power that Hadoop had, and it soon became the top platform used to handle big data. With a lot of hard work from dedicated contributors and open source groups to the project, Hadoop 1.0 was released and the IT industry welcomed it with open arms.

A lot of companies started using Hadoop as the primary platform for their Data Warehousing and Extract-Transform-Load (ETL) needs. They started deploying thousands of nodes in a Hadoop cluster and realized that there were scalability issues beyond the 4000+ node clusters that were already present. This was because JobTracker was not able to handle that many Task Trackers, and there was also the need for high availability in order to make sure that clusters were reliable to use. This gave birth to Hadoop 2.0.

In this introductory chapter, we are going to learn interesting recipes such as installing a single/multi-node Hadoop 2.0 cluster, its benchmarking, adding new nodes to existing clusters, and so on. So, let's get started.

Installing a single-node Hadoop Cluster

In this recipe, we are going to learn how to install a single-node Hadoop cluster, which can be used for development and testing.

Getting ready

To install Hadoop, you need to have a machine with the UNIX operating system installed on it. You can choose from any well known UNIX OS such as Red Hat, CentOS, Ubuntu, Fedora, and Amazon Linux (this is in case you are using Amazon Web Service instances).

Here, we will be using the Ubuntu distribution for demonstration purposes.

How to do it...

Let's start installing Hadoop:

First of all, you need to download the required installers from the Internet. Here, we need to download Java and Hadoop installers. The following are the links to do this:

For the Java download, choose the latest version of the available JDK from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html.

You can also use Open JDK instead of Oracle.

For the Hadoop 2.7 Download, go to

http://www.eu.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz.

We will first install Java. Here, I am using /usr/local as the installation directory and the root user for all installations. You can choose a directory of your choice.

Extract tar.gz like this:

tar -xzf java-7-oracle.tar.gz

Rename the extracted folder to give the shorter name Java instead of java-7-oracle. Doing this will help you remember the folder name easily.

Alternately, you can install Java using the apt-get package manager if your machine is connected to the Internet:

sudo apt-get updatesudo apt-get install openjdk-7-jdk
Similarly, we will extract and configure Hadoop. We will also rename the extracted folder for easier accessibility. Here, we will extract Hadoop to path /usr/local:
tar –xzf hadoop-2.7.0.tar.gzmv hadoop-2.7.0 hadoop
Next, in order to use Java and Hadoop from any folder, we would need to add these paths to the ~/.bashrc file. The contents of the file get executed every time a user logs in:
cd ~vi .bashrc

Once the file is open, append the following environment variable settings to it. These variables are used by Java and Hadoop at runtime:

export JAVA_HOME=/usr/local/java export PATH=$PATH:$JAVA_HOME/bin export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
In order to verify whether our installation is perfect, close the terminal and restart it again. Also, check whether the Java and Hadoop versions can be seen:
$java -version java version "1.7.0_45" Java(TM) SE Runtime Environment (build 1.7.0_45-b18) Java HotSpot(TM) Server VM (build 24.45-b08, mixed mode) $ hadoop version Hadoop 2.7.0 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r d4c8d4d4d203c934e8074b31289a28724c0842cf Compiled by jenkins on 2015-04-10T18:40Z Compiled with protoc 2.5.0 From source with checksum a9e90912c37a35c3195d23951fd18f

This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.0.jar.

Now that Hadoop and Java are installed and verified, we need to install ssh (Secure Shell) if it's not already available by default. If you are connected to the Internet, execute the following commands. SSH is used to secure data transfers between nodes:
sudo apt-get install openssh-clientsudo apt-get install openssh-server
Once the ssh installation is done, we need to execute the ssh configuration in order to avail a passwordless access to remote hosts. Note that even though we are installing Hadoop on a single node, we need to perform an ssh configuration in order to securely access the localhost.

First of all, we need to generate public and private keys by executing the following command:

ssh-keygen -t rsa -P ""

This will generate the private and public keys by default in the $HOME/.ssh folder. In order to provide passwordless access, we need to append the public key to authorized_keys file:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Let's check whether the ssh configuration is okay or not. To test it, execute and connect to the localhost like this:

ssh localhost

This will prompt you to confirm whether to add this connection to the known_hosts file. Type yes, and you should be connected to ssh without prompting for the password.

Once the ssh configuration is done and verified, we need to configure Hadoop. The Hadoop configuration begins with adding various configuration parameters to the following default files:
hadoop-env.sh: This is where we need to perform the Java environment variable configuration.core-site.xml: This is where we need to perform NameNode-related configurations.yarn-site.xml: This is where we need to perform configurations related toYet Another Resource Negotiator (YARN).mapred-site.xml: This is where we need to the map reduce engine as YARN.hdfs-site.xml: This is where we need to perform configurations related toHadoop Distributed File System (HDFS).

These configuration files can be found in the /usr/local/hadoop/etc/hadoop folder. If you install Hadoop as the root user, you will have access to edit these files, but if not, you will first need to get access to this folder before editing.

So, let's take a look at the configurations one by one.

Configure hadoop-env.sh and update the Java path like this:

Export JAVA_HOME=/usr/local/java.Edit core-site.xml, and add the host and port on which you wish to install NameNode. Here is the single node installation that we would need in order to add the localhost:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000/</value> </property> </configuration>
Edit yarn-site.xml, add the following properties to it:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>

The yarn.nodemanager.aux-services property tells NodeManager that an auxiliary service named mapreduce.shuffle is present and needs to be implemented. The second property tells NodeManager about the class by which means it needs to implement the shuffle auxiliary service. This specific configuration is needed as the MapReduce job involves shuffling of key value pairs.

Next, edit mapred-site.xml to set the map reduce processing engine as YARN:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Edit hdfs-site.xml to set the folder paths that can be used by NameNode and datanode:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
I am also setting the HDFS block replication factor to 1 as this is a single node cluster installation.

We also need to make sure that we create the previously mentioned folders and change their ownership to suit the current user. To do this, you can choose a folder path of your own choice:

sudo mkdir –p /usr/local/store/hdfs/namenodesudo mkdir –p /usr/local/store/hdfs/datanodesudo chown root:root –R /usr/local/store
Now, it's time to format namenode so that it creates the required folder structure by default:
hadoop namenode -format
The final step involves starting Hadoop daemons; here, we will first execute two scripts to start HDFS daemons and then start YARN daemons:
/usr/local/hadoop/sbin/start-dfs.sh

This will start NameNode, the secondary NameNode, and then DataNode daemons:

/usr/local/hadoop/sbin/start-yarn.sh

This will start NodeManager and ResourceManager. You can execute the jps command to take a look at the running daemons:

$jps 2184 DataNode 2765 NodeManager 2835 Jps 2403 SecondaryNameNode 2025 NameNode 2606 ResourceManager

We can also access the web portals for HDFS and YARN by accessing the following URLs:

For HDFS: http://<hostname>:50070/For YARN: http://<hostname>:8088/

How it works...

Hadoop 2.0 has been majorly reformed in order to solve issues of scalability and high-availability. Earlier in Hadoop 1.0, Map Reduce was the only means of processing data stored in HDFS. With advancement of YARN, Map Reduce is one of the ways of processing data on Hadoop. Here is a pictorial difference between Hadoop 1.x and Hadoop 2.x:

Now, let's try to understand how HDFS and YARN works.

Hadoop Distributed File System (HDFS)

HDFS is a redundant, reliable storage for Hadoop. It consists of three important parts: NameNode, the secondary NameNode, and DataNodes. When a file needs to be processed on Hadoop, it first needs to be saved on HDFS. HDFS distributes the file in chunks of 64/128 MB data blocks across the data nodes. The blocks are replicated across data nodes for reliability. NameNode stores the metadata in the blocks and replicas. After a certain period of time, the metadata is backed up on the secondary NameNode. The default time is 60 seconds. We can modify this by setting a property called dfs.namenode.checkpoint.check.period in hdfs-site.xml.

Yet Another Resource Negotiator (YARN)

YARN has been developed to address scalability issues and for the better management of jobs in Hadoop; till date, it has proved itself to be the perfect solution. It is responsible for the management of resources available in clusters. It consists of two important components: ResouceManager(Master) and NodeManager(Worker). NodeManager provides a node-level view of the cluster, while ResourceManager takes a view of a cluster. When an application is submitted by an application client, the following things happen:

The application talks to ResourceManager and provides details about it.ResourceManager makes a container request on behalf of an application to any of the worker nodes and ApplicationMaster starts running within that container.ApplicationMaster then makes subsequent requests for the containers to execute tasks on other nodes.These tasks then take care of all the communication. Once all the tasks are complete, containers are deallocated and ApplicationMaster exits.After this, the application client also exits.

There's more

Now that your single node Hadoop cluster is up and running, you can try some HDFS file operations on it, such as creating a directory, copying a file from a local machine to HDFS, and so on. Here some sample commands.

To list all the files in the HDFS root directory, take a look at this:

hadoop fs –ls /

To create a new directory, take a look at this:

hadoop fs –mkdir /input

To copy a file from the local machine to HDFS, take a look at this:

hadoop fs –copyFromLocal /usr/local/hadoop/LICENSE.txt /input

In order to access all the command options that are available, go to https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html.