Big Data Using Hadoop and Hive - Nitin Kumar - E-Book

Big Data Using Hadoop and Hive E-Book

Nitin Kumar

0,0
49,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book is a guide for developers and engineers to use Hadoop and Hive for scalable big data applications. It covers reading, writing, and managing large datasets with Hive and provides a concise introduction to Apache Hadoop and Hive, detailing their collaboration to simplify development. Through clear examples, the book explains the logic, code, and configurations needed for building successful distributed applications.
The course starts with an introduction to big data and Apache Hadoop fundamentals. It then covers the Hadoop Distributed Filesystem and how to get started with Hadoop. The journey continues with interfaces to access HDFS files, resource management with Yet Another Resource Negotiator, and MapReduce for data processing. The book also explores Hive architecture, storage types, and the Hive query language.
Mastering these concepts is vital for creating scalable big data solutions. This book ensures a smooth transition from novice to proficient Hadoop and Hive user, providing practical skills and comprehensive knowledge. By the end, readers will be able to set up, configure, and optimize Hadoop, utilize Hive for data management, and effectively solve big data challenges.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 213

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



BIG DATAUSING HADOOP™AND HIVE™

LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

BIG DATAUSING HADOOP™AND HIVE™

NITIN KUMAR

MERCURY LEARNING AND INFORMATION

Dulles, Virginia

Boston, Massachusetts

New Delhi

Copyright ©2021 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David PallaiMERCURY LEARNING AND INFORMATION22841 Quicksilver DriveDulles, VA [email protected]

Nitin Kumar. Big Data Using Hadoop™ and Hive™.ISBN: 978-1-68392-645-0

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2021934303

212223321 This book is printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

To my wife Sarika, my children, Shaurya and Irima,and to my parents

CONTENTS

Preface

Chapter 1:Big Data

Big Data Challenges for Organizations

How We Are Using Big Data

Big Data: An Opportunity

Hadoop: A Big Data Solution

Big Data in the Real World

Chapter 2:What is Apache Hadoop?

Hadoop History

Hadoop Benefits

Hadoop’s Ecosystem: Components

Hadoop Core Component Architecture

Summary

Chapter 3:The Hadoop Distribution Filesystem

HDFS Core Components

HDFS Architecture

Data Replication

Data Locality

Data Storage

Failure Handling on the HDFS

Erasure Coding (EC)

HDFS Disk Balancer

HDFS Federation

HDFS Architecture and Its Challenges

Hadoop Federation: A Rescue

Benefits of the HDFS Federation

HDFS Processes: Read and Write

Failure Handling During Read and Write

Chapter 4:Getting Started with Hadoop

Hadoop Configuration

Command-Line Interface

Generic Filesystem CLI Command

Distributed Copy (distcp)

Hadoop’s Other User Commands

HDFS Permissions

HDFS Quotas Guide

HDFS Short-Circuit Local Reads

Offline Edits Viewer Guide

Offline Image Viewer Guide

Chapter 5:Interfaces to Access HDFS Files

WebHDFS REST API

FileSystem URIs

Error Responses

Authentication

Java FileSystem API

URI and Path

FSDataInputStream

FSDataOutputStream

FileStatus

Directories

Delete Files

C API libhdfs

Chapter 6:Yet Another Resource Negotiator

YARN Architecture

YARN Process Flow

YARN Failures

YARN High Availability

YARN Schedulers

The Fair Scheduler

The Capacity Scheduler

The YARN Timeline Server

Application Timeline Server (ATS)

ATS Data Model Structure

ATS V2

YARN Federation

Chapter 7:MapReduce

MapReduce Process

Key Features

Different Phases in the MapReduce Process

MapReduce Architecture

MapReduce Sample Program

MapReduce Composite Key Operation

Mapper Program

MapReduce Configuration

Chapter 8:Hive

Hive History

Hive Query

Data Storage

Data Model

Complex Data Types

Hive DDL (Data Definition Language)

Tables

View

Partition

Bucketing

Hive Architecture

Serialization/Deserialization (SerDe)

Metastore

Query Compiler

HiveServer2

Chapter 9:Getting Started with Hive

Hive Set-up

Hive Configuration Settings

Loading and Inserting Data into Tables

Insert from a Select Query

Load Table Data into File

Create and Load Data into a Table

Hive Transactions

Enable Transactions

Insert Values

Update

Delete

Merge

Locks

Hive Select Query

Select Basic Query

Hive QL File

Hive Select on Complex Datatypes

Order By and Sort By

Distribute By and Cluster By

Group By and Having

Built-in Aggregate Functions

Enhanced Aggregation

Table-Generating Functions

Built-In Utility Functions

Collection Functions

Date Functions

Conditional Functions

String Functions

Hive Query Language-Join

Chapter 10:File Format

File Format Characteristics

Columnar Format

Schema Evolution

Splittable

Compression

File Formats

RC (Row-Columnar) File Input Format

Optimized Row Columnar (ORC) File Format

Parquet

File Format Comparisons

ORC vs. Parquet

Chapter 11:Data Compression

Data Compression Benefits

Data Compression in Hadoop

Splitting

Compression Codec

Data Compressions

References

Index

PREFACE

Big Data Using Hadoop and Hive is the essential guide for developers, architects, engineers, and anyone who wants to start leveraging Hadoop to build a distributed, scalable, and concurrent applications. It is a concise guide to get started with Hadoop and Hive. It provides an overall understanding of Hadoop and how it works while giving the sample code to speed up development with minimal effort. It refers to simple concepts and examples, as they are likely to be the best teaching aids. It will explain the logic, code, and configurations needed to build a successful, distributed, concurrent application, and the reason behind those decisions.

Intended Audience

The primary audience of this book is the developers who would like to start development using Hadoop and Hive, architects who want to get an architect’s understanding of Hadoop technologies, and application designers looking for a detailed understanding of what Hadoop can do for them.

What this book covers

Big Data Using Hadoop and Hive covers Hadoop 3 exclusively. The Hadoop 3.3.x release series is the current active release series and contains the most stable version of Hadoop. This edition also covers the latest Hive 3.x version with all-new features. It contains four main parts: Chapters 1 through 5 are an introduction to Hadoop and the architecture around it. Chapters 6 to 8 discuss architecture and HDFS, MapReduce, YARN, and Mesosphere. In Chapters 9 and 10, we cover Hive and its architecture. Chapter 11 discusses various file formats and the compression methodology used in the Hadoop environment.

Chapter 1: This chapter explains Big Data and how organizations face difficulties with it. This chapter explores the key benefits of Big Data and how Big Data is utilized. It also shines light on how Hadoop solves the challenges of Big Data.

Chapter 2: This chapter covers Hadoop history and answers questions regarding the key benefits of Apache Hadoop. It also provides a brief introduction to the Hadoop ecosystem and explains Hadoop architecture.

Chapter 3: This chapter covers the Hadoop Distributed File System (HDFS) and its architecture. It also includes the HDFS core components and how Hadoop reads/writes with the HDFS.

Chapter 4: This chapter explains the basic Hadoop set-up and essential Hadoop configuration and optimization. It also covers the command-line interface to access the HDFS.

Chapter 5: This chapter covers the JAVA API and WebHDFS REST API to access HDFS files.

Chapter 6: This chapter provides a detailed understanding of MapReduce jobs and how the Hadoop Core framework executes MapReduce jobs.

Chapter 7: This chapter explains the challenges of Hadoop v1 and how YARN solves these challenges. This chapter explains YARN's basic architecture understanding and various failover scenarios on YARN. It also describes the Hadoop Timeline Server introduced in Hadoop v3.

Chapter 8: This chapter covers Hive and its architecture with various storage types. It also includes the Hive query language.

Chapter 9: This chapter covers the step-by-step Hive set-up. It includes different types of tables and explains how to query the table data. It also emphasizes various join and builds in utility functions.

Chapters 10 and 11: These chapters explain various file formats, such as Avro, ORC, and Parquet. They also cover different compression mechanisms.

Nitin KumarMarch 2021

CHAPTER 1

BIG DATA

There has been an explosive growth of social interactions, including online relationships, collaboration, information sharing, and feedback. Social networks such as Twitter, Facebook, LinkedIn, YouTube, and WhatsApp are connecting millions of users, allowing them to share vast amounts of information. Consumers invest more time on the Internet, which enables organizations to collect a massive amount of data. Every new generation of devices is connected and sharing massive amounts of sensor data. These data could be anything, such as user transaction logs, sales logs, user history, machine-generated data, video streaming, and sensor data.

These large amounts (terabytes, petabytes, and exabytes) of structured and unstructured data are known as Big Data. Today, businesses have endless opportunities to transform these structured and unstructured data into practical information for decision-making, predictive analysis, recommendation systems, and analytic platforms.

Organizations collect and transform data into actionable insights that can be used to contribute to revenue generation. There is an enormous demand for professionals in these areas to build models, extract features and patterns, and perform statistical analysis to transform the data into meaningful information.

Various fields were spawned from Big Data, such as artificial intelligence (AI), machine learning, and deep learning, which apply patterns to predict outcomes and make recommendations. For example, video streaming services keep your interest based on your previous activity and recommend the same type of video for future viewing. Online shopping sites provide recommendations based on your interests, trends, and events.

BIG DATA CHALLENGES FOR ORGANIZATIONS

Internet users are doing everything online, from business communications to shopping and social networking. Billions of connected devices and embedded systems create, collect, and share a wealth of data analytics every day, all over the world. According to IDC, 75% of the world’s population will be interacting with online data every day by 2025. The amount of digital data is expected to more than double.

Organizations are collecting a large amount of data, but struggling to manage and utilize it. This leads to a new challenge regarding how to effectively manage, process, and analyze vast amounts of data. The challenge is not only performing data extraction but also managing that expanding amount of data.

Traditional RDBMS-based databases were not capable of handling this massive amount of unstructured data, and so organizations were not able to use it. If data is archived on tapes, it is expensive to recover for further use. There was a need for technologies that store and process massive amounts of data.

HOW WE ARE USING BIG DATA

Big Data isn’t only about the quantity of information, but how we utilize it to improve statistical and computational process. Organizations are adopting new technologies to visualize new patterns that could improve their strategic decision-making. Big Data helps organizations drive innovation by providing new insights about their customers. An organization can build an understanding of a customer based on that customer’s transaction history and provide improved recommendations to them about products or services. Behavioral analytics is important, which visualizes business sentiment and accelerates business outcomes.

BIG DATA: AN OPPORTUNITY

Big Data has brought about a new era in the economy, which is all about information and converting it into revenue. Organizations need to understand how to deal with all this data. Their growth depends on the quality of their predictive mechanisms and how they leverage machine learning. In one survey, 97% of companies responded that they plan to increase their spending in analytic endeavors, with estimates that companies will drive $432 billion in IT spending through 2025. By 2023, the demand for jobs related to Big Data are estimated to soar 38%.

HADOOP: A BIG DATA SOLUTION

Astronomical unstructured data is a challenge to store, process, and extract information. Traditional RDBMS systems are not compatible with handling such large data sets.

Apache Hadoop is the solution for such types of data sets. Hadoop is an open-source project started by Doug Cutting. It was initiated based on papers published by Google, describing how their engineers were dealing with the challenge of storing and processing massive amounts of data. Yahoo! and other IT companies have driven the development of Hadoop.

BIG DATA IN THE REAL WORLD

Hadoop provides highly scalable, reliable, and distributed data processing computing platforms. Below are some of the critical business use cases for Big Data.

•Large Data Transformation: In big organizations, we always face challenges regarding how to process extensive data from one source to another source.

•Market Trends: Organizations need to analyze market trends based on the feedback they received. Big Data provides the best-optimized market trends.

•Machine Learning: Data can be obtained from different sources. Developers can build a system that can learn from data, such as Artificial Intelligence (AI), speech recognition, or understanding human behaviors.

•Making Recommendations: Organizations utilize user-buying patterns to identify recommendations for proactive sales.

•Decision-making: The organization uses data to predict the future and make decisions.

•Campaigning: Big Data processing provides a way to customize the marketing campaigns based on user activity and behavior.

•User Behavior: Organizations are interested in how their customers use their product, and Big Data provides a window to this behavior.

•Predictive Mechanism: Organizations get extensive data and ways to extrapolate and make predictions using it.

CHAPTER 2

WHAT IS APACHE HADOOP?

Hadoop provides the capabilities to store a massive amount of data in a distributed environment and process it effectively. It’s a distributed data processing system that supports distributed file systems, and it offers a way to parallelize and execute programs on a cluster of machines. It could be installed on a cluster using a large number of commodities hardware, which optimizes overall solution costs.

Apache Hadoop has been adopted by technology giants such as Yahoo, Facebook, Twitter, and LinkedIn to address their data needs, and it’s making inroads across all industrial sectors.

The Apache Hadoop software library is a distributed framework for processing massive data sets in batches and streams across clusters. It allows for scaling servers to thousands of machines to support large data set computation with a robust failure handling mechanism. Each device on a Hadoop cluster offers local computation and storage, as well as failure detection and handling.

HADOOP HISTORY

Doug Cutting originally developed Nutch, an open-source Web search engine, which is part of Lucene. However, it is a challenge to scale up this solution in a distributed environment. The Nutch Distributed Filesystem (NDF) was based on a paper published in 2003 [http://research.google.com/archive/gfs.html].

In 2005, MapReduce with NDFS was implemented in Nutch. Later, Doug Cutting joined Yahoo!, which provided a dedicated team to build Hadoop on top of Lucene based on Nutch’s NDFS. Yahoo! later announced that a 10,000 core Hadoop cluster generated its production search index. In 2008, Yahoo released Hadoop as an Apache open source project (Apache Software Foundation). In 2009, Hadoop successfully evaluated a petabyte of data in less than 17 hours and handled billions of searches. In 2011, Doug Cutting joined Cloudera and spread Hadoop to other organizations. In the same year, the Apache foundation released Apache Hadoop version 1.0; Hadoop version 2 was released in 2013, and Apache Hadoop version 3.0 was released in 2017.

HADOOP BENEFITS

Hadoop is one of the best solutions for the distributed storage and processing of a vast dataset (with terabytes or petabytes of information) in a cluster environment. It is fault-tolerant, scalable, and easy to use.

It divides files into small parts and distributes them into multiple parallel processing nodes to accelerate the processing time. Hadoop leverages clusters of machines to provide ample storage and processing power at a price that businesses can afford. Hadoop provides a scalable and reliable mechanism for processing large amounts of data over cluster environments. It offers novel analytic techniques that enable the sophisticated processing of multi-structured data. Data replicate in multiple nodes for reliability and higher availability. Hadoop processes data in the local node before aggregating it the remote node, which minimizes bandwidth bottlenecks.

The following are some of the salient features of Hadoop:

•Massive data processing: Apache Hadoop can process massive amounts of data.

•Cost: Apache Hadoop is a low-cost solution for high-performance computing applications.

•Easily scalable: Apache Hadoop distributes data into small chunks across clusters that can run independently. It provides flexibility to scale storage without impacting the application.

•Distributed processing: Hadoop breaks large datasets into smaller, fixed chunks and distributes those chunks across the clusters. It allows users to access and process data in a distributed environment.

•Reliability: Reliability is a big challenge in a distributed cluster environment. Hadoop was designed so it can detect failures and retries by replicated processing to other nodes.

•Separation of concerns: Hadoop maintains its business functionality processing within the overall infrastructure, and offers fault tolerance, high availability, parallel processing, and storage.

•Fast processing: It breaks the data processing task into multiple smaller jobs, which run in parallel in distributed cluster environments to reduce the overall data processing time.

•Fault tolerance & high availability: HDFS replicates data across the clusters, which gives it the power to handle failure and protect data loss in case of a node crash. In overall processing, if any node failed, HDFS efficiently allocates data from replicas. Since each task is independent, MapReduce easily re-processes the failed task.

•Highly configurable: Apache Hadoop is highly configurable and provides a default configuration.

•Status monitoring: Apache Hadoop builds on Web servers to make it easy to monitor the clusters and jobs through a Web application.

HADOOP’S ECOSYSTEM: COMPONENTS

Various components have been developed for Apache Hadoop to assist with Big Data solutions, and these form the Hadoop ecosystem.

The Hadoop Distributed File System (HDFS) is the core component for storing and accessing large file systems. MapReduce is the heart of Hadoop, and provides the capability to process files stored in the HDFS. HBase assists with real-time data processing. Hive allows for a data query without using the MapReduce program. New components are always being added to the Apache Hadoop ecosystem.

We can visualize the Hadoop Ecosystem as a Hadoop platform, a combination of various components stitched together to provide a Big Data solution.

•HDFS: The Hadoop Distributed File System partitions data and stores it across cluster nodes. HDFS is used to store a massive amount of data over a distributed environment. HDFS stores metadata information from the file and data separately. Data stored in the HDFS is written once, but read many times. It provides a base for other tools, such as Hive, Pig, HBase, and MapReduce, to process data.

•YARN: It was introduced in Hadoop 2 and is available in higher versions. It decouples the functionalities of resource management and job scheduling/monitoring into separate daemons.

FIGURE 2.1 Key Hadoop ecosystem components and the layered architecture

•MapReduce: MapReduce is Hadoop’s key component for processing a massive amount of data in parallel. It provides mechanisms to handle large datasets as a batch in highly reliable, available, and fault-tolerant environments. MapReduce breaks the data into independent parts, which are processed in parallel by the map task and pass the <Key, value> to a reducer that aggregates them before storing them into the HDFS.

•Apache Spark: Apache Spark provides fast in-memory data processing for the Hadoop environment, as well as support for a wide range of processing, including ETL, machine learning, stream processing, and graph computation.

•Apache Tez: Apache Tez is an alternative of MapReduce in Hadoop 2 used to process HDFS data in both batch and interactive ways based on the Database Availability Group (DAG).

•Zookeeper: Zookeeper is Apache Hadoop’s coordination service designed to manage Hadoop operations.

•Oozie: Oozie is a workflow system for MapReduce designed to manage multiple MapReduce job flows.

•Pig: Pig provides a scripting language (Pig Latin) to analyze datasets. Pig Latin makes it easy to create a sequence of MapReduce programs.

•Hive: Apache Hive provides a SQL-like language to retrieve data stored in Hadoop. Developers can write a SQL-like query, which gets translated into MapReduce jobs in Hadoop. Hive is more useful to developers who are familiar with SQL.

•Sqoop: Sqoop is an integration framework used to transfer data from the relational database to Hadoop and vice versa. Sqoop uses the database to describe the schema and MapReduce for parallelization operation and fault tolerance.

•Flume: Apache Flume collects, aggregates, and transfers extensive data from multiple machines to HDFS. It provides a distributed, reliable and highly available service to transfer data from various devices to Hadoop.

Beyond the core components, the Apache Hadoop ecosystem includes other tools to address particular needs, as explained below.

•Whirr: Whirr is a set of libraries that allow using the Hadoop cluster on top of Amazon EC2, Rackspace, or any virtual infrastructure.

•Mahout: This is a machine learning and data-mining library to provides MapReduce implementation for popular algorithms used for analyzing data and modeling.

•BigTop: This is a framework for the packaging and testing of Hadoop’s sub-projects and related components.

•HBase: HBase is a schema-based database built on top of the HDFS to read and write HDFS files.

•HCatalog: HCatalog is a metadata abstraction layer for referencing data without using the underlying file names or formats. It insulates users and scripts from how and where the data is physically stored.

•Ambari: A Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, including support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop.

•Avro™: A component for data serialization.

HADOOP CORE COMPONENT ARCHITECTURE

Hadoop stores a massive amount of data and provides a framework to process and manage it. Hadoop stores datasets in the HDFS cluster environment, which breaks files, stores them in multiple fixed blocks, and replicates them across the cluster. Its replication and cluster capability enhance the reliability and scalability for extensive data storage management.

FIGURE 2.2 Hadoop’s core components

Hadoop works on a multi-layer architecture where it receives data from various sources, processes them, and then optimizes them for further use. Hadoop uses batch processes (MapReduce, Hive, and Pig) to process large datasets and store the aggregated results into the HDFS, NoSQL, or imported to OLTP. The source data are transferred to the Hadoop HDFS via various mechanisms, such as FTP, Flume, Sqoop, messaging, logs, and applications. Oozie is a workflow that manages and schedules Hadoop jobs. The HDFS stores data across the clusters (DataNode), where NameNode controls the meta detail information. Fsimage and EditLogs synchronizes data states across the NameNode and DataNode.

Various processing frameworks allow processing stored HDFS data, such as batch processing, event processing, and streaming.

ZooKeeper manages Hadoop operation and provides high availability.

However, due to the amount of data stored in Hadoop, real-time access is a challenge for data transactions; HBase could be useful on partial data stored on the cluster.

SUMMARY

This chapter has provided a high-level overview of Hadoop and its ecosystem (core components). We have briefly discussed each component and their utilization as part of the platform and how each component fits together to solve Big Data challenges.

CHAPTER 3

THE HADOOP DISTRIBUTION FILESYSTEM

The HDFS is Hadoop’s distributed filesystem storage system, designed to hold a large amount of data and provide access to the file system’s namespace in a distributed and clustered environment. The HDFS abstracts the network complexity from storing and processing data files, and therefore it is easy to use it without thinking about the network complexity.