49,19 €
This book is a guide for developers and engineers to use Hadoop and Hive for scalable big data applications. It covers reading, writing, and managing large datasets with Hive and provides a concise introduction to Apache Hadoop and Hive, detailing their collaboration to simplify development. Through clear examples, the book explains the logic, code, and configurations needed for building successful distributed applications.
The course starts with an introduction to big data and Apache Hadoop fundamentals. It then covers the Hadoop Distributed Filesystem and how to get started with Hadoop. The journey continues with interfaces to access HDFS files, resource management with Yet Another Resource Negotiator, and MapReduce for data processing. The book also explores Hive architecture, storage types, and the Hive query language.
Mastering these concepts is vital for creating scalable big data solutions. This book ensures a smooth transition from novice to proficient Hadoop and Hive user, providing practical skills and comprehensive knowledge. By the end, readers will be able to set up, configure, and optimize Hadoop, utilize Hive for data management, and effectively solve big data challenges.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 213
Veröffentlichungsjahr: 2024
BIG DATAUSING HADOOP™AND HIVE™
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.
BIG DATAUSING HADOOP™AND HIVE™
NITIN KUMAR
MERCURY LEARNING AND INFORMATION
Dulles, Virginia
Boston, Massachusetts
New Delhi
Copyright ©2021 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David PallaiMERCURY LEARNING AND INFORMATION22841 Quicksilver DriveDulles, VA [email protected]
Nitin Kumar. Big Data Using Hadoop™ and Hive™.ISBN: 978-1-68392-645-0
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2021934303
212223321 This book is printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are available in digital format at academiccourseware.com and other digital vendors. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
To my wife Sarika, my children, Shaurya and Irima,and to my parents
CONTENTS
Preface
Chapter 1:Big Data
Big Data Challenges for Organizations
How We Are Using Big Data
Big Data: An Opportunity
Hadoop: A Big Data Solution
Big Data in the Real World
Chapter 2:What is Apache Hadoop?
Hadoop History
Hadoop Benefits
Hadoop’s Ecosystem: Components
Hadoop Core Component Architecture
Summary
Chapter 3:The Hadoop Distribution Filesystem
HDFS Core Components
HDFS Architecture
Data Replication
Data Locality
Data Storage
Failure Handling on the HDFS
Erasure Coding (EC)
HDFS Disk Balancer
HDFS Federation
HDFS Architecture and Its Challenges
Hadoop Federation: A Rescue
Benefits of the HDFS Federation
HDFS Processes: Read and Write
Failure Handling During Read and Write
Chapter 4:Getting Started with Hadoop
Hadoop Configuration
Command-Line Interface
Generic Filesystem CLI Command
Distributed Copy (distcp)
Hadoop’s Other User Commands
HDFS Permissions
HDFS Quotas Guide
HDFS Short-Circuit Local Reads
Offline Edits Viewer Guide
Offline Image Viewer Guide
Chapter 5:Interfaces to Access HDFS Files
WebHDFS REST API
FileSystem URIs
Error Responses
Authentication
Java FileSystem API
URI and Path
FSDataInputStream
FSDataOutputStream
FileStatus
Directories
Delete Files
C API libhdfs
Chapter 6:Yet Another Resource Negotiator
YARN Architecture
YARN Process Flow
YARN Failures
YARN High Availability
YARN Schedulers
The Fair Scheduler
The Capacity Scheduler
The YARN Timeline Server
Application Timeline Server (ATS)
ATS Data Model Structure
ATS V2
YARN Federation
Chapter 7:MapReduce
MapReduce Process
Key Features
Different Phases in the MapReduce Process
MapReduce Architecture
MapReduce Sample Program
MapReduce Composite Key Operation
Mapper Program
MapReduce Configuration
Chapter 8:Hive
Hive History
Hive Query
Data Storage
Data Model
Complex Data Types
Hive DDL (Data Definition Language)
Tables
View
Partition
Bucketing
Hive Architecture
Serialization/Deserialization (SerDe)
Metastore
Query Compiler
HiveServer2
Chapter 9:Getting Started with Hive
Hive Set-up
Hive Configuration Settings
Loading and Inserting Data into Tables
Insert from a Select Query
Load Table Data into File
Create and Load Data into a Table
Hive Transactions
Enable Transactions
Insert Values
Update
Delete
Merge
Locks
Hive Select Query
Select Basic Query
Hive QL File
Hive Select on Complex Datatypes
Order By and Sort By
Distribute By and Cluster By
Group By and Having
Built-in Aggregate Functions
Enhanced Aggregation
Table-Generating Functions
Built-In Utility Functions
Collection Functions
Date Functions
Conditional Functions
String Functions
Hive Query Language-Join
Chapter 10:File Format
File Format Characteristics
Columnar Format
Schema Evolution
Splittable
Compression
File Formats
RC (Row-Columnar) File Input Format
Optimized Row Columnar (ORC) File Format
Parquet
File Format Comparisons
ORC vs. Parquet
Chapter 11:Data Compression
Data Compression Benefits
Data Compression in Hadoop
Splitting
Compression Codec
Data Compressions
References
Index
PREFACE
Big Data Using Hadoop and Hive is the essential guide for developers, architects, engineers, and anyone who wants to start leveraging Hadoop to build a distributed, scalable, and concurrent applications. It is a concise guide to get started with Hadoop and Hive. It provides an overall understanding of Hadoop and how it works while giving the sample code to speed up development with minimal effort. It refers to simple concepts and examples, as they are likely to be the best teaching aids. It will explain the logic, code, and configurations needed to build a successful, distributed, concurrent application, and the reason behind those decisions.
Intended Audience
The primary audience of this book is the developers who would like to start development using Hadoop and Hive, architects who want to get an architect’s understanding of Hadoop technologies, and application designers looking for a detailed understanding of what Hadoop can do for them.
What this book covers
Big Data Using Hadoop and Hive covers Hadoop 3 exclusively. The Hadoop 3.3.x release series is the current active release series and contains the most stable version of Hadoop. This edition also covers the latest Hive 3.x version with all-new features. It contains four main parts: Chapters 1 through 5 are an introduction to Hadoop and the architecture around it. Chapters 6 to 8 discuss architecture and HDFS, MapReduce, YARN, and Mesosphere. In Chapters 9 and 10, we cover Hive and its architecture. Chapter 11 discusses various file formats and the compression methodology used in the Hadoop environment.
Chapter 1: This chapter explains Big Data and how organizations face difficulties with it. This chapter explores the key benefits of Big Data and how Big Data is utilized. It also shines light on how Hadoop solves the challenges of Big Data.
Chapter 2: This chapter covers Hadoop history and answers questions regarding the key benefits of Apache Hadoop. It also provides a brief introduction to the Hadoop ecosystem and explains Hadoop architecture.
Chapter 3: This chapter covers the Hadoop Distributed File System (HDFS) and its architecture. It also includes the HDFS core components and how Hadoop reads/writes with the HDFS.
Chapter 4: This chapter explains the basic Hadoop set-up and essential Hadoop configuration and optimization. It also covers the command-line interface to access the HDFS.
Chapter 5: This chapter covers the JAVA API and WebHDFS REST API to access HDFS files.
Chapter 6: This chapter provides a detailed understanding of MapReduce jobs and how the Hadoop Core framework executes MapReduce jobs.
Chapter 7: This chapter explains the challenges of Hadoop v1 and how YARN solves these challenges. This chapter explains YARN's basic architecture understanding and various failover scenarios on YARN. It also describes the Hadoop Timeline Server introduced in Hadoop v3.
Chapter 8: This chapter covers Hive and its architecture with various storage types. It also includes the Hive query language.
Chapter 9: This chapter covers the step-by-step Hive set-up. It includes different types of tables and explains how to query the table data. It also emphasizes various join and builds in utility functions.
Chapters 10 and 11: These chapters explain various file formats, such as Avro, ORC, and Parquet. They also cover different compression mechanisms.
Nitin KumarMarch 2021
CHAPTER 1
BIG DATA
There has been an explosive growth of social interactions, including online relationships, collaboration, information sharing, and feedback. Social networks such as Twitter, Facebook, LinkedIn, YouTube, and WhatsApp are connecting millions of users, allowing them to share vast amounts of information. Consumers invest more time on the Internet, which enables organizations to collect a massive amount of data. Every new generation of devices is connected and sharing massive amounts of sensor data. These data could be anything, such as user transaction logs, sales logs, user history, machine-generated data, video streaming, and sensor data.
These large amounts (terabytes, petabytes, and exabytes) of structured and unstructured data are known as Big Data. Today, businesses have endless opportunities to transform these structured and unstructured data into practical information for decision-making, predictive analysis, recommendation systems, and analytic platforms.
Organizations collect and transform data into actionable insights that can be used to contribute to revenue generation. There is an enormous demand for professionals in these areas to build models, extract features and patterns, and perform statistical analysis to transform the data into meaningful information.
Various fields were spawned from Big Data, such as artificial intelligence (AI), machine learning, and deep learning, which apply patterns to predict outcomes and make recommendations. For example, video streaming services keep your interest based on your previous activity and recommend the same type of video for future viewing. Online shopping sites provide recommendations based on your interests, trends, and events.
BIG DATA CHALLENGES FOR ORGANIZATIONS
Internet users are doing everything online, from business communications to shopping and social networking. Billions of connected devices and embedded systems create, collect, and share a wealth of data analytics every day, all over the world. According to IDC, 75% of the world’s population will be interacting with online data every day by 2025. The amount of digital data is expected to more than double.
Organizations are collecting a large amount of data, but struggling to manage and utilize it. This leads to a new challenge regarding how to effectively manage, process, and analyze vast amounts of data. The challenge is not only performing data extraction but also managing that expanding amount of data.
Traditional RDBMS-based databases were not capable of handling this massive amount of unstructured data, and so organizations were not able to use it. If data is archived on tapes, it is expensive to recover for further use. There was a need for technologies that store and process massive amounts of data.
HOW WE ARE USING BIG DATA
Big Data isn’t only about the quantity of information, but how we utilize it to improve statistical and computational process. Organizations are adopting new technologies to visualize new patterns that could improve their strategic decision-making. Big Data helps organizations drive innovation by providing new insights about their customers. An organization can build an understanding of a customer based on that customer’s transaction history and provide improved recommendations to them about products or services. Behavioral analytics is important, which visualizes business sentiment and accelerates business outcomes.
BIG DATA: AN OPPORTUNITY
Big Data has brought about a new era in the economy, which is all about information and converting it into revenue. Organizations need to understand how to deal with all this data. Their growth depends on the quality of their predictive mechanisms and how they leverage machine learning. In one survey, 97% of companies responded that they plan to increase their spending in analytic endeavors, with estimates that companies will drive $432 billion in IT spending through 2025. By 2023, the demand for jobs related to Big Data are estimated to soar 38%.
HADOOP: A BIG DATA SOLUTION
Astronomical unstructured data is a challenge to store, process, and extract information. Traditional RDBMS systems are not compatible with handling such large data sets.
Apache Hadoop is the solution for such types of data sets. Hadoop is an open-source project started by Doug Cutting. It was initiated based on papers published by Google, describing how their engineers were dealing with the challenge of storing and processing massive amounts of data. Yahoo! and other IT companies have driven the development of Hadoop.
BIG DATA IN THE REAL WORLD
Hadoop provides highly scalable, reliable, and distributed data processing computing platforms. Below are some of the critical business use cases for Big Data.
•Large Data Transformation: In big organizations, we always face challenges regarding how to process extensive data from one source to another source.
•Market Trends: Organizations need to analyze market trends based on the feedback they received. Big Data provides the best-optimized market trends.
•Machine Learning: Data can be obtained from different sources. Developers can build a system that can learn from data, such as Artificial Intelligence (AI), speech recognition, or understanding human behaviors.
•Making Recommendations: Organizations utilize user-buying patterns to identify recommendations for proactive sales.
•Decision-making: The organization uses data to predict the future and make decisions.
•Campaigning: Big Data processing provides a way to customize the marketing campaigns based on user activity and behavior.
•User Behavior: Organizations are interested in how their customers use their product, and Big Data provides a window to this behavior.
•Predictive Mechanism: Organizations get extensive data and ways to extrapolate and make predictions using it.
CHAPTER 2
WHAT IS APACHE HADOOP?
Hadoop provides the capabilities to store a massive amount of data in a distributed environment and process it effectively. It’s a distributed data processing system that supports distributed file systems, and it offers a way to parallelize and execute programs on a cluster of machines. It could be installed on a cluster using a large number of commodities hardware, which optimizes overall solution costs.
Apache Hadoop has been adopted by technology giants such as Yahoo, Facebook, Twitter, and LinkedIn to address their data needs, and it’s making inroads across all industrial sectors.
The Apache Hadoop software library is a distributed framework for processing massive data sets in batches and streams across clusters. It allows for scaling servers to thousands of machines to support large data set computation with a robust failure handling mechanism. Each device on a Hadoop cluster offers local computation and storage, as well as failure detection and handling.
HADOOP HISTORY
Doug Cutting originally developed Nutch, an open-source Web search engine, which is part of Lucene. However, it is a challenge to scale up this solution in a distributed environment. The Nutch Distributed Filesystem (NDF) was based on a paper published in 2003 [http://research.google.com/archive/gfs.html].
In 2005, MapReduce with NDFS was implemented in Nutch. Later, Doug Cutting joined Yahoo!, which provided a dedicated team to build Hadoop on top of Lucene based on Nutch’s NDFS. Yahoo! later announced that a 10,000 core Hadoop cluster generated its production search index. In 2008, Yahoo released Hadoop as an Apache open source project (Apache Software Foundation). In 2009, Hadoop successfully evaluated a petabyte of data in less than 17 hours and handled billions of searches. In 2011, Doug Cutting joined Cloudera and spread Hadoop to other organizations. In the same year, the Apache foundation released Apache Hadoop version 1.0; Hadoop version 2 was released in 2013, and Apache Hadoop version 3.0 was released in 2017.
HADOOP BENEFITS
Hadoop is one of the best solutions for the distributed storage and processing of a vast dataset (with terabytes or petabytes of information) in a cluster environment. It is fault-tolerant, scalable, and easy to use.
It divides files into small parts and distributes them into multiple parallel processing nodes to accelerate the processing time. Hadoop leverages clusters of machines to provide ample storage and processing power at a price that businesses can afford. Hadoop provides a scalable and reliable mechanism for processing large amounts of data over cluster environments. It offers novel analytic techniques that enable the sophisticated processing of multi-structured data. Data replicate in multiple nodes for reliability and higher availability. Hadoop processes data in the local node before aggregating it the remote node, which minimizes bandwidth bottlenecks.
The following are some of the salient features of Hadoop:
•Massive data processing: Apache Hadoop can process massive amounts of data.
•Cost: Apache Hadoop is a low-cost solution for high-performance computing applications.
•Easily scalable: Apache Hadoop distributes data into small chunks across clusters that can run independently. It provides flexibility to scale storage without impacting the application.
•Distributed processing: Hadoop breaks large datasets into smaller, fixed chunks and distributes those chunks across the clusters. It allows users to access and process data in a distributed environment.
•Reliability: Reliability is a big challenge in a distributed cluster environment. Hadoop was designed so it can detect failures and retries by replicated processing to other nodes.
•Separation of concerns: Hadoop maintains its business functionality processing within the overall infrastructure, and offers fault tolerance, high availability, parallel processing, and storage.
•Fast processing: It breaks the data processing task into multiple smaller jobs, which run in parallel in distributed cluster environments to reduce the overall data processing time.
•Fault tolerance & high availability: HDFS replicates data across the clusters, which gives it the power to handle failure and protect data loss in case of a node crash. In overall processing, if any node failed, HDFS efficiently allocates data from replicas. Since each task is independent, MapReduce easily re-processes the failed task.
•Highly configurable: Apache Hadoop is highly configurable and provides a default configuration.
•Status monitoring: Apache Hadoop builds on Web servers to make it easy to monitor the clusters and jobs through a Web application.
HADOOP’S ECOSYSTEM: COMPONENTS
Various components have been developed for Apache Hadoop to assist with Big Data solutions, and these form the Hadoop ecosystem.
The Hadoop Distributed File System (HDFS) is the core component for storing and accessing large file systems. MapReduce is the heart of Hadoop, and provides the capability to process files stored in the HDFS. HBase assists with real-time data processing. Hive allows for a data query without using the MapReduce program. New components are always being added to the Apache Hadoop ecosystem.
We can visualize the Hadoop Ecosystem as a Hadoop platform, a combination of various components stitched together to provide a Big Data solution.
•HDFS: The Hadoop Distributed File System partitions data and stores it across cluster nodes. HDFS is used to store a massive amount of data over a distributed environment. HDFS stores metadata information from the file and data separately. Data stored in the HDFS is written once, but read many times. It provides a base for other tools, such as Hive, Pig, HBase, and MapReduce, to process data.
•YARN: It was introduced in Hadoop 2 and is available in higher versions. It decouples the functionalities of resource management and job scheduling/monitoring into separate daemons.
FIGURE 2.1 Key Hadoop ecosystem components and the layered architecture
•MapReduce: MapReduce is Hadoop’s key component for processing a massive amount of data in parallel. It provides mechanisms to handle large datasets as a batch in highly reliable, available, and fault-tolerant environments. MapReduce breaks the data into independent parts, which are processed in parallel by the map task and pass the <Key, value> to a reducer that aggregates them before storing them into the HDFS.
•Apache Spark: Apache Spark provides fast in-memory data processing for the Hadoop environment, as well as support for a wide range of processing, including ETL, machine learning, stream processing, and graph computation.
•Apache Tez: Apache Tez is an alternative of MapReduce in Hadoop 2 used to process HDFS data in both batch and interactive ways based on the Database Availability Group (DAG).
•Zookeeper: Zookeeper is Apache Hadoop’s coordination service designed to manage Hadoop operations.
•Oozie: Oozie is a workflow system for MapReduce designed to manage multiple MapReduce job flows.
•Pig: Pig provides a scripting language (Pig Latin) to analyze datasets. Pig Latin makes it easy to create a sequence of MapReduce programs.
•Hive: Apache Hive provides a SQL-like language to retrieve data stored in Hadoop. Developers can write a SQL-like query, which gets translated into MapReduce jobs in Hadoop. Hive is more useful to developers who are familiar with SQL.
•Sqoop: Sqoop is an integration framework used to transfer data from the relational database to Hadoop and vice versa. Sqoop uses the database to describe the schema and MapReduce for parallelization operation and fault tolerance.
•Flume: Apache Flume collects, aggregates, and transfers extensive data from multiple machines to HDFS. It provides a distributed, reliable and highly available service to transfer data from various devices to Hadoop.
Beyond the core components, the Apache Hadoop ecosystem includes other tools to address particular needs, as explained below.
•Whirr: Whirr is a set of libraries that allow using the Hadoop cluster on top of Amazon EC2, Rackspace, or any virtual infrastructure.
•Mahout: This is a machine learning and data-mining library to provides MapReduce implementation for popular algorithms used for analyzing data and modeling.
•BigTop: This is a framework for the packaging and testing of Hadoop’s sub-projects and related components.
•HBase: HBase is a schema-based database built on top of the HDFS to read and write HDFS files.
•HCatalog: HCatalog is a metadata abstraction layer for referencing data without using the underlying file names or formats. It insulates users and scripts from how and where the data is physically stored.
•Ambari: A Web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, including support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop.
•Avro™: A component for data serialization.
HADOOP CORE COMPONENT ARCHITECTURE
Hadoop stores a massive amount of data and provides a framework to process and manage it. Hadoop stores datasets in the HDFS cluster environment, which breaks files, stores them in multiple fixed blocks, and replicates them across the cluster. Its replication and cluster capability enhance the reliability and scalability for extensive data storage management.
FIGURE 2.2 Hadoop’s core components
Hadoop works on a multi-layer architecture where it receives data from various sources, processes them, and then optimizes them for further use. Hadoop uses batch processes (MapReduce, Hive, and Pig) to process large datasets and store the aggregated results into the HDFS, NoSQL, or imported to OLTP. The source data are transferred to the Hadoop HDFS via various mechanisms, such as FTP, Flume, Sqoop, messaging, logs, and applications. Oozie is a workflow that manages and schedules Hadoop jobs. The HDFS stores data across the clusters (DataNode), where NameNode controls the meta detail information. Fsimage and EditLogs synchronizes data states across the NameNode and DataNode.
Various processing frameworks allow processing stored HDFS data, such as batch processing, event processing, and streaming.
ZooKeeper manages Hadoop operation and provides high availability.
However, due to the amount of data stored in Hadoop, real-time access is a challenge for data transactions; HBase could be useful on partial data stored on the cluster.
SUMMARY
This chapter has provided a high-level overview of Hadoop and its ecosystem (core components). We have briefly discussed each component and their utilization as part of the platform and how each component fits together to solve Big Data challenges.
CHAPTER 3
THE HADOOP DISTRIBUTION FILESYSTEM
The HDFS is Hadoop’s distributed filesystem storage system, designed to hold a large amount of data and provide access to the file system’s namespace in a distributed and clustered environment. The HDFS abstracts the network complexity from storing and processing data files, and therefore it is easy to use it without thinking about the network complexity.