E-Book
23,99 €

Apache Hive Essentials E-Book

Dayong Du

0,0

23,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

This book takes you on a fantastic journey to discover the attributes of big data using Apache Hive.

Key FeaturesGrasp the skills needed to write efficient Hive queries to analyze the Big Data Discover how Hive can coexist and work with other tools within the Hadoop ecosystemUses practical, example-oriented scenarios to cover all the newly released features of Apache Hive 2.3.3Book Description

In this book, we prepare you for your journey into big data by frstly introducing you to backgrounds in the big data domain, alongwith the process of setting up and getting familiar with your Hive working environment.

Next, the book guides you through discovering and transforming the values of big data with the help of examples. It also hones your skills in using the Hive language in an effcient manner. Toward the end, the book focuses on advanced topics, such as performance, security, and extensions in Hive, which will guide you on exciting adventures on this worthwhile big data journey.

By the end of the book, you will be familiar with Hive and able to work effeciently to find solutions to big data problems

What you will learnCreate and set up the Hive environmentDiscover how to use Hive's definition language to describe dataDiscover interesting data by joining and filtering datasets in HiveTransform data by using Hive sorting, ordering, and functionsAggregate and sample data in different waysBoost Hive query performance and enhance data security in HiveCustomize Hive to your needs by using user-defined functions and integrate it with other toolsWho this book is for

If you are a data analyst, developer, or simply someone who wants to quickly get started with Hive to explore and analyze Big Data in Hadoop, this is the book for you. Since Hive is an SQL-like language, some previous experience with SQL will be useful to get the most out of this book.

Dayong Du is a big data practitioner, author, and coach with over 10 years' experience in technology consulting, designing, and implementing enterprise big data architecture and analytics in various industries, including finance, media, travel, and telecoms. He has a master's degree in computer science from Dalhousie University and is a Cloudera certified Hadoop developer. He is a cofounder of Toronto Big Data Professional Association and the founder of DataFiber website.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 229

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Apache Hive EssentialsSecond Edition

Essential techniques to help you process, and get unique insights from, big data

Dayong Du

BIRMINGHAM - MUMBAI

Apache Hive Essentials Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amey VarangaonkarAcquisition Editor: Noyonika DasContent Development Editor: Mohammed Yusuf ImaratwaleTechnical Editor: Jinesh TopiwalaCopy Editor: Safis EditingProject Coordinator:Hardik BhindeProofreader: Safis EditingIndexer:Rekha NairGraphics:Jason MonteiroProduction Coordinator: Aparna Bhagat

First published: February 2015 Second edition: June 2018

Production reference: 1290618

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78899-509-2

www.packtpub.com

I dedicate this book to my daughter, Elaine

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and, as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

About the reviewers

Deepak Kumar Sahu is a big data technology-driven professional with extensive experience in data gathering, modeling, analysis, validation, and architecture design to build next-generation analytics platforms. He has a strong analytical and technical background with good problem-solving skills to develop effective, complex business solutions. He enjoys developing high-quality software and designing secure and scalable data systems. He has written blogs on machine learning, data science, big data management, and Blockchain. He can be reached at linkedin deepakkumarsahu.

Shuguang Li is a big data professional with extensive experience in designing and implementing complete end-to-end Hadoop infrastructure using MapReduce, Spark, Hive, Atlas, Kafka, Sqoop, HBase. The whole lifecycle covers data ingestion, data streaming, data analyzing and data mining. He also has hands on experience in blockchain technology, including fabric and sawtooth. Shuguang has more than 20 years' experience in financial industry, like banks, stock exchange and mutual fund companies. He can be reach at linkedin michael-li-12016915.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Apache Hive Essentials Second Edition

Dedication

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Overview of Big Data and Hive

A short history

Introducing big data

The relational and NoSQL databases versus Hadoop

Batch, real-time, and stream processing

Overview of the Hadoop ecosystem

Hive overview

Summary

Setting Up the Hive Environment

Installing Hive from Apache

Installing Hive from vendors

Using Hive in the cloud 

Using the Hive command

Using the Hive IDE

Summary

Data Definition and Description

Understanding data types

Data type conversions

Data Definition Language

Database

Tables

Table creation

Table description

Table cleaning

Table alteration

Partitions

Buckets

Views

Summary

Data Correlation and Scope

Project data with SELECT

Filtering data with conditions

Linking data with JOIN

INNER JOIN

OUTER JOIN

Special joins

Combining data with UNION

Summary

Data Manipulation

Data exchanging with LOAD

Data exchange with INSERT

Data exchange with [EX|IM]PORT

Data sorting

Functions

Function tips for collections

Function tips for date and string

Virtual column functions

Transactions and locks

Transactions

UPDATE statement

DELETE statement

MERGE statement

Locks

Summary

Data Aggregation and Sampling

Basic aggregation 

Enhanced aggregation

Grouping sets

Rollup and Cube

Aggregation condition

Window functions

Window aggregate functions

Window sort functions

Window analytics functions

Window expression

Sampling

Random sampling

Bucket table sampling

Block sampling

Summary

Performance Considerations

Performance utilities

EXPLAIN statement

ANALYZE statement

Logs

Design optimization

Partition table design

Bucket table design

Index design

Use skewed/temporary tables

Data optimization

File format

Compression

Storage optimization

Job optimization

Local mode

JVM reuse

Parallel execution

Join optimization

Common join

Map join

Bucket map join

Sort merge bucket (SMB) join

Sort merge bucket map (SMBM) join

Skew join

Job engine

Optimizer

Vectorization optimization

Cost-based optimization

Summary

Extensibility Considerations

User-defined functions

UDF code template

UDAF code template

UDTF code template

Development and deployment

HPL/SQL

Streaming

SerDe

Summary

Security Considerations

Authentication

Metastore authentication

Hiveserver2 authentication

Authorization

Legacy mode

Storage-based mode

SQL standard-based mode

Mask and encryption

The data-hashing function

The data-masking function

The data-encryption function

Other methods

Summary

Working with Other Tools

The JDBC/ODBC connector

NoSQL

The Hue/Ambari Hive view

HCatalog

Oozie

Spark

Hivemall

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

With an increasing interest in big data analysis, Hive over Hadoop becomes a cutting-edge data solution for storing, computing, and analyzing big data. The SQL-like syntax makes Hive easier to learn and is popularly accepted as a standard for interactive SQL queries over big data. The variety of features available within Hive provides us with the capability of doing complex big data analysis without advanced coding skills. The maturity of Hive lets it gradually merge and share its valuable architecture and functionalities across different computing frameworks beyond Hadoop.

Apache Hive Essentials, Second Edition prepares your journey to big data by covering the introduction of backgrounds and concepts in the big data domain, along with the process of setting up and getting familiar with your Hive working environment in the first two chapters. In the next four chapters, the book guides you through discovering and transforming the value behind big data using examples and skills of Hive query languages. In the last four chapters, the book highlights the well-selected and advanced topics, such as performance, security, and extensions, as exciting adventures for this worthwhile big data journey.

Who this book is for

If you are a data analyst, developer, or user who wants to use Hive for exploring and analyzing data in Hadoop, this is the right book for you. Whether you are new to big data or already an experienced user, you will be able to master both basic and advanced functions of Hive. Since HQL is quite similar to SQL, some previous experience with SQL and databases will help with getting a better understanding of this book.

What this book covers

Chapter 1, Overview of Big Data and Hive, begins with the evolution of big data, Hadoop ecosystem, and Hive. You will also learn the Hive architecture and advantages of using Hive in big data analysis.

Chapter 2, Setting Up the Hive Environment, presents the Hive environment setup and configuration. It also covers using Hive through the command line and development tools.

Chapter 3, Data Definition and Description, outlines the basic data types and data definition language for tables, partitions, buckets, and views in Hive.

Chapter 4, Data Correlation and Scope, shows you ways to discover the data by querying, linking, and scoping the data in Hive.

Chapter 5, Data Manipulation, focuses on the process of exchanging, moving, sorting, and transforming the data in Hive.

Chapter 6, Data Aggregation and Sampling, explains the way of doing aggregation and sample using aggregation functions, analytic functions, windowing, and sample clauses.

Chapter 7, Performance Considerations, introduces the best practices of performance considerations in the aspect of design, file format, compression, storage, query, and job.

Chapter 8, Extensibility Considerations, describes the way of extending Hive by creating user-defined functions, streaming, serializers, and deserializers.

Chapter 9, Security Considerations, introduces the area of Hive security in terms of authentication, authorization, and encryption.

Chapter 10, Working with Other Tools, discusses how Hive works with other big data tools.

To get the most out of this book

This book will give you maximum benefit if you have some experience with SQL. If you are a data analyst, developer, or simply someone who wants to quickly get started with Hive to explore and analyze Big Data in Hadoop, this is the book for you. Additionally, install the following in your system.

JDK 1.8

Hadoop 2.x.y

Ubuntu 16.04/CentOS 7

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packtpub.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Hive-Essentials-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/ApacheHiveEssentialsSecondEdition_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Add the necessary system path variables in the ~/.profile or ~/.bashrc file"

A block of code is set as follows:

export HADOOP_HOME=/opt/hadoopexport HADOOP_CONF_DIR=/opt/hadoop/confexport HIVE_HOME=/opt/hiveexport HIVE_CONF_DIR=/opt/hive/confexport PATH=$PATH:$HIVE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Any command-line or beeline interactive input or output is written as follows:

$hive

$beeline -u "jdbc:hive2://localhost:10000"

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select Preference from the interface."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Overview of Big Data and Hive

This chapter is an overview of big data and Hive, especially in the Hadoop ecosystem. It briefly introduces the evolution of big data so that readers know where they are in the journey of big data and can find out their preferred areas in future learning. This chapter also covers how Hive has become one of the leading tools in the big data ecosystem and why it is still competitive.

In this chapter, we will cover the following topics:

A short history from the database, data warehouse to big data

Introducing big data

Relational and NoSQL databases versus Hadoop

Batch, real-time, and stream processing

Hadoop ecosystem overview

Hive overview

A short history

In the 1960s, when computers became a more cost-effective option for businesses, people started to use databases to manage data. Later on, in the 1970s, relational databases became more popular for business needs since they connected physical data with the logical business easily and closely. In the next decade, Structured Query Language (SQL) became the standard query language for databases. The effectiveness and simplicity of SQL motivated lots of people to use databases and brought databases closer to a wide range of users and developers. Soon, it was observed that people used databases for data application and management and this continued for a long period of time.

Once plenty of data was collected, people started to think about how to deal with the historical data. Then, the term data warehousing came up in the 1990s. From that time onward, people started discussing how to evaluate current performance by reviewing the historical data. Various data models and tools were created to help enterprises effectively manage, transform, and analyze their historical data. Traditional relational databases also evolved to provide more advanced aggregation and analyzed functions as well as optimizations for data warehousing. The leading query language was still SQL, but it was more intuitive and powerful compared to the previous versions. The data was still well-structured and the model was normalized. As we entered the 2000s, the internet gradually became the topmost industry for the creation of the majority of data in terms of variety and volume. Newer technologies, such as social media analytics, web mining, and data visualizations, helped lots of businesses and companies process massive amounts of data for a better understanding of their customers, products, competition, and markets. The data volume grew and the data format changed faster than ever before, which forced people to search for new solutions, especially in the research and open source areas. As a result, big data became a hot topic and a challenging field for many researchers and companies.

However, in every challenge there lies great opportunity. In the 2010s, Hadoop, which was one of the big data open source projects, started to gain wide attention due to its open source license, active communities, and power to deal with the large volumes of data. This was one of the few times that an open source project led to the changes in technology trends before any commercial software products. Soon after, the NoSQL database, real-time analytics, and machine learning, as followers, quickly became important components on top of the Hadoop big data ecosystem. Armed with these big data technologies, companies were able to review the past, evaluate the current, and grasp the future opportunities.

Introducing big data

Big Data is not simply a big volume of data. Here, the word Big refers to the big scope of data. A well-known saying in this domain is to describe big data with the help of three words starting with the letter V: volume, velocity, and variety. But the analytical and data science world has seen data varying in other dimensions in addition to the fundament three Vs of big data, such as veracity, variability, volatility, visualization, and value. The different Vs mentioned so far are explained as follows:

Volume

: This refers to the amount of data generated in seconds. 90% of the world's data today has been created in the last two years. Since that time, the data in the world doubles every two years. Such big volumes of data are mainly generated by machines, networks, social media, and sensors, including structured, semi-structured, and unstructured data.

Velocity

: This refers to the speed at which the data is generated, stored, analyzed, and moved around. With the availability of internet-connected devices, wireless or wired machines and sensors can pass on their data as soon as it is created. This leads to real-time data streaming and helps businesses to make valuable and fast decisions.

Variety

: This refers to the different data formats. Data used to be stored in the

.txt

.csv

, and

.dat

formats from data sources such as filesystems, spreadsheets, and databases. This type of data, which resides in a fixed field within a record or file, is called structured data. Nowadays, data is not always in the traditional structured format. The newer semi-structured or unstructured forms of data are also generated by various methods such as email, photos, audio, video, PDFs, SMSes, or even something we have no idea about. These varieties of data formats create problems for storing and analyzing data. This is one of the major challenges we need to overcome in the big data domain.

Veracity

: This refers to the quality of data, such as trustworthiness, biases, noise, and abnormality in data. Corrupted data is quite normal. It could originate due to a number of reasons, such as typos, missing or uncommon abbreviations, data reprocessing, and system failures. However, ignoring this malicious data could lead to inaccurate data analysis and eventually a wrong decision. Therefore, making sure the data is correct in terms of data audition and correction is very important for big data analysis.

Variability

: This refers to the changing of data. It means that the same data could have different meanings in different contexts. This is particularly important when carrying out sentiment analysis. The analysis algorithms are able to understand the context and discover the exact meaning and values of data in that context.

Volatility

: This refers to how long the data is valid and stored. This is particularly important for real-time analysis. It requires a target time window of data to be determined so that analysts can focus on particular questions and gain good performance out of the analysis.

Visualization

: This refers to the way of making data well understood. Visualization does not only mean ordinary graphs or pie charts; it also makes vast amounts of data comprehensible in a multidimensional view that is easy to understand. Visualization is an innovative way to show changes in data. It requires lots of interaction, conversations, and joint efforts between big data analysts and business-domain experts to make the visualization meaningful.

Value

: This refers to the knowledge gained from data analysis on big data. The value of big data is how organizations turn themselves into big data-driven companies and use the insight from big data analysis for their decision-making.

In summary, big data is not just about lots of data, it is a practice to discover new insight from existing data and guide the analysis of new data. A big-data-driven business will be more agile and competitive to overcome challenges and win competitions.

The relational and NoSQL databases versus Hadoop

To better understand the differences among the relational database, NoSQL database, and Hadoop, let's compare them with ways of traveling. You will be surprised to find that they have many similarities. When people travel, they either take cars or airplanes, depending on the travel distance and cost. For example, when you travel to Vancouver from Toronto, an airplane is always the first choice in terms of the travel time versus cost. When you travel to Niagara Falls from Toronto, a car is always a good choice. When you travel to Montreal from Toronto, some people may prefer taking a car to an airplane. The distance and cost here are like the big data volume and investment. The traditional relational database is like the car, and the Hadoop big data tool is like the airplane. When you deal with a small amount of data (short distance), a relational database (like the car) is always the best choice, since it is fast and agile to deal with a small or moderate amount of data. When you deal with a big amount of data (long distance), Hadoop (like the airplane) is the best choice, since it is more linear-scalable, fast, and stable to deal with the big volume of data. You could drive from Toronto to Vancouver, but it takes too much time. You can also take an airplane from Toronto to Niagara Falls, but it would take more time on your way to the airport and cost more than traveling by car. In addition, you could take a ship or a train. This is like a NoSQL database, which offers characteristics and balance from both a relational database and Hadoop in terms of good performance and various data format support for moderate to large amounts of data.

Batch, real-time, and stream processing

Batch processing is used to process data in batches. It reads data from the input, processes it, and writes it to the output. Apache Hadoop is the most well-known and popular open source implementation of the distributed batch processing system using the MapReduce paradigm. The data is stored in a shared and distributed file system, called Hadoop Distributed File System (HDFS), and divided into splits, which are the logical data divisions for MapReduce processing.

To process these splits using the MapReduce paradigm, the map task reads the splits and passes all of its key/value pairs to a map function, and writes the results to intermediate files. After the map phase is completed, the reducer reads intermediate files sent through the shuffle process and passes them to the reduce function. Finally, the reduce task writes results to the final output files. The advantages of the MapReduce model include making distributed programming easier, near-linear speed-up, good scalability, as well as fault tolerance. The disadvantage of this batch processing model is being unable to execute recursive or iterative jobs. In addition, the obvious batch behavior is that all input must be ready by map before the reduce job starts, which makes MapReduce unsuitable for online and stream-processing use cases.

Real-time processing is used to process data and get the result almost immediately. This concept in the area of real-time ad hoc queries over big data was first implemented in Dremel by Google. It uses a novel columnar storage format for nested structures with fast index and scalable aggregation algorithms for computing query results in parallel instead of batch sequences. These two techniques are the major characters for real-time processing and are used by similar implementations, such as Impala (https://impala.apache.org/), Presto (https://prestodb.io/), and Drill (https://drill.apache.org/), powered by the columnar storage data format, such as Parquet (https://parquet.apache.org/), ORC (https://orc.apache.org/), CarbonData (https://carbondata.apache.org/), and Arrow (https://arrow.apache.org/). On the other hand, in-memory computing no doubt offers faster solutions for real-time processing. In-memory computing offers very high bandwidth, which is more than 10 gigabytes/second, compared to a hard disk's 200 megabytes/second. Also, the latency is comparatively lower, nanoseconds versus milliseconds, compared to hard disks. With the price of RAM getting lower and lower each day, in-memory computing is more affordable as a real-time solution, such as Apache Spark (https://spark.apache.org/), which is a popular open source implementation of in-memory computing. Spark can be easily integrated with Hadoop, and its in-memory data structure Resilient Distributed Dataset (RDD) can be generated from data sources, such as HDFS and HBase, for efficient caching.

Stream processing is used to continuously process and act on the live stream data to get a result. In stream processing, there are two commonly used general-purpose stream processing frameworks: Storm (https://storm.apache.org/) and Flink (https://flink.apache.org/). Both frameworks run on the Java Virtual Machine (JVM) and both process keyed streams. In terms of the programming model, Storm gives you the basic tools to build a framework, while Flink gives you a well-defined and easily used framework. In addition, Samza (http://samza.apache.org/) and Kafka Stream (https://kafka.apache.org/documentation/streams/) leverage Kafka for both message-caching and transformation. Recently, Spark also provides a type of stream processing in terms of its innovative continuous-processing mode.