E-Book
39,59 €

Apache Hive Cookbook E-Book

Hanish Bansal

0,0

39,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Easy, hands-on recipes to help you understand Hive and its integration with frameworks that are used widely in today's big data world

About This Book

Grasp a complete reference of different Hive topics.
Get to know the latest recipes in development in Hive including CRUD operations
Understand Hive internals and integration of Hive with different frameworks used in today's world.

Who This Book Is For

The book is intended for those who want to start in Hive or who have basic understanding of Hive framework. Prior knowledge of basic SQL command is also required

What You Will Learn

Learn different features and offering on the latest Hive
Understand the working and structure of the Hive internals
Get an insight on the latest development in Hive framework
Grasp the concepts of Hive Data Model
Master the key concepts like Partition, Buckets and Statistics
Know how to integrate Hive with other frameworks such as Spark, Accumulo, etc

In Detail

Hive was developed by Facebook and later open sourced in Apache community. Hive provides SQL like interface to run queries on Big Data frameworks. Hive provides SQL like syntax also called as HiveQL that includes all SQL capabilities like analytical functions which are the need of the hour in today's Big Data world.

This book provides you easy installation steps with different types of metastores supported by Hive. This book has simple and easy to learn recipes for configuring Hive clients and services. You would also learn different Hive optimizations including Partitions and Bucketing. The book also covers the source code explanation of latest Hive version.

Hive Query Language is being used by other frameworks including spark. Towards the end you will cover integration of Hive with these frameworks.

Style and approach

Starting with the basics and covering the core concepts with the practical usage, this book is a complete guide to learn and explore Hive offerings.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 281

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Apache Hive Cookbook

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why Subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Developing Hive

Introduction

Deploying Hive on a Hadoop cluster

Getting ready

How to do it...

How it works…

Deploying Hive Metastore

Getting ready

How to do it…

Installing Hive

Getting ready

How to do it…

Hive with an embedded metastore

Hive with a local metastore

Hive with a remote metastore

Configuring HCatalog

Getting ready

How to do it...

Understanding different components of Hive

HiveServer

Hive metastore

How to do it...

HiveServer2

How to do it...

Hive clients

Hive CLI

Getting ready

How to do it...

Beeline

Getting ready

How to do it...

Compiling Hive from source

Getting ready

How to do it...

Hive packages

Getting ready

How to do it...

Debugging Hive

Getting ready

How to do it...

Running Hive

Getting ready

How to do it...

Changing configurations at runtime

How to do it...

2. Services in Hive

Introducing HiveServer2

How to do it…

How it works…

See also

Understanding HiveServer2 properties

How to do it…

How it works…

See also

Configuring HiveServer2 high availability

Getting ready

How to do it…

How it works…

See also

Using HiveServer2 clients

Getting ready

How to do it…

Beeline

Beeline command options

JDBC

JDBC client sample code using Eclipse

Running the JDBC sample code from the command-line

JDBC datatypes

Other clients

Introducing the Hive metastore service

How to do it…

How it works…

Configuring high availability of metastore service

How to do it…

Introducing Hue

Getting ready

How to do it…

Prepare dependencies

Downloading and installing Hue

Configuring Hive with Hue

Starting Hue

Accessing Hive with Hue

3. Understanding the Hive Data Model

Introduction

Introducing data types

Primitive data types

Complex data types

Using numeric data types

How to do it…

Using string data types

How to do it…

How it works…

Using Date/Time data types

How to do it…

Using miscellaneous data types

How to do it…

Using complex data types

How to do it…

Using operators

Using relational operators

How to do it…

Using arithmetic operators

How to do it…

Using logical operators

How to do it…

Using complex operators

How to do it…

Partitioning

Getting ready

How to do it…

Partitioning a managed table

How to do it…

Adding new partitions

Renaming partitions

Exchanging partitions

Dropping the partitions

Loading data in a managed partitioned table

Partitioning an external table

How to do it…

Bucketing

Getting ready

How to do it…

How it works…

4. Hive Data Definition Language

Introduction

Creating a database schema

Getting ready

How to do it…

Dropping a database schema

Getting ready

How to do it…

Altering a database schema

Getting ready

How to do it…

Using a database schema

Getting ready

How to do it…

Showing database schemas

Getting ready

How to do it…

Describing a database schema

Getting ready

How to do it…

Creating tables

How to do it…

Create table LIKE

How it works

Dropping tables

Getting ready

How to do it…

Truncating tables

Getting ready

How to do it…

Renaming tables

Getting ready

How to do it…

Altering table properties

Getting ready

How to do it…

Creating views

Getting ready

How to do it…

Dropping views

Getting ready

How to do it…

Altering the view properties

Getting ready

How to do it…

Altering the view as select

Getting ready

How to do it…

Showing tables

Getting ready

How to do it…

Showing partitions

Getting ready

How to do it…

Show the table properties

Getting ready

How to do it…

Showing create table

Getting ready

How to do it…

HCatalog

Getting ready

How to do it…

HCatalog DMLs

WebHCat

Getting ready

How to do it…

See also

Exploring views

How to do it…

How it works…

Exploring indexes

How to do it…

Hive partitioning

How to do it…

Static partitioning

Dynamic partitioning

Creating buckets in Hive

How to do it…

Metastore view of bucketing

Analytics functions in Hive

How to do it…

See also

Windowing in Hive

How to do it…

LEAD

LAG

FIRST_VALUE

LAST_VALUE

See also

File formats

How to do it…

7. Joins and Join Optimization

Understanding the joins concept

Getting ready

How to do it…

How it works…

Using a left/right/full outer join

How to do it…

How it works…

Using a left semi join

How to do it…

How it works…

Using a cross join

How to do it…

How it works…

Using a map-side join

How to do it…

How it works…

Using a bucket map join

Getting ready

How to do it…

How it works…

Using a bucket sort merge map join

Getting ready

How to do it…

How it works…

Using a skew join

How to do it…

How it works…

8. Statistics in Hive

Bringing statistics in to Hive

How to do it…

Table and partition statistics in Hive

Getting ready

How to do it…

Statistics for a partitioned table

Column statistics in Hive

How to do it…

How it works…

Top K statistics in Hive

How to do it…

9. Functions in Hive

Using built-in functions

How to do it…

Mathematical functions

Collection functions

Type conversion functions

Date functions

String functions

How it works…

Mathematical functions

Collection functions

Type conversion functions

Date functions

String functions

There's more

Conditional functions

Miscellaneous functions

See also

Using the built-in User-defined Aggregation Function (UDAF)

How to do it…

How it works…

Using the built-in User Defined Table Function (UDTF)

How to do it…

How it works…

See also

Creating custom User-Defined Functions (UDF)

How to do it…

How it works…

10. Hive Tuning

Enabling predicate pushdown optimizations in Hive

Getting ready

How to do it…

How it works…

Optimizations to reduce the number of map

Getting ready

How to do it…

Sampling

Getting ready

Sampling bucketed table

Block sampling

Length literal

Row count

How to do it…

How it works…

11. Hive Security

Securing Hadoop

How to do it…

How it works…

Giving read and write access to user mike

Revoking the access of the user mike

See also

Authorizing Hive

How to do it…

Default authorization–legacy mode

Storage-based authorization

SQL standards-based authorization

There's more

Configuring the SQL standards-based authorization

Getting Started

How to do it…

To list out all existing roles

creating a role

Deleting a role

Showing list of current roles

Setting a role

Granting a role

Revoking a role

Checking roles of a user/role

Checking principles of a role

Granting privileges

Revoking privileges

Checking privileges of a user or role

See also

Authenticating Hive

How to do it…

Anonymous with SASL (default no authentication)

Anonymous without SASL

Kerberos

Configuring the JDBC client for Kerberos authentication

Access Hive using the Beeline client

Access Hive using the Hive JDBC client in Java

LDAP

Pluggable Authentication Modules

Custom

12. Hive Integration with Other Frameworks

Working with Apache Spark

Getting ready

How to do it…

How it works…

Working with Accumulo

Getting ready

How to do it…

How it works…

Working with HBase

Getting ready

How to do it…

How it works…

Working with Google Drill

Getting ready

How to do it…

How it works…

Index

Apache Hive Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2016

Production reference: 1260416

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-108-0

www.packtpub.com

Credits

Authors

Hanish Bansal

Saurabh Chauhan

Shrey Mehrotra

Reviewer

Aristides Villarreal Bravo

Commissioning Editor

Wilson D'souza

Acquisition Editor

Tushar Gupta

Content Development Editor

Anish Dhurat

Technical Editor

Vishal K. Mewada

Copy Editor

Dipti Mankame

Project Coordinator

Bijal Patel

Proofreader

Safis Editing

Indexer

Priya Sane

Graphics

Kirk D'Penha

Production Coordinator

Shantanu N. Zagade

Cover Work

Shantanu N. Zagade

About the Authors

Hanish Bansal is a software engineer with over 4 years of experience in developing big data applications. He loves to study emerging solutions and applications mainly related to big data processing, NoSQL, natural language processing, and neural networks. He has worked on various technologies such as Spring Framework, Hibernate, Hadoop, Hive, Flume, Kafka, Storm, and NoSQL databases, which include HBase, Cassandra, MongoDB, and search engines such as Elasticsearch.

In 2012, he completed his graduation in Information Technology stream from Jaipur Engineering College and Research Center, Jaipur, India. He was also the technical reviewer of the book Apache Zookeeper Essentials. In his spare time, he loves to travel and listen to music.

You can read his blog at http://hanishblogger.blogspot.in/ and follow him on Twitter at https://twitter.com/hanishbansal786.

I would like to thank my parents for their love, support, encouragement and the amazing chances they've given me over the years.

Saurabh Chauhan is a module lead with close to 8 years of experience in data warehousing and big data applications. He has worked on multiple Extract, Transform and Load tools, such as Oracle Data Integrator and Informatica as well as on big data technologies such as Hadoop, Hive, Pig, Sqoop, and Flume.

He completed his bachelor of technology in 2007 from Vishveshwarya Institute of Engineering and Technology. In his spare time, he loves to travel and discover new places. He also has a keen interest in sports.

I would like to thank everyone who has supported me throughout my life.

Shrey Mehrotra has 6 years of IT experience and, since the past 4 years, in designing and architecting cloud and big data solutions for the governance and financial domains.

Having worked with big data R&D Labs and Global Data and Analytical Capabilities, he has gained insights into Hadoop, focusing on HDFS, MapReduce, and YARN. His technical strengths also include Hive, Pig, Spark, Elasticsearch, Sqoop, Flume, Kafka, and Java.

He likes spending time performing R&D on different big data technologies. He is the co-author of the book Learning YARN, a certified Hadoop developer, and has also written various technical papers. In his free time, he listens to music, watches movies, and spending time with friends.

I would like to thank my mom and dad for giving me support to accomplish anything I wanted. Also, I would like to thank my friends, who bear with me while I am busy writing.

About the Reviewer

Aristides Villarreal Bravo is a Java developers, a member of the NetBeans Dream Team, and a Java User Groups leader.

He has organized and participated in various conferences and seminars related to Java, JavaEE, NetBeans, NetBeans Platform, free software, and mobile devices, nationally and internationally.

He has written tutorials and blogs about Java, NetBeans, and web development. He has participated in several interviews on sites such as NetBeans, NetBeans Dzone, and JavaHispano. He has developed plugins for NetBeans. He has been a technical reviewer for the book PrimeFaces Blueprints.

Aristides is the CEO of Javscaz Software Developers. He lives in Panamá

To my mother, father, and all family and friends.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Hive is an open source big data framework in the Hadoop ecosystem. It provides an SQL-like interface to query data stored in HDFS. Underlying it runs MapReduce programs corresponding to the SQL query. Hive was initially developed by Facebook and later added to the Hadoop ecosystem.

Hive is currently the most preferred framework to query data in Hadoop. Because most of the historical data is stored in RDBMS data stores, including Oracle and Teradata. It is convenient for the developers to run similar SQL statements in Hive to query data.

Along with simple SQL statements, Hive supports wide variety of windowing and analytical functions, including rank, row num, dense rank, lead, and lag.

Hive is considered as de facto big data warehouse solution. It provides a number of techniques to optimize storage and processing of terabytes or petabytes of data in a cost-effective way.

Hive could be easily integrated with a majority of other frameworks, including Spark and HBase. Hive allows developers or analysts to execute SQL on it. Hive also supports querying data stored in different formats such as JSON.

What this book covers

Chapter 1, Developing Hive, helps you out in configuring Hive on a Hadoop platform. This chapter explains a different mode of Hive installations. It also provides pointers for debugging Hive and brief information about compiling Hive source code and different modules in the Hive source code.

Chapter 2, Services in Hive, gives a detailed description about the configurations and usage of different services provided by Hive such as HiveServer2. This chapter also explains about different clients of Hive, including Hive CLI and Beeline.

Chapter 3, Understanding the Hive Data Model, takes you through the details of different data types provided by Hive in order to be helpful in data modeling.

Chapter 4, Hive Data Definition Language, helps you understand the syntax and semantics of creating, altering, and dropping different objects in Hive, including databases, tables, functions, views, indexes, and roles.

Chapter 5, Hive Data Manipulation Language, gives you complete understanding of Hive interfaces for data manipulation. This chapter also includes some of the latest features in Hive related to CRUD operations in Hive. It explains insert, update, and delete at the row level in Hive available in Hive 0.14 and later versions.

Chapter 6, Hive Extensibility Features, covers a majority of advance concepts in Hive. This chapter explain some concepts such as SerDes, Partitions, Bucketing, Windowing and Analytics, and File Formats in Hive with the detailed examples.

Chapter 7, Joins and Join Optimization, gives you a detailed explanation of types of Join supported by Hive. It also provides detailed information about different types of Join optimizations available in Hive.

Chapter 8, Statistics in Hive, allows you to capture and analyze tables, partitions, and column-level statistics. This chapter covers the configurations and commands use to capture these statistics.

Chapter 9, Functions in Hive, gives you the detailed overview of the extensive set of inbuilt functions supported by Hive, which can be used directly in queries. This chapter also covers how to create a custom User-Defined Function and register in Hive.

Chapter 10, Hive Tuning, helps you out in optimizing the complex queries to reduce the throughput time. It covers different optimization techniques using predicate pushdown, by reducing number of maps, and by sampling.

Chapter 11, Hive Security, covers concepts to secure the data from any unauthorized access. It explains the different mechanisms of authentication and authorization that can be implement in Hive for security purposes. In case of critical or sensitive data, security is the first thing that needs to be considered.

Chapter 12, Hive Integration with Other Frameworks, takes you through the integration mechanism of Hive with some other popular frameworks such as Spark, HBase, Accumulo, and Google Drill.

What you need for this book

To practice in parallel with reading the book, you need a machine or set of machines on which Hadoop is installed in either pseudo distributed or clustered mode.

To have a better understanding of metastore concept, you should have configured Hive with local or remote metastore using MySQL at the backend.

You also need a sample dataset to practice different windowing and analytical functions available in Hive and to optimize queries using concepts such as partitions and bucketing.

Who this book is for

This book has covered almost all concepts of Hive. So, if you are a beginner in the big data Hadoop domain, you can start with installing Hive, understanding Hive services and clients, and using Hive data modeling concepts to design your data model. If you have basic knowledge of Hive, you can deep dive into some of the advance concepts covered in the book such as partitions, bucketing, file formats, security, and windowing and analytics.

In a nutshell, this book is helpful for both a Hadoop developer and a Hadoop analyst who want to explore Hive.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: By default, this location is set to the /metastore_dbinconf/hive-default.xml file.

A block of code is set as follows:

Any command-line input or output is written as follows:

hive --service metastore &

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Create a Maven project in Eclipse by going to File | New | Project.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/ApacheHiveCookbook_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Developing Hive

In this chapter, we will cover the following recipes:

Deploying Hive on a Hadoop clusterDeploying Hive MetastoreInstalling HiveConfiguring HCatalogUnderstanding different components of HiveCompiling Hive from sourceHive packagesDebugging HiveRunning HiveChanging configurations at runtime

Introduction

Hive, an Apache Hadoop ecosystem component is developed by Facebook to query the data stored in Hadoop Distributed File System (HDFS). Here, HDFS is the data storage layer of Hadoop that at very high level divides the data into small blocks (default 128 MB) and stores these blocks on different nodes.

Hive provides a SQL-like query model named Hive Query Language (HQL) to access and analyze big data. It is also termed Data Warehousing framework of Hadoop and provides various analytical features, such as windowing and partitioning.

Deploying Hive on a Hadoop cluster

Hive is supported by a wide variety of platforms. GNU/Linux and Windows are commonly used as the production environment, whereas Mac OS X is commonly used as the development environment.

Getting ready

In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions.

Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

How to do it...

To install Hive, just download it from http://Hive.apache.org/downloads.html and unpack it. Choose the latest stable version.

Note

At the time of writing this book, Hive 1.2.1 was the latest stable version available.

How it works…

By default, Hive is configured to use an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default, this location is set to the /metastore_dbinconf/hive-default.xml file. Hive with Derby as metastore in embedded mode allows at most one user at a time.

The other modes of installation are Hive with local metastore and Hive with remote metastore, which will be discussed later.

Deploying Hive Metastore

Apache Hive is a client-side library that provides a table-like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a map reduce plan, which is then submitted to the Hadoop cluster. Hadoop cluster is the set of nodes or machines with HDFS, MapReduce, and YARN deployed on these machines. MapReduce works on the distributed data stored in HDFS and processes a large datasets in parallel, as compared with traditional processing engines that process whole task on a single machine and wait for hours or days for a single query. Yet Another Resource Negotiator (YARN) is used to manage RAM the and CPU cores of the whole cluster, which are critical for running any process on a node.

The Hive table and database definitions and mapping to the data in HDFS is stored in a metastore. A metastore is a central repository for Hive metadata. A metastore consists of two main components, which are really important for working on Hive. Let's take a look at these components:

Services to which the client connects and queries the metastoreA backing database to store the metadata

Getting ready

In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions.

How to do it…

In Hive, a metastore (service and RDBMS database) could be configured in one of the following ways:

An embedded metastoreA local metastoreA remote metastore

When we install Hive on the preinstalled Hadoop cluster, Hive, by default, gets the embedded database. This means that we need not configure any database as a Hive metastore. Let's check out what these configurations are and why we call them the embedded and remote metastore.

By default, the metastore service and the Hive service run in the same JVM. Hive needs a database to store metadata. In default mode, it uses an embedded Derby database stored on the local file system. The embedded mode of Hive has the limitation that only one session can be opened at a time from the same location on a machine as only one embedded Derby database can get lock and access the database files on disk:

An Embedded Metastore has a single service and a single JVM that cannot work with multiple nodes at a time.

To solve this limitation, a separate RDBMS database runs on same node. The metastore service and Hive service still run in the same JVM. This configuration mode is named local metastore. Here, local means the same environment of the JVM machine as well as the service in the same node.

There is one more configuration where one or more metastore servers run in a separate JVM process to the Hive service connecting to a database on a remote machine. This configuration is named remote metastore.

The Hive service is configured to use a remote metastore by setting hive.metastore.uris to metastore server URIs, separated by commas. The Hive metastore could be configured using properties specified in the following sections.

In the following diagram, the pictorial representation of the metastore and driver is given:

<property> <name>hive.metastore.warehouse.dir</name> <value>/user/Hive/warehouse </value> <description>The directory relative to fs.default.name where managed tables are stored. </description> </property> <property> <name> hive.metastore.uris</name> <value></value> <description> The URIs specifying the remote metastore servers to connect to. If there are multiple remote servers, clients connect in a round-robin fashion </description> </property> <property> <name>javax.jdo.option. ConnectionURL</name> <value>jdbc:derby:;databaseName=hivemetastore;create=true</value> <description> The JDBC URL of database. </description> </property> <property> <name> javax.jdo.option.ConnectionDriverName </name> <value> org.apache.derby.jdbc.EmbeddedDriver </value> <description> The JDBC driver classname. </description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>username</value> <description>metastore username to connect with </description> </property> <property> <name> javax.jdo.option.ConnectionPassword </name> <value>password</value> <description>metastore password to connect with </description> </property>

Installing Hive

We will now take a look at installing Hive along with all the prerequisites.

Getting ready

Let's download the stable version from one of the mirrors:

$ wget http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz

How to do it…

This can be achieved in three ways.

Hive with an embedded metastore

Once you have downloaded the Hive tar-ball file, installing and setting up a Hive is pretty simple and straightforward. Extract the compressed tar:

$tar –xzvf apache-hive-1.2.1-bin.tar.gz

Export the location where Hive is extracted as the environment variable HIVE_HOME:

$ cd apache-hive-1.2.1-bin$ export HIVE_HOME={{pwd}}

Hive has all its installation scripts in the $HIVE_HOME/bin directory. Export this location to the PATH environment variable so that you can run all scripts from any location directly from a command-line:

$ export PATH=$HIVE_HOME/bin:$PATH

Alternatively, if you want to set the Hive path permanently for the user, then make the entry of Hive environment variables in the .bashrc or .bash_profile files available or could be created in the user's home folder:

Add the following to ~/.bash_profile:

export HIVE_HOME=/home/hduser/apache-hive-1.2.1-binexport PATH=$PATH:$HIVE_HOME/bin

Here, hduser is the name of user with which you have logged in and Hive-1.2.1 is the Hive directory extracted from the tar file. Run Hive from a terminal:

hive

Make sure that the Hive node has a connection to Hadoop cluster, which means Hive would be installed on any of the Hadoop nodes, or Hadoop configurations are available in the node's class path.This installation uses the embedded Derby database and stores the data on the local filesystem. Only one Hive session can be open on the node.If different users try to run the Hive shell, the second would get the Failed to start database 'metastore_db' error.Run Hive queries for the datastore to test the installation:

hive> SHOW TABLES;hive> CREATE TABLE sales(id INT, product String, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Logs are generated per user bases in the /tmp/<usrename> folder.

Hive with a local metastore

Follow these steps to configure Hive with the local metastore. Here, we are using the MySQL database as a metastore:

Add following to ~/.bash_profile:

export HIVE_HOME=/home/hduser/apache-hive-1.2.1-binexport PATH=$PATH:$HIVE_HOME/bin

Here, hduser is the user name, and apache-hive-1.2.1-bin is the Hive directory extracted from the tar file.

Install a SQL database such as MySQL on the same machine where you want to run Hive.For the Ubuntu, MySQL could be installed by running the following command on the node's terminal:

sudo apt-get install mysql-server

In case of MySql, Hive needs the mysql-connector jar. Download the latest mysql-connector jar from http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz and copy it to the lib folder of your Hive home directory.Create a file, hive-site.xml, in the conf folder of Hive and add the following entries to it:

<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hduser</value> <description>user name for connecting to mysql server </description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>passwd</value> <description>password for connecting to mysql server</description> </property> </configuration>

Run Hive from the terminal:

hive

Note

There is a known "JLine" jar conflict issue with Hadoop 2.6.0 and Hive 1.2.1. If you are getting the error "unable to load class jline.terminal," you need to remove the older version of the jline jar from the yarn lib folder using the following command:

sudo rm -r $HADOOP_PREFIX/share/hadoop/yarn/lib/jline-0.9.94.jar

Hive with a remote metastore

Follow these steps to configure Hive with a remote metastore.

Download the latest version of Hive from http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz.Extract the package:

tar –xzvf apache-hive-1.2.1-bin.tar.gz

Add the following to ~/.bash_profile:

sudo nano ~/.bash_profileexport HIVE_HOME=/home/hduser/apache-hive-1.2.1-binexport PATH=$PATH:$HIVE_HOME/bin

Here, hduser is the user name and apache-hive-1.2.1-bin is the

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Apache Hive Cookbook E-Book

Hanish Bansal

About This Book

Who This Book Is For

What You Will Learn

In Detail

Style and approach

Table of Contents

Apache Hive Cookbook

Apache Hive Cookbook

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why Subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Chapter 1. Developing Hive

Introduction

Deploying Hive on a Hadoop cluster

Getting ready

How to do it...

Note

How it works…

Deploying Hive Metastore

Getting ready

How to do it…

Installing Hive

Getting ready

How to do it…

Hive with an embedded metastore

Hive with a local metastore

Note

Hive with a remote metastore