Learning Elasticsearch - Abhishek Andhavarapu - E-Book

Learning Elasticsearch E-Book

Abhishek Andhavarapu

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Store, search, and analyze your data with ease using Elasticsearch 5.x

About This Book

  • Get to grips with the basics of Elasticsearch concepts and its APIs, and use them to create efficient applications
  • Create large-scale Elasticsearch clusters and perform analytics using aggregation
  • This comprehensive guide will get you up and running with Elasticsearch 5.x in no time

Who This Book Is For

If you want to build efficient search and analytics applications using Elasticsearch, this book is for you. It will also benefit developers who have worked with Lucene or Solr before and now want to work with Elasticsearch. No previous knowledge of Elasticsearch is expected.

What You Will Learn

  • See how to set up and configure Elasticsearch and Kibana
  • Know how to ingest structured and unstructured data using Elasticsearch
  • Understand how a search engine works and the concepts of relevance and scoring
  • Find out how to query Elasticsearch with a high degree of performance and scalability
  • Improve the user experience by using autocomplete, geolocation queries, and much more
  • See how to slice and dice your data using Elasticsearch aggregations.
  • Grasp how to use Kibana to explore and visualize your data
  • Know how to host on Elastic Cloud and how to use the latest X-Pack features such as Graph and Alerting

In Detail

Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. You can use Elasticsearch for small or large applications with billions of documents. It is built to scale horizontally and can handle both structured and unstructured data. Packed with easy-to- follow examples, this book will ensure you will have a firm understanding of the basics of Elasticsearch and know how to utilize its capabilities efficiently.

You will install and set up Elasticsearch and Kibana, and handle documents using the Distributed Document Store. You will see how to query, search, and index your data, and perform aggregation-based analytics with ease. You will see how to use Kibana to explore and visualize your data.

Further on, you will learn to handle document relationships, work with geospatial data, and much more, with this easy-to-follow guide. Finally, you will see how you can set up and scale your Elasticsearch clusters in production environments.

Style and approach

This comprehensive guide will get you started with Elasticsearch 5.x, so you build a solid understanding of the basics. Every topic is explained in depth and is supplemented with practical examples to enhance your understanding.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 364

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Learning Elasticsearch
Distributed real-time search and analytics with Elasticsearch 5.x
Abhishek Andhavarapu

BIRMINGHAM - MUMBAI

< html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

Learning Elasticsearch

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2017

Production reference: 1290617

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78712-845-3

www.packtpub.com

Credits

Authors

Abhishek Andhavarapu

Copy Editor

Manisha Sinha

Reviewers

Dan Noble

Marcelo Ochoa

Project Coordinator

Manthan Patel

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Varsha Shetty

Indexer

Tejal Daruwale Soni

Content Development Editor

Jagruti Babaria

Graphics

Tania Dutta

Technical Editor

Danish Shaikh

Production Coordinator

Deepika Naik

 

About the Author

Abhishek Andhavarapu is a software engineer at eBay who enjoys working on highly scalable distributed systems. He has a master's degree in Distributed Computing and has worked on multiple enterprise Elasticsearch applications, which are currently serving hundreds of millions of requests per day.He began his journey with Elasticsearch in 2012 to build an analytics engine to power dashboards and quickly realized that Elasticsearch is like nothing out there for search and analytics. He has been a strong advocate since then and wrote this book to share the practical knowledge he gained along the way.

 

Writing a book is a humongous task, I want to thank my wife Ashwini for her patience and support during the nights and weekends I spent writing this book. I am thankful to my parents Govinda Rajulu, Jaya Lakshmi, my brother Sarat Kiran and my in-laws Satya Rao and Suguna for the constant motivation and encouragement throughout the writing of this book. I'm grateful to all my friends and colleagues, whom I couldn't mention by name, for their valuable feedback and inputs. I also would like to thank my publisher and editors at Packt for the continuous support.

About the Reviewers

Dan Noble is a software engineer with a passion for writing secure, clean, and articulate code. He enjoys working with a variety of programming languages and software frameworks, particularly Python, Elasticsearch, and various Javascript frontend technologies. Dan currently works on geospatial web applications and data processing systems. Dan has been a user and advocate of Elasticsearch since 2011. He has given several talks about Elasticsearch, is the author of the book Monitoring Elasticsearch, and was a technical reviewer for the book The Elasticsearch Cookbook, Second Edition, by Alberto Paro. Dan is also the author of the Python Elasticsearch client rawes.

Marcelo Ochoa works at the system laboratory of Facultad de Ciencias Exactas of the Universidad Nacional del Centro de la Provincia de Buenos Aires and is the CTO at Scotas.com, a company that specializes in near real-time search solutions using Apache Solr and Oracle. He divides his time between university jobs and external projects related to Oracle and big data technologies. He has worked on several Oracle-related projects, such as the translation of Oracle manuals and multimedia CBTs. His background is in database, network, web, and Java technologies. In the XML world, he is known as the developer of the DB Generator for the Apache Cocoon project. He has worked on the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integration using the Oracle JVM Directory implementation, and the Restlet.org project, where he worked on the Oracle XDB Restlet Adapter, which is an alternative to writing native REST web services inside a database resident JVM.

Since 2006, he has been part of an Oracle ACE program and recently incorporated into a Docker Mentor program.

He has coauthored Oracle Database Programming using Java and Web Services by Digital Press and Professional XML Databases by Wrox Press and been a technical reviewer for several PacktPub books, such as Mastering Elastic Stack, Mastering Elasticsearch 5.x - Third Edition, Elasticsearch 5.x Cookbook - Third Edition, and others.

 

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787128458.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Introduction to Elasticsearch

Basic concepts of Elasticsearch

Document

Index

Type

Cluster and node

Shard

Interacting with Elasticsearch

Creating a document

Retrieving an existing document

Updating an existing document

Updating a partial document

Deleting an existing document

How does search work?

Importance of information retrieval

Simple search query

Inverted index

Stemming

Synonyms

Phrase search

Apache Lucene

Scalability and availability

Relation between node, index, and shard

Three shards with zero replicas

Six shards with zero replicas

Six shards with one replica

Distributed search

Failure handling

Strengths and limitations of Elasticsearch

Summary

Setting Up Elasticsearch and Kibana

Installing Elasticsearch

Installing Java

Windows

Starting and stopping Elasticsearch

Mac OS X

Starting and stopping Elasticsearch

DEB and RPM packages

Debian package

RPM package

Starting and stopping Elasticsearch

Sample configuration files

Verifying Elasticsearch is running

Installing Kibana

Mac OS X

Starting and stopping Kibana

Windows

Starting and stopping Kibana

Query format used in this book (Kibana Console)

Using cURL or Postman

Health of the cluster

Summary

Modeling Your Data and Document Relations

Mapping

Dynamic mapping

Create index with mapping

Adding a new type/field

Getting the existing mapping

Mapping conflicts

Data type

Metafields

How to handle null values

Storing the original document

Searching all the fields in the document

Difference between full-text search and exact match

Core data types

Text

Keyword

Date

Numeric

Boolean

Binary

Complex data types

Array

Object

Nested

Geo data type

Geo-point data type

Specialized data type

IP

Mapping the same field with different mappings

Handling relations between different document types

Parent-child document relation

How are parent-child documents stored internally?

Nested

Routing

Summary

Indexing and Updating Your Data

Indexing your data

Indexing errors

Node/shards errors

Serialization/mapping errors

Thread pool rejection error

Managing an index

What happens when you index a document?

Updating your data

Update using an entire document

Partial updates

Scripted updates

Upsert

NOOP

What happens when you update a document?

Merging segments

Using Kibana to discover

Using Elasticsearch in your application

Java

Transport client

Dependencies

Initializing the client

Sniffing

Node client

REST client

Third party clients

Indexing using Java client

Concurrency

Translog

Async versus sync

CRUD from translog

Primary and Replica shards

Primary preference

More replicas for query throughput

Increasing/decreasing the number of replicas

Summary

Organizing Your Data and Bulk Data Ingestion

Bulk operations

Bulk API

Multi Get API

Update by query

Delete by query

Reindex API

Change mappings/settings

Combining documents from one or more indices

Copying only missing documents

Copying a subset of documents into a new index

Copying top N documents

Copying the subset of fields into new index

Ingest Node

Organizing your data

Index alias

Index templates

Managing time-based indices

Shrink API

Summary

All About Search

Different types of queries

Sample data

Querying Elasticsearch

Basic query (finding the exact value)

Pagination

Sorting based on existing fields

Selecting the fields in the response

Querying based on range

Handling dates

Analyzed versus non-analyzed fields

Term versus Match query

Match phrase query

Prefix and match phrase prefix query

Wildcard and Regular expression query

Exists and missing queries

Using more than one query

Routing

Debugging search query

Relevance

Queries versus Filters

How to boost relevance based on a single field

How to boost score based on queries

How to boost relevance using decay functions

Rescoring

Debugging relevance score

Searching for same value across multiple fields

Best matching fields

Most matching fields

Cross-matching fields

Caching

Node Query cache

Shard request cache

Summary

More Than a Search Engine (Geofilters, Autocomplete, and More)

Sample data

Correcting typos and spelling mistakes

Fuzzy query

Making suggestions based on the user input

Implementing "did you mean" feature

Term suggester

Phrase suggester

Implementing the autocomplete feature

Highlighting

Handling document relations using parent-child

The has_parent query

The has_child query

Inner hits for parent-child

How parent-child works internally

Handling document relations using nested

Inner hits for nested documents

Scripting

Script Query

Post Filter

Reverse search using the percolate query

Geo and Spatial Filtering

Geo Distance

Using Geolocation to rank the search results

Geo Bounding Box

Sorting

Multi search

Search templates

Querying Elasticsearch from Java application

Summary

How to Slice and Dice Your Data Using Aggregations

Aggregation basics

Sample data

Query structure

Multilevel aggregations

Types of aggregations

Terms aggregations (group by)

Size and error

Order

Minimum document count

Missing values

Aggregations based on filters

Aggregations on dates ( range, histogram )

Aggregations on numeric values (range, histogram)

Aggregations on geolocation (distance, bounds)

Geo distance

Geo bounds

Aggregations on child documents

Aggregations on nested documents

Reverse nested aggregation

Post filter

Using Kibana to visualize aggregations

Caching

Doc values

Field data

Summary

Production and Beyond

Configuring Elasticsearch

The directory structure

zip/tar.gz

DEB/RPM

Configuration file

Cluster and node name

Network configuration

Memory configuration

Configuring file descriptors

Types of nodes

Multinode cluster

Inspecting the logs

How nodes discover each other

Node failures

X-Pack

Windows

Mac OS X

Debian/RPM

Authentication

X-Pack basic license

Monitoring

Monitoring Elasticsearch clusters

Monitoring indices

Monitoring nodes

Thread pools

Elasticsearch server logs

Slow logs

Summary

Exploring Elastic Stack (Elastic Cloud, Security, Graph, and Alerting)

Elastic Cloud

High availability

Data reliability

Security

Authentication and roles

Securing communications using SSL

Graph

Graph UI

Alerting

Summary

Preface

Welcome to Learning Elasticsearch. We will start by describing the basic concepts of Elasticsearch. You will see how to install Elasticsearch and Kibana and learn how to index and update your data. We will use an e-commerce site as an example to explain how a search engine works and how to query your data. The real power of Elasticsearch is aggregations. You will see how to perform aggregation-based analytics with ease. You will also see how to use Kibana to explore and visualize your data. Finally, we will discuss how to use Graph to discover relations in your data and use alerting to set up alerts and notification on different trends in your data.

To better explain various concepts, lots of examples have been used throughout the book. Detailed instructions to install Elasticsearch, Kibana and how to execute the examples is included in Chapter 2, Setting Up Elasticsearch and Kibana.

What this book covers

Chapter 1, Introduction to Elasticsearch, describes the building blocks of Elasticsearch and what makes Elasticsearch scalable and distributed. In this chapter, we also discuss the strengths and limitations of Elasticsearch.

Chapter 2, Setting Up Elasticsearch and Kibana, covers the installation of Elasticsearch and Kibana.

Chapter 3,Modeling Your Data and Document Relations, focuses on modeling your data. To support text search, Elasticsearch preprocess the data before indexing. This chapter describes why preprocessing is necessary and various analyzers Elasticsearch supports. In addition to that, we discuss how to handle relationships between different document types.

Chapter 4, Indexing and Updating Your Data, covers how to index and update your data and what happens internally when you index and update. The data indexed in Elasticsearch is only available after a small delay, we discuss the reason for the delay and how to control the delay.

Chapter 5, Organizing Your Data and Bulk Data Ingestion, describes how to organize and manage indices in Elasticsearch using aliases and templates and more. This chapter also covers various Bulk API’s Elasticsearch supports and how to rebuild your existing indices using Reindex and Shrink API.

Chapter 6, All About Search, covers how to search, sort and paginate on your data. The concept of relevance is introduced and we discuss how to tune the relevance score to get the most relevant search results at the top.

Chapter 7,More Than a Search Engine (Geofilters, Autocomplete and More), covers how to filter based on geolocation, using Elasticsearch suggesters for autocomplete, correcting user typo’s and lot more.

Chapter 8, How to Slice and Dice Your Data Using Aggregations, covers different kinds of aggregations Elasticsearch supports and how to visualize the data using Kibana.

Chapter 9, Production and Beyond, covers important settings to configure and monitor in production.

Chapter 10, Exploring Elastic Stack (Elastic Cloud, Security, Graph, and Alerting), covers Elastic Cloud, which is managed cloud hosting and other products that are part of X-Pack.

What you need for this book

The book was written using Elasticsearch 5.1.2, and all the examples used in the book should work with it. The request format used in this book is based on the Kibana Console and you’ll need Kibana Console or Sense Chrome plugin to execute the examples used in this book. Please refer to Query format used in this book section of Chapter 2, Setting up Elasticsearch and Kibana for more details. If using Kibana or Sense is not option, you can use other HTTP clients such as cURL or Postman. The request format is slightly different and is explained in the Using cURL or Postman section of Chapter 2, Setting Up Elasticsearch and Kibana.

Who this book is for

This book is for software developers who are planning to build a search and analytics engine or are trying to learn Elasticsearch.

Some familiarity with web technologies (JavaScript, REST, JSON) would be helpful.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

A block of code is set as follows:

{ "articleid": 1, "name": "Introduction to Elasticsearch"}

 

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

{ "articleid": 1,

"name": "Introduction to Elasticsearch"

}

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Elasticsearch. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Introduction to Elasticsearch

In this chapter, we will focus on the basic concepts of Elasticsearch. We will start by explaining the building blocks and then discuss how to create, modify and query in Elasticsearch. Getting started with Elasticsearch is very easy; most operations come with default settings. The default settings can be overridden when you need more advanced features.

I first started using Elasticsearch in 2012 as a backend search engine to power our Analytics dashboards. It has been more than five years, and I never looked for any other technologies for our search needs. Elasticsearch is much more than just a search engine; it supports complex aggregations, geo filters, and the list goes on. Best of all, you can run all your queries at a speed you have never seen before. To understand how this magic happens, we will briefly discuss how Elasticsearch works internally and then discuss how to talk to Elasticsearch. Knowing how it works internally will help you understand its strengths and limitations. Elasticsearch, like any other open source technology, is very rapidly evolving, but the core fundamentals that power Elasticsearch don't change. By the end of this chapter, we will have covered the following:

Basic concepts of Elasticsearch

How to interact with Elasticsearch

How to create, read, update, and delete

How does search work

Availability and horizontal scalability

Failure handling

Strengths and limitations

Basic concepts of Elasticsearch

Elasticsearch is a highly scalable open source search engine. Although it started as a text search engine, it is evolving as an analytical engine, which can support not only search but also complex aggregations. Its distributed nature and ease of use makes it very easy to get started and scale as you have more data. 

One might ask what makes Elasticsearch different from any other document stores out there. Elasticsearch is a search engine and not just a key-value store. It's also a very powerful analytical engine; all the queries that you would usually run in a batch or offline mode can be executed in real time. Support for features such as autocomplete, geo-location based filters, multilevel aggregations, coupled with its user friendliness resulted in industry-wide acceptance. That being said, I always believe it is important to have the right tool for the right job. Towards the end of the chapter, we will discuss it’s strengths and limitations.

In this section, we will go through the basic concepts and terminology of Elasticsearch. We will start by explaining how to insert, update, and perform a search. If you are familiar with SQL language, the following table shows the equivalent terms in Elasticsearch:

Database

Table

Row

Column

Index

Type

Document

Field

Document

Your data in Elasticsearch is stored as JSON (Javascript Object Notation) documents. Most NoSQL data stores use JSON to store their data as JSON format is very concise, flexible, and readily understood by humans. A document in Elasticsearch is very similar to a row when compared to a relational database. Let's say we have a User table with the following information:

The users in the preceding user table, when represented in JSON format, will look like the following:

{ "id": 1, "name": "Luke", "age": 100, "gender": "M", "email": "[email protected]" }

{ "id": 2, "name": "Leia", "age": 100, "gender": "F", "email": "[email protected]" }

A row contains columns; similarly, a document contains fields. Elasticsearch documents are very flexible and support storing nested objects. For example, an existing user document can be easily extended to include the address information. To capture similar information using a table structure, you need to create a new address table and manage the relations using a foreign key. The user document with the address is shown here:

{ "id": 1, "name": "Luke", "age": 100, "gender": "M", "email": "[email protected]",

"address": {

"street": "123 High Lane",

"city": "Big City",

"state": "Small State",

"zip": 12345

}

}

Reading similar information without the JSON structure would also be difficult as the information would have to be read from multiple tables. Elasticsearch allows you to store the entire JSON as it is. For a database table, the schema has to be defined before you can use the table. Elasticsearch is built to handle unstructured data and can automatically determine the data types for the fields in the document. You can index new documents or add new fields without adding or changing the schema. This process is also known as dynamic mapping. We will discuss how this works and how to define schema in Chapter 3, Modeling Your Data and Document Relations.

Index

An index is similar to a database. The term index should not be confused with a database index, as someone familiar with traditional SQL might assume. Your data is stored in one or more indexes just like you would store it in one or more databases. The word indexing means inserting/updating the documents into an Elasticsearch index. The name of the index must be unique and typed in all lowercase letters. For example, in an e-commerce world, you would have an index for the items--one for orders, one for customer information, and so on.

Type

A type is similar to a database table, an index can have one or more types. Type is a logical separation of different kinds of data. For example, if you are building a blog application, you would have a type defined for articles in the blog and a type defined for comments in the blog. Let's say we have two types--articles and comments.  

The following is the document that belongs to the article type:

{ "articleid": 1, "name": "Introduction to Elasticsearch" }

The following is the document that belongs to the comment type:

{ "commentid": "AVmKvtPwWuEuqke_aRsm", "articleid": 1, "comment": "Its Awesome !!" }

We can also define relations between different types. For example, a parent/child relation can be defined between articles and comments. An article (parent) can have one or more comments (children). We will discuss relations further in Chapter 3, Modeling Your Data and Document Relations.

Cluster and node

In a traditional database system, we usually have only one server serving all the requests. Elasticsearch is a distributed system, meaning it is made up of one or more nodes (servers) that act as a single application, which enables it to scale and handle load beyond what a single server can handle. Each node (server) has part of the data. You can start running Elasticsearch with just one node and add more nodes, or, in other words, scale the cluster when you have more data. A cluster with three nodes is shown in the following diagram:

In the preceding diagram, the cluster has three nodes with the names elasticsearch1, elasticsearch2, elasticsearch3. These three nodes work together to handle all the indexing and query requests on the data. Each cluster is identified by a unique name, which defaults to elasticsearch. It is often common to have multiple clusters, one for each environment, such as staging, pre-production, production. 

Just like a cluster, each node is identified by a unique name. Elasticsearch will automatically assign a unique name to each node if the name is not specified in the configuration. Depending on your application needs, you can add and remove nodes (servers) on the fly. Adding and removing nodes is seamlessly handled by Elasticsearch.

We will discuss how to set up an Elasticsearch cluster in Chapter 2, Setting Up Elasticsearch and Kibana.

Shard

An index is a collection of one or more shards. All the data that belongs to an index is distributed across multiple shards. By spreading the data that belongs to an index to multiple shards, Elasticsearch can store information beyond what a single server can store. Elasticsearch uses Apache Lucene internally to index and query the data. A shard is nothing but an Apache Lucene instance. We will discuss Apache Lucene and why Elasticsearch uses Lucene in the How search works section later.

I know we introduced a lot of new terms in this section. For now, just remember that all data that belongs to an index is spread across one or more shards. We will discuss how shards work in the Scalability and Availability section towards the end of this chapter.

Interacting with Elasticsearch

The primary way of interacting with Elasticsearch is via REST API. Elasticsearch provides JSON-based REST API over HTTP. By default, Elasticsearch REST API runs on port 9200. Anything from creating an index to shutting down a node is a simple REST call. The APIs are broadly classified into the following:

Document APIs

CRUD

(

Create Retrieve Update Delete

) operations on documents

Search APIs

: For all the search operations

Indices APIs

: For managing indices (creating an index, deleting an index, and so on)

Cat APIs

: Instead of JSON, the data is returned in tabular form 

Cluster APIs

: For managing the cluster

We have a chapter dedicated to each one of them to discuss more in detail. For example, indexing documents in Chapter 4, Indexing and Updating Your Data and search in Chapter 6, All About Search and so on. In this section, we will go through some basic CRUD using the Document APIs. This section is simply a brief introduction on how to manipulate data using Document APIs. To use Elasticsearch in your application, clients in all major languages, such as Java, Python, are also provided. The majority of the clients acts as a wrapper around the REST API. 

To better explain the CRUD operations, imagine we are building an e-commerce site. And we want to use Elasticsearch to power its search functionality. We will use an index named chapter1 and store all the products in the type called product.  Each product we want to index is represented by a JSON document. We will start by creating a new product document, and then we will retrieve a product by its identifier, followed by updating a product's category and deleting a product using its identifier. 

Creating a document

A new document can be added using the Document API's. For the e-commerce example, to add a new product, we execute the following command. The body of the request is the product document we want to index.

PUT http://localhost:9200/chapter1/product/1{ "title": "Learning Elasticsearch", "author": "Abhishek Andhavarapu", "category": "books"}

Let's inspect the request:

INDEXchapter1TYPEproductIDENTIFIER1DOCUMENTJSON HTTP METHODPUT

The document's properties, such as title, author, the category, are also known as fields, which are similar to SQL columns. 

Elasticsearch will automatically create the index chapter1 and type product if they don't exist already. It will create the index with the default settings.

When we execute the preceding request, Elasticsearch responds with a JSON response, shown as follows: 

{ "_index": "chapter1", "_type": "product", "_id": "1",

"_version": 1,

"_shards": { "total": 1, "successful": 1, "failed": 0 },

"created": true

}

In the response, you can see that Elasticsearch created the document and the version of the document is 1. Since you are creating the document using the HTTP PUT method, you are required to specify the document identifier. If you don’t specify the identifier, Elasticsearch will respond with the following error message:

No handler found for uri [/chapter1/product/] and method [PUT]

If you don’t have a unique identifier, you can let Elasticsearch assign an identifier for you, but you should use the POST HTTP method. For example, if you are indexing log messages, you will not have a unique identifier for each log message, and you can let Elasticsearch assign the identifier for you. 

In general, we use the HTTP POST method for creating an object. The HTTP PUT method can also be used for object creation, where the client provides the unique identifier instead of the server assigning the identifier.

We can index a document without specifying a unique identifier as shown here:

POST http://localhost:9200/chapter1/product/{ "title": "Learning Elasticsearch", "author": "Abhishek Andhavarapu", "category": "books"}

In the above request, URL doesn't contain the unique identifier and we are using the HTTP POST method. Let's inspect the request:

INDEXchapter1TYPEproductDOCUMENTJSON HTTP METHODPOST

The response from Elasticsearch is shown as follows:

{ "_index": "chapter1", "_type": "product",

"_id":

"

AVmKvtPwWuEuqke_aRsm

", "_version": 1, "_shards": { "total": 1, "successful": 1, "failed": 0 },

"created": true

}

You can see from the response that Elasticsearch assigned the unique identifier AVmKvtPwWuEuqke_aRsm to the document and created flag is set to true. If a document with the same unique identifier already exists, Elasticsearch replaces the existing document and increments the document version. If you have to run the same PUT request from the beginning of the section, the response from Elasticsearch would be this:

{ "_index": "chapter1", "_type": "product", "_id": "1",

"_version": 2,

"_shards": { "total": 1, "successful": 1, "failed": 0 },

"created": false

}

In the response, you can see that the created flag is false since the document with id: 1 already exists. Also, observe that the version is now 2.

Retrieving an existing document

To retrieve an existing document, we need the index, type and a unique identifier of the document. Let’s try to retrieve the document we just indexed. To retrieve a document we need to use HTTP GET method as shown below:

GET http://localhost:9200/chapter1/product/1

Let’s inspect the request:

INDEXchapter1TYPEproductIDENTIFIER1HTTP METHODGET

Response from Elasticsearch as shown below contains the product document we indexed in the previous section:

{ "_index": "chapter1", "_type": "product", "_id": "1", "_version": 2,

"found": true,

"_source": {

"title": "Learning Elasticsearch", "author": "Abhishek Andhavarapu", "category": "books" } }

The actual JSON document will be stored in the _source field.  Also note the version in the response; every time the document is updated, the version is increased.

Updating an existing document

Updating a document in Elasticsearch is more complicated than in a traditional SQL database. Internally, Elasticsearch retrieves the old document, applies the changes, and re-inserts the document as a new document. The update operation is very expensive. There are different ways of updating a document. We will talk about updating a partial document here and in more detail in the Updating your datasection inChapter 4, Indexing and Updating Your Data.

Updating a partial document

We already indexed the document with the unique identifier 1, and now we need to update the category of the product from just books to technical books. We can update the document as shown here:

POST http://localhost:9200/chapter1/product/1/

_update

{

"doc": {

"category": "technical books" } }

The body of the request is the field of the document we want to update and the unique identifier is passed in the URL.

Please note the _update endpoint at the end of the URL.

The response from Elasticsearch is shown here:

{ "_index": "chapter1", "_type": "product", "_id": "1",

"_version": 3,

"_shards": { "total": 1, "successful": 1, "failed": 0 } }

As you can see in the response, the operation is successful, and the version of the document is now 3. More complicated update operations are possible using scripts and upserts.

Deleting an existing document

For creating and retrieving a document, we used the POST and GET methods. For deleting an existing document, we need to use the HTTP DELETE method and pass the unique identifier of the document in the URL as shown here:

DELETE http://localhost:9200/chapter1/product/1

Let's inspect the request:

INDEXchapter1TYPEproductIDENTIFIER1HTTP METHODDELETE

The response from Elasticsearch is shown here:

{

"found": true,

"_index": "chapter1", "_type": "product", "_id": "1", "_version": 4, "_shards": { "total": 1,

"successful": 1,

"failed": 0 } }

In the response, you can see that Elasticsearch was able to find the document with the unique identifier 1 and was successful in deleting the document. 

How does search work?

In the previous section, we discussed how to create, update, and delete documents. In this section, we will briefly discuss how search works internally and explain the basic query APIs. Mostly, I want to talk about the inverted index and Apache Lucene. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. We will discuss Search API in detail in Chapter 6, All About Search. 

Importance of information retrieval

As the computation power is increasing and cost of storage is decreasing, the amount of day-to-day data we deal with is growing exponentially. But without a way to retrieve the information and to be able to query it, the information we collect doesn't help.

Information retrieval systems are very important to make sense of the data. Imagine how hard it would be to find some information on the Internet without Google or other search engines out there. Information is not knowledge without information retrieval systems.

Simple search query

Let's say we have a User table as shown here:

Now, we want to query for all the users with the name Luke. A SQL query to achieve this would be something like this:

select * from user where name like ‘%luke%’

To do a similar task in Elasticsearch, you can use the search API and execute the following command:

GET http://127.0.0.1:9200/chapter1/user/_search?q=name:luke

Let's inspect the request:

INDEX

chapter1

TYPE

user

FIELD

name

Just like you would get all the rows in the User table as a result of the SQL query, the response to the Elasticsearch query would be JSON documents:

{ "id": 1, "name": "Luke", "age": 100, "gender": "M", "email": "[email protected]" }

Querying using the URL parameters can be used for simple queries as shown above. For more practical queries, you should pass the query represented as JSON in the request body. The same query passed in the request body is shown here:

POST http://127.0.0.1:9200/chapter1/user/_search { "query": { "term": { "name": "luke" } } }

The Search API is very flexible and supports different kinds of filters, sort, pagination, and aggregations. 

Inverted index

Before we talk more about search, I want to talk about the inverted index. Knowing about inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on.

We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear.

Document1: Fear leads to anger

Document2: Anger leads to hate

Document3: Hate leads to suffering

In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same.

Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value.  

TermDocumentFear1Anger1,2Hate2,3Suffering3Leads1,2,3

Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don’t use the exact words. Now let’s say we encountered a document containing the following:

Yosemite national park may be closed for the weekend due to forecast of substantial rainfall

We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown: 

DocumentQueryrainfallrain

To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections.

Stemming