45,59 €
Store, search, and analyze your data with ease using Elasticsearch 5.x
If you want to build efficient search and analytics applications using Elasticsearch, this book is for you. It will also benefit developers who have worked with Lucene or Solr before and now want to work with Elasticsearch. No previous knowledge of Elasticsearch is expected.
Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. You can use Elasticsearch for small or large applications with billions of documents. It is built to scale horizontally and can handle both structured and unstructured data. Packed with easy-to- follow examples, this book will ensure you will have a firm understanding of the basics of Elasticsearch and know how to utilize its capabilities efficiently.
You will install and set up Elasticsearch and Kibana, and handle documents using the Distributed Document Store. You will see how to query, search, and index your data, and perform aggregation-based analytics with ease. You will see how to use Kibana to explore and visualize your data.
Further on, you will learn to handle document relationships, work with geospatial data, and much more, with this easy-to-follow guide. Finally, you will see how you can set up and scale your Elasticsearch clusters in production environments.
This comprehensive guide will get you started with Elasticsearch 5.x, so you build a solid understanding of the basics. Every topic is explained in depth and is supplemented with practical examples to enhance your understanding.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 364
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
< html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: June 2017
Production reference: 1290617
ISBN 978-1-78712-845-3
www.packtpub.com
Authors
Abhishek Andhavarapu
Copy Editor
Manisha Sinha
Reviewers
Dan Noble
Marcelo Ochoa
Project Coordinator
Manthan Patel
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Varsha Shetty
Indexer
Tejal Daruwale Soni
Content Development Editor
Jagruti Babaria
Graphics
Tania Dutta
Technical Editor
Danish Shaikh
Production Coordinator
Deepika Naik
Abhishek Andhavarapu is a software engineer at eBay who enjoys working on highly scalable distributed systems. He has a master's degree in Distributed Computing and has worked on multiple enterprise Elasticsearch applications, which are currently serving hundreds of millions of requests per day.He began his journey with Elasticsearch in 2012 to build an analytics engine to power dashboards and quickly realized that Elasticsearch is like nothing out there for search and analytics. He has been a strong advocate since then and wrote this book to share the practical knowledge he gained along the way.
Dan Noble is a software engineer with a passion for writing secure, clean, and articulate code. He enjoys working with a variety of programming languages and software frameworks, particularly Python, Elasticsearch, and various Javascript frontend technologies. Dan currently works on geospatial web applications and data processing systems. Dan has been a user and advocate of Elasticsearch since 2011. He has given several talks about Elasticsearch, is the author of the book Monitoring Elasticsearch, and was a technical reviewer for the book The Elasticsearch Cookbook, Second Edition, by Alberto Paro. Dan is also the author of the Python Elasticsearch client rawes.
Marcelo Ochoa works at the system laboratory of Facultad de Ciencias Exactas of the Universidad Nacional del Centro de la Provincia de Buenos Aires and is the CTO at Scotas.com, a company that specializes in near real-time search solutions using Apache Solr and Oracle. He divides his time between university jobs and external projects related to Oracle and big data technologies. He has worked on several Oracle-related projects, such as the translation of Oracle manuals and multimedia CBTs. His background is in database, network, web, and Java technologies. In the XML world, he is known as the developer of the DB Generator for the Apache Cocoon project. He has worked on the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integration using the Oracle JVM Directory implementation, and the Restlet.org project, where he worked on the Oracle XDB Restlet Adapter, which is an alternative to writing native REST web services inside a database resident JVM.
Since 2006, he has been part of an Oracle ACE program and recently incorporated into a Docker Mentor program.
He has coauthored Oracle Database Programming using Java and Web Services by Digital Press and Professional XML Databases by Wrox Press and been a technical reviewer for several PacktPub books, such as Mastering Elastic Stack, Mastering Elasticsearch 5.x - Third Edition, Elasticsearch 5.x Cookbook - Third Edition, and others.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787128458.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Introduction to Elasticsearch
Basic concepts of Elasticsearch
Document
Index
Type
Cluster and node
Shard
Interacting with Elasticsearch
Creating a document
Retrieving an existing document
Updating an existing document
Updating a partial document
Deleting an existing document
How does search work?
Importance of information retrieval
Simple search query
Inverted index
Stemming
Synonyms
Phrase search
Apache Lucene
Scalability and availability
Relation between node, index, and shard
Three shards with zero replicas
Six shards with zero replicas
Six shards with one replica
Distributed search
Failure handling
Strengths and limitations of Elasticsearch
Summary
Setting Up Elasticsearch and Kibana
Installing Elasticsearch
Installing Java
Windows
Starting and stopping Elasticsearch
Mac OS X
Starting and stopping Elasticsearch
DEB and RPM packages
Debian package
RPM package
Starting and stopping Elasticsearch
Sample configuration files
Verifying Elasticsearch is running
Installing Kibana
Mac OS X
Starting and stopping Kibana
Windows
Starting and stopping Kibana
Query format used in this book (Kibana Console)
Using cURL or Postman
Health of the cluster
Summary
Modeling Your Data and Document Relations
Mapping
Dynamic mapping
Create index with mapping
Adding a new type/field
Getting the existing mapping
Mapping conflicts
Data type
Metafields
How to handle null values
Storing the original document
Searching all the fields in the document
Difference between full-text search and exact match
Core data types
Text
Keyword
Date
Numeric
Boolean
Binary
Complex data types
Array
Object
Nested
Geo data type
Geo-point data type
Specialized data type
IP
Mapping the same field with different mappings
Handling relations between different document types
Parent-child document relation
How are parent-child documents stored internally?
Nested
Routing
Summary
Indexing and Updating Your Data
Indexing your data
Indexing errors
Node/shards errors
Serialization/mapping errors
Thread pool rejection error
Managing an index
What happens when you index a document?
Updating your data
Update using an entire document
Partial updates
Scripted updates
Upsert
NOOP
What happens when you update a document?
Merging segments
Using Kibana to discover
Using Elasticsearch in your application
Java
Transport client
Dependencies
Initializing the client
Sniffing
Node client
REST client
Third party clients
Indexing using Java client
Concurrency
Translog
Async versus sync
CRUD from translog
Primary and Replica shards
Primary preference
More replicas for query throughput
Increasing/decreasing the number of replicas
Summary
Organizing Your Data and Bulk Data Ingestion
Bulk operations
Bulk API
Multi Get API
Update by query
Delete by query
Reindex API
Change mappings/settings
Combining documents from one or more indices
Copying only missing documents
Copying a subset of documents into a new index
Copying top N documents
Copying the subset of fields into new index
Ingest Node
Organizing your data
Index alias
Index templates
Managing time-based indices
Shrink API
Summary
All About Search
Different types of queries
Sample data
Querying Elasticsearch
Basic query (finding the exact value)
Pagination
Sorting based on existing fields
Selecting the fields in the response
Querying based on range
Handling dates
Analyzed versus non-analyzed fields
Term versus Match query
Match phrase query
Prefix and match phrase prefix query
Wildcard and Regular expression query
Exists and missing queries
Using more than one query
Routing
Debugging search query
Relevance
Queries versus Filters
How to boost relevance based on a single field
How to boost score based on queries
How to boost relevance using decay functions
Rescoring
Debugging relevance score
Searching for same value across multiple fields
Best matching fields
Most matching fields
Cross-matching fields
Caching
Node Query cache
Shard request cache
Summary
More Than a Search Engine (Geofilters, Autocomplete, and More)
Sample data
Correcting typos and spelling mistakes
Fuzzy query
Making suggestions based on the user input
Implementing "did you mean" feature
Term suggester
Phrase suggester
Implementing the autocomplete feature
Highlighting
Handling document relations using parent-child
The has_parent query
The has_child query
Inner hits for parent-child
How parent-child works internally
Handling document relations using nested
Inner hits for nested documents
Scripting
Script Query
Post Filter
Reverse search using the percolate query
Geo and Spatial Filtering
Geo Distance
Using Geolocation to rank the search results
Geo Bounding Box
Sorting
Multi search
Search templates
Querying Elasticsearch from Java application
Summary
How to Slice and Dice Your Data Using Aggregations
Aggregation basics
Sample data
Query structure
Multilevel aggregations
Types of aggregations
Terms aggregations (group by)
Size and error
Order
Minimum document count
Missing values
Aggregations based on filters
Aggregations on dates ( range, histogram )
Aggregations on numeric values (range, histogram)
Aggregations on geolocation (distance, bounds)
Geo distance
Geo bounds
Aggregations on child documents
Aggregations on nested documents
Reverse nested aggregation
Post filter
Using Kibana to visualize aggregations
Caching
Doc values
Field data
Summary
Production and Beyond
Configuring Elasticsearch
The directory structure
zip/tar.gz
DEB/RPM
Configuration file
Cluster and node name
Network configuration
Memory configuration
Configuring file descriptors
Types of nodes
Multinode cluster
Inspecting the logs
How nodes discover each other
Node failures
X-Pack
Windows
Mac OS X
Debian/RPM
Authentication
X-Pack basic license
Monitoring
Monitoring Elasticsearch clusters
Monitoring indices
Monitoring nodes
Thread pools
Elasticsearch server logs
Slow logs
Summary
Exploring Elastic Stack (Elastic Cloud, Security, Graph, and Alerting)
Elastic Cloud
High availability
Data reliability
Security
Authentication and roles
Securing communications using SSL
Graph
Graph UI
Alerting
Summary
Welcome to Learning Elasticsearch. We will start by describing the basic concepts of Elasticsearch. You will see how to install Elasticsearch and Kibana and learn how to index and update your data. We will use an e-commerce site as an example to explain how a search engine works and how to query your data. The real power of Elasticsearch is aggregations. You will see how to perform aggregation-based analytics with ease. You will also see how to use Kibana to explore and visualize your data. Finally, we will discuss how to use Graph to discover relations in your data and use alerting to set up alerts and notification on different trends in your data.
To better explain various concepts, lots of examples have been used throughout the book. Detailed instructions to install Elasticsearch, Kibana and how to execute the examples is included in Chapter 2, Setting Up Elasticsearch and Kibana.
Chapter 1, Introduction to Elasticsearch, describes the building blocks of Elasticsearch and what makes Elasticsearch scalable and distributed. In this chapter, we also discuss the strengths and limitations of Elasticsearch.
Chapter 2, Setting Up Elasticsearch and Kibana, covers the installation of Elasticsearch and Kibana.
Chapter 3,Modeling Your Data and Document Relations, focuses on modeling your data. To support text search, Elasticsearch preprocess the data before indexing. This chapter describes why preprocessing is necessary and various analyzers Elasticsearch supports. In addition to that, we discuss how to handle relationships between different document types.
Chapter 4, Indexing and Updating Your Data, covers how to index and update your data and what happens internally when you index and update. The data indexed in Elasticsearch is only available after a small delay, we discuss the reason for the delay and how to control the delay.
Chapter 5, Organizing Your Data and Bulk Data Ingestion, describes how to organize and manage indices in Elasticsearch using aliases and templates and more. This chapter also covers various Bulk API’s Elasticsearch supports and how to rebuild your existing indices using Reindex and Shrink API.
Chapter 6, All About Search, covers how to search, sort and paginate on your data. The concept of relevance is introduced and we discuss how to tune the relevance score to get the most relevant search results at the top.
Chapter 7,More Than a Search Engine (Geofilters, Autocomplete and More), covers how to filter based on geolocation, using Elasticsearch suggesters for autocomplete, correcting user typo’s and lot more.
Chapter 8, How to Slice and Dice Your Data Using Aggregations, covers different kinds of aggregations Elasticsearch supports and how to visualize the data using Kibana.
Chapter 9, Production and Beyond, covers important settings to configure and monitor in production.
Chapter 10, Exploring Elastic Stack (Elastic Cloud, Security, Graph, and Alerting), covers Elastic Cloud, which is managed cloud hosting and other products that are part of X-Pack.
The book was written using Elasticsearch 5.1.2, and all the examples used in the book should work with it. The request format used in this book is based on the Kibana Console and you’ll need Kibana Console or Sense Chrome plugin to execute the examples used in this book. Please refer to Query format used in this book section of Chapter 2, Setting up Elasticsearch and Kibana for more details. If using Kibana or Sense is not option, you can use other HTTP clients such as cURL or Postman. The request format is slightly different and is explained in the Using cURL or Postman section of Chapter 2, Setting Up Elasticsearch and Kibana.
This book is for software developers who are planning to build a search and analytics engine or are trying to learn Elasticsearch.
Some familiarity with web technologies (JavaScript, REST, JSON) would be helpful.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."
A block of code is set as follows:
{ "articleid": 1, "name": "Introduction to Elasticsearch"}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
{ "articleid": 1,
"name": "Introduction to Elasticsearch"
}
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Clicking the Next button moves you to the next screen."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Elasticsearch. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we will focus on the basic concepts of Elasticsearch. We will start by explaining the building blocks and then discuss how to create, modify and query in Elasticsearch. Getting started with Elasticsearch is very easy; most operations come with default settings. The default settings can be overridden when you need more advanced features.
I first started using Elasticsearch in 2012 as a backend search engine to power our Analytics dashboards. It has been more than five years, and I never looked for any other technologies for our search needs. Elasticsearch is much more than just a search engine; it supports complex aggregations, geo filters, and the list goes on. Best of all, you can run all your queries at a speed you have never seen before. To understand how this magic happens, we will briefly discuss how Elasticsearch works internally and then discuss how to talk to Elasticsearch. Knowing how it works internally will help you understand its strengths and limitations. Elasticsearch, like any other open source technology, is very rapidly evolving, but the core fundamentals that power Elasticsearch don't change. By the end of this chapter, we will have covered the following:
Basic concepts of Elasticsearch
How to interact with Elasticsearch
How to create, read, update, and delete
How does search work
Availability and horizontal scalability
Failure handling
Strengths and limitations
Elasticsearch is a highly scalable open source search engine. Although it started as a text search engine, it is evolving as an analytical engine, which can support not only search but also complex aggregations. Its distributed nature and ease of use makes it very easy to get started and scale as you have more data.
One might ask what makes Elasticsearch different from any other document stores out there. Elasticsearch is a search engine and not just a key-value store. It's also a very powerful analytical engine; all the queries that you would usually run in a batch or offline mode can be executed in real time. Support for features such as autocomplete, geo-location based filters, multilevel aggregations, coupled with its user friendliness resulted in industry-wide acceptance. That being said, I always believe it is important to have the right tool for the right job. Towards the end of the chapter, we will discuss it’s strengths and limitations.
In this section, we will go through the basic concepts and terminology of Elasticsearch. We will start by explaining how to insert, update, and perform a search. If you are familiar with SQL language, the following table shows the equivalent terms in Elasticsearch:
Database
Table
Row
Column
Index
Type
Document
Field
Your data in Elasticsearch is stored as JSON (Javascript Object Notation) documents. Most NoSQL data stores use JSON to store their data as JSON format is very concise, flexible, and readily understood by humans. A document in Elasticsearch is very similar to a row when compared to a relational database. Let's say we have a User table with the following information:
The users in the preceding user table, when represented in JSON format, will look like the following:
{ "id": 1, "name": "Luke", "age": 100, "gender": "M", "email": "[email protected]" }
{ "id": 2, "name": "Leia", "age": 100, "gender": "F", "email": "[email protected]" }
A row contains columns; similarly, a document contains fields. Elasticsearch documents are very flexible and support storing nested objects. For example, an existing user document can be easily extended to include the address information. To capture similar information using a table structure, you need to create a new address table and manage the relations using a foreign key. The user document with the address is shown here:
{ "id": 1, "name": "Luke", "age": 100, "gender": "M", "email": "[email protected]",
"address": {
"street": "123 High Lane",
"city": "Big City",
"state": "Small State",
"zip": 12345
}
}
Reading similar information without the JSON structure would also be difficult as the information would have to be read from multiple tables. Elasticsearch allows you to store the entire JSON as it is. For a database table, the schema has to be defined before you can use the table. Elasticsearch is built to handle unstructured data and can automatically determine the data types for the fields in the document. You can index new documents or add new fields without adding or changing the schema. This process is also known as dynamic mapping. We will discuss how this works and how to define schema in Chapter 3, Modeling Your Data and Document Relations.
An index is similar to a database. The term index should not be confused with a database index, as someone familiar with traditional SQL might assume. Your data is stored in one or more indexes just like you would store it in one or more databases. The word indexing means inserting/updating the documents into an Elasticsearch index. The name of the index must be unique and typed in all lowercase letters. For example, in an e-commerce world, you would have an index for the items--one for orders, one for customer information, and so on.
A type is similar to a database table, an index can have one or more types. Type is a logical separation of different kinds of data. For example, if you are building a blog application, you would have a type defined for articles in the blog and a type defined for comments in the blog. Let's say we have two types--articles and comments.
The following is the document that belongs to the article type:
{ "articleid": 1, "name": "Introduction to Elasticsearch" }
The following is the document that belongs to the comment type:
{ "commentid": "AVmKvtPwWuEuqke_aRsm", "articleid": 1, "comment": "Its Awesome !!" }
We can also define relations between different types. For example, a parent/child relation can be defined between articles and comments. An article (parent) can have one or more comments (children). We will discuss relations further in Chapter 3, Modeling Your Data and Document Relations.
In a traditional database system, we usually have only one server serving all the requests. Elasticsearch is a distributed system, meaning it is made up of one or more nodes (servers) that act as a single application, which enables it to scale and handle load beyond what a single server can handle. Each node (server) has part of the data. You can start running Elasticsearch with just one node and add more nodes, or, in other words, scale the cluster when you have more data. A cluster with three nodes is shown in the following diagram:
In the preceding diagram, the cluster has three nodes with the names elasticsearch1, elasticsearch2, elasticsearch3. These three nodes work together to handle all the indexing and query requests on the data. Each cluster is identified by a unique name, which defaults to elasticsearch. It is often common to have multiple clusters, one for each environment, such as staging, pre-production, production.
Just like a cluster, each node is identified by a unique name. Elasticsearch will automatically assign a unique name to each node if the name is not specified in the configuration. Depending on your application needs, you can add and remove nodes (servers) on the fly. Adding and removing nodes is seamlessly handled by Elasticsearch.
We will discuss how to set up an Elasticsearch cluster in Chapter 2, Setting Up Elasticsearch and Kibana.
An index is a collection of one or more shards. All the data that belongs to an index is distributed across multiple shards. By spreading the data that belongs to an index to multiple shards, Elasticsearch can store information beyond what a single server can store. Elasticsearch uses Apache Lucene internally to index and query the data. A shard is nothing but an Apache Lucene instance. We will discuss Apache Lucene and why Elasticsearch uses Lucene in the How search works section later.
I know we introduced a lot of new terms in this section. For now, just remember that all data that belongs to an index is spread across one or more shards. We will discuss how shards work in the Scalability and Availability section towards the end of this chapter.
The primary way of interacting with Elasticsearch is via REST API. Elasticsearch provides JSON-based REST API over HTTP. By default, Elasticsearch REST API runs on port 9200. Anything from creating an index to shutting down a node is a simple REST call. The APIs are broadly classified into the following:
Document APIs
:
CRUD
(
Create Retrieve Update Delete
) operations on documents
Search APIs
: For all the search operations
Indices APIs
: For managing indices (creating an index, deleting an index, and so on)
Cat APIs
: Instead of JSON, the data is returned in tabular form
Cluster APIs
: For managing the cluster
We have a chapter dedicated to each one of them to discuss more in detail. For example, indexing documents in Chapter 4, Indexing and Updating Your Data and search in Chapter 6, All About Search and so on. In this section, we will go through some basic CRUD using the Document APIs. This section is simply a brief introduction on how to manipulate data using Document APIs. To use Elasticsearch in your application, clients in all major languages, such as Java, Python, are also provided. The majority of the clients acts as a wrapper around the REST API.
To better explain the CRUD operations, imagine we are building an e-commerce site. And we want to use Elasticsearch to power its search functionality. We will use an index named chapter1 and store all the products in the type called product. Each product we want to index is represented by a JSON document. We will start by creating a new product document, and then we will retrieve a product by its identifier, followed by updating a product's category and deleting a product using its identifier.
A new document can be added using the Document API's. For the e-commerce example, to add a new product, we execute the following command. The body of the request is the product document we want to index.
PUT http://localhost:9200/chapter1/product/1{ "title": "Learning Elasticsearch", "author": "Abhishek Andhavarapu", "category": "books"}
Let's inspect the request:
The document's properties, such as title, author, the category, are also known as fields, which are similar to SQL columns.
When we execute the preceding request, Elasticsearch responds with a JSON response, shown as follows:
{ "_index": "chapter1", "_type": "product", "_id": "1",
"_version": 1,
"_shards": { "total": 1, "successful": 1, "failed": 0 },
"created": true
}
In the response, you can see that Elasticsearch created the document and the version of the document is 1. Since you are creating the document using the HTTP PUT method, you are required to specify the document identifier. If you don’t specify the identifier, Elasticsearch will respond with the following error message:
No handler found for uri [/chapter1/product/] and method [PUT]
If you don’t have a unique identifier, you can let Elasticsearch assign an identifier for you, but you should use the POST HTTP method. For example, if you are indexing log messages, you will not have a unique identifier for each log message, and you can let Elasticsearch assign the identifier for you.
We can index a document without specifying a unique identifier as shown here:
POST http://localhost:9200/chapter1/product/{ "title": "Learning Elasticsearch", "author": "Abhishek Andhavarapu", "category": "books"}
In the above request, URL doesn't contain the unique identifier and we are using the HTTP POST method. Let's inspect the request:
INDEXchapter1TYPEproductDOCUMENTJSON HTTP METHODPOSTThe response from Elasticsearch is shown as follows:
{ "_index": "chapter1", "_type": "product",
"_id":
"
AVmKvtPwWuEuqke_aRsm
", "_version": 1, "_shards": { "total": 1, "successful": 1, "failed": 0 },
"created": true
}
You can see from the response that Elasticsearch assigned the unique identifier AVmKvtPwWuEuqke_aRsm to the document and created flag is set to true. If a document with the same unique identifier already exists, Elasticsearch replaces the existing document and increments the document version. If you have to run the same PUT request from the beginning of the section, the response from Elasticsearch would be this:
{ "_index": "chapter1", "_type": "product", "_id": "1",
"_version": 2,
"_shards": { "total": 1, "successful": 1, "failed": 0 },
"created": false
}
In the response, you can see that the created flag is false since the document with id: 1 already exists. Also, observe that the version is now 2.
To retrieve an existing document, we need the index, type and a unique identifier of the document. Let’s try to retrieve the document we just indexed. To retrieve a document we need to use HTTP GET method as shown below:
GET http://localhost:9200/chapter1/product/1
Let’s inspect the request:
Response from Elasticsearch as shown below contains the product document we indexed in the previous section:
{ "_index": "chapter1", "_type": "product", "_id": "1", "_version": 2,
"found": true,
"_source": {
"title": "Learning Elasticsearch", "author": "Abhishek Andhavarapu", "category": "books" } }
The actual JSON document will be stored in the _source field. Also note the version in the response; every time the document is updated, the version is increased.
Updating a document in Elasticsearch is more complicated than in a traditional SQL database. Internally, Elasticsearch retrieves the old document, applies the changes, and re-inserts the document as a new document. The update operation is very expensive. There are different ways of updating a document. We will talk about updating a partial document here and in more detail in the Updating your datasection inChapter 4, Indexing and Updating Your Data.
We already indexed the document with the unique identifier 1, and now we need to update the category of the product from just books to technical books. We can update the document as shown here:
POST http://localhost:9200/chapter1/product/1/
_update
{
"doc": {
"category": "technical books" } }
The body of the request is the field of the document we want to update and the unique identifier is passed in the URL.
The response from Elasticsearch is shown here:
{ "_index": "chapter1", "_type": "product", "_id": "1",
"_version": 3,
"_shards": { "total": 1, "successful": 1, "failed": 0 } }
As you can see in the response, the operation is successful, and the version of the document is now 3. More complicated update operations are possible using scripts and upserts.
For creating and retrieving a document, we used the POST and GET methods. For deleting an existing document, we need to use the HTTP DELETE method and pass the unique identifier of the document in the URL as shown here:
DELETE http://localhost:9200/chapter1/product/1
Let's inspect the request:
The response from Elasticsearch is shown here:
{
"found": true,
"_index": "chapter1", "_type": "product", "_id": "1", "_version": 4, "_shards": { "total": 1,
"successful": 1,
"failed": 0 } }
In the response, you can see that Elasticsearch was able to find the document with the unique identifier 1 and was successful in deleting the document.
In the previous section, we discussed how to create, update, and delete documents. In this section, we will briefly discuss how search works internally and explain the basic query APIs. Mostly, I want to talk about the inverted index and Apache Lucene. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. We will discuss Search API in detail in Chapter 6, All About Search.
As the computation power is increasing and cost of storage is decreasing, the amount of day-to-day data we deal with is growing exponentially. But without a way to retrieve the information and to be able to query it, the information we collect doesn't help.
Information retrieval systems are very important to make sense of the data. Imagine how hard it would be to find some information on the Internet without Google or other search engines out there. Information is not knowledge without information retrieval systems.
Let's say we have a User table as shown here:
Now, we want to query for all the users with the name Luke. A SQL query to achieve this would be something like this:
select * from user where name like ‘%luke%’
To do a similar task in Elasticsearch, you can use the search API and execute the following command:
GET http://127.0.0.1:9200/chapter1/user/_search?q=name:luke
Let's inspect the request:
INDEX
chapter1
TYPE
user
FIELD
name
Just like you would get all the rows in the User table as a result of the SQL query, the response to the Elasticsearch query would be JSON documents:
{ "id": 1, "name": "Luke", "age": 100, "gender": "M", "email": "[email protected]" }
Querying using the URL parameters can be used for simple queries as shown above. For more practical queries, you should pass the query represented as JSON in the request body. The same query passed in the request body is shown here:
POST http://127.0.0.1:9200/chapter1/user/_search { "query": { "term": { "name": "luke" } } }
The Search API is very flexible and supports different kinds of filters, sort, pagination, and aggregations.
Before we talk more about search, I want to talk about the inverted index. Knowing about inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on.
We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear.
Document1: Fear leads to anger
Document2: Anger leads to hate
Document3: Hate leads to suffering
In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same.
Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value.
Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don’t use the exact words. Now let’s say we encountered a document containing the following:
Yosemite national park may be closed for the weekend due to forecast of substantial rainfall
We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown:
To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections.
