47,99 €
Leverage Elasticsearch to create a robust, fast, and flexible search solution with ease
If you are a competent developer and want to learn about the great and exciting world of ElasticSearch, then this book is for you. No prior knowledge of Java or Apache Lucene is needed.
ElasticSearch is a very fast and scalable open source search engine, designed with distribution and cloud in mind, complete with all the goodies that Apache Lucene has to offer. ElasticSearch's schema-free architecture allows developers to index and search unstructured content, making it perfectly suited for both small projects and large big data warehouses, even those with petabytes of unstructured data.
This book will guide you through the world of the most commonly used ElasticSearch server functionalities. You'll start off by getting an understanding of the basics of ElasticSearch and its data indexing functionality. Next, you will see the querying capabilities of ElasticSearch, followed by a through explanation of scoring and search relevance. After this, you will explore the aggregation and data analysis capabilities of ElasticSearch and will learn how cluster administration and scaling can be used to boost your application performance. You'll find out how to use the friendly REST APIs and how to tune ElasticSearch to make the most of it. By the end of this book, you will have be able to create amazing search solutions as per your project's specifications.
This step-by-step guide is full of screenshots and real-world examples to take you on a journey through the wonderful world of full text search provided by ElasticSearch.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 718
Veröffentlichungsjahr: 2016
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Second edition: February 2015
Third edition: February 2016
Production reference: 1230216
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-881-6
www.packtpub.com
Authors
Rafał Kuć
Marek Rogoziński
Reviewer
Paige Cook
Commissioning Editor
Nadeem Bagban
Acquisition Editor
Divya Poojari
Content Development Editor
Kirti Patil
Technical Editor
Utkarsha S. Kadam
Copy Editor
Alpha Singh
Project Coordinator
Nidhi Joshi
Proofreader
Safis Editing
Indexer
Rekha Nair
Graphics
Jason Monteiro
Production Coordinator
Manu Joseph
Cover Work
Manu Joseph
Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days.
Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest.
Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Marek Rogoziński is a software architect and consultant with more than 10 years of experience. His specialization concerns solutions based on open source search engines, such as Solr and Elasticsearch, and the software stack for big data analytics including Hadoop, Hbase, and Twitter Storm.
He is also a cofounder of the solr.pl site, which publishes information and tutorials about Solr and Lucene libraries. He is the coauthor of ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
He is currently the chief technology officer and lead architect at ZenCard, a company that processes and analyzes large quantities of payment transactions in real time, allowing automatic and anonymous identification of retail customers on all retailer channels (m-commerce/e-commerce/brick&mortar) and giving retailers a customer retention and loyalty tool.
Paige Cook works as a software architect for Videa, part of the Cox Family of Companies, and lives near Atlanta, Georgia. He has twenty years of experience in software development, primarily with the Microsoft .NET Framework. His career has been largely focused on building enterprise solutions for the media and entertainment industry. He is especially interested in search technologies using the Apache Lucene search engine and has experience with both Elasticsearch and Apache Solr. Apart from his work, he enjoys DIY home projects and spending time with his wife and two daughters.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Welcome to Elasticsearch Server, Third Edition. This is the third instalment of the book dedicated to yet another major release of Elasticsearch—this time version 2.2. In the third edition, we have decided to go on a similar route that we took when we wrote the second edition of the book. We not only updated the content to match the new version of Elasticsearch, but also restructured the book by removing and adding new sections and chapters. We read the suggestions we got from you—the readers of the book, and we carefully tried to incorporate the suggestions and comments received since the release of the first and second editions.
While reading this book, you will be taken on a journey to the wonderful world of full-text search provided by the Elasticsearch server. We will start with a general introduction to Elasticsearch, which covers how to start and run Elasticsearch, its basic concepts, and how to index and search your data in the most basic way. This book will also discuss the query language, so called Query DSL, that allows you to create complicated queries and filter returned results. In addition to all of this, you'll see how you can use the aggregation framework to calculate aggregated data based on the results returned by your queries. We will implement the autocomplete functionality together and learn how to use Elasticsearch spatial capabilities and prospective search.
Finally, this book will show you Elasticsearch's administration API capabilities with features such as shard placement control, cluster handling, and more, ending with a dedicated chapter that will discuss Elasticsearch's preparation for small and large deployments— both ones that concentrate on indexing and also ones that concentrate on indexing.
Chapter 1, Getting Started with Elasticsearch Cluster, covers what full-text searching is, what Apache Lucene is, what text analysis is, how to run and configure Elasticsearch, and finally, how to index and search your data in the most basic way.
Chapter 2, Indexing Your Data, shows how indexing works, how to prepare index structure, what data types we are allowed to use, how to speed up indexing, what segments are, how merging works, and what routing is.
Chapter 3, Searching Your Data, introduces the full-text search capabilities of Elasticsearch by discussing how to query it, how the querying process works, and what types of basic and compound queries are available. In addition to this, we will show how to use position-aware queries in Elasticsearch.
Chapter 4, Extending Your Query Knowledge, shows how to efficiently narrow down your search results by using filters, how highlighting works, how to sort your results, and how query rewrite works.
Chapter 5, Extending Your Index Structure, shows how to index more complex data structures. We learn how to index tree-like data types, how to index data with relationships between documents, and how to modify index structure.
Chapter 6, Make Your Search Better, covers Apache Lucene scoring and how to influence it in Elasticsearch, the scripting capabilities of Elasticsearch, and its language analysis capabilities.
Chapter 7, Aggregations for Data Analysis, introduces you to the great world of data analysis by showing you how to use the Elasticsearch aggregation framework. We will discuss all types of aggregations—metrics, buckets, and the new pipeline aggregations that have been introduced in Elasticsearch.
Chapter 8, Beyond Full-text Searching, discusses non full-text search-related functionalities such as percolator—reversed search, and the geo-spatial capabilities of Elasticsearch. This chapter also discusses suggesters, which allow us to build a spellchecking functionality and an efficient autocomplete mechanism, and we will show how to handle deep-paging efficiently.
Chapter 9, Elasticsearch Cluster in Detail, discusses nodes discovery mechanism, recovery and gateway Elasticsearch modules, templates, caches, and settings update API.
Chapter 10, Administrating Your Cluster, covers the Elasticsearch backup functionality, rebalancing, and shards moving. In addition to this, you will learn how to use the warm up functionality, use the Cat API, and work with aliases.
Chapter 11, Scaling by Example, is dedicated to scaling and tuning. We will start with hardware preparations and considerations and a single Elasticsearch node-related tuning. We will go through cluster setup and vertical scaling, ending the chapter with high querying and indexing use cases and cluster monitoring.
This book was written using Elasticsearch server 2.2 and all the examples and functions should work with this. In addition to this, you'll need a command that allows you to send HTTP request such as curl, which is available for most operating systems. Please note that all the examples in this book use the previously mentioned curl tool. If you want to use another tool, please remember to format the request in an appropriate way that is understood by the tool of your choice.
In addition to this, some chapters may require additional software, such as Elasticsearch plugins, but when needed it has been explicitly mentioned.
If you are a beginner to the world of full-text search and Elasticsearch, then this book is especially for you. You will be guided through the basics of Elasticsearch and you will learn how to use some of the advanced functionalities.
If you know Elasticsearch and you worked with it, then you may find this book interesting as it provides a nice overview of all the functionalities with examples and descriptions. However, you may encounter sections that you already know.
If you know the Apache Solr search engine, this book can also be used to compare some functionalities of Apache Solr and Elasticsearch. This may give you the knowledge about which tool is more appropriate for your use case.
If you know all the details about Elasticsearch and you know how each of the configuration parameters work, then this is definitely not the book you are looking for.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "If you use the Linux or OS X command, the cURL package should already be available."
A block of code is set as follows:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
Any command-line input or output is written as follows:
Warnings or important notes appear in a box like this.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/ElasticsearchServerThirdEdition_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
Welcome to the wonderful world of Elasticsearch—a great full text search and analytics engine. It doesn't matter if you are new to Elasticsearch and full text searches in general, or if you already have some experience in this. We hope that, by reading this book, you'll be able to learn and extend your knowledge of Elasticsearch. As this book is also dedicated to beginners, we decided to start with a short introduction to full text searches in general, and after that, a brief overview of Elasticsearch.
Please remember that Elasticsearch is a rapidly changing of software. Not only are features added, but the Elasticsearch core functionality is also constantly evolving and changing. We try to keep up with these changes, and because of this we are giving you the third edition of the book dedicated to Elasticsearch 2.x.
The first thing we need to do with Elasticsearch is install and configure it. With many applications, you start with the installation and configuration and usually forget the importance of these steps. We will try to guide you through these steps so that it becomes easier to remember. In addition to this, we will show you the simplest way to index and retrieve data without going into too much detail. The first chapter will take you on a quick ride through Elasticsearch and the full text search world. By the end of this chapter, you will have learned the following topics:
Back in the days when full text searching was a term known to a small percentage of engineers, most of us used SQL databases to perform search operations. Using SQL databases to search for the data stored in them was okay to some extent. Such a search wasn't fast, especially on large amounts of data. Even now, small applications are usually good with a standard LIKE %phrase% search in a SQL database. However, as we go deeper and deeper, we start to see the limits of such an approach—a lack of scalability, not enough flexibility, and a lack of language analysis. Of course, there are additional modules that extend SQL databases with full text search capabilities, but they are still limited compared to dedicated full text search libraries and search engines such as Elasticsearch. Some of those reasons led to the creation of Apache Lucene (http://lucene.apache.org/), a library written completely in Java (http://java.com/en/), which is very fast, light, and provides language analysis for a large number of languages spoken throughout the world.
Before going into the details of the analysis process, we would like to introduce you to the glossary and overall architecture of Apache Lucene. We decided that this information is crucial for understanding how Elasticsearch works, and even though the book is not about Apache Lucene, knowing the foundation of the Elasticsearch analytics and indexing engine is vital to fully understand how this great search engine works.
The basic concepts of the mentioned library are as follows:
Apache Lucene writes all the information to a structure called the inverted index. It is a data structure that maps the terms in the index to the documents and not the other way around as a relational database does in its tables. You can think of an inverted index as a data structure where data is term-oriented rather than document-oriented. Let's see how a simple inverted index will look. For example, let's assume that we have documents with only a single field called title to be indexed, and the values of that field are as follows:
A very simplified visualization of the Lucene inverted index could look as follows:
Each term points to the number of documents it is present in. For example, the term edition is present twice in the second and third documents. Such a structure allows for very efficient and fast search operations in term-based queries (but not exclusively). Because the occurrences of the term are connected to the terms themselves, Lucene can use information about the term occurrences to perform fast and precise scoring information by giving each document a value that represents how well each of the returned documents matched the query.
Of course, the actual index created by Lucene is much more complicated and advanced because of additional files that include information such as term vectors (per document inverted index), doc values (column oriented field information), stored fields ( the original and not the analyzed value of the field), and so on. However, all you need to know for now is how the data is organized and not what exactly is stored.
Each index is divided into multiple write-once and read-many-time structures called segments. Each segment is a miniature Apache Lucene index on its own. When indexing, after a single segment is written to the disk it can't be updated, or we should rather say it can't be fully updated; documents can't be removed from it, they can only be marked as deleted in a separate file. The reason that Lucene doesn't allow segments to be updated is the nature of the inverted index. After the fields are analyzed and put into the inverted index, there is no easy way of building the original document structure. When deleting, Lucene would have to delete the information from the segment, which translates to updating all the information within the inverted index itself.
Because of the fact that segments are write-once structures Lucene is able to merge segments together in a process called segment merging. During indexing, if Lucene thinks that there are too many segments falling into the same criterion, a new and bigger segment will be created—one that will have data from the other segments. During that process, Lucene will try to remove deleted data and get back the space needed to hold information about those documents. Segment merging is a demanding operation both in terms of the I/O and CPU. What we have to remember for now is that searching with one large segment is faster than searching with multiple smaller ones holding the same data. That's because, in general, searching translates to just matching the query terms to the ones that are indexed. You can imagine how searching through multiple small segments and merging those results will be slower than having a single segment preparing the results.
The transformation of a document that comes to Lucene and is processed and put into the inverted index format is called indexation. One of the things Lucene has to do during this is data analysis. You may want some of your fields to be processed by a language analyzer so that words such as car and cars are treated as the same be your index. On the other hand, you may want other fields to be divided only on the white space character or be only lowercased.
Analysis is done by the analyzer, which is built of a tokenizer and zero or more token filters, and it can also have zero or more character mappers.
A tokenizer in Lucene is used to split the text into tokens, which are basically the terms with additional information such as its position in the original text and its length. The results of the tokenizer's work is called a token stream, where the tokens are put one by one and are ready to be processed by the filters.
Apart from the tokenizer, the Lucene analyzer is built of zero or more token filters that are used to process tokens in the token stream. Some examples of filters are as follows:
Filters are processed one after another, so we have almost unlimited analytical possibilities with the addition of multiple filters, one after another.
Finally, the character mappers operate on non-analyzed text—they are used before the tokenizer. Therefore, we can easily remove HTML tags from whole parts of text without worrying about tokenization.
You may wonder how all the information we've described so far affects indexing and querying when using Lucene and all the software that is built on top of it. During indexing, Lucene will use an analyzer of your choice to process the contents of your document; of course, different analyzers can be used for different fields, so the name field of your document can be analyzed differently compared to the summary field. For example, the name field may only be tokenized on whitespaces and lowercased, so that exact matches are done and the summary field is stemmed in addition to that. We can also decide to not analyze the fields at all—we have full control over the analysis process.
During a query, your query text can be analyzed as well. However, you can also choose not to analyze your queries. This is crucial to remember because some Elasticsearch queries are analyzed and some are not. For example, prefix and term queries are not analyzed, and match queries are analyzed (we will get to that in Chapter 3, Searching Your Data). Having queries that are analyzed and not analyzed is very useful; sometimes, you may want to query a field that is not analyzed, while sometimes you may want to have a full text search analysis. For example, if we search for the LightRed term and the query is being analyzed by the standard analyzer, then the terms that would be searched are light and red. If we use a query type that has not been analyzed, then we will explicitly search for the LightRed term. We may not want to analyze the content of the query if we are only interested in exact matches.
What you should remember about indexing and querying analysis is that the index should match the query term. If they don't match, Lucene won't return the desired documents. For example, if you use stemming and lowercasing during indexing, you need to ensure that the terms in the query are also lowercased and stemmed, or your queries won't return any results at all. For example, let's get back to our LightRed term that we analyzed during indexing; we have it as two terms in the index: light and red. If we run a LightRed query against that data and don't analyze it, we won't get the document in the results—the query term does not match the indexed terms. It is important to keep the token filters in the same order during indexing and query time analysis so that the terms resulting from such an analysis are the same.
There is one additional thing that we only mentioned once till now—scoring. What is the score of a document? The score is a result of a scoring formula that describes how well the document matches the query. By default, Apache Lucene uses the TF/IDF (term frequency/inverse document frequency) scoring mechanism, which is an algorithm that calculates how relevant the document is in the context of our query. Of course, it is not the only algorithm available, and we will mention other algorithms in the Mappings configuration section of Chapter 2, Indexing Your Data.
If you want to read more about the Apache Lucene TF/IDF scoring formula, please visit Apache Lucene Javadocs for the TFIDF. The similarity class is available at http://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html.
Elasticsearch is an open source search server project started by Shay Banon and published in February 2010. During this time, the project grew into a major player in the field of search and data analysis solutions and is widely used in many common or lesser-known search and data analysis platforms. In addition, due to its distributed nature and real-time search and analytics capabilities, many organizations use it as a document store.
In the next few pages, we will get you through the basic concepts of Elasticsearch. You can skip this section if you are already familiar with Elasticsearch architecture. However, if you are not familiar with Elasticsearch, we strongly advise you to read this section. We will refer to the key words used in this section in the rest of the book, and understanding those concepts is crucial to fully utilize Elasticsearch.
An index is the logical place where Elasticsearch stores the data. Each index can be spread onto multiple Elasticsearch nodes and is divided into one or more smaller pieces called shards that are physically placed on the hard drives. If you are coming from the relational database world, you can think of an index like a table. However, the index structure is prepared for fast and efficient full text searching and, in particular, does not store original values. That structure is called an inverted index (https://en.wikipedia.org/wiki/Inverted_index).
If you know MongoDB, you can think of the Elasticsearch index as a collection in MongoDB. If you are familiar with CouchDB, you can think about an index as you would about the CouchDB database. Elasticsearch can hold many indices located on one machine or spread them over multiple servers. As we have already said, every index is built of one or more shards, and each shard can have many replicas.
The main entity stored in Elasticsearch is a document. A document can have multiple fields, each having its own type and treated differently. Using the analogy to relational databases, a document is a row of data in a database table. When you compare an Elasticsearch document to a MongoDB document, you will see that both can have different structures. The thing to keep in mind when it comes to Elasticsearch is that fields that are common to multiple types in the same index need to have the same type. This means that all the documents with a field called title need to have the same data type for it, for example, string.
Documents consist of fields, and each field may occur several times in a single document (such a field is called multivalued). Each field has a type (text, number, date, and so on). The field types can also be complex—a field can contain other subdocuments or arrays. The field type is important to Elasticsearch because type determines how various operations such as analysis or sorting are performed. Fortunately, this can be determined automatically (however, we still suggest using mappings; take a look at what follows).
Unlike the relational databases, documents don't need to have a fixed structure—every document may have a different set of fields, and in addition to this, fields don't have to be known during application development. Of course, one can force a document structure with the use of schema. From the client's point of view, a document is a JSON object (see more about the JSON format at https://en.wikipedia.org/wiki/JSON). Each document is stored in one index and has its own unique identifier, which can be generated automatically by Elasticsearch, and document type. The thing to remember is that the document identifier needs to be unique inside an index and should be for a given type. This means that, in a single index, two documents can have the same unique identifier if they are not of the same type.
In Elasticsearch, one index can store many objects serving different purposes. For example, a blog application can store articles and comments. The document type lets us easily differentiate between the objects in a single index. Every document can have a different structure, but in real-world deployments, dividing documents into types significantly helps in data manipulation. Of course, one needs to keep the limitations in mind. That is, different document types can't set different types for the same property. For example, a field called title must have the same type across all document types in a given index.
In the section about the basics of full text searching (the Full text searching section), we wrote about the process of analysis—the preparation of the input text for indexing and searching done by the underlying Apache Lucene library. Every field of the document must be properly analyzed depending on its type. For example, a different analysis chain is required for the numeric fields (numbers shouldn't be sorted alphabetically) and for the text fetched from web pages (for example, the first step would require you to omit the HTML tags as it is useless information). To be able to properly analyze at indexing and querying time, Elasticsearch stores the information about the fields of the documents in so-called mappings. Every document type has its own mapping, even if we don't explicitly define it.
Now, we already know that Elasticsearch stores its data in one or more indices and every index can contain documents of various types. We also know that each document has many fields and how Elasticsearch treats these fields is defined by the mappings. But there is more. From the beginning, Elasticsearch was created as a distributed solution that can handle billions of documents and hundreds of search requests per second. This is due to several important key features and concepts that we are going to describe in more detail now.
Elasticsearch can work as a standalone, single-search server. Nevertheless, to be able to process large sets of data and to achieve fault tolerance and high availability, Elasticsearch can be run on many cooperating servers. Collectively, these servers connected together are called a cluster and each server forming a cluster is called a node.
When we have a large number of documents, we may come to a point where a single node may not be enough—for example, because of RAM limitations, hard disk capacity, insufficient processing power, and an inability to respond to client requests fast enough. In such cases, an index (and the data in it) can be divided into smaller parts called shards (where each shard is a separate Apache Lucene index). Each shard can be placed on a different server, and thus your data can be spread among the cluster nodes. When you query an index that is built from multiple shards, Elasticsearch sends the query to each relevant shard and merges the result in such a way that your application doesn't know about the shards. In addition to this, having multiple shards can speed up indexing, because documents end up in different shards and thus the indexing operation is parallelized.
In order to increase query throughput or achieve high availability, shard replicas can be used. A replica is just an exact copy of the shard, and each shard can have zero or more replicas. In other words, Elasticsearch can have many identical shards and one of them is automatically chosen as a place where the operations that change the index are directed. This special shard is called a primary shard, and the others are called replica shards. When the primary shard is lost (for example, a server holding the shard data is unavailable), the cluster will promote the replica to be the new primary shard.
The cluster state is held by the gateway, which stores the cluster state and indexed data across full cluster restarts. By default, every node has this information stored locally; it is synchronized among nodes. We will discuss the gateway module in The gateway and recovery modules section of Chapter 9, Elasticsearch Cluster, in detail.
You may wonder how you can tie all the indices, shards, and replicas together in a single environment. Theoretically, it would be very difficult to fetch data from the cluster when you have to know where your document is: on which server, and in which shard. Even more difficult would be searching when one query can return documents from different shards placed on different nodes in the whole cluster. In fact, this is a complicated problem; fortunately, we don't have to care about this at all—it is handled automatically by Elasticsearch. Let's look at the following diagram:
When you send a new document to the cluster, you specify a target index and send it to any of the nodes. The node knows how many shards the target index has and is able to determine which shard should be used to store your document. Elasticsearch can alter this behavior; we will talk about this in the Introduction to routing section in Chapter 2, Indexing Your Data. The important information that you have to remember for now is that Elasticsearch calculates the shard in which the document should be placed using the unique identifier of the document—this is one of the reasons each document needs a unique identifier. After the indexing request is sent to a node, that node forwards the document to the target node, which hosts the relevant shard.
Now, let's look at the following diagram on searching request execution:
When you try to fetch a document by its identifier, the node you send the query to uses the same routing algorithm to determine the shard and the node holding the document and again forwards the request, fetches the result, and sends the result to you. On the other hand, the querying process is a more complicated one. The node receiving the query forwards it to all the nodes holding the shards that belong to a given index and asks for minimum information about the documents that match the query (the identifier and score are matched by default), unless routing is used, when the query will go directly to a single shard only. This is called the scatter phase. After receiving this information, the aggregator node (the node that receives the client request) sorts the results and sends a second request to get the documents that are needed to build the results list (all the other information apart from the document identifier and score). This is called the gather phase. After this phase is executed, the results are returned to the client.
Now the question arises: what is the replica's role in the previously described process? While indexing, replicas are only used as an additional place to store the data. When executing a query, by default, Elasticsearch will try to balance the load among the shard and its replicas so that they are evenly stressed. Also, remember that we can change this behavior; we will discuss this in the Understanding the querying process section in Chapter 3, Searching Your Data.
Installing and running Elasticsearch even in production environments is very easy nowadays, compared to how it was in the days of Elasticsearch 0.20.x. From a system that is not ready to one with Elasticsearch, there are only a few steps that one needs to go. We will explore these steps in the following section:
Elasticsearch is a Java application and to use it we need to make sure that the Java SE environment is installed properly. Elasticsearch requires Java Version 7 or later to run. You can download it from http://www.oracle.com/technetwork/java/javase/downloads/index.html. You can also use OpenJDK (http://openjdk.java.net/) if you wish. You can, of course, use Java Version 7, but it is not supported by Oracle anymore, at least without commercial support. For example, you can't expect new, patched versions of Java 7 to be released. Because of this, we strongly suggest that you install Java 8, especially given that Java 9 seems to be right around the corner with the general availability planned to be released in September 2016.
To install Elasticsearch you just need to go to https://www.elastic.co/downloads/elasticsearch, choose the last stable version of Elasticsearch, download it, and unpack it. That's it! The installation is complete.
At the time of writing, we used a snapshot of Elasticsearch 2.2. This means that we've skipped describing some properties that were marked as deprecated and are or will be removed in the future versions of Elasticsearch.
The main interface to communicate with Elasticsearch is based on the HTTP protocol and REST. This means that you can even use a web browser for some basic queries and requests, but for anything more sophisticated you'll need to use additional software, such as the cURL command. If you use the Linux or OS X command, the cURL package should already be available. If you use Windows, you can download the package from http://curl.haxx.se/download.html.
Let's run our first instance that we just downloaded as the ZIP archive and unpacked. Go to the bin directory and run the following commands depending on the OS:
Congratulations! Now, you have your Elasticsearch instance up-and-running. During its work, the server usually uses two port numbers: the first one for communication with the REST API using the HTTP protocol, and the second one for the transport module used for communication in a cluster and between the native Java client and the cluster. The default port used for the HTTP API is 9200, so we can check search readiness by pointing the web browser to http://127.0.0.1:9200/. The browser should show a code snippet similar to the following:
The output is structured as a JavaScript Object Notation (JSON) object. If you are not familiar with JSON, please take a minute and read the article available at https://en.wikipedia.org/wiki/JSON.
Elasticsearch is smart. If the default port is not available, the engine binds to the next free port. You can find information about this on the console during booting as follows:
Note the fragment with [http]. Elasticsearch uses a few ports for various tasks. The interface that we are using is handled by the HTTP module.
Now, we will use the cURL program to communicate with Elasticsearch. For example, to check the cluster health, we will use the following command:
The -X parameter is a definition of the HTTP request method. The default value is GET (so in this example, we can omit this parameter). For now, do not worry about the GET value; we will describe it in more detail later in this chapter.
As a standard, the API returns information in a JSON object in which new line characters are omitted. The pretty parameter added to our requests forces Elasticsearch to add a new line character to the response, making the response more user-friendly. You can try running the preceding query with and without the ?pretty parameter to see the difference.
Elasticsearch is useful in small and medium-sized applications, but it has been built with large clusters in mind. So, now we will set up our big two-node cluster. Unpack the Elasticsearch archive in a different directory and run the second instance. If we look at the log, we will see the following:
This means that our second instance (named Big Man) discovered the previously running instance (named Blob). Here, Elasticsearch automatically formed a new two-node cluster. Starting from Elasticsearch 2.0, this will only work with nodes running on the same physical machine—because Elasticsearch 2.0 no longer supports multicast. To allow your cluster to form, you need to inform Elasticsearch about the nodes that should be contacted initially using the discovery.zen.ping.unicast.hosts array in elasticsearch.yml. For example, like this:
Even though we expect our cluster (or node) to run flawlessly for a lifetime, we may need to restart it or shut it down properly (for example, for maintenance). The following are the two ways in which we can shut down Elasticsearch:
The previous versions of Elasticsearch exposed a dedicated shutdown API but, in 2.0, this option has been removed because of security reasons.
Now, let's go to the newly created directory. We should see the following directory structure:
Directory
Description
Bin
The scripts needed to run Elasticsearch instances and for plugin management
Config
The directory where configuration files are located
Lib
The libraries used by Elasticsearch
Modules
The plugins bundled with Elasticsearch
After Elasticsearch starts, it will create the following directories (if they don't exist):
Directory
Description
Data
The directory used by Elasticsearch to store all the data
Logs
The files with information about events and errors
Plugins
The location to store the installed plugins
Work
The temporary files used by Elasticsearch
One of the reasons—of course, not the only one—why Elasticsearch is gaining more and more popularity is that getting started with Elasticsearch is quite easy. Because of the reasonable default values and automatic settings for simple environments, we can skip the configuration and go straight to indexing and querying (or to the next chapter of the book). We can do all this without changing a single line in our configuration files. However, in order to truly understand Elasticsearch, it is worth understanding some of the available settings.
We will now explore the default directories and the layout of the files provided with the Elasticsearch tar.gz archive. The entire configuration is located in the config directory. We can see two files here: elasticsearch.yml (or elasticsearch.json, which will be used if present) and logging.yml. The first file is responsible for setting the default configuration values for the server. This is important because some of these values can be changed at runtime and can be kept as a part of the cluster state, so the values in this file may not be accurate. The two values that we cannot change at runtime are cluster.name and node.name.
The cluster.name property is responsible for holding the name of our cluster. The cluster name separates different clusters from each other. Nodes configured with the same cluster name will try to form a cluster.
The second value is the instance (the node.name property) name. We can leave this parameter undefined. In this case, Elasticsearch automatically chooses a unique name for itself. Note that this name is chosen during each startup, so the name can be different on each restart. Defining the name can helpful when referring to concrete instances by the API or when using monitoring tools to see what is happening to a node during long periods of time and between restarts. Think about giving descriptive names to your nodes.
Other parameters are commented well in the file, so we advise you to look through it; don't worry if you do not understand the explanation. We hope that everything will become clearer after reading the next few chapters.
