E-Book
47,99 €

Elasticsearch Server - Third Edition E-Book

Rafal Kuc

0,0

47,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Leverage Elasticsearch to create a robust, fast, and flexible search solution with ease

About This Book

Boost the searching capabilities of your system through synonyms, multilingual data handling, nested objects and parent-child documents
Deep dive into the world of data aggregation and data analysis with ElasticSearch
Explore a wide range of ElasticSearch modules that define the behavior of a cluster

Who This Book Is For

If you are a competent developer and want to learn about the great and exciting world of ElasticSearch, then this book is for you. No prior knowledge of Java or Apache Lucene is needed.

What You Will Learn

Configure, create, and retrieve data from your indices
Use an ElasticSearch query DSL to create a wide range of queries
Discover the highlighting and geographical search features offered by ElasticSearch
Find out how to index data that is not flat or data that has a relationship
Exploit a prospective search to search for queries not documents
Use the aggregations framework to get more from your data and improve your client's search experience
Monitor your cluster state and health using the ElasticSearch API as well as third-party monitoring solutions
Discover how to properly set up ElasticSearch for various use cases

In Detail

ElasticSearch is a very fast and scalable open source search engine, designed with distribution and cloud in mind, complete with all the goodies that Apache Lucene has to offer. ElasticSearch's schema-free architecture allows developers to index and search unstructured content, making it perfectly suited for both small projects and large big data warehouses, even those with petabytes of unstructured data.

This book will guide you through the world of the most commonly used ElasticSearch server functionalities. You'll start off by getting an understanding of the basics of ElasticSearch and its data indexing functionality. Next, you will see the querying capabilities of ElasticSearch, followed by a through explanation of scoring and search relevance. After this, you will explore the aggregation and data analysis capabilities of ElasticSearch and will learn how cluster administration and scaling can be used to boost your application performance. You'll find out how to use the friendly REST APIs and how to tune ElasticSearch to make the most of it. By the end of this book, you will have be able to create amazing search solutions as per your project's specifications.

Style and approach

This step-by-step guide is full of screenshots and real-world examples to take you on a journey through the wonderful world of full text search provided by ElasticSearch.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 718

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Elasticsearch Server Third Edition

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Getting Started with Elasticsearch Cluster

Full text searching

The Lucene glossary and architecture

Input data analysis

Indexing and querying

Scoring and query relevance

The basics of Elasticsearch

Key concepts of Elasticsearch

Index

Document

Document type

Mapping

Key concepts of the Elasticsearch infrastructure

Nodes and clusters

Shards

Replicas

Gateway

Indexing and searching

Installing and configuring your cluster

Installing Java

Installing Elasticsearch

Running Elasticsearch

Shutting down Elasticsearch

The directory layout

Configuring Elasticsearch

The system-specific installation and configuration

Installing Elasticsearch on Linux

Installing Elasticsearch using RPM packages

Installing Elasticsearch using the DEB package

Elasticsearch configuration file localization

Configuring Elasticsearch as a system service on Linux

Elasticsearch as a system service on Windows

Manipulating data with the REST API

Understanding the REST API

Storing data in Elasticsearch

Creating a new document

Automatic identifier creation

Retrieving documents

Updating documents

Dealing with non-existing documents

Adding partial documents

Deleting documents

Versioning

Usage example

Versioning from external systems

Searching with the URI request query

Sample data

URI search

Elasticsearch query response

Query analysis

URI query string parameters

The query

The default search field

Analyzer

The default operator property

Query explanation

The fields returned

Sorting the results

The search timeout

The results window

Limiting per-shard results

Ignoring unavailable indices

The search type

Lowercasing term expansion

Wildcard and prefix analysis

Lucene query syntax

Summary

2. Indexing Your Data

Elasticsearch indexing

Shards and replicas

Write consistency

Creating indices

Altering automatic index creation

Settings for a newly created index

Index deletion

Mappings configuration

Type determining mechanism

Disabling the type determining mechanism

Tuning the type determining mechanism for numeric types

Tuning the type determining mechanism for dates

Index structure mapping

Type and types definition

Fields

Core types

Common attributes

String

Number

Boolean

Binary

Date

Multi fields

The IP address type

Token count type

Using analyzers

Out-of-the-box analyzers

Defining your own analyzers

Default analyzers

Different similarity models

Setting per-field similarity

Available similarity models

Configuring default similarity

Configuring BM25 similarity

Configuring DFR similarity

Configuring IB similarity

Batch indexing to speed up your indexing process

Preparing data for bulk indexing

Indexing the data

The _all field

The _source field

Additional internal fields

Introduction to segment merging

Segment merging

The need for segment merging

The merge policy

The merge scheduler

Throttling

Introduction to routing

Default indexing

Default searching

Routing

The routing parameters

Routing fields

Summary

3. Searching Your Data

Querying Elasticsearch

The example data

A simple query

Paging and result size

Returning the version value

Limiting the score

Choosing the fields that we want to return

Source filtering

Using the script fields

Passing parameters to the script fields

Understanding the querying process

Query logic

Search type

Search execution preference

Search shards API

Basic queries

The term query

The terms query

The match all query

The type query

The exists query

The missing query

The common terms query

The match query

The Boolean match query

The phrase match query

The match phrase prefix query

The multi match query

The query string query

Running the query string query against multiple fields

The simple query string query

The identifiers query

The prefix query

The fuzzy query

The wildcard query

The range query

Regular expression query

The more like this query

Compound queries

The bool query

The dis_max query

The boosting query

The constant_score query

The indices query

Using span queries

A span

Span term query

Span first query

Span near query

Span or query

Span not query

Span within query

Span containing query

Span multi query

Performance considerations

Choosing the right query

The use cases

Limiting results to given tags

Searching for values in a range

Boosting some of the matched documents

Ignoring lower scoring partial queries

Using Lucene query syntax in queries

Handling user queries without errors

Autocomplete using prefixes

Finding terms similar to a given one

Matching phrases

Spans, spans everywhere

Summary

4. Extending Your Querying Knowledge

Filtering your results

The context is the key

Explicit filtering with bool query

Highlighting

Getting started with highlighting

Field configuration

Under the hood

Forcing highlighter type

Configuring HTML tags

Controlling highlighted fragments

Global and local settings

Require matching

Custom highlighting query

The Postings highlighter

Validating your queries

Using the Validate API

Sorting data

Default sorting

Selecting fields used for sorting

Sorting mode

Specifying behavior for missing fields

Dynamic criteria

Calculate scoring when sorting

Query rewrite

Prefix query as an example

Getting back to Apache Lucene

Query rewrite properties

Summary

5. Extending Your Index Structure

Indexing tree-like structures

Data structure

Analysis

Indexing data that is not flat

Data

Objects

Arrays

Mappings

Final mappings

Sending the mappings to Elasticsearch

To be or not to be dynamic

Disabling object indexing

Using nested objects

Scoring and nested queries

Using the parent-child relationship

Index structure and data indexing

Child mappings

Parent mappings

The parent document

Child documents

Querying

Querying data in the child documents

Querying data in the parent documents

Performance considerations

Modifying your index structure with the update API

The mappings

Adding a new field to the existing index

Modifying fields of an existing index

Summary

6. Make Your Search Better

Introduction to Apache Lucene scoring

When a document is matched

Default scoring formula

Relevancy matters

Scripting capabilities of Elasticsearch

Objects available during script execution

Script types

In file scripts

Inline scripts

Indexed scripts

Querying with scripts

Scripting with parameters

Script languages

Using other than embedded languages

Using native code

The factory implementation

Implementing the native script

The plugin definition

Installing the plugin

Running the script

Searching content in different languages

Handling languages differently

Handling multiple languages

Detecting the language of the document

Sample document

The mappings

Querying

Queries with an identified language

Queries with an unknown language

Combining queries

Influencing scores with query boosts

The boost

Adding the boost to queries

Modifying the score

Constant score query

Boosting query

The function score query

Structure of the function query

The weight factor function

Field value factor function

The script score function

The random score function

Decay functions

When does index-time boosting make sense?

Defining boosting in the mappings

Words with the same meaning

Synonym filter

Synonyms in the mappings

Synonyms stored on the file system

Defining synonym rules

Using Apache Solr synonyms

Explicit synonyms

Equivalent synonyms

Expanding synonyms

Using WordNet synonyms

Query or index-time synonym expansion

Understanding the explain information

Understanding field analysis

Explaining the query

Summary

7. Aggregations for Data Analysis

Aggregations

General query structure

Inside the aggregations engine

Aggregation types

Metrics aggregations

Minimum, maximum, average, and sum

Missing values

Using scripts

Field value statistics and extended statistics

Value count

Field cardinality

Percentiles

Percentile ranks

Top hits aggregation

Additional parameters

Geo bounds aggregation

Scripted metrics aggregation

Buckets aggregations

Filter aggregation

Filters aggregation

Terms aggregation

Counts are approximate

Minimum document count

Range aggregation

Keyed buckets

Date range aggregation

IPv4 range aggregation

Missing aggregation

Histogram aggregation

Date histogram aggregation

Time zones

Geo distance aggregations

Geohash grid aggregation

Global aggregation

Significant terms aggregation

Choosing significant terms

Multiple value analysis

Sampler aggregation

Children aggregation

Nested aggregation

Reverse nested aggregation

Nesting aggregations and ordering buckets

Buckets ordering

Pipeline aggregations

Available types

Referencing other aggregations

Gaps in the data

Pipeline aggregation types

Min, max, sum, and average bucket aggregations

Cumulative sum aggregation

Bucket selector aggregation

Bucket script aggregation

Serial differencing aggregation

Derivative aggregation

Moving avg aggregation

Predicting future buckets

The models

Summary

8. Beyond Full-text Searching

Percolator

The index

Percolator preparation

Getting deeper

Controlling the size of returned results

Percolator and score calculation

Combining percolators with other functionalities

Getting the number of matching queries

Indexed document percolation

Elasticsearch spatial capabilities

Mapping preparation for spatial searches

Example data

Additional geo_field properties

Sample queries

Distance-based sorting

Bounding box filtering

Limiting the distance

Arbitrary geo shapes

Point

Envelope

Polygon

Multipolygon

An example usage

Storing shapes in the index

Using suggesters

Available suggester types

Including suggestions

Suggester response

Term suggester

Term suggester configuration options

Additional term suggester options

Phrase suggester

Configuration

Completion suggester

Indexing data

Querying indexed completion suggester data

Custom weights

Context suggester

Context types

Using context

Using the geo location context

The Scroll API

Problem definition

Scrolling to the rescue

Summary

9. Elasticsearch Cluster in Detail

Understanding node discovery

Discovery types

Node roles

Master node

Data node

Client node

Configuring node roles

Setting the cluster's name

Zen discovery

Master election configuration

Configuring unicast

Fault detection ping settings

Cluster state updates control

Dealing with master unavailability

Adjusting HTTP transport settings

Disabling HTTP

HTTP port

HTTP host

The gateway and recovery modules

The gateway

Recovery control

Additional gateway recovery options

Indices recovery API

Delayed allocation

Index recovery prioritization

Templates and dynamic templates

Templates

An example of a template

Dynamic templates

The matching pattern

Field definitions

Elasticsearch plugins

The basics

Installing plugins

Removing plugins

Elasticsearch caches

Fielddata cache

Fielddata size

Circuit breakers

Fielddata and doc values

Shard request cache

Enabling and configuring the shard request cache

Per request shard request cache disabling

Shard request cache usage monitoring

Node query cache

Indexing buffers

When caches should be avoided

The update settings API

The cluster settings API

The indices settings API

Summary

10. Administrating Your Cluster

Elasticsearch time machine

Creating a snapshot repository

Creating snapshots

Additional parameters

Restoring a snapshot

Cleaning up – deleting old snapshots

Monitoring your cluster's state and health

Cluster health API

Controlling information details

Additional parameters

Indices stats API

Docs

Store

Indexing, get, and search

Additional information

Nodes info API

Returned information

Nodes stats API

Cluster state API

Cluster stats API

Pending tasks API

Indices recovery API

Indices shard stores API

Indices segments API

Controlling the shard and replica allocation

Explicitly controlling allocation

Specifying node parameters

Configuration

Index creation

Excluding nodes from allocation

Requiring node attributes

Using the IP address for shard allocation

Disk-based shard allocation

Configuring disk based shard allocation

Disabling disk based shard allocation

The number of shards and replicas per node

Allocation throttling

Cluster-wide allocation

Allocation awareness

Forcing allocation awareness

Filtering

What do include, exclude, and require mean

Manually moving shards and replicas

Moving shards

Canceling shard allocation

Forcing shard allocation

Multiple commands per HTTP request

Allowing operations on primary shards

Handling rolling restarts

Controlling cluster rebalancing

Understanding rebalance

Cluster being ready

The cluster rebalance settings

Controlling when rebalancing will be allowed

Controlling the number of shards being moved between nodes concurrently

Controlling which shards may be rebalanced

The Cat API

The basics

Using Cat API

Common arguments

The examples

Getting information about the master node

Getting information about the nodes

Retrieving recovery information for an index

Warming up

Defining a new warming query

Retrieving the defined warming queries

Deleting a warming query

Disabling the warming up functionality

Choosing queries for warming

Index aliasing and using it to simplify your everyday work

An alias

Creating an alias

Modifying aliases

Combining commands

Retrieving aliases

Removing aliases

Filtering aliases

Aliases and routing

Zero downtime reindexing and aliases

Summary

11. Scaling by Example

Hardware

Physical servers or a cloud

CPU

RAM memory

Mass storage

The network

How many servers

Cost cutting

Preparing a single Elasticsearch node

The general preparations

Avoiding swapping

File descriptors

Virtual memory

The memory

Field data cache and breaking the circuit

Use doc values

RAM buffer for indexing

Index refresh rate

Thread pools

Horizontal expansion

Automatically creating the replicas

Redundancy and high availability

Cost and performance flexibility

Continuous upgrades

Multiple Elasticsearch instances on a single physical machine

Preventing a shard and its replicas from being on the same node

Designated node roles for larger clusters

Query aggregator nodes

Data nodes

Master eligible nodes

Preparing the cluster for high indexing and querying throughput

Indexing related advice

Index refresh rate

Thread pools tuning

Automatic store throttling

Handling time-based data

Multiple data paths

Data distribution

Bulk indexing

RAM buffer for indexing

Advice for high query rate scenarios

Shard request cache

Think about the queries

Parallelize your queries

Field data cache and breaking the circuit

Keep size and shard size under control

Monitoring

Elasticsearch HQ

Marvel

SPM for Elasticsearch

Summary

Index

Elasticsearch Server Third Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2013

Second edition: February 2015

Third edition: February 2016

Production reference: 1230216

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-881-6

www.packtpub.com

Credits

Authors

Rafał Kuć

Marek Rogoziński

Reviewer

Paige Cook

Commissioning Editor

Nadeem Bagban

Acquisition Editor

Divya Poojari

Content Development Editor

Kirti Patil

Technical Editor

Utkarsha S. Kadam

Copy Editor

Alpha Singh

Project Coordinator

Nidhi Joshi

Proofreader

Safis Editing

Indexer

Rekha Nair

Graphics

Jason Monteiro

Production Coordinator

Manu Joseph

Cover Work

Manu Joseph

About the Authors

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days.

Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest.

Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.

Marek Rogoziński is a software architect and consultant with more than 10 years of experience. His specialization concerns solutions based on open source search engines, such as Solr and Elasticsearch, and the software stack for big data analytics including Hadoop, Hbase, and Twitter Storm.

He is also a cofounder of the solr.pl site, which publishes information and tutorials about Solr and Lucene libraries. He is the coauthor of ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.

He is currently the chief technology officer and lead architect at ZenCard, a company that processes and analyzes large quantities of payment transactions in real time, allowing automatic and anonymous identification of retail customers on all retailer channels (m-commerce/e-commerce/brick&mortar) and giving retailers a customer retention and loyalty tool.

About the Reviewer

Paige Cook works as a software architect for Videa, part of the Cox Family of Companies, and lives near Atlanta, Georgia. He has twenty years of experience in software development, primarily with the Microsoft .NET Framework. His career has been largely focused on building enterprise solutions for the media and entertainment industry. He is especially interested in search technologies using the Apache Lucene search engine and has experience with both Elasticsearch and Apache Solr. Apart from his work, he enjoys DIY home projects and spending time with his wife and two daughters.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Preface

Welcome to Elasticsearch Server, Third Edition. This is the third instalment of the book dedicated to yet another major release of Elasticsearch—this time version 2.2. In the third edition, we have decided to go on a similar route that we took when we wrote the second edition of the book. We not only updated the content to match the new version of Elasticsearch, but also restructured the book by removing and adding new sections and chapters. We read the suggestions we got from you—the readers of the book, and we carefully tried to incorporate the suggestions and comments received since the release of the first and second editions.

While reading this book, you will be taken on a journey to the wonderful world of full-text search provided by the Elasticsearch server. We will start with a general introduction to Elasticsearch, which covers how to start and run Elasticsearch, its basic concepts, and how to index and search your data in the most basic way. This book will also discuss the query language, so called Query DSL, that allows you to create complicated queries and filter returned results. In addition to all of this, you'll see how you can use the aggregation framework to calculate aggregated data based on the results returned by your queries. We will implement the autocomplete functionality together and learn how to use Elasticsearch spatial capabilities and prospective search.

Finally, this book will show you Elasticsearch's administration API capabilities with features such as shard placement control, cluster handling, and more, ending with a dedicated chapter that will discuss Elasticsearch's preparation for small and large deployments— both ones that concentrate on indexing and also ones that concentrate on indexing.

What this book covers

Chapter 1, Getting Started with Elasticsearch Cluster, covers what full-text searching is, what Apache Lucene is, what text analysis is, how to run and configure Elasticsearch, and finally, how to index and search your data in the most basic way.

Chapter 2, Indexing Your Data, shows how indexing works, how to prepare index structure, what data types we are allowed to use, how to speed up indexing, what segments are, how merging works, and what routing is.

Chapter 3, Searching Your Data, introduces the full-text search capabilities of Elasticsearch by discussing how to query it, how the querying process works, and what types of basic and compound queries are available. In addition to this, we will show how to use position-aware queries in Elasticsearch.

Chapter 4, Extending Your Query Knowledge, shows how to efficiently narrow down your search results by using filters, how highlighting works, how to sort your results, and how query rewrite works.

Chapter 5, Extending Your Index Structure, shows how to index more complex data structures. We learn how to index tree-like data types, how to index data with relationships between documents, and how to modify index structure.

Chapter 6, Make Your Search Better, covers Apache Lucene scoring and how to influence it in Elasticsearch, the scripting capabilities of Elasticsearch, and its language analysis capabilities.

Chapter 7, Aggregations for Data Analysis, introduces you to the great world of data analysis by showing you how to use the Elasticsearch aggregation framework. We will discuss all types of aggregations—metrics, buckets, and the new pipeline aggregations that have been introduced in Elasticsearch.

Chapter 8, Beyond Full-text Searching, discusses non full-text search-related functionalities such as percolator—reversed search, and the geo-spatial capabilities of Elasticsearch. This chapter also discusses suggesters, which allow us to build a spellchecking functionality and an efficient autocomplete mechanism, and we will show how to handle deep-paging efficiently.

Chapter 9, Elasticsearch Cluster in Detail, discusses nodes discovery mechanism, recovery and gateway Elasticsearch modules, templates, caches, and settings update API.

Chapter 10, Administrating Your Cluster, covers the Elasticsearch backup functionality, rebalancing, and shards moving. In addition to this, you will learn how to use the warm up functionality, use the Cat API, and work with aliases.

Chapter 11, Scaling by Example, is dedicated to scaling and tuning. We will start with hardware preparations and considerations and a single Elasticsearch node-related tuning. We will go through cluster setup and vertical scaling, ending the chapter with high querying and indexing use cases and cluster monitoring.

What you need for this book

This book was written using Elasticsearch server 2.2 and all the examples and functions should work with this. In addition to this, you'll need a command that allows you to send HTTP request such as curl, which is available for most operating systems. Please note that all the examples in this book use the previously mentioned curl tool. If you want to use another tool, please remember to format the request in an appropriate way that is understood by the tool of your choice.

In addition to this, some chapters may require additional software, such as Elasticsearch plugins, but when needed it has been explicitly mentioned.

Who this book is for

If you are a beginner to the world of full-text search and Elasticsearch, then this book is especially for you. You will be guided through the basics of Elasticsearch and you will learn how to use some of the advanced functionalities.

If you know Elasticsearch and you worked with it, then you may find this book interesting as it provides a nice overview of all the functionalities with examples and descriptions. However, you may encounter sections that you already know.

If you know the Apache Solr search engine, this book can also be used to compare some functionalities of Apache Solr and Elasticsearch. This may give you the knowledge about which tool is more appropriate for your use case.

If you know all the details about Elasticsearch and you know how each of the configuration parameters work, then this is definitely not the book you are looking for.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "If you use the Linux or OS X command, the cURL package should already be available."

A block of code is set as follows:

{ "mappings": { "post": { "properties": { "id": { "type":"long" }, "name": { "type":"string" }, "published": { "type":"date" }, "contents": { "type":"string" } } } } }

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

{ "mappings": { "post": { "properties": { "id": { "type":"long" }, "name": { "type":"string" }, "published": { "type":"date" }, "contents": { "type":"string" } } } } }

Any command-line input or output is written as follows:

curl -XPUT http://localhost:9200/users/?pretty -d '{ "mappings" : { "user": { "numeric_detection" : true } }}'

Note

Warnings or important notes appear in a box like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/ElasticsearchServerThirdEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Getting Started with Elasticsearch Cluster

Welcome to the wonderful world of Elasticsearch—a great full text search and analytics engine. It doesn't matter if you are new to Elasticsearch and full text searches in general, or if you already have some experience in this. We hope that, by reading this book, you'll be able to learn and extend your knowledge of Elasticsearch. As this book is also dedicated to beginners, we decided to start with a short introduction to full text searches in general, and after that, a brief overview of Elasticsearch.

Please remember that Elasticsearch is a rapidly changing of software. Not only are features added, but the Elasticsearch core functionality is also constantly evolving and changing. We try to keep up with these changes, and because of this we are giving you the third edition of the book dedicated to Elasticsearch 2.x.

The first thing we need to do with Elasticsearch is install and configure it. With many applications, you start with the installation and configuration and usually forget the importance of these steps. We will try to guide you through these steps so that it becomes easier to remember. In addition to this, we will show you the simplest way to index and retrieve data without going into too much detail. The first chapter will take you on a quick ride through Elasticsearch and the full text search world. By the end of this chapter, you will have learned the following topics:

Full text searchingThe basics of Apache LucenePerforming text analysisThe basic concepts of ElasticsearchInstalling and configuring ElasticsearchUsing the Elasticsearch REST API to manipulate dataSearching using basic URI requests

Full text searching

Back in the days when full text searching was a term known to a small percentage of engineers, most of us used SQL databases to perform search operations. Using SQL databases to search for the data stored in them was okay to some extent. Such a search wasn't fast, especially on large amounts of data. Even now, small applications are usually good with a standard LIKE %phrase% search in a SQL database. However, as we go deeper and deeper, we start to see the limits of such an approach—a lack of scalability, not enough flexibility, and a lack of language analysis. Of course, there are additional modules that extend SQL databases with full text search capabilities, but they are still limited compared to dedicated full text search libraries and search engines such as Elasticsearch. Some of those reasons led to the creation of Apache Lucene (http://lucene.apache.org/), a library written completely in Java (http://java.com/en/), which is very fast, light, and provides language analysis for a large number of languages spoken throughout the world.

The Lucene glossary and architecture

Before going into the details of the analysis process, we would like to introduce you to the glossary and overall architecture of Apache Lucene. We decided that this information is crucial for understanding how Elasticsearch works, and even though the book is not about Apache Lucene, knowing the foundation of the Elasticsearch analytics and indexing engine is vital to fully understand how this great search engine works.

The basic concepts of the mentioned library are as follows:

Document: This is the main data carrier used during indexing and searching, comprising one or more fields that contain the data we put in and get from Lucene.Field: This a section of the document, which is built of two parts: the name and the value.Term: This is a unit of search representing a word from the text.Token: This is an occurrence of a term in the text of the field. It consists of the term text, start and end offsets, and a type.

Apache Lucene writes all the information to a structure called the inverted index. It is a data structure that maps the terms in the index to the documents and not the other way around as a relational database does in its tables. You can think of an inverted index as a data structure where data is term-oriented rather than document-oriented. Let's see how a simple inverted index will look. For example, let's assume that we have documents with only a single field called title to be indexed, and the values of that field are as follows:

Elasticsearch Server (document 1)Mastering Elasticsearch Second Edition (document 2)Apache Solr Cookbook Third Edition (document 3)

A very simplified visualization of the Lucene inverted index could look as follows:

Each term points to the number of documents it is present in. For example, the term edition is present twice in the second and third documents. Such a structure allows for very efficient and fast search operations in term-based queries (but not exclusively). Because the occurrences of the term are connected to the terms themselves, Lucene can use information about the term occurrences to perform fast and precise scoring information by giving each document a value that represents how well each of the returned documents matched the query.

Of course, the actual index created by Lucene is much more complicated and advanced because of additional files that include information such as term vectors (per document inverted index), doc values (column oriented field information), stored fields ( the original and not the analyzed value of the field), and so on. However, all you need to know for now is how the data is organized and not what exactly is stored.

Each index is divided into multiple write-once and read-many-time structures called segments. Each segment is a miniature Apache Lucene index on its own. When indexing, after a single segment is written to the disk it can't be updated, or we should rather say it can't be fully updated; documents can't be removed from it, they can only be marked as deleted in a separate file. The reason that Lucene doesn't allow segments to be updated is the nature of the inverted index. After the fields are analyzed and put into the inverted index, there is no easy way of building the original document structure. When deleting, Lucene would have to delete the information from the segment, which translates to updating all the information within the inverted index itself.

Because of the fact that segments are write-once structures Lucene is able to merge segments together in a process called segment merging. During indexing, if Lucene thinks that there are too many segments falling into the same criterion, a new and bigger segment will be created—one that will have data from the other segments. During that process, Lucene will try to remove deleted data and get back the space needed to hold information about those documents. Segment merging is a demanding operation both in terms of the I/O and CPU. What we have to remember for now is that searching with one large segment is faster than searching with multiple smaller ones holding the same data. That's because, in general, searching translates to just matching the query terms to the ones that are indexed. You can imagine how searching through multiple small segments and merging those results will be slower than having a single segment preparing the results.

Input data analysis

The transformation of a document that comes to Lucene and is processed and put into the inverted index format is called indexation. One of the things Lucene has to do during this is data analysis. You may want some of your fields to be processed by a language analyzer so that words such as car and cars are treated as the same be your index. On the other hand, you may want other fields to be divided only on the white space character or be only lowercased.

Analysis is done by the analyzer, which is built of a tokenizer and zero or more token filters, and it can also have zero or more character mappers.

A tokenizer in Lucene is used to split the text into tokens, which are basically the terms with additional information such as its position in the original text and its length. The results of the tokenizer's work is called a token stream, where the tokens are put one by one and are ready to be processed by the filters.

Apart from the tokenizer, the Lucene analyzer is built of zero or more token filters that are used to process tokens in the token stream. Some examples of filters are as follows:

Lowercase filter: Makes all the tokens lowercasedSynonyms filter: Changes one token to another on the basis of synonym rulesLanguage stemming filters: Responsible for reducing tokens (actually, the text part that they provide) into their root or base forms called the stem (https://en.wikipedia.org/wiki/Word_stem)

Filters are processed one after another, so we have almost unlimited analytical possibilities with the addition of multiple filters, one after another.

Finally, the character mappers operate on non-analyzed text—they are used before the tokenizer. Therefore, we can easily remove HTML tags from whole parts of text without worrying about tokenization.

Indexing and querying

You may wonder how all the information we've described so far affects indexing and querying when using Lucene and all the software that is built on top of it. During indexing, Lucene will use an analyzer of your choice to process the contents of your document; of course, different analyzers can be used for different fields, so the name field of your document can be analyzed differently compared to the summary field. For example, the name field may only be tokenized on whitespaces and lowercased, so that exact matches are done and the summary field is stemmed in addition to that. We can also decide to not analyze the fields at all—we have full control over the analysis process.

During a query, your query text can be analyzed as well. However, you can also choose not to analyze your queries. This is crucial to remember because some Elasticsearch queries are analyzed and some are not. For example, prefix and term queries are not analyzed, and match queries are analyzed (we will get to that in Chapter 3, Searching Your Data). Having queries that are analyzed and not analyzed is very useful; sometimes, you may want to query a field that is not analyzed, while sometimes you may want to have a full text search analysis. For example, if we search for the LightRed term and the query is being analyzed by the standard analyzer, then the terms that would be searched are light and red. If we use a query type that has not been analyzed, then we will explicitly search for the LightRed term. We may not want to analyze the content of the query if we are only interested in exact matches.

What you should remember about indexing and querying analysis is that the index should match the query term. If they don't match, Lucene won't return the desired documents. For example, if you use stemming and lowercasing during indexing, you need to ensure that the terms in the query are also lowercased and stemmed, or your queries won't return any results at all. For example, let's get back to our LightRed term that we analyzed during indexing; we have it as two terms in the index: light and red. If we run a LightRed query against that data and don't analyze it, we won't get the document in the results—the query term does not match the indexed terms. It is important to keep the token filters in the same order during indexing and query time analysis so that the terms resulting from such an analysis are the same.

Scoring and query relevance

There is one additional thing that we only mentioned once till now—scoring. What is the score of a document? The score is a result of a scoring formula that describes how well the document matches the query. By default, Apache Lucene uses the TF/IDF (term frequency/inverse document frequency) scoring mechanism, which is an algorithm that calculates how relevant the document is in the context of our query. Of course, it is not the only algorithm available, and we will mention other algorithms in the Mappings configuration section of Chapter 2, Indexing Your Data.

Note

If you want to read more about the Apache Lucene TF/IDF scoring formula, please visit Apache Lucene Javadocs for the TFIDF. The similarity class is available at http://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html.

The basics of Elasticsearch

Elasticsearch is an open source search server project started by Shay Banon and published in February 2010. During this time, the project grew into a major player in the field of search and data analysis solutions and is widely used in many common or lesser-known search and data analysis platforms. In addition, due to its distributed nature and real-time search and analytics capabilities, many organizations use it as a document store.

Key concepts of Elasticsearch

In the next few pages, we will get you through the basic concepts of Elasticsearch. You can skip this section if you are already familiar with Elasticsearch architecture. However, if you are not familiar with Elasticsearch, we strongly advise you to read this section. We will refer to the key words used in this section in the rest of the book, and understanding those concepts is crucial to fully utilize Elasticsearch.

Index

An index is the logical place where Elasticsearch stores the data. Each index can be spread onto multiple Elasticsearch nodes and is divided into one or more smaller pieces called shards that are physically placed on the hard drives. If you are coming from the relational database world, you can think of an index like a table. However, the index structure is prepared for fast and efficient full text searching and, in particular, does not store original values. That structure is called an inverted index (https://en.wikipedia.org/wiki/Inverted_index).

If you know MongoDB, you can think of the Elasticsearch index as a collection in MongoDB. If you are familiar with CouchDB, you can think about an index as you would about the CouchDB database. Elasticsearch can hold many indices located on one machine or spread them over multiple servers. As we have already said, every index is built of one or more shards, and each shard can have many replicas.

Document

The main entity stored in Elasticsearch is a document. A document can have multiple fields, each having its own type and treated differently. Using the analogy to relational databases, a document is a row of data in a database table. When you compare an Elasticsearch document to a MongoDB document, you will see that both can have different structures. The thing to keep in mind when it comes to Elasticsearch is that fields that are common to multiple types in the same index need to have the same type. This means that all the documents with a field called title need to have the same data type for it, for example, string.

Documents consist of fields, and each field may occur several times in a single document (such a field is called multivalued). Each field has a type (text, number, date, and so on). The field types can also be complex—a field can contain other subdocuments or arrays. The field type is important to Elasticsearch because type determines how various operations such as analysis or sorting are performed. Fortunately, this can be determined automatically (however, we still suggest using mappings; take a look at what follows).

Unlike the relational databases, documents don't need to have a fixed structure—every document may have a different set of fields, and in addition to this, fields don't have to be known during application development. Of course, one can force a document structure with the use of schema. From the client's point of view, a document is a JSON object (see more about the JSON format at https://en.wikipedia.org/wiki/JSON). Each document is stored in one index and has its own unique identifier, which can be generated automatically by Elasticsearch, and document type. The thing to remember is that the document identifier needs to be unique inside an index and should be for a given type. This means that, in a single index, two documents can have the same unique identifier if they are not of the same type.

Document type

In Elasticsearch, one index can store many objects serving different purposes. For example, a blog application can store articles and comments. The document type lets us easily differentiate between the objects in a single index. Every document can have a different structure, but in real-world deployments, dividing documents into types significantly helps in data manipulation. Of course, one needs to keep the limitations in mind. That is, different document types can't set different types for the same property. For example, a field called title must have the same type across all document types in a given index.

Mapping

In the section about the basics of full text searching (the Full text searching section), we wrote about the process of analysis—the preparation of the input text for indexing and searching done by the underlying Apache Lucene library. Every field of the document must be properly analyzed depending on its type. For example, a different analysis chain is required for the numeric fields (numbers shouldn't be sorted alphabetically) and for the text fetched from web pages (for example, the first step would require you to omit the HTML tags as it is useless information). To be able to properly analyze at indexing and querying time, Elasticsearch stores the information about the fields of the documents in so-called mappings. Every document type has its own mapping, even if we don't explicitly define it.

Key concepts of the Elasticsearch infrastructure

Now, we already know that Elasticsearch stores its data in one or more indices and every index can contain documents of various types. We also know that each document has many fields and how Elasticsearch treats these fields is defined by the mappings. But there is more. From the beginning, Elasticsearch was created as a distributed solution that can handle billions of documents and hundreds of search requests per second. This is due to several important key features and concepts that we are going to describe in more detail now.

Nodes and clusters

Elasticsearch can work as a standalone, single-search server. Nevertheless, to be able to process large sets of data and to achieve fault tolerance and high availability, Elasticsearch can be run on many cooperating servers. Collectively, these servers connected together are called a cluster and each server forming a cluster is called a node.

Shards

When we have a large number of documents, we may come to a point where a single node may not be enough—for example, because of RAM limitations, hard disk capacity, insufficient processing power, and an inability to respond to client requests fast enough. In such cases, an index (and the data in it) can be divided into smaller parts called shards (where each shard is a separate Apache Lucene index). Each shard can be placed on a different server, and thus your data can be spread among the cluster nodes. When you query an index that is built from multiple shards, Elasticsearch sends the query to each relevant shard and merges the result in such a way that your application doesn't know about the shards. In addition to this, having multiple shards can speed up indexing, because documents end up in different shards and thus the indexing operation is parallelized.

Replicas

In order to increase query throughput or achieve high availability, shard replicas can be used. A replica is just an exact copy of the shard, and each shard can have zero or more replicas. In other words, Elasticsearch can have many identical shards and one of them is automatically chosen as a place where the operations that change the index are directed. This special shard is called a primary shard, and the others are called replica shards. When the primary shard is lost (for example, a server holding the shard data is unavailable), the cluster will promote the replica to be the new primary shard.

Gateway

The cluster state is held by the gateway, which stores the cluster state and indexed data across full cluster restarts. By default, every node has this information stored locally; it is synchronized among nodes. We will discuss the gateway module in The gateway and recovery modules section of Chapter 9, Elasticsearch Cluster, in detail.

Indexing and searching

You may wonder how you can tie all the indices, shards, and replicas together in a single environment. Theoretically, it would be very difficult to fetch data from the cluster when you have to know where your document is: on which server, and in which shard. Even more difficult would be searching when one query can return documents from different shards placed on different nodes in the whole cluster. In fact, this is a complicated problem; fortunately, we don't have to care about this at all—it is handled automatically by Elasticsearch. Let's look at the following diagram:

When you send a new document to the cluster, you specify a target index and send it to any of the nodes. The node knows how many shards the target index has and is able to determine which shard should be used to store your document. Elasticsearch can alter this behavior; we will talk about this in the Introduction to routing section in Chapter 2, Indexing Your Data. The important information that you have to remember for now is that Elasticsearch calculates the shard in which the document should be placed using the unique identifier of the document—this is one of the reasons each document needs a unique identifier. After the indexing request is sent to a node, that node forwards the document to the target node, which hosts the relevant shard.

Now, let's look at the following diagram on searching request execution:

When you try to fetch a document by its identifier, the node you send the query to uses the same routing algorithm to determine the shard and the node holding the document and again forwards the request, fetches the result, and sends the result to you. On the other hand, the querying process is a more complicated one. The node receiving the query forwards it to all the nodes holding the shards that belong to a given index and asks for minimum information about the documents that match the query (the identifier and score are matched by default), unless routing is used, when the query will go directly to a single shard only. This is called the scatter phase. After receiving this information, the aggregator node (the node that receives the client request) sorts the results and sends a second request to get the documents that are needed to build the results list (all the other information apart from the document identifier and score). This is called the gather phase. After this phase is executed, the results are returned to the client.

Now the question arises: what is the replica's role in the previously described process? While indexing, replicas are only used as an additional place to store the data. When executing a query, by default, Elasticsearch will try to balance the load among the shard and its replicas so that they are evenly stressed. Also, remember that we can change this behavior; we will discuss this in the Understanding the querying process section in Chapter 3, Searching Your Data.

Installing and configuring your cluster

Installing and running Elasticsearch even in production environments is very easy nowadays, compared to how it was in the days of Elasticsearch 0.20.x. From a system that is not ready to one with Elasticsearch, there are only a few steps that one needs to go. We will explore these steps in the following section:

Installing Java

Elasticsearch is a Java application and to use it we need to make sure that the Java SE environment is installed properly. Elasticsearch requires Java Version 7 or later to run. You can download it from http://www.oracle.com/technetwork/java/javase/downloads/index.html. You can also use OpenJDK (http://openjdk.java.net/) if you wish. You can, of course, use Java Version 7, but it is not supported by Oracle anymore, at least without commercial support. For example, you can't expect new, patched versions of Java 7 to be released. Because of this, we strongly suggest that you install Java 8, especially given that Java 9 seems to be right around the corner with the general availability planned to be released in September 2016.

Installing Elasticsearch

To install Elasticsearch you just need to go to https://www.elastic.co/downloads/elasticsearch, choose the last stable version of Elasticsearch, download it, and unpack it. That's it! The installation is complete.

Note

At the time of writing, we used a snapshot of Elasticsearch 2.2. This means that we've skipped describing some properties that were marked as deprecated and are or will be removed in the future versions of Elasticsearch.

The main interface to communicate with Elasticsearch is based on the HTTP protocol and REST. This means that you can even use a web browser for some basic queries and requests, but for anything more sophisticated you'll need to use additional software, such as the cURL command. If you use the Linux or OS X command, the cURL package should already be available. If you use Windows, you can download the package from http://curl.haxx.se/download.html.

Running Elasticsearch

Let's run our first instance that we just downloaded as the ZIP archive and unpacked. Go to the bin directory and run the following commands depending on the OS:

Linux or OS X: ./elasticsearchWindows: elasticsearch.bat

Congratulations! Now, you have your Elasticsearch instance up-and-running. During its work, the server usually uses two port numbers: the first one for communication with the REST API using the HTTP protocol, and the second one for the transport module used for communication in a cluster and between the native Java client and the cluster. The default port used for the HTTP API is 9200, so we can check search readiness by pointing the web browser to http://127.0.0.1:9200/. The browser should show a code snippet similar to the following:

{ "name" : "Blob", "cluster_name" : "elasticsearch", "version" : { "number" : "2.2.0", "build_hash" : "5b1dd1cf5a1957682d84228a569e124fedf8e325", "build_timestamp" : "2016-01-13T18:12:26Z", "build_snapshot" : true, "lucene_version" : "5.4.0" }, "tagline" : "You Know, for Search" }

The output is structured as a JavaScript Object Notation (JSON) object. If you are not familiar with JSON, please take a minute and read the article available at https://en.wikipedia.org/wiki/JSON.

Note

Elasticsearch is smart. If the default port is not available, the engine binds to the next free port. You can find information about this on the console during booting as follows:

[2016-01-13 20:04:49,953][INFO ][http] [Blob] publish_address {127.0.0.1:9201}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9201}

Note the fragment with [http]. Elasticsearch uses a few ports for various tasks. The interface that we are using is handled by the HTTP module.

Now, we will use the cURL program to communicate with Elasticsearch. For example, to check the cluster health, we will use the following command:

curl -XGET http://127.0.0.1:9200/_cluster/health?pretty

The -X parameter is a definition of the HTTP request method. The default value is GET (so in this example, we can omit this parameter). For now, do not worry about the GET value; we will describe it in more detail later in this chapter.

As a standard, the API returns information in a JSON object in which new line characters are omitted. The pretty parameter added to our requests forces Elasticsearch to add a new line character to the response, making the response more user-friendly. You can try running the preceding query with and without the ?pretty parameter to see the difference.

Elasticsearch is useful in small and medium-sized applications, but it has been built with large clusters in mind. So, now we will set up our big two-node cluster. Unpack the Elasticsearch archive in a different directory and run the second instance. If we look at the log, we will see the following:

[2016-01-13 20:07:58,561][INFO ][cluster.service ] [Big Man] detected_master {Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300}, added {{Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [{Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300}])

This means that our second instance (named Big Man) discovered the previously running instance (named Blob). Here, Elasticsearch automatically formed a new two-node cluster. Starting from Elasticsearch 2.0, this will only work with nodes running on the same physical machine—because Elasticsearch 2.0 no longer supports multicast. To allow your cluster to form, you need to inform Elasticsearch about the nodes that should be contacted initially using the discovery.zen.ping.unicast.hosts array in elasticsearch.yml. For example, like this:

discovery.zen.ping.unicast.hosts: ["192.168.2.1", "192.168.2.2"]

Shutting down Elasticsearch

Even though we expect our cluster (or node) to run flawlessly for a lifetime, we may need to restart it or shut it down properly (for example, for maintenance). The following are the two ways in which we can shut down Elasticsearch:

If your node is attached to the console, just press Ctrl + CThe second option is to kill the server process by sending the TERM signal (see the kill command on the Linux boxes and Program Manager on Windows)

Note

The previous versions of Elasticsearch exposed a dedicated shutdown API but, in 2.0, this option has been removed because of security reasons.

The directory layout

Now, let's go to the newly created directory. We should see the following directory structure:

Configuring Elasticsearch

One of the reasons—of course, not the only one—why Elasticsearch is gaining more and more popularity is that getting started with Elasticsearch is quite easy. Because of the reasonable default values and automatic settings for simple environments, we can skip the configuration and go straight to indexing and querying (or to the next chapter of the book). We can do all this without changing a single line in our configuration files. However, in order to truly understand Elasticsearch, it is worth understanding some of the available settings.

We will now explore the default directories and the layout of the files provided with the Elasticsearch tar.gz archive. The entire configuration is located in the config directory. We can see two files here: elasticsearch.yml (or elasticsearch.json, which will be used if present) and logging.yml. The first file is responsible for setting the default configuration values for the server. This is important because some of these values can be changed at runtime and can be kept as a part of the cluster state, so the values in this file may not be accurate. The two values that we cannot change at runtime are cluster.name and node.name.

The cluster.name property is responsible for holding the name of our cluster. The cluster name separates different clusters from each other. Nodes configured with the same cluster name will try to form a cluster.

The second value is the instance (the node.name property) name. We can leave this parameter undefined. In this case, Elasticsearch automatically chooses a unique name for itself. Note that this name is chosen during each startup, so the name can be different on each restart. Defining the name can helpful when referring to concrete instances by the API or when using monitoring tools to see what is happening to a node during long periods of time and between restarts. Think about giving descriptive names to your nodes.

Other parameters are commented well in the file, so we advise you to look through it; don't worry if you do not understand the explanation. We hope that everything will become clearer after reading the next few chapters.

Note

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Elasticsearch Server - Third Edition E-Book

Rafal Kuc

About This Book

Who This Book Is For

What You Will Learn

In Detail

Style and approach

Table of Contents

Elasticsearch Server Third Edition

Elasticsearch Server Third Edition

Credits

About the Authors

About the Reviewer

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Chapter 1. Getting Started with Elasticsearch Cluster

Full text searching

The Lucene glossary and architecture

Input data analysis

Indexing and querying

Scoring and query relevance

Note

The basics of Elasticsearch

Key concepts of Elasticsearch

Index

Document

Document type

Mapping

Key concepts of the Elasticsearch infrastructure

Nodes and clusters

Shards

Replicas

Gateway

Indexing and searching

Installing and configuring your cluster

Installing Java

Installing Elasticsearch

Note

Running Elasticsearch

Note

Shutting down Elasticsearch

Note

The directory layout

Configuring Elasticsearch

Note