E-Book
39,59 €

Mastering Elasticsearch 5.x E-Book

Bharvi Dixit

0,0

39,59 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. Elasticsearch leverages the capabilities of Apache Lucene, and provides a new level of control over how you can index and search even huge sets of data.
This book will give you a brief recap of the basics and also introduce you to the new features of Elasticsearch 5. We will guide you through the intermediate and advanced functionalities of Elasticsearch, such as querying, indexing, searching, and modifying data. We’ll also explore advanced concepts, including aggregation, index control, sharding, replication, and clustering.
We’ll show you the modules of monitoring and administration available in Elasticsearch, and will also cover backup and recovery. You will get an understanding of how you can scale your Elasticsearch cluster to contextualize it and improve its performance. We’ll also show you how you can create your own analysis plugin in Elasticsearch.
By the end of the book, you will have all the knowledge necessary to master Elasticsearch and put it to efficient use.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2017

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Mastering Elasticsearch 5.x

Bharvi Dixit

Elasticsearch Essentials

Bharvi Dixit

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Der größte Raubzug der Geschichte

Matthias Weik

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Radikal führen

Reinhard K. Sprenger

30 Minuten Sympathisch und souverän: So geht Vortragen!

Thomas Lorenz

BLACKOUT - Morgen ist es zu spät

Marc Elsberg

The Truth About Employee Engagement

Mastering Elasticsearch 5.x - Third Edition

Credits

About the Author

Acknowledgements

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Revisiting Elasticsearch and the Changes

An overview of Lucene

Getting deeper into the Lucene index

Inverted index

Segments

Norms

Term vectors

Posting formats

Doc values

Document analysis

Basics of the Lucene query language

Querying fields

Term modifiers

Handling special characters

An overview of Elasticsearch

The key concepts

Working of Elasticsearch

Introducing Elasticsearch 5.x

Introducing new features in Elasticsearch

New features in Elasticsearch 5.x

New features in Elasticsearch 2.x

The changes in Elasticsearch

Changes between 1.x to 2.x

Mapping changes

Query and filter changes

Security, reliability, and networking changes

Monitoring parameter changes

Changes between 2.x to 5.x

Mapping changes

No more string fields

Floats are default

Changes in numeric fields

Changes in geo_point fields

Some more changes

Summary

2. The Improved Query DSL

The changed default text scoring in Lucene - BM25

Precision versus recall

Recalling TF-IDF

Introducing BM25 scoring

BM25 scoring formula

Example - tuning BM25 with custom similarity

How BM25 differs from TF-IDF

Saturation point

Average document length

Re-factored Query DSL

Choosing the right query for the job

Query categorization

Basic queries

Compound queries

Understanding bool queries

Non-analyzed queries

Full text search queries

Pattern queries

Similarity supporting queries

Score altering queries

Position aware queries

Structure aware queries

The use cases

Example data

Basic queries use cases

Searching for values in range

Compound queries use cases

A Boolean query for multiple terms

Boosting some of the matched documents

Ignoring lower scoring partial queries

Not analyzed queries use cases

Limiting results to given tags

Full text search queries use cases

Using Lucene query syntax in queries

Handling user queries without errors

Pattern queries use cases

Autocomplete using prefixes

Pattern matching

Similarity supporting queries use cases

Finding terms similar to a given one

Score altering query use cases

Decreasing importance of books with a certain value

Pattern query use cases

Matching phrases

Spans, spans everywhere

Some more important changes in Query DSL

Query rewrite explained

Prefix query as an example

Getting back to Apache Lucene

Query rewrite properties

An example

Query templates

Introducing search templates

The Mustache template engine

Conditional expressions

Loops

Default values

Storing templates in files

Storing templates in a cluster

Summary

3. Beyond Full Text Search

Controlling multimatching

Multimatch types

Best fields matching

Cross fields matching

Most fields matching

Phrase matching

Phrase with prefixes matching

Controlling scores using the function score query

Built-in functions under the function score query

The weight function

The field value factor function

The script score function

Decay functions - linear, exp, and gauss

Query rescoring

What is query rescoring?

Structure of the rescore query

Rescore parameters

To sum up

Elasticsearch scripting

The syntax

Scripting changes across different versions

Painless - the new default scripting language

Using Painless as your scripting language

Variable definition in scripts

Conditionals

Loops

An example

Sorting results based on scripts

Sorting based on multiple fields

Lucene expressions

The basics

An example

Summary

4. Data Modeling and Analytics

Data modeling techniques in Elasticsearch

Managing relational data in Elasticsearch

The object type

The nested documents

Parent - child relationship

Parent-child relationship in the cluster

Finding child documents with a parent ID query

A few words about alternatives

An example of data denormalization

Data analytics using aggregations

Instant aggregations in Elasticsearch 5.0

Revisiting aggregations

Metric aggregations

Bucket aggregations

Pipeline aggregations

Calculating average monthly sales using avg_bucket aggregation

Calculating the derivative for the sum of the monthly sale

The new aggregation category - Matrix aggregation

Understanding matrix stats

Dealing with missing values

Summary

5. Improving the User Search Experience

Correcting user spelling mistakes

Testing data

Getting into technical details

Suggesters

Using a suggester under the _search endpoint

Understanding the suggester response

Multiple suggestion types for the same suggestion text

The term suggester

Configuring the Elasticsearch term suggester

Common term suggester options

Additional term suggester options

The phrase suggester

Usage example

Configuring the phrase suggester

Basic configuration

Configuring smoothing models

Configuring candidate generators

The completion suggester

The logic behind the completion suggester

Using the completion suggester

Indexing data

Querying data

Custom weights

Using fuzziness with the completion suggester

Implementing your own auto-completion

Creating an index

Understanding the parameters

Configuring settings

Configuring mappings

Indexing documents

Querying documents for auto-completion

Working with synonyms

Preparing settings for synonym search

Formatting synonyms

Synonym expansion versus contraction

Summary

6. The Index Distribution Architecture

Configuring an example multi-node cluster

Choosing the right amount of shards and replicas

Sharding and overallocation

A positive example of overallocation

Multiple shards versus multiple indices

Replicas

Routing explained

Shards and data

Let's test routing

Indexing with routing

Routing in practice

Querying

Aliases

Multiple routing values

Shard allocation control

Allocation awareness

Forcing allocation awareness

Shard allocation filtering

What include, exclude, and require mean

Runtime allocation updating

Index level updates

Cluster level updates

Defining total shards allowed per node

Defining total shards allowed per physical server

Inclusion

Requirement

Exclusion

Disk-based allocation

Query execution preference

Introducing the preference parameter

An example of using query execution preference

Stripping data on multiple paths

Index versus type - a revised approach for creating indices

Summary

7. Low-Level Index Control

Altering Apache Lucene scoring

Available similarity models

Setting a per-field similarity

Similarity model configuration

Choosing the default similarity model

Configuring the chosen similarity model

Configuring the TF-IDF similarity

Configuring the BM25 similarity

Configuring the DFR similarity

Configuring the IB similarity

Configuring the LM Dirichlet similarity

Configuring the LM Jelinek Mercer similarity

Choosing the right directory implementation - the store module

The store type

The simple file system store - simplefs

The new I/O filesystem store - niofs

The mmap filesystem store - mmapfs

The default store type - fs

NRT, flush, refresh, and transaction log

Updating the index and committing changes

Changing the default refresh time

The transaction log

The transaction log configuration

Handling corrupted translogs

Near real-time GET

Segment merging under control

Merge policy changes in Elasticsearch

Configuring the tiered merge policy

Merge scheduling

The concurrent merge scheduler

Force merging

Understanding Elasticsearch caching

Node query cache

Configuring node query cache

Shard request cache

Enabling and disabling the shard request cache

Request cache settings

Cache invalidation

The field data cache

Field data or doc values

Using circuit breakers

The parent circuit breaker

The field data circuit breaker

The request circuit breaker

In flight requests circuit breaker

Script compilation circuit breaker

Summary

8. Elasticsearch Administration

Node types in Elasticsearch

Data node

Master node

Ingest node

Tribe node

Coordinating nodes/Client nodes

Discovery and recovery modules

Discovery configuration

Zen discovery

The unicast Zen discovery configuration

The master election configuration

Zen discovery fault detection and configuration

No Master Block

The Amazon EC2 discovery

The EC2 plugin installation

The EC2 plugin's generic configuration

Optional EC2 discovery configuration options

The EC2 nodes scanning configuration

Other discovery implementations

The gateway and recovery configuration

The gateway recovery process

Configuration properties

The local gateway

Low-level recovery configuration

Cluster-level recovery configuration

The indices recovery API

The human-friendly status API - using the cat API

The basics of cat API

Using the cat API

Cat API common arguments

The examples of cat API

Getting information about the master node

Getting information about the nodes

Changes in cat API - Elasticsearch 5.0

Host field removed from the cat nodes API

Changes to cat recovery API

Changes to cat nodes API

Changes to cat field data API

Backing up

The snapshot API

Saving backups on a filesystem

Creating snapshot

Registering repository path

Registering shared file system repository in Elasticsearch

Creating snapshots

Getting snapshot information

Deleting snapshots

Saving backups in the cloud

The S3 repository

The HDFS repository

The Azure repository

The Google cloud storage repository

Restoring snapshots

Example - restoring a snapshot

Restoring multiple indices

Renaming indices

Partial restore

Changing index settings during restore

Restoring to different cluster

Summary

9. Data Transformation and Federated Search

Preprocessing data within Elasticsearch with ingest nodes

Working with ingest pipeline

The ingest APIs

Creating a pipeline

Getting pipeline details

Deleting a pipeline

Simulating pipelines for debugging purposes

Handling errors in pipelines

Tagging errors within the same document and index

Indexing error prone documents in a different index

Ignoring errors altogether

Working with ingest processors

Append processor

Convert processor

Grok processor

Federated search

The test clusters

Creating the tribe node

Reading data with the tribe node

Master-level read operations

Writing data with the tribe node

Master-level write operations

Handling indices conflicts

Blocking write operations

Summary

10. Improving Performance

Query validation and profiling

Validating expensive queries before execution

Query profiling for detailed query execution reports

Understanding the profile API response

Consideration for profiling usage

Very hot threads

Usage clarification for the hot threads API

The hot threads API response

Scaling Elasticsearch

Vertical scaling

Horizontal scaling

Automatically creating replicas

Redundancy and high availability

Cost and performance flexibility

Continuous upgrades

Multiple Elasticsearch instances on a single physical machine

Preventing the shard and its replicas from being on the same node

Designated nodes' roles for larger clusters

Query aggregator nodes

Data nodes

Master eligible nodes

Using Elasticsearch for high load scenarios

General Elasticsearch-tuning advice

The index refresh rate

Thread pools tuning

Data distribution

Advice for high query rate scenarios

Node query cache and shard query cache

Think about the queries

Using routing

Parallelize your queries

Keeping size and shard_size under control

High indexing throughput scenarios and Elasticsearch

Bulk indexing

Keeping your document fields under control

The index architecture and replication

Tuning the write-ahead log

Thinking about storage

RAM buffer for indexing

Managing time-based indices efficiently using shrink and rollover APIs

The shrink API

Requirements for indices to be shrunk

Shrinking an index

Rollover API

Using the rollover API

Passing additional settings with a rollover request

Pattern for creating new index name

Summary

11. Developing Elasticsearch Plugins

Creating the Apache Maven project structure

Understanding the basics

The structure of the Maven Java project

The idea of POM

Running the build process

Introducing the assembly Maven plugin

Understanding the plugin descriptor file

Creating a custom REST action

The assumptions

Implementation details

Using the REST action class

The constructor

Handling requests

Writing responses

The plugin class

Informing Elasticsearch about our REST action

Time for testing

Building the REST action plugin

Installing the REST action plugin

Checking whether the REST action plugin works

Creating the custom analysis plugin

Implementation details

Implementing TokenFilter

Implementing the TokenFilter factory

Implementing the class custom analyzer

Implementing the analyzer provider

Implementing the analyzer plugin

Informing Elasticsearch about our custom analyzer

Testing our custom analysis plugin

Building our custom analysis plugin

Installing the custom analysis plugin

Checking whether our analysis plugin works

Summary

12. Introducing Elastic Stack 5.0

Overview of Elastic Stack 5.0

Introducing Logstash, Beats, and Kibana

Working with Logstash

Logstash architecture

Installing Logstash

Installing Logstash from binaries

Installing Logstash from APT repositories

Installing Logstash from YUM repositories

Configuring Logstash

Example - shipping system logs using Logstash

Starting Logstash

Introducing Beats as data shippers

Working with Metricbeat

Installing Metricbeat

Configuring Metricbeat

Running Metricbeat

Loading a sample Kibana dashboard into Elasticsearch

Working with Kibana

Installing Kibana

Kibana configuration

Starting Kibana

Exploring and visualizing data on Kibana

Understanding the Kibana Management screen

Discovering data on Kibana

Using the Dashboard screen to create/load dashboards

Using Sense

Summary

Mastering Elasticsearch 5.x - Third Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: October 2013

Second edition: February 2015

Third edition: February 2017

Production reference: 1160217

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78646-018-9

www.packtpub.com

Credits

Author

Bharvi Dixit

Copy Editor

Safis Editing

Reviewer

Marcelo Ochoa

Project Coordinator

Nidhi Joshi

Commissioning Editor

Amey Varangaonkar

Proofreader

Safis Editing

Acquisition Editor

Divya Poojari

Indexer

Tejal Daruwale Soni

Content Development Editor

Cheryl Dsa

Graphics

Tania Dutta

Technical Editor

Prasad Ramesh

Production Coordinator

Nilesh Mohite

About the Author

Bharvi Dixit is an IT professional with extensive experience of working on search servers, NoSQL databases, and cloud services. He holds a master's degree in computer science and is currently working with Sentieo, a USA-based financial data and equity research platform, where he leads the overall platform and architecture of the company spanning across hundreds of servers. At Sentieo, he also plays a key role in the search and data team.

He is also the organizer of Delhi's Elasticsearch Meetup Group, where he speaks about Elasticsearch and Lucene and is continuously building the community around these technologies.

Bharvi also works as a freelance Elasticsearch consultant and has helped more than half a dozen organizations adapt Elasticsearch to solve their complex search problems around different use cases, such as creating search solutions for big data-automated intelligence platforms in the area of counter-terrorism and risk management, as well as in other domains, such as recruitment, e-commerce, finance, social search, and log monitoring.

He has a keen interest in creating scalable backend platforms. His other areas of interests are search engineering, data analytics, and distributed computing. Java and Python are the primary languages in which he loves to write code. He has also built a proprietary software for consultancy firms.

In 2013, he started working on Lucene and Elasticsearch, and in 2016, he authored his first book, Elasticsearch Essentials, which was published by Packt. He has also worked as a technical reviewer for the book Learning Kibana 5.0 by Packt.

You can connect with him on LinkedIn at https://in.linkedin.com/in/bharvidixit or can follow him on Twitter @d_bharvi.

Acknowledgements

This is my second book on Elasticsearch, and I am really fascinated by the love and feedback I got from the readers of my first book, Elasticsearch Essentials. The book you are holding covers Elasticsearch 5.x, the release of Elasticsearch that brings a whole lot of features and improvements to this great search server. Hopefully, after reading this book, you will not only get to know the underlying architecture of Lucene and Elasticsearch, but also posses a command over many advanced concepts, such as scripting, improving cluster performance, writing custom Java-based plugins, and many more.

Now it is time to say thank you.

I would like to thank my family for their continuous support, especially my brother, Patanjali Dixit, who has been a pillar of strength for me at each step throughout my career. I extend my big thanks to Lavleen for the love, support, and encouragement she gave during all those days when I was busy writing this book or solving complex problems at work.

I would like to extend my thanks to the Packt team working on this book, including our technical reviewer. Without their incredible support, the book wouldn't have been as great as it is now.

I would also like to thank all the people I'm working with at Sentieo for all their love and for creating a culture that helps make work more fun. At Sentieo, I extend my special thanks to Atul Shah, who always inspired me to go into the intricacies of Lucene and Elasticsearch and solve some really complex problems using these technologies.

Finally, thanks to Shay Banon for creating Elasticsearch and to all the people who contributed to the libraries and modules published around this project.

Once again, thank you.

About the Reviewer

Marcelo Ochoa works at the system laboratory of Facultad de Ciencias Exactas of the Universidad Nacional del Centro de la Provincia de Buenos Aires and is the CTO at scotas, a company that specializes in near real-time search solutions using Apache Solr and Oracle. He divides his time between university jobs and external projects related to Oracle and big data technologies. He has worked on several Oracle-related projects, such as the translation of Oracle manuals and multimedia CBTs. His background is in database, network, web, and Java technologies. In the XML world, he is known as the developer of the DB Generator for the Apache Cocoon project. He has worked on open source projects such as DBPrism and DBPrism CMS, the Lucene-Oracle integration using the Oracle JVM Directory implementation, and the Restlet.org project, where he worked on the Oracle XDB Restlet Adapter, which is an alternative to writing native REST web services inside a database resident JVM. Since 2006, he has been part of an Oracle ACE program. Oracle ACEs are known for their strong credentials as Oracle community enthusiasts and advocates, with candidates nominated by ACEs in the Oracle technology and applications communities. He has coauthored Oracle Database Programming using Java and Web Services by Digital Press and Professional XML Databases by Wrox Press, and has worked as a technical reviewer for several Packt books, such as Apache Solr 4 Cookbook, ElasticSearch Server, and others.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1786460181.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

Welcome to the world of Elasticsearch and Mastering Elasticsearch 5.x, Third Edition. While reading the book, you'll be taken through different topics—all connected to Elasticsearch. Please remember though that this book is not meant for beginners, and we really treat the book as a follow-up to Mastering Elasticsearch 5.x, Second Edition, which was based on Elasticsearch version 1.4.x. There is a lot of new content in the book since Elasticsearch has gone through many changes between versions 1.x and 5.x.

Throughout the book, we will discuss different topics related to Elasticsearch and Lucene. We start with an introduction to the world of Lucene and Elasticsearch to introduce you to the world of queries provided by Elasticsearch, where we discuss different topics related to queries, such as filtering and which query to choose in a particular situation. Of course, querying is not everything, and because of that, the book you are holding in your hands provides information on newly introduced aggregations and features that will help you give meaning to the data you have indexed in Elasticsearch indices and provide a better search experience for your users.

We have also decided to cover the approaches of data modeling and handling relational data in Elasticsearch along with taking you through the scripting module of Elasticsearch and show some examples of using the latest default scripting language, Painless.

Even though, for most users, querying and data analysis are the most interesting parts of Elasticsearch, they are not all that we need to discuss. Because of this, the book tries to bring you additional information when it comes to index architecture, such as choosing the right number of shards and replicas, adjusting the shard allocation behavior, and so on. We will also get into places where Elasticsearch meets Lucene, and we will discuss topics such as different scoring algorithms, choosing the right store mechanism, what the differences between them are, and why choosing the proper one matters.

Last but not least, we touch on the administration part of Elasticsearch by discussing discovery and recovery modules and the human-friendly cat API, which allows us to very quickly get relevant administrative information in a form that most humans should be able to read without parsing JSON responses. We also talk about ingest nodes, which allow you to preprocess data within Elasticsearch before indexing takes place and use tribe nodes, giving the ability to create federated searches across many nodes.

Because of the title of the book, we couldn't omit performance-related topics, and we decided to dedicate a whole chapter to it.

Just as with the second edition of the book, we decided to include a chapter dedicated to development of Elasticsearch plugins, showing you how to set up the Apache Maven project and develop two types of plugins—custom REST action and custom analysis.

At the end, we have included one chapter discussing the components of the complete Elastic Stack, and you should get a great overview of how to start with tools such as Logstash, Kibana, and Beats after reading the chapter.

If you think that you are interested in these topics after reading about them, we think this is a book for you, and hopefully, you will like the book after reading the last words of the summary in Chapter 12, Introducing Elastic Stack 5.0.

What this book covers

Chapter 1, Revisiting Elasticsearch and the Changes, guides you through how Apache Lucene works and will introduce you to Elasticsearch 5.x, describing the basic concepts and showing you the important changes in Elasticsearch from version 1.x to 5.x.

Chapter 2, The Improved Query DSL, describes the new default scoring algorithm, BM25, and how it would be better than the previous TF-IDF algorithm. In addition to that, it explains various Elasticsearch features, such as query rewriting, query templates, changes in query modules, and various queries to choose from in a given scenario.

Chapter 3, Beyond Full Text Search, describes queries about rescoring, multimatching control, and function score queries. In addition to that, this chapter covers the scripting module of Elasticsearch.

Chapter 4, Data Modeling and Analytics, discusses different approaches of data modeling in Elasticsearch and also covers how to handle relationships among documents using parent-child and nested data types, along with focusing on practical considerations. It further discusses the aggregation module of Elasticsearch for the purpose of data analytics.

Chapter 5, Improving the User Search Experience, focuses on topics for improving the user search experience using suggesters, which allows you to correct user-query spelling mistakes and build efficient autocomplete mechanisms. In addition to that, it covers how to improve query relevance and how to use synonyms to search.

Chapter 6, The Index Distribution Architecture, covers techniques for choosing the right amount of shards and replicas, how routing works, how shard allocation works, and how to alter its behavior. In addition to that, we discuss what query execution preference is and how it allows us to choose where the queries are going to be executed.

Chapter 7, Low-Level Index Control, describes how to alter Apache Lucene scoring and how to choose an alternative scoring algorithm. It also covers NRT searching and indexing and transaction log usage and allows you to understand segment merging and tune it for your use case along with the details about removed merge policies inside Elasticsearch 5.x. At the end of the chapter, you will also find information about IO throttling and Elasticsearch caching.

Chapter 8, Elasticsearch Administration, focuses on concepts related to administering Elasticsearch. It describes what the discovery, gateway, and recovery modules are, how to configure them, and why you should bother. We also describe what the cat API is and how to back up and restore your data to different cloud services (such as Amazon AWS and Microsoft Azure).

Chapter 9, Data Transformation and Federated Search, covers the latest feature of Elasticsearch 5, that is ingest node, which allows us to preprocess data into the Elasticsearch cluster itself before indexing. It further tells us about how federated search works with different clusters using tribe nodes.

Chapter 10, Improving Performance, discusses Elasticsearch performance improvements under different loads and what the right way of scaling production clusters is, along with covering the insights into garbage collections and hot threads issues and how to deal with them. It further covers query profiling and query benchmarking. In the end, it explains the general Elasticsearch cluster tuning advice under high query rate scenarios versus high indexing throughput scenarios.

Chapter 11, Developing Elasticsearch Plugins, covers Elasticsearch plugins' development by showing and describing in depth how to write your own REST action and language analysis plugin.

Chapter 12, Introducing Elastic Stack 5.0, introduces you to the components of Elastic Stack 5.0, covering Elasticsearch, Logstash, Kibana, and Beats.

What you need for this book

This book was written using Elasticsearch 5.0.x, and all the examples and functions should work with it. In addition to that, you'll need a command-line tool that allows you to send HTTP requests such as curl, which are available for most operating systems. Please note that all examples in this book use the mentioned curl tool. If you want to use another tool, please remember to format the request in an appropriate way that is understood by the tool of your choice.

In addition to that, to run examples in Chapter 11, Developing Elasticsearch Plugins, you will need a Java Development Kit (JDK) Version 1.8.0_73 and above installed and an editor that will allow you to develop your code (or a Java IDE such as Eclipse). To build the code and manage dependencies in Chapter 11, Developing Elasticsearch Plugins, we are using Apache Maven.

The last chapter of this book has been written using Elastic Stack 5.0.0, so you will need to have Logstash, Kibana, and Metricbeat, all comprising the same version.

Who this book is for

This book was written for Elasticsearch users and enthusiasts who are already familiar with the basic concepts of this great search server and want to extend their knowledge of Elasticsearch. It also covers topics such as how Apache Lucene or Elasticsearch works, along with getting aware of the changes from Elasticsearch 1.x to 5.x. In addition to that, readers who want to see how to improve their query relevancy and learn how to extend Elasticsearch with their own plugin may find this book interesting and useful.

If you are new to Elasticsearch and you are not familiar with basic concepts, such as querying and data indexing, you may find it a little difficult to use this book as most of the chapters assume that you have this knowledge already.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "but not the Elasticsearch term in the document field"

A block of code is set as follows:

public class CustomRestActionPlugin extends Plugin implements ActionPlugin { @Override public List<Class<? extends RestHandler>> getRestHandlers() { return Collections.singletonList(CustomRestAction.class); } }

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

curl -XGET 'localhost:9200/clients/_search?pretty' -d '{ "query" : { "prefix" : { "name" : { "prefix" : "j", "rewrite" : "constant_score_boolean" } } } }'

Any command-line input or output is written as follows:

curl -XPUT 'localhost:9200/mastering_meta/_settings' -d '{ "index" : { "auto_expand_replicas" : "0-all" }}

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "field and hit the Create button"

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the book in the Search box.Select the book for which you're looking to download the code files.Choose from the drop-down menu where you purchased this book from.Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-ElasticSearch-5.x-Third-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringElasticSearch5dotxThirdEdition_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Chapter 1. Revisiting Elasticsearch and the Changes

Welcome to Mastering Elasticsearch 5.x, Third Edition. Elasticsearch has progressed rapidly from version 1.x, released in 2014, to version 5.x, released in 2016. During the two-and-a-half-year period since 1.0.0, adoption has skyrocketed, and both vendors and the community have committed bug-fixes, interoperability enhancements, and rich feature upgrades to ensure Elasticsearch remains the most popular NoSQL storage, indexing, and search utility for both structured and unstructured documents, as well as gaining popularity as a log analysis tool as part of the Elastic Stack.

We treat Mastering Elasticsearch as a book that will systematize your knowledge about Elasticsearch, and extend it by showing some examples of how to leverage your knowledge in certain situations. If you are looking for a book that will help you start your journey into the world of Elasticsearch, please take a look at Elasticsearch Essentials, also published by Packt.

Before going further into the book, we assume that you already know the basic concepts of Elasticsearch for performing operations such as how to index documents, how to send queries to get the documents you are interested in, how to narrow down the results of your queries by using filters, and how to calculate statistics for your data with the use of the aggregation mechanism. However, before getting to the exciting functionality that Elasticsearch offers, we think we should start with a quick overview of Apache Lucene, which is a full text search library that Elasticsearch uses to build and search its indices. We also need to make sure that we understand Lucene correctly, as Mastering Elasticsearch requires this understanding. By the end of this chapter, we will have covered the following topics:

An overview of Lucene and ElasticsearchIntroducing Elasticsearch 5.xLatest features introduced in ElasticsearchThe changes in Elasticsearch after 1.x

An overview of Lucene

In order to fully understand how Elasticsearch works, especially when it comes to indexing and query processing, it is crucial to understand how the Apache Lucene library works. Under the hood, Elasticsearch uses Lucene to handle document indexing. The same library is also used to perform a search against the indexed documents. In the next few pages, we will try to show you the basics of Apache Lucene, just in case you've never used it.

Lucene is a mature, open source, highly performing, scalable, light and, yet, very powerful library written in Java. Its core comes as a single file of the Java library with no dependencies, and allows you to index documents and search them with its out-of-the-box full text search capabilities. Of course, there are extensions to Apache Lucene that allow different language handling, and enable spellchecking, highlighting, and much more, but if you don't need those features, you can download a single file and use it in your application.

Getting deeper into the Lucene index

In order to fully understand Lucene, the following terminologies need to be understood first:

Document: This is a main data carrier used during indexing and search, containing one or more fields, which contains the data we put and get from Lucene.Field: This is a section of the document which is built of two parts: the name and the value.Term: This is a unit of search representing a word from the text.Token: This is an occurrence of a term from the text of the field. It consists of term text, start and end offset, and a type.

Inverted index

Apache Lucene writes all the information to the structure called the inverted index. It is a data structure that maps the terms in the index to the documents, not the other way round, as the relational database does. You can think of an inverted index as a data structure, where data is term oriented rather than document oriented.

Let's see how a simple inverted index can look. For example, let's assume that we have the documents with only the title field to be indexed, and they look like the following:

Elasticsearch Server (document 1)Mastering Elasticsearch (document 2)Elasticsearch Essentials (document 3)

So, the index (in a very simple way) could be visualized as shown in the following table:

Term

Count

Document : Position

Elasticsearch

1:1, 2:2, 3:1

Essentials

3:2

Mastering

2:1

Server

1:2

As you can see, each term points to the number of documents it is present in, along with its position. This allows for a very efficient and fast search such as term-based queries. In addition to this, each term has a number connected to it: the count, telling Lucene how often it occurs.

Segments

Each index is divided into multiple write once and read many times segments. When indexing, after a single segment is written to disk, it can't be updated. For example, the information about deleted documents is stored in a separate file, but the segment itself is not updated.

However, multiple segments can be merged together in a process called segments merge. After forcing, segments are merged, or after Lucene decides it is time for merging to be performed, segments are merged together by Lucene to create larger ones. This can be I/O demanding; however, it is needed to clean up some information because during that time some information that is not needed anymore is deleted; for example, the deleted documents. In addition to this, searching with the use of one larger segment is faster than searching against multiple smaller ones holding the same data.

Of course, the actual index created by Lucene is much more complicated and advanced, and consists of more than the terms, their counts, and documents, in which they are present. We would like to tell you about a few of these additional index pieces because even though they are internal, it is usually good to know about them, as they can be very useful.

Norms

A norm is a factor associated with each indexed document and stores normalization factors used to compute the score relative to the query. Norms are computed on the basis of index time boosts and are indexed along with the documents. With the use of norms, Lucene is able to provide an index time-boosting functionality at the cost of a certain amount of additional space needed for norms indexation and some amount of additional memory.

Term vectors

Term vectors are small inverted indices per document. They consist of pairs-a term and its frequency-and can optionally include information about the term position. By default, Lucene and Elasticsearch don't enable term vectors indexing, but some functionalities, such as the fast vector highlighting, require them to be present.

Posting formats

With the release of Lucene 4.0, the library introduced the so-called codec architecture, giving developers control over how the index files are written onto the disk. One of the parts of the index is the posting format, which stores fields, terms, documents, term positions and offsets, and, finally, the payloads (a byte array stored at an arbitrary position in the Lucene index, which can contain any information we want). Lucene contains different posting formats for different purposes; for example; one that is optimized for high cardinality fields such as the unique identifier.

Doc values

As we have already mentioned, the Lucene index is the so-called inverted index. However, for certain features, such as aggregations, such an architecture is not the best one. The mentioned functionality operates on the document level and not the term level because Elasticsearch needs to uninvert the index before calculations can be done. Because of that, doc values were introduced and an additional structure was used for sorting and aggregations. The doc values store uninverted data for a field that they are turned on for. Both Lucene and Elasticsearch allow us to configure the implementation used to store them, giving us the possibility of memory-based doc values, disk-based doc values, and a combination of the two. Doc values are default in Elasticsearch since the 2.x release.

Document analysis

When we index a document into Elasticsearch, it goes through an analysis phase which is necessary in order to create the inverted indexes. It is a series of steps performed by Lucene which are depicted in following image:

Analysis is done by the analyzer, which is built of a tokenizer and zero or more filters, and can also have zero or more character filters.

A tokenizer in Lucene is used to divide the text into tokens, which are basically terms with additional information, such as its position in the original text and its length. The result of the tokenizer work is a so-called token stream, where the tokens are put one by one and are ready to be processed by filters.

Apart from the tokenizer, the Lucene analyzer is built of zero or more filters that are used to process tokens in the token stream. For example, it can remove tokens from the stream, change them, or even produce new ones. There are numerous filters and you can easily create new ones. Some examples of filters are as follows:

Lowercase filter: This makes all the tokens lowercaseASCII folding filter: This removes non-ASCII parts from tokensSynonyms filter: This is responsible for changing one token to another on the basis of synonym rulesMultiple language stemming filters: These are responsible for reducing tokens (actually the text part that they provide) into their root or base forms, the stem

Filters are processed one after another, so we have almost unlimited analysis possibilities with adding multiple filters one after another.

The last thing is the character filtering, which is used before the tokenizer and is responsible for processing text before any analysis is done. One of the examples of the character filter is the HTML tags removal process.

This analysis phase is applied during query time also. However, you can also choose the other path and not analyze your queries. This is crucial to remember because some of the Elasticsearch queries are being analyzed and some are not. For example, the prefix query is not analyzed and the match query is analyzed.

What you should remember about indexing and querying analysis is that the index should be matched by the query term. If they don't match, Lucene won't return the desired documents. For example, if you are using stemming and lowercasing during indexing, you need to be sure that the terms in the query are also lowercased and stemmed, or your queries will return no results at all.

Basics of the Lucene query language

Some of the query types provided by Elasticsearch support Apache Lucene query parser syntax. Because of this, it is crucial to understand the Lucene query language.

A query is divided by Apache Lucene into terms and operators. A term, in Lucene, can be a single word or a phrase (a group of words surrounded by double quote characters). If the query is set to be analyzed, the defined analyzer will be used on each of the terms that form the query.

A query can also contain Boolean operators that connect terms to each other forming clauses. The list of Boolean operators is as follows:

AND: This means that the given two terms (left and right operand) need to match in order for the clause to be matched. For example, we would run a query, such as apache AND lucene, to match documents with both apache and lucene terms in a document field.OR: This means that any of the given terms may match in order for the clause to be matched. For example, we would run a query, such as apache OR lucene, to match documents with apache or lucene (or both) terms in a document field.NOT: This means that in order for the document to be considered a match, the term appearing after the NOT operator must not match. For example, we would run a query lucene NOT Elasticsearch to match documents that contain the lucene term, but not the Elasticsearch term in the document field.

In addition to these, we may use the following operators:

+: This means that the given term needs to be matched in order for the document to be considered as a match. For example, in order to find documents that match the lucene term and may match the apache term, we would run a query such as +lucene apache.-: This means that the given term can't be matched in order for the document to be considered a match. For example, in order to find a document with the lucene term, but not the Elasticsearch term, we would run a query such as +lucene -Elasticsearch.

When not specifying any of the previous operators, the default OR operator will be used.

In addition to all these, there is one more thing: you can use parentheses to group clauses together; for example, with something like the following query:

Elasticsearch AND (mastering OR book)

Querying fields

Of course, just like in Elasticsearch, in Lucene all your data is stored in fields that build the document. In order to run a query against a field, you need to provide the field name, add the colon character, and provide the clause that should be run against that field. For example, if you would like to match documents with the term Elasticsearch in the title field, you would run the following query:

title:Elasticsearch

You can also group multiple clauses. For example, if you would like your query to match all the documents having the Elasticsearch term and the mastering book phrase in the title field, you could run a query like the following code:

title:(+Elasticsearch +"mastering book")

The previous query can also be expressed in the following way:

+title:Elasticsearch +title:"mastering book"

Term modifiers

In addition to the standard field query with a simple term or clause, Lucene allows us to modify the terms we pass in the query with modifiers. The most common modifiers, which you will be familiar with, are wildcards. There are two wildcards supported by Lucene, the ? and * terms. The first one will match any character and the second one will match multiple characters.

In addition to this, Lucene supports fuzzy and proximity searches with the use of the ~ character and an integer following it. When used with a single word term, it means that we want to search for terms that are similar to the one we've modified (the so-called fuzzy search). The integer after the ~ character specifies the maximum number of edits that can be done to consider the term similar. For example, if we would run a query, such as writer~2, both the terms writer and writers would be considered a match.

When the ~ character is used on a phrase, the integer number we provide is telling Lucene how much distance between the words is acceptable. For example, let's take the following query:

title:"mastering Elasticsearch"

It would match the document with the title field containing mastering Elasticsearch, but not mastering book Elasticsearch. However, if we ran a query, such as title:"mastering Elasticsearch"~2, it would result in both example documents being matched.

We can also use boosting to increase our term importance by using the ^ character and providing a float number. Boosts lower than 1 would result in decreasing the document importance. Boosts higher than 1 would result in increasing the importance. The default boost value is 1. Please refer to the The changed default text scoring in Lucene - BM25 section in Chapter 2, The Improved Query DSL, for further information on what boosting is and how it is taken into consideration during document scoring.

In addition to all these, we can use square and curly brackets to allow range searching. For example, if we would like to run a range search on a numeric field, we could run the following query:

price:[10.00 TO 15.00]

The preceding query would result in all documents with the price field between 10.00 and 15.00 inclusive.

In case of string-based fields, we also can run a range query; for example name:[Adam TO Adria].

The preceding query would result in all documents containing all the terms between Adam and Adria in the name field including them.

If you would like your range bound or bounds to be exclusive, use curly brackets instead of the square ones. For example, in order to find documents with the price field between 10.00 inclusive and 15.00 exclusive, we would run the following query:

price:[10.00 TO 15.00}

If you would like your range bound from one side and not bound by the other, for example querying for documents with a price higher than 10.00, we would run the following query:

price:[10.00 TO *]

Handling special characters

In case you want to search for one of the special characters (which are +, -, &&, ||, !, (, ), { }, [ ], ^, ", ~, *, ?, :, \, /), you need to escape it with the use of the backslash (\) character. For example, to search for the abc"efg term you need to do something like abc"efg.

An overview of Elasticsearch

Although we've said that we expect the reader to be familiar with Elasticsearch, we would really like to give you a short introduction to the concepts of this great search engine.

As you probably know, Elasticsearch is a distributed full text search and analytic engine that is built on top of Lucene to build search and analysis-oriented applications. It was originally started by Shay Banon and published in February 2010. Since then, it has rapidly gained popularity within just a few years and has become an important alternative to other open source and commercial solutions. It is one of the most downloaded open source projects.

The key concepts

There are a few concepts that come with Elasticsearch, and their understanding is crucial to fully understand how Elasticsearch works and operates:

Index: A logical namespace under which Elasticsearch stores data and may be built with more than one Lucene index using shards and replicas.Document: A document is a JSON object that contains the actual data in key value pairs. It is very important to understand that when a field is indexed for the first time into the index, Elasticsearch creates a data type for that field. Starting from version 2.x, a very strict type checking gets done.Type: A doc type in Elasticsearch represents a class of similar documents. A type consists of a name such as a user or a blog post, and a mapping including data types and the Lucene configurations for each field.Mapping: As already mentioned in the An overview of Lucene section, all documents are analyzed before being indexed. We can configure how the input text is divided into tokens, which tokens should be filtered out, or what additional processing, such as removing HTML tags, is needed. This is where mapping comes into play-it holds all the information about the analysis chain. Besides the fact that Elasticsearch can automatically discover a field type by looking at its value, in most cases we will want to configure the mappings ourselves to avoid unpleasant surprises.Node: A single instance of Elasticsearch running on a machine. Elasticsearch nodes can serve different purposes. Of course, Elasticsearch is designed to index and search our data, so the first type of node is the data node. Such nodes hold the data and search on them. The second type of node is the master node-a node that works as a supervisor of the cluster controlling other nodes' work. The third node type is the client node, which is used as a query router. The fourth type of node is the tribe node, which was introduced in Elasticsearch 1.0. The tribe node can join multiple clusters and thus act as a bridge between them, allowing us to execute almost all Elasticsearch functionalities on multiple clusters just like we would by using a single cluster. Elasticsearch 5.0 has also introduced a new type of node called the ingest node, which can be used for data transformation before the data gets indexed.Cluster: A cluster is a single name under which one or more nodes/instances of Elasticsearch are connected to each other.Shard: Shards are containers that can be stored on a single node or multiple nodes and are composed of Lucene segments. An index is divided into one or more shards to make the data distributable. For the index, shards once created cannot be increased or decreased.

Note

A shard can be either primary or secondary. A primary shard is the one where all the operations that change the index are directed. A secondary shard is the one that contains duplicate data of the primary shard and helps in quickly searching data as well as in high availability; in case the machine that holds the primary shard goes down, then the secondary shard becomes the primary shard automatically.

Replica: A duplicate copy of the data living in a shard for high availability. Having a replica also provides a faster search experience.

Working of Elasticsearch

Elasticsearch uses the zen discovery module for cluster formation. In 1.x, multicast was the default discovery used in Elasticsearch, but in 2.x unicast became the default discovery type. Although, multicast was available in Elasticsearch 2.x as a plugin. Multicast support has completely been removed from Elasticsearch 5.0

When an Elasticsearch node starts, it performs discovery and searches for the list of unicast hosts (master eligible nodes), which are configured in the elasticsearch.yml configuration file using the discovery.zen.ping.unicast.hosts parameter. By default, the default list of unicast hosts is ["127.0.0.1", "[::1]"] so that each node, when starting, does not form a cluster only with itself. We will have a detailed section on zen discovery and node configurations in Chapter 8, Elasticsearch Administration.

Introducing Elasticsearch 5.x

In 2015, Elasticsearch, after acquiring Kibana, Logstash, Beats, and Found, re-branded the company name as Elastic. According to Shay Banon, the name change is part of an initiative to better align the company with the broad solutions it provides: future products, and new innovations created by Elastic's massive community of developers and enterprises that utilize the ELK stack for everything from real-time search, to sophisticated analytics, to building modern data applications.

But having several products under one hood resulted in discord among them during the release process and started creating confusion for the users. This resulted in the ELK stack being renamed to Elastic Stack and the company decided to keep releasing all components of the Elastic Stack together. This is so that they will all share the same version number for all the products to keep speed with your deployments, simplify compatibility testing, and make it even easier for developers to add new functionality across the stack.

The very first GA release under Elastic stack is 5.0.0, which will be covered throughout this book. Further, Elasticsearch keeps pace with Lucene version releases to incorporate bug fixes and the latest features into Elasticsearch. Elasticsearch 5.0 is based on Lucene 6, which is a major release from Lucene with some awesome new features and a focus on improving the search speed. We will discuss Lucene 6 in upcoming chapters to let you know how Elasticsearch is going to have some awesome improvements, both from search and storage points of view.

Introducing new features in Elasticsearch

Elasticsearch 5.x has many improvements and has gone through a great refactoring, which caused removal/deprecation of some features. We will keep discussing the removed/improved/new features in upcoming chapters, but for now let's take an overview of the new and improved things in Elasticsearch.

New features in Elasticsearch 5.x

Following are some of the most important features introduced in Elasticsearch version 5.0:

Ingest node: This node is a new type of node in Elasticsearch, which can be used for simple data transformation and enrichment before actual data indexing takes place. The best thing is that any node can be configured to act as an ingest node and it is very lighter across the board. You can avoid Logstash for these tasks because the ingest node is a Java based implementation of the Logstash filter and comes as a default in Elasticsearch itself.Index shrinking: By design, once an index is created, there is no provision of reducing the number of shards for that index and this brings a lot of challenges since each shard consumes some resources. Although this design still remains same, to make life easier for users, Elasticsearch has introduced a new _shrink API to overcome this problem. This API allows you to shrink an existing index into a newer index with a fewer number of shards.

Note

We will cover the ingest node and shrink API in detail under Chapter 9, Data Transformation and Federated Search.

Painless scripting language: In Elasticsearch, scripting has always been a matter of concern because of its slowness and for security reasons. Elasticsearch 5.0 includes a new scripting language called Painless, which has been designed to be fast and secure. Painless is still going through lots of improvements to make it more awesome and easily adaptable. We will cover it under Chapter 3, Beyond Full Text Search.Instant aggregations: Queries have been completely refactored in 5.0; they are now parsed on the coordinating node and serialized to different nodes in a binary format. This allows Elasticsearch to be much more efficient, with more cache-able queries, especially on data separated into time-based indices. This will cause a significant speed up for aggregations.A new completion suggester: The completion suggester has undergone a complete rewrite. This means that the syntax and data structure for fields of type completion have changed, as have the syntax and response of the completion suggester requests. The completion suggester is now built on top of the first iteration of Lucene's new suggest API.Multi-dimensional points: This is one of the most exciting features of Lucene 6, which empowers Elasticsearch 5.0. It is built using the k-d tree geospatial data structure to offer a fast single- and multi-dimensional numeric range and a geospatial point-in-shape filtering. A multi-dimensional point helps in reducing disk storage, memory utilization, and faster searches.Delete by Query API: After much demand from the community, Elasticsearch has finally provided the ability to delete documents based on a matching query using the _delete_by_query REST endpoint.

New features in Elasticsearch 2.x

Apart from the features discussed just now, you can also benefit from all of the new features that came in Elasticsearch version 2.x. For those who have not had a look at the 2.x series, let's have a quick revamp of the new features which came with Elasticsearch under this series:

Reindex API: In Elasticsearch, re-indexing of documents is almost needed by every user, under several scenarios. The _reindex

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Mastering Elasticsearch 5.x E-Book

Bharvi Dixit

Table of Contents

Mastering Elasticsearch 5.x - Third Edition

Mastering Elasticsearch 5.x - Third Edition

Credits

About the Author

Acknowledgements

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Chapter 1. Revisiting Elasticsearch and the Changes

An overview of Lucene

Getting deeper into the Lucene index

Inverted index

Segments

Norms

Term vectors

Posting formats

Doc values

Document analysis

Basics of the Lucene query language

Querying fields

Term modifiers

Handling special characters

An overview of Elasticsearch

The key concepts

Note

Working of Elasticsearch

Introducing Elasticsearch 5.x

Introducing new features in Elasticsearch

New features in Elasticsearch 5.x

Note

New features in Elasticsearch 2.x