Elasticsearch 8.x Cookbook - Alberto Paro - E-Book

Elasticsearch 8.x Cookbook E-Book

Alberto Paro

0,0
22,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Elasticsearch is a Lucene-based distributed search engine at the heart of the Elastic Stack that allows you to index and search unstructured content with petabytes of data. With this updated fifth edition, you'll cover comprehensive recipes relating to what's new in Elasticsearch 8.x and see how to create and run complex queries and analytics.
The recipes will guide you through performing index mapping, aggregation, working with queries, and scripting using Elasticsearch. You'll focus on numerous solutions and quick techniques for performing both common and uncommon tasks such as deploying Elasticsearch nodes, using the ingest module, working with X-Pack, and creating different visualizations. As you advance, you'll learn how to manage various clusters, restore data, and install Kibana to monitor a cluster and extend it using a variety of plugins. Furthermore, you'll understand how to integrate your Java, Scala, Python, and big data applications such as Apache Spark and Pig with Elasticsearch and create efficient data applications powered by enhanced functionalities and custom plugins.
By the end of this Elasticsearch cookbook, you'll have gained in-depth knowledge of implementing the Elasticsearch architecture and be able to manage, search, and store data efficiently and effectively using Elasticsearch.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 752

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Elasticsearch 8.x Cookbook

Fifth Edition

Over 180 recipes to perform fast, scalable, and reliable searches for your enterprise

Alberto Paro

BIRMINGHAM—MUMBAI

Elasticsearch 8.x Cookbook

Fifth Edition

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Devika Battike

Senior Editor: Nathanya Dias

Content Development Editor: Sean Lobo

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Manju Arasan

Production Designer: Ponraj Dhandapani

Marketing Coordinator: Priyanka Mhatre

First published: December 2013

Second edition: January 2015

Third edition: February 2017

Fourth edition: April 2019

Fifth edition: May 2022

Production reference: 1280422

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-981-5

www.packt.com

Contributors

About the author

Alberto Paro is an engineer, manager, and software developer. He currently works as the technology architecture delivery associate director of the Accenture Cloud First data and AI team in Italy. He loves to study emerging solutions and applications, mainly related to cloud and big data processing, NoSQL, Natural Language Processing (NLP), software development, and machine learning. In 2000, he graduated in computer science engineering from Politecnico di Milano. Then, he worked with many companies, mainly using Scala/Java and Python on knowledge management solutions and advanced data mining products, using state-of-the-art big data software. A lot of his time is spent teaching others how to effectively use big data solutions, NoSQL data stores, and related technologies.

About the reviewers

Kyle Davis is the senior developer advocate with OpenSearch and Open Distro for Elasticsearch at Amazon Web Services (AWS). Kyle has a long history of working in software development, starting in the late 1990s. His experience runs the gamut from frontend development to microcontrollers, but his most passionate area of interest is NoSQL databases. He has blogged and presented extensively about technology and is the author of Redis Microservices for Dummies. Kyle is based out of Edmonton, Alberta, Canada.

Mahipalsinh Rana is currently chief technology officer (CTO) of Inexture Solutions LLP. At Inexture, he specializes in enterprise searching, Python, Java, and ML/AI. He has 15 years of experience. His stint with search technologies started in 2010 when he started working with Solr. He then started working with Elastic and has done various large-scale implementations and consultations. At the start of his career, he worked for Sun Microsystems, where he worked on internationalization (i18n). He likes exploring emerging technology trends such as NLP and intuitive searching for e-commerce. He plans to develop a search engine for people who are still in the early stages of technological advancement to provide them with information at ease. He has also worked on Liferay Beginner's Guide by Packt.

Arpit Dubey is a big data engineer with over 14 years of experience in building large-scale, data-intensive applications. He has experience in envisioning enterprise-wide data strategies, roadmaps, and architecture for large internet companies, with varied use cases. He specializes in building event-driven architectures and real-time analytical solutions, using distributed systems such as Kafka, Flink, Spark, the Hadoop stack, NoSQL databases, and graph databases. He has been an active public speaker on various technology topics and has spoken at Kafka Summit, Druid Summit, and several other technology meetups.

I would like to thank my entire family for always being my guiding light for every path I choose and every step I take.

Table of Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Get in touch

Share Your Thoughts

Chapter 1: Getting Started

Technical requirements

Downloading and installing Elasticsearch

Getting ready

How to do it…

How it works…

There's more…

See also

Setting up networking

Getting ready

How to do it…

How it works…

See also

Setting up a node

Getting ready

How to do it…

How it works…

See also

Setting up Linux systems

Getting ready

How to do it…

How it works…

There's more…

Setting up different node roles

Getting ready

How to do it…

How it works…

There's more…

See also

Setting up a coordinating-only node

Getting ready

How to do it…

How it works…

Setting up an ingestion node

Getting ready

How to do it…

How it works…

There's more…

Installing plugins in Elasticsearch

Getting ready

How to do it…

How it works…

There's more…

See also

Removing a plugin

Getting ready

How to do it…

How it works…

Changing logging settings

Getting ready

How to do it…

How it works…

Setting up a node via Docker

Getting ready

How to do it…

How it works…

There's more…

See also

Deploying on Elastic Cloud Enterprise

Getting ready

How to do it…

How it works…

See also

Chapter 2: Managing Mappings

Technical requirements

Using explicit mapping creation

Getting ready

How to do it…

How it works…

There's more…

See also

Mapping base types

Getting ready

How to do it...

How it works...

There's more...

See also

Mapping arrays

Getting ready

How to do it…

How it works…

Mapping an object

Getting ready

How to do it…

How it works…

See also

Mapping a document

Getting ready

How to do it…

How it works…

See also

Using dynamic templates in document mapping

Getting ready

How to do it…

How it works…

There's more...

See also

Managing nested objects

Getting ready

How to do it…

How it works…

There's more...

See also

Managing a child document with a join field

Getting ready

How to do it…

How it works…

There's more...

See also

Adding a field with multiple mappings

Getting ready

How to do it…

How it works…

There's more...

See also

Mapping a GeoPoint field

Getting ready

How to do it…

How it works…

There's more...

Mapping a GeoShape field

Getting ready

How to do it…

How it works…

See also

Mapping an IP field

Getting ready

How to do it…

How it works…

Mapping an Alias field

Getting ready

How to do it...

How it works…

Mapping a Percolator field

Getting ready

How to do it...

How it works…

Mapping the Rank Feature and Feature Vector fields

Getting ready

How to do it…

How it works…

Mapping the Search as you type field

Getting ready

How to do it…

How it works…

See also

Using the Range Fields type

Getting ready

How to do it...

How it works…

See also

Using the Flattened field type

Getting ready

How to do it…

How it works…

See also

Using the Point and Shape field types

Getting ready

How to do it…

How it works…

See also

Using the Dense Vector field type

Getting ready

How to do it…

How it works...

Using the Histogram field type

Getting ready

How to do it…

How it works…

See also

Adding metadata to a mapping

Getting ready

How to do it…

How it works…

Specifying different analyzers

Getting ready

How to do it…

How it works…

See also

Using index components and templates

Getting ready

How to do it…

How it works…

See also

Chapter 3: Basic Operations

Technical requirements

Creating an index

Getting ready

How to do it...

How it works...

There's more...

See also

Deleting an index

Getting ready

How to do it...

How it works...

See also

Opening or closing an index

Getting ready

How to do it...

How it works...

There's more...

See also

Putting a mapping in an index

Getting ready

How to do it...

How it works...

There's more...

See also

Getting a mapping

Getting ready

How to do it...

How it works...

See also

Reindexing an index

Getting ready

How to do it...

How it works...

See also

Refreshing an index

Getting ready

How to do it...

How it works...

See also

Flushing an index

Getting ready

How to do it...

How it works...

See also

Using ForceMerge on an index

Getting ready

How to do it...

How it works...

There's more...

See also

Shrinking an index

Getting ready

How to do it...

How it works...

There's more...

See also

Checking whether an index exists

Getting ready

How to do it...

How it works...

Managing index settings

Getting ready

How to do it...

How it works...

There's more...

See also

Using index aliases

Getting ready

How to do it...

How it works...

There's more...

Managing dangling indices

Getting ready

How to do it…

How it works...

See also

Resolving index names

Getting ready

How to do it…

How it works...

See also

Rolling over an index

Getting ready

How to do it…

How it works...

There's more...

See also

Indexing a document

Getting ready

How to do it...

How it works...

There's more...

See also

Getting a document

Getting ready

How to do it...

How it works...

There's more...

See also

Deleting a document

Getting ready

How to do it...

How it works...

See also

Updating a document

Getting ready

How to do it...

How it works...

See also

Speeding up atomic operations (bulk operations)

Getting ready

How to do it...

How it works...

Speeding up GET operations (multi-GET)

Getting ready

How to do it...

How it works...

See also...

Chapter 4: Exploring Search Capabilities

Technical requirements

Executing a search

Getting ready

How to do it...

How it works...

There's more...

See also

Sorting results

Getting ready

How to do it...

How it works...

There's more...

See also

Highlighting results

Getting ready

How to do it...

How it works...

See also

Executing a scrolling query

Getting ready

How to do it...

How it works...

There's more...

See also

Using the search_after functionality

Getting ready

How to do it…

How it works...

See also

Returning inner hits in results

Getting ready

How to do it...

How it works...

See also

Suggesting a correct query

Getting ready

How to do it...

How it works...

See also

Counting matched results

Getting ready

How to do it...

How it works...

There's more...

See also

Explaining a query

Getting ready

How to do it...

How it works...

There's more...

See also

Query profiling

Getting ready

How to do it...

How it works...

Deleting by query

Getting ready

How to do it...

How it works...

There's more...

See also

Updating by query

Getting ready

How to do it...

How it works...

There's more...

See also

Matching all of the documents

Getting ready

How to do it...

How it works...

See also

Using a Boolean query

Getting ready

How to do it...

How it works...

There's more...

Using the search template

Getting ready

How to do it...

How it works...

See also

Chapter 5: Text and Numeric Queries

Technical requirements

Using a term query

Getting ready

How to do it...

How it works...

There's more...

Using a terms query

Getting ready

How to do it...

How it works...

There's more...

See also

Using a terms set query

Getting ready

How to do it...

How it works...

See also

Using a prefix query

Getting ready

How to do it...

How it works...

There's more...

See also

Using a wildcard query

Getting ready

How to do it...

How it works...

See also

Using a regexp query

Getting ready

How to do it...

How it works...

See also

Using span queries

Getting ready

How to do it...

How it works...

See also

Using a match query

Getting ready

How to do it...

How it works...

See also

Using a query string query

Getting ready

How to do it...

How it works...

There's more…

See also

Using a simple query string query

Getting ready

How to do it...

How it works...

See also

Using the range query

Getting ready

How to do it...

How it works...

There's more...

Using an IDs query

Getting ready

How to do it...

How it works...

See also

Using the function score query

Getting ready

How to do it...

How it works...

See also

Using the exists query

Getting ready

How to do it...

How it works...

See also

Using a pinned query (XPACK)

Getting ready

How to do it...

How it works...

See also

Chapter 6: Relationships and Geo Queries

Technical requirements

Using the has_child query

Getting ready

How to do it...

How it works...

There's more...

See also

Using the has_parent query

Getting ready

How to do it...

How it works...

See also

Using the nested query

Getting ready

How to do it...

How it works...

See also

Using the geo_bounding_box query

Getting ready

How to do it...

How it works...

See also

Using the geo_shape query

Getting ready

How to do it...

How it works...

See also

Using the geo_distance query

Getting ready

How to do it...

How it works...

See also

Chapter 7: Aggregations

Executing an aggregation

Getting ready

How to do it...

How it works...

See also

Executing a stats aggregation

Getting ready

How to do it...

How it works...

See also

Executing a terms aggregation

Getting ready

How to do it...

How it works...

There’s more...

See also

Executing a significant terms aggregation

Getting ready

How to do it...

How it works...

Executing a range aggregation

Getting ready

How to do it...

How it works...

There’s more...

See also

Executing a histogram aggregation

Getting ready

How to do it...

How it works...

There’s more...

See also

Executing a date histogram aggregation

Getting ready

How to do it...

How it works...

There’s more...

See also

Executing a filter aggregation

Getting ready

How to do it...

How it works...

There’s more...

See also

Executing a filters aggregation

Getting ready

How to do it...

How it works...

Executing a global aggregation

Getting ready

How to do it...

How it works...

Executing a geo distance aggregation

Getting ready

How to do it...

How it works...

See also

Executing a children aggregation

Getting ready

How to do it...

How it works...

Executing a nested aggregation

Getting ready

How to do it...

How it works...

There’s more...

Executing a top hit aggregation

Getting ready

How to do it...

How it works...

See also

Executing a matrix stats aggregation

Getting ready

How to do it...

How it works...

Executing a geo bounds aggregation

Getting ready

How to do it...

How it works...

See also

Executing a geo centroid aggregation

Getting ready

How to do it...

How it works...

See also

Executing a geotile grid aggregation

Getting ready

How to do it...

How it works...

See also

Executing a sampler aggregation

Getting ready

How to do it...

How it works...

Executing a pipeline aggregation

Getting ready

How to do it...

How it works...

See also

Chapter 8: Scripting in Elasticsearch

Painless scripting

Getting ready

How to do it...

How it works...

There’s more...

See also

Installing additional scripting languages

Getting ready

How to do it...

How it works...

There’s more...

Managing scripts

Getting ready

How to do it...

How it works...

There’s more...

See also

Sorting data using scripts

Getting ready

How to do it...

How it works...

There’s more...

Computing return fields with scripting

Getting ready

How to do it...

How it works...

See also

Filtering a search using scripting

Getting ready

How to do it...

How it works...

See also

Using scripting in aggregations

Getting ready

How to do it...

How it works...

Updating a document using scripts

Getting ready

How to do it...

How it works...

There’s more...

Reindexing with a script

Getting ready

How to do it...

How it works...

Scripting in ingest processors

Getting ready

How to do it...

How it works...

See also

Chapter 9: Managing Clusters

Controlling the cluster health using the health API

Getting ready

How to do it...

How it works...

There's more...

See also

Controlling the cluster state using the API

Getting ready

How to do it...

How it works...

There's more...

See also

Getting cluster node information using the API

Getting ready

How to do it...

How it works...

There's more...

See also

Getting node statistics using the API

Getting ready

How to do it...

How it works...

There's more...

Using the task management API

Getting ready

How to do it...

How it works...

There's more...

See also

Using the hot threads API

Getting ready

How to do it...

How it works...

Managing the shard allocation

Getting ready

How to do it...

How it works...

There's more...

See also

Monitoring segments with the segment API

Getting ready

How to do it...

How it works...

See also

Cleaning the cache

Getting ready

How to do it...

How it works...

Chapter 10: Backups and Restoring Data

Managing repositories

Getting ready

How to do it...

How it works...

There's more...

See also

Executing a snapshot

Getting ready

How to do it...

How it works...

There's more...

Restoring a snapshot

Getting ready

How to do it...

How it works...

Setting up an NFS share for backups

Getting ready

How to do it...

How it works...

Reindexing from a remote cluster

Getting ready

How to do it...

How it works...

See also

Chapter 11: User Interfaces

Installing Kibana

Getting ready

How to do it...

How it works...

See also

Managing Kibana Discover

Getting ready

How to do it...

How it works...

Visualizing data with Kibana

Getting ready

How to do it...

How it works...

Using Kibana Dev Tools

Getting ready

How to do it...

How it works...

There's more...

See also

Chapter 12: Using the Ingest Module

Pipeline definition

Getting ready

How to do it...

How it works...

There's more...

See also

Inserting an ingest pipeline

Getting ready

How to do it...

How it works...

Getting an ingest pipeline

Getting ready

How to do it...

How it works...

There's more...

Deleting an ingest pipeline

Getting ready

How to do it...

How it works...

Simulating an ingest pipeline

Getting ready

How to do it...

How it works...

There's more...

Built-in processors

Getting ready

How to do it...

How it works...

See also

The grok processor

Getting ready

How to do it...

How it works...

See also

Using the ingest attachment plugin

Getting ready

How to do it...

How it works...

Using the ingest GeoIP processor

Getting ready

How to do it...

How it works...

See also

Using the enrichment processor

Getting ready

How to do it...

How it works...

See also

Chapter 13: Java Integration

Creating a standard Java HTTP client

Getting ready

How to do it...

How it works...

See also

Creating a low-level Elasticsearch client

Getting ready

How to do it...

How it works...

See also

Using the Elasticsearch official Java client

Getting ready

How to do it...

How it works...

See also

Managing indices

Getting ready

How to do it...

How it works...

See also

Managing mappings

Getting ready

How to do it...

How it works...

There's more...

See also

Managing documents

Getting ready

How to do it...

How it works...

See also

Managing bulk actions

Getting ready

How to do it...

How it works...

Building a query

Getting ready

How to do it...

How it works...

There's more...

Executing a standard search

Getting ready

How to do it...

How it works...

See also

Executing a search with aggregations

Getting ready

How to do it...

How it works...

See also

Executing a scroll search

Getting ready

How to do it...

How it works...

See also

Integrating with DeepLearning4j

Getting ready

How to do it...

How it works...

See also

Chapter 14: Scala Integration

Creating a client in Scala

Getting ready

How to do it…

How it works...

See also

Managing indices

Getting ready

How to do it...

How it works...

See also

Managing mappings

Getting ready

How to do it...

How it works...

See also

Managing documents

Getting ready

How to do it...

How it works...

There's more...

See also

Executing a standard search

Getting ready

How to do it...

How it works...

See also

Executing a search with aggregations

Getting ready

How to do it...

How it works...

See also

Integrating with DeepLearning.scala

Getting ready

How to do it...

How it works...

See also

Chapter 15: Python Integration

Creating a client

Getting ready

How to do it...

How it works…

See also

Managing indices

Getting ready

How to do it…

How it works…

There's more…

See also

Managing mappings

Getting ready

How to do it…

How it works…

See also

Managing documents

Getting ready

How to do it…

How it works…

See also

Executing a standard search

Getting ready

How to do it…

How it works…

See also

Executing a search with aggregations

Getting ready

How to do it…

How it works…

See also

Integrating with NumPy and scikit-learn

Getting ready

How to do it...

How it works...

See also

Using AsyncElasticsearch

Getting ready

How to do it...

How it works...

See also

Using Elasticsearch with FastAPI

Getting ready

How to do it...

How it works...

See also

Chapter 16: Plugin Development

Creating a plugin

Getting ready

How to do it...

How it works...

There's more...

Creating an analyzer plugin

Getting ready

How to do it...

How it works...

There's more...

Creating a REST plugin

Getting ready

How to do it...

How it works...

See also

Creating a cluster action

Getting ready

How to do it...

How it works...

See also

Creating an ingest plugin

Getting ready

How to do it...

How it works...

See also

Chapter 17: Big Data Integration

Installing Apache Spark

Getting ready

How to do it...

How it works...

There's more...

Indexing data using Apache Spark

Getting ready

How to do it...

How it works...

See also

Indexing data with meta using Apache Spark

Getting ready

How to do it...

How it works...

There's more...

Reading data with Apache Spark

Getting ready

How to do it...

How it works...

Reading data using Spark SQL

Getting ready

How to do it...

How it works...

Indexing data with Apache Pig

Getting ready

How to do it...

How it works...

Using Elasticsearch with Alpakka

Getting ready

How to do it...

How it works...

See also

Using Elasticsearch with MongoDB

Getting ready

How to do it...

How it works...

See also

Chapter 18: X-Pack

ILM – managing the index life cycle

Getting ready

How to do it...

How it works...

There's more...

See also

ILM – automating rollover

Getting ready

How to do it...

How it works...

There's more...

See also

Using the SQL Rest API

Getting ready

How to do it…

How it works...

There's more...

See also

Using SQL via JDBC

Getting ready

How to do it…

How it works...

See also

Using X-Pack Security

Getting ready

How to do it…

How it works...

See also

Using alerting to monitor data events

Getting ready

How to do it…

How it works...

See also

Why subscribe?

Other Books You May Enjoy

Preface

Welcome to the fifth edition of Elasticsearch Cookbook targeting Elasticsearch 8.x. It's a long journey (about 12 years) that I have been on with both Elasticsearch and readers of my books. Every version of Elasticsearch brings breaking changes and new functionalities, and the evolution of already present components is a continuous cycle of product and marketing evolution.

Elasticsearch, once a very niche product, is now one of the most used databases in the world (ranked seventh in April 2022 – source: https://db-engines.com/en/ranking), and both the on-premises (bare metal, Docker, or K8S) and multi-cloud markets provided by Elastic on Amazon, Azure, and Google will rank it as one of the next best solutions for cloud searching and storage.

The growth of Elasticsearch is mainly due to it being one of the best solutions for searching, storage, and providing analytics on unstructured content in petabyte-sized datasets, and these are the main pillars of modern data-centered companies.

In this book, you'll be guided through comprehensive recipes on Elasticsearch 8.x and see how you can create and run complex queries and analytics.

Packed with recipes on performing index mapping, aggregation, and scripting using Elasticsearch, this fifth edition of Elasticsearch Cookbook will get you acquainted with numerous solutions and quick techniques to perform both everyday and uncommon tasks, such as how to deploy Elasticsearch nodes, integrate other tools into Elasticsearch, and create different visualizations with Kibana. Finally, you will integrate your Java, Scala, Python, and big data applications, such as Apache Spark and Pig, and create efficient data applications powered by enhanced functionalities and custom plugins.

By the end of this book, you will have gained in-depth knowledge of implementing Elasticsearch architecture, and you'll be able to manage, search, and store data efficiently and effectively using Elasticsearch.

IMHO, this book is the last of a long series and, due to continuous refinements, technical/stylistic improvements, and the suggestions of about 10 years of readers, it's probably one of the most complete and effective books on Elasticsearch.

Dear reader, thus, it is a technical book. I hope you'll enjoy it from the bottom of your heart!

Sincerely,

Alberto

Who this book is for

If you're a software engineer, big data infrastructure engineer, or Elasticsearch developer, you'll find this book useful. This Elasticsearch book will also help data professionals working in the e-commerce and FMCG industries who use Elasticsearch for metrics evaluation and search analytics to get deeper insights for better business decisions.

Prior experience with Elasticsearch will help you get the most out of this book in the latter chapters, which cover more advanced topics.

What this book covers

Chapter 1, Getting Started, covers the basic steps to start using Elasticsearch, from the simple installation to the cloud. We also cover several setup cases.

Chapter 2, Managing Mappings, covers the correct definition of the data fields to improve both indexing and searching quality.

Chapter 3, Basic Operations, introduces the most common actions that are required to ingest data in Elasticsearch and manage it.

Chapter 4, Exploring Search Capabilities, talks about executing searches, sorting, and related API calls. The APIs discussed in this chapter are the essential ones.

Chapter 5, Text and Numeric Queries, talks about the search DSL part of text and numeric fields – the core of the search functionalities of Elasticsearch.

Chapter 6, Relationships and Geo Queries, talks about queries that work on related documents (child/parent and nested) and geo-located fields.

Chapter 7, Aggregations, covers another capability of Elasticsearch, the possibility to execute analytics on search results to improve both the user experience and to drill down on the information contained in Elasticsearch.

Chapter 8, Scripting in Elasticsearch, shows how to customize Elasticsearch with scripting and how to use the scripting capabilities in different parts of Elasticsearch (search, aggregation, and ingestion) using different languages. The chapter is mainly focused on Painless, the new scripting language developed by the Elastic team.

Chapter 9, Managing Clusters, shows how to analyze the behavior of a cluster/node to understand common pitfalls.

Chapter 10, Backups and Restoring Data, covers one of the most important components in managing data: backing up. It shows how to manage a distributed backup and the restoration of snapshots.

Chapter 11, User Interfaces, describes two of the most common user interfaces for Elasticsearch: Cerebro, mainly used for admin activities, and Kibana, with X-Pack as a common UI extension for Elasticsearch.

Chapter 12, Using the Ingest Module, talks about the ingest functionality for importing data into Elasticsearch via an ingestion pipeline.

Chapter 13, Java Integration, describes how to integrate Elasticsearch in a Java application using both REST and native protocols.

Chapter 14, Scala Integration, describes how to integrate Elasticsearch in Scala using elastic4s – an advanced type-safe and feature-rich Scala library based on the native Java API.

Chapter 15, Python Integration, covers the usage of the official Elasticsearch Python client.

Chapter 16, Plugin Development, describes how to create native plugins to extend Elasticsearch functionalities. Some examples show the plugin skeletons, the setup process, and the building of them.

Chapter 17, Big Data Integration, covers how to integrate Elasticsearch in common big data tools, such as Apache Spark and Apache Pig.

Chapter 18, X-Pack, covers the extra functionalities provided by XPack, including security, machine learning, SQL, and reporting.

To get the most out of this book

Basic knowledge of Java, Scala, and Python would be beneficial.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Elasticsearch-8.x-Cookbook. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801079815_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map {

height: 100%;

margin: 0;

padding: 0

}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]

exten => s,1,Dial(Zap/1|30)

exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Tips or Important Notes

Appear like this.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to enhance your knowledge of it.

See also

This section provides helpful links to other useful information for the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Elasticsearch 8.x Cookbook, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Chapter 1: Getting Started

In this chapter, we will start using Elasticsearch by downloading the correct version for our operating system, configuring it to perform at its best, and extending it via plugins. By the end of the chapter, we will see how to set it up on Docker, and in a cluster using Elastic Cloud Enterprise (Docker/Kubernetes).

We will cover the following recipes:

Downloading and installing ElasticsearchSetting up networkingSetting up a nodeSetting up Linux systemsSetting up different node rolesSetting up a coordinating-only nodeSetting up an ingestion nodeInstalling plugins in ElasticsearchRemoving a pluginChanging logging settingsSetting up a node via DockerDeploying on Elastic Cloud Enterprise

Technical requirements

Elasticsearch runs on Linux/macOS/Windows, and a browser to access Kibana.

All the examples and code in this book are available at https://github.com/PacktPublishing/Elasticsearch-8.x-Cookbook.

If you don't want to go into the details of installing and configuring your Elasticsearch instance, and instead want to quickly set up your environment for developing or fun purposes, you can skip and go straight to the Setting up a node via Docker recipe to fire it up via Docker Compose. This will quickly help you install an Elasticsearch instance with Kibana and other tools.

Downloading and installing Elasticsearch

Elasticsearch has an active community, and the release cycles are very fast; generally, new minor releases are available every 2 or 3 weeks.

Since Elasticsearch depends on many common Java libraries (Lucene, Guice, and Jackson are the most famous ones), the Elasticsearch community tries to keep them updated and fix bugs that are discovered in them and in the Elasticsearch core.

The large user base is also a source of new ideas and features for improving Elasticsearch use cases.

For these reasons, if possible, it's best to use the latest available release; this is usually the most stable, with plenty of rich features, and bug-free as well. At the time of writing this book, the version is 8.0.0.

Getting ready

To install Elasticsearch, you need a supported operating system (Linux/macOS X/Windows) and a web browser, which is required to download the Elasticsearch binary release. At least 1 GB of free disk space is required to install Elasticsearch.

How to do it…

The following steps will show how Elasticsearch can be downloaded and successfully installed:

We will start by downloading Elasticsearch from the web.

Elasticsearch is distributed in two different versions: the commercial one with integrated X-Pack, whose latest version is always downloadable at https://www.elastic.co/downloads/elasticsearch.

The versions that are available for different operating systems are as follows:

elasticsearch-{version-number}-windows-x86_64.zip and elasticsearch-{version-number}.msi are for the Windows operating systems.elasticsearch-{version-number}-darwin-x86_64.tar.gz is for macOS X.elasticsearch-{version-number}-linux-x86_64.tar.gz is for Linux.elasticsearch-{version-number}-x86_64.deb is for Debian-based Linux distributions (this also covers the Ubuntu family); this is installable with Debian by using the dpkg -i elasticsearch-*.deb command.elasticsearch-{version-number}-x86_64.rpm is for Red Hat-based Linux distributions (this also covers the Cent OS family). This is installable with the rpm -i elasticsearch-*.rpm command.

The preceding packages contain everything to start Elasticsearch (the application and a bundled Java Virtual Machine (JVM) for running it). This book targets version 8.x or higher. The latest and most stable version of Elasticsearch is 8.0.0. To check out whether this is the latest version when you read this, visit https://www.elastic.co/downloads/elasticsearch.

Extract the binary content. After downloading the correct release for your platform, the installation involves expanding the archive in a working directory.

Choose a working directory that is safe from charset problems and does not have a long path. This prevents problems when Elasticsearch creates its directories to store index data.

For the Windows platform, a good directory in which to install Elasticsearch could be c:\es, on Unix, and /opt/es on macOS X.

Let's start Elasticsearch to check whether everything is working. To start your Elasticsearch server, just access the directory, and for Linux and macOS X execute the following command:

# bin/elasticsearch

Alternatively, you can type the following command line for Windows:

# bin\elasticserch.bat

Your server should now start up and show logs similar to the following (I commented out the most important part. Pay attention to the credential part for accessing Elasticsearch/Kibana):

[2022-02-13T11:18:17,230][INFO ][o.e.n.Node               ] [iMacParo] version[8.0.0], pid[57579], build[default/tar/1b6a7ece17463df5ff54a3e1302d825889aa1161/2022-02-03T16:47:57.507843096Z], OS[Mac OS X/11.1/x86_64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/17.0.1/17.0.1+12]

[2022-02-13T11:18:17,235][INFO ][o.e.n.Node               ] [iMacParo] JVM home [/opt/elasticsearch-8.x-cookbook/elasticsearch/jdk.app/Contents/Home], using bundled JDK [true] …

Module and plugin loading:

[2022-02-13T11:18:20,382][INFO ][o.e.p.PluginsService     ] [iMacParo] loaded module [aggs-matrix-stats] …

Setup node networking functionalities:

[2022-02-13T11:18:20,454][INFO ][o.e.e.NodeEnvironment    ] [iMacParo] using [1] data paths, mounts [[/System/Volumes/Data (/dev/disk1s1)]], net usable_space [141.7gb], net total_space [931.6gb], types [apfs]

[2022-02-13T11:18:20,454][INFO ][o.e.e.NodeEnvironment    ] [iMacParo] heap size [31gb], compressed ordinary object pointers [true] …

Current license:

[2022-02-13T11:18:26,646][INFO ][o.e.x.s.a.Realms         ] [iMacParo] license mode is [trial], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native] …

Binding Transport Protocol Network address:

[2022-02-13T11:18:29,642][INFO ][o.e.t.TransportService   ] [iMacParo] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300} …

Binding HTTP Protocol Network address:

[2022-02-13T11:18:30,550][INFO ][o.e.h.AbstractHttpServerTransport] [iMacParo] publish_address {192.168.1.31:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}, {192.168.1.31:9200}

[2022-02-13T11:18:30,551][INFO ][o.e.n.Node               ] [iMacParo] started …

Registering new index patterns:

[2022-02-13T11:18:30,972][INFO ][o.e.c.m.MetadataIndexTemplateService] [iMacParo] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-7-*]  …

Registering license check:

[2022-02-13T11:18:35,079][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [iMacParo] adding index lifecycle policy [.fleet-actions-results-ilm-policy]

[2022-02-13T11:18:35,335][INFO ][o.e.l.LicenseService     ] [iMacParo] license [880f6db9-75b6-4106-8e2e-0c06cb0e8b30] mode [basic] - valid

[2022-02-13T11:18:35,336][INFO ][o.e.x.s.a.Realms         ] [iMacParo] license mode is [basic], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]

[2022-02-13T11:18:36,244][INFO ][o.e.c.m.MetadataCreateIndexService] [iMacParo] [.geoip_databases] creating index, cause [auto(bulk api)], templates [], shards [1]/[0]

Generation of token to connect other nodes:

[2022-02-13T11:18:39,862][INFO ][o.e.x.s.e.InternalEnrollmentTokenGenerator] [iMacParo] Will not generate node enrollment token because node is only bound on localhost for transport and cannot connect to nodes from other hosts

[2022-02-13T11:18:39,950][INFO ][o.e.c.m.MetadataCreateIndexService] [iMacParo] [.security-7] creating index, cause [api], templates [], shards [1]/[0]…

Credentials:

Elasticsearch security features have been automatically configured!

Authentication is enabled and cluster connections are encrypted.    … truncated…

i  Configure Kibana to use this cluster:

• Run Kibana and click the configuration link in the terminal when Kibana starts.

• Copy the following enrollment token and paste it into Kibana in your browser (valid for the next 30 minutes):

eyJ2ZXIiOiI4LjAuMCIsImFkciI6WyIxOTIuMTY4LjEuMzE6OTIwM CJdLCJmZ3IiOiJjNDRkMTZmNWEzODljODhkMDhlY2MxNjNmZDEyM GQyNGUzMzYwOTBlOTRmNTc3NjQ1MWVhNzU5MDY4MWE1MTAyIiwia2V 5IjoiREt1WDhuNEJRYl9MRXFtN2Q5YkY6UnZzNVU1Wk1UY3l1Qm9SZ HRtTG5DdyJ9 … truncated…

Download of geoip data:

[2022-02-13T11:18:41,922][INFO ][o.e.i.g.GeoIpDownloader  ] [iMacParo] successfully downloaded geoip database [GeoLite2-City.mmdb]

… truncated…

How it works…

The Elasticsearch package generally contains the following directories:

bin: This contains the scripts to start and manage Elasticsearch.elasticsearch.bat: This is the main executable script to start Elasticsearch.elasticsearch-plugin.bat: This is a script to manage plugins.config: This contains the Elasticsearch configurations. The most important ones are as follows:elasticsearch.yml: This is the main config file for Elasticsearch.log4j2.properties: This is the logging config file.data: This stores all the ingested data in Elasticsearch.jdk.app: The name of this directory can change based on the operating system. It contains a bundled JVM 11 version to be used with Elasticsearch.lib: This contains all the libraries required to run Elasticsearch.logs: This directory is empty at installation time, but in the future, it will contain the application logs.modules: This contains the Elasticsearch default plugin modules.plugins: This directory is empty at installation time, but it's the place where custom plugins will be installed.

During Elasticsearch startup, the following events happen:

A node name is taken from the hostname of the machine. The default installed modules are loaded. The most important ones are as follows:aggs-matrix-stats: This provides support for aggregation matrix statistics.analysis-common: This is a common analyzer that extends the language processing capabilities of Elasticsearch.ingest-common/ingest-geoip/ingest-user-agent: These include common functionalities for the ingest module plus geo/user agent management.kibana: This sets up special indices for Kibana functionalities, including .kibana*, .reporting*, and .apm*.lang-expression/lang-mustache/lang-painless: These are the default supported scripting languages of Elasticsearch. mapper-extras/mapper-version: These provide extra mapper types to be used, such as token_count and scaled_float.parent-join: This provides an extra query, such as has_children and has_parent.percolator: This provides percolator capabilities.rank-eval: This provides support for the experimental rank evaluation Application Programming Interface (APIs). These are used to evaluate hit scoring based on queries.reindex: This provides support for reindex actions (reindex/update by query).repository-*: These modules allow the use of external cloud services as repository storage (Azure, Google Cloud Storage, and S3).x-pack-*: All the xpack modules depend on a subscription for their activation.If there are plugins, they are loaded.If not configured, Elasticsearch binds the following two ports on the 127.0.0.1 localhost automatically:9300: This port is used for internal intranode communication.9200: This port is used for the HTTP REST API.After starting, if indices are available, they are restored and ready to be used.

There are more events that are fired during the Elasticsearch startup. We'll see them in detail in other recipes.

There's more…

During a node's startup, a lot of required services are automatically started. The most important ones are as follows:

Cluster services: These help you manage the cluster state and intranode communication and synchronization.Indexing service: This helps you manage all the index operations, initializing all active indices and shards.Mapping service: This helps you manage the document types stored in the cluster (we'll discuss mapping in Chapter 2,Managing Mapping).Network services: These include services such as HTTP REST services (default on port 9200), and the internal Elasticsearch protocol (port 9300).Plugin service: This manages the loading of the plugins. Aggregation services: These provide advanced analytics on stored Elasticsearch documents, such as statistics, histograms, and document grouping.Ingesting services: These provide support for document preprocessing before ingestion, such as field enrichment, Natural Language Processing (NLP), type conversion, and automatic field population.Language scripting services: These allow you to add new language scripting support to Elasticsearch.

See also

The Setting up networking recipe we're going to cover next will help you with the initial network setup. Check the official Elasticsearch download page at https://www.elastic.co/downloads/elasticsearch to get the latest version.

Setting up networking

Correctly setting up networking is very important for your nodes and cluster.

There are a lot of different installation scenarios and networking issues. The first step for configuring the nodes in order to build a cluster is to correctly set the node discovery.

Getting ready

To change configuration files, you will need a working Elasticsearch installation and a simple text editor, as well as your current networking configuration (your IP address).

How to do it…

To set up the networking, use the following steps:

Use a standard Elasticsearch configuration config/elasticsearch.yml file; your node will be configured to bind on the localhost interface (by default) so that it can't be accessed by external machines or nodes.To allow another machine to connect to our node, we need to set network.host to our IP address (for example, I have 192.168.1.164).To be able to discover other nodes, we need to list them in the discovery.zen.ping.unicast.hosts parameter. This means that it sends signals to the machine in a unicast list and waits for a response. If a node responds to it, it can join a cluster.In general, since Elasticsearch version 7.x, the node versions are compatible. You must have the same cluster name (the cluster.name option in elasticsearch.yml) to let nodes join with each other.

The best practice is to have all the nodes installed with the same Elasticsearch version (major.minor.release). This suggestion is also valid for third-party plugins.

To customize the network preferences, you need to change some parameters in the elasticsearch.yml file, as follows:

cluster.name: ESCookBook

node.name: "Node1"

network.host: 192.168.1.164

discovery.zen.ping.unicast.hosts: ["192.168.1.164","192.168.1.165[9300-9400]"]

This configuration sets the cluster name to Elasticsearch, the node name, and the network address, and it tries to bind the node to the address given in the discovery section by performing the following tasks:We can check the configuration during node loading.We can now start the server and check whether networking is configured, as follows:

[2020-12-06T17:42:16,386][INFO ][o.e.c.s.MasterService ] [Node1] zen-disco-elected-as-master ([0] nodes joined)[, ], reason: new_master {Node1}{fyBySLMcR3uqKiYC32P5Sg}{IX1wpA01QSKkruZeSRPlFg}{192.168.1.164}{192.168.1.164:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} [2020-12-06T17:42:16,390][INFO ][o.e.c.s.ClusterApplierService] [Node1] new_master {Node1}{fyBySLMcR3uqKiYC32P5Sg}{IX1wpA01QSKkruZeSRPlFg}{192.168.1.164}{192.168.1.164:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {Node1}{fyBySLMcR3uqKiYC32P5Sg}{IX1wpA01QSKkruZeSRPlFg}{192.168.1.164}{192.168.1.164:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)[, ]]]) [2020-12-06T17:42:16,403][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [Node1] publish_address {192.168.1.164:9200}, bound_addresses {192.168.1.164:9200} [2020-12-06T17:42:16,403][INFO ][o.e.n.Node ] [Node1] started [2020-12-06T17:42:16,600][INFO ][o.e.l.LicenseService ] [Node1] license [b2754b17-a4ec-47e4-9175-4b2e0d714a45] mode [basic] - valid

As you can see from my screen dump, the transport is bound to 192.168.1.164:9300. The REST HTTP interface is bound to 192.168.1.164:9200.

How it works…

The following are the main important configuration keys for networking management:

cluster.name: This sets up the name of the cluster. Only nodes with the same name can join together.node.name: If not defined, this is automatically assigned by Elasticsearch.

node.name allows defining a name for the node. If you have a lot of nodes on different machines, it is useful to set their names to something meaningful in order to easily locate them. Setting a valid name is easier to remember than a generated name such as fyBySLMcR3uqKiYC32P5Sg.

You must always set up node.name if you need to monitor your server. Generally, a node name is the same as a host server name for easy maintenance.

network.host defines the IP address of your machine to be used to bind the node. If your server is on different LANs, or you want to limit the bind on only one LAN, you must set this value with your server IP address.

discovery.zen.ping.unicast.hosts allows you to define a list of hosts (with ports or a port range) to be used to discover other nodes to join the cluster. The preferred port is the transport one, usually 9300.

The addresses of the host list can be a mix of the following:

Hostname, that is, myhost1IP address, that is, 192.168.1.12IP address or hostname with the port, that is, myhost1:9300, 192.168.168.1.2:9300IP address or hostname with a range of ports, that is, myhost1:[9300-9400], 192.168.168.1.2:[9300-9400]

See also

For more details, refer to the Setting up a node recipe in this chapter.

Setting up a node

Elasticsearch allows the customization of several parameters in an installation. In this recipe, we'll look at the most used ones to define where to store our data and improve the overall performance.

Getting ready

As described in the Downloading and installing Elasticsearch recipe, you need a working Elasticsearch installation and a simple text editor to change configuration files.

How to do it…

The steps required for setting up a simple node are as follows:

Open the config/elasticsearch.yml file with an editor of your choice.Set up the directories that store your server data, as follows:For Linux or macOS X, add the following path entries (using /opt/data as the base path):

path.conf: /opt/data/es/conf

path.data: /opt/data/es/data1,/opt2/data/data2

path.work: /opt/data/work

path.logs: /opt/data/logs

path.plugins: /opt/data/plugins

For Windows, add the following path entries (using c:\Elasticsearch as the base path):

path.conf: c:\Elasticsearch\conf

path.data: c:\Elasticsearch\data

path.work: c:\Elasticsearch\work

path.logs: c:\Elasticsearch\logs

path.plugins: c:\Elasticsearch\plugins

Set up the parameters to control the standard index shard and replication at creation. These parameters are as follows:

index.number_of_shards: 1

index.number_of_replicas: 1

How it works…

The path.conf parameter defines the directory that contains your configurations, mainly, elasticsearch.yml and logging.yml. The default is $ES_HOME/config, with ES_HOME to install the directory of your Elasticsearch server.

It's important to set up the config directory outside your application directory so that it is not necessary to obtain the details of the configuration files whenever you update your Elasticsearch server.

The path.data parameter is the most important one. This allows you to define one or more directories (in a different disk) where you can store your index data. When you define more than one directory, they are managed similarly to RAID 0 (their space is summed up), favoring locations where most of the free space is available.

The path.work parameter is a location in which Elasticsearch stores temporary files.

The path.log parameter is where log files are put. These control how a log is managed in logging.yml.

The path.plugins parameter allows you to override the plugins path (the default is $ES_HOME/plugins). It's useful to put system-wide plugins in a shared path usually using Network File System (NFS) in case you want a single place in which to store your plugins for all of the clusters.

The main parameters are used to control index and shards in index.number_of_shards, which controls the standard number of shards for a newly created index, and index.number_of_replicas, which controls the initial number of replicas.

See also

Refer to the following points to learn more about topics related to this recipe:

The Setting up Linux systems recipeThe official Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/reference/master/setup.html

Setting up Linux systems

If you are using a Linux system (generally in a production environment), you need to manage an extra setup to improve performance or to resolve production problems with many indices.

This recipe covers the following two common errors that happen in production:

Too many open files that can corrupt your indices and your dataSlow performance in search and indexing due to the garbage collector

Big problems arise when you run out of disk space. In this scenario, some files can become corrupted. To prevent your indices from corruption and possible data loss, it is best to monitor the storage spaces. Default settings prevent index writing and block the cluster if your storage is over 95% full.

Getting ready

As we described in the Downloading and installing Elasticsearch recipe in this chapter, you need a working Elasticsearch installation and a simple text editor to change configuration files.

How to do it…

To improve the performance of Linux systems, we will perform the following steps:

First, you need to change the current limit for the user that runs the Elasticsearch server. In these examples, we will call this elasticsearch.To allow Elasticsearch to manage a large number of files, you need to increment the number of file descriptors (number of files) that a user can manage. To do so, you must edit your /etc/security/limits.conf file and add the following lines at the end:

elasticsearch - nofile 65536

elasticsearch - memlock unlimited

Then, a machine restart is required to be sure that the changes have been made.The new version of Ubuntu (that is, version 16.04 or later) can skip the /etc/security/limits.conf file in the init.d scripts. In these cases, you need to edit /etc/pam.d/ and remove the following comment line:

# session required pam_limits.so

To control memory swapping, you need to set up the following parameter in elasticsearch.yml:

bootstrap.memory_lock: true

To fix the memory usage size of the Elasticsearch server, we need to set up the same values for Xms and Xmx in $ES_HOME/config/jvm.options (that is, we set 1 GB of memory in this case), as follows:

-Xms1g -Xmx1g

How it works…

The standard limit of file descriptors (https://www.bottomupcs.com/file_descriptors.xhtml) (maximum number of open files for a user) is typically 1,024 or 8,096. When you store a lot of records in several indices, you run out of file descriptors very quickly, so your Elasticsearch server becomes unresponsive and your indices may become corrupted, causing you to lose your data.

Changing the limit to a very high number means that your Elasticsearch doesn't hit the maximum number of open files.

The other setting for memory prevents Elasticsearch from swapping memory and gives a performance boost in an environment. This setting is required because, during indexing and searching, Elasticsearch creates and destroys a lot of objects in memory. This large number of Create/Destroy actions fragments the memory and reduces performance. The memory then becomes full of holes (https://en.wikipedia.org/wiki/Fragmentation_(computing)) and, when the system needs to allocate more memory, it suffers an overhead to find compacted memory. If you don't set bootstrap.memory_lock: true, Elasticsearch dumps the whole process memory on disk and defragments it back in memory, freezing the system. With this setting, the defragmentation step is done all in memory, with a huge performance boost.

There's more…

Generally, developers' machines do not have a lot of disk space: this prevents Elasticsearch from starting in write mode. To change the quota of free disk space for Elasticsearch, the following configuration can be used:

# no production safe

cluster.routing.allocation.disk.threshold_enabled: false

cluster.routing.allocation.disk.watermark.high: 99%

cluster.routing.allocation.disk.watermark.flood_stage: 99%

Setting up different node roles

Elasticsearch is natively designed for the cloud, so when you need to release a production environment with a huge number of records and you need high availability (HA) and good performance, you need to aggregate more nodes in a cluster.

Elasticsearch allows you to associate different roles to nodes to balance and improve overall performance.

Getting ready

As described in the Downloading and installing Elasticsearch recipe, you need a working Elasticsearch installation and a simple text editor to change the configuration files.

How to do it…

For the advanced setup of a cluster, there are some parameters that must be configured to define different node types.

These parameters are in the config/elasticsearch.yml, file and they can be set with the following steps:

Set up whether the node can only be a master, as follows:

node.roles: [ master ]

Set up whether a node can only contain data, as follows:

node.roles: [ data ]

Set up whether a node can only work as an ingest node, as follows:

node.roles: [ ingest ]

How it works…

The node.roles parameter establishes that roles are associated with the actual node.

The default value for this value is to enable all the possible roles for the actual node, which are as follows:

master: This is an arbiter for the cloud; it makes decisions about shard management, keeps the cluster status, and is the main controller of every index action. If your master nodes are on overload, all the nodes in the clusters will have performance penalties. The master node is the node that distributes the search across all data nodes and aggregates/rescores the result to return it to the user. In big data terms, it's a Redux layer in the Map/Redux search in Elasticsearch.

The number of master nodes must always be odd.

data: This allows you to store data in the node. This node will be a worker that is responsible for indexing and searching data.data_content, data_hot, data_warm, data_cold, and data_frozen: These are roles that allow defining different scopes in how the data is managed. Hot data is generally in Solid State Drives (SSD) of faster solutions to provide high frequency ingested/searched data. data_warm and data_cold are nodes used for infrequent searches.ingest: This role enables the usage of Elasticsearch ingest capabilities. See Chapter 12, Using the Ingest Module.ml: This role enables machine learning capabilities. This node will be able to run machine learning jobs and answer machine learning API calls.remote_cluster_client: This role enables cross-cluster integration. The node can connect to other clusters and execute a search on them.transform: This enables the transform functionalities that are automatic presets to copy data between indices.

The more frequent usage is mixing master and data roles. This allows you to have different node types with different scopes, as shown in the following table:

Table 1.1 – Different setups between master and data settings

The most frequently used configuration is the first one, but if you have a very big cluster or special needs (such as defining a large group of data nodes), you can change the scopes of your nodes to better serve searches and aggregations.

There's more…

Related to the number of master nodes, there are settings that require at least half of them plus one to be available to ensure that the cluster is in a safe state (in order to avoid the risk of split-brain (https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-node.html#split-brain). This setting is cluster.initial_master_nodes, and it must be set to the following equation:

(master_eligible_nodes / 2) + 1

To have an HA cluster, you need at least three nodes that are masters with the value of minimum_master_nodes set to 2.

See also

Refer to the following point to learn more about topics related to this recipe:

The official Elasticsearch documentation about node setup at https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html

Setting up a coordinating-only node

The master nodes that we have seen previously are most important for cluster stability because they control node join/leave, index creation and mapping changes, and the allocation of resources. To prevent the queries and aggregations from creating instability in your cluster, coordinator (or client/proxy) nodes can be used to provide safe communication with the cluster.

Getting ready

You need a working Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in this chapter, and a simple text editor to change configuration files.

How to do it…

For the advanced setup of a cluster, there are some parameters that must be configured to define different node roles.

The parameter is in the config/elasticsearch.yml file, and you need to set the nodes.roles property to empty, as follows:

node.roles: []

How it works…

The coordinator node is a special node that works as a proxy/pass thought for the cluster. Its main advantages are as follows:

It can easily be killed or removed from the cluster without causing any problems. It's not a master, so it doesn't participate in cluster functionalities, and it doesn't contain data, so there are no data relocations/replications due to its failure.It prevents the instability of the cluster due to a developer's/user's bad queries. Sometimes, a user executes aggregations that are too large (that is, date histograms with a range of some years and intervals of 10 seconds). Here, the Elasticsearch node could crash. In its newest version, Elasticsearch has a structure called circuit breaker to prevent similar issues, but there are always borderline cases that can bring instability using scripting, for example. The coordinator node is not a master and its overload doesn't cause any problems for cluster stability.If the coordinator or client node is embedded in the application, there are fewer round trips for the data, resulting in the speeding up of the application.You can add them to balance the search and aggregation throughput without generating changes and data relocation in the cluster.

Setting up an ingestion node

The main goals of Elasticsearch are indexing, searching, and analytics, but it's often necessary to modify or enhance the documents before storing them in Elasticsearch.

The following are the most common scenarios in this case: 

Preprocessing the log string to extract meaningful dataEnriching the content of textual fields with NLP toolsEnriching the content using  ML computed fieldsAdding data modification or transformation during ingestion, such as the following:Converting IP in geolocalizationAdding DateTime fields at ingestion timeBuilding custom fields (via scripting) at ingestion time

Getting ready

You need a working Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe, as well as a simple text editor to change configuration files.

How to do it…

To set up an ingest node, you need to edit the config/elasticsearch.yml file and set up the nodes.roles property to ingest, as follows:

node.roles: [ ingest]

Every time you change your elasticsearch.yml file, a node restart is required.

How it works…

The default configuration for Elasticsearch is to set the node as an ingest node (refer to Chapter 12, Using the Ingest Module, for more information on the ingestion pipeline).

As the coordinator node, using the ingest node is a way to provide functionalities to Elasticsearch without suffering cluster safety.

It's a best practice to disable this in the master and data nodes in order to prevent ingestion error issues and to protect the cluster. The coordinator node is the best candidate to be an ingest one.

If you are using NLP, attachment extraction (via the attachment ingest plugin), or logs ingestion, the best practice is to have a pool of coordinator nodes (no master, no data) with ingestion active.

The attachment and NLP plugins in the previous version of Elasticsearch were available in the standard data node or master node. These give a lot of problems to Elasticsearch due to the following reasons:

High CPU usage for NLP algorithms that saturates all CPU on the data node, giving bad indexing and searching performances.Instability due to the bad format of attachment and/or Apache Tika bugs (the library used for managing document extraction).