E-Book
46,44 €

Advanced Elasticsearch 7.0 E-Book

Wai Tak Wong

0,0

46,44 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Master the intricacies of Elasticsearch 7.0 and use it to create flexible and scalable search solutions

Key Features

Master the latest distributed search and analytics capabilities of Elasticsearch 7.0

Perform searching, indexing, and aggregation of your data at scale

Discover tips and techniques for speeding up your search query performance

Book Description

Building enterprise-grade distributed applications and executing systematic search operations call for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and understand how you can develop a sophisticated, real-time search engine confidently. In addition to this, you'll also learn to run machine learning jobs in Elasticsearch to speed up routine tasks.

You'll get started by learning to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of query results and enhancing the customer experience. You'll then get up to speed with performing analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision-makers with better insights. The book will later guide you through using Logstash with examples to collect, parse, and enrich logs before indexing them in Elasticsearch.

By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.

What you will learn

Pre-process documents before indexing in ingest pipelines

Learn how to model your data in the real world

Get to grips with using Elasticsearch for exploratory data analysis

Understand how to build analytics and RESTful services

Use Kibana, Logstash, and Beats for dashboard applications

Get up to speed with Spark and Elasticsearch for real-time analytics

Explore the basics of Spring Data Elasticsearch, and understand how to index, search, and query in a Spring application

Who this book is for

This book is for Elasticsearch developers and data engineers who want to take their basic knowledge of Elasticsearch to the next level and use it to build enterprise-grade distributed search applications. Prior experience of working with Elasticsearch will be useful to get the most out of this book.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 453

Veröffentlichungsjahr: 2019

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Advanced Elasticsearch 7.0

A practical guide to designing, indexing, and querying advanced distributed search engines

Wai Tak Wong

BIRMINGHAM - MUMBAI

Advanced Elasticsearch 7.0

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor:Pravin DhandreAcquisition Editor:Nelson MorrisContent Development Editor:Roshan KumarSenior Editor: Jack CummingsTechnical Editor:Dinesh ChaudharyCopy Editor: Safis EditingProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer:Rekha NairProduction Designer:Nilesh Mohite

First published: August 2019

Production reference: 1220819

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78995-775-4

www.packtpub.com

I would like to thank the Almighty One for His encouragement and this opportunity, and my wife's constant support and patience throughout the long process of writing this book. Thanks also to those companies and teams, such as Elastics, Investor Exchange (ETF APIs), TD Ameritrade (commission-free ETF), Pivotal Software (Spring Tool Suite), Google (Postman), and Docker, that support open source software so that I can use their technologies as part of the book and make it a success.

– Wai Tak Wong

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Wai Tak Wong is a faculty member in the Department of Computer Science at Kean University, NJ, USA. He has more than 15 years professional experience in cloud software design and development. His PhD in computer science was obtained at NJIT, NJ, USA. Wai Tak has served as an associate professor in the Information Management Department of Chung Hua University, Taiwan. A co-founder of Shanghai Shellshellfish Information Technology, Wai Tak acted as the Chief Scientist of the R&D team, and he has published more than a dozen algorithms in prestigious journals and conferences. Wai Tak began his search and analytics technology career with Elasticsearch in the real estate market and later applied this to data management and FinTech data services.

About the reviewers

Marcelo Ochoa works for Dirección TICs of Facultad de Ciencias Exactas at Universidad Nacional del Centro de la Prov. de Buenos Aires and is the CTO at Scotas, a company that specializes in near real-time search solutions using Apache Solr and Oracle. He divides his time between university jobs and external projects related to Oracle, open source, and big data technologies. Since 2006, he has been part of an Oracle ACE program and was recently incorporated into a Docker Mentor program.

He has co-authored Oracle Database Programming Using Java and Web Services and Professional XML Databases, and as a technical reviewer for several books and videos such as Mastering Apache Solr 7, Mastering Elastic Stack, Learning Elasticsearch 6, and others.

Saurabh Chhajed is a machine learning and big data engineer with 9 years of professional experience in the enterprise application development life cycle, using the latest frameworks, tools, and design patterns. He has experience of designing and implementing some of the most widely used and scalable customer-facing recommendation systems with extensive usage of the big data ecosystem – in the batch and real-time and machine learning pipelines. He has also worked for some of the largest investment banks, credit card companies, and manufacturing companies around the world, implementing a range of robust and scalable product suites. He has written Learning ELK Stack and reviewed Mastering Kibana and Python Machine Learning with Packt Publishing.

Craig Brown is an independent consultant, offering services for Elasticsearch and other big data software. He is a core Java developer with 25+ years of experience and more than 10 years of Elasticsearch experience. He is also experienced with machine learning, Hadoop, Apache Spark; is a co-founder of the Big Mountain Data user group in Utah; and is a speaker on Elasticsearch and other big data topics. Craig founded NosqlRevolution LLC, focused on Elasticsearch and big data services, and PicoCluster LLC, a desktop data center designed for learning and prototyping cluster computing and big data frameworks.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Advanced Elasticsearch 7.0

Dedication

About Packt

Why subscribe?

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Fundamentals and Core APIs

Overview of Elasticsearch 7

Preparing your environment

Running Elasticsearch

Basic Elasticsearch configuration

Important system configuration

Talking to Elasticsearch

Using Postman to work with the Elasticsearch REST API

Elasticsearch architectural overview

Elastic Stack architecture

Elasticsearch architecture

Between the Elasticsearch index and the Lucene index

Key concepts

Mapping concepts across SQL and Elasticsearch

Mapping

Analyzer

Standard analyzer

API conventions

New features

New features to be discussed

New features with description and issue number

Breaking changes

Aggregations changes

Analysis changes

API changes

Cluster changes

Discovery changes

High-level REST client changes

Low-level REST client changes 

Indices changes

Java API changes

Mapping changes

ML changes

Packaging changes

Search changes

Query DSL changes

Settings changes

Scripting changes

Migration between versions

Summary

Index APIs

Index management APIs

Basic CRUD APIs

Index settings

Index templates

Index aliases

Reindexing with zero downtime

Grouping multiple indices

Views on a subset of documents

Miscellaneous

Monitoring indices

Indices stats

Indices segments, recovery, and share stores

Index persistence

Advanced index management APIs

Split index 

Shrink index 

Rollover index 

Summary

Document APIs

The Elasticsearch document life cycle

What is a document?

The document life cycle

Single document management APIs

Sample documents

Indexing a document

Retrieving a document by identifier

Updating a document

Removing a document by identifier

Multi-document management APIs

Retrieving multiple documents

Bulk API

Update by query API

Delete by query API

Reindex API

Copying documents

Migration from a multiple mapping types index

Summary

Mapping APIs

Dynamic mapping

Mapping rules

Dynamic templates

Meta fields in mapping

Field datatypes

Static mapping for the sample document 

Mapping parameters

Refreshing mapping changes for static mapping

Typeless APIs working with old custom index types

Summary

Anatomy of an Analyzer

An analyzer's components

Character filters

The html_strip filter

The mapping filter

The pattern_replace filter

Tokenizers

Token filters

Built-in analyzers

Custom analyzers

Normalizers

Summary

Search APIs

Indexing sample documents

Search APIs

URI search

Request body search

The sort parameter

The scroll parameter

The search_after parameter

The rescore parameter

The _name parameter

The collapse parameter

The highlighting parameter

Other search parameters

Query DSL

Full text queries

The match keyword

The query string keyword

The intervals keyword

Term-level queries

Compound queries

The script query

The multi-search API

Other search-related APIs

The _explain API

The _validate API

The _count API

The field capabilities API

Profiler

Suggesters

Summary

Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics

Modeling Your Data in the Real World

The Investor Exchange Cloud

Modeling data and the approaches

Data denormalization

Using an array of objects datatype

Nested object mapping datatypes

Join datatypes

Parent ID query

has_child query

has_parent query

Practical considerations

Summary

Aggregation Frameworks

ETF historical data preparation

Aggregation query syntax

Matrix aggregations

Matrix stats

Metrics aggregations

avg

weighted_avg

cardinality

value_count

sum

min

max

stats

extended_stats

top_hit

percentiles

percentile_ranks

median_absolute_deviation

geo_bound

geo_centroid

scripted_metric

Bucket aggregations

histogram

date_histogram

auto_date_histogram

ranges

date_range

ip_range

filter

filters

term

significant_terms

significant_text

sampler

diversified_sampler

nested

reverse_nested

global

missing

composite

adjacency_matrix

parent

children

geo_distance

geohash_grid

geotile_grid

Pipeline aggregations

Sibling family

avg_bucket 

max_bucket

min_bucket

sum_bucket

stats_bucket

extended_stats_bucket

percentiles_bucket

Parent family

cumulative_sum

derivative

bucket_script

bucket_selector

bucket_sort

serial_diff

Moving average aggregation

simple

linear

ewma

holt

holt_winters

Moving function aggregation

max

min

sum

stdDev

unweightedAvg

linearWeightedAvg

ewma

holt

holtWinters

Post filter on aggregations

Summary

Preprocessing Documents in Ingest Pipelines

Ingest APIs

Accessing data in pipelines

Processors

Conditional execution in pipelines

Handling failures in pipelines

Summary

Using Elasticsearch for Exploratory Data Analysis

Business analytics

Operational data analytics

Sentiment analysis

Summary

Section 3: Programming with the Elasticsearch Client

Elasticsearch from Java Programming

Overview of Elasticsearch Java REST client

The Java low-level REST client

The Java low-level REST client workflow

REST client initialization

Performing requests using a REST client 

Handing responses

Testing with Swagger UI

New features

The Java high-level REST client

The Java high-level REST client workflow

REST client initialization

Performing requests using the REST client

Handling responses

Testing with Swagger UI

New features

Spring Data Elasticsearch

Summary

Elasticsearch from Python Programming

Overview of the Elasticsearch Python client

The Python low-level Elasticsearch client

Workflow for the Python low-level Elasticsearch client

Client initialization

Performing requests

Handling responses

The Python high-level Elasticsearch library

Illustrating the programming concept

Initializing a connection

Performing requests 

Handling responses

The query class 

The aggregations class

Summary

Section 4: Elastic Stack

Using Kibana, Logstash, and Beats

Overview of the Elastic Stack

Running the Elastic Stack with Docker

Running Elasticsearch in a Docker container

Running Kibana in a Docker container

Running Logstash in a Docker container

Running Beats in a Docker container

Summary

Working with Elasticsearch SQL

 Overview

Getting started

Elasticsearch SQL language

Reserved keywords

Data type

Operators

Functions

Aggregate

Grouping

Date-time

Full-text search 

Mathematics

String

Type conversion

Conditional

System

Elasticsearch SQL query syntax

New features

Elasticsearch SQL REST API

Elasticsearch SQL JDBC

Upgrading Elasticsearch from a basic to a trial license

Workflow of Elasticsearch SQL JDBC 

Testing with Swagger UI

Summary

Working with Elasticsearch Analysis Plugins

What are Elasticsearch plugins?

Plugin management

Working with the ICU Analysis plugin

Examples

Working with the Smart Chinese Analysis plugin

Examples

Working with the IK Analysis plugin

Examples

Configuring a custom dictionary in the IK Analysis plugin

Summary

Section 5: Advanced Features

Machine Learning with Elasticsearch

Machine learning with Elastic Stack

Machine learning APIs

Machine learning jobs

Sample data

Running a single-metric job

Creating index patterns

Creating a new machine learning job

Examining the result

Machine learning using Elasticsearch and scikit-learn

Summary

Spark and Elasticsearch for Real-Time Analytics

Overview of ES-Hadoop

Apache Spark support

Real-time analytics using Elasticsearch and Apache Spark

Building a virtual environment to run the sample ES-Hadoop project

Running the sample ES-Hadoop project

Running the sample ES-Hadoop project using a prepared Docker image

Source code

Summary

Building Analytics RESTful Services

Building a RESTful web service with Spring Boot

Project program structure

Running the program and examining the APIs

Main workflow anatomy

Building the analytic model

Performing daily update data

Getting the registered symbols

Building the scheduler

Integration with the Bollinger Band

Building a Java Spark ML module for k-means anomaly detection

Source code

Testing Analytics RESTful services

Testing the build-analytics-model API

Testing the get-register-symbols API

Working with Kibana to visualize the analytics results

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Building enterprise-grade distributed applications and executing systematic search operations calls for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and learn how to develop a sophisticated real-time search engine confidently. In addition to this, you'll also learn how to run machine learning jobs in Elasticsearch to speed up routine tasks. You'll get started by learning how to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of queries and enhancing customer experience. You'll then get up to speed with analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision makers with better insights. The book will later guide you through using Logstash to collect, parse, and enrich logs before indexing them into Elasticsearch. By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.

You will do the following:

Pre-process documents before indexing in ingest pipelines

Learn how to model your data in the real world

Get to grips with using Elasticsearch for exploratory data analysis

Understand how to build analytics and RESTful services

Use Kibana, Logstash, and Beats for dashboard applications

Get up to speed with Spark and Elasticsearch for real-time analytics

Explore the Java high/low-level REST client and learn how to index, search, and query in a Spring application

Who this book is for

The book is aimed at beginners with no prior experience with Elasticsearch, and gradually introduces intermediate and advanced topics. The chapters walk through the most important aspects to help audiences to build and master the powerful search engine. Search engine data engineers, software engineers, and database engineers who want to take their basic knowledge of Elasticsearch to the next level can use it to its optimum level in their daily core tasks.

What this book covers

Chapter 1, Overview of Elasticsearch 7, takes beginners through some basic features in minutes. We just take a few steps to launch the new version of the Elasticsearch server. An architectural overview and a core concept introduction will make it easy to understand the workflow in Elasticsearch.

Chapter 2, Index APIs, discusses how to use index APIs to manage individual indices, index settings, aliases, and templates. It also involves monitoring statistics for operations that occur on an index. Index management operations including refreshing, flushing, and clearing the cache are also discussed.

Chapter 3, Document APIs, begins with the basic information about a document and its life cycle. Then we learn how to access it. After that, we look at accessing multiple documents with the bulk API. Finally, we discuss migrating indices from the old version to version 7.0.

Chapter 4, Mapping APIs, introduces the schema in Elasticsearch. The mapping rules for both dynamic mappings and explicit static mappings will be discussed. It also provides the idea and details of creating static mapping for an index. We also step into the details of the meta fields and field data types in index mapping.

Chapter 5, Anatomy of an Analyzer, drills down in to the anatomy of the analyzer and in-depth practice different analyzers. We will discuss different character filters, tokenizers, and token filters in order to understand the building blocks of the analyzer. We also practice how to create a custom analyzer and use it in the analyze API.

Chapter 6,Search APIs, covers different types of searches, from terms-based to full-text, from exact search to fuzzy search, from single-field search to multi-search, and then to compound search. Additional information about Query DSL and search-related APIs such as tuning, validating, and troubleshooting will be discussed.

Chapter 7, Modeling Your Data in the Real World, discusses data modeling with Elasticsearch. It focuses on some common issues users may encounter when working with different techniques. It helps you understand some of the conventions and contains insights from real-world examples involving denormalizing complex objects and using nested objects to handle relationships.

Chapter 8, Aggregation Framework, discusses data analytics using the aggregation framework. We learn how to perform aggregations with examples and delve into most of the types of aggregations. We also use IEX ETF historical data to plot a graph for different types of moving averages, including forecasted data supported by the model.

Chapter 9, Preprocessing Documents in Ingest Pipelines, discusses the preprocessing of a document through predefined pipeline processors before the actual indexing operation begins. We also learn about data accessing to documents through the pipeline processor. Finally, we cover exception handling when an error occurs during pipeline processing.

Chapter 10, Using Elasticsearch for Exploratory Data Analysis, uses the aggregation framework to perform data analysis. We first discuss a comprehensive analysis of exploratory data and simple financial analysis of business strategies. In addition, we provide step-by-step instructions for calculating Bollinger Bands using daily operational data. Finally, we will conduct a brief survey of sentiment analysis using Elasticsearch.

Chapter 11, Elasticsearch from Java Programming, focuses on the basics of two supported Java REST clients. We explore the main features and operations of each approach. A sample project is provided to demonstrate the high-level and low-level REST clients integrated with Spring Boot programming.

Chapter 12, Elasticsearch from Python Programming, introduces the Python Elasticsearch client. We learn about two Elasticsearch client packages, elasticsearch-py and elasticsearch-dsl-py. We learn how the clients work and incorporate them into a Python application. We implement Bollinger Bands by using elasticsearch-dsl-py.

Chapter 13, Using Kibana, Logstash, and Beats, outlines the components of the Elastic Stack, including Kibana, Logstash, and Beats. We learn how to use Logstash to collect and parse log data from system log files. In addition, we use Filebeat to extend the use of Logstash to a central log processing center. All work will be run on official supported Elastic Stack Docker images.

Chapter 14, Working with Elasticsearch SQL, introduces Elasticsearch SQL. With Elasticsearch SQL, we can access full-text search using familiar SQL syntax. We can even obtain results in tabular view format. We perform search and aggregation using different approaches, such as using the SQL REST API interface, the command-line interface, and JDBC.

Chapter 15, Working with Elasticsearch Analysis Plugins, introduces built-in Analysis plugins. We practice using the ICU Analysis plugin, the Smart Chinese Analysis plugin, and the IK Analysis plugin to analyze Chinese texts. We also add a new custom dictionary to improve word segmentation to make it generate better results.

Chapter 16, Machine Learning with Elasticsearch, discusses the machine learning feature supported by Elasticsearch. This feature automatically analyzes time series data by running a metric job. This type of job contains one or more detectors (the analyzed fields). We also introduce the Python scikit-learn library and the unsupervised learning algorithm K-means clustering and use it for comparison.

Chapter 17, Spark and Elasticsearch for Real-Time Analytics, focuses on ES-Hadoop's Apache Spark support. We practice reading data from the Elasticsearch index, performing some computations using Spark, and then writing the results back to Elasticsearch through ES-Hadoop. We build a real-time anomaly detection routine based on the K-means model created from past data by using the Spark ML library.

Chapter 18, Building Analytics RESTful Services, explains how to construct a project providing a search analytics REST service powered by Elasticsearch. We combine lots of material and source code from different chapters to build a real-world end-to-end project and present the result on a Kibana Visualize page.

To get the most out of this book

Readers should have a basic knowledge of Linux, Java, Python, Virtualenv, SQL, Spark, and Docker.

All installation steps are described in detail in each relevant chapter.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Advanced-Elasticsearch-7.0. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789957754_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map { height: 100%; margin: 0; padding: 0}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Fundamentals and Core APIs

In this section, you will get an overview of Elasticsearch 7 by looking into various concepts and examining Elasticsearch services and core APIs. You will also look at the new distributed, scalable, real-time search and analytics engine.

This section is comprised the following chapters:

Chapter 1

Overview of Elasticsearch 7

Chapter 2

Index APIs

Chapter 3

Document APIs

Chapter 4

Mapping APIs

Chapter 5

Anatomy of an Analyzer

Chapter 6

Search APIs

Overview of Elasticsearch 7

Welcome to Advanced Elasticsearch 7.0. Elasticsearch quickly evolved from version 1.0.0, released in February 2014, to version 6.0.0 GA, released in November 2017. Nonetheless, we will use 7.0.0 release as the base of this book. Without making any assumptions about your knowledge of Elasticsearch, this opening chapter provides setup instructions with the Elasticsearch development environment. To help beginners complete some basic features within a few minutes, several steps are given to launch the new version of an Elasticsearch server. An architectural overview and some core concepts will help you to understand the workflow within Elasticsearch. It will help you straighten your learning path.

Keep in mind that you can learn the potential benefits by reading the API conventions section and becoming familiar with it. The section New features following this one is a list of new features you can explore in the new release. Because major changes are often introduced between major versions, you must check to see whether it breaks the compatibility and affects your application. Go through the Migration between versions section to find out how to minimize the impact on your upgrade project.

In this chapter, you'll learn about the following topics:

Preparing your environment

Running Elasticsearch

Talking to Elasticsearch

Elasticsearch architectural overview

Key concepts

API conventions

New features

Breaking changes

Migration between versions

Preparing your environment

The first step of the novice is to set up the Elasticsearch server, while an experienced user may just need to upgrade the server to the new version. If you are going to upgrade your server software, read through the Breaking changes section and the Migration between versions section to discover the changes that require your attention.

Elasticsearch is developed in Java. As of writing this book, it is recommended that you use a specific Oracle JDK, version 1.8.0_131. By default, Elasticsearch will use the Java version defined by the JAVA_HOME environment variable. Before installing Elasticsearch, please check the installed Java version.

Elasticsearch is supported on many popular operating systems such as RHEL, Ubuntu, Windows, and Solaris. For information on supported operating systems and product compatibility, see the Elastic Support Matrix at https://www.elastic.co/support/matrix. The installation instructions for all the supported platforms can be found in the Installing Elasticsearch documentation (https://www.elastic.co/guide/en/elasticsearch/reference/7.0/install-elasticsearch.html). Although there are many ways to properly install Elasticsearch on different operating systems, it'll be simple and easy to run Elasticsearch from the command line for novices. Please follow the instructions on the official download site (https://www.elastic.co/downloads/past-releases/elasticsearch-7-0-0). In this book, we'll use the Ubuntu 16.04 operating system to host Elasticsearch Service. For example, use the following command line to check the Java version on Ubuntu 16.04:

java -version

java version "1.8.0_181"

java(TM) SE Runtime Environment(build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

The following is a step-by-step guide for installing the preview version from the official download site:

Select the correct package for your operating system (

WINDOWS

MACOS

LINUX

DEB

RPM

, or

MSI (BETA)

) and download the 7.0.0 release. For Linux, the filename is

elasticsearch-7.0.0-linux-x86_64.tar.gz

Extract the GNU zipped file into the target directory, which will generate a folder called

elasticsearch-7.0.0

using the following command:

tar -zxvf elasticsearch-7.0.0-linux-86_64.tar.gz

Go to the folder and run Elasticsearch with the

-p

parameter to create a

pid

file at the specified path:

cd elasticsearch-7.0.0

./bin/elasticsearch -p pid

Elasticsearch runs in the foreground when it runs with the command line above. If you want to shut it down, you can stop it by pressing Ctrl + C, or you can use the process ID from the pid file in the working directory to terminate the process:

kill -15 `cat pid`

Check the log file to make sure the process is closed. You will see the text Native controller process has stopped, stopped, closing, closed near the end of file:

tail logs/elasticsearch.log

To run Elasticsearch as a daemon in background mode, specify -d on the command line:

./bin/elasticsearch -d -p pid

In the next section, we will show you how to run an Elasticsearch instance.

Running Elasticsearch

Elasticsearch does not start automatically after installation. On Windows, to start it automatically at boot time, you can install Elasticsearch as a service. On Ubuntu, it's best to use the Debian package, which installs everything you need to configure Elasticsearch as a service. If you're interested, please refer to the official website (https://www.elastic.co/guide/en/elasticsearch/reference/master/deb.html).

Important system configuration

Elasticsearch has two working modes, development mode and production mode. You'll work in development mode with a fresh installation. If you reconfigure a setting such as network.host, it will switch to production mode. In production mode, some settings must be taken care and you can check with the Elasticsearch Reference at https://www.elastic.co/guide/en/elasticsearch/reference/master/system-config.html. We will discuss the file descriptors and virtual memory settings as follows:

File descriptors

: Elasticsearch uses a large number of file descriptors. Running out of file descriptors can result in data loss. Use the

ulimit

command to set the maximum number of open files for the current session or in a runtime script file:

ulimit -n 65536

If you want to set the value permanently, add the following line to the /etc/security/limits.conf file:

elasticsearch - nofile 65536

Ubuntu ignores the limits.conf file for processes started by init.d. You can comment out the following line to enable the ulimit feature as follow:

# Sets up user limits according to /etc/security/limits.conf# (Replaces the use of /etc/limits in old login)#session required pam_limits.so

Virtual memory

: By default, Elasticsearch uses the

mmapfs

directory to store its indices, however, the default operating system limits setting on

mmap

counts is low. If the setting is below the standard, increase the limit to

262144

or higher:

sudo sysctl -w vm.max_map_count=262144

sudo sysctl -p

cat /proc/sys/vm/max_map_count

262144

By default, the Elasticsearch security features are disabled for open source downloads or basic licensing. Since Elasticsearch binds to localhost only by default, it is safe to run the installed server as a local development server. The changed setting only takes effect after the Elasticsearch server instance has been restarted. In the next section, we will discuss several ways to communicate with Elasticsearch.

Talking to Elasticsearch

Many programming languages (including Java, Python, and .NET) have official clients written and supported by Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/client/index.html). However, by default, only two protocols are really supported, HTTP (via a RESTful API) and native. You can talk to Elasticsearch via one of the following ways:

Transport client

: One of the native ways to connect to Elasticsearch.

Node client

: Similar to the transport client. In most cases, if you're using Java,

you should choose the transport client instead of the node client.

HTTP client

: For most programming languages, HTTP is the most common way to connect to Elasticsearch.

Other protocols

: It's possible to create a new client interface to Elasticsearch simply by writing a plugin.

Transport clients (that is, the Java API) are scheduled to be deprecated in Elasticsearch 7.0 and completely removed in 8.0. Java users should use a Java High Level REST Client.

You can communicate with Elasticsearch via the default 9200 port using the RESTful API. An example of using the curl command to communicate with Elasticsearch from the command line is shown in the following code block. You should see the instance details and the cluster information in the response. Before running the following command, make sure the installed Elasticsearch server is running. In the response, the machine's hostname is wai. The default Elasticsearch cluster name is elasticsearch. The version of Elasticsearch that is running is 7.0.0. The downloaded Elasticsearch software is in TAR format. The version of Lucene used is 8.0.0:

curl -XGET 'http://localhost:9200'

{

"name" : "wai",

"cluster_name" : "elasticsearch",

"cluster_uuid" : "7-fjLIFkQrednHgFh0Ufxw",

"version" : {

"number" : "7.0.0",

"build_flavor" : "default",

"build_type" : "tar",

"build_hash" : "a30e8c2",

"build_date" : "2018-12-17T12:33:32.311168Z",

"build_snapshot" : false,

"lucene_version" : "8.0.0",

"minimum_wire_compatibility_version" : "6.6.0",

"minimum_index_compatibility_version" : "6.0.0-beta1"

"tagline" : "You Know, for Search"

}

Using Postman to work with the Elasticsearch REST API

The Postman app is a handy tool for testing the REST API. In this book, we'll use Postman to illustrate the examples. The following are step-by-step instructions for installing Postman from the official download site (https://www.getpostman.com/apps):

Select Package Management (Windows, macOS, or Linux) and download the appropriate 32-/64-bit version for your operating system. For 64-bit Linux package management, the filename is

Postman-linux-x64-6.6.1.tar.gz

Extract the GNU zipped file into your target directory, which will generate a folder called

Postman

tar -zxvf Postman-linux-x64-6.6.1.tar.gz

Go to the folder and run

Postman

and you'll see a pop-up window:

cd Postman

./Postman

In the pop-up window, use the same URL as in

the previous

curl

command and press the

Send

button. You will get the same output shown as follows:

In the next section, let's dive into the architectural overview of Elasticsearch.

Elasticsearch architectural overview

The story of how the ELK Stack becomes Elasticsearch, Logstash, and Kibana, is a pretty long story(https://www.elastic.co/about/history-of-elasticsearch). At Elastic{ON} 2015 in San Francisco, Elasticsearch Inc. was renamed Elastic and announced the next evolution of Elastic Stack. Elasticsearch will still play an important role, no matter what happens.

Elastic Stack architecture

Elastic Stack is an end-to-end software stack for search and analysis solutions. It is designed to help users get data from any type of source in any format to allow for searching, analyzing, and visualizing data in real time. The full stack consists of the following:

Beats master

: A lightweight data conveyor that can send data directly to Elasticsearch or via Logstash

APM server master

: Used for measuring and monitoring the performance of applications

Elasticsearch master

: A highly scalable full-text search and analytics engine

Elasticsearch Hadoop master

: A two-way fast data mover between Apache Hadoop and Elasticsearch

Kibana master

: A primer on data exploration, visualization, and dashboarding

Logstash master

: A data-collection engine with real-time pipelining capabilities

Each individual product has its own purpose and features, as shown in the following diagram:

Elasticsearch architecture

Elasticsearch is a real-time distributed search and analytics engine with high availability. It is used for full-text search, structured search, analytics, or all three in combination. It is built on top of the Apache Lucene library. It is a schema-free, document-oriented data store. However, unless you fully understand your use case, the general recommendation is not to use it as the primary data store. One of the advantages is that the RESTful API uses JSON over HTTP, which allows you to integrate, manage, and query index data in a variety of ways.

An Elasticsearch cluster is a group of one or more Elasticsearch nodes that are connected together. Let's first outline how it is laid out, as shown in the following diagram:

Although each node has its own purpose and responsibility, each node can forward client requests (coordination) to the appropriate nodes. The following are the nodes used in an Elasticsearch cluster:

Master-eligible node

: The master node's tasks are primarily used for lightweight cluster-wide operations,

including

creating or deleting an index, tracking the cluster nodes, and determining the location of the allocated shards. By default, the master-eligible role is enabled. A master-eligible node can be elected to become the master node (

the node with the asterisk

) by the master-election process. You can disable this type of role for a node by setting

node.master

false

in the

elasticsearch.yml

file.

Data node

: A data node contains data that contains indexed documents. It handles related operations such as CRUD, search, and aggregation. By default, the data node role

is enabled, and you can disable such a role for a node by setting the

node.data

false

in the

elasticsearch.yml

file

Ingest node

: Using an ingest nodes is a way to process a document in pipeline mode before indexing the document. By default, the ingest node role is enabled—you can disable such a role for a node by setting

node.ingest

false

the

elasticsearch.yml

file

Coordinating-only node

: If all three roles (master eligible, data, and ingest) are disabled, the node will only act as a coordination node that performs routing requests, handling the search reduction phase, and distributing works via bulk indexing.

When you launch an instance of Elasticsearch, you actually launch the Elasticsearch node. In our installation, we are running a single node of Elasticsearch, so we have a cluster with one node. Let's retrieve the information for all nodes from our installed server using the Elasticsearch cluster nodes info API, as shown in the following screenshot:

The cluster name is elasticsearch. The total number of nodes is 1. The node ID is V1P0a-tVR8afUqJW86Hnrw. The node name is wai. The wai node has three roles, which are master, data, and ingest. The Elasticsearch version running on the node is 7.0.0.

Between the Elasticsearch index and the Lucene index

The data in Elasticsearch is organized into indices. Each index is a logical namespace for organizing data. The document is a basic unit of data in Elasticsearch. An inverted index is created by tokenizing the terms in the document, creating a sorted list of all unique terms, and associating the document list with the location where the terms can be found. An index consists of one or more shards. A shard is a Lucene index that uses a data structure (inverted index) to store data. Each shard can have zero or more replicas. Elasticsearch ensures that the primary and the replica of the same shard will not collocate in the same node, as shown in the following screenshot, where Data Node 1 contains primary shard 1 of Index 1 (I1P1), primary shard 2 of Index 2 (I2P2), replica shard 2 of Index 1 (I1R2), and replica shard 1 of Index 2 (I2R1).

A Lucene index consists of one or more immutable index segments, and a segment is a functional inverted index. Segments are immutable, allowing Lucene to incrementally add new documents to the index without rebuilding efforts. To maintain the manageability of the number of segments, Elasticsearch merges the small segments together into one larger segment, commits the new merge segment to disk and eliminates the old smaller segments at the appropriate time. For each search request, all Lucene segments of a given shard of an Elasticsearch index will be searched. Let's examine the query process in a cluster, as shown in the following diagram:

In the next section, let's drilled down to the key concepts.

Key concepts

In the previous section, we learned some core concepts such as clusters, nodes, shards, replicas, and so on. We will briefly introduce the other key concepts in this section. Then, we'll drill down into the details in subsequent chapters.

Mapping concepts across SQL and Elasticsearch

In the early stages of Elasticsearch, mapping types were a way to divide the documents into different logical groups in the same index. This meant that the index could have any number of types. In the past, talking about index in Elasticsearch is similar to talking about database in SQL. In addition, the discussion of viewing index type in Elasticsearch is equivalent to a table in SQL is also very popular. According to the official Elastic website (https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html), the removal of mapping types was published in the documentation of version 5.6. Later, in Elasticsearch 6.0.0, indices needed to contain only one mapping type. Mapping types of the same index were completely removed in Elasticsearch 7.0.0. The main reason was that tables are independent of each other in an SQL database. However, fields with the same name in different mapping types of the same index are the same. In an Elasticsearch index, fields with the same name in different mapping types are internally supported by the same Lucene field.

Let's take a look at the terminology in SQL and Elasticsearch in the following table(https://www.elastic.co/guide/en/elasticsearch/reference/master/_mapping_concepts_across_sql_and_elasticsearch.html), showing how the data is organized:

SQL

Elasticsearch

Description

Column

Field

A column is a set of data values in the same data type, with one value for each row of the database, while Elasticsearch refers to as a field. A field is the smallest unit of data in Elasticsearch. It can contain a list of multiple values of the same type.

Row

Document

A row represents a structured data item, which contains a series of data values from each column of the table. A document is like a row to group fields (columns in SQL). A document is a JSON object in Elasticsearch.

Table

Index

A table consists of columns and rows. An index is the largest unit of data in Elasticsearch.

Comparing to a

database in SQL, an index is a logical partition of the indexed documents and the target against which the search queries get executed.

Schema

Implicit

In a

relational database management system

(

RDBMS

), a schema contains schema objects, which can be tables, columns, data types, views, and so on. A schema is typically owned by a database user. Elasticsearch does not provide an equivalent concept for it.

Catalog/database

Cluster

In SQL, a catalog or database represents a set of schemas. In Elasticsearch,

a cluster contains

a set of indices.

Mapping

A schema could mean an outline, diagram, or model, which is often used to describe the structure of different types of data. Elasticsearch is reputed to be schema-less, in contrast to traditional relational databases. In traditional relational databases, you must explicitly specify tables, fields, and field types. In Elasticsearch, schema-less simply means that the document can be indexed without specifying the schema in advance. Under the hood though, Elasticsearch dynamically derives a schema from the first document's index structure and decides how to index them when no explicit static mapping is specified. Elasticsearch enforces the term schema called mapping, which is a process of defining how Lucene stores the indexed document and those fields it contains. When you add a new field to your document, the mapping will also be automatically updated.

Starting from Elasticsearch 6.0.0, only one mapping type is allowed for each index. The mapping type has fields defined by data types and meta fields. Elasticsearch supports many different data types for fields in a document. Each document has meta-fields associated with it. We can customize the behavior of the meta-fields when creating a mapping type. We'll cover this in Chapter 4, Mapping APIs.

Analyzer

Elasticsearch comes with a variety of built-in analyzers that can be used in any index without further configuration. If the built-in analyzers are not suitable for your use case, you can create a custom analyzer. Whether it is a built-in analyzer or a customized analyzer, it is just a package of the three following lower-level building blocks:

Character filter

: Receives the raw text as a stream of characters and can transform the stream by adding, removing, or changing its characters

Tokenizers

: Splits the given streams of characters into

a token stream

Token filters

: Receives the token stream and may add, remove, or change tokens

The same analyzer should normally be used both at index time and at search time, but you can set search_analyzer in the field mapping to perform different analyses while searching.

Standard analyzer

The standard analyzer is the default analyzer, which is used if none is specified. A standard analyzer consists of the following:

Character filter

: None

Tokenizer

: Standard tokenizer

Token filters

: Lowercase token filter and stop token filter (disabled by default)

A standard tokenizer provides a grammar-based tokenization. A lowercase token filter normalizes the token text to lowercase, where a stop token filter removes the stop words from token streams. For a list of English stop words, you can refer to https://www.ranks.nl/stopwords. Let's test the standard analyzer with the input text You'll love Elasticsearch 7.0.

Since it is a POST request, you need to set the Content-Type to application/json:

The URL is http://localhost:9200/_analyze and the request Body has a raw JSON string, {"text": "You will love Elasticsearch 7.0."}. You can see that the response has four tokens: you'll, love, elasticsearch, and 7.0, all in lowercase, which is due to the lowercase token filter:

In the next section, let's get familiar with the API conventions.

API conventions

We will only discuss some of the major conventions. For others, please refer to the Elasticsearch reference (https://www.elastic.co/guide/en/elasticsearch/reference/master/api-conventions.html). The following list can be applied throughout the REST API:

Access across m

ultiple indices

: T

his convention cannot be used in single document APIs:

_all

: For all indices

comma

: A separator between two indices

wildcard (*,-)

: The asterisk character,

, is used to match any sequence of

characters in the index name,

excluding the index afterwards

Common options

Boolean values

false

means the mentioned value is false;

true

means the value is true.

Number values

: A number is as a string on top of the native JSON number type.

Time unit

for duration

The supported time units are

for days,

for hours,

for minutes,

for seconds,

for milliseconds,

micros

for microseconds, and

nanos

for nanoseconds.

Byte size unit

: The supported data units are

for bytes,

for kilobytes,

for megabytes,

for gigabytes,

for terabytes, and

for petabytes.

Distance unit

: T

he supported distance units are

for miles,

for yards,

for feet,

for inches,

for kilometers,

for meters,

for centimeters,

for millimeters, and

nmi

for nautical miles.

Unit-less quantities

If the value specified is large enough, we can use a quantity as a multiplier. The supported quantities are

for kilo,

for mega,

for giga,

for tera, and

for peta. For instance,

10m

represents the value 10,000,000.

Human-readable output

: Values can be converted to human-readable values, such as

for 1 hour and

1kb

for 1,024 kilobytes. This option can be turned on by adding

?human=true

to the query string.

The default value is

false

Pretty result

: If you append

?pretty=true

to the request URL, the JSON string in the response will be in

pretty format.

REST parameters

: Follow the convention of using underscore delimiting.

Content type

: The type of content

the request body

must be specified in the request header using the

Content-Type

key name. Check the reference as to whether the content type you use is supported. In all our

POST

UPDATE

PATCH

request examples,

application/json

is used.

Request body in query string

: If the client library does not accept a request body for non-POST requests, you can use the

source

query string parameter

to pass the request body and specify the

source_content_type

parameter with a supported media type.

Stack traces

: If the

error_trace=true

request URL parameter

is set, the error stack trace will be included in the response when an exception is raised.

Date math in a formatted date value

: In range queries or in date range aggregations,

you can format

date

fields

using date math:

The date math expressions start with an anchor date (

now

, or a date string ending with a double vertical bar:

), followed by one or more sub-expressions such as

+1h

-1d

, or

The supported time units are different from the time units for duration in the previously mentioned

Common options

bullet list. Where

is for years,

is for months,

is for weeks,

is for days,

, or

is for hours,

is for minutes,

is for seconds,

is for addition,

is for subtraction, and

is for rounding down to the nearest time unit. For example, this means that

means rounding down to the nearest day.

For the following discussion of these data parameters, assume that the current system time now is2019.01.03 01:20:00, now+1his2019.01.03 02:20:00,now-1dis2019.01.02 01:20:00,now/dis2019.01.03 00:00:00,now/Mis2019.01.01 00:00:00,2019.01.03 01:20:00||+1h is2019.01.03 02:20:00, and so forth.

Date math in index name

: If you want to index time series data, such as logs, you can use a pattern with different date fields as the index names to manage daily logging information. Date math then gives you a way to search through a series of time series indices. The date math syntax for the index name is as follows:

<static_name{date_math_expr{date_format|time_zone}}>