46,44 €
Master the intricacies of Elasticsearch 7.0 and use it to create flexible and scalable search solutions
Key Features
Book Description
Building enterprise-grade distributed applications and executing systematic search operations call for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and understand how you can develop a sophisticated, real-time search engine confidently. In addition to this, you'll also learn to run machine learning jobs in Elasticsearch to speed up routine tasks.
You'll get started by learning to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of query results and enhancing the customer experience. You'll then get up to speed with performing analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision-makers with better insights. The book will later guide you through using Logstash with examples to collect, parse, and enrich logs before indexing them in Elasticsearch.
By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.
What you will learn
Who this book is for
This book is for Elasticsearch developers and data engineers who want to take their basic knowledge of Elasticsearch to the next level and use it to build enterprise-grade distributed search applications. Prior experience of working with Elasticsearch will be useful to get the most out of this book.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 453
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor:Pravin DhandreAcquisition Editor:Nelson MorrisContent Development Editor:Roshan KumarSenior Editor: Jack CummingsTechnical Editor:Dinesh ChaudharyCopy Editor: Safis EditingProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer:Rekha NairProduction Designer:Nilesh Mohite
First published: August 2019
Production reference: 1220819
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78995-775-4
www.packtpub.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Wai Tak Wong is a faculty member in the Department of Computer Science at Kean University, NJ, USA. He has more than 15 years professional experience in cloud software design and development. His PhD in computer science was obtained at NJIT, NJ, USA. Wai Tak has served as an associate professor in the Information Management Department of Chung Hua University, Taiwan. A co-founder of Shanghai Shellshellfish Information Technology, Wai Tak acted as the Chief Scientist of the R&D team, and he has published more than a dozen algorithms in prestigious journals and conferences. Wai Tak began his search and analytics technology career with Elasticsearch in the real estate market and later applied this to data management and FinTech data services.
Marcelo Ochoa works for Dirección TICs of Facultad de Ciencias Exactas at Universidad Nacional del Centro de la Prov. de Buenos Aires and is the CTO at Scotas, a company that specializes in near real-time search solutions using Apache Solr and Oracle. He divides his time between university jobs and external projects related to Oracle, open source, and big data technologies. Since 2006, he has been part of an Oracle ACE program and was recently incorporated into a Docker Mentor program.
He has co-authored Oracle Database Programming Using Java and Web Services and Professional XML Databases, and as a technical reviewer for several books and videos such as Mastering Apache Solr 7, Mastering Elastic Stack, Learning Elasticsearch 6, and others.
Saurabh Chhajed is a machine learning and big data engineer with 9 years of professional experience in the enterprise application development life cycle, using the latest frameworks, tools, and design patterns. He has experience of designing and implementing some of the most widely used and scalable customer-facing recommendation systems with extensive usage of the big data ecosystem – in the batch and real-time and machine learning pipelines. He has also worked for some of the largest investment banks, credit card companies, and manufacturing companies around the world, implementing a range of robust and scalable product suites. He has written Learning ELK Stack and reviewed Mastering Kibana and Python Machine Learning with Packt Publishing.
Craig Brown is an independent consultant, offering services for Elasticsearch and other big data software. He is a core Java developer with 25+ years of experience and more than 10 years of Elasticsearch experience. He is also experienced with machine learning, Hadoop, Apache Spark; is a co-founder of the Big Mountain Data user group in Utah; and is a speaker on Elasticsearch and other big data topics. Craig founded NosqlRevolution LLC, focused on Elasticsearch and big data services, and PicoCluster LLC, a desktop data center designed for learning and prototyping cluster computing and big data frameworks.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Advanced Elasticsearch 7.0
Dedication
About Packt
Why subscribe?
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Fundamentals and Core APIs
Overview of Elasticsearch 7
Preparing your environment
Running Elasticsearch
Basic Elasticsearch configuration
Important system configuration
Talking to Elasticsearch
Using Postman to work with the Elasticsearch REST API
Elasticsearch architectural overview
Elastic Stack architecture
Elasticsearch architecture
Between the Elasticsearch index and the Lucene index
Key concepts
Mapping concepts across SQL and Elasticsearch
Mapping
Analyzer
Standard analyzer
API conventions
New features
New features to be discussed
New features with description and issue number
Breaking changes
Aggregations changes
Analysis changes
API changes
Cluster changes
Discovery changes
High-level REST client changes
Low-level REST client changes 
Indices changes
Java API changes
Mapping changes
ML changes
Packaging changes
Search changes
Query DSL changes
Settings changes
Scripting changes
Migration between versions
Summary
Index APIs
Index management APIs
Basic CRUD APIs
Index settings
Index templates
Index aliases
Reindexing with zero downtime
Grouping multiple indices
Views on a subset of documents
Miscellaneous
Monitoring indices
Indices stats
Indices segments, recovery, and share stores
Index persistence
Advanced index management APIs
Split index 
Shrink index 
Rollover index 
Summary
Document APIs
The Elasticsearch document life cycle
What is a document?
The document life cycle
Single document management APIs
Sample documents
Indexing a document
Retrieving a document by identifier
Updating a document
Removing a document by identifier
Multi-document management APIs
Retrieving multiple documents
Bulk API
Update by query API
Delete by query API
Reindex API
Copying documents
Migration from a multiple mapping types index
Summary
Mapping APIs
Dynamic mapping
Mapping rules
Dynamic templates
Meta fields in mapping
Field datatypes
Static mapping for the sample document 
Mapping parameters
Refreshing mapping changes for static mapping
Typeless APIs working with old custom index types
Summary
Anatomy of an Analyzer
An analyzer's components
Character filters
The html_strip filter
The mapping filter
The pattern_replace filter
Tokenizers
Token filters
Built-in analyzers
Custom analyzers
Normalizers
Summary
Search APIs
Indexing sample documents
Search APIs
URI search
Request body search
The sort parameter
The scroll parameter
The search_after parameter
The rescore parameter
The _name parameter
The collapse parameter
The highlighting parameter
Other search parameters
Query DSL
Full text queries
The match keyword
The query string keyword
The intervals keyword
Term-level queries
Compound queries
The script query
The multi-search API
Other search-related APIs
The _explain API
The _validate API
The _count API
The field capabilities API
Profiler
Suggesters
Summary
Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics
Modeling Your Data in the Real World
The Investor Exchange Cloud
Modeling data and the approaches
Data denormalization
Using an array of objects datatype
Nested object mapping datatypes
Join datatypes
Parent ID query
has_child query
has_parent query
Practical considerations
Summary
Aggregation Frameworks
ETF historical data preparation
Aggregation query syntax
Matrix aggregations
Matrix stats
Metrics aggregations
avg
weighted_avg
cardinality
value_count
sum
min
max
stats
extended_stats
top_hit
percentiles
percentile_ranks
median_absolute_deviation
geo_bound
geo_centroid
scripted_metric
Bucket aggregations
histogram
date_histogram
auto_date_histogram
ranges
date_range
ip_range
filter
filters
term
significant_terms
significant_text
sampler
diversified_sampler
nested
reverse_nested
global
missing
composite
adjacency_matrix
parent
children
geo_distance
geohash_grid
geotile_grid
Pipeline aggregations
Sibling family
avg_bucket 
max_bucket
min_bucket
sum_bucket
stats_bucket
extended_stats_bucket
percentiles_bucket
Parent family
cumulative_sum
derivative
bucket_script
bucket_selector
bucket_sort
serial_diff
Moving average aggregation
simple
linear
ewma
holt
holt_winters
Moving function aggregation
max
min
sum
stdDev
unweightedAvg
linearWeightedAvg
ewma
holt
holtWinters
Post filter on aggregations
Summary
Preprocessing Documents in Ingest Pipelines
Ingest APIs
Accessing data in pipelines
Processors
Conditional execution in pipelines
Handling failures in pipelines
Summary
Using Elasticsearch for Exploratory Data Analysis
Business analytics
Operational data analytics
Sentiment analysis
Summary
Section 3: Programming with the Elasticsearch Client
Elasticsearch from Java Programming
Overview of Elasticsearch Java REST client
The Java low-level REST client
The Java low-level REST client workflow
REST client initialization
Performing requests using a REST client 
Handing responses
Testing with Swagger UI
New features
The Java high-level REST client
The Java high-level REST client workflow
REST client initialization
Performing requests using the REST client
Handling responses
Testing with Swagger UI
New features
Spring Data Elasticsearch
Summary
Elasticsearch from Python Programming
Overview of the Elasticsearch Python client
The Python low-level Elasticsearch client
Workflow for the Python low-level Elasticsearch client
Client initialization
Performing requests
Handling responses
The Python high-level Elasticsearch library
Illustrating the programming concept
Initializing a connection
Performing requests 
Handling responses
The query class 
The aggregations class
Summary
Section 4: Elastic Stack
Using Kibana, Logstash, and Beats
Overview of the Elastic Stack
Running the Elastic Stack with Docker
Running Elasticsearch in a Docker container
Running Kibana in a Docker container
Running Logstash in a Docker container
Running Beats in a Docker container
Summary
Working with Elasticsearch SQL
 Overview
Getting started
Elasticsearch SQL language
Reserved keywords
Data type
Operators
Functions
Aggregate
Grouping
Date-time
Full-text search 
Mathematics
String
Type conversion
Conditional
System
Elasticsearch SQL query syntax
New features
Elasticsearch SQL REST API
Elasticsearch SQL JDBC
Upgrading Elasticsearch from a basic to a trial license
Workflow of Elasticsearch SQL JDBC 
Testing with Swagger UI
Summary
Working with Elasticsearch Analysis Plugins
What are Elasticsearch plugins?
Plugin management
Working with the ICU Analysis plugin
Examples
Working with the Smart Chinese Analysis plugin
Examples
Working with the IK Analysis plugin
Examples
Configuring a custom dictionary in the IK Analysis plugin
Summary
Section 5: Advanced Features
Machine Learning with Elasticsearch
Machine learning with Elastic Stack
Machine learning APIs
Machine learning jobs
Sample data
Running a single-metric job
Creating index patterns
Creating a new machine learning job
Examining the result
Machine learning using Elasticsearch and scikit-learn
Summary
Spark and Elasticsearch for Real-Time Analytics
Overview of ES-Hadoop
Apache Spark support
Real-time analytics using Elasticsearch and Apache Spark
Building a virtual environment to run the sample ES-Hadoop project
Running the sample ES-Hadoop project
Running the sample ES-Hadoop project using a prepared Docker image
Source code
Summary
Building Analytics RESTful Services
Building a RESTful web service with Spring Boot
Project program structure
Running the program and examining the APIs
Main workflow anatomy
Building the analytic model
Performing daily update data
Getting the registered symbols
Building the scheduler
Integration with the Bollinger Band
Building a Java Spark ML module for k-means anomaly detection
Source code
Testing Analytics RESTful services
Testing the build-analytics-model API
Testing the get-register-symbols API
Working with Kibana to visualize the analytics results
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Building enterprise-grade distributed applications and executing systematic search operations calls for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and learn how to develop a sophisticated real-time search engine confidently. In addition to this, you'll also learn how to run machine learning jobs in Elasticsearch to speed up routine tasks. You'll get started by learning how to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of queries and enhancing customer experience. You'll then get up to speed with analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision makers with better insights. The book will later guide you through using Logstash to collect, parse, and enrich logs before indexing them into Elasticsearch. By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.
You will do the following:
Pre-process documents before indexing in ingest pipelines
Learn how to model your data in the real world
Get to grips with using Elasticsearch for exploratory data analysis
Understand how to build analytics and RESTful services
Use Kibana, Logstash, and Beats for dashboard applications
Get up to speed with Spark and Elasticsearch for real-time analytics
Explore the Java high/low-level REST client and learn how to index, search, and query in a Spring application
The book is aimed at beginners with no prior experience with Elasticsearch, and gradually introduces intermediate and advanced topics. The chapters walk through the most important aspects to help audiences to build and master the powerful search engine. Search engine data engineers, software engineers, and database engineers who want to take their basic knowledge of Elasticsearch to the next level can use it to its optimum level in their daily core tasks.
Chapter 1, Overview of Elasticsearch 7, takes beginners through some basic features in minutes. We just take a few steps to launch the new version of the Elasticsearch server. An architectural overview and a core concept introduction will make it easy to understand the workflow in Elasticsearch.
Chapter 2, Index APIs, discusses how to use index APIs to manage individual indices, index settings, aliases, and templates. It also involves monitoring statistics for operations that occur on an index. Index management operations including refreshing, flushing, and clearing the cache are also discussed.
Chapter 3, Document APIs, begins with the basic information about a document and its life cycle. Then we learn how to access it. After that, we look at accessing multiple documents with the bulk API. Finally, we discuss migrating indices from the old version to version 7.0.
Chapter 4, Mapping APIs, introduces the schema in Elasticsearch. The mapping rules for both dynamic mappings and explicit static mappings will be discussed. It also provides the idea and details of creating static mapping for an index. We also step into the details of the meta fields and field data types in index mapping.
Chapter 5, Anatomy of an Analyzer, drills down in to the anatomy of the analyzer and in-depth practice different analyzers. We will discuss different character filters, tokenizers, and token filters in order to understand the building blocks of the analyzer. We also practice how to create a custom analyzer and use it in the analyze API.
Chapter 6,Search APIs, covers different types of searches, from terms-based to full-text, from exact search to fuzzy search, from single-field search to multi-search, and then to compound search. Additional information about Query DSL and search-related APIs such as tuning, validating, and troubleshooting will be discussed.
Chapter 7, Modeling Your Data in the Real World, discusses data modeling with Elasticsearch. It focuses on some common issues users may encounter when working with different techniques. It helps you understand some of the conventions and contains insights from real-world examples involving denormalizing complex objects and using nested objects to handle relationships.
Chapter 8, Aggregation Framework, discusses data analytics using the aggregation framework. We learn how to perform aggregations with examples and delve into most of the types of aggregations. We also use IEX ETF historical data to plot a graph for different types of moving averages, including forecasted data supported by the model.
Chapter 9, Preprocessing Documents in Ingest Pipelines, discusses the preprocessing of a document through predefined pipeline processors before the actual indexing operation begins. We also learn about data accessing to documents through the pipeline processor. Finally, we cover exception handling when an error occurs during pipeline processing.
Chapter 10, Using Elasticsearch for Exploratory Data Analysis, uses the aggregation framework to perform data analysis. We first discuss a comprehensive analysis of exploratory data and simple financial analysis of business strategies. In addition, we provide step-by-step instructions for calculating Bollinger Bands using daily operational data. Finally, we will conduct a brief survey of sentiment analysis using Elasticsearch.
Chapter 11, Elasticsearch from Java Programming, focuses on the basics of two supported Java REST clients. We explore the main features and operations of each approach. A sample project is provided to demonstrate the high-level and low-level REST clients integrated with Spring Boot programming.
Chapter 12, Elasticsearch from Python Programming, introduces the Python Elasticsearch client. We learn about two Elasticsearch client packages, elasticsearch-py and elasticsearch-dsl-py. We learn how the clients work and incorporate them into a Python application. We implement Bollinger Bands by using elasticsearch-dsl-py.
Chapter 13, Using Kibana, Logstash, and Beats, outlines the components of the Elastic Stack, including Kibana, Logstash, and Beats. We learn how to use Logstash to collect and parse log data from system log files. In addition, we use Filebeat to extend the use of Logstash to a central log processing center. All work will be run on official supported Elastic Stack Docker images.
Chapter 14, Working with Elasticsearch SQL, introduces Elasticsearch SQL. With Elasticsearch SQL, we can access full-text search using familiar SQL syntax. We can even obtain results in tabular view format. We perform search and aggregation using different approaches, such as using the SQL REST API interface, the command-line interface, and JDBC.
Chapter 15, Working with Elasticsearch Analysis Plugins, introduces built-in Analysis plugins. We practice using the ICU Analysis plugin, the Smart Chinese Analysis plugin, and the IK Analysis plugin to analyze Chinese texts. We also add a new custom dictionary to improve word segmentation to make it generate better results.
Chapter 16, Machine Learning with Elasticsearch, discusses the machine learning feature supported by Elasticsearch. This feature automatically analyzes time series data by running a metric job. This type of job contains one or more detectors (the analyzed fields). We also introduce the Python scikit-learn library and the unsupervised learning algorithm K-means clustering and use it for comparison.
Chapter 17, Spark and Elasticsearch for Real-Time Analytics, focuses on ES-Hadoop's Apache Spark support. We practice reading data from the Elasticsearch index, performing some computations using Spark, and then writing the results back to Elasticsearch through ES-Hadoop. We build a real-time anomaly detection routine based on the K-means model created from past data by using the Spark ML library.
Chapter 18, Building Analytics RESTful Services, explains how to construct a project providing a search analytics REST service powered by Elasticsearch. We combine lots of material and source code from different chapters to build a real-world end-to-end project and present the result on a Kibana Visualize page.
Readers should have a basic knowledge of Linux, Java, Python, Virtualenv, SQL, Spark, and Docker.
All installation steps are described in detail in each relevant chapter.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Advanced-Elasticsearch-7.0. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789957754_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
A block of code is set as follows:
html, body, #map { height: 100%; margin: 0; padding: 0}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
[default]exten => s,1,Dial(Zap/1|30)exten => s,2,Voicemail(u100)
exten => s,102,Voicemail(b100)
exten => i,1,Voicemail(s0)
Any command-line input or output is written as follows:
$ mkdir css
$ cd css
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this section, you will get an overview of Elasticsearch 7 by looking into various concepts and examining Elasticsearch services and core APIs. You will also look at the new distributed, scalable, real-time search and analytics engine.
This section is comprised the following chapters:
Chapter 1
,
Overview of Elasticsearch 7
Chapter 2
,
Index APIs
Chapter 3
,
Document APIs
Chapter 4
,
Mapping APIs
Chapter 5
,
Anatomy of an Analyzer
Chapter 6
,
Search APIs
Welcome to Advanced Elasticsearch 7.0. Elasticsearch quickly evolved from version 1.0.0, released in February 2014, to version 6.0.0 GA, released in November 2017. Nonetheless, we will use 7.0.0 release as the base of this book. Without making any assumptions about your knowledge of Elasticsearch, this opening chapter provides setup instructions with the Elasticsearch development environment. To help beginners complete some basic features within a few minutes, several steps are given to launch the new version of an Elasticsearch server. An architectural overview and some core concepts will help you to understand the workflow within Elasticsearch. It will help you straighten your learning path.
Keep in mind that you can learn the potential benefits by reading the API conventions section and becoming familiar with it. The section New features following this one is a list of new features you can explore in the new release. Because major changes are often introduced between major versions, you must check to see whether it breaks the compatibility and affects your application. Go through the Migration between versions section to find out how to minimize the impact on your upgrade project.
In this chapter, you'll learn about the following topics:
Preparing your environment
Running Elasticsearch
Talking to Elasticsearch
Elasticsearch architectural overview
Key concepts
API conventions
New features
Breaking changes
Migration between versions
The first step of the novice is to set up the Elasticsearch server, while an experienced user may just need to upgrade the server to the new version. If you are going to upgrade your server software, read through the Breaking changes section and the Migration between versions section to discover the changes that require your attention.
Elasticsearch is developed in Java. As of writing this book, it is recommended that you use a specific Oracle JDK, version 1.8.0_131. By default, Elasticsearch will use the Java version defined by the JAVA_HOME environment variable. Before installing Elasticsearch, please check the installed Java version.
Elasticsearch is supported on many popular operating systems such as RHEL, Ubuntu, Windows, and Solaris. For information on supported operating systems and product compatibility, see the Elastic Support Matrix at https://www.elastic.co/support/matrix. The installation instructions for all the supported platforms can be found in the Installing Elasticsearch documentation (https://www.elastic.co/guide/en/elasticsearch/reference/7.0/install-elasticsearch.html). Although there are many ways to properly install Elasticsearch on different operating systems, it'll be simple and easy to run Elasticsearch from the command line for novices. Please follow the instructions on the official download site (https://www.elastic.co/downloads/past-releases/elasticsearch-7-0-0). In this book, we'll use the Ubuntu 16.04 operating system to host Elasticsearch Service. For example, use the following command line to check the Java version on Ubuntu 16.04:
java -version
java version "1.8.0_181"
java(TM) SE Runtime Environment(build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
The following is a step-by-step guide for installing the preview version from the official download site:
Select the correct package for your operating system (
WINDOWS
,
MACOS
,
LINUX
,
DEB
,
RPM
, or
MSI (BETA)
) and download the 7.0.0 release. For Linux, the filename is
elasticsearch-7.0.0-linux-x86_64.tar.gz
.
Extract the GNU zipped file into the target directory, which will generate a folder called
elasticsearch-7.0.0
using the following command:
tar -zxvf elasticsearch-7.0.0-linux-86_64.tar.gz
Go to the folder and run Elasticsearch with the
-p
parameter to create a
pid
file at the specified path:
cd elasticsearch-7.0.0
./bin/elasticsearch -p pid
Elasticsearch runs in the foreground when it runs with the command line above. If you want to shut it down, you can stop it by pressing Ctrl + C, or you can use the process ID from the pid file in the working directory to terminate the process:
kill -15 `cat pid`
Check the log file to make sure the process is closed. You will see the text Native controller process has stopped, stopped, closing, closed near the end of file:
tail logs/elasticsearch.log
To run Elasticsearch as a daemon in background mode, specify -d on the command line:
./bin/elasticsearch -d -p pid
In the next section, we will show you how to run an Elasticsearch instance.
Elasticsearch does not start automatically after installation. On Windows, to start it automatically at boot time, you can install Elasticsearch as a service. On Ubuntu, it's best to use the Debian package, which installs everything you need to configure Elasticsearch as a service. If you're interested, please refer to the official website (https://www.elastic.co/guide/en/elasticsearch/reference/master/deb.html).
Elasticsearch has two working modes, development mode and production mode. You'll work in development mode with a fresh installation. If you reconfigure a setting such as network.host, it will switch to production mode. In production mode, some settings must be taken care and you can check with the Elasticsearch Reference at https://www.elastic.co/guide/en/elasticsearch/reference/master/system-config.html. We will discuss the file descriptors and virtual memory settings as follows:
File descriptors
: Elasticsearch uses a large number of file descriptors. Running out of file descriptors can result in data loss. Use the
ulimit
command to set the maximum number of open files for the current session or in a runtime script file:
ulimit -n 65536
If you want to set the value permanently, add the following line to the /etc/security/limits.conf file:
elasticsearch - nofile 65536
Ubuntu ignores the limits.conf file for processes started by init.d. You can comment out the following line to enable the ulimit feature as follow:
# Sets up user limits according to /etc/security/limits.conf# (Replaces the use of /etc/limits in old login)#session required pam_limits.so
Virtual memory
: By default, Elasticsearch uses the
mmapfs
directory to store its indices, however, the default operating system limits setting on
mmap
counts is low. If the setting is below the standard, increase the limit to
262144
or higher:
sudo sysctl -w vm.max_map_count=262144
sudo sysctl -p
cat /proc/sys/vm/max_map_count
262144
By default, the Elasticsearch security features are disabled for open source downloads or basic licensing. Since Elasticsearch binds to localhost only by default, it is safe to run the installed server as a local development server. The changed setting only takes effect after the Elasticsearch server instance has been restarted. In the next section, we will discuss several ways to communicate with Elasticsearch.
Many programming languages (including Java, Python, and .NET) have official clients written and supported by Elasticsearch (https://www.elastic.co/guide/en/elasticsearch/client/index.html). However, by default, only two protocols are really supported, HTTP (via a RESTful API) and native. You can talk to Elasticsearch via one of the following ways:
Transport client
: One of the native ways to connect to Elasticsearch.
Node client
: Similar to the transport client. In most cases, if you're using Java,
you should choose the transport client instead of the node client.
HTTP client
: For most programming languages, HTTP is the most common way to connect to Elasticsearch.
Other protocols
: It's possible to create a new client interface to Elasticsearch simply by writing a plugin.
You can communicate with Elasticsearch via the default 9200 port using the RESTful API. An example of using the curl command to communicate with Elasticsearch from the command line is shown in the following code block. You should see the instance details and the cluster information in the response. Before running the following command, make sure the installed Elasticsearch server is running. In the response, the machine's hostname is wai. The default Elasticsearch cluster name is elasticsearch. The version of Elasticsearch that is running is 7.0.0. The downloaded Elasticsearch software is in TAR format. The version of Lucene used is 8.0.0:
curl -XGET 'http://localhost:9200'
{
"name" : "wai",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "7-fjLIFkQrednHgFh0Ufxw",
"version" : {
"number" : "7.0.0",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "a30e8c2",
"build_date" : "2018-12-17T12:33:32.311168Z",
"build_snapshot" : false,
"lucene_version" : "8.0.0",
"minimum_wire_compatibility_version" : "6.6.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
The Postman app is a handy tool for testing the REST API. In this book, we'll use Postman to illustrate the examples. The following are step-by-step instructions for installing Postman from the official download site (https://www.getpostman.com/apps):
Select Package Management (Windows, macOS, or Linux) and download the appropriate 32-/64-bit version for your operating system. For 64-bit Linux package management, the filename is
Postman-linux-x64-6.6.1.tar.gz
.
Extract the GNU zipped file into your target directory, which will generate a folder called
Postman
:
tar -zxvf Postman-linux-x64-6.6.1.tar.gz
Go to the folder and run
Postman
and you'll see a pop-up window:
cd Postman
./Postman
In the pop-up window, use the same URL as in
the previous
curl
command and press the
Send
button. You will get the same output shown as follows:
In the next section, let's dive into the architectural overview of Elasticsearch.
The story of how the ELK Stack becomes Elasticsearch, Logstash, and Kibana, is a pretty long story(https://www.elastic.co/about/history-of-elasticsearch). At Elastic{ON} 2015 in San Francisco, Elasticsearch Inc. was renamed Elastic and announced the next evolution of Elastic Stack. Elasticsearch will still play an important role, no matter what happens.
Elastic Stack is an end-to-end software stack for search and analysis solutions. It is designed to help users get data from any type of source in any format to allow for searching, analyzing, and visualizing data in real time. The full stack consists of the following:
Beats master
: A lightweight data conveyor that can send data directly to Elasticsearch or via Logstash
APM server master
: Used for measuring and monitoring the performance of applications
Elasticsearch master
: A highly scalable full-text search and analytics engine
Elasticsearch Hadoop master
: A two-way fast data mover between Apache Hadoop and Elasticsearch
Kibana master
: A primer on data exploration, visualization, and dashboarding
Logstash master
: A data-collection engine with real-time pipelining capabilities
Each individual product has its own purpose and features, as shown in the following diagram:
Elasticsearch is a real-time distributed search and analytics engine with high availability. It is used for full-text search, structured search, analytics, or all three in combination. It is built on top of the Apache Lucene library. It is a schema-free, document-oriented data store. However, unless you fully understand your use case, the general recommendation is not to use it as the primary data store. One of the advantages is that the RESTful API uses JSON over HTTP, which allows you to integrate, manage, and query index data in a variety of ways.
An Elasticsearch cluster is a group of one or more Elasticsearch nodes that are connected together. Let's first outline how it is laid out, as shown in the following diagram:
Although each node has its own purpose and responsibility, each node can forward client requests (coordination) to the appropriate nodes. The following are the nodes used in an Elasticsearch cluster:
Master-eligible node
: The master node's tasks are primarily used for lightweight cluster-wide operations,
including
creating or deleting an index, tracking the cluster nodes, and determining the location of the allocated shards. By default, the master-eligible role is enabled. A master-eligible node can be elected to become the master node (
the node with the asterisk
) by the master-election process. You can disable this type of role for a node by setting
node.master
to
false
in the
elasticsearch.yml
file.
Data node
: A data node contains data that contains indexed documents. It handles related operations such as CRUD, search, and aggregation. By default, the data node role
is enabled, and you can disable such a role for a node by setting the
node.data
to
false
in the
elasticsearch.yml
file
.
Ingest node
: Using an ingest nodes is a way to process a document in pipeline mode before indexing the document. By default, the ingest node role is enabled—you can disable such a role for a node by setting
node.ingest
to
false
in
the
elasticsearch.yml
file
.
Coordinating-only node
: If all three roles (master eligible, data, and ingest) are disabled, the node will only act as a coordination node that performs routing requests, handling the search reduction phase, and distributing works via bulk indexing.
When you launch an instance of Elasticsearch, you actually launch the Elasticsearch node. In our installation, we are running a single node of Elasticsearch, so we have a cluster with one node. Let's retrieve the information for all nodes from our installed server using the Elasticsearch cluster nodes info API, as shown in the following screenshot:
The cluster name is elasticsearch. The total number of nodes is 1. The node ID is V1P0a-tVR8afUqJW86Hnrw. The node name is wai. The wai node has three roles, which are master, data, and ingest. The Elasticsearch version running on the node is 7.0.0.
The data in Elasticsearch is organized into indices. Each index is a logical namespace for organizing data. The document is a basic unit of data in Elasticsearch. An inverted index is created by tokenizing the terms in the document, creating a sorted list of all unique terms, and associating the document list with the location where the terms can be found. An index consists of one or more shards. A shard is a Lucene index that uses a data structure (inverted index) to store data. Each shard can have zero or more replicas. Elasticsearch ensures that the primary and the replica of the same shard will not collocate in the same node, as shown in the following screenshot, where Data Node 1 contains primary shard 1 of Index 1 (I1P1), primary shard 2 of Index 2 (I2P2), replica shard 2 of Index 1 (I1R2), and replica shard 1 of Index 2 (I2R1).
A Lucene index consists of one or more immutable index segments, and a segment is a functional inverted index. Segments are immutable, allowing Lucene to incrementally add new documents to the index without rebuilding efforts. To maintain the manageability of the number of segments, Elasticsearch merges the small segments together into one larger segment, commits the new merge segment to disk and eliminates the old smaller segments at the appropriate time. For each search request, all Lucene segments of a given shard of an Elasticsearch index will be searched. Let's examine the query process in a cluster, as shown in the following diagram:
In the next section, let's drilled down to the key concepts.
In the previous section, we learned some core concepts such as clusters, nodes, shards, replicas, and so on. We will briefly introduce the other key concepts in this section. Then, we'll drill down into the details in subsequent chapters.
In the early stages of Elasticsearch, mapping types were a way to divide the documents into different logical groups in the same index. This meant that the index could have any number of types. In the past, talking about index in Elasticsearch is similar to talking about database in SQL. In addition, the discussion of viewing index type in Elasticsearch is equivalent to a table in SQL is also very popular. According to the official Elastic website (https://www.elastic.co/guide/en/elasticsearch/reference/5.6/removal-of-types.html), the removal of mapping types was published in the documentation of version 5.6. Later, in Elasticsearch 6.0.0, indices needed to contain only one mapping type. Mapping types of the same index were completely removed in Elasticsearch 7.0.0. The main reason was that tables are independent of each other in an SQL database. However, fields with the same name in different mapping types of the same index are the same. In an Elasticsearch index, fields with the same name in different mapping types are internally supported by the same Lucene field.
Let's take a look at the terminology in SQL and Elasticsearch in the following table(https://www.elastic.co/guide/en/elasticsearch/reference/master/_mapping_concepts_across_sql_and_elasticsearch.html), showing how the data is organized:
SQL
Elasticsearch
Description
Column
Field
A column is a set of data values in the same data type, with one value for each row of the database, while Elasticsearch refers to as a field. A field is the smallest unit of data in Elasticsearch. It can contain a list of multiple values of the same type.
Row
Document
A row represents a structured data item, which contains a series of data values from each column of the table. A document is like a row to group fields (columns in SQL). A document is a JSON object in Elasticsearch.
Table
Index
A table consists of columns and rows. An index is the largest unit of data in Elasticsearch.
Comparing to a
database in SQL, an index is a logical partition of the indexed documents and the target against which the search queries get executed.
Schema
Implicit
In a
relational database management system
(
RDBMS
), a schema contains schema objects, which can be tables, columns, data types, views, and so on. A schema is typically owned by a database user. Elasticsearch does not provide an equivalent concept for it.
Catalog/database
Cluster
In SQL, a catalog or database represents a set of schemas. In Elasticsearch,
a cluster contains
a set of indices.
A schema could mean an outline, diagram, or model, which is often used to describe the structure of different types of data. Elasticsearch is reputed to be schema-less, in contrast to traditional relational databases. In traditional relational databases, you must explicitly specify tables, fields, and field types. In Elasticsearch, schema-less simply means that the document can be indexed without specifying the schema in advance. Under the hood though, Elasticsearch dynamically derives a schema from the first document's index structure and decides how to index them when no explicit static mapping is specified. Elasticsearch enforces the term schema called mapping, which is a process of defining how Lucene stores the indexed document and those fields it contains. When you add a new field to your document, the mapping will also be automatically updated.
Starting from Elasticsearch 6.0.0, only one mapping type is allowed for each index. The mapping type has fields defined by data types and meta fields. Elasticsearch supports many different data types for fields in a document. Each document has meta-fields associated with it. We can customize the behavior of the meta-fields when creating a mapping type. We'll cover this in Chapter 4, Mapping APIs.
Elasticsearch comes with a variety of built-in analyzers that can be used in any index without further configuration. If the built-in analyzers are not suitable for your use case, you can create a custom analyzer. Whether it is a built-in analyzer or a customized analyzer, it is just a package of the three following lower-level building blocks:
Character filter
: Receives the raw text as a stream of characters and can transform the stream by adding, removing, or changing its characters
Tokenizers
: Splits the given streams of characters into
a token stream
Token filters
: Receives the token stream and may add, remove, or change tokens
The same analyzer should normally be used both at index time and at search time, but you can set search_analyzer in the field mapping to perform different analyses while searching.
The standard analyzer is the default analyzer, which is used if none is specified. A standard analyzer consists of the following:
Character filter
: None
Tokenizer
: Standard tokenizer
Token filters
: Lowercase token filter and stop token filter (disabled by default)
A standard tokenizer provides a grammar-based tokenization. A lowercase token filter normalizes the token text to lowercase, where a stop token filter removes the stop words from token streams. For a list of English stop words, you can refer to https://www.ranks.nl/stopwords. Let's test the standard analyzer with the input text You'll love Elasticsearch 7.0.
Since it is a POST request, you need to set the Content-Type to application/json:
The URL is http://localhost:9200/_analyze and the request Body has a raw JSON string, {"text": "You will love Elasticsearch 7.0."}. You can see that the response has four tokens: you'll, love, elasticsearch, and 7.0, all in lowercase, which is due to the lowercase token filter:
In the next section, let's get familiar with the API conventions.
We will only discuss some of the major conventions. For others, please refer to the Elasticsearch reference (https://www.elastic.co/guide/en/elasticsearch/reference/master/api-conventions.html). The following list can be applied throughout the REST API:
Access across m
ultiple indices
: T
his convention cannot be used in single document APIs:
_all
: For all indices
comma
: A separator between two indices
wildcard (*,-)
: The asterisk character,
*
, is used to match any sequence of
characters in the index name,
excluding the index afterwards
Common options
:
Boolean values
:
false
means the mentioned value is false;
true
means the value is true.
Number values
: A number is as a string on top of the native JSON number type.
Time unit
for duration
:
The supported time units are
d
for days,
h
for hours,
m
for minutes,
s
for seconds,
ms
for milliseconds,
micros
for microseconds, and
nanos
for nanoseconds.
Byte size unit
: The supported data units are
b
for bytes,
kb
for kilobytes,
mb
for megabytes,
gb
for gigabytes,
tb
for terabytes, and
pb
for petabytes.
Distance unit
: T
he supported distance units are
mi
for miles,
yd
for yards,
ft
for feet,
in
for inches,
km
for kilometers,
m
for meters,
cm
for centimeters,
mm
for millimeters, and
nmi
or
NM
for nautical miles.
Unit-less quantities
:
If the value specified is large enough, we can use a quantity as a multiplier. The supported quantities are
k
for kilo,
m
for mega,
g
for giga,
t
for tera, and
p
for peta. For instance,
10m
represents the value 10,000,000.
Human-readable output
: Values can be converted to human-readable values, such as
1h
for 1 hour and
1kb
for 1,024 kilobytes. This option can be turned on by adding
?human=true
to the query string.
The default value is
false
.
Pretty result
: If you append
?pretty=true
to the request URL, the JSON string in the response will be in
pretty format.
REST parameters
: Follow the convention of using underscore delimiting.
Content type
: The type of content
in
the request body
must be specified in the request header using the
Content-Type
key name. Check the reference as to whether the content type you use is supported. In all our
POST
/
UPDATE
/
PATCH
request examples,
application/json
is used.
Request body in query string
: If the client library does not accept a request body for non-POST requests, you can use the
source
query string parameter
to pass the request body and specify the
source_content_type
parameter with a supported media type.
Stack traces
: If the
error_trace=true
request URL parameter
is set, the error stack trace will be included in the response when an exception is raised.
Date math in a formatted date value
: In range queries or in date range aggregations,
you can format
date
fields
using date math:
The date math expressions start with an anchor date (
now
, or a date string ending with a double vertical bar:
||
), followed by one or more sub-expressions such as
+1h
,
-1d
, or
/d
.
The supported time units are different from the time units for duration in the previously mentioned
Common options
bullet list. Where
y
is for years,
M
is for months,
w
is for weeks,
d
is for days,
h
, or
H
is for hours,
m
is for minutes,
s
is for seconds,
+
is for addition,
-
is for subtraction, and
/
is for rounding down to the nearest time unit. For example, this means that
/d
means rounding down to the nearest day.
Date math in index name
: If you want to index time series data, such as logs, you can use a pattern with different date fields as the index names to manage daily logging information. Date math then gives you a way to search through a series of time series indices. The date math syntax for the index name is as follows:
<static_name{date_math_expr{date_format|time_zone}}>
