32,36 €
A beginner's guide to storing, managing, and analyzing data with the updated features of Elastic 7.0
Key Features
Book Description
The Elastic Stack is a powerful combination of tools for techniques such as distributed search, analytics, logging, and visualization of data. Elastic Stack 7.0 encompasses new features and capabilities that will enable you to find unique insights into analytics using these techniques. This book will give you a fundamental understanding of what the stack is all about, and help you use it efficiently to build powerful real-time data processing applications.
The first few sections of the book will help you understand how to set up the stack by installing tools, and exploring their basic configurations. You'll then get up to speed with using Elasticsearch for distributed searching and analytics, Logstash for logging, and Kibana for data visualization. As you work through the book, you will discover the technique of creating custom plugins using Kibana and Beats. This is followed by coverage of the Elastic X-Pack, a useful extension for effective security and monitoring. You'll also find helpful tips on how to use Elastic Cloud and deploy Elastic Stack in production environments.
By the end of this book, you'll be well versed with the fundamental Elastic Stack functionalities and the role of each component in the stack to solve different data processing problems.
What you will learn
Who this book is for
This book is for entry-level data professionals, software engineers, e-commerce developers, and full-stack developers who want to learn about Elastic Stack and how the real-time processing and search engine works for business analytics and enterprise search applications. Previous experience with Elastic Stack is not required, however knowledge of data warehousing and database concepts will be helpful.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 445
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor:Amey VarangaonkarAcquisition Editor: Yogesh DeokarContent Development Editor: Unnati GuhaTechnical Editor: Manikandan KurupCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Tejal Daruwale SoniGraphics: Jisha ChirayilProduction Coordinator: Aparna Bhagat
First published: December 2017 Second edition: May 2019
Production reference: 2310519
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78995-439-5
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Pranav Shukla is the founder and CEO of Valens DataLabs, a technologist, husband, and father of two. He is a big data architect and software craftsman who uses JVM-based languages. Pranav has over 15 years experience in architecting enterprise applications for Fortune 500 companies and start-ups. His core expertise lies in building JVM-based, scalable, reactive, and data-driven applications using Java/Scala, the Hadoop ecosystem, Apache Spark, and NoSQL databases. Pranav founded Valens DataLabs with the vision of helping companies to leverage data to their competitive advantage. In his spare time, he enjoys reading books, playing musical instruments, and playing tennis.
Sharath Kumar M N did his master's in computer science at the University of Texas, Dallas, USA. He is currently working as a senior principal architect at Broadcom. Prior to this, he was working as an Elasticsearch solutions architect at Oracle. He has given several tech talks at conferences such as Oracle Code events. Sharath is a certified trainer – Elastic Certified Instructor – one of the few technology experts in the world who has been certified by Elastic Inc. to deliver their official from the creators of Elastic training. He is also a data science and machine learning enthusiast. In his free time, he likes playing with his lovely niece, Monisha; nephew, Chirayu; and his pet, Milo.
Tan-Vinh Nguyen is a Switzerland-based Java, Elasticsearch, and Kafka enthusiast. He has more than 15 years experience in enterprise software development. As Elastic Certified Engineer, he works currently for mimacom ag in international Elasticsearch projects. He runs a blog named Cinhtau, where he evaluates technology, concepts, and best practices. His blog posts enable and empower application developers to accomplish their missions.
Marcelo Ochoa works for Dirección TICs of Facultad de Ciencias Exactas at Universidad Nacional del Centro de la Prov. de Buenos Aires and is the CTO at Scotas.com, a company that specializes in near real-time search solutions using Apache Solr and Oracle. He divides his time between university jobs and external projects related to Oracle, open source and big data technologies. Since 2006, he has been part of an Oracle ACE program and was recently incorporated into a Docker Mentor program.
He has coauthored Oracle Database Programming using Java and Web Services and Professional XML Databases, and worked as technical reviewers on several books such as Mastering Apache Solr 7, Learning Elastic Search 6 Video, Mastering Elastic Stack and more.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Learning Elastic Stack 7.0 Second Edition
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Introduction to Elastic Stack and Elasticsearch
Introducing Elastic Stack
What is Elasticsearch, and why use it?
Schemaless and document-oriented
Searching capability
Analytics
Rich client library support and the REST API
Easy to operate and easy to scale 
Near real-time capable
Lightning–fast
Fault-tolerant
Exploring the components of the Elastic Stack
Elasticsearch
Logstash
Beats
Kibana
X-Pack
Security
Monitoring
Reporting
Alerting
Graph
Machine learning
Elastic Cloud
Use cases of Elastic Stack
Log and security analytics
Product search
Metrics analytics
Web search and website search
Downloading and installing
Installing Elasticsearch
Installing Kibana
Summary
Getting Started with Elasticsearch
Using the Kibana Console UI
Core concepts of Elasticsearch
Indexes
Types
Documents
Nodes
Clusters
Shards and replicas
Mappings and datatypes
Datatypes
Core datatypes
Complex datatypes
Other datatypes
Mappings
Creating an index with the name catalog
Defining the mappings for the type of product
Inverted indexes
CRUD operations
Index API
Indexing a document by providing an ID
Indexing a document without providing an ID
Get API
Update API
Delete API
Creating indexes and taking control of mapping
Creating an index
Creating type mapping in an existing index
Updating a mapping
REST API overview
Common API conventions
Formatting the JSON response
Dealing with multiple indexes
Searching all documents in one index
Searching all documents in multiple indexes
Searching all the documents of a particular type in all indexes
Summary
Section 2: Analytics and Visualizing Data
Searching - What is Relevant
The basics of text analysis
Understanding Elasticsearch analyzers
Character filters
Tokenizer
Standard tokenizer
Token filters
Using built-in analyzers
Standard analyzer
Implementing autocomplete with a custom analyzer
Searching from structured data
Range query
Range query on numeric types
Range query with score boosting
Range query on dates
Exists query
Term query
Searching from the full text
Match query
Operator
Minimum should match
Fuzziness
Match phrase query
Multi match query
Querying multiple fields with defaults
Boosting one or more fields
With types of multi match queries
Writing compound queries
Constant score query
Bool query
Combining OR conditions
Combining AND and OR conditions
Adding NOT conditions
Modeling relationships
has_child query
has_parent query
parent_id query
Summary
Analytics with Elasticsearch
The basics of aggregations
Bucket aggregations
Metric aggregations
Matrix aggregations
Pipeline aggregations
Preparing data for analysis
Understanding the structure of the data
Loading the data using Logstash
Metric aggregations
Sum, average, min, and max aggregations
Sum aggregation
Average aggregation
Min aggregation
Max aggregation
Stats and extended stats aggregations
Stats aggregation
Extended stats aggregation
Cardinality aggregation
Bucket aggregations
Bucketing on string data
Terms aggregation
Bucketing on numerical data
Histogram aggregation
Range aggregation
Aggregations on filtered data
Nesting aggregations
Bucketing on custom conditions
Filter aggregation
Filters aggregation
Bucketing on date/time data
Date Histogram aggregation
Creating buckets across time periods
Using a different time zone
Computing other metrics within sliced time intervals
Focusing on a specific day and changing intervals
Bucketing on geospatial data
Geodistance aggregation
GeoHash grid aggregation
Pipeline aggregations
Calculating the cumulative sum of usage over time
Summary
Analyzing Log Data
Log analysis challenges
Using Logstash
Installation and configuration
Prerequisites
Downloading and installing Logstash
Installing on Windows
Installing on Linux
Running Logstash
The Logstash architecture
Overview of Logstash plugins
Installing or updating plugins
Input plugins
Output plugins
Filter plugins
Codec plugins
Exploring plugins
Exploring input plugins
File
Beats
JDBC
IMAP
Output plugins
Elasticsearch
CSV
Kafka
PagerDuty
Codec plugins
JSON
Rubydebug 
Multiline
Filter plugins
Ingest node
Defining a pipeline 
Ingest APIs
Put pipeline API
Get pipeline API
Delete pipeline API
Simulate pipeline API
Summary
Building Data Pipelines with Logstash
Parsing and enriching logs using Logstash
Filter plugins
CSV filter 
Mutate filter
Grok filter
Date filter
Geoip filter
Useragent filter
Introducing Beats
Beats by Elastic.co
Filebeat
Metricbeat
Packetbeat
Heartbeat
Winlogbeat
Auditbeat
Journalbeat
Functionbeat
Community Beats
Logstash versus Beats
Filebeat
Downloading and installing Filebeat
Installing on Windows
Installing on Linux
Architecture
Configuring Filebeat
Filebeat inputs
Filebeat general/global options
Output configuration 
Logging
Filebeat modules
Summary
Visualizing Data with Kibana
Downloading and installing Kibana
Installing on Windows
Installing on Linux
Configuring Kibana
Preparing data
Kibana UI
User interaction
Configuring the index pattern
Discover
Elasticsearch query string/Lucene query
Elasticsearch DSL query
KQL
Visualize
Kibana aggregations
Bucket aggregations
Metric
Creating a visualization
Visualization types
Line, area, and bar charts
Data tables
Markdown widgets
Metrics
Goals
Gauges
Pie charts
Co-ordinate maps
Region maps
Tag clouds
Visualizations in action
Response codes over time
Top 10 requested URLs
Bandwidth usage of the top five countries over time
Web traffic originating from different countries
Most used user agent
Dashboards
Creating a dashboard
Saving the dashboard 
Cloning the dashboard
Sharing the dashboard 
Timelion
Timelion 
Timelion expressions
Using plugins
Installing plugins
Removing plugins
Summary
Section 3: Elastic Stack Extensions
Elastic X-Pack
Installing Elasticsearch and Kibana with X-Pack
Installation
Activating X-Pack trial account
Generating passwords for default users
Configuring X-Pack
Securing Elasticsearch and Kibana
User authentication
User authorization
Security in action
Creating a new user
Deleting a user
Changing the password
Creating a new role
Deleting or editing a role
Document-level security or field-level security
X-Pack security APIs
User Management APIs
Role Management APIs
Monitoring Elasticsearch
Monitoring UI
Elasticsearch metrics
Overview tab
Nodes tab
The Indices tab
Alerting
Anatomy of a watch
Alerting in action
Creating a new alert
Threshold Alert
Advanced Watch
Deleting/deactivating/editing a watch
Summary
Section 4: Production and Server Infrastructure
Running Elastic Stack in Production
Hosting Elastic Stack on a managed cloud
Getting up and running on Elastic Cloud
Using Kibana
Overriding configuration 
Recovering from a snapshot
Hosting Elastic Stack on your own
Selecting hardware
Selecting an operating system
Configuring Elasticsearch nodes
JVM heap size
Disable swapping
File descriptors
Thread pools and garbage collector
Managing and monitoring Elasticsearch
Running in Docker containers
Special considerations while deploying to a cloud
Choosing instance type
Changing default ports; do not expose ports!
Proxy requests
Binding HTTP to local addresses
Installing EC2 discovery plugin
Installing the S3 repository plugin
Setting up periodic snapshots
Backing up and restoring
Setting up a repository for snapshots
Shared filesystem
Cloud or distributed filesystems
Taking snapshots
Restoring a specific snapshot
Setting up index aliases
Understanding index aliases
How index aliases can help
Setting up index templates
Defining an index template
Creating indexes on the fly
Modeling time series data
Scaling the index with unpredictable volume over time
Unit of parallelism in Elasticsearch
The effect of the number of shards on the relevance score
The effect of the number of shards on the accuracy of aggregations
Changing the mapping over time
New fields get added
Existing fields get removed
Automatically deleting older documents
How index-per-timeframe solves these issues
Scaling with index-per-timeframe
Changing the mapping over time
Automatically deleting older documents
Summary
Building a Sensor Data Analytics Application
Introduction to the application
Understanding the sensor-generated data
Understanding the sensor metadata
Understanding the final stored data
Modeling data in Elasticsearch
Defining an index template
Understanding the mapping
Setting up the metadata database
Building the Logstash data pipeline
Accepting JSON requests over the web
Enriching the JSON with the metadata we have in the MySQL database
The jdbc_streaming plugin 
The mutate plugin
Moving the looked-up fields that are under lookupResult directly in JSON
Combining the latitude and longitude fields under lookupResult as a location field
Removing the unnecessary fields
Store the resulting documents in Elasticsearch
Sending data to Logstash over HTTP
Visualizing the data in Kibana
Setting up an index pattern in Kibana
Building visualizations
How does the average temperature change over time?
How does the average humidity change over time?
How do temperature and humidity change at each location over time?
Can I visualize temperature and humidity over a map?
How are the sensors distributed across departments?
Creating a dashboard
Summary
Monitoring Server Infrastructure
Metricbeat
Downloading and installing Metricbeat
Installing on Windows
Installing on Linux
Architecture
Event structure
Configuring Metricbeat
Module configuration
Enabling module configs in the modules.d directory
Enabling module configs in the metricbeat.yml file
General settings
Output configuration 
Logging
Capturing system metrics
Running Metricbeat with the system module
Specifying aliases
Visualizing system metrics using Kibana
Deployment architecture
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
The Elastic Stack is a powerful combination of tools for techniques including distributed searching, analytics, logging, and the visualization of data. Elastic Stack 7.0 encompasses new features and capabilities that will enable you to find unique insights into analytics using these techniques. This book will give you a fundamental understanding of what the stack is all about, and help you use it efficiently to build powerful real-time data processing applications.
The first few sections of the book will help you to understand how to set up the stack by installing tools and exploring their basic configurations. You'll then get up to speed with using Elasticsearch for distributed searching and analytics, Logstash for logging, and Kibana for data visualization. As you work through the book, you will discover a technique for creating custom plugins using Kibana and Beats. This is followed by coverage of Elastic X-Pack, a useful extension for effective security and monitoring. You'll also find helpful tips on how to use Elastic Cloud and deploy Elastic Stack in production environments.
By the end of this book, you'll be well versed in the fundamental Elastic Stack functionalities and the role of each component in the stack in solving different data processing problems.
This book is for entry-level data professionals, software engineers, e-commerce developers, and full-stack developers who want to learn about Elastic Stack and how the real-time processing and search engine works for business analytics and enterprise search applications.
Chapter 1, Introducing Elastic Stack, motivates you by introducing the core components of Elastic Stack, and the importance of the distributed, scalable search and analytics that Elastic Stack offers by means of use cases involving Elasticsearch. The chapter provides a brief introduction to all the core components, where they fit into the overall stack, and the purpose of each component. It concludes with instructions for downloading and installing Elasticsearch and Kibana to get started.
Chapter 2, Getting Started with Elasticsearch, introduces the core concepts involved in Elasticsearch, which form the backbone of the Elastic Stack. Concepts such as indexes, types, nodes, and clusters are introduced. You will also be introduced to the REST API to perform essential operations, datatypes, and mappings.
Chapter 3, Searching – What is Relevant, focuses on the search use case of Elasticsearch. It introduces the concepts of text analysis, tokenizers, analyzers, and the need for analysis and relevance-based searches. The chapter highlights an example use case to cover the relevance-based search topics.
Chapter 4, Analytics with Elasticsearch, covers various types of aggregations by means of examples in order for you to acquire an in-depth understanding. This chapter covers very simple to complex aggregations to get powerful insights from terabytes of data. The chapter also covers the motivation behind using different types of aggregations.
Chapter 5, Analyzing Log Data, establishes the foundation for the motivation behind Logstash, its architecture, and installing and configuring Logstash to set up basic data pipelines. Elastic 5 introduced ingest nodes, which can be used instead of a dedicated Logstash setup. This chapter also covers building pipelines using Elastic ingest nodes.
Chapter 6, Building Data Pipelines with Logstash, builds on the fundamental knowledge of Logstash by means of transformations and aggregation-related filters. It covers how the rich set of filters brings Logstash closer to the other real-time and near real-time stream processing frameworks with zero coding. It introduces the Beats platform, along with FileBeat components, to transport log files from edge machines.
Chapter 7, Visualizing Data with Kibana, covers how to effectively use Kibana to build beautiful dashboards for effective story telling regarding your data. It uses a sample dataset and provides step-by-step guidance on creating visualizations with just a few clicks.
Chapter 8, Elastic X-Pack, covers how to add the extensions required for specific use cases. Elastic X-Pack is a set of extensions developed and maintained by Elastic Stack developers. These extensions are maintained with consistent versioning.
Chapter 9, Running Elastic Stack in Production, puts together a complete application for sensor data analytics with the concepts learned so far. It is entirely reliant on Elastic Stack components and close to zero programming. It shows how to model your data in Elasticsearch, how to build the data pipeline to ingest data, and then visualize it using Kibana. It also demonstrates how to effectively use X-Pack components to secure, monitor, and get alerts when certain conditions are met in this real-world example.
Chapter 10, Building a Sensor Data Analytics Application, covers recommendations on how to deploy Elastic Stack to production. ElasticSearch can be deployed to solve a variety of use cases, such as product search, log analytics, and sensor data analytics. This chapter provides recommendations for taking your application to production. It provides guidelines on typical configurations that need to be looked at for different use cases. It also covers deployment in cloud-based hosted providers such as Elastic Cloud.
Chapter 11, Monitoring Server Infrastructure, shows how you can use Elastic Stack to set up a real-time monitoring solution for your servers and applications that is built entirely using Elastic Stack. This can help prevent and minimize downtime while also improving the end user experience.
Previous experience with Elastic Stack is not required. However, some knowledge of data warehousing and database concepts will be beneficial.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Elastic-Stack-7.0-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781789954395_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
A block of code is set as follows:
POST _analyze{ "tokenizer": "standard", "text": "Tokenizer breaks characters into tokens!"}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
GET /amazon_products/_search{ "query": {
"term": {
"manufacturer.raw": "victory multimedia"
}
}}
Any command-line input or output is written as follows:
$> tar -xzf filebeat-7.0.0-linux-x86_64.tar.gz
$> cd filebeat
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Click on the Management icon on the left-hand menu and then click on License Management."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
This section covers the basics of Elasticsearch and Elastic Stack. It highlights the importance of distributed and scalable search and analytics that Elastic Stack offers. It will includes concepts such as indexes, types, nodes, and clusters, and provide insights into the REST API, which can be used to perform essential operations such as datatypes and mappings.
This section includes the following chapters:
Chapter 1
,
Introducing Elastic Stack
Chapter 2
,
Getting Started with Elasticsearch
The emergence of the web, mobiles, social networks, blogs, and photo sharing has created a massive amount of data in recent years. These new data sources create information that cannot be handled using traditional data storage technology, typically relational databases. As an application developer or business intelligence developer, your job is to fulfill the search and analytics needs of the application.
A number of data stores, capable of big data scale, have emerged in the last few years. These include Hadoop ecosystem projects, several NoSQL databases, and search and analytics engines such as Elasticsearch.
The Elastic Stack is a rich ecosystem of components serving as a full search and analytics stack. The main components of the Elastic Stack are Kibana, Logstash, Beats, X-Pack, and Elasticsearch.
Elasticsearch is at the heart of the Elastic Stack, providing storage, search, and analytical capabilities. Kibana, also referred to as a window into the Elastic Stack, is a user interface for the Elastic Stack with great visualization capabilities. Logstash and Beats help get the data into the Elastic Stack. X-Pack provides powerful features including monitoring, alerting, security, graph, and machine learning to make your system production-ready. Since Elasticsearch is at the heart of the Elastic Stack, we will cover the stack inside-out, starting from the heart and moving on to the surrounding components.
In this chapter, we will cover the following topics:
What is Elasticsearch, and why use it?
A brief history of Elasticsearch and Apache Lucene
Elastic Stack components
Use cases of Elastic Stack
We will look at what Elasticsearch is and why you should consider it as your data store. Once you know the key strengths of Elasticsearch, we will look at the history of Elasticsearch and its underlying technology, Apache Lucene. We will then look at some use cases of the Elastic Stack, and provide an overview of the Elastic Stack's components.
Since you are reading this book, you probably already know what Elasticsearch is. For the sake of completeness, let's define Elasticsearch:
Elasticsearch is at the core of the Elastic Stack, playing the central role of a search and analytics engine. Elasticsearch is built on a radically different technology, Apache Lucene. This fundamentally different technology in Elasticsearch sets it apart from traditional relational databases and other NoSQL solutions. Let's look at the key benefits of using Elasticsearch as your data store:
Schemaless, document-oriented
Searching
Analytics
Rich client library support and the REST API
Easy to operate and easy to scale
Near real-time
Lightning-fast
Fault-tolerant
Let's look at each benefit one by one.
Elasticsearch does not impose a strict structure on your data; you can store any JSON documents. JSON documents are first-class citizens in Elasticsearch as opposed to rows and columns in a relational database. A document is roughly equivalent to a record in a relational database table. Traditional relational databases require a schema to be defined beforehand to specify a fixed set of columns and their data types and sizes. Often the nature of data is very dynamic, requiring support for new or dynamic columns. JSON documents naturally support this type of data. For example, take a look at the following document:
{ "name": "John Smith", "address": "121 John Street, NY, 10010", "age": 40 }
This document may represent a customer's record. Here the record has the name, address, and age fields of the customer. Another record may look like the following:
{ "name": "John Doe", "age": 38, "email": "[email protected]" }
Note that the second customer doesn't have the address field but, instead, has an email address. In fact, other customer documents may have completely different sets of fields. This provides a tremendous amount of flexibility in terms of what can be stored.
The core strength of Elasticsearch lies in its text-processing capabilities. Elasticsearch is great at searching, especially full-text searches. Let's understand what a full-text search is:
Full-text search means searching through all the terms of all the documents available in the database. This requires the entire contents of all documents to be parsed and stored beforehand. When you hear full-text search, think of Google Search. You can enter any search term and Google looks through all of the web pages on the internet to find the best-matching web pages. This is quite different from simple SQL queries run against columns of type string in relational databases. Normal SQL queries with a WHERE clause and an equals () or LIKE clause try to do an exact or wildcard match with underlying data. SQL queries can, at best, just match the search term to a sub-string within the text column.
When you want to perform a search similar to a Google search on your own data, Elasticsearch is your best bet. You can index emails, text documents, PDF files, web pages, or practically any unstructured text documents and search across all your documents with search terms.
At a high level, Elasticsearch breaks up text data into terms and makes every term searchable by building Lucene indexes. You can build your own fast and flexible Google-like search for your application.
In addition to supporting text data, Elasticsearch also supports other data types such as numbers, dates, geolocations, IP addresses, and many more. We will take an in-depth look at searching in Chapter 3, Searching-What is Relevant.
Apart from searching, the second most important functional strength of Elasticsearch is analytics. Yes, what was originally known as just a full-text search engine is now used as an analytics engine in a variety of use cases. Many organizations are running analytics solutions powered by Elasticsearch in production.
Conducting a search is like zooming in and finding a needle in a haystack, that is, locating precisely what is needed within huge amounts of data. Analytics is exactly the opposite of a search; it is about zooming out and taking a look at the bigger picture. For example, you may want to know how many visitors on your website are from the United States as opposed to every other country, or you may want to know how many of your website's visitors use macOS, Windows, or Linux.
Elasticsearch supports a wide variety of aggregations for analytics. Elasticsearch aggregations are quite powerful and can be applied to various data types. We will take a look at the analytics capabilities of Elasticsearch in Chapter 4, Analytics with Elasticsearch.
Elasticsearch has very rich client library support to make it accessible to many programming languages. There are client libraries available for Java, C#, Python, JavaScript, PHP, Perl, Ruby, and many more. Apart from the official client libraries, there are community-driven libraries for 20 plus programming languages.
Additionally, Elasticsearch has a very rich REST (Representational State Transfer) API, which works on the HTTP protocol. The REST API is very well documented and quite comprehensive, making all operations available over HTTP.
All this means that Elasticsearch is very easy to integrate into any application to fulfill your search and analytics needs.
Elasticsearch can run on a single node and easily scale out to hundreds of nodes. It is very easy to start a single node instance of Elasticsearch; it works out of the box without any configuration changes and scales to hundreds of nodes.
Unlike most traditional databases that only allow vertical scaling, Elasticsearch can be scaled horizontally. It can run on tens or hundreds of commodity nodes instead of one extremely expensive server. Adding a node to an existing Elasticsearch cluster is as easy as starting up a new node in the same network, with virtually no extra configuration. The client application doesn't need to change, whether it is running against a single-node or a hundred-node cluster.
Typically, data is available for queries within a second after being indexed (saved). Not all big data storage systems are real-time capable. Elasticsearch allows you to index thousands to hundreds of thousands of documents per second and makes them available for searching almost immediately.
Elasticsearch uses Apache Lucene as its underlying technology.By default, Elasticsearch indexes all the fields of your documents. This is extremely invaluable as you can query or search by any field in your records. You will never be in a situation in which you think, If only I had chosen to create an index on this field. Elasticsearch contributors have leveraged Apache Lucene to its best advantage, and there are other optimizations that make it lightning-fast.
Elasticsearch clusters can keep running even when there are hardware failures such as node failure and network failure. In the case of node failure, it replicates all the data on the failed node to another node in the cluster. In the case of network failure, Elasticsearch seamlessly elects master replicas to keep the cluster running. Whether it is a case of node or network failure, you can rest assured that your data is safe.
Now that you know when and why Elasticsearch could be a great choice, let's take a high-level view of the ecosystem – the Elastic Stack.
The Elastic Stack components are shown in the following diagram. It is not necessary to include all of them in your solution. Some components are general-purpose and can be used outside the Elastic Stack without using any other components.
Let's look at the purpose of each component and how they fit into the stack:
Elasticsearch is at the heart of the Elastic Stack. It stores all your data and provides search and analytic capabilities in a scalable way.We have already looked at the strengths of Elasticsearch and why you would want to use it. Elasticsearch can be used without using any other components to power your application in terms of search and analytics. We will cover Elasticsearch in great detail in Chapter 2, Getting Started with Elasticsearch, Chapter 3, Searching-What is Relevant, and Chapter 4, Analytics with Elasticsearch.
Logstash helps centralize event data such as logs,metrics, or any other data in any format. It can perform a number of transformations before sending it to a stash of your choice.It is a key component of the Elastic Stack, used to centralize the collection and transformation processes in your data pipeline.
Logstash is a server-side component. Its role is to centralize the collection of data from a wide number of input sourcesin a scalable way, and transform and send the data to an output of your choice. Typically, the output is sent to Elasticsearch, but Logstash is capable of sending it to a wide variety of outputs. Logstash has a plugin-based, extensible architecture. It supports three types of plugin: input plugins, filter plugins, and output plugins. Logstash has a collection of 200+ supported plugins and the count is ever increasing.
Logstash is an excellent general-purpose data flow engine that helps in building real-time, scalable data pipelines.
Beats is a platform of open source lightweight data shippers. Its role is complementary to Logstash. Logstash is a server-side component, whereas Beats has a role on the client side. Beats consists of a core library, libbeat, which provides an API for shipping data from the source, configuring the input options, and implementing logging. Beats is installed on machines that are not part of server-side components such as Elasticsearch, Logstash, or Kibana. These agents reside on non-cluster nodes, which are sometimes called edge nodes.
Many Beat components have already been built by the Elastic team and the open source community. The Elastic team has built Beats including Packetbeat, Filebeat, Metricbeat, Winlogbeat, Audiobeat, and Heartbeat.
Filebeat is a single-purpose Beat built to ship log files from your servers to a centralized Logstash server or Elasticsearch server. Metricbeat is a server monitoring agent that periodically collects metrics from the operating systems and services running on your servers. There are already around 40 community Beats built for specific purposes, such as monitoring Elasticsearch, Cassandra, the Apache web server, JVM performance, and so on. You can build your own beat using libbeat, if you don't find one that fits your needs.
We will explore Logstash and Beats in Chapter 5, Analyzing Log Data, and Chapter 6, Building Data Pipelines with Logstash.
Kibana is the visualization tool for the Elastic Stack, and can help you gain powerful insights about your data in Elasticsearch. It is often called a window into the Elastic Stack. It offers many visualizations includinghistograms,maps, line charts, time series, and more. You can build visualizations with just a few clicks and interactively explore data. It lets you build beautiful dashboards by combining different visualizations, sharing with others, and exporting high-quality reports.
Kibana also has management and development tools. You can manage settings and configure X‑Pack security features for Elastic Stack. Kibana also has development tools that enable developers to build and test REST API requests.
We will explore Kibana in Chapter 7, Visualizing Data with Kibana.
X-Pack adds essential features to make the Elastic Stack production-ready. It adds security, monitoring, alerting, reporting, graph, and machine learning capabilities to the Elastic Stack.
The security plugin within X-Pack adds authentication and authorization capabilities to Elasticsearch and Kibana so that only authorized people can access data, and they can only see what they are allowed to. The security plugin works across components seamlessly, securing access to Elasticsearch and Kibana.
The security extension also lets you configure fields and document-level security with the licensed version.
You can monitor your Elastic Stack components so that there is no downtime. The monitoring component in X-Pack lets you monitor your Elasticsearch clusters and Kibana.
You can monitor clusters, nodes, and index-level metrics. The monitoring plugin maintains a history of performance so you can compare current metrics with past metrics. It also has a capacity planning feature.
The reporting plugin within X-Pack allows for generating printable, high-quality reports from Kibana visualizations. The reports can be scheduled to run periodically or on a per-event basis.
X-Pack has sophisticated alerting capabilities that can alert you in multiple possible ways when certain conditions are met. It gives tremendous flexibility in terms of when, how, and who to alert.
You may be interested in detecting security breaches, such as when someone has five login failures within an hour from different locations or finding out when your product is trending on social media. You can use the full power of Elasticsearch queries to check when complex conditions are met.
Alerting provides a wide variety of options in terms of how alerts are sent. It can send alerts via email, Slack, Hipchat, and PagerDuty.
Graph lets you explore relationships in your data. Data in Elasticsearch is generally perceived as a flat list of entities without connections to other entities. This relationship opens up the possibility of new use cases. Graph can surface relationships among entities that share common properties such as people, places, products, or preferences.
Graph consists of the Graph API and a UI within Kibana, that let you explore this relationship. Under the hood, it leverages distributed querying, indexing at scale, and the relevance capabilities of Elasticsearch.
X-Pack has a machine learning module, which is for learning from patterns within data. Machine learning is a vast field that includes supervised learning, unsupervised learning, reinforcement learning, and other specialized areas such as deep learning. The machine learning module within X-Pack is limited to anomaly detection in time series data, which falls under the unsupervised learning branch of machine learning.
We will look at some X-Pack components in Chapter 8, Elastic X-Pack.
Elastic Cloud is the cloud-based, hosted, and managed setup of the Elastic Stack components. The service is provided by Elastic (https://www.elastic.co/), which is behind the development of Elasticsearch and other Elastic Stack components. All Elastic Stack components are open source except X-Pack (and Elastic Cloud). Elastic, the company, provides services for Elastic Stack components including training, development, support, and cloud hosting.
Apart from Elastic Cloud, other hosted solutions are available for Elasticsearch, including one from Amazon Web Services (AWS). The advantage of Elastic Cloud is that it is developed and maintained by the original creators of Elasticsearch and other Elastic Stack components.
Elastic Stack components have a variety of practical use cases, and new use cases are emerging as more plugins are added to existing components. As mentioned earlier, you may use a subset of the components for your use case. The following list of example use cases is by no means exhaustive, but highlights some of the most common ones:
Log and security analytics
Product search
Metrics analytics
Web searches and website searches
Let's look at each use case.
The Elasticsearch, Logstash, and Kibana trio was, previously, very popular as a stack. The presence of Elasticsearch, Logstash, and Kibana (also known as ELK) makes the Elastic Stack an excellent stack for aggregating and analyzing logs in a central place.
Application support teams face a great challenge in administering and managing large numbers of applications deployed across tens or hundreds of servers. The application infrastructure could have the following components:
Web servers
Application servers
Database servers
Message brokers
Typically, enterprise applications have all, or most, of the types of servers described earlier, and there are multiple instances of each server. In the event of an error or production issue, the support team has to log in to individual servers and look at the errors. It is quite inefficient to log in to individual servers and look at the raw log files. The Elastic Stack provides a complete toolset to collect, centralize, analyze, visualize, alert, and report errors as they occur. Each component can be used to solve this problem as follows:
The Beats framework, Filebeat in particular, can run as a lightweight agent to collect and forward logs.
Logstash can centralize events received from Beats, and parse and transform each log entry before sending it to the Elasticsearch cluster.
Elasticsearch indexes logs. It enables both search and analytics on the parsed logs.
Kibana then lets you create visualizations based on errors, warnings, and other information logs. It lets you create dashboards on which you can centrally monitor events as they occur, in real time.
With X-Pack, you can secure the solution, configure alerts, get reports, and analyze relationships in data.
As you can see, you can get a complete log aggregation and monitoring solution using Elastic Stack.
A security analytics solution would be very similar to this; the logs and events being fed into the system would pertain to firewalls, switches, and other key network elements.
A product search involves searching for the most relevant product from thousands or tens of thousands of products and presenting the most relevant products at the top of the list before other, less relevant, products. You can directly relate this problem to e-commerce websites, which sell huge numbers of products sold by many vendors or resellers.
Elasticsearch's full-text and relevance search capabilities can find the best-matching results. Presenting the best matches on the first page has great value as it increases the chances of the customer actually buying the product. Imagine a customer searching for the iPhone 7, and the results on the first page showing different cases, chargers, and accessories for previous iPhone versions. Text analysis capabilities backed by Lucene, and innovations added by Elasticsearch, ensure that the search shows iPhone 7 chargers and cases as the best match.
This problem, however, is not limited to e-commerce websites. Any application that needs to find the most relevant item from millions, or billions, of items, can use Elasticsearch to solve this problem.
Elastic Stack has excellent analytics capabilities, thanks to the rich Aggregations API in Elasticsearch. This makes it a perfect tool for analyzing data with lots of metrics. Metric data consists of numeric values as opposed to unstructured text such as documents and web pages. Some examples are data generated by sensors, Internet of Things (IoT) devices, metrics generated by mobile devices, servers, virtual machines, network routers, switches, and so on. The list is endless.
Metric data is, typically, also time series; that is, values or measures are recorded over a period of time. Metrics that are recorded are usually related to some entity. For example, a temperature reading (which is a metric) is recorded for a particular sensor device with a certain identifier. The type, name of the building, department, floor, and so on are the dimensions associated with the metric. The dimensions may also include the location of the sensor device, that is, the longitude and latitude.
Elasticsearch and Kibana allow for slicing and dicing metric data along different dimensions to provide a deep insight into your data. Elasticsearch is very powerful at handling time series and geospatial data, which means you can plot your metrics on line charts and area charts aggregating millions of metrics. You can also conduct geospatial analysis on a map.
We will build a metrics analytics application using the Elastic Stack in Chapter 9, Building a Sensor Data Analytics Application.
Elasticsearch can serve as a search engine for your website and perform a Google-like search across the entire content of your site. GitHub, Wikipedia, and many other platforms power their searches using Elasticsearch.
Elasticsearch can be leveraged to build content aggregation platforms. What is a content aggregator or a content aggregation platform? Content aggregators scrape/crawl multiple websites, index the web pages, and provide a search functionality on the underlying content. This is a powerful way to build domain-specific, aggregated platforms.
Apache Nutch, an open source, large-scale web crawler, was created by Doug Cutting, the original creator of Apache Lucene. Apache Nutch crawls the web, parses HTML pages, stores them, and also builds indexes to make the content searchable. Apache Nutch supports indexing into Elasticsearch or Apache Solr for its search engine.
As is evident, Elasticsearch and the Elastic Stack have many practical use cases. The Elastic Stack is a platform with a complete set of tools to build end-to-end search and analytics solutions. It is a very approachable platform for developers, architects, business intelligence analysts, and system administrators. It is possible to put together an Elastic Stack solution with almost zero coding and only configuration. At the same time, Elasticsearch is very customizable, that is, developers and programmers can build powerful applications using its rich programming language support and REST API.
Now that we have enough motivation and reasons to learn about Elasticsearch and the Elastic Stack, let's start by downloading and installing the key components. Firstly, we will download and install Elasticsearch and Kibana. We will install the other components as we need them on the course of our journey. We also need Kibana because, apart from visualizations, it also has a UI for developer tools and for interacting with Elasticsearch.
Starting from Elastic Stack 5.x, all Elastic Stack components are now released together; they share the same version and are tested for compatibility with each other. This is also true for Elastic Stack 6.x components.
At the time of writing, the current version of Elastic Stack is 7.0.0. We will use this version for all components.
Elasticsearch can be downloaded as a ZIP, TAR, DEB, or RPM package. If you are on Ubuntu, Red Hat, or CentOS Linux, it can be directly installed using apt or yum.
We will use the ZIP format as it is the least intrusive and the easiest for development purposes:
Go to
https://www.elastic.co/downloads/elasticsearch
and download the ZIP distribution. You can also download an older version if you are looking for an exact version.
Extract the file and change your directory to the top-level extracted folder. Run
bin/elasticsearch
or
bin/elasticsearch.bat
.
Run
curl http://localhost:9200
or open the URL in your favorite browser.
You should see an output like this:
Congratulations! You have just set up a single-node Elasticsearch cluster.
Kibana is also available in a variety of packaging formats such as ZIP, TAR.GZ, RMP, and DEB for 32-bit and 64-bit architecture machines:
Go to
https://www.elastic.co/downloads/kibana
and download the ZIP or TAR.GZ distribution for the platform that you are on.
Extract
the file and change your directory to the top-level extracted folder
. Run
bin/kibana
or
bin/kibana.bat
.
Open
http://localhost:5601
in your favorite browser.
Congratulations! You have a working setup of Elasticsearch and Kibana.
In this chapter, we started by understanding the motivations of various search and analytics technologies other than relational databases and NoSQL stores. We looked at the strengths of Elasticsearch, which is at the heart of the Elastic Stack. We then looked at the rest of the components of the Elastic Stack and how they fit into the ecosystem. We also looked at real-world use cases of the Elastic Stack. We successfully downloaded and installed Elasticsearch and Kibana to begin the journey of learning about the Elastic Stack.