Getting Started with Elastic Stack 8.0 - Asjad Athick - E-Book

Getting Started with Elastic Stack 8.0 E-Book

Asjad Athick

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

The Elastic Stack helps you work with massive volumes of data to power use cases in the search, observability, and security solution areas.
This three-part book starts with an introduction to the Elastic Stack with high-level commentary on the solutions the stack can be leveraged for. The second section focuses on each core component, giving you a detailed understanding of the component and the role it plays. You’ll start by working with Elasticsearch to ingest, search, analyze, and store data for your use cases. Next, you’ll look at Logstash, Beats, and Elastic Agent as components that can collect, transform, and load data. Later chapters help you use Kibana as an interface to consume Elastic solutions and interact with data on Elasticsearch. The last section explores the three main use cases offered on top of the Elastic Stack. You’ll start with a full-text search and look at real-world outcomes powered by search capabilities. Furthermore, you’ll learn how the stack can be used to monitor and observe large and complex IT environments. Finally, you’ll understand how to detect, prevent, and respond to security threats across your environment. The book ends by highlighting architecture best practices for successful Elastic Stack deployments.
By the end of this book, you’ll be able to implement the Elastic Stack and derive value from it.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 410

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Getting Started with Elastic Stack 8.0

Run powerful and scalable data platforms to search, observe, and secure your organization

Asjad Athick

BIRMINGHAM—MUMBAI

Getting Started with Elastic Stack 8.0

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Sunith Shetty

Senior Editor: Nazia Shaikh

Content Development Editor: Sean Lobo

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Nilesh Mohite

Marketing Coordinator: Priyanka Mhatre

First published: March 2022

Production reference: 1080222

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-949-2

www.packt.com

To my parents, for their unconditional love, inspiration, and all their sacrifices. To Inez and James, for their unwavering support through thick and thin. And to the brilliant people at Elastic, for being generous with their time and knowledge.

– Asjad Athick

Foreword

In 2009 I sat down to write Elasticsearch with the goal of helping people find what they are looking for, and I am still in awe of the breadth of the ecosystem that has been built around it. The Elastic Stack is already at version 8 (we did cheat a bit and jumped from version 2/3 to 5 to align versions across all core components). It has grown from Elasticsearch to ELK (Elasticsearch, Logstash, Kibana) to the Elastic Stack, that has a breadth beyond just the set of core products to address solutions as well.

It has also been at the confluence of major structural, cultural, and technological evolutions: from the breadth of Enterprise Search and the ability to put a search box on a website, application, or workspace, through to the emergence of Observability to help monitor applications regardless of the "how", like Logging and APM. And now, in the early days of the same movement in security, between SIEM, Endpoint, Container, and Cloud security, the convergence of Observability and Security capabilities are becoming the status quo in any modern IT environment. These evolutions are as much social and organizational ones (DevOps, SecOps, and so on) as they are technological ones. The best ones often are.

And at the core of it all are still products like Elasticsearch and Kibana. To some level, their fundamentals have remained the same. In other ways, they have evolved tremendously over the years to address many new and exciting use cases. Data is all around us, growing faster than ever before, making the need to search across this data natural and timeless. The Elastic Stack acts both as an end-to-end solution (like in Observability or Security) and also as an extensible platform for your data. It is flexible to handle so many other use cases with strong building blocks: Elasticsearch to store and search data, Kibana to visualize it, and various ways to bring data into the stack. The best and most fun products are ones that are easy to use for the obvious, and flexible enough to tackle the unobvious. We built the Elastic Stack to do both.

In this book, Asjad goes through the various components of the Elastic Stack and the possibilities you can unlock in the form of holistic solutions for your organization. The focus on providing the best possible "Get Started" experience is important to make it even easier to build with the stack while making the right decisions early on in the project. Asjad works with a broad range of customers at Elastic, ranging from tech startups to commercial and large enterprise organizations. This book encompasses a wealth of knowledge and experience acquired over the years from production customer deployments used every day for mission-critical applications.

I recommend reading chapters in Part 2 to familiarize yourself with the various components of the stack and the problems they can solve for you. Then use Part 3 to implement the solutions that deliver the value you need to create for your end users. Add this book to your reading list if you are looking for a great starting point on how to best leverage search and analytics to extract the most value from the data around you.

I hope you enjoy reading this book.

Shay Banon

Founder and Chief Technology Officer at Elastic

Contributors

About the author

Asjad Athick is a security specialist at Elastic with demonstratable experience in architecting enterprise-scale solutions on the cloud. He believes in empowering people with the right tools to help them achieve their goals. At Elastic, he works with a broad range of customers across Australia and New Zealand to help them understand their environment; this allows them to build robust threat detection, prevention, and response capabilities. He previously worked in the telecommunications space to build a security capability to help analysts identify and contextualize unknown cyber threats. With a background in application development and technology consulting, he has worked with various small businesses and start-up organizations across Australia.

About the reviewers

Liza Katz started her software engineering career at the age of 17, and now, with almost 20 years of experience under her belt, she has vast experience with building software, delivering products, and effectively sharing knowledge. She enjoys slow travel, fitness, and baking.

Ravindra Ramnani is a principal solutions architect at Elastic. Ravi has over 14 years of solution architecture, design, and consulting experience across multiple industries and technologies. He has deep experience with the Elastic Stack, having built solutions on the stack for many years now. He is a fintech thought-leader with rich experience in helping banks in their digitization journey.

Table of Contents

Preface

Section 1: Core Components

Chapter 1: Introduction to the Elastic Stack

An overview of the Elastic Stack

The evolution of the Elastic Stack

A note about licensing

What is Elasticsearch?

When to use Elasticsearch

Architectural characteristics of Elasticsearch

When Elasticsearch may not be the right tool

Introducing Kibana

Collecting and ingesting data

Collecting data from across your environment using Beats

Centralized extraction and transformation and loading your data with Logstash

Deciding between using Beats and Logstash

Running the Elastic Stack

Standalone deployments

Elastic Cloud

Solutions built on the stack

Enterprise Search

Security

Observability

Summary

Chapter 2: Installing and Running the Elastic Stack

Technical requirements

Manual installation of the stack

Installing on Linux

Automating the installation

Using Ansible for automation

Using Elastic Cloud Enterprise (ECE) for orchestration

ECE architecture

Proxies

ECE installation size

Installing ECE

Creating your deployment on ECE

Running on Kubernetes

Configuration of your lab environment

Summary

Section 2: Working with the Elastic Stack

Chapter 3: Indexing and Searching for Data

Technical requirements

Understanding the internals of an Elasticsearch index

Inside an index

Elasticsearch nodes

Master-eligible nodes

Voting-only nodes

Data nodes

Ingest nodes

Coordinator nodes

Machine learning nodes

Elasticsearch clusters

Searching for data

Indexing sample logs

Running queries on your data

Summary

Chapter 4: Leveraging Insights and Managing Data on Elasticsearch

Technical requirements

Getting insights from data using aggregations

Managing the life cycle of time series data

The usefulness of data over time

Index Lifecycle Management

Using data streams to manage time series data

Manipulating incoming data with ingest pipelines

Common use cases for ingest pipelines

Responding to changing data with Watcher

Getting started with Watcher

Common use cases for Watcher

Summary

Chapter 5: Running Machine Learning Jobs on Elasticsearch

Technical requirements

The value of running machine learning on Elasticsearch

Preparing data for machine learning jobs

Machine learning concepts

Looking for anomalies in time series data

Looking for anomalous event rates in application logs

Looking for anomalous data transfer volumes

Comparing the behavior of source IP addresses against the population

Running classification on data

Predicting maliciously crafted requests using classification

Inferring against incoming data using machine learning

Summary

Chapter 6: Collecting and Shipping Data with Beats

Technical requirements

Introduction to Beats agents

Collecting logs using Filebeat

Using Metricbeat to monitor system and application metrics

Monitoring operating system audit data using Auditbeat

Monitoring the uptime and availability of services using Heartbeat

Collecting network traffic data using Packetbeat

Summary

Chapter 7: Using Logstash to Extract, Transform, and Load Data

Technical requirements

Introduction to Logstash

Understanding how Logstash works

Configuring your Logstash instance

Running your first pipeline

Looking at pipelines for real-world data-processing scenarios

Loading data from CSV files into Elasticsearch

Parsing Syslog data sources

Enriching events with contextual data

Aggregating event streams into a single event

Processing custom logs collected by Filebeat using Logstash

Summary

Chapter 8: Interacting with Your Data on Kibana

Technical requirements

Getting up and running on Kibana

Solutions in Kibana

Kibana data views

Visualizing data with dashboards

Creating data-driven presentations with Canvas

Working with geospatial datasets using Maps

Responding to changes in data with alerting

The anatomy of an alert

Creating alerting rules

Summary

Chapter 9: Managing Data Onboarding with Elastic Agent

Technical requirements

Tackling the challenges in onboarding new data sources

Unified data collection using a single agent

Managing Elastic Agent at scale with Fleet

Agent policies and integrations

Setting up your environment

Preparing your Elasticsearch deployment for Fleet

Setting up Fleet Server to manage your agents

Collecting data from your web server using Elastic Agent

Using integrations to collect data

Summary

Section 3: Building Solutions with the Elastic Stack

Chapter 10: Building Search Experiences Using the Elastic Stack

Technical requirements

An introduction to full-text searching

Analyzing text for a search

Running searches

Implementing features to improve the search experience

Autocompleting search queries

Suggesting search terms for queries

Using filters to narrow down search results

Paginating large result sets

Ordering search results

Putting it all together to implement recipe search functionality

Summary

Chapter 11: Observing Applications and Infrastructure Using the Elastic Stack

Technical requirements

An introduction to observability

Metrics

Logs

Traces

Synthetic and real user monitoring

Observing your environment

Infrastructure-level visibility

Platform-level visibility

Host- and operating system-level visibility

Monitoring your software workloads

Leveraging out-of-the-box content for observability data

Instrumenting your application performance

Configuring APM to instrument your code

Summary

Chapter 12: Security Threat Detection and Response Using the Elastic Stack

Technical requirements

Building security capability to protect your organization

Confidentiality

Integrity

Availability

Building a SIEM for your SOC

Collecting data from a range of hosts and source systems

Monitoring and detecting security threats in near real time

Allowing analysts to work and investigate collaboratively

Applying threat intelligence and data enrichment to contextualize your alerts

Enabling teams to hunt for adversarial behavior in the environment

Providing alerting, integrations, and response actions

Easily scaling with data volumes over suitable data retention periods

Leveraging endpoint detection and response in your SOC

Malware

Ransomware

Memory threats

Malicious behavior

Summary

Chapter 13: Architecting Workloads on the Elastic Stack

Architecting workloads on Elastic Stack

Designing for high availability

Scaling your workloads with your data

Recovering your workloads from disaster

Securing your workloads on Elastic Stack

Architectures to handle complex requirements

Federating searches across different Elasticsearch deployments

Replicating data between your Elasticsearch deployments

Using tiered data architectures for your deployment

Implementing successful deployments of the Elastic Stack

Summary

Other Books You May Enjoy

Preface

A core aspect of working in any IT environment is the ability to make sense of and use large amounts of data. Every single component in your environment generates data about its state, warnings or errors that were encountered, and vital health and diagnostic information about the component. The ability to collect, analyze, correlate, and visualize this data is key to the operational resiliency as well as security of your organization.

The Elastic Stack has deep roots in the world of search. Elasticsearch is a powerful and ultra-scalable search engine and data store that gives users the ability to ingest and search across massive volumes of data. The flexibility of Elasticsearch allows users to build simple experiences to find what they are looking for in large repositories of data.

The Elastic Stack is a collection of technologies that can collect data from any source system, transform the data to make it useful, and give users the ability to understand and derive insights from the data to enable a range of use cases. Today, the Elastic Stack consists of Beats, Logstash, and Elastic Agent as collection and transformation tools; Elasticsearch as a search and analytics engine; and Kibana as a tool to build solutions around your data. The Elastic Stack has become a de facto standard when it comes to collecting and analyzing data, used widely in open source as well as enterprise and commercial projects.

The main goal of this book is to simplify and optimize your experience as you get started with this technology. The flexibility of the Elastic Stack means there is more than one way to solve a given problem. The nature of the individual core components also means that the guides and reference materials available focus on technical capability and not the solutions or outcomes that can be built.

This book aims to give you a robust introduction and understanding of the core components and how they work together to solve problems in the realms of search, observability, and security. It also focuses on the most up-to-date best practices and approaches to implementing your solution using the stack.

Use this book to give yourself a head start on your Elastic Stack projects. You will understand the capabilities of the stack and build your solutions to evolve and grow alongside your environment, as well as using the insights in your data to best serve your users while delivering value to your organization.

Who this book is for

This book is designed for those with little to no experience with the Elastic Stack. It does, however, expect you to have the curiosity to learn and explore new technologies and be comfortable with basic Linux and system administration and simple scripting. You are also encouraged to supplement the content in this book with further online research where appropriate for best outcomes.

Developers, engineers, and analysts can use this book to learn how use cases and solutions can be implemented to solve their data problems.

Solution architects and tech leads can understand how the components work at a high level and where the capability may fit at a high level in their environment.

The book makes it easy to wrap your head around the various core technologies in the Elastic Stack with structured content, story-based explanations, and hands-on exercises to expedite your learning.

What this book covers

Chapter 1, Introduction to the Elastic Stack, gives you an overview of the core components of the stack and the solutions they can enable.

Chapter 2, Installing and Running the Elastic Stack, shows you how the core components can be installed, orchestrated, and run.

Chapter 3, Indexing and Searching for Data, explores Elasticsearch fundamentals for indexing and full-text search.

Chapter 4, Leveraging Insights and Managing Data on Elasticsearch, dives deeper into Elasticsearch, exploring aggregations, the data life cycle, and alerting.

Chapter 5, Running Machine Learning Jobs on Elasticsearch, looks at how supervised and unsupervised machine learning jobs can be configured to run on your data.

Chapter 6, Collecting and Shipping Data with Beats, introduces you to commonly used Beats agents and the different types of data sources they can collect on the stack.

Chapter 7, Using Logstash to Extract, Transform, and Load Data, explores the use of Logstash to build ETL pipelines for your data.

Chapter 8, Interacting with Your Data on Kibana, focuses on the use cases and solutions that can be built on top of your data.

Chapter 9, Managing Data Onboarding with Elastic Agent, looks at the use of a unified agent to continuously onboard and manage the collection of your data.

Chapter 10, Building Search Experiences Using the Elastic Stack, dives deep into the different aspects of building powerful and rich search experiences for your applications.

Chapter 11, Observing Applications and Infrastructure Using the Elastic Stack, focuses on building end-to-end observability solutions using logs, metrics, and APM traces to drive operational resiliency in your environment.

Chapter 12, Security Threat Detection and Response Using the Elastic Stack, looks at implementing security detection and response capability using Elastic's SIEM and EDR solutions to protect your environment from cyber-attacks.

Chapter 13, Architecting Workloads on the Elastic Stack, explores various best practices and reference architectures when it comes to running Elastic Stack workloads in production settings.

To get the most out of this book

You do not need to be an expert in a range of technologies to get the most out of this book. While the following tools are used in the hands-on examples, all the core concepts are introduced and explained in the book. Some additional online research may be required where appropriate for you.

Most chapters in this book include relevant setup instructions and technical requirements related to the contents of the chapter. Read these instructions before continuing with the chapter to follow along with any hands-on exercises or examples.

You can also visit the book's website using the following link: https://www.elasticstackbook.com/

All code used in this book can be accessed from the GitHub repository for this book. A link to the repository is available in the next section. This will help avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Getting-Started-with-Elastic-Stack-8.0. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800569492_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map {

height: 100%;

margin: 0;

padding: 0

}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]

exten => s,1,Dial(Zap/1|30)

exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Getting Started with Elastic Stack 8.0, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Section 1: Core Components

This section offers a quick introduction to the core components of the Elastic Stack: Elasticsearch, Kibana, Logstash, and Beats.

This section includes the following chapters:

Chapter 1, Introduction to the Elastic StackChapter 2, Installing and Running the Elastic Stack

Chapter 1: Introduction to the Elastic Stack

Welcome to Getting Started with Elastic Stack 8.0. The Elastic Stack has exploded in popularity over the last couple of years, becoming the de facto standard for centralized logging and "big data"-related use cases. The stack is leveraged by organizations, both big and small, across the world to solve a range of data-related problems. Hunting for adversaries in your network, looking for fraudulent transactions, real-time monitoring and alerting in systems, and searching for relevant products in catalogs are some of the real-world applications of the Elastic Stack.

The Elastic Stack is a bundle of multiple core products that integrate with each other. We will look at each product briefly in this chapter, and then dive into each one in later chapters in this book. The Elastic Stack attracts a great deal of interest from developers and architects that are working for organizations of all sizes. This book aims to serve as the go-to guide for those looking to get started with the best practices when it comes to building real-world search, security, and observability platforms using this technology.

In this chapter, you will learn a little bit about each component that makes up the Elastic Stack, and how they can be leveraged for your use cases. Those of you who are beginners or intermediary learners of this subject will benefit from this content to gain useful context for Chapter 3, Indexing and Searching for Data, to Chapter 13, Architecting Workloads on the Elastic Stack, of this book.

Specifically, we will cover the following topics:

An overview of the Elastic StackAn introduction to ElasticsearchVisualizing and interacting with data on KibanaIngesting various data sources using Logstash and BeatsEnd-to-end solutions on the Elastic Stack

An overview of the Elastic Stack

The Elastic Stack is made up of four core products:

Elasticsearch is a full-text search engine and a versatile data store. It can store and allow you to search and compute aggregations on large volumes of data quickly.Kibana provides a user interface for Elasticsearch. Users can search for and create visualizations, and then administer Elasticsearch, using this tool. Kibana also offers out-of-the-box solutions (in the form of apps) for use cases such as search, security, and observability.Beats can be used to collect and ship data directly from a range of source systems (such as different types of endpoints, network and infrastructure appliances, or cloud-based API sources) into Logstash or Elasticsearch.Logstash is an Extract, Transform, and Load (ETL) tool that's used to process and ingest data from various sources (such as log files on servers, Beats agents in your environment, or message queues and streaming platforms) into Elasticsearch.

This diagram shows how the core components of the Elastic Stack work together to ingest, store, and search on data:

Figure 1.1 – Components of the Elastic Stack

Each core component solves a single, common data-related problem. This genericity makes the stack flexible and domain-agnostic, allowing it to be adopted in multiple solution areas. Most users start with a simple logging use case where data is collected, parsed, and stored in Elasticsearch to create dashboards and alerts. Others might create more sophisticated capabilities, such as a workplace search to make information across a range of data sources accessible to your team; leveraging SIEM and machine learning to look for anomalous user/machine behavior and hunt for adversaries on your company network; understanding performance bottlenecks in applications; and monitoring infrastructure logs/metrics to respond to issues on critical systems.

The evolution of the Elastic Stack

Multiple independent projects have evolved over the years to create the present-day version of the Elastic Stack. Knowing how these components evolved indicates some of the functional gaps that existed in the big data space and how the Elastic Stack components come together to solve these challenges. Let's take a look:

An open source transactional Object/Search Engine Mapping (OSEM) framework for Java called Compass was released. Compass leveraged Lucene, an open source search engine library for implementing high-performance full-text search and indexing functionality.To address scalability concerns in Compass, it was rewritten as a distributed search engine called Elasticsearch. Elasticsearch implemented RESTful APIs over HTTP using JSON, allowing programming languages other than Java to interact with it. Elasticsearch quickly gained popularity in the open source community.As Elasticsearch was adopted by the community, a modular tool called Logstash was being developed to collect, transform, and send logs to a range of target systems. Elasticsearch was one of the target systems supported by Logstash.Kibana was written to act as a user interface for using the REST APIs on Elasticsearch to search for and visualize data. Elasticsearch, Logstash, and Kibana were commonly referred to as the ELK Stack.Elastic started providing managed Elasticsearch clusters on the cloud. Elastic Cloud Enterprise (ECE) was offered for customers to orchestrate and manage Elasticsearch deployments on-premises or on private cloud infrastructure.An open source tool called Packetbeat was created to collect and ship network packet data to Elasticsearch. This later evolved into the Beats project, a collection of lightweight agents designed to collect and ship several types of data into Elasticsearch.Machine learning capabilities were added to Elasticsearch and Kibana to support anomaly detection use cases on data residing on Elasticsearch.Application Performance Monitoring (APM) capabilities were added to the Elastic Stack. The APM app on Kibana, together with the Logs, Metrics, and Uptime apps, formed the Observability solution.Kibana added security analytics functionality as part of the Security Information and Event Management (SIEM) app.A collection of proprietary features known as X-Pack was made open source under the Elastic licensing model.Endpoint Detection and Response (EDR) capabilities were added to the Elastic Stack. EDR and SIEM capabilities formed the Security solution.Out-of-the-box website, application, and content search functionality was offered as part of the Enterprise Search solution.

A note about licensing

The core components of the stack are open source software projects, licensed under a mix of the Apache 2, Elastic License version 2 (ELv2), and Server Side Public License (SSPL) licensing agreements. The LICENSE.txt file in the root of each product's GitHub repository should explain how the code is licensed.

A paid license is not required to learn about and explore the Elastic Stack features covered in this book. A trial license can be activated for full access to all the features for a limited period upon installing the software.

To focus on learning about the features and technical aspects of the product, there will be no notes on licensing implications after this section. Please refer to the Elastic Subscriptions page to understand what kind of license you might need for a production deployment of the technology:

https://www.elastic.co/subscriptions

What is Elasticsearch?

Elasticsearch is often described as a distributed search engine that can be used to search through and aggregate enormous amounts of data. Some describe Elasticsearch as an analytics engine, while others have used the term document store or NoSQL database. The reason for the wide-ranging definitions for Elasticsearch is that it is quite a flexible product. It can be used to store JSON documents, with or without a predefined schema (allowing for unstructured data); it can be used to compute aggregations on document values (to calculate metrics or group data into buckets), and it can be used to implement relevant, free text search functionality across a large corpus.

Elasticsearch builds on top of Apache Lucene, a popular and fast full-text search library for Java applications. Lucene is not distributed in any way and does not manage resources/handle requests natively. At its core, Elasticsearch abstracts away the complexities and intricacies of working directly with a library such as Lucene by providing user-friendly APIs to help index, search for, and aggregate data. It also introduces concepts such as the following:

A method to organize and group related data as indicesReplica shards to improve search performance and add redundancy in the case of hardware failureThread pools for managing node resources while servicing several types of requests and cluster tasksFeatures such as Index Lifecycle Management (ILM) and data streams to manage the size and movement of indices on a cluster

Elasticsearch exposes RESTful APIs using JSON format, allowing for interoperability between different programming languages and technology stacks.

Elasticsearch today is a feature-rich and complex piece of software. Do not worry if you do not fully understand or appreciate some of the terms used to explain Elasticsearch. We will dive into these, as well as the features on offer, in Chapter 3, Indexing and Searching for Data.

When to use Elasticsearch

Selecting the right tool for the job is an important aspect of any project. This section describes some scenarios where Elasticsearch may be suited for use.

Ingesting, storing, and searching through large volumes of data

Elasticsearch is a horizontally scalable data store where additional nodes can easily be added to a cluster to increase the available resources. Each node can store multiple primary shards on data, and each shard can be replicated (as replica shards) on other nodes. Primary shards handle read and write requests, while replica shards only handle read requests.

The following diagram shows how primary and replica shards are distributed across Elasticsearch nodes to achieve scalable and redundant reading and writing of data:

Figure 1.2 – Shards of data distributed across nodes

The preceding diagram shows the following:

Three Elasticsearch nodes: node A, node B, and node CTwo indices: index A and index BEach index with two primary and two replica shards

High volume ingest can mean either of the following things:

A singular index or data source with a large number of events Emitted Per Second (EPS)A group of indices or data sources receiving a large number of aggregate events per second

Elasticsearch can also store large volumes of data for search and aggregation. To retain data costs efficiently over long retention periods, clusters can be architected with hot, warm, and cold tiers of data. During its life cycle, data can be moved across nodes with different disk or Input/Output Operations Per Second (IOPS) specifications to take advantage of slower disk drives and their associated lower costs. We will look at these sorts of architectures in Chapter 3, Indexing and Searching for Data and Chapter 13, Architecting Workloads on the Elastic Stack.

Some examples of where you need to handle large volumes of data include the following:

Centralized logging platforms (ingesting various application, security, event, and audit logs from multiple sources)When handling metrics/traces/telemetry data from many devicesWhen ingesting data from large document repositories or crawling a large number of web pages

Getting relevant search results from textual data

As we discussed previously, Elasticsearch leverages Lucene for indexing and searching operations. As documents are ingested into Elasticsearch, unstructured textual components from the document are analyzed to extract some structure in the form of terms. Terms are maintained in an inverted index data structure. In simple terms, an index (such as the table of contents in a book) is a list of topics (or documents) and the corresponding page numbers for each topic. An index is great for retrieving page content, given you already know what the chapter is called. An inverted index, however, is a collection of words (or terms) in topics and a corresponding list of pages that contain them. Therefore, an inverted index can make it easier to find all the relevant pages, given the search term you are interested in.

The following table visualizes an inverted index for a collection of documents containing recipes:

Table 1.1 – Visualization of an inverted index

A search string containing multiple terms goes through a similar process of analysis to extract terms, to then look up all the matching terms and their occurrences in the inverted index. A score is calculated for each field match based on the similarity module. By default, the BM25 ranking function (based on term frequency/inverse document frequency) is used to estimate the relevance of a document for a search query. Elasticsearch then returns a union of the results if an OR operator is used (by default) or an intersection of the results if an AND operator is used. The results are sorted by score, with the highest score appearing first.

Aggregating data

Elasticsearch can aggregate large volumes of data with speed thanks to its distributed nature. There are primarily two types of aggregations:

Bucket aggregations: Bucket aggregations allow you to group (and sub-group) documents based on the values of fields or where the value sits in a range.Metrics aggregations: Metrics aggregations can calculate metrics based on the values of fields in documents. Supported metrics aggregations include avg, min, max, count, and cardinality, among others. Metrics can be computed for buckets/groups of data.

Tools such as Kibana heavily use aggregations to visualize the data on Elasticsearch. We will dive deeper into aggregations in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch.

Acting on data in near real time

One of the benefits of quickly ingesting and retrieving data is the ability to respond to the latest information quickly. Imagine a scenario where uptime information for business-critical services is ingested into Elasticsearch. Alerting would work by continually querying Elasticsearch (at a predefined interval) to return any documents that indicate degrading service performance or downtime. If the query returns any results, actions can be configured to alert a Site Reliability Engineer (SRE) or trigger automated remediation processes as appropriate.

Watcher and Kibana alerting are two ways in which this can be achieved; we will look at this in detail in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch, and Chapter 8, Interacting with Your Data on Kibana.

Working with unstructured/semi-structured data

Elasticsearch does not require predefined schemas for documents you want to work with. Schemas on indices can be preconfigured if they're known to control storage/memory consumption and know how the data will be used later on. Schemas (also known as index mappings) can be dynamically or strictly configured, depending on your flexibility and the maturity of your document's structure.

By default, Elasticsearch will dynamically update these index mappings based on the documents that have been ingested. Where no mapping exists for a field, Elasticsearch will guess the data type based on its value. This flexibility makes it extremely easy for developers to get up and running, while also making it suitable for use in environments where document schemas may evolve over time.

We'll look at index mappings in Chapter 3, Indexing and Searching for Data.

Architectural characteristics of Elasticsearch

Elasticsearch can be configured to work as a distributed system where groups of nodes (Elasticsearch instances) work together to form a cluster. Clusters can be set up for the various architectural characteristics when deployed in mission-critical environments. We will take a look at some of these in this section.

Horizontally scalable

As we mentioned previously, Elasticsearch is a horizontally scalable system. Read/write throughput, as well as storage capacity, can be increased almost linearly by adding additional nodes to the Elasticsearch cluster. Adding nodes to a cluster is relatively effortless and can be done without any downtime. The cluster can automatically redistribute shards evenly across nodes (subject to shard allocation filtering rules) as the number of nodes available changes to optimize performance and improve resource utilization across nodes.

Highly available and resilient

A primary shard in Elasticsearch can handle both read and write operations, while a replica shard is a read-only copy of a primary shard. By default, Elasticsearch will allocate one replica for every primary shard on different nodes in the cluster, making Elasticsearch a highly available system where requests can still be completed when one or more nodes experience failures.

If a node holding a primary shard is lost, a replica shard will be selected and promoted to become a primary shard, and a replica shard will be allocated to another node in the cluster.

If a node holding a replica shard is lost, the replica shard will simply be allocated to another node in the cluster.

Indexing and search requests can be handled seamlessly while shards are being allocated, with clients experiencing little to no downtime. Even if a search request fails, subsequent search requests will likely succeed because of this architecture.

Shard allocation on Elasticsearch can also consider node attributes to help us make more informed allocation decisions. For example, a cluster deployed in a cloud region with three availability zones can be configured so that replicas are always allocated on a different availability zone (or even a server rack in an on-premises data center) to the primary shard to protect against failures at the zone level.

Recoverable from disasters

Elasticsearch allows us to persistently store or snapshot data, making it a recoverable system in the event of a disaster. Snapshots can be configured to write data to a traditional filesystem or an object store such as AWS S3. Snapshots are a point-in-time copy of the data and must be taken at regular intervals, depending on your Recovery Point Objective (RPO), for an effective disaster recovery plan to be created.

Cross-cluster operations

Elasticsearch can search for and replicate data across remote clusters to enable more sophisticated architectural patterns.

Cross-Cluster Search (CCS) is a feature that allows you to search data that resides on an external or remote Elasticsearch cluster. A single search request can be run on the local cluster, as well as one or more remote clusters. Each cluster will run the search independently on its own data before returning a response to the coordinator node (the node handling the search request). The coordinator nodes then combine the results from the different clusters into a single response for the client. The local node does not join remote clusters, allowing for higher network latencies for inter-cluster communication, compared to intracluster communication. This is useful in scenarios where multiple independent clusters in different geographic regions need to search on each other to have a unified search capability.

The following diagram shows how Elasticsearch clusters can search across multiple clusters and combine results into a single search response for the user:

Figure 1.3 – How CCS requests are handled

Cross-cluster replication (CCR) allows an index to be replicated in a local cluster to a remote cluster. CCR implements a leader/follower model, where all the changes that have been made to a leader index are replicated on the follower index. This feature allows for fast searching on the same dataset in different geographical regions by replicating data closer to where it will be consumed. CCR is also sometimes used for cross-region redundancy requirements:

Figure 1.4 – How CCR works

CCS and CCR enable more complex use cases where multiple regional clusters can be used to independently store and search for data, while also allowing unified search and geographical redundancy.

Security

Elasticsearch offers security features to help authenticate and authorize user requests, as well as encrypting network communications to and within the cluster:

Encryption in transit: TLS can be used to encrypt inter-node communications, as well as REST API requests.Access control: Role-Based Access Control (RBAC) or Attribute-Based Access Control (ABAC) can be used to control access to the data and functionality on Elasticsearch:RBAC works by associating a user with a role, where a role contains a list of privileges (such as read/write/update), as well as the resources these privileges can be applied to (such as an index; for example, my-logs-1).ABAC works by using attributes the user has (such as their location, security clearance, or job role) in conjunction with an access policy to determine what the user can do or access on the cluster. ABAC is generally a more fine-grained authorization control compared to RBAC.Document security: A security role in Elasticsearch can specify what subset of data a user can access on an index. A role can also specify what fields in a document a user can or cannot access. For example, an employee with security clearance of baseline can only access documents where the value of the classification field is either UNOFFICIAL or OFFICIAL. Field security: Elasticsearch can also control what fields a user has access to as part of a document. Building on the example in the previous point, field-level security can be used so that the user can only view fields that start with the metadata- string.Authentication providers: In addition to local/native authentication, Elasticsearch can use external services such as Active Directory, LDAP, SAML, and Kerberos for user authentication. API keys-based authentication is also available for system accounts and programmatic access.

When Elasticsearch may not be the right tool

It is also important to understand the limitations of Elasticsearch. This section describes some scenarios where Elasticsearch alone may not be the best tool for the job.

Handling relational datasets

Elasticsearch, unlike databases such as MySQL, was not designed to handle relational data. Elasticsearch allows you to have simple relationships in your data, such as parent-child and nested relationships, at a performance cost (at search time and indexing time, respectively). Data on Elasticsearch must be de-normalized (duplicating or adding redundant fields to documents, to avoid having to join data) to improve search and indexing/update performance.

If you need to have the database manage relationships and enforce rules of consistency across different types of linked data, as well as maintaining normalized records of data, Elasticsearch may not be the right tool for the job.

Performing ACID transactions

Individual requests in Elasticsearch support ACID properties. However, Elasticsearch does not have the concept of transactions, so it does not offer ACID transactions.

At the individual request level, ACID properties can be achieved as follows:

Atomicity is achieved by sending a write request, which will either succeed on all active shards or fail. There is no way for the request to partially succeed.Consistency is achieved by writing to the primary shard. Data replication happens synchronously before a success response is returned. This means that all the read requests on all shards after a write request will see the same response.Isolation is offered since concurrent writes or updates (which are deletes and writes) can be handled successfully without any interference.Durability is achieved since once a document is written into Elasticsearch, it will persist, even in the case of a system failure. Writes on Elasticsearch are not immediately persisted onto Lucene segments on disk as Lucene commits are relatively expensive operations. Instead, documents are written to a transaction log (referred to as a translog) and flushed into the disk periodically. If a node crashes before the data is flushed, operations from the translog will be recovered into the Lucene index upon startup.

If ACID transactions are important to your use case, Elasticsearch may not be suitable for you.

Important Note

In the case of relational data or ACID transaction requirements, Elasticsearch is often used alongside a traditional RDBMS solution such as MySQL. In such architectures, the RDBMS would act as the source of truth and handle writes/updates from the application. These updates can then be replicated to Elasticsearch using tools such as Logstash for fast/relevant searches and visualization/analytics use cases.

With that, we have explored some of the core concepts of Elasticsearch and the role it plays in ingesting and storing our data. Now, let's look at how we can interact with the data on Elasticsearch using Kibana.

Introducing Kibana

Kibana was created primarily as a visualization tool for data residing on Elasticsearch and is bundled together as part of the stack. Since its inception, Kibana has evolved to cater to use cases such as alerting, reporting, and monitoring Elastic Stack components, as well as administrating and managing the Elasticsearch cluster in use.

More importantly, Kibana provides the interface and functionality for the solutions that Elastic Stack offers, in addition to administration and management options for the core components. Functionality in Kibana is organized using applications (or apps, for short).

Apps on Kibana can be solution-specific or part of the general stack. The SIEM app, for example, powers the security solution, enabling security analysts and threat hunters to defend their organization from attacks. The APM app is another solution-specific app that, in this case, allows developers and SREs to observe their applications to look for issues or performance bottlenecks.

On the other hand, general Kibana apps such as Discover, Visualize, and Dashboard can be used to explore, interrogate, and visualize data, regardless of the solution the data enables. Ingest Manager is another example of an app that allows you to configure Elastic Agent to collect any kind of data from across an environment, agnostic of the solution the data may be used in.

Solution-specific apps on Kibana provide a great out-of-the-box user experience, as well as targeted features and functionality for the solution in question. General or stack-based applications provide powerful but unified capabilities that can be used across all solutions, even custom solutions that you might build on the Elastic Stack. General Kibana apps such as Discover and Dashboard are useful for all use cases, while solution-specific apps such as Observability and Security provide curated out-of-the-box experiences for the solution area. Kibana is usually considered a core component of the Elastic Stack and is often installed, even if the cluster is not used for data analysis.

We will dive deeper into Kibana's features in Chapter 8, Interacting with Your Data on Kibana. Now, let's look at how data can be collected and ingested into Elasticsearch using Logstash and Beats.

Collecting and ingesting data

So far, we have looked at Elasticsearch, a scalable search and analytics engine for all kinds of data. We have also got Kibana to interface with Elasticsearch to help us explore and use our data effectively. The final capability to make it all work together is ingestion.

The Elastic Stack provides two products for ingestion, depending on your use cases.

Collecting data from across your environment using Beats

Useful data is generated all over the place in present-day environments, often from varying technology stacks, as well as legacy and new systems. As such, it makes sense to collect data directly from, or closer to, the source system and ship it into your centralized logging or analytics platform. This is where Beats come in; Beats are lightweight applications (also referred to as agents) that can collect and ship several types of data to destinations such as Elasticsearch, Logstash, or Kafka.

Elastic offers a few types of Beats today for various use cases:

Filebeat: Collecting log dataMetricbeat: Collecting metric dataPacketbeat: Decoding and collecting network packet metadataHeartbeat: Collecting system/service uptime and latency dataAuditbeat: Collecting OS audit dataWinlogbeat: Collecting Windows event, applicatio, and security logsFunctionbeat: Running data collection on serverless compute infrastructure such as AWS Lambda

Beats use an open source library called libbeat that provides generic APIs for configuring inputs and destinations for data output. Beats implement the data collection functionality that's specific to the type of data (such as logs and metrics) that they collect. A range of community-developed Beats are available, in addition to the officially produced Beats agents.

Beats modules and the Elastic Common Schema

The modules that are available in Beats allow you to collect consistent datasets and the distribution of out-of-the-box dashboards, machine learning jobs, and alerts for users to leverage in their use cases.

Importance of a unified data model

One of the most important aspects of ingesting data into a centralized logging platform is paying attention to the data format in use. A Unified Data Model (UDM) is an especially useful tool to have, ensuring data can be easily consumed by end users once ingested into a logging platform. Enterprises typically follow a mixture of two approaches to ensure the log data complies with their unified data model:

Enforcing a logging standard or specification for log-producing applications in the company.

This approach is often considerably costly to implement, maintain, and scale. Changes in the log schema at the source can also have unintended downstream implications in other applications consuming the data. It is common to see UDMs evolve quite rapidly as the nature and the content of the logs that have been collected change. The use of different technology stacks or frameworks in an organization can also make it challenging to log with consistency and uniformity across the environment.

Transforming/renaming fields in incoming data using an ETL tool such as Logstash to comply with the UDM. Organizations can achieve relatively successful results using this approach, with considerably fewer upfront costs when reworking logging formats and schemas. However, the approach does come with some downsides:

(a) Parsers need to be maintained and constantly updated to make sure the logs are extracted and stored correctly.