39,59 €
The Elastic Stack helps you work with massive volumes of data to power use cases in the search, observability, and security solution areas.
This three-part book starts with an introduction to the Elastic Stack with high-level commentary on the solutions the stack can be leveraged for. The second section focuses on each core component, giving you a detailed understanding of the component and the role it plays. You’ll start by working with Elasticsearch to ingest, search, analyze, and store data for your use cases. Next, you’ll look at Logstash, Beats, and Elastic Agent as components that can collect, transform, and load data. Later chapters help you use Kibana as an interface to consume Elastic solutions and interact with data on Elasticsearch. The last section explores the three main use cases offered on top of the Elastic Stack. You’ll start with a full-text search and look at real-world outcomes powered by search capabilities. Furthermore, you’ll learn how the stack can be used to monitor and observe large and complex IT environments. Finally, you’ll understand how to detect, prevent, and respond to security threats across your environment. The book ends by highlighting architecture best practices for successful Elastic Stack deployments.
By the end of this book, you’ll be able to implement the Elastic Stack and derive value from it.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 410
Run powerful and scalable data platforms to search, observe, and secure your organization
Asjad Athick
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Sunith Shetty
Senior Editor: Nazia Shaikh
Content Development Editor: Sean Lobo
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Production Designer: Nilesh Mohite
Marketing Coordinator: Priyanka Mhatre
First published: March 2022
Production reference: 1080222
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-949-2
www.packt.com
To my parents, for their unconditional love, inspiration, and all their sacrifices. To Inez and James, for their unwavering support through thick and thin. And to the brilliant people at Elastic, for being generous with their time and knowledge.
– Asjad Athick
In 2009 I sat down to write Elasticsearch with the goal of helping people find what they are looking for, and I am still in awe of the breadth of the ecosystem that has been built around it. The Elastic Stack is already at version 8 (we did cheat a bit and jumped from version 2/3 to 5 to align versions across all core components). It has grown from Elasticsearch to ELK (Elasticsearch, Logstash, Kibana) to the Elastic Stack, that has a breadth beyond just the set of core products to address solutions as well.
It has also been at the confluence of major structural, cultural, and technological evolutions: from the breadth of Enterprise Search and the ability to put a search box on a website, application, or workspace, through to the emergence of Observability to help monitor applications regardless of the "how", like Logging and APM. And now, in the early days of the same movement in security, between SIEM, Endpoint, Container, and Cloud security, the convergence of Observability and Security capabilities are becoming the status quo in any modern IT environment. These evolutions are as much social and organizational ones (DevOps, SecOps, and so on) as they are technological ones. The best ones often are.
And at the core of it all are still products like Elasticsearch and Kibana. To some level, their fundamentals have remained the same. In other ways, they have evolved tremendously over the years to address many new and exciting use cases. Data is all around us, growing faster than ever before, making the need to search across this data natural and timeless. The Elastic Stack acts both as an end-to-end solution (like in Observability or Security) and also as an extensible platform for your data. It is flexible to handle so many other use cases with strong building blocks: Elasticsearch to store and search data, Kibana to visualize it, and various ways to bring data into the stack. The best and most fun products are ones that are easy to use for the obvious, and flexible enough to tackle the unobvious. We built the Elastic Stack to do both.
In this book, Asjad goes through the various components of the Elastic Stack and the possibilities you can unlock in the form of holistic solutions for your organization. The focus on providing the best possible "Get Started" experience is important to make it even easier to build with the stack while making the right decisions early on in the project. Asjad works with a broad range of customers at Elastic, ranging from tech startups to commercial and large enterprise organizations. This book encompasses a wealth of knowledge and experience acquired over the years from production customer deployments used every day for mission-critical applications.
I recommend reading chapters in Part 2 to familiarize yourself with the various components of the stack and the problems they can solve for you. Then use Part 3 to implement the solutions that deliver the value you need to create for your end users. Add this book to your reading list if you are looking for a great starting point on how to best leverage search and analytics to extract the most value from the data around you.
I hope you enjoy reading this book.
Shay Banon
Founder and Chief Technology Officer at Elastic
Asjad Athick is a security specialist at Elastic with demonstratable experience in architecting enterprise-scale solutions on the cloud. He believes in empowering people with the right tools to help them achieve their goals. At Elastic, he works with a broad range of customers across Australia and New Zealand to help them understand their environment; this allows them to build robust threat detection, prevention, and response capabilities. He previously worked in the telecommunications space to build a security capability to help analysts identify and contextualize unknown cyber threats. With a background in application development and technology consulting, he has worked with various small businesses and start-up organizations across Australia.
Liza Katz started her software engineering career at the age of 17, and now, with almost 20 years of experience under her belt, she has vast experience with building software, delivering products, and effectively sharing knowledge. She enjoys slow travel, fitness, and baking.
Ravindra Ramnani is a principal solutions architect at Elastic. Ravi has over 14 years of solution architecture, design, and consulting experience across multiple industries and technologies. He has deep experience with the Elastic Stack, having built solutions on the stack for many years now. He is a fintech thought-leader with rich experience in helping banks in their digitization journey.
A core aspect of working in any IT environment is the ability to make sense of and use large amounts of data. Every single component in your environment generates data about its state, warnings or errors that were encountered, and vital health and diagnostic information about the component. The ability to collect, analyze, correlate, and visualize this data is key to the operational resiliency as well as security of your organization.
The Elastic Stack has deep roots in the world of search. Elasticsearch is a powerful and ultra-scalable search engine and data store that gives users the ability to ingest and search across massive volumes of data. The flexibility of Elasticsearch allows users to build simple experiences to find what they are looking for in large repositories of data.
The Elastic Stack is a collection of technologies that can collect data from any source system, transform the data to make it useful, and give users the ability to understand and derive insights from the data to enable a range of use cases. Today, the Elastic Stack consists of Beats, Logstash, and Elastic Agent as collection and transformation tools; Elasticsearch as a search and analytics engine; and Kibana as a tool to build solutions around your data. The Elastic Stack has become a de facto standard when it comes to collecting and analyzing data, used widely in open source as well as enterprise and commercial projects.
The main goal of this book is to simplify and optimize your experience as you get started with this technology. The flexibility of the Elastic Stack means there is more than one way to solve a given problem. The nature of the individual core components also means that the guides and reference materials available focus on technical capability and not the solutions or outcomes that can be built.
This book aims to give you a robust introduction and understanding of the core components and how they work together to solve problems in the realms of search, observability, and security. It also focuses on the most up-to-date best practices and approaches to implementing your solution using the stack.
Use this book to give yourself a head start on your Elastic Stack projects. You will understand the capabilities of the stack and build your solutions to evolve and grow alongside your environment, as well as using the insights in your data to best serve your users while delivering value to your organization.
This book is designed for those with little to no experience with the Elastic Stack. It does, however, expect you to have the curiosity to learn and explore new technologies and be comfortable with basic Linux and system administration and simple scripting. You are also encouraged to supplement the content in this book with further online research where appropriate for best outcomes.
Developers, engineers, and analysts can use this book to learn how use cases and solutions can be implemented to solve their data problems.
Solution architects and tech leads can understand how the components work at a high level and where the capability may fit at a high level in their environment.
The book makes it easy to wrap your head around the various core technologies in the Elastic Stack with structured content, story-based explanations, and hands-on exercises to expedite your learning.
Chapter 1, Introduction to the Elastic Stack, gives you an overview of the core components of the stack and the solutions they can enable.
Chapter 2, Installing and Running the Elastic Stack, shows you how the core components can be installed, orchestrated, and run.
Chapter 3, Indexing and Searching for Data, explores Elasticsearch fundamentals for indexing and full-text search.
Chapter 4, Leveraging Insights and Managing Data on Elasticsearch, dives deeper into Elasticsearch, exploring aggregations, the data life cycle, and alerting.
Chapter 5, Running Machine Learning Jobs on Elasticsearch, looks at how supervised and unsupervised machine learning jobs can be configured to run on your data.
Chapter 6, Collecting and Shipping Data with Beats, introduces you to commonly used Beats agents and the different types of data sources they can collect on the stack.
Chapter 7, Using Logstash to Extract, Transform, and Load Data, explores the use of Logstash to build ETL pipelines for your data.
Chapter 8, Interacting with Your Data on Kibana, focuses on the use cases and solutions that can be built on top of your data.
Chapter 9, Managing Data Onboarding with Elastic Agent, looks at the use of a unified agent to continuously onboard and manage the collection of your data.
Chapter 10, Building Search Experiences Using the Elastic Stack, dives deep into the different aspects of building powerful and rich search experiences for your applications.
Chapter 11, Observing Applications and Infrastructure Using the Elastic Stack, focuses on building end-to-end observability solutions using logs, metrics, and APM traces to drive operational resiliency in your environment.
Chapter 12, Security Threat Detection and Response Using the Elastic Stack, looks at implementing security detection and response capability using Elastic's SIEM and EDR solutions to protect your environment from cyber-attacks.
Chapter 13, Architecting Workloads on the Elastic Stack, explores various best practices and reference architectures when it comes to running Elastic Stack workloads in production settings.
You do not need to be an expert in a range of technologies to get the most out of this book. While the following tools are used in the hands-on examples, all the core concepts are introduced and explained in the book. Some additional online research may be required where appropriate for you.
Most chapters in this book include relevant setup instructions and technical requirements related to the contents of the chapter. Read these instructions before continuing with the chapter to follow along with any hands-on exercises or examples.
You can also visit the book's website using the following link: https://www.elasticstackbook.com/
All code used in this book can be accessed from the GitHub repository for this book. A link to the repository is available in the next section. This will help avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Getting-Started-with-Elastic-Stack-8.0. If there's an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800569492_ColorImages.pdf.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."
A block of code is set as follows:
html, body, #map {
height: 100%;
margin: 0;
padding: 0
}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
[default]
exten => s,1,Dial(Zap/1|30)
exten => s,2,Voicemail(u100)
exten => s,102,Voicemail(b100)
exten => i,1,Voicemail(s0)
Any command-line input or output is written as follows:
$ mkdir css
$ cd css
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."
Tips or Important Notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you've read Getting Started with Elastic Stack 8.0, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.
This section offers a quick introduction to the core components of the Elastic Stack: Elasticsearch, Kibana, Logstash, and Beats.
This section includes the following chapters:
Chapter 1, Introduction to the Elastic StackChapter 2, Installing and Running the Elastic StackWelcome to Getting Started with Elastic Stack 8.0. The Elastic Stack has exploded in popularity over the last couple of years, becoming the de facto standard for centralized logging and "big data"-related use cases. The stack is leveraged by organizations, both big and small, across the world to solve a range of data-related problems. Hunting for adversaries in your network, looking for fraudulent transactions, real-time monitoring and alerting in systems, and searching for relevant products in catalogs are some of the real-world applications of the Elastic Stack.
The Elastic Stack is a bundle of multiple core products that integrate with each other. We will look at each product briefly in this chapter, and then dive into each one in later chapters in this book. The Elastic Stack attracts a great deal of interest from developers and architects that are working for organizations of all sizes. This book aims to serve as the go-to guide for those looking to get started with the best practices when it comes to building real-world search, security, and observability platforms using this technology.
In this chapter, you will learn a little bit about each component that makes up the Elastic Stack, and how they can be leveraged for your use cases. Those of you who are beginners or intermediary learners of this subject will benefit from this content to gain useful context for Chapter 3, Indexing and Searching for Data, to Chapter 13, Architecting Workloads on the Elastic Stack, of this book.
Specifically, we will cover the following topics:
An overview of the Elastic StackAn introduction to ElasticsearchVisualizing and interacting with data on KibanaIngesting various data sources using Logstash and BeatsEnd-to-end solutions on the Elastic StackThe Elastic Stack is made up of four core products:
Elasticsearch is a full-text search engine and a versatile data store. It can store and allow you to search and compute aggregations on large volumes of data quickly.Kibana provides a user interface for Elasticsearch. Users can search for and create visualizations, and then administer Elasticsearch, using this tool. Kibana also offers out-of-the-box solutions (in the form of apps) for use cases such as search, security, and observability.Beats can be used to collect and ship data directly from a range of source systems (such as different types of endpoints, network and infrastructure appliances, or cloud-based API sources) into Logstash or Elasticsearch.Logstash is an Extract, Transform, and Load (ETL) tool that's used to process and ingest data from various sources (such as log files on servers, Beats agents in your environment, or message queues and streaming platforms) into Elasticsearch.This diagram shows how the core components of the Elastic Stack work together to ingest, store, and search on data:
Figure 1.1 – Components of the Elastic Stack
Each core component solves a single, common data-related problem. This genericity makes the stack flexible and domain-agnostic, allowing it to be adopted in multiple solution areas. Most users start with a simple logging use case where data is collected, parsed, and stored in Elasticsearch to create dashboards and alerts. Others might create more sophisticated capabilities, such as a workplace search to make information across a range of data sources accessible to your team; leveraging SIEM and machine learning to look for anomalous user/machine behavior and hunt for adversaries on your company network; understanding performance bottlenecks in applications; and monitoring infrastructure logs/metrics to respond to issues on critical systems.
Multiple independent projects have evolved over the years to create the present-day version of the Elastic Stack. Knowing how these components evolved indicates some of the functional gaps that existed in the big data space and how the Elastic Stack components come together to solve these challenges. Let's take a look:
An open source transactional Object/Search Engine Mapping (OSEM) framework for Java called Compass was released. Compass leveraged Lucene, an open source search engine library for implementing high-performance full-text search and indexing functionality.To address scalability concerns in Compass, it was rewritten as a distributed search engine called Elasticsearch. Elasticsearch implemented RESTful APIs over HTTP using JSON, allowing programming languages other than Java to interact with it. Elasticsearch quickly gained popularity in the open source community.As Elasticsearch was adopted by the community, a modular tool called Logstash was being developed to collect, transform, and send logs to a range of target systems. Elasticsearch was one of the target systems supported by Logstash.Kibana was written to act as a user interface for using the REST APIs on Elasticsearch to search for and visualize data. Elasticsearch, Logstash, and Kibana were commonly referred to as the ELK Stack.Elastic started providing managed Elasticsearch clusters on the cloud. Elastic Cloud Enterprise (ECE) was offered for customers to orchestrate and manage Elasticsearch deployments on-premises or on private cloud infrastructure.An open source tool called Packetbeat was created to collect and ship network packet data to Elasticsearch. This later evolved into the Beats project, a collection of lightweight agents designed to collect and ship several types of data into Elasticsearch.Machine learning capabilities were added to Elasticsearch and Kibana to support anomaly detection use cases on data residing on Elasticsearch.Application Performance Monitoring (APM) capabilities were added to the Elastic Stack. The APM app on Kibana, together with the Logs, Metrics, and Uptime apps, formed the Observability solution.Kibana added security analytics functionality as part of the Security Information and Event Management (SIEM) app.A collection of proprietary features known as X-Pack was made open source under the Elastic licensing model.Endpoint Detection and Response (EDR) capabilities were added to the Elastic Stack. EDR and SIEM capabilities formed the Security solution.Out-of-the-box website, application, and content search functionality was offered as part of the Enterprise Search solution.The core components of the stack are open source software projects, licensed under a mix of the Apache 2, Elastic License version 2 (ELv2), and Server Side Public License (SSPL) licensing agreements. The LICENSE.txt file in the root of each product's GitHub repository should explain how the code is licensed.
A paid license is not required to learn about and explore the Elastic Stack features covered in this book. A trial license can be activated for full access to all the features for a limited period upon installing the software.
To focus on learning about the features and technical aspects of the product, there will be no notes on licensing implications after this section. Please refer to the Elastic Subscriptions page to understand what kind of license you might need for a production deployment of the technology:
https://www.elastic.co/subscriptions
Elasticsearch is often described as a distributed search engine that can be used to search through and aggregate enormous amounts of data. Some describe Elasticsearch as an analytics engine, while others have used the term document store or NoSQL database. The reason for the wide-ranging definitions for Elasticsearch is that it is quite a flexible product. It can be used to store JSON documents, with or without a predefined schema (allowing for unstructured data); it can be used to compute aggregations on document values (to calculate metrics or group data into buckets), and it can be used to implement relevant, free text search functionality across a large corpus.
Elasticsearch builds on top of Apache Lucene, a popular and fast full-text search library for Java applications. Lucene is not distributed in any way and does not manage resources/handle requests natively. At its core, Elasticsearch abstracts away the complexities and intricacies of working directly with a library such as Lucene by providing user-friendly APIs to help index, search for, and aggregate data. It also introduces concepts such as the following:
A method to organize and group related data as indicesReplica shards to improve search performance and add redundancy in the case of hardware failureThread pools for managing node resources while servicing several types of requests and cluster tasksFeatures such as Index Lifecycle Management (ILM) and data streams to manage the size and movement of indices on a clusterElasticsearch exposes RESTful APIs using JSON format, allowing for interoperability between different programming languages and technology stacks.
Elasticsearch today is a feature-rich and complex piece of software. Do not worry if you do not fully understand or appreciate some of the terms used to explain Elasticsearch. We will dive into these, as well as the features on offer, in Chapter 3, Indexing and Searching for Data.
Selecting the right tool for the job is an important aspect of any project. This section describes some scenarios where Elasticsearch may be suited for use.
Elasticsearch is a horizontally scalable data store where additional nodes can easily be added to a cluster to increase the available resources. Each node can store multiple primary shards on data, and each shard can be replicated (as replica shards) on other nodes. Primary shards handle read and write requests, while replica shards only handle read requests.
The following diagram shows how primary and replica shards are distributed across Elasticsearch nodes to achieve scalable and redundant reading and writing of data:
Figure 1.2 – Shards of data distributed across nodes
The preceding diagram shows the following:
Three Elasticsearch nodes: node A, node B, and node CTwo indices: index A and index BEach index with two primary and two replica shardsHigh volume ingest can mean either of the following things:
A singular index or data source with a large number of events Emitted Per Second (EPS)A group of indices or data sources receiving a large number of aggregate events per secondElasticsearch can also store large volumes of data for search and aggregation. To retain data costs efficiently over long retention periods, clusters can be architected with hot, warm, and cold tiers of data. During its life cycle, data can be moved across nodes with different disk or Input/Output Operations Per Second (IOPS) specifications to take advantage of slower disk drives and their associated lower costs. We will look at these sorts of architectures in Chapter 3, Indexing and Searching for Data and Chapter 13, Architecting Workloads on the Elastic Stack.
Some examples of where you need to handle large volumes of data include the following:
Centralized logging platforms (ingesting various application, security, event, and audit logs from multiple sources)When handling metrics/traces/telemetry data from many devicesWhen ingesting data from large document repositories or crawling a large number of web pagesAs we discussed previously, Elasticsearch leverages Lucene for indexing and searching operations. As documents are ingested into Elasticsearch, unstructured textual components from the document are analyzed to extract some structure in the form of terms. Terms are maintained in an inverted index data structure. In simple terms, an index (such as the table of contents in a book) is a list of topics (or documents) and the corresponding page numbers for each topic. An index is great for retrieving page content, given you already know what the chapter is called. An inverted index, however, is a collection of words (or terms) in topics and a corresponding list of pages that contain them. Therefore, an inverted index can make it easier to find all the relevant pages, given the search term you are interested in.
The following table visualizes an inverted index for a collection of documents containing recipes:
Table 1.1 – Visualization of an inverted index
A search string containing multiple terms goes through a similar process of analysis to extract terms, to then look up all the matching terms and their occurrences in the inverted index. A score is calculated for each field match based on the similarity module. By default, the BM25 ranking function (based on term frequency/inverse document frequency) is used to estimate the relevance of a document for a search query. Elasticsearch then returns a union of the results if an OR operator is used (by default) or an intersection of the results if an AND operator is used. The results are sorted by score, with the highest score appearing first.
Elasticsearch can aggregate large volumes of data with speed thanks to its distributed nature. There are primarily two types of aggregations:
Bucket aggregations: Bucket aggregations allow you to group (and sub-group) documents based on the values of fields or where the value sits in a range.Metrics aggregations: Metrics aggregations can calculate metrics based on the values of fields in documents. Supported metrics aggregations include avg, min, max, count, and cardinality, among others. Metrics can be computed for buckets/groups of data.Tools such as Kibana heavily use aggregations to visualize the data on Elasticsearch. We will dive deeper into aggregations in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch.
One of the benefits of quickly ingesting and retrieving data is the ability to respond to the latest information quickly. Imagine a scenario where uptime information for business-critical services is ingested into Elasticsearch. Alerting would work by continually querying Elasticsearch (at a predefined interval) to return any documents that indicate degrading service performance or downtime. If the query returns any results, actions can be configured to alert a Site Reliability Engineer (SRE) or trigger automated remediation processes as appropriate.
Watcher and Kibana alerting are two ways in which this can be achieved; we will look at this in detail in Chapter 4, Leveraging Insights and Managing Data on Elasticsearch, and Chapter 8, Interacting with Your Data on Kibana.
Elasticsearch does not require predefined schemas for documents you want to work with. Schemas on indices can be preconfigured if they're known to control storage/memory consumption and know how the data will be used later on. Schemas (also known as index mappings) can be dynamically or strictly configured, depending on your flexibility and the maturity of your document's structure.
By default, Elasticsearch will dynamically update these index mappings based on the documents that have been ingested. Where no mapping exists for a field, Elasticsearch will guess the data type based on its value. This flexibility makes it extremely easy for developers to get up and running, while also making it suitable for use in environments where document schemas may evolve over time.
We'll look at index mappings in Chapter 3, Indexing and Searching for Data.
Elasticsearch can be configured to work as a distributed system where groups of nodes (Elasticsearch instances) work together to form a cluster. Clusters can be set up for the various architectural characteristics when deployed in mission-critical environments. We will take a look at some of these in this section.
As we mentioned previously, Elasticsearch is a horizontally scalable system. Read/write throughput, as well as storage capacity, can be increased almost linearly by adding additional nodes to the Elasticsearch cluster. Adding nodes to a cluster is relatively effortless and can be done without any downtime. The cluster can automatically redistribute shards evenly across nodes (subject to shard allocation filtering rules) as the number of nodes available changes to optimize performance and improve resource utilization across nodes.
A primary shard in Elasticsearch can handle both read and write operations, while a replica shard is a read-only copy of a primary shard. By default, Elasticsearch will allocate one replica for every primary shard on different nodes in the cluster, making Elasticsearch a highly available system where requests can still be completed when one or more nodes experience failures.
If a node holding a primary shard is lost, a replica shard will be selected and promoted to become a primary shard, and a replica shard will be allocated to another node in the cluster.
If a node holding a replica shard is lost, the replica shard will simply be allocated to another node in the cluster.
Indexing and search requests can be handled seamlessly while shards are being allocated, with clients experiencing little to no downtime. Even if a search request fails, subsequent search requests will likely succeed because of this architecture.
Shard allocation on Elasticsearch can also consider node attributes to help us make more informed allocation decisions. For example, a cluster deployed in a cloud region with three availability zones can be configured so that replicas are always allocated on a different availability zone (or even a server rack in an on-premises data center) to the primary shard to protect against failures at the zone level.
Elasticsearch allows us to persistently store or snapshot data, making it a recoverable system in the event of a disaster. Snapshots can be configured to write data to a traditional filesystem or an object store such as AWS S3. Snapshots are a point-in-time copy of the data and must be taken at regular intervals, depending on your Recovery Point Objective (RPO), for an effective disaster recovery plan to be created.
Elasticsearch can search for and replicate data across remote clusters to enable more sophisticated architectural patterns.
Cross-Cluster Search (CCS) is a feature that allows you to search data that resides on an external or remote Elasticsearch cluster. A single search request can be run on the local cluster, as well as one or more remote clusters. Each cluster will run the search independently on its own data before returning a response to the coordinator node (the node handling the search request). The coordinator nodes then combine the results from the different clusters into a single response for the client. The local node does not join remote clusters, allowing for higher network latencies for inter-cluster communication, compared to intracluster communication. This is useful in scenarios where multiple independent clusters in different geographic regions need to search on each other to have a unified search capability.
The following diagram shows how Elasticsearch clusters can search across multiple clusters and combine results into a single search response for the user:
Figure 1.3 – How CCS requests are handled
Cross-cluster replication (CCR) allows an index to be replicated in a local cluster to a remote cluster. CCR implements a leader/follower model, where all the changes that have been made to a leader index are replicated on the follower index. This feature allows for fast searching on the same dataset in different geographical regions by replicating data closer to where it will be consumed. CCR is also sometimes used for cross-region redundancy requirements:
Figure 1.4 – How CCR works
CCS and CCR enable more complex use cases where multiple regional clusters can be used to independently store and search for data, while also allowing unified search and geographical redundancy.
Elasticsearch offers security features to help authenticate and authorize user requests, as well as encrypting network communications to and within the cluster:
Encryption in transit: TLS can be used to encrypt inter-node communications, as well as REST API requests.Access control: Role-Based Access Control (RBAC) or Attribute-Based Access Control (ABAC) can be used to control access to the data and functionality on Elasticsearch:RBAC works by associating a user with a role, where a role contains a list of privileges (such as read/write/update), as well as the resources these privileges can be applied to (such as an index; for example, my-logs-1).ABAC works by using attributes the user has (such as their location, security clearance, or job role) in conjunction with an access policy to determine what the user can do or access on the cluster. ABAC is generally a more fine-grained authorization control compared to RBAC.Document security: A security role in Elasticsearch can specify what subset of data a user can access on an index. A role can also specify what fields in a document a user can or cannot access. For example, an employee with security clearance of baseline can only access documents where the value of the classification field is either UNOFFICIAL or OFFICIAL. Field security: Elasticsearch can also control what fields a user has access to as part of a document. Building on the example in the previous point, field-level security can be used so that the user can only view fields that start with the metadata- string.Authentication providers: In addition to local/native authentication, Elasticsearch can use external services such as Active Directory, LDAP, SAML, and Kerberos for user authentication. API keys-based authentication is also available for system accounts and programmatic access.It is also important to understand the limitations of Elasticsearch. This section describes some scenarios where Elasticsearch alone may not be the best tool for the job.
Elasticsearch, unlike databases such as MySQL, was not designed to handle relational data. Elasticsearch allows you to have simple relationships in your data, such as parent-child and nested relationships, at a performance cost (at search time and indexing time, respectively). Data on Elasticsearch must be de-normalized (duplicating or adding redundant fields to documents, to avoid having to join data) to improve search and indexing/update performance.
If you need to have the database manage relationships and enforce rules of consistency across different types of linked data, as well as maintaining normalized records of data, Elasticsearch may not be the right tool for the job.
Individual requests in Elasticsearch support ACID properties. However, Elasticsearch does not have the concept of transactions, so it does not offer ACID transactions.
At the individual request level, ACID properties can be achieved as follows:
Atomicity is achieved by sending a write request, which will either succeed on all active shards or fail. There is no way for the request to partially succeed.Consistency is achieved by writing to the primary shard. Data replication happens synchronously before a success response is returned. This means that all the read requests on all shards after a write request will see the same response.Isolation is offered since concurrent writes or updates (which are deletes and writes) can be handled successfully without any interference.Durability is achieved since once a document is written into Elasticsearch, it will persist, even in the case of a system failure. Writes on Elasticsearch are not immediately persisted onto Lucene segments on disk as Lucene commits are relatively expensive operations. Instead, documents are written to a transaction log (referred to as a translog) and flushed into the disk periodically. If a node crashes before the data is flushed, operations from the translog will be recovered into the Lucene index upon startup.If ACID transactions are important to your use case, Elasticsearch may not be suitable for you.
Important Note
In the case of relational data or ACID transaction requirements, Elasticsearch is often used alongside a traditional RDBMS solution such as MySQL. In such architectures, the RDBMS would act as the source of truth and handle writes/updates from the application. These updates can then be replicated to Elasticsearch using tools such as Logstash for fast/relevant searches and visualization/analytics use cases.
With that, we have explored some of the core concepts of Elasticsearch and the role it plays in ingesting and storing our data. Now, let's look at how we can interact with the data on Elasticsearch using Kibana.
Kibana was created primarily as a visualization tool for data residing on Elasticsearch and is bundled together as part of the stack. Since its inception, Kibana has evolved to cater to use cases such as alerting, reporting, and monitoring Elastic Stack components, as well as administrating and managing the Elasticsearch cluster in use.
More importantly, Kibana provides the interface and functionality for the solutions that Elastic Stack offers, in addition to administration and management options for the core components. Functionality in Kibana is organized using applications (or apps, for short).
Apps on Kibana can be solution-specific or part of the general stack. The SIEM app, for example, powers the security solution, enabling security analysts and threat hunters to defend their organization from attacks. The APM app is another solution-specific app that, in this case, allows developers and SREs to observe their applications to look for issues or performance bottlenecks.
On the other hand, general Kibana apps such as Discover, Visualize, and Dashboard can be used to explore, interrogate, and visualize data, regardless of the solution the data enables. Ingest Manager is another example of an app that allows you to configure Elastic Agent to collect any kind of data from across an environment, agnostic of the solution the data may be used in.
Solution-specific apps on Kibana provide a great out-of-the-box user experience, as well as targeted features and functionality for the solution in question. General or stack-based applications provide powerful but unified capabilities that can be used across all solutions, even custom solutions that you might build on the Elastic Stack. General Kibana apps such as Discover and Dashboard are useful for all use cases, while solution-specific apps such as Observability and Security provide curated out-of-the-box experiences for the solution area. Kibana is usually considered a core component of the Elastic Stack and is often installed, even if the cluster is not used for data analysis.
We will dive deeper into Kibana's features in Chapter 8, Interacting with Your Data on Kibana. Now, let's look at how data can be collected and ingested into Elasticsearch using Logstash and Beats.
So far, we have looked at Elasticsearch, a scalable search and analytics engine for all kinds of data. We have also got Kibana to interface with Elasticsearch to help us explore and use our data effectively. The final capability to make it all work together is ingestion.
The Elastic Stack provides two products for ingestion, depending on your use cases.
Useful data is generated all over the place in present-day environments, often from varying technology stacks, as well as legacy and new systems. As such, it makes sense to collect data directly from, or closer to, the source system and ship it into your centralized logging or analytics platform. This is where Beats come in; Beats are lightweight applications (also referred to as agents) that can collect and ship several types of data to destinations such as Elasticsearch, Logstash, or Kafka.
Elastic offers a few types of Beats today for various use cases:
Filebeat: Collecting log dataMetricbeat: Collecting metric dataPacketbeat: Decoding and collecting network packet metadataHeartbeat: Collecting system/service uptime and latency dataAuditbeat: Collecting OS audit dataWinlogbeat: Collecting Windows event, applicatio, and security logsFunctionbeat: Running data collection on serverless compute infrastructure such as AWS LambdaBeats use an open source library called libbeat that provides generic APIs for configuring inputs and destinations for data output. Beats implement the data collection functionality that's specific to the type of data (such as logs and metrics) that they collect. A range of community-developed Beats are available, in addition to the officially produced Beats agents.
The modules that are available in Beats allow you to collect consistent datasets and the distribution of out-of-the-box dashboards, machine learning jobs, and alerts for users to leverage in their use cases.
One of the most important aspects of ingesting data into a centralized logging platform is paying attention to the data format in use. A Unified Data Model (UDM) is an especially useful tool to have, ensuring data can be easily consumed by end users once ingested into a logging platform. Enterprises typically follow a mixture of two approaches to ensure the log data complies with their unified data model:
Enforcing a logging standard or specification for log-producing applications in the company.This approach is often considerably costly to implement, maintain, and scale. Changes in the log schema at the source can also have unintended downstream implications in other applications consuming the data. It is common to see UDMs evolve quite rapidly as the nature and the content of the logs that have been collected change. The use of different technology stacks or frameworks in an organization can also make it challenging to log with consistency and uniformity across the environment.
Transforming/renaming fields in incoming data using an ETL tool such as Logstash to comply with the UDM. Organizations can achieve relatively successful results using this approach, with considerably fewer upfront costs when reworking logging formats and schemas. However, the approach does come with some downsides:(a) Parsers need to be maintained and constantly updated to make sure the logs are extracted and stored correctly.