The Definitive Guide to OpenSearch - Jon Handler - E-Book

The Definitive Guide to OpenSearch E-Book

Jon Handler

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

From seasoned data professionals managing billions of records to aspiring analysts exploring diverse datasets, this guide is for users at all levels who want to make the most of OpenSearch's capabilities and functionalities. Written by distinguished AWS Solutions Architects Jon Handler, Ph.D., a former search engine developer, Prashant Agrawal, a search specialist, and Soujanya Konka, an expert in large-scale data migrations, this guide brings together deep technical expertise with practical, hands-on knowledge of implementing OpenSearch in real-world scenarios.
Starting with an introduction to OpenSearch, you’ll get to grips with the key features before delving into essential topics such as installing OpenSearch, ingesting data, crafting queries, visualizing results, ensuring security, and optimizing performance. Each concept is accompanied by practical examples and tutorials, allowing you to grasp the material through hands-on experience.
Keeping up with OpenSearch’s new releases and updates, this book equips you to fully leverage its potential through real-world scenarios and examples that demonstrate how OpenSearch works.
Whether enhancing your search experience or extracting insightful analytics from data, The Definitive Guide to OpenSearch provides developers, engineers, data scientists, and system administrators with the tools needed to thrive.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 518

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Definitive Guide to OpenSearch

Discover advanced techniques and best practices for efficient search and analytics with OpenSearch

Jon Handler

Soujanya Konka

Prashant Agrawal

The Definitive Guide to OpenSearch

Copyright © 2025 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Portfolio Director: Sunith Shetty

Relationship Lead: Apeksha Shetty

Project Manager: Gowri Rekha

Content Engineer: Gowri Rekha

Technical Editor: Seemanjay Ameriya

Copy Editor: Safis Editing

Indexer: Hemangini Bari

Proofreader: Gowri Rekha

Production Designer: Ganesh Bhadwalkar

Growth Lead: Ankur Mulasi

First published: August 2025

Production reference: 1050825

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-83588-578-9

www.packtpub.com

This book is dedicated to the many people who use and grow OpenSearch every day.

– Jon, Soujanya, and Prashant

Foreword

In a large and growing landscape of search engines and databases offering both classical and AI-driven approaches to querying and interacting with data, along with a whole lot of marketing hype, OpenSearch consistently stands out as a rock-solid, open, and scalable search engine capable of handling both simple and complex data workloads across enterprise, e-commerce, observability, legal, financial, and consumer-facing use cases. Built on Apache Lucene, the open source search library that has “launched a thousand search engines,” such as Apache Solr and Elasticsearch, OpenSearch stands on the shoulders of giants, while also adding its own mark on the space with numerous extensions that make implementing high-quality search easier than ever. Given this capability and its place in the market, we are long overdue for a book that walks us, hand in hand, through the many different facets of taking OpenSearch from “Hello World” to “Hello Production.” The Definitive Guide to OpenSearch by Jon, Soujanya, and Prashant provides both the breadth and depth required to fill this void with a well-written and well-organized walk-through of all things OpenSearch.

In the early chapters, the three search-engine-veteran authors act as gentle guides, mixing in key background knowledge about how OpenSearch (and, really, most search engines) works by helping readers understand everything from key use cases to how to get up and running, all the while mixing in practical examples that allow readers to warm up their proverbial search legs. With the warm-up complete, readers are then given just the right amount of guidance to take them through the heart of any search application: indexing, searching, aggregations, and the visualization of results—again backed by practical code examples, diagrams, and clear, concise explanations that show them what good looks like. Finally, in the home stretch of the journey, the now-trusted guides take us through the latest in artificial intelligence capabilities in OpenSearch, as well as advanced topics on how to extend OpenSearch through plugins and effectively and efficiently migrate to OpenSearch from other engines. Last but not least, if readers simply want a managed service that does the heavy operational lifting sometimes required of search engines, the authors take us through the usage of said service and the trade-offs inherent in such a choice.

Whether you are new to search due to requirements to build an AI chatbot using Retrieval-Augmented Generation (RAG) or are a seasoned search veteran looking to upskill on a new engine, Jon, Soujanya, and Prashant bring a wealth of experience as implementers of search in the real world that shines through across the core content, the well-placed callouts, and the detailed instructions and examples.

I believe this book effectively fulfills its promise of guiding you through advanced techniques and best practices for search and analytics with OpenSearch. Beyond that, I hope it inspires your exploration of information retrieval and discovery, a field that has profoundly shaped my own career as an engineer, a start-up founder, and a CTO. I’ve always believed that search, information retrieval, and related disciplines are transformative, empowering individuals to make more informed decisions by providing swift and effective access to data, regardless of its volume or location.

Happy searching,

Grante Ingersoll

CEO & Founder of Develomentor LLC, OpenSearch Leadership Committee

Contributors

About the authors

Jon Handler is a senior principal solutions architect at Amazon Web Services (AWS) based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, e-commerce search engine. Jon holds a Bachelor of Arts from the University of Pennsylvania, and a Master of Science and a Ph.D. in computer science and artificial intelligence from Northwestern University.

First and foremost, thanks to my amazing family, Cindy, Lori, Six, and Sophia, for putting up with the long hours and late-night meetings. Thanks as well to my many mentors and co-workers at AWS who continue to dream big and make those dreams a reality. Special thanks to Asif Makhani, Mehul A. Shah, Jules Graybill, Mukul Karnik, Carl Meadows, and Mike McCandless. You all have inspired and guided me throughout my career at AWS. The OpenSearch project team has been a constant source of support and inspiration. Thanks to Grant Ingersoll for writing the foreword, Michael Froh and Muhammad Ali for their valuable reviews, and the whole Packt team, who have been amazing partners. Finally, thanks to my co-authors, Prashant and Soujanya, for helping to make this book a reality.

 

Soujanya Konka is a Senior Solutions Architect at Amazon Web Services (AWS), bringing over 17 years of expertise in analytics and cloud technologies. She specializes in Amazon OpenSearch, vector search, and agentic AI solutions, helping organizations optimize their data architecture and search capabilities. Before AWS, she spearheaded enterprise search implementations and orchestrated large-scale data warehouse migrations to cloud platforms. At AWS, Soujanya focuses on helping customers navigate big data challenges, implement cost-effective search architectures, and leverage advanced analytics at scale. Her commitment to continuous learning and innovation drives her success in tackling complex data challenges.

This book is dedicated to my pillars of strength—my parents, whose perfect blend of intelligence and determination shapes my journey; my loving husband, Vijay, for his unwavering encouragement to explore new opportunities; and my teenage son, Soureesh, whose determination to excel is beautifully balanced with his innate kindness. Thank you also to the amazing leaders I have worked with, Gaurav Sahi, Imtiaz Ali, and Muhammed Ali, for their guidance and trust in me.

 

Prashant Agrawal is a senior search specialist solutions architect at AWS based out of Seattle, bringing over 14 years of invaluable experience in search technologies and log analytics, as well as recently expanding into generative AI and vector search implementations. His deep expertise in search technologies and passion for solving complex search challenges have made him a trusted advisor in the field. As a technical leader, he collaborates closely with clients to facilitate seamless migrations and optimize OpenSearch clusters for peak performance and cost efficiency. Through this book, he shares his extensive knowledge gained from more than a decade of hands-on experience with search technologies, helping readers master OpenSearch from fundamentals to advanced implementations.

When not immersed in the world of technology, Prashant is an avid explorer who lives by the mantra “Eat -> Travel -> Repeat,” finding joy in discovering new places and experiences. His approach to both technology and life reflects his belief that every challenge is an opportunity for a rewarding adventure.

This book is dedicated to my father, who taught me resilience, wisdom, and the strength to keep going; my loving wife, Deepika, for her unwavering support; and my beautiful daughter, Ivani, whose endless curiosity inspires me every day.

About the reviewers

Ishneet Kaur Dua is a prominent advocate for sustainability in technology, specializing in the intersection of artificial intelligence (AI), generative AI, machine learning (ML), and cloud computing. As a senior solutions architect, she has made significant contributions to promoting environmentally responsible practices within the tech industry. With a robust portfolio of influential blogs focused on AI/ML and sustainability in cloud computing, Ishneet has reached over 600,000 readers, sharing insights on optimizing cloud workloads to minimize carbon footprints and designing well-architected AI systems. Her expertise has been recognized on major platforms, as she has been a featured speaker at multiple high-profile tech conferences, where she presented strategies for sustainable cloud practices and responsible AI development. Ishneet’s commitment to environmental innovation is exemplified by her organization of the “Code Green” hackathon, which attracted over 600 participants dedicated to developing solutions for environmental challenges. She has also trained hundreds of engineers and technologists through nationwide bootcamps and local sessions in the San Francisco area, focusing on eco-friendly approaches to AI and ML. Ishneet’s work goes beyond advocacy, directly helping organizations succeed with:

Cloud-native designs: She guides companies in implementing efficient, scalable architectures that leverage cloud technologies to their fullest potential while minimizing environmental impact.ML model optimization: Ishneet provides expertise in effectively training and inferencing ML models, balancing performance with energy efficiency.Sustainable practices: She assists organizations in integrating sustainability into their core operations, from data center management to software development practices.AI system architecture: Her knowledge extends to designing robust, ethical, and environmentally conscious AI systems that meet both business and sustainability goals.

Ishneet’s approach emphasizes the practical application of cutting-edge technologies to combat climate change and pollution. She helps organizations understand the long-term benefits of sustainable tech practices, in terms of both environmental impact and operational efficiency and cost savings. As the tech industry increasingly prioritizes sustainability, Ishneet’s advocacy and expertise continue to inspire change and innovation in creating a greener future. Her work demonstrates that technological advancement and environmental responsibility can go hand in hand, setting a new standard for the industry.

Parth Girish Patel is an architect with extensive experience in management consulting and cloud computing, specializing in AI/ML, generative AI, sustainability, and cloud-native solutions. His background encompasses software engineering, consulting at Deloitte, and work at AWS. As an architect, he assists customers with cloud adoption and AI implementation, providing insights into scalable architectures and ML solutions. He is proficient in AWS, Azure, GCP, and various ML skills.

Parth is passionate about AI, including sustainable and ethical AI, and AI-enabled services, emphasizing transparency. He also mentors teams and individuals.

Michael Froh is an OpenSearch maintainer and an Apache Lucene committer. He has worked on Lucene-based distributed search systems since 2011, with a focus on OpenSearch since 2022. He writes lessons on Lucene under his lucene-university repository on GitHub and teaches OpenSearch internals on YouTube.

Muhammad Ali is a principal OpenSearch specialist solutions architect at AWS, with over 20 years of experience in content management and information retrieval. He has helped AWS customers design scalable solutions serving hundreds of millions of users. His expertise spans distributed data systems, internet-scale applications, and advanced information retrieval. He is passionate about generative and agentic AI and is excited to see information retrieval become a central challenge across a growing range of applications.

Table of Contents

The Definitive Guide to OpenSearch

Foreword

Contributors

About the authors

About the reviewers

Preface

Who this book is for

What this book covers

To get the most out of this book

Conventions used

Share your thoughts

Get in touch

Your Book Comes with Exclusive Perks — Here's How to Unlock Them

Unlock this book’s exclusive benefits now

Step 1

Step 2

Step 3

Part 1: Getting Started with OpenSearch: Fundamentals and Deployment

1

Overview of OpenSearch

Introducing OpenSearch and its evolution journey

Evolution of OpenSearch

Understanding the core capabilities of OpenSearch

Distributed database

Lexical search

Semantic search with vector embeddings

Log analytics

Real-world examples and use cases

Revolutionizing e-commerce search with OpenSearch

Transformative search: a fashionable journey with Iva

Maximizing operational efficiency with OpenSearch log analytics and observability

Hello OpenSearch

Summary

Join our community on Discord

2

Installing and Configuring OpenSearch

Understanding key terminology

Nodes basics

Cluster basics: the backbone of OpenSearch

Index insights: organizing data in OpenSearch

How shards work

How segments work

System requirements and compatibility

Operating system compatibility matrix

Java compatibility matrix

Network configuration

Recommended filesystem setup for better performance

Installation guide for OpenSearch

Using a tarball

Using Docker

Setting up OpenSearch Dashboards

Using a tarball (locally)

Using Docker

Setting the foundation for advanced cluster configuration

OpenSearch cluster settings: static and dynamic

OpenSearch Dashboards settings

Security considerations and setup

Introducing authentication and authorization

Initial exploration of OpenSearch functionalities

Indexing Iva’s fashionable finds

Searching for fashion inspiration

Summary

3

Deployment Options: Amazon OpenSearch Service and Amazon OpenSearch Serverless

Introduction to Amazon OpenSearch Service

Architecture and components

Key features

Infrastructure of Amazon OpenSearch Service Domains

Managing Amazon OpenSearch Service Domains

Rightsizing

Scaling Amazon OpenSearch Service Domains

Snapshots in Amazon OpenSearch Service Managed Clusters

Storage management

Amazon OpenSearch Serverless

Creating and managing Amazon OpenSearch Serverless collections

Ingesting data into Amazon OpenSearch Serverless collections

Security in Amazon OpenSearch Serverless

Supported operations and plugins in Amazon OpenSearch Serverless

Monitoring Amazon OpenSearch Serverless

Choosing between OpenSearch Service-managed clusters and OpenSearch Serverless

OpenSearch hosting partners

Summary

Join our community on Discord

Part 2: Data Management and Discovery: Indexing, Querying, and Visualization

4

Indexing Data

Technical requirements

Overview of indexing

Hands-on: Connecting to OpenSearch Dashboards

Creating an index

The _bulk API

Mapping your data

Creating your index via an API

Understanding index settings

Diving into mappings

Mapping types

String mapping types

Advanced mapping types

Summary

5

Searching: Core APIs

Technical requirements

Query processing

Matching

Merging

Scoring and sorting

Fetching

Hands-on: loading data

OpenSearch’s query API and supported languages

Format of a Query DSL query

match_all: the most basic query

Pagination

Leaf queries

Text queries

Term queries

Highlighting in OpenSearch queries

Completions and suggestions

Search templates

Summary

Join our community on Discord

6

Advanced Querying

Technical requirements

Compound queries and filters

bool queries

Geospatial queries and aggregations

Faceted search

Query percolation

How to run the profile API

Summary

7

Analyze and Visualize OpenSearch Data

Technical requirements

Introduction to Dashboards

Management

OpenSearch Plugins

Types of aggregation

Metric aggregations

Bucket aggregations

Nested aggregations

Pipeline aggregations

Visualizations

Total bytes over time

Traffic

Traffic by country

Error codes by request

Traffic flows

Logging and Observability

Step 1: The Management section

Step 2: The Observability section

Step 3: The Observability Plugins section

Specialized query languages

Low-cost logging and observability

Key features of Flint indexing

OpenSearch Assistant for Dashboards

Best practices for log workloads with OpenSearch

Additional resources and references

Configuring Amazon S3 as a data source with OpenSearch

Summary

Join our community on Discord

Part 3: Extending OpenSearch: Plugins, AI Integration, and Application Development

8

Introduction to OpenSearch Plugins

Built-in plugins and custom plugins

Key OpenSearch plugins and their functions

OpenSearch SQL plugin

OpenSearch Job Scheduler plugin

OpenSearch Alerting plugin

OpenSearch Index State Management (ISM) plugin

Security plugin

Security Analytics plugin

KNN plugin

Neural Search plugin

Learning to Rank (LTR) plugin

Installing and managing plugins

Installing plugins

Managing plugins

Building your own plugins

Advanced plugin architecture

Plugin lifecycle management

Dependency injection

Custom plugin development

Do you want to develop a plugin?

Plugin best practices

The future of OpenSearch plugins

Summary

References

9

OpenSearch in Action: Making Apps Awesome

Meet Iva — a developer on a mission to build a smarter movie search app

API-driven development — making your app talk to OpenSearch

Understanding OpenSearch APIs

Setting up a Python virtual environment

Connecting to OpenSearch using Python

Testing API queries before UI integration

Autocomplete and fuzzy search — making search more user-friendly

Implementing autocomplete

Implementing fuzzy search to handle typos

Combining autocomplete and fuzzy search

Filtering and faceted search — giving users more control

Filtering by genre

Filtering by release year

Applying multiple filters together

Implementing faceted search for dynamic filtering

Bringing it all together in a UI

Setting up Streamlit

Connecting Streamlit to OpenSearch

Implementing the search bar with autocomplete

Performing the search and displaying results

Adding filters for genre and release year

Running the app

Summary

Join our community on Discord

10

OpenSearch Vectors and Generative AI

Technical requirements

Vectorization of data

Dense vectors

Sparse vectors

Semantic search

ML Commons and ML models

Exact K-Nearest-Neighbor

Approximate nearest neighbor

Sparse vectors and hybrid search

Generative AI (gen AI) architectures and components

Summary

11

Migrate to OpenSearch

Why OpenSearch?

Open source, community-driven, and vendor-neutral

Familiar APIs and an easy transition path

Expanding ecosystem and tooling

From Apache Solr (enterprise search)

From Algolia (Search as a Service)

From Splunk (logs and SIEM)

From Elasticsearch (all use cases)

From Amazon CloudSearch (Search as a Service)

Stages of migration

Planning

Proof of concept (POC)

Set up a test cluster

Compatibility testing

Performance and scalability testing

Deploy

Deploy in phases

Migrate data and indexes

Cutover to OpenSearch

Continuous monitoring and optimization

Patterns for minimal or no-downtime migration

Dual-write pattern: migrating e-commerce search without losing a beat

Shadow read pattern: matching relevance in travel search

Blue-green deployment: taking control of logging from Splunk

Canary deployment: search reinvented for a national news site

Cold data replay: compliance-first migration for a fintech company

OpenSearch Migration Assistant

Pre-migration checks and metadata review

Traffic replay for near zero-downtime testing

Historical data migration

Migration management console

Deployment options

How teams are moving to OpenSearch – without missing a beat

AnyMovie’s search migration and modernization

Final outcome

AnyLog’s live logs migration—a cleaner path to observability

Outcome

Summary

Join our community on Discord

Part 4: Securing and Optimizing OpenSearch: Administration Best Practices

12

Security in OpenSearch

OpenSearch’s security framework and components

The core components of security

Authentication and authorization mechanisms

Multi-tenant security architecture

Auditing and compliance

Summary

13

Monitoring, Backup, and Recovery

Monitoring an OpenSearch domain

Monitoring tools

Key metrics and dashboards for monitoring

Dashboards for monitoring and alarms

Admission control and backpressure mechanisms

Admission control

Backpressure mechanisms

Troubleshooting and scaling

Performance tuning

Backup strategies for data resilience

Disaster recovery architecture in OpenSearch

Post-recovery validation

Summary

Join our community on Discord

14

Scaling and Performance Optimization

Understanding OpenSearch as a distributed system

OpenSearch distributed architecture

Amazon OpenSearch Service

Amazon OpenSearch Serverless

Data lifecycle in OpenSearch

OpenSearch request processing

Threads and queues

Strategies for sizing your cluster

Storage

RAM

CPU

Shards and networking

Completing the examples

Search

Logs

Vectors

Optimizing OpenSearch clusters for high performance

Running a POC

Tenancy

Shard skew

Scaling per node resources

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share your thoughts

Index

Preface

OpenSearch is a “Swiss Army knife” that touches diverse use cases spanning application features, operations, and generative AI. If there’s one unifying theme of the software, it is that it enables storing and retrieving data to support intelligent decision-making. It’s a database, but it’s a funny kind of database that emphasizes speed and volume processing over consistency. It’s a logs store, but a funny kind of logs store that emphasizes aggregations and log-line search. It’s a data source for generative AI, but it’s a funny kind of data source that brings rich search to the retrieval of information for prompts. In all these cases, OpenSearch provides high-volume request processing and intelligent retrieval of data.

In this book, you’ll learn in depth the capabilities of OpenSearch, how and when to apply them, and where you can get the most benefits. You’ll also learn about Amazon OpenSearch Service, its managed clusters and serverless deployment options, and how to get the most out of your OpenSearch Service domain or OpenSearch Serverless collection.

We’ll begin with introductory chapters that give you a history and overview of OpenSearch and show you how to deploy OpenSearch and how to use OpenSearch Service. We’ll then dive deep into OpenSearch’s core capabilities—indexing and querying data and building aggregations and visualizations. We’ll cover OpenSearch’s large collection of plugins that deliver additional features, such as Structured Query Language (SQL), alerting, and k-nearest neighbor search. We’ll dive deep into application-building and delivering AI-powered applications with generative AI. We will then move on to operational topics, including migrations, security, monitoring, backups, and recovery. We will round out the book with a deep dive on scaling and performance optimization.

In writing this book, we wanted to distill our years of experience and thousands of hours of customer interaction for you. We wish you every success, and happy OpenSearching!

Who this book is for

This book is for developers, operators, and DevOps engineers who want to add or modernize search for their applications, and who want to monitor those applications for uptime and diagnose and remediate errors. Experience with Amazon Web Services, the Python programming language, Docker, and Kubernetes will be helpful but is not necessary.

What this book covers

Chapter 1, Overview of OpenSearch, covers OpenSearch’s history, its core capabilities, and the main use cases for OpenSearch, with real-world examples. It also introduces the topic of operational efficiency.

Chapter 2, Installing and Configuring OpenSearch, gives an overview of OpenSearch distributed system basics. It guides you through deploying OpenSearch via tarball and Docker, and covers OpenSearch Dashboards and the basics of securing your cluster.

Chapter 3, Deployment Options: Amazon OpenSearch Service and Amazon OpenSearch Serverless, guides you through deploying and running OpenSearch in the Amazon Web Services cloud, using Amazon OpenSearch Service, and operational basics such as scaling, storage management, and security.

Chapter 4, Indexing Data, details how to create and maintain OpenSearch indexes, including creating indexes, index settings, setting a mapping, different mapping types, and mapping templates.

Chapter 5, Searching: Core APIs, explains query processing in OpenSearch, leaf queries, hit highlighting, search suggestions, and search templates.

Chapter 6, Advanced Searching, covers OpenSearch’s query APIs in depth, as well as compound queries, geospatial queries, faceted search, query percolation, and query performance and profiling.

Chapter 7, Analyze and Visualize OpenSearch Data, dives into aggregations, OpenSearch Dashboards, dashboards and visualizations, working with time-series data such as logs, and the Observability plugin.

Chapter 8, Introduction to OpenSearch Plugins, covers the key OpenSearch plugins, including SQL, alerting, security analytics, k-nearest neighbor, and the Neural plugin. It then details how to install, manage, and build your own plugins for OpenSearch.

Chapter 9, OpenSearch in Action: Making Apps Awesome, moves from the theoretical to the abstract, integrating the topics covered to help you bring the power of OpenSearch to your application with faceted search, auto completions, and connecting to OpenSearch’s APIs from your application. It brings everything together in a Streamlit application.

Chapter 10, OpenSearch Vectors and Generative AI, provides a theoretical foundation on dense vectors, sparse vectors, and the large language models that produce them. It goes into depth on exact and approximate k-nearest neighbor search, with the algorithms and engines OpenSearch provides, closing with a generative AI example.

Chapter 11, Migrate to OpenSearch, guides you through why, whether, and how to migrate from other search solutions, including planning for your migration, executing a proof of concept, deploying your target, and moving data and traffic with and without OpenSearch Migration Assistant. It closes with two examples of migrations.

Chapter 12, Security in OpenSearch, explains OpenSearch’s security features and guides you in using them to best effect to secure your data and cluster.

Chapter 13, Monitoring, Backup, and Recovery, enters the world of operations to help you use Amazon OpenSearch Service managed clusters efficiently. It covers the metrics that the service generates, how to monitor them, and how best to respond to issues with troubleshooting and backups.

Chapter 14, Scaling and Performance Optimization, explains OpenSearch as a distributed system and walks through the core resources your cluster provides and how OpenSearch maps your workload onto those resources. It finishes with best practices to optimize your cluster infrastructure for maximum efficiency.

To get the most out of this book

Some of the code examples provided are in Python. A working knowledge of the language, and a working Python installation for your system, will allow you to you run those examples.

Some knowledge of distributed systems and other database systems will help you follow the discussion.

Knowledge of Amazon Web Services, Amazon Elastic Compute Cloud, and Docker will enable you to more easily deploy OpenSearch for the examples.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter/X handles. For example: “The _bulk API reduces overhead.”

A block of code is set as follows:

POST _bulk { "create": { "_index": "first_index", "_id": "2" } } { "an_integer_field": 23456, "a_string_field": "the quick brown fox"}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

PUT index_with_mapping {   "mappings": {     "dynamic": "strict",     "properties": {       "an_integer_field": { "type": "integer"},       "a_string_field": { "type": "text" } }}}

Bold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes appear in the text like this. For example: “Select Dev Tools from the left navigation panel.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book or have any general feedback, please email us at [email protected] and mention the book’s title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packt.com/submit-errata, and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packt.com/.

Share your thoughts

Once you’ve read The Definitive Guide to OpenSearch, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Your Book Comes with Exclusive Perks — Here's How to Unlock Them

Unlock this book’s exclusive benefits now

Scan this QR code or go to packtpub.com/unlock, then search this book by name. Ensure it’s the correct edition.

Note: Keep your purchase invoice ready before you start.

Figure X.1: Next-Gen Reader, AI Assistant (Beta), and Free PDF access

Enhanced reading experience with our Next-gen Reader:

Multi-device progress sync: Learn from any device with seamless progress sync.

Highlighting and Notetaking: Turn your reading into lasting knowledge.

Bookmarking: Revisit your most important learnings anytime.

Dark mode: Focus with minimal eye strain by switching to dark or sepia modes.

Learn smarter using our AI assistant (Beta):

Summarize it: Summarize key sections or an entire chapter.

AI code explainers: In Packt Reader, click the “Explain” button above each code block for AI-powered code explanations.

Note: AI Assistant is part of next-gen Packt Reader and is stillin beta.

Learn anytime, anywhere:

Access your content offline with DRM-free PDF and ePub versions—compatible with your favorite e-readers.

Unlock Your Book’s Exclusive Benefits

Your copy of this book comes with the following exclusive benefits:

Next-gen Packt Reader

AI assistant (beta)

DRM-free PDF/ePub downloads

Use the following guide to unlock them if you haven’t already. The process takes just a few minutes and needs to be done only once.

How to unlock these benefits in three easy steps

Step 1

Have your purchase invoice for this book ready, as you’ll need it in Step 3. If you received a physical invoice, scan it on your phone and have it ready as either a PDF, JPG, or PNG.

For more help on finding your invoice, visit https://www.packtpub.com/unlock-benefits/help.

Note

Did you buy this book directly from Packt? You don’t need an invoice. After completing Step 2, you can jump straight to your exclusive content.

Step 2

Scan this QR code or go to packtpub.com/unlock.

On the page that opens (which will look similar to Figure X.1 if you’re on desktop), search for this book by name. Make sure you select the correct edition.

Figure X.1: Packt unlock landing page on desktop

Step 3

Sign in to your Packt account or create a new one for free. Once you’re logged in, upload your invoice. It can be in PDF, PNG, or JPG format and must be no larger than 10 MB. Follow the rest of the instructions on the screen to complete the process.

Need help?

If you get stuck and need help, visit https://www.packtpub.com/unlock-benefits/ help for a detailed FAQ on how to find your invoices and more. The following QR code will take you to the help page directly.

Note

If you are still facing issues, reach out to [email protected].

Part 1: Getting Started with OpenSearch: Fundamentals and Deployment

In this first part of the book, you’ll develop a solid foundation in OpenSearch and its fundamental architecture. We’ll explore OpenSearch’s evolution from its origins, understand its core capabilities in organizing and searching data, and see how it transforms real-world applications across industries. You’ll gain hands-on experience with installation, configuration, and security implementations across various platforms, along with understanding different deployment strategies, including the AWS OpenSearch Service and Serverless solutions. By the end of this part, you’ll have the practical knowledge and skills needed to deploy, manage, and troubleshoot OpenSearch implementations with confidence, preparing you for more advanced topics in subsequent sections.

This part of the book includes the following chapters:

Chapter 1, Overview of OpenSearchChapter 2, Installing and Configuring OpenSearchChapter 3, Deployment Options: Amazon OpenSearch Service and Amazon OpenSearch Serverless

1

Overview of OpenSearch

Welcome to OpenSearch! This book aims to be a comprehensive guide to OpenSearch for both beginners and more experienced users. With the adoption of OpenSearch growing rapidly, this book comes at an opportune time to help developers, engineers, data scientists, and system administrators leverage this powerful open source search and analytics engine to build robust search experiences and gain insightful analytics on data.

In this chapter, you’ll learn about OpenSearch’s story, its main features, and how it’s used in real life. You’ll start by understanding where OpenSearch came from and how it grew over time, as a community project. Then, you’ll explore the important things OpenSearch can do, such as organizing information in a special way, handling a lot of data, and being really good at returning relevant results for your queries. This chapter also shares stories about how people are using OpenSearch in everyday situations, such as finding products online, keeping track of computer system health, and suggesting interesting content for us. You’ll get to know a helpful tool that can do a lot of cool things!

We will cover the following main topics in this chapter:

Introducing OpenSearch and its evolution journeyUnderstanding the core capabilities of OpenSearchReal-world examples and use casesRevolutionizing e-commerce search with OpenSearchMaximizing operational efficiency with OpenSearch log analytics and observability

Introducing OpenSearch and its evolution journey

In this section, we will dive into the history of search engines to understand their role in day-to-day activities.

Search engines came to the fore in the early 1990s, along with the advent of the web. The dawn of the World Wide Web brought with it the democratization of publishing—people could write and create available information, opinions, scientific data, and more. Hyperlinking made reading interactive within and across sites. If you knew where to look, you could dig in and learn anything!

Knowing where to look for a keyword/string or an item was a challenge. Web search engines such as AltaVista, Ask Jeeves, Yahoo!, and Google aimed to catalog and make web pages available in response to user queries. These web search engines indexed information on web pages and replied to user queries with lists of universal resource locators (URLs). Responses to a query were ranked according to a relevance measure that pulled information about the collection of documents.

Search is an integral part of our daily lives. When we need to find information, products, places, or even people, we often rely on search engines and search functionality in apps and websites.

Whether searching online, looking up contacts, or finding files, effective search is vital. It’s estimated that humanity produced 2.5 quintillion bytes of data daily in 2020. As data volumes grow exponentially each year, developing scalable search techniques has become crucial.

Let’s take a closer look at a few examples from our daily routine. Searching on Google has become synonymous with looking something up. Whether you need quick facts, locations of nearby restaurants, directions to a destination, or answers to random questions, Google’s search engine provides relevant information from its vast index of web pages. Google is so ubiquitous that Googling has become a commonly used verb for performing a web search. Searching also plays a major role in online shopping. E-commerce sites and apps such as Amazon rely heavily on their search features to help users find products among their massive digital inventories. Consumers can search by product name, category, brand, price range, and other filters to narrow down options. The more relevant the search results, the more likely shoppers will find and purchase what they need. Social media apps such as Facebook and Instagram also use search to connect users with the people, groups, photos, and updates they seek. Using the search bar, users can access content directly from friends and public figures based on usernames, hashtags, or keywords. That’s easier than sorting through a chronological news feed—you can locate specific posts, images, and profiles so much faster! Whether you’re searching the boundless depths of the internet or your personal collections of messages, photos, and notes, you use search to power your ability to access relevant information and connections. The many search innovations we use each day have transformed how we consume knowledge and interact with the digital world. While search may seem like a basic utility, its significance and underlying technology are complex and constantly evolving.

Techniques such as lexical indexing, neural networks, and ranking algorithms enable robust searches across massive databases. Lexical methods allow you to quickly narrow through metadata tagging, where descriptive labels are systematically applied to content for efficient categorization and retrieval. More advanced neural models parse contextual meaning from text and speech, enabling flexible natural language searches. Ranking algorithms also process signals to determine the optimal results order based on past interactions. Combining these paradigms allows modern applications to handle imprecise searches across huge datasets. Even a short text or voice query undergoes complex retrieval processes nearly instantly. The same capabilities allow databases and content systems to ingest new content and make it discoverable through intelligent indexing. As reliance increases on mobile apps, web services, business analytics, and personalized recommendations, the need for sophisticated search also increases. Thankfully, machine learning provides increasingly powerful tools to make sense of exponentially expanding seas of data. Maintaining accessibility and relevance is crucial for realizing the value of low latency and accurate search response.

Now that we have the background of search, let’s iron out an understanding of a typical search architecture and what it takes to build one.

Figure 1.1: Component architecture for search

A typical search architecture consists of several key components working together to enable efficient data ingestion, indexing, and retrieval.

The crawler (collection agent) is responsible for discovering and fetching data from various sources, such as web pages or enterprise documents; popular examples include Apache Nutch and Googlebot. Next, the buffering/transform component processes and normalizes raw data, handling tasks such as filtering, enrichment, or format conversion; tools such as Apache Kafka or the OpenSearch project's Data Prepper are often used here. The catalog component maintains metadata about the datasets and indexes, organizing and tracking data schemas and versions—this role can be fulfilled by systems such as AWS Glue Data Catalog or Apache Hive Metastore. The indexing module takes processed data and builds optimized indexes for fast querying, exemplified by OpenSearch. Finally, the search engine executes user queries against these indexes, ranks results, and returns relevant data; OpenSearch, Elasticsearch, and Solr are prominent examples. Together, these components form a scalable, fault-tolerant pipeline that transforms raw data into actionable search results.

Supplementary data stores are additional databases such as blob stores (S3), data warehouses (Redshift), Hadoop (HDFS), and so on that complement search capabilities and workloads. The key principles are separating ingestion, storage, indexing, metadata, and search query functionalities for scalability and flexibility. Components communicate through APIs and standardized interfaces. Common technologies such as OpenSearch, Elasticsearch, and Solr provide indexing, storage, and search capabilities in a single integrated platform, while custom search infrastructures may combine specialized tools for each component of the search architecture.

For example, a modern e-commerce search system might use OpenSearch as its core search engine, with product catalog data indexed for fast full-text search. Images and videos would be stored in S3, with their metadata and descriptors in OpenSearch for searchability. User behavior data might flow through Kafka to both OpenSearch (for real-time personalization) and Redshift (for deeper analytics). The search API layer would orchestrate queries across these systems, perhaps using OpenSearch for text search while simultaneously querying a recommendation service built on user behavior data. This modular architecture allows each component to scale independently and enables specialized optimizations for different types of content and search patterns.

Evolution of OpenSearch

OpenSearch originated from Elasticsearch, an open source search and analytics engine first released in 2010 under the Apache v2 license. Elasticsearch quickly gained popularity for its ability to combine speed and scalability with a developer-friendly API. In January 2021, Amazon Web Services forked the Apache License V2-licensed version of Elasticsearch, 7.10.2, to create the OpenSearch project. In July of 2021, Amazon made OpenSearch V1 generally available. OpenSearch represents a fully open-source and community-driven fork. Amazon’s goal was to maintain a free and open alternative for users wanting to use and help grow the engine’s capabilities. Since its launch, OpenSearch has seen strong adoption and an active open source community contribution. The project focuses on providing production-ready search, analytics, and vector database capabilities, with enterprise security, alerting, and reporting features. In September 2024, Amazon, along with premier members SAP, and Uber, and 12 general members, created the OpenSearch Software Foundation (https://opensearch.org/foundation), under the Linux Foundation, providing open governance and a vendor neutral foundation for the future growth of the project.

Now that you know a bit about OpenSearch’s past, let’s find out what it offers by exploring its core capabilities.

Understanding the core capabilities of OpenSearch

In this section, you will explore the core capabilities of OpenSearch from a builder lens, focusing on service components of the project.

The OpenSearch Project (https://github.com/opensearch-project) is a collection of sub-projects comprising a software suite, composed of OpenSearch, which is a representational state transfer (REST) API-driven, distributed system; OpenSearch Dashboards, which is a graphical user interface (GUI) frontend; and about 40 plugins that deliver advanced capabilities such as anomaly detection, fine-grained access control, alerting, SQL query support, vector storage and retrieval, and much more.

Distributed database

OpenSearch is in the database family of technologies. Like other databases, OpenSearch stores and organizes information for its efficient retrieval. OpenSearch provides a REST API that supports create, read, update, delete (CRUD) operations, along with a host of administrative operations. You send your data to OpenSearch via its indexing APIs, and you query OpenSearch via its search APIs. OpenSearch stores your data in indexes that you create and manage.

OpenSearch and search engines more generally were first designed to work as an adjunct to a relational database management system (RDBMS). Relational databases provide important features that make them durable, predictable, and compositional. Relational databases have ACID (which stands for atomicity, consistency, isolation, and durability) properties:

Atomic operations in a transaction succeed or fail togetherConsistency helps guarantee consistent readsIsolation means that operations don’t interfere with one anotherDurability ensures that once operations succeed, they are persisted

In addition, a relational database provides the ability to design a data representation with foreign keys that link tables together, and to run queries that join across these relational keys.

In order to maintain ACID properties, relational databases can be limited in their ability to scale to high volumes of traffic. Search engines don’t guarantee ACID properties to provide higher scale and lower latency for query processing. OpenSearch provides eventual consistency, no atomic operations, and no transactions. OpenSearch works best when its data is stored in a flat (denormalized) form. In trade for these limitations, OpenSearch can scale to process 100,000 operations per second, or more, across terabytes, or even petabytes of data. It’s common to see average, server-side processing times in the single-digit milliseconds, with 90th-percentile (p90) latencies in the 20–30 ms range.

OpenSearch is a distributed database. You deploy software in a cluster, across a set of nodes deployed on containers or servers. OpenSearch nodes provide resources that OpenSearch consumes to store and process queries for your data. Each node has one or many responsibilities, as listed here:

cluster_manager: Maintains the state of the cluster—the indices you have deployed and the locations of shards of those indices—and manages cluster-wide operations such as creating new indices, taking cluster backups, and allocating shardsdata: Provides storage, compute, and RAM. Nodes with the data role directly process your indexing and search requests.coordinating_only: Like a front desk that receives client requests and directs them to the right places in the cluster. It doesn’t store data itself, but knows where to find everything and collects all the results before sending a complete answer back to the client.ingest: Provides resources to handle special processing of indexing requests, such as processing PDF files.search (referred to as warm from OpenSearch 3.0): Supports searching snapshots by providing caching for remote store data.ml: Provides dedicated resources to host ML Commons. The ML Commons plugin provides a set of REST APIs for machine learning (ML) features in OpenSearch.remote_cluster_client: Provides resources to connect clusters.

In most clusters, you will use the data and cluster_manager roles. For higher-scale, you will likely use coordinating_only nodes. Finally, for AI/ML use cases where you want to host ML models within your cluster, you will use ml nodes.

By relaxing ACID constraints and providing a means to scale horizontally, search engine designers were able to provide a system that can store and query against much more data. They can also retrieve results with much higher throughput and at much lower latency than relational systems. OpenSearch clusters, in production today, are serving 100,000 or more queries per second at average latencies in single-digit milliseconds.

Lexical search

OpenSearch’s first use case is the one that most clearly defines it. OpenSearch makes it possible to retrieve results for text queries that match against stored blocks of text. For instance, e-commerce websites that sell products can take words that you type in the search box and match them against words in product titles and descriptions. Or, if you have a large repository of documents such as wiki pages, PDFs, Microsoft Word files, and so on, OpenSearch can match a text query to all of your documents and bring you the ones that best match what you had in mind when you started searching.

In basic terms, OpenSearch can import structured data from other database systems and then perform searches to find matching information within that data. A relational database stores information as rows in a table. Each row contains a set of typed columns, supporting numeric, text, binary, and other types of fields. OpenSearch indices are analogous to database tables, search documents are analogous to table rows, and fields in search documents are analogous to database columns. OpenSearch queries can specify exact matching of values to field values in the same way that databases provide direct matching for cell values. In fact, one use of OpenSearch is to offload the information in your database and send it to OpenSearch to provide OpenSearch’s low-latency and high-throughput query processing. This can substantially reduce the load on your database and reduce query processing time from seconds/minutes to milliseconds/seconds.

OpenSearch goes further than exact matching by providing tools to work with large blocks of text (free-text). Lemmatization is the process of taking free text, parsing it into single words (technically, terms) for matching, applying natural language rules to terms, and storing them to support matching. Lemmatization reduces words to their core meaning (such as changing running to run), helping OpenSearch better match what users are searching for with the actual content in the database, even when words appear in different forms. For example, lemmatizing the quick brown fox jumped over the lazy dog produces quick brown fox jump over lazi dog.

This enables matching queries such as jumping foxes (lemmatized as jump fox), the fox’s laziness (lemmatized as fox lazi), and Lazy Foxes and Dogs (lemmatized as lazi fox dog). In all of these cases, OpenSearch’s English analyzer produces a source text and query terms that preserve meaning, but broaden the possibility of matching beyond exact string-to-string equality (you’ll learn much more about lemmatization in Chapter 4).

That brings us to an urgent question: matching more text also means matching text that could be further away in meaning, so how does OpenSearch produce matches that are good? A good match is one that answers the question posed by the query in a way that matches the terms, directly and indirectly, based on what they mean. Relevance is the relationship of each search result to the query that produced it. OpenSearch uses a scoring function called TF/IDF (Okapi BM25: https://en.wikipedia.org/wiki/Okapi_BM25, more precisely) to produce a score for each matching piece of text, and sorts its results based on that score.

The key idea of BM25 is to take the value of a term, represented by its commonality across all terms, and multiply it by the number of times that term occurs in the document. Rare terms score highly, since they likely provide the most relevant and discriminating power for that query. Common terms don’t provide much information, so they receive lower scores. The most common terms (such as a, an, and the) are typically removed with stop word removal, since matching them provides little to no value. Sorting by BM25 score has been the predominant choice in search engines for most of their history. Recently, advances in natural language processing have brought a new scoring methodology to the fore. Using large language models (LLMs) to produce vector embeddings has brought new tools into play.

Semantic search with vector embeddings

Semantic search extends and enhances OpenSearch’s core ability to match documents, meaning-for-meaning. In the very recent past, the last five or so years, the AI field has made a step change in working with natural language text. The Transformer architecture has led the way to exciting results for text generation and the rise of chatbots, AI assistants, and agents. Bidirectional Encoder Representation from Transformers (BERT) models are LLMs that produce vectors representing natural language, projecting text into a multi-dimensional space.

Models such as Amazon Titan Text Embeddings and Anthropic’s Claude take text (some models encode images, video, or audio) and transform it into an array of floating-point numbers. This large array (typically 384, 768, or 1,536 values) is a vector in a space defined by the number of dimensions of the array. Embedding models generate embeddings so that pieces of text that mean the same thing are close together in the vector space, while texts that have different meanings are far apart. For example, the embedding for dog is close to the embedding for puppy, while it’s far apart from the embedding for skyscraper.

To perform semantic search as described in Figure 1.2, find the nearest neighbors for the embedding of your query, compared with the embedding for each document. During indexing, you create embeddings for each document you are sending to OpenSearch. Then, you create an embedding for the query and send a knn query with the query embedding. OpenSearch finds nearest neighbors for your query vector, scoring documents based on their distance from the query vector. Because the vectors preserve much of the meaning of the words, matches are highly relevant. Semantic search can deliver 10–20% better relevance than BM25 on standard benchmarks.

Figure 1.2: Semantic search with Amazon OpenSearch Service

OpenSearch’s K-Nearest Neighbor (KNN) plugin provides the engines and algorithms for storing, matching, and scoring queries for vector embeddings. At the low end of the scale, you can employ exact KNN. In exact KNN, you match the query vector to every document in the index, providing a guaranteed best match. But matching every vector becomes prohibitively expensive as the number of vectors increases. With approximate KNN, OpenSearch employs heuristic algorithms that reduce the number of vectors matched, lowering latency and trading off accuracy.

OpenSearch’s KNN plugin supports three different storage and query engines: Non-Metric Space Library (NMSLib), Facebook AI Similarity Search (FAISS), and Lucene’s vector engine. These engines provide two algorithms for storing and matching vectors:

Hierarchical Navigable Small World (HNSW): Lucene, NMSLib, and FAISS: The HNSW algorithm relies on hierarchical graphs at deepening granularities to store and find neighbors. The graph nodes are the vectors, and edges provide connections to close neighbors. The HNSW algorithm finds neighbors at the coarsest granularity, taking large steps. It deepens in the graph to take smaller steps, eventually retrieving points at the leaf level of the graph.Inverted File (IVF): Inverted file takes a clustering approach to approximation. At indexing time, it builds clusters of close points, picking a representative centroid point. At query time, it examines the cluster with the representative centroid closest to the query point. You can use a technique called product quantization to reduce the number of bytes used to store the vectors, saving space and latency, with a corresponding loss of accuracy.

OpenSearch’s newest technique for semantic search is the use of sparse vectors (opensearch.org/docs/latest/search-plugins/neural-sparse-search/). A sparse vector representation of text stores a number of term-like tokens that is fewer than the full vocabulary of the corpus but larger than the embeddings produced for HNSW and IVF dense vectors. You can think of a spectrum: on one end, the vector has dimensions equivalent to the number of terms in the corpus. Each vector has a single 1, exactly representing that term. These one-hot vectors are accurate but have the least amount of overlap in their points. Moving along the spectrum, sparse vectors compress the number of dimensions, adding floating-point values for some subset of the dimensions. This promotes generalization through the overlapping of terms into groups. At the dense end of the spectrum, the direct correlation of terms and dimensions is lost, providing maximum generalization with loss of direct matching.

OpenSearch’s Neural plugin simplifies the process of adding vector embeddings during indexing and query by employing models hosted on the cluster or connecting to third-party model hosting services. OpenSearch supports custom models on a node with the ml role. When you use OpenSearch for semantic search, you run with one or more dedicated ml nodes to minimize the impact on the rest of the cluster. OpenSearch can also connect with model hosting services such as Amazon Bedrock, Amazon SageMaker, Cohere, and OpenAI.

An example of a custom model that you can upload to an OpenSearch cluster can be found here: https://docs.opensearch.org/latest/ml-commons-plugin/pretrained-models/ dense models and sparse models provided by Hugging Face.

So, which is better, lexical or semantic search? Not surprisingly, each has its place. In many search scenarios and for many search use cases, exact matching and BM25 relevance bring results that are as good as, or better than, semantic search with dense vectors. For example, if you are shopping for the Eastman 1/2 inch FIP x 3/8 inch OD Compression Quarter Turn Angle Stop Valve, Brass Plumbing Fitting, Chrome, 10733LF, and are querying with something such as Eastman 10733LF, you don’t need (or want!) a match based on the meaning of the terms that you typed. However, if you are searching for a movie to watch with a query such as a wholesome family movie with a funny plot line, semantic search with dense vectors is your best bet. Sparse vectors can succeed in more cases, since they retain a closer relationship from the source terms to the indexed vector. OpenSearch provides additional crossover capabilities with the Neural plugin’s hybrid search (https://opensearch.org/docs/latest/search-plugins/hybrid-search/). Hybrid search runs both a BM25-scored query and a vector-scored query, normalizing and combining scores from both. Hybrid search delivers 14–17% improved accuracy on standard benchmarks.

Log analytics

So far, we’ve been detailing OpenSearch’s capabilities for searching your data and finding relevant results. OpenSearch also provides capabilities for storing and analyzing time-series data. About half of OpenSearch’s use cases are pure search. The other half is for storing, searching, and analyzing trace, log, and metric data. People use OpenSearch to store this time-series data for DevOps, security monitoring, and monitoring their applications and infrastructure.

OpenSearch’s monitoring capabilities grew out of its core search capabilities. As application builders developed search experiences, they built UI and API elements to support faceted search. When you use a site such as Amazon.com and enter a query into the search box, the website summarizes information in various fields of the documents it retrieves. For example, if you search for golf shirt on Amazon, the site will show you statistics about the items that match, such as brands, price ranges, colors, and sizes. Most search-backed websites show a list of these per-attribute statistics to the left of the search results. These UI elements are called facets, and you can click on them to add a filter for one or more of the values to your query.

But facets are really just histograms of the possible values for a field. OpenSearch aggregates these values into buckets and counts how many items are in the bucket. When you send log data to OpenSearch, you first parse the original log line, which is usually a string. You apply field names and structure the data into fields in a JSON object that you send to OpenSearch for indexing. Here’s an (obfuscated) Apache httpd log line from a NASA dataset captured during the July 1995 shuttle launch:

192.168.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

Quick tip: Enhance your coding experience with the AI Code Explainer and Quick Copy features. Open this book in the next-gen Packt Reader. Click the Copy button (1) to quickly copy code into your coding environment, or click the Explain button (2) to get the AI assistant to explain a block of code to you.

The next-gen Packt Reader is included for free with the purchase of this book. Scan the QR code OR go to packtpub.com/unlock, then use the search bar to find this book by name. Double-check the edition shown to make sure you get the right one.

In actual usage, you would parse the preceding data to separate things such as the HTTP request code, resource name, HTTP version, and so on. You would enrich the data as well with location data from the host’s IP address. For the present, you can render the preceding in JSON as follows:

{   "host": "192.168.81.55",   "client": "-",   "user": "-",   "timestamp": "[01/Jul/1995:00:00:01 -0400]",   "request": "GET /history/apollo/ HTTP 1.0",   "status_code": 200,   "bytes": 6245 }

This small piece of information proves tremendously useful when combined with filtering and aggregation. As the technology has matured, its developers have added mathematical capabilities to the aggregations feature, enabling sums, averages, standard deviations, mins, maxes, and so on. Taken together with OpenSearch’s ability to filter for values in fields, aggregations provide the means of monitoring values in particular fields. When you have hundreds or thousands of web servers running and need a high-level view of what they are doing, you can send all of their access logs, such as the preceding ones, to OpenSearch. Once indexed, you can run queries to find out the following:

What was the most commonly accessed resource?For each 5-minute interval, from midnight on July 1 through to 10:00 a.m., how many bytes did servers send over the wire?How many times did host 192.168.81.55 access the website?

The value of this time series data is very closely related to its age. When you are monitoring your application infrastructure to make sure that it is available and functioning correctly, you need to know what’s happening as soon as possible. If your website is down but your monitoring software is 15 minutes out of date, you’ll have a 15-minute lag before you can react and correct the failure. OpenSearch supports near-real-time updating of information for search and log data. OpenSearch can scale to ingest hundreds of millions of log lines with 1 second as the minimum.

OpenSearch Dashboards (https://github.com/opensearch-project/OpenSearch-Dashboards), an open source data visualization tool and part of the OpenSearch project, provides a GUI that supports administrative functions, plugin-driven solutions for alerting, anomaly detection, observability, and security analytics. It provides visualizations and dashboards for monitoring your software and hardware solutions, and more. OpenSearch Dashboards makes it easy to build and collect these visualizations and monitor your infrastructure in near-real time.

The following graphic shows OpenSearch Dashboards monitoring for sample Apache web logs. You can see the kinds of visualizations you can build, including pie charts, a gauge, an area graph, and a histogram.

Figure 1.3: OpenSearch Dashboard for log analytics

Quick tip: Need to see a high-resolution version of this image? Open this book in the next-gen Packt Reader or view it in the PDF/ePub copy.

The next-gen Packt Reader and a free PDF/ePub copy of this book are included with your purchase. Scan the QR code OR go to packtpub.com/unlock, then use the search bar to find this book by name. Double-check the edition shown to make sure you get the right one.

With OpenSearch and OpenSearch Dashboards, you get a full suite for searching, analyzing, and visualizing your data. OpenSearch can be the key to unlocking the information in your log data, finding the product you want to buy, or making sure that your application is running and secure. We will now turn to some real-world examples of people using OpenSearch to solve these problems, and more.

Real-world examples and use cases

In this section, we explore diverse real-world applications of OpenSearch, demonstrating its adaptability across various industries and use cases. From revolutionizing product searches in e-commerce platforms to providing crucial insights in healthcare environments, OpenSearch proves to be a versatile tool. Going one step further, we will talk about use cases related to semantic search, where OpenSearch utilizes natural language processing (NLP) to grasp user query intent, delivering highly relevant results. For instance, a query such as formal attire for a beach wedding