35,99 €
To overcome application monitoring and observability challenges, Grafana Labs offers a modern, highly scalable, cost-effective Loki, Grafana, Tempo, and Mimir (LGTM) stack along with Prometheus for the collection, visualization, and storage of telemetry data.
Beginning with an overview of observability concepts, this book teaches you how to instrument code and monitor systems in practice using standard protocols and Grafana libraries. As you progress, you’ll create a free Grafana cloud instance and deploy a demo application to a Kubernetes cluster to delve into the implementation of the LGTM stack. You’ll learn how to connect Grafana Cloud to AWS, GCP, and Azure to collect infrastructure data, build interactive dashboards, make use of service level indicators and objectives to produce great alerts, and leverage the AI & ML capabilities to keep your systems healthy. You’ll also explore real user monitoring with Faro and performance monitoring with Pyroscope and k6. Advanced concepts like architecting a Grafana installation, using automation and infrastructure as code tools for DevOps processes, troubleshooting strategies, and best practices to avoid common pitfalls will also be covered.
After reading this book, you’ll be able to use the Grafana stack to deliver amazing operational results for the systems your organization uses.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 438
Veröffentlichungsjahr: 2024
Observability with Grafana
Monitor, control, and visualize your Kubernetes and cloud platforms using the LGTM stack
Rob Chapman
Peter Holmes
BIRMINGHAM—MUMBAI
Copyright © 2023 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Preet Ahuja
Publishing Product Manager: Surbhi Suman
Book Project Manager: Ashwini Gowda
Senior Editor: Shruti Menon
Technical Editor: Nithik Cheruvakodan
Copy Editor: Safis Editing
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Production Designer: Ponraj Dhandapani
DevRel Marketing Coordinator: Rohan Dobhal
Senior DevRel Marketing Coordinator: Linda Pearlson
First published: December 2023
Production reference: 1141223
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-80324-800-4
www.packtpub.com
To all my children, for making me want to be more. To Heather, for making that possible.
– Rob Chapman
For every little moment that brought me to this point.
– Peter Holmes
Rob Chapman is a creative IT engineer and founder at The Melt Cafe, with two decades of experience in the full application life cycle. Working over the years for companies such as the Environment Agency, BT Global Services, Microsoft, and Grafana, Rob has built a wealth of experience on large complex systems. More than anything, Rob loves saving energy, time, and money and has a track record for bringing production-related concerns forward so that they are addressed earlier in the development cycle, when they are cheaper and easier to solve. In his spare time, Rob is a Scout leader, and he enjoys hiking, climbing, and, most of all, spending time with his family and six children.
A special thanks to Peter, my co-author/business partner, and to our two reviewers, Flick and Brad – we did this!
Heather, my best friend – thank you for believing in me and giving me space to grow.
Thank you to my friend and coach, Sam, for guiding me to be all I can be and not being afraid to show that to the world.
Last but not least, thanks to Phil for the sanctuary over the years – you kept me sane.
Peter Holmes is a senior engineer with a deep interest in digital systems and how to use them to solve problems. With over 16 years of experience, he has worked in various roles in operations. Working at organizations such as Boots UK, Fujitsu Services, Anaplan, Thomson Reuters, and the NHS, he has experience in complex transformational projects, site reliability engineering, platform engineering, and leadership. Peter has a history of taking time to understand the customer and ensuring Day-2+ operations are as smooth and cost-effective as possible.
A special thanks to Rob, my co-author, for bringing me along on this writing journey, and to our reviewers.
Rania, my wife – thank you for helping me stay sane while writing this book.
Felicity (Flick) Ratcliffe has almost 20 years of experience in IT operations. Starting her career in technical support for an internet service provider, Flick has used her analytical and problem-solving superpowers to grow her career. She has worked for various companies over the past two decades in frontline operational positions, such as systems administrator, site reliability engineer, and now, for Post Office Ltd in the UK, as a cloud platform engineer. Continually developing her strong interest in emerging technologies, Flick has become a specialist in the area of observability over the past few years and champions the cause for OpenTelemetry within the organizations she works at or comes into contact with.
I have done so much with my career in IT, thanks to my colleagues, past and present. Thank you for your friendship and mentorship. You encouraged me to develop my aptitude for technology, and I cannot imagine myself in any other type of work now.
Bradley Pettit has over 10 years of experience in the technology industry. His expertise spans a range of roles, including hands-on engineering and solution architecture. Bradley excels in addressing complex technical challenges, thanks to his strong foundation in platform and systems engineering, automation, and DevOps practices. Recently, Bradley has specialized in observability, working as a senior solutions architect at Grafana Labs. He is a highly analytical, dedicated, and results-oriented professional. Bradley’s customer-centric delivery approach empowers organizations and the O11y community to achieve transformative outcomes.
Hello and welcome! Observability with Grafana is a book about the tools offered by Grafana Labs for observability and monitoring. Grafana Labs is an industry-leading provider of open source tools to collect, store, and visualize data collected from IT systems. This book is primarily aimed toward IT engineers who will interact with these systems, whatever discipline they work in.
We have written this book as we have seen some common problems across organizations:
Systems that were designed without a strategy for scaling are being pushed to handle additional data load or teams using the systemOperational costs are not being attributable correctly in the organization, leading to poor cost analysis and managementIncident management processes that treat the humans involved as robots without sleep schedules or parasympathetic nervous systemsIn this book, we will use the OpenTelemetry Demo application to simulate a real-world environment and send the collected data to a free Grafana Cloud account that we will create. This will guide you through the Grafana tools for collecting telemetry and also give you hands-on experience using the administration and support tools offered by Grafana. This approach will teach you how to run the Grafana tools in a way so that anyone can experiment and learn independently.
This is an exciting time for Grafana, identified as a visionary in the 2023 Gartner Magic Quadrant for Observability (https://www.gartner.com/en/documents/4500499). They recently delivered change in two trending areas:
Cost reduction: This has seen Grafana as the first vendor in the observability space to release tools that not only help you understand your costs but also reduce them.Artificial intelligence (AI): Grafana has introduced generative AI tools that assist daily operations in simple yet effective ways – for example, writing an incident summary automatically. Grafana Labs also recently purchased Asserts.ai to simplify root cause analysis and accelerate problem detection.We hope you enjoy learning some new things with us and have fun doing it!
IT engineers, support teams, and leaders can gain practical insights into bringing the huge power of an observability platform to their organization. The book will focus on engineers in disciplines such as the following:
Software development: Learn how to quickly instrument applications and produce great visualizations enabling applications to be easily supportedOperational teams (DevOps, Operations, Site Reliability, Platform, or Infrastructure): Learn to manage an observability platform or other key infrastructure platform and how to manage such platforms in the same way as any other applicationSupport teams: Learn how to work closely with development and operational teams to have great visualizations and alerting in place to quickly respond to customers’ needs and IT incidentsThis book will also clearly establish the role of leadership in incident management, cost management, and establishing an accurate data model for this powerful dataset.
Chapter 1, Introducing Observability and the Grafana Stack, provides an introduction to the Grafana product stack in relation to observability as a whole. You will learn about the target audiences and how that impacts your design. We will take a look at the roadmap for observability tooling and how Grafana compares to alternative solutions. We will explore architectural deployment models, from self-hosted to cloud offerings. Inevitably, you will have answers to the question “Why choose Grafana?”.
Chapter 2, Instrumenting Applications and Infrastructure, takes you through the common protocols and best practices for each telemetry type at a high level. You will be introduced to widely used libraries for multiple programming languages that make instrumenting an application simple. Common protocols and strategies for collecting data from infrastructural components will also be discussed. This chapter provides a high-level overview of the technology space and aims to be valuable for quick reference.
Chapter 3, Setting Up a Learning Environment with Demo Applications, explains how to install and set up a learning environment that will support you through later sections of the book. You will also learn how to explore the telemetry produced by the demo app and add monitoring for your own service.
Chapter 4, Looking at Logs with, Loki, takes you through working examples to understand LogQL. You will then be introduced to common log formats, and their benefits and drawbacks. Finally, you will be taken through the important architectural designs of Loki, and best practices when working with it.
Chapter 5, Monitoring with Metrics Using Grafana Mimir and Prometheus, discusses working examples to understand PromQL with real data. Detailed information about the different metric protocols will be discussed. Finally, you will be taken through important architectural designs backing Mimir, Prometheus, and Graphite that guide best practices when working with the tools.
Chapter 6, Tracing Technicalities with Grafana Tempo, shows you working examples to understand TraceQL with real data. Detailed information about the different tracing protocols will be discussed. Finally, you will be taken through the important architectural designs of Tempo, and best practices when working with it.
Chapter 7, Interrogating Infrastructure with Kubernetes, AWS, GCP, and Azure, describes the setup and configuration used to capture telemetry from infrastructure. You will learn about the different options available for Kubernetes. Additionally, you will investigate the main plugins that allow Grafana to query data from cloud vendors such as AWS, GCP, and Azure. You will look at solutions to handle large volumes of telemetry where direct connections are not scalable. The chapter will also cover options for filtering and selecting telemetry data before it gets to Grafana for security and cost optimization.
Chapter 8, Displaying Data with Dashboards, explains how you can set up your first dashboard in the Grafana UI. You will also learn how to present your telemetry data in an effective and meaningful way. The chapter will also teach you how to manage your Grafana dashboards to be organized and secure.
Chapter 9, Managing Incidents Using Alerts, describes how to set up your first Grafana alert with Alert Manager. You will learn how to design an alert strategy that prioritizes business-critical alerts over ordinary notifications. Additionally, you will learn about alert notification policies, different delivery methods, and what to look for.
Chapter 10, Automation with Infrastructure as Code, gives you the tools and approaches to automate parts of your Grafana stack deployments while introducing standards and quality checks. You will gain a deep dive into the Grafana API, working with Terraform, and how to protect changes with validation.
Chapter 11, Architecting an Observability Platform, will show those of you who are responsible for offering an efficient and easy-to-use observability platform how you can structure your platform so you can delight your internal customers. In the environment we operate in, it is vital to offer these platform services as quickly and efficiently as possible, so more time can be dedicated to the production of customer-facing products. This chapter aims to build on the ideas already covered to get you up and running quickly.
Chapter 12, Real User Monitoring with Grafana, introduces you to frontend application observability, using Grafana Faro and Grafana Cloud Frontend Observability for real user monitoring (RUM). This chapter will discuss instrumenting your frontend browser applications. You will learn how to capture frontend telemetry and link this with backend telemetry for full stack observability.
Chapter 13, Application Performance with Grafana Pyroscope and k6, introduces you to application performance and profiling using Grafana Pyroscope and k6. You will obtain a high-level overview that discusses the various aspects of k6 for smoke, spike, stress, and soak tests, as well as using Pyroscope to continuously profile an application both in production and test environments.
Chapter 14, Supporting DevOps Processes with Observability, takes you through DevOps processes and how they can be supercharged with observability with Grafana. You will learn how the Grafana stack can be used in the development stages to speed up the feedback loop for engineers. You will understand how to prepare engineers to operate the product in production. Finally, You will learn when and how to implement CLI and automation tools to enhance the development workflow.
Chapter 15, Troubleshooting, Implementing Best Practices, and More with Grafana, closes the book by taking you through best practices when working with Grafana in production. You will also learn some valuable troubleshooting tips to support you with high-traffic systems in day-to-day operations. You will also learn about additional considerations for your telemetry data with security and business intelligence.
The following table presents the operating system requirements for the software that will be used in this book:
Software/hardware covered in the book
Operating system requirements
Kubernetes v1.26
Either Windows, macOS, or Linux with dual CPU and 4 GB RAM
Docker v23
Either Windows, macOS, or Linux with dual CPU and 4 GB RAM
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Observability-with-Grafana. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
The Code in Action videos for this book can be viewed at https://packt.link/v59Jp.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “The function used to get this information is the rate() function.”
A block of code is set as follows:
histogram_quantile( 0.95, sum( rate( http_server_duration_milliseconds_bucket{}[$__rate_interval]) ) by (le) )Any command-line input or output is written as follows:
$ helm upgrade owg open-telemetry/opentelemetry-collector -f OTEL-Collector.yamlBold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “From the dashboard’s Settings screen, you can add or remove tags from individual dashboards.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Observability with Grafana, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781803248004
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyIn this part of the book, you will get an introduction to Grafana and observability. You will learn about the producers and consumers of telemetry data. You will explore how to instrument applications and infrastructure. Then, you will set up a learning environment that will be enhanced throughout the chapters to provide comprehensive examples of all parts of observability with Grafana.
This part has the following chapters:
Chapter 1, Introducing Observability and the Grafana StackChapter 2, Instrumenting Applications and InfrastructureChapter 3, Setting Up a Learning Environment with Demo ApplicationsThe modern computer systems we work with have moved from the realm of complicated into the realm of complex, where the number of interacting variables make them ultimately unknowable and uncontrollable. We are using the terms complicated and complex as per system theory. A complicated system, like an engine, has clear causal relationships between components. A complex system, such as the flowing of traffic in a city, shows emergent behavior from the interactions of its components.
With the average cost of downtime estimated to be $9,000 per minute by Ponemon Institute in 2016, this complexity can cause significant financial loss if organizations do not take steps to manage this risk. Observability offers a way to mitigate these risks, but making systems observable comes with its own financial risks if implemented poorly or without a clear business goal.
In this book, we will give you a good understanding of what observability is and who the customers who might use it are. We will explore how to use the tools available from Grafana Labs to gain visibility of your organization. These tools include Loki, Prometheus, Mimir, Tempo, Frontend Observability, Pyroscope, and k6. You will learn how to use Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to obtain clear transparent signals of when a service is operating correctly, and how to use the Grafana incident response tools to handle incidents. Finally, you will learn about managing their observability platform using automation tools such as Ansible, Terraform, and Helm.
This chapter aims to introduce observability to all audiences, using examples outside of the computing world. We’ll introduce the types of telemetry used by observability tools, which will give you an overview of how to use them to quickly understand the state of your services. The various personas who might use observability systems will be outlined so that you can explore complex ideas later with a clear grounding on who will benefit from their correct implementation. Finally, we’ll investigate Grafana’s Loki, Grafana, Tempo, Mimir (LGTM) stack, how to deploy it, and what alternatives exist.
In this chapter, we’re going to cover the following main topics:
Observability in a nutshellTelemetry types and technologiesUnderstanding the customers of observabilityIntroducing the Grafana stackAlternatives to the Grafana stackDeploying the Grafana stackThe term observability is borrowed from control theory. It’s common to use the term interchangeably with the term monitoring in IT systems, as the concepts are closely related. Monitoring is the ability to raise an alarm when something is wrong, while observability is the ability to understand a system and determine whether something is wrong, and why.
Control theory was formalized in the late 1800s on the topic of centrifugal governors in steam engines. This diagram shows a simplified view of such a system:
Figure 1.1 – James Watt’s steam engine flyweight governor (source: https://www.mpoweruk.com)
Steam engines use a boiler to boil water in a pressure vessel. The steam pushes a piston backward and forward, which converts heat energy to reciprocal energy. In steam engines that use centrifugal governors, this reciprocal energy is converted to rotational energy via a wheel connected to a piston. The centrifugal governor provides a physical link backward through the system to the throttle. This means that the speed of rotation controls the throttle, which, in turn, controls the speed of rotation. Physically, this is observed by the balls on the governor flying outward and dropping inward until the system reaches equilibrium.
Monitoring defines the metrics or events that are of interest in advance. For instance, the governor measures the pre-defined metric of drive shaft revolutions. The controllability of the throttle is then provided by the pivot and actuator rod assembly. Assuming the actuator rod is adjusted correctly, the governor should control the throttle from fully open to fully closed.
In contrast, observability is achieved by allowing the internal state of the system to be inferred from its external outputs. If the operating point adjustment is incorrectly set, the governor may spin too fast or too slowly, rendering the throttle control ineffective. A governor spinning too fast or too slowly could also indicate that the sliding ring is stuck in place and needs oiling. Importantly, this insight can be gained without defining in advance what too fast or too slow means. The insight that the governor is spinning too fast or too slowly also needs very little knowledge of the full steam engine.
Fundamentally, both monitoring and observability are used to improve the reliability and performance of the system in question.
Now that we have introduced the high-level concepts, let’s explore a practical example outside of the world of software services.
Let’s imagine a ship traversing the Agua Clara locks on the Panama Canal. This can be illustrated using the following figure:
Figure 1.2 – The Agua Clara locks on the Panama Canal
There are a few aspects of these locks that we might want to monitor:
The successful opening and closing of each gateThe water level inside each lockHow long it takes for a ship to traverse the locksMonitoring these aspects may highlight situations that we need to be alerted about:
A gate is stuck open because of a mechanical failureThe water level is rapidly descending because of a leakA ship is taking too long to exit the locks because it is stuckThere may be situations where the data we are monitoring are within acceptable limits, but we can still observe a deviation from what is considered normal, which should prompt further action:
A small leak has formed near the top of the lock wall:We would see the water level drop but only when it is above the leakThis could prompt maintenance work on the lock wallA gate in one lock is opening more slowly because it needs maintenance:We would see the time between opening and closing the gate increaseThis could prompt maintenance on the lock gateShips take longer to traverse the locks when the wind is coming from a particular direction:We could compare hourly average traversal ratesThis could prompt work to reduce the impact of wind from one directionNow that we’ve seen an example of measuring a real-world system, we can group these types of measurements into different data types to best suit the application. Let’s introduce those now.
The boring but important part of observability tools is telemetry – capturing data that is useful, shipping it from place to place, and producing visualizations, alerts, and reports that offer value to the organization.
Three main types of telemetry are used to build monitoring and observability systems – metrics, logs, and distributed traces. Other telemetry types may be used by some vendors and in particular circumstances. We will touch on these here, but they will be explored in more detail in Chapters 12 and 13 of this book.
Metrics can be thought of as numeric data that is recorded at a point in time and enriched with labels or dimensions to enable analysis. Metrics are frequently generated and are easy to search, making them ideal for determining whether something is wrong or unusual. Let’s look at an example of metrics showing temporal changes:
Figure 1.3 – Metrics showing changes over time
Taking our example of the Panama Canal, we could represent the water level in each lock as a metric, to be measured at regular intervals. To be able to use the data effectively, we might add some of these labels:
The lock name: Agua ClaraThe lock chamber: Lower lockThe canal: Panama CanalLogs are considered to be unstructured string data types. They are recorded at a point in time and usually contain a huge amount of information about what is happening. While logs can be structured, there is no guarantee of that structure persisting, because the log producer has control over the structure of the log. Let’s look at an example:
Jun 26 2016 20:31:01 pc-ac-g1 gate-events no obstructions seen Jun 26 2016 20:32:01 pc-ac-g1 gate-events starting motors Jun 26 2016 20:32:30 pc-ac-g1 gate-events motors engaged successfully Jun 26 2016 20:35:30 pc-ac-g1 gate-events stopping motors Jun 26 2016 20:35:30 pc-ac-g1 gate-events gate open completeIn our example, the various operations involved in opening or closing a lock gate could be represented as logs.
Almost every system produces logs, and they often give very detailed information. This is great for understanding what happened. However, the volume of data presents two problems:
Searching can be inefficient and slow.As the data is in text format, knowing what to search for can be difficult. For example, error occurred, process failed, and action did not complete successfully could all be used to describe a failure, but there are no shared strings to search for.Let’s consider a real log entry from a computer system to see how log data is usually represented:
Figure 1.4 – Logs showing discrete events in time
We can clearly see that we have a number of fields that have been extracted from the log entry by the system. These fields detail where the log entry originated from, what time it occurred, and various other items.
Distributed traces show the end-to-end journey of an action. They are captured from every step that is taken to complete the action. Let’s imagine a trace that covers the passage of a ship through the lock system. We will be interested in the time a ship enters and leaves each lock, and we will want to be able to compare different ships using the system. A full passage can be given an identifier, usually called a trace ID. Traces are made up of spans. In our example, a span would cover the entry and exit for each individual lock. These spans are given a second identifier, called a span ID. To tie these two together, each span in a trace references the trace ID for the whole trace. The following screenshot shows an example of how a distributed trace is represented for a computer application:
Figure 1.5 – Traces showing the relationship of actions over time
Now that we have introduced metrics, logs, and traces, let’s consider a more detailed example of a ship passing through the locks, and how each telemetry type would be produced in this process:
Ship enters the first lock:Span ID createdTrace ID createdContextual information is added to the span, for example, a ship identificationKey events are recorded in the span with time stamps, for example, gates are opened and closedShip exits the first lock:Span closed and submitted to the recording systemSecond lock notified of trace ID and span IDShip enters the second lock:Span ID createdTrace ID added to spanContextual information is added to the spanKey events recorded in the span with time stampsShip exits the second lock:Span closed and submitted to the recording systemThird lock notified of trace ID and span IDShip enters the third lock:Repeat step 3Ship exits the third lock:Span closed and submitted to the recording systemNow let’s look at some other telemetry types.
Metrics, logs, and traces are often called the three pillars or the golden triangle of observability. As we outlined earlier, observability is the ability to understand a system. While metrics, logs, and traces give us a very good ability to understand a system, they are not the only signals we might need, as this depends at what abstraction layer we need to observe the system. For instance, when looking at a very detailed level, we may be very interested in the stack trace of an application’s activity at the CPU and RAM level. Conversely, if we are interested in the execution of a CI/CD pipeline, we may just be interested in whether a deployment occurred and nothing more.
Profiling data (stack traces) can give us a very detailed technical view of the system’s use of resources such as CPU cycles or memory. With cloud services often charged per hour for these resources, this kind of detailed analysis can easily create cost savings.
Similarly, events can be consumed from a platform, such as CI/CD. These can offer a huge amount of insight that can reduce the Mean Time to Recovery (MTTR). Imagine responding to an out-of-hours alert and seeing that a new version of a service was deployed immediately before the issues started occurring. Even better, imagine not having to wake up because the deployment process could check for failures and roll back automatically. Events differ from logs only in that an event represents a whole action. In our earlier example in the Logs section, we created five logs, but all of these referred to stages of the same event (opening the lock gate). As a relatively generic term, event gets used with other meanings.
Now that we’ve introduced the fundamental concepts of the technology, let’s talk about the customers who will use observability data.
Observability deals with understanding a system, identifying whether something is wrong with that system, and understanding why it is wrong. But what do we mean by understanding a system? The simple answer would be knowing the state of a single application or infrastructure component.
In this section, we will introduce the user personas that we will use throughout this book. These personas will help to distinguish the different types of questions that people use observability systems to answer.
Let’s take a quick look at the user personas that will be used throughout the book as examples, and their roles:
Name and role
Description
Diego Developer
Frontend, backend, full stack, and so on
Ophelia Operator
SRE, DevOps, DevSecOps, customer success, and so on
Steven Service
Service manager and other tasks
Pelé Product
Product manager, product owner, and so on
Masha Manager
Manager, senior leadership, and so on
Table 1.1 – User persona introductions
Now let’s look at each of these users in greater detail.
Diego Developer works on many types of systems, from frontend applications that customers directly interact with, to backend systems that let his organization store data in ways that delight its customers. You might even find him working on platforms that other developers use to get their applications integrated, built, delivered, and deployed safely and speedily.
He writes great software that is well tested and addresses customers’ actual needs.
When he is not writing code, he works with Ophelia Operator to address any questions and issues that occur.
Pelé Product works in his team and provides insight into the customer’s needs. They work together closely, taking those needs and turning them into detailed plans on how to deliver software that addresses them.
Steven Service is keen to ensure that the changes Diego makes are not impacting customer commitments. He’s also the one who wakes Diego up if there is an incident that needs attention. The data provided to Masha Manager gives her a breakdown of costs. When Diego is working on developer platforms, he also collects data that helps her get investment from the business into teams that are not performing as expected.
Diego really needs easy-to-use libraries for the languages he uses to instrument the code he produces. He does not have time to become an expert. He wants to be able to add a few lines of code and get results quickly.
Having a clear standard for acceptable performance measures makes it easy for him to get the right results.
When Diego’s systems produce too much data, he finds it difficult to sort signal from noise. He also gets frustrated having to change his code because of an upstream decision to change tooling.
Ophelia Operator works in an operations-focused environment. You might find her in a customer-facing role or as part of a development team as a DevOps engineer. She could be part of a group dedicated to the reliability of an organization’s systems, or she could be working in security or finance to ensure the business runs securely and smoothly.
Ophelia wants to make sure a product is functioning as expected. She also likes it when she is not woken up early in the morning by an incident.
Ophelia will work a lot with Diego Developer; sometimes it’s escalating customer tickets when she doesn’t have the data available to understand the problem; at other times it’s developing runbooks to keep the systems running. Sometimes she will need to give Diego clear information on acceptable performance measures so that her team can make sure systems perform well for customers.
Steven Service works closely with Ophelia. They work together to ensure there are not many incidents, and that they are quickly resolved. Steven makes sure that business data on changes and incidents is tracked, and tweaks processes when things aren’t working.
Pelé Product likes to have data showing the problematic areas of his products.
Good data is necessary to do the job effectively. Being able to see that a customer has encountered an error can make the difference between resolving a problem straight away or having them wait maybe weeks for a response.
During an incident seeing that a new version of a service was deployed at the time a problem started can change an hours-long incident into a brief blip, and keep customers happy.
Getting continuous alerts but not being empowered to fix the underlying issue is a big problem. Ophelia has seen colleagues burn out, and it makes her want to leave the organization when this happens.
Steven Service works in service delivery. He is interested in making sure the organization’s services are delivered smoothly. Jumping in on critical incidents and coordinating actions to get them resolved as quickly as possible is part of the job. So is ensuring that changes are made using processes that help others do it as safely as possible. Steven also works with third parties who provide services that are critical to the running of the organization.
He wants services to run as smoothly as possible so that the organization can spend more time focused on customers.
Diego Developer and Ophelia Operator work a lot with the change management processes created by Steven and the support processes he manages. Having accurate data to hand during change management really helps to make the process as smooth as possible.
Steven works very closely with Masha Manager to make sure she has access to data showing where processes are working smoothly and where they need to spend time improving them.
He needs to be able to compare the delivery of different products and provide that data to Masha and the business.
During incidents, he needs to be able to get the right people on the call as quickly as possible and keep a record of what happened for the incident post-mortem.
Being able to identify the right person to get on a call during an incident is a common problem he faces. Seeing incidents drag on while different systems are compared and who can fix the problem is argued about is also a big concern to him.
Pelé Product works in the product team. You’ll find him working with customers to understand their needs, keeping product roadmaps in order, and communicating requirements back to developers such as Diego Developer so they can build them. You might also find him understanding and shaping the product backlog for the internal platforms used by developers in the organization.
Pelé wants to understand customers, give them products that delight them, and keep them coming back.
He spends a lot of time working with Diego when they can look at the same information to really understand what customers are doing and how they can help them do it better.
Ophelia Operator and Steven Service help Pelé keep products on track. If too many incidents occur, they ask everyone to refocus on getting stability right. There is no point in providing customers with lots of features on a system that they can’t trust.
Pelé works closely with Masha Manager to ensure the organization has the right skills in the teams that build products. The business depends on her leadership to make sure that these developers have the best tools to help them get their code live in front of customers where it can be used.
Pelé needs to be able to understand customers’ pain points even when they do not articulate them clearly during user research.
He needs data that gives him a common language with Diego and Ophelia. Sometimes they can get too focused on specific numbers such as shaving off a couple of milliseconds from a request, when improving a poor workflow would improve the customer experience more significantly.
Pelé hates not being able to see at a high level what customers are doing. Understanding which bits of an application have the most usage, and which bits are not used at all, lets him know where to focus time and resources.
While customers never tell him they want stability, if it’s not there they will lose trust very quickly and start to look at alternatives.
Masha works in management. You might find her leading a team and working closely with them daily. She also represents middle management, setting strategy and making tactical choices, and she is involved, to some extent, in senior leadership. Much of her role involves managing budgets and people. If something can make that process easier, then she is usually interested in hearing about it. What Masha does not want to do is waste the organization’s money, because that can directly impact jobs.
Her primary goals are to keep the organization running smoothly and ensure the budget is balanced.
As a leader, Masha needs accurate data and needs to be able to trust the teams who provide that data. The data could be the end-to-end cycle time of feature concept to delivery from Pelé Product, the lead time for changes from Diego Developer, or even the MTTR from Steven Service. Having that data helps her to understand where focus and resources can have the biggest impact.
Masha works regularly with the financial operations staff and needs to make sure they have accurate information on the organization’s expenditure and the value that expenditure provides.
She needs good data in a place where she can view it and make good decisions. This usually means she consumes information from a business intelligence system. To use such tools effectively, she needs to be clear on what the organization’s goals are, so that the correct data can be collected to help her understand how her teams are tracking to that goal.
She also needs to know that the teams she is responsible for have the correct data and tools to excel in their given areas.
High failure rates and long recovery time usually result in her having to speak with customers to apologize. Masha really hates these calls!
Poor visibility of cloud systems is a particular concern. Masha has too many horror stories of huge overspending caused by a lack of monitoring; she would rather spend that budget on something more useful.
You now know about the customers who use observability data, and the types of data you will be using to meet their needs. As the main focus of this book is on Grafana as the underlying technology, let’s now introduce the tools that make up the Grafana stack.
Grafana was born in 2013 when a developer was looking for a new user interface to display metrics from Graphite. Initially forked from Kibana, the Grafana project was developed to make it easy to build quick, interactive dashboards that were valuable to organizations. In 2014, Grafana Labs was formed with the core value of building a sustainable business with a strong commitment to open source projects. From that foundation, Grafana has grown into a strong company supporting more than 1 million active installations. Grafana Labs is a huge contributor to open source projects, from their own tools to widely adopted technologies such as Prometheus, and recent initiatives with a lot of traction such as OpenTelemetry.
Grafana offers many tools, which we’ve grouped into the following categories:
The core Grafana stack: LGTM and the Grafana AgentGrafana enterprise pluginsIncident response toolsOther Grafana toolsLet’s explore these tools in the following sections.
The core Grafana stack consists of Mimir, Loki, Tempo, and Grafana; the acronym LGTM is often used to refer to this tech stack.
Mimir is a Time Series Database (TSDB) for the storage of metric data. It uses low-cost object storage such as S3, GCS, or Azure Blob Storage. First announced for general availability in March 2022, Mimir is the newest of the four products we’ll discuss here, although it’s worth highlighting that Mimir initially forked from another project, Cortex, which was started in 2016. Parts of Cortex also form the core of Loki and Tempo.
Mimir is a fully Prometheus-compatible solution that addresses the common scalability problems encountered with storing and searching huge quantities of metric data. In 2021 Mimir was load tested to 1 billion active time series. An active time series is a metric with a value and unique labels that has reported a sample in the last 20 minutes. We will explore Mimir and Prometheus in much greater detail in Chapter 5.
Loki is a set of components that offer a full feature logging stack. Loki uses lower-cost object storage such as S3 or GCS, and only indexes label metadata. Loki entered general availability in November 2019.
Log aggregation tools typically use two data structures to store log data. An index that contains references to the location of the raw data paired with searchable metadata, and the raw data itself stored in a compressed form. Loki differs from a lot of other log aggregation tools by keeping the index data relatively small and scaling the search functionality by using horizontal scaling of the querying component. The process of selecting the best index fields is one we will cover in Chapter 4.
Tempo is a storage backend for high-scale distributed trace telemetry, with the aim of sampling 100% of the read path. Like Loki and Mimir, it leverages lower-cost object storage such as S3, GCS, or Azure Blob Storage. Tempo went into general availability in June 2021.
When Tempo released 1.0, it was tested at a sustained ingestion of >2 million spans per second (about 350 MB per second). Tempo also offers the ability to generate metrics from spans as they are ingested; these metrics can be written to any backend that supports Prometheus remote write. Tempo is explored in detail in Chapter 6.
Grafana has been a staple for fantastic visualization of data since 2014. It has targeted the ability to connect to a huge variety of data sources from TSDBs to relational databases and even other observability tools. Grafana has over 150 data source plugins available. Grafana has a huge community using it for many different purposes. This community supports over 6,000 dashboards, which means there is a starting place for most available technologies with minimal time to value.
Collecting telemetry from many places is one of the fundamental aspects of observability. Grafana Agent is a collection of tools for collecting logs, metrics, and traces. There are many other collection tools that Grafana integrates well with. Different collection tools offer different advantages and disadvantages, which is not a topic we will explore in this book. We will highlight other tools in the space later in this chapter and in Chapter 2 to give you a starting point for learning more about this topic. We will also briefly discuss architecting a collection infrastructure in Chapter 11.
The Grafana stack is a fantastic group of open source software for observability. The commitment of Grafana Labs to open source is supported by great enterprise plugins. Let’s explore them now.
As part of their Cloud Pro, Cloud Advanced, and Enterprise license offerings, Grafana offers Enterprise plugins. These are part of any paid subscription to Grafana.
The Enterprise data source plugins allow organizations to read data from many other storage tools they may use, from software development tools such as GitLab and Azure DevOps to business intelligence tools such as Snowflake, Databricks, and Looker. Grafana also offers tools to read data from many other observability tools, which enables organizations to build comprehensive operational coverage while offering individual teams a choice of the tools they use.
Alongside the data source plugins, Grafana offers premium tools for logs, metrics, and traces. These include access policies and tokens for log data to secure sensitive information, in-depth health monitoring for the ingest and storage of cloud stacks, and management of tenants.
Grafana offers three products in the incident response and management (IRM) space:
At the foundation of IRM are