46,99 €
Discover the power of open source observability for your enterprise environment
In Mastering Observability and OpenTelemetry: Enhancing Application and Infrastructure Performance and Avoiding Outages, accomplished engineering leader and open source contributor Steve Flanders unlocks the secrets of enterprise application observability with a comprehensive guide to OpenTelemetry (OTel). Explore how OTel transforms observability, providing a robust toolkit for capturing and analyzing telemetry data across your environment.
You will learn how OTel delivers unmatched flexibility, extensibility, and vendor neutrality, freeing you from vendor lock-in and enabling data sovereignty and portability. You will also discover:
Whether you are a novice or a seasoned professional, Mastering Observability and OpenTelemetry is your roadmap to troubleshooting availability and performance problems by learning to detect anomalies, interpret data, and proactively optimize performance in your enterprise environment. Embark on your journey to observability mastery today!
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 634
Veröffentlichungsjahr: 2024
Cover
Table of Contents
Title Page
Foreword
Introduction
The Mastering Series
Chapter 1: What Is Observability?
Definition
Background
Metadata
Signals
Collecting Signals
Observability
The Bottom Line
Notes
Chapter 2: Introducing OpenTelemetry!
Background
OpenTelemetry Components
OpenTelemetry Concepts
The Bottom Line
Notes
Chapter 3: Getting Started with the Astronomy Shop
Background
Getting Started
Beyond the Basics
Astronomy Shop Scenarios
The Bottom Line
Notes
Chapter 4: Understanding the OpenTelemetry Specification
Background
API Specification
SDK Specification
Data Specification
General Specification
The Bottom Line
Notes
Chapter 5: Managing the OpenTelemetry Collector
Background
Deployment Modes
The Basics
Configuration
Observing
Troubleshooting
Beyond the Basics
The Bottom Line
Notes
Chapter 6: Leveraging OpenTelemetry Instrumentation
Environment Setup
Python Trace Instrumentation
Python Metrics Instrumentation
Python Log Instrumentation
Language Considerations
Distributions
The Bottom Line
Notes
Chapter 7: Adopting OpenTelemetry
The Basics
General Process
Brownfield Deployment
Greenfield Deployment
Other Considerations
The Bottom Line
Notes
Chapter 8: The Power of Context and Correlation
Background
Context
Correlation
The Bottom Line
Notes
Chapter 9: Choosing an Observability Platform
Primary Considerations
Primary Factors
Implementation Basics
The Bottom Line
Notes
Chapter 10: Observability Antipatterns and Pitfalls
Telemetry Data Missteps
Observability Platform Missteps
Company Culture Implications
Prioritization Framework
The Bottom Line
Notes
Chapter 11: Observability at Scale
Understanding the Challenges
Strategies for Scaling Observability
Best Practices for Managing Scale
The Bottom Line
Notes
Chapter 12: The Future of Observability
Challenges and Opportunities
Emerging Trends and Innovations
The Future of OpenTelemetry
The Bottom Line
Notes
Appendix A: The Bottom Line
Chapter 1: What Is Observability?
Chapter 2: Introducing OpenTelemetry!
Chapter 3: Getting Started with the Astronomy Shop
Chapter 4: Understanding the OpenTelemetry Specification
Chapter 5: Managing the OpenTelemetry Collector
Chapter 6: Leveraging OpenTelemetry Instrumentation
Chapter 7: Adopting OpenTelemetry
Chapter 8: The Power of Context and Correlation
Chapter 9: Choosing an Observability Platform
Chapter 10: Observability Antipatterns and Pitfalls
Chapter 11: Observability at Scale
Chapter 12: The Future of Observability
Appendix B: Introduction
Chapter 2: Introducing OpenTelemetry!
Chapter 3: Getting Started with the Astronomy Shop
Chapter 5: Managing the OpenTelemetry Collector
Chapter 12: The Future of Observability
Notes
Index
Copyright
Dedication
About the Author
Acknowledgments
End User License Agreement
Chapter 1
TABLE 1.1: A 2×2 matrix showing states of awareness and understanding. Monit...
Chapter 2
TABLE 2.1: Timeline of major open source observability projects through 2023...
TABLE 2.2: OTel component status and maturity
TABLE 2.3: Specification signal and component status as of 2024. In short, t...
TABLE 2.4: Collector core components and signal status as of mid 2024. Note ...
TABLE 2.5: OTel instrumentation and signal status as of mid 2024. Informatio...
TABLE 2.6: OTel instrumentation language and semconv category status as of m...
TABLE 2.7: OTel component and pipeline flow
TABLE 2.8: Resource and semconv status as of mid 2024
Chapter 4
TABLE 4.1: The metric instruments available in OTel, along with their associ...
TABLE 4.2: Field names available to log records and what they mean
Chapter 5
TABLE 5.1: Common telemetry flow options from instrumentation generation to ...
TABLE 5.2: Tiers of Collector support, including guarantees and OS types sup...
TABLE 5.3: General categories and examples (not a complete list) of processo...
TABLE 5.4: Extensions grouped into categories and with examples
Chapter 7
TABLE 7.1: Example of the components that need to be supported for each tele...
TABLE 7.2: The types of OTel components that may need to be deployed and man...
Chapter 9
TABLE 9.1: A simplified example of a prioritized checklist of requirements. ...
TABLE 9.2: Example scenarios that could be used to test the capabilities of ...
TABLE 9.3: The different approaches when choosing an observability platform ...
TABLE 9.4: Some critical decisions that need to be made when choosing an obs...
TABLE 9.5: Some OTel decision factors that may influence the observability p...
TABLE 9.6: Key stakeholders to engage when making an observability platform ...
Chapter 10
TABLE 10.1: A 2×2 matrix that could be used to determine priority. On the le...
TABLE 10.2: The results of the prioritization framework for scenarios in thi...
Chapter 11
TABLE 11.1: A 2x2 matrix suggesting the order in which to instrument telemet...
Chapter 1
FIGURE 1.1 An example of a monolithic application experiencing an issue. The...
FIGURE 1.2 An example of a microservice-based architecture experiencing an i...
FIGURE 1.3 A Grafana dashboard displaying aggregate metric information.
FIGURE 1.4 An example of a generic trace represented as a waterfall. This wa...
FIGURE 1.5 An example of a generic trace represented as a service map. As yo...
FIGURE 1.6 Example of head-based sampling. Each circle represents a microser...
FIGURE 1.7 Example of tail-based sampling. Each circle represents a microser...
FIGURE 1.8 An application emitting telemetry data directly to a vendor-based...
FIGURE 1.9 An application that emits its telemetry data to an agent running ...
FIGURE 1.10 An application that emits its telemetry data to an agent running...
Chapter 2
FIGURE 2.1 High-level grouping of OTel components. The specification is an o...
Chapter 3
FIGURE 3.1 A simplified (not all services are listed) service map of the Ast...
FIGURE 3.2 The OTel Astronomy Shop demo application home page.
FIGURE 3.3 Load generation tool powered by Locust available as part of the A...
FIGURE 3.4 Load generation settings and statistics.
FIGURE 3.5 Load generator settings to control the number of concurrent users...
FIGURE 3.6 The home page for the Tracetest UI that is available as part of t...
FIGURE 3.7 Jaeger UI searching for traces from the product catalog service (...
FIGURE 3.8 Example trace containing the product catalog service and at least...
FIGURE 3.9 Tags from a product catalog service span accessed by selecting th...
FIGURE 3.10 Tags from a checkout service span accessed by selecting the span...
FIGURE 3.11 Jaeger UI Monitor tab for the product catalog service (1). Notic...
FIGURE 3.12 Grafana Demo Dashboard for the product catalog service (1). Noti...
FIGURE 3.13 Grafana Demo Dashboard for the checkout service (1). Notice the ...
FIGURE 3.14 Grafana Demo Dashboard for the frontend service (1). Notice the ...
FIGURE 3.15 Grafana Spanmetrics Demo Dashboard. Notice the error rate for th...
FIGURE 3.16 Example of the demo dashboard for the recommendation after runni...
FIGURE 3.17 A Jaeger search (3) for traces that contain the recommendation s...
FIGURE 3.18 A long duration trace for the previous Jaeger query. Notice that...
FIGURE 3.19 A Jaeger search (4) for traces that contain the recommendation s...
FIGURE 3.20 A long duration trace for the previous Jaeger query. Notice that...
Chapter 4
FIGURE 4.1 Visualization of the generic design for OTel instrumentation.
FIGURE 4.2 Visualization of how the OTel API would be implemented in an appl...
FIGURE 4.3 Visualization of how the OTel SDK would be implemented in an appl...
FIGURE 4.4 An example of an OTel telemetry schema version change and how a s...
FIGURE 4.5 An example of an OTel telemetry schema version change and how the...
Chapter 5
FIGURE 5.1 Different types of telemetry data the Collector supports.
FIGURE 5.2 An example of the OTel SDK sending data to a Collector running in...
FIGURE 5.3 An example of the OTel SDK sending data to a Collector running in...
FIGURE 5.4 Common agent deployment architectures. The Agent Binary represent...
FIGURE 5.5 Common gateway deployment architecture. In this image, the Gatewa...
FIGURE 5.6 A high-level architecture diagram of the Collector. A trace servi...
FIGURE 5.7 An example of the zPages pipelines page provided by the Collector...
FIGURE 5.8 Visualization of a Collector configuration. This makes it easier ...
FIGURE 5.9 An example of the zPages Trace Spans page after receiving a singl...
FIGURE 5.10 Selecting a non-zero number link next to the path will display i...
FIGURE 5.11 Selecting a non-zero number link next to the receiver will displ...
FIGURE 5.12 Selecting a non-zero number link next to the exporter will displ...
FIGURE 5.13 The two modes of operating OpAMP. On the left, the read-write mo...
Chapter 6
FIGURE 6.1 A diagram of how the OTel API and SDK are leveraged to provide tr...
FIGURE 6.2 A diagram of how the OTel API and SDK are leveraged to provide me...
FIGURE 6.3 A diagram of how the OTel API and SDK are leveraged to provide lo...
Chapter 7
FIGURE 7.1 Flow chart of the general approach for OTel adoption. While every...
FIGURE 7.2 Example of how to initially approach adopting OTel in a brownfiel...
FIGURE 7.3 Common agent architecture for brownfield deployments. Typically, ...
FIGURE 7.4 One possible side-by-side agent migration state. In this example,...
FIGURE 7.5 Example of how to approach adopting OTel in a brownfield deployme...
Chapter 8
FIGURE 8.1 An example of the context that may be helpful when troubleshootin...
FIGURE 8.2 An example of the types of information different signals provide ...
FIGURE 8.3 An example of an OTel instrumented service calling an uninstrumen...
FIGURE 8.4 An example of what is possible when analyzing trace data provided...
FIGURE 8.5 An example of metadata analysis of telemetry data for a particula...
FIGURE 8.6 An example of a Kibana dashboard containing multiple signals and ...
FIGURE 8.7 An example of a service dashboard from OpenSearch displaying trac...
FIGURE 8.8 An example of metadata analysis of traces from a specific service...
FIGURE 8.9 An example in Grafana Tempo of being able to query for logs assoc...
FIGURE 8.10 An example of OTel span events shown in Jaeger. Jaeger records O...
FIGURE 8.11 An example of linking or launch in context between logs and trac...
FIGURE 8.12 An example of an exemplar. On the left is a graph of the results...
Cover
Title Page
Copyright
Dedication
About the Author
Acknowledgments
Foreword
Introduction
Table of Contents
Begin Reading
Appendix A: The Bottom Line
Appendix B: Introduction
Index
End User License Agreement
i
xiii
xiv
xv
xvi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
343
344
345
346
347
348
349
350
351
352
ii
iii
iv
v
353
Steve Flanders
To build and operate any complex system, whether it be inventory in warehouses, money in bank accounts, or large computer systems, you need to be able to understand what you have built and how it is currently operating. The observability tools that we rely on today have a long history; in one way or another, they have existed since the beginning of the computing industry. As relatively high-scale (for their era) computing services started to come about in the 1980s and early 1990s, commercial tools that analyzed their performance also became available. As the dot-com boom of the late 1990s and then the proliferation of easily accessible cloud infrastructure drove more and more firms to build high-scale web services, the market and capabilities of what we now call observability tools increased dramatically.
Throughout this period, one of the biggest challenges that these tools faced was how to get the right data into them. This is harder than it seems, as early solutions would capture some combination of logs and metrics, requiring integrations with a handful of operating systems and known technologies like databases and message queues. Getting visibility into a modern microservices environment requires distributed traces, application metrics, profiles, and other types of data that must be captured from every web framework, RPC system, database client, and so forth, each of which are different for each programming language. Each of these integrations must be maintained to ensure that it does not break when the data source gets updated; this is extremely expensive for vendors to build and for customers to set up, leading to poor coverage and for customers to be semipermanently locked in to their vendors.
We created OpenTelemetry to break this logjam. By providing a single set of APIs, agents, and a protocol, we allowed software developers to both emit and capture distributed traces, metrics, profiles, and logs easily and with the strong semantic conventions needed to gain valuable insights from analyzing it. This has fundamentally changed people’s relationships with observability tools. Thanks to OpenTelemetry, they are more accessible and widely used than ever before—and of higher quality, as vendors and open source solutions have redirected the effort that they used to spend on data collection to providing better solutions. Both end users and those who want to emit data from shared code are no longer locked in to vendor-specific interfaces, and anyone can take control of creating custom telemetry, filtering their data, and sending it wherever they would like. OpenTelemetry now has over 1,200 developers contributing to it every month, making it one of the largest open source projects in the world—a testament to its utility and how much it has changed things.
That being said, tools are only as good as one’s ability to properly use them, and OpenTelemetry is no exception. OpenTelemetry is now an essential part of building and operating services of any scale, and this book will guide you through the problems that it can be used to solve (and those that it should not), OpenTelemetry’s various components, best practices and examples of using OpenTelemetry successfully, and how to apply it to your codebase and organization to achieve your goals.
—Morgan McLean, Senior Director of Product Management, Splunk
Welcome to Mastering OpenTelemetry and Observability, a comprehensive guide designed to help you navigate the complex and ever-evolving landscape of observability. As organizations increasingly rely on distributed systems and microservices architectures, the need for robust observability solutions has never been greater. OpenTelemetry, or OTel as it is called, is an open source and vendor-agnostic observability framework. It has emerged as a critical technology in this field, providing standardized tools for collecting and analyzing telemetry data across various platforms and technologies. In addition, OTel is extensible, with the ability to handle the telemetry needs and observability platforms of today and the observability landscape of the future.
This book aims to equip you with the knowledge and skills necessary to harness the full potential of OTel and build a solid observability foundation. Whether you are a developer, DevOps engineer, site reliability engineer (SRE), sales engineer, support engineer, information technology (IT) manager, engineering manager (EM), product manager (PM), C-level executive, or really any role that involves software or infrastructure, the insights and practical guidance offered in this book will empower you to observe, diagnose, and optimize your systems effectively.
You will begin by exploring the fundamental concepts of observability, tracing its evolution from traditional monitoring practices to modern, holistic approaches. You will gain a deep understanding of the three pillars of observability—metrics, logs, and traces—and how they interrelate to provide a comprehensive view of system health and performance. The core of this book delves into OpenTelemetry, starting with its architecture and components, including the specification, instrumentation, and the Collector. Next, the OTel demo environment, known as the Astronomy Shop, is explored so you can experience the power of OTel firsthand. Deep dives on all the major components, including step-by-step instructions, are provided on how to instrument your applications and collect, process, and send your telemetry data using OTel. You will also learn about important topics such as context propagation, distributions, and integrating OTel with popular observability platforms like Prometheus and Jaeger.
With a solid foundation in observability and OTel, you will move on to adopting and scaling observability in large and complex environments. From obtaining stakeholder buy-in to handling high volumes of telemetry data to ensuring performance and reliability, you will discover practical solutions to common challenges faced by organizations today. This is followed by considerations for observability platforms, whether existing or new.
Beyond technical guidance, this book also addresses the human and organizational aspects of observability. This is because building a culture of observability within your team and organization is crucial for success. This book discusses strategies for fostering collaboration, continuous improvement, and proactive incident response, ensuring that observability becomes an integral part of your operational practices. Finally, this book explores emerging trends and innovations in observability, including the role of artificial intelligence (AI) and machine learning (ML) in predictive analytics, the evolution of observability standards, and the potential impact of new technologies on the industry.
Mastering OpenTelemetry and Observability is more than just a technical manual; it is a journey into the heart of modern system monitoring and optimization. By the end of this book, you will have the knowledge and confidence to implement robust observability solutions that enhance your system’s reliability, performance, and overall user experience.
Before you begin reading, there are a few things to know:
This book has been written in a way that tries to make it approachable to the largest audience possible. Examples of this include:
The book does not use contractions to make it easier for non-native English speakers.
Every abbreviation used in every chapter is defined first.
Relatable examples and metaphors will be found throughout the book.
Hyperlinks to additional information are provided throughout the book so you can learn more about the topics being discussed.
A fictitious but likely relatable story is embedded into every chapter. Through it, you will learn how an enterprise company migrating to the cloud was struggling to achieve observability. With each challenge experienced, you will see how a determined site reliability engineer (SRE) helps her company embrace OTel and improve observability.
Some terminology is used throughout this book that you should be aware of, including:
Back end:
The data access layer of an application, which often includes processing and persistence of data.
Framework:
A structure on which other things are built. For example, OTel is a telemetry framework that can be extended to support various use cases.
Front end:
The presentation layer of an application, which is often a user interface (UI) or user-facing way to interact with an application.
Instrumentation:
Software added to an application to generate telemetry data. Various forms of instrumentation are available, including automatic, which is injected at runtime, manual, which is added with the existing code, and programmatic, which is a particular form of manual instrumentation where specific libraries or frameworks have already been instrumented (also called
instrumentation libraries
).
Platform:
An environment in which software is executed. An observability or monitoring platform typically consists of one or more back end and front end components.
Telemetry:
Data used to determine the health, performance, and usage of applications. Examples of telemetry include metrics, logs, and traces. This data is typically sent to a platform or back end.
The OTel project is constantly evolving, and changes are frequently released. The examples provided in this book were tested against specific versions of OTel. Where possible, they were created in a generic way that should work as the project advances. With that said, it is possible that changes have been made that will result in differences from what is documented. If this occurs, checking the GitHub repository associated with this book (covered next) and reading the latest OTel documentation and release notes is recommended. The minimal recommended and maximum tested versions of OTel components for this book are as follows:
OTel Demo, also known as the Astronomy Shop, version 1.11 is the minimum supported version. This is to get OpenSearch support. Up to version 1.11.1 has been tested.
Collector (core and contrib) version 0.95.0 is the minimum supported version. This is to get JSON encoding for the OTLP receiver and exporter. Up to version 0.109.0 has been tested.
Python instrumentation version 1.23.0/0.44b0 is the minimum supported version. This is to get support for Flask and Werkzeug 3.0 or higher. Up to version 1.27.0/0.48b0 has been tested.
This book is accompanied by a GitHub repository, which can be found at
https://github.com/flands/mastering-otel-book
and will be updated at least annually. If you notice any issues with the information presented in this book, please open a GitHub issue. The contents of this repository include:
All code examples provided in the book
Status information about OTel components
Post-production modifications
Changes to support the latest OTel advancements
The Mastering series from Sybex provides outstanding instruction for readers with intermediate and advanced skills, in the form of top-notch training and development for those already working in their field and clear, serious education for those aspiring to become pros. Every Mastering book includes:
Real-World Scenarios, ranging from case studies to interviews, that show how the tool, technique, or knowledge presented is applied in actual practice
Skill-based instruction, with chapters organized around real tasks rather than abstract concepts or subjects
Self-review test questions, so you can be certain you’re equipped to do the job right
In modern software development and operations, observability has emerged as a fundamental concept essential for maintaining and improving the performance, reliability, and scalability of complex systems. But what exactly is observability? At its core, observability is the practice of gaining insights into the internal states and behaviors of systems through the collection, analysis, and visualization of telemetry data. Unlike traditional monitoring, which primarily focuses on predefined metrics and thresholds, observability offers a more comprehensive and dynamic approach, enabling teams to proactively detect, diagnose, and resolve issues.
This chapter will explore the principles and components of observability, highlighting its significance in today’s distributed and microservices-based architectures. Through a deep dive into the three pillars of observability—metrics, logs, and traces—you will understand the groundwork for how observability can transform the way resilient systems are built and managed.
IN THIS CHAPTER, YOU WILL LEARN TO:
Differentiate between monitoring and observability
Explain the importance of metadata
Identify the differences between telemetry signals
Distinguish between instrumentation and data collection
Analyze the requirements for choosing an observability platform
So, what is observability in the realm of modern software development and operations? While many definitions exist, they all generally refer to observability providing the ability to quickly identify availability and performance problems, regardless of whether they have been experienced before, and help perform problem isolation, root cause analysis, and remediation. Because observability is about making it easier to understand complex systems and address unperceived issues, often referred to in the software industry as unknown unknowns,1 the data collected must be correlated across different telemetry types and be rich enough and immediately accessible to answer questions during a live incident.
The Cloud Native Computing Foundation (CNCF), described more fully later in this chapter, provides a definition for the term observability:2
Observability is a system property that defines the degree to which the system can generate actionable insights. It allows users to understand a system’s state from these external outputs and take (corrective) action.
Computer systems are measured by observing low-level signals such as CPU time, memory, disk space, and higher-level and business signals, including API response times, errors, transactions per second, etc. These observable systems are observed (or monitored) through specialized tools, so-called observability tools. A list of these tools can be viewed in the Cloud Native Landscape’s observability section.3
Observable systems yield meaningful, actionable data to their operators, allowing them to achieve favorable outcomes (faster incident response, increased developer productivity) and less toil and downtime.
Consequently, the observability of a system will significantly impact its operating and development costs.
While the CNCF’s definition is good, it is missing a few critical aspects:
The goal of observability should be where a system’s state can be
fully understood
from its external output
without the need to ship code
. This means you should be able to ask
novel questions
about your observability data, especially questions you had
not
thought of beforehand.
Observability is not just about collecting data but about collecting
meaningful data
, such as data with context and correlated across different sources, and storing it on a platform that offers rich analytics and query capabilities
across signals
.
A system is truly observable when you can troubleshoot
without prior knowledge of the system
.
The OpenTelemetry project, which will be introduced in Chapter 2, “Introducing OpenTelemetry!,” provides a definition of observability that is worth highlighting:
Observability lets you understand a system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows you to easily troubleshoot and handle novel problems—that is, “unknown unknowns.” It also helps you answer the question, “Why is this happening?”
To ask those questions about your system, your application must be properly instrumented. That is, the application code must emit signals such as traces, metrics, and logs. An application is properly instrumented when developers don’t need to add more instrumentation to troubleshoot an issue, because they have all of the information they need.4
In short, observability is about collecting critical telemetry data with relevant context and using that data to quickly determine your systems’ behavior and health. Observability goes beyond mere monitoring by enabling a proactive and comprehensive understanding of system behavior, facilitating quicker detection, diagnosis, and resolution of issues. This capability is crucial in today’s fast-paced, microservices-driven, distributed environments, where the complexity and dynamic nature of systems demand robust and flexible observability solutions. Through the lens of the CNCF and OpenTelemetry, you can see observability is not just defined as a set of tools and practices but as a fundamental shift toward more resilient, reliable, and efficient system management.
Riley (she/her) is an experienced site reliability engineer (SRE) with deep observability and operations experience. She recently joined Jupiterian to address their observability problems and work with a new vendor. Riley joined Jupiterian from a large private equity (PE) advertising company, where she was the technical lead of the SRE team and was responsible for a large-scale, globally distributed, cloud native architecture. Before that, she was the founding member of a growth startup where she developed observability practices and culture while helping scale the business to over three million dollars in annual recurring revenue (ARR). Riley was excited about the challenge and opportunity of building observability practices from the ground up at a public enterprise company transitioning to the cloud.
Jupiterian is an e-commerce company that has been around for more than two decades. Over the last five years, the company has seen a massive influx of customers and has been on a journey to modernize its tech stack to keep up with demand and the competition. As part of these changes, it has been migrating from its on-premises monolithic application to a microservices-based architecture running on Kubernetes (K8s) and deployed in the cloud. Recently, outages have been plaguing the new architecture—a problem threatening the company and one that needed to be resolved before the annual peak traffic expected during the upcoming holiday season.
For the original architecture, the company had been using Zabbix, an open source monitoring solution to monitor the environment. The IT team was beginning to learn about DevOps practices and had set up Prometheus for the new architecture. Given organizational constraints and priorities, they did not have the time to develop the skill set to manage it and the ever-increasing number of collected metrics. In short, a critical piece of the new architecture was without ownership. On top of this, engineering teams continued to add data, dashboards, and alerts without defined standards or processes. Not surprisingly, this resulted in the company having difficulty proactively identifying availability and performance issues. It also resulted in various observability issues, including Prometheus availability, blind spots, and alert storms. In terms of observability, the company frequently experienced infrastructure issues and could not tell if it was because of an architecture limitation or an improper use of the new infrastructure. As a result, engineers feared going on-call, and innovation velocity was significantly below average.
The Jupiterian engineering team had been pushing management to invest more in observability and SRE. Instead, head count remained flat, and the product roadmaps, driven primarily by the sales team, continued to take priority. With the service missing its service-level agreement (SLA) target for the last three months, leadership demanded a focus on resiliency. To address the problem, the Chief Technology Officer (CTO) signed a three-year deal with Watchwhale, an observability vendor, so the company could focus on its core intellectual property (IP) instead of managing third-party software. An architect in the office of the CTO vetted the vendor and its technology. Given other organizational priorities, the engineering team was largely uninvolved in the proof of concept (PoC). The Vice President (VP) of Engineering was tasked with ensuring the service’s SLA was consistently hit ahead of the holiday period as well as the adoption and success of the Watchwhale product. He allocated one of his budget IDs (BIDs) for a senior SRE position, which led to Riley being hired.
The term observability has been around since at least the mid-20th century and is mainly credited to Rudolf E. Kálmán, a Hungarian American engineer who used it in a paper about control theory.5 Since then, the term has been used in various fields, including quantum mechanics, physics, statistics, and perhaps most recently, software development. Kálmán’s definition of observability can be summarized as a measure of how well the internal states of a system can be inferred from knowledge of its external outputs.6
Observability is often abbreviated as O11y (the letter O, the number 11, and the letter y), as there are 11 characters between the letter O and the letter y. While it is the number 11, the ones are pronounced as the letter l—thus, the abbreviation is pronounced Ollie. This abbreviation standard is common for longer words in software. For example, Kubernetes, a popular cloud native open source project, is often referred to as K8s and pronounced kay-ates for the same reason.
In software, the term observability has become popular due to the rise of cloud native workloads. Since the turn of the century, the software industry has seen a progression that has included moving from bare metal machines to virtual machines (VMs) to containers. In addition, there has been a shift from owning, deploying, and managing hardware to leasing data center equipment to deploying in the cloud. But what does cloud native mean? One way to answer this question is to look to the CNCF. The foundation is part of the Linux Foundation and defines itself as:
The open source, vendor-neutral hub of cloud native computing, hosting projects like Kubernetes and Prometheus to make cloud native universal and sustainable.7
Perhaps not surprisingly, the CNCF has created a definition for the term cloud native:
Cloud native practices empower organizations to develop, build, and deploy workloads in computing environments (public, private, hybrid cloud) to meet their organizational needs at scale in a programmatic and repeatable manner. They are characterized by loosely coupled systems that interoperate in a manner that is secure, resilient, manageable, sustainable, and observable.
Cloud native technologies and architectures typically consist of some combination of containers, service meshes, multi-tenancy, microservices, immutable infrastructure, serverless, and declarative APIs—this list is non-exhaustive.
Before the cloud native era, it was common to see patterns including on-premises software, monoliths, separate development and operations teams, and waterfall software development with long release cycles. In this prior generation, the term observability had not been adopted yet, and instead, the term monitoring was used. Sometimes, these terms are used interchangeably, but their meanings are not identical. The Merriam-Webster dictionary defines monitoring as the ability “to watch, keep track of, or check usually for a special purpose.”8 It defines observability as the ability “to come to realize or know especially through consideration of noted facts.”9 The distinction between monitoring and observability is important. With monitoring, you track items but must infer why something occurred or how it is related to another event. With observability, you use information to prove facts and use that knowledge to determine how or why something behaves the way it does. Observability allows for first principle thinking, or the ability to validate assumptions not deduced from another assumption.10
In software, both observability and monitoring rely on specific data types—primarily metrics and logs with some tracing—but the usage of the data differs. Before the cloud native era, most software ran on-premises and was often developed and deployed as a monolith or single code base or application. As a result, problem isolation, or where the problem originated, was easy to identify when issues occurred, and scaling typically consisted of adding more resources to the monolith, known as scaling up or scaling vertically. When issues arose, the problem was either the monolith, the infrastructure the monolith was running on top of, or whatever application was calling into or called by the monolith (see Figure 1.1). To monitor the monolith, operational teams needed the ability to be alerted about specific, known symptoms, sometimes referred to as known knowns. Monitoring systems did exactly that.
To provide monitoring, either your application needs to be instrumented to emit health data or you are required to infer the health of the application by watching its external behavior. In either case, the data collected needs to be able to track and answer questions about availability, performance, and security. This data collection needs to be added before issues happen; otherwise, you cannot proactively determine nor quickly resolve the problems as they arise.
FIGURE 1.1 An example of a monolithic application experiencing an issue. The square represents the monolith, while the circles represent different functions or features within the monolith. In this example, the B function is experiencing problems, denoted by the service’s gray shading. This may or may not result in issues with the A and C functions.
There are two different types of monitoring. First, there is monitoring based on data exposed from the internals of the system. This means the application makes specific data available for external systems to gather. This type of monitoring is sometimes called white box monitoring because you can see into the system,11 though a better name would be internally provided monitoring. Second, there is monitoring based on external behavior. This means the application does not make any data available beyond what is required for the application to function. As such, an external system must infer what an application is doing. This type of monitoring is sometimes called black box monitoring because you cannot see into the system,12 though a better name would be externally provided monitoring.13
In many cases, application developers add instrumentation as necessary, including to measure performance and investigate issues during development and operations. Engineers responsible for monitoring the health and performance of these applications would typically send telemetry data to a monitoring platform. Based on this telemetry data, the engineer would then define alerts with static thresholds. To determine these thresholds, an engineer would need to know what problems to expect beforehand, thus enabling proactive monitoring; otherwise, new thresholds would have to be defined after an issue is identified, which is known as reactive monitoring. One way to think about monitoring is like a doctor who collects certain pieces of information from a person and compares that data against known baselines to understand the symptoms being experienced and to determine the health of the person. The monitoring of heart rate, blood pressure, and temperature in humans is like the monitoring of CPU (central processing unit), memory, and disk usage in applications.
While monitoring with static thresholds provides some awareness of potential system issues, it is not without its limitations. Take, for example, CPU utilization, which represents the rate at which an application is operating expressed as a percentage. If CPU utilization is very high, this could be a symptom of a system issue and, as such, something you want to be notified about. For example, you could define an alert when the CPU utilization exceeds 95 percent for some period of time. In fact, such a definition is common in traditional monitoring applications. The problem is, such an alert may not indicate a problem but instead indicate that the application is using its resources efficiently. What is missing from this symptom is context, including how other related components are behaving, and correlation, including changes within the environment. Another limitation of traditional monitoring tools is the difficulty in alerting on issues that do not manifest as high resource consumption or latency.
The introduction of cloud native workloads made traditional monitoring even less effective. In this new world, workloads are run in the cloud and often consist of many small applications, called microservices, that are isolated to individual functionality. For example, an authentication service or a notification service. Microservices make it easier to deploy more instances, known as scaling out or scaling horizontally, and allow for specific components to be scaled as needed. These microservices typically run on immutable infrastructure using declarative APIs (application programming interfaces). In addition, they are run with DevOps practices and with the help of site reliability engineers (SREs).14 Software release cycles are also more frequent and leverage continuous integration and continuous deployment or CI/CD pipelines. The decoupling and elasticity of applications enable developers to reduce duplicated efforts and scale to meet demand, but often at the cost of being able to troubleshoot the system and keep it available. In this era, it is the “unknown unknowns” that need to be addressed.
Due to the difficulty in troubleshooting microservice-based architectures, a popular meme was shared throughout the community:
“We replaced our monolith with micro services so that every outage could be more like a murder mystery.” @honest_update15
With cloud native workloads, problem isolation became a problem. This is because when one microservice has an issue, it could impact upstream or downstream services, causing them to have problems as well (see Figure 1.2). Using traditional monitoring, the net result is alert storms and the need to investigate every issue on every service in order to get to the root cause and remediation. Of course, there are other issues with cloud native workloads as well. For example, there is an inability to have complete visibility into the infrastructure as it is being managed by a third party and prone to dynamic changes.
FIGURE 1.2 An example of a microservice-based architecture experiencing an issue. Each circle represents a different microservice. In this example, multiple microservices are experiencing an issue, denoted with gray lines, though one service is the root cause of the problem, denoted with gray shading. Note not all services called by the root cause service are impacted.
Going back to the doctor analogy, assume you have a large group of people who are all part of the same community, and multiple people become sick around the same time. While you may want to help everyone experiencing symptoms concurrently, it requires many doctors and resources. In addition, focusing on the symptoms of the patients does not address the root cause issue, which is that people are getting sick from something, and it is spreading instead of being contained. The sickness may cause other problems to arise as well. For example, doctors may become sick and thus become unable to care for patients, or businesses might need to shut down because they do not have enough employees to work. Without containment, an infectious disease can spread uncontrollably. This analogy is similar to the changes necessary due to the shift to cloud native workloads. For example, instead of paging all service owners during an outage and burning out engineers, the more sustainable approach is to contain the problem and page the root cause service. Observability helps with containment.
When dealing with complex systems, it is ideal when you can address things you are aware of and understand, referred to as known knowns, as well as things you are not aware of and do not understand, referred to as unknown unknowns, using the same solution. See Table 1.1 for different states of awareness and understanding. A goal of observability is to provide the ability to answer the “unknown unknowns.” At the same time, it contains the building blocks necessary to address the “known knowns.” As a result, observability may be considered a superset of monitoring.
TABLE 1.1: A 2×2 matrix showing states of awareness and understanding. Monitoring systems are optimized to address “known knowns” where observability systems can address all aspects but especially “unknown unknowns.”
AWARENESS KNOWN
UNKNOWN
Understanding
Known
Aware of and understand
Aware of but do not understand
Unknown
Understand but not aware of
Neither aware of nor understand
When you hear the term observability, you may initially think about data sources such as metrics, logs, and traces. These terms will be introduced in the next section, but something just as important as the data source information is metadata. While a fancy word, metadata is just data about other data. For example, if you generate and collect a metric, such as the total number of HTTP requests, it may also be helpful to know other information about that metric, such as which host it is running on or what HTTP status code was returned for that request. These additional pieces of information are known as metadata and are typically attached to traditional data source information, such as metrics, logs, and traces. Metadata may go by other names as well, including tags, labels, attributes, and resources.
Metadata is powerful because it provides additional information to data sources, which helps with problem isolation and remediation. This information may even contain context and correlation, topics explored in Chapter 8, “The Power of Context and Correlation.” Without metadata, observability is harder to achieve. Metadata is typically represented as a key-value pair, such as foo="bar". The key is the name for the piece of metadata and is often referred to as a dimension. The value can be of various forms, including numbers or strings and the uniqueness of the values is referred to as cardinality. Other ways to represent metadata also exist. For example, in unstructured log records, metadata is sometimes presented as just a value where the name is inferred—an example is provided in the “Logs” section later in this chapter.
In observability, Dimensionality refers to the number of unique keys (sometimes called names) within a set. It is represented by attributes or labels associated with telemetry data, allowing for more granular and detailed analysis. Each piece of telemetry data can have multiple dimensions that provide context about the data. These dimensions enable the grouping, filtering, and slicing of data along various axes, which is crucial for deep analysis and troubleshooting. Examples of dimensions include:
Time
Application, such as
service.name
and
service.version
Host, such as
host.name
and
host.arch
User, such as
enduser.id
and
enduser.role
HTTP, such as
http.route
and
http.response.status_code
These dimensions would allow you to ask your telemetry to show you data such as:
All
502
errors in the last half hour for host
foo
All
403
requests against the
/export
endpoint made by user
bar
Dimensionality, which may also be referred to as the width of telemetry data, is a foundational concept in observability that greatly enhances the depth and utility of telemetry data. It matters because it enables more detailed, contextualized, and actionable insights, which are essential for maintaining and improving the performance and reliability of modern distributed systems. In practice, dimensions are indexed by observability platforms to support capabilities, including auto-complete and real-time analysis of key-value pairs, that assist with troubleshooting.
In observability, Cardinality refers to the number of unique values for a given key within a set. High cardinality refers to a large number of unique values, whereas low cardinality indicates fewer unique values. For example, a dimension like HTTP status code, which includes values such as 404 or 500, is bounded and has low cardinality, whereas a dimension like a user, session, or transaction ID is unbounded and is likely to have high cardinality. Monitoring and observability platforms care about cardinality. For example, if a platform supports indexing of keys, it likely needs to return values for those indexed keys quickly. For high cardinality metadata, this can prove challenging to visualize and very expensive to compute. In short, cardinality affects the storage, performance, and usability of telemetry data. High cardinality presents both opportunities for detailed insights and challenges in terms of resource consumption and data management. Effectively managing cardinality is essential for maintaining scalable, efficient, and actionable observability systems.
Another concept you should be aware of is semantic conventions, or semconvs for short. These are standardized dimensions, or keys, for metadata and ensure consistency in how data is recorded, labeled, and interpreted across different systems and services. It may also contain standardized cardinality, or values for these dimensions. For example, it is common to have semconvs for HTTP-related data. An example of this may include the key for the HTTP route, such as http.route, or the response status code, such as http.response.status_code. Semconvs can be grouped into multiple different categories, such as the aforementioned HTTP. Other categories would include databases, exceptions, host metrics, function as a service, and messaging, to name a few. Each category would have multiple semconvs defined. Semconvs may be signal specific or apply to more than one signal type. Semconvs matter because they enable context and correlation and provide data portability. For example, if the same key is used to represent the same data, then it is easy to see its behavior across systems and environments. In addition, if keys are consistently named, they can be leveraged identically across different platforms.
Metadata can contain sensitive information. For example, names or email addresses may be attached to data sources and leak personally identifiable information (PII). In addition, internal business logic, such as Internet protocol (IP) addresses or hostnames, may be considered sensitive information. This information would generally only be sent to the configured observability platforms, but that configuration can change over time. In addition, while only restricted users may have access to the platform data, for example, employees of a company authenticated via Security Assertion Markup Language (SAML), without proper data permissions, such as role-based access control (RBAC), it is possible that sensitive information is exposed to employees who should not have access to such information. Given that metadata can contain virtually anything, care must be taken to ensure proper data configuration, scrubbing, and access control.
The three pillars of observability is an industry phrase that you have likely encountered. The three pillars refer to metrics, logs, and traces. While these pillars are just data sources and do not inherently provide observability, they are recognized as fundamental types of telemetry data needed to understand the behavior and performance of systems. Another comparable term or acronym in the observability space is MELT, which stands for metrics, events, logs, and traces. These are the most common data sources, but they are far from exhaustive. Other examples include profiling and sessions. Data sources have a variety of names in the industry, including diagnostics, telemetry, signals, or data sources. For the purposes of this book, and in alignment with OpenTelemetry, the term signals will be used going forward. It is important to note that signals do not inherently provide observability, though they are necessary to enable it.
A metric, sometimes referred to as a metric record, measurement, or metric time series (MTS), is a set of data points represented as a time series with metadata. A time series is a set of data points over some period of time. To generate a time series, an instrument takes one or more measurements. For example, a speedometer measures speed, and a measurement could be taken every tenth of a second but recorded every minute. Metrics also have signal-specific metadata terms. For example, attributes, dimensions, labels, and resources are all terms used with metrics that refer to some kind of metadata.
A metric contains a name, value, timestamp, and optionally additional metadata. Note that multiple types of metric values exist. For example, it may be a single value, such as a counter, or a multi-value, such as a histogram. Here is an example of a metric from Prometheus, an open source metric solution that will be described in more detail later:
http_requests_total{method="post",code="200"} 1027 | 1395066363000
The example Prometheus metric is made up of various components, including:
Name—
http_requests_total
Metadata—
{method="post",code="200"}
Value—
1027
Timestamp—
1395066363000
Metrics are one of the primary data sources used to engage on-call engineers as well as troubleshoot availability and performance issues. It is pervasive for alerts and dashboards to be configured based on metric data. Generally, aggregated metrics, like those shown in Figure 1.3, provide the most value because they identify behaviors over time and can be used to determine anomalies. Some popular methods for analyzing aggregated metrics include:
RED, which stands for requests, error, and duration and was popularized by Tom Wilkie.
16
The idea is for every object to monitor the number of requests, the number of those requests that result in an error, and the amount of time those requests take. In general, this information can be used to determine user experience.
USE, which stands for utilization, saturation, and errors and was popularized by Brendan Gregg.
17
The idea is for every object to monitor the percentage of time the object was busy, the amount of work (queue size) for the object, and the number of errors. In general, this information can be used to determine object experience.
Four golden signals, which include latency, traffic, errors, and saturation and was popularized by the Google SRE Handbook.
18
The idea is for every object to monitor the time it takes to service a request, the amount of demand placed on the object, the rate of requests that fail, and the fullness of the object. This is like RED but includes saturation.
In addition, metrics are used to define service-level indicators (SLIs) that measure the performance of applications. These SLIs are used to define and measure service-level objectives (SLOs) which determine whether applications are operating within acceptable bounds. Service-level agreements (SLAs) are also defined and calculated based on metrics to determine whether applications are meeting specified customer expectations.
FIGURE 1.3 A Grafana dashboard displaying aggregate metric information.
SLIs, SLOs, and SLAs are critical topics that are outside the scope of this book. If you are looking to learn more about these concepts, be sure to read the Google Site Reliability Engineering (SRE) book, which is freely available online.19
Given the ever increasing number of objects in an environment and the need to collect more and more data, metric platforms need to be able to process and store a large number of metrics quickly. Various techniques are used to control the amount of data generated, processed, and stored. For example, the interval at which metrics are generated within the application or stored within an observability platform can be different from the resolution displayed in charts. Aggregation techniques are used to achieve these different granularities, including aggregation policies and rollups. In short, these strategies provide a summarized view of granular data over specific time intervals. Regardless of the techniques used, end users consume charts or alerts from this collected, analyzed, and queried data.
Several open source metric instrumentation frameworks and standards have become popular over the years. For example, the following solutions were popular in the monitoring era:
StatsD (
https://github.com/statsd/statsd
)
Graphite (
https://graphite.readthedocs.io/en/stable/overview.html
)
Nagios (
https://www.nagios.org
)
Telegraf (
https://www.influxdata.com/time-series-platform/telegraf
)
Zabbix (
https://www.zabbix.com
)
In the observability era, the following projects have gained popularity:
Grafana (
https://grafana.com/grafana
)
OpenTelemetry (
https://opentelemetry.io
)
Prometheus (
https://prometheus.io
)
M3 (
https://m3db.io
)
Programming languages also have their own frameworks that can be leveraged to generate metric data. For example:
Java Management Extensions (JMX) (
https://en.wikipedia.org/wiki/Java_Management_Extensions
) for Java
System.Diagnostics.Metrics
for .NET
There are also a variety of open source third-party frameworks, such as Micrometer (https://micrometer.io) for Java.
A log, sometimes called a log record, is a time-based event with metadata. A log typically contains a timestamp, severity, message, and optionally additional metadata. Logs have signal-specific metadata terms. For example, attributes, fields, and resources are all terms used with logs that refer to some kind of metadata.
A log can be either:
Structured, meaning the message and other components are stated in a known regular syntax, making them easily recognizable and parsable. Structured logs are becoming the standard in cloud native workloads. Here is an example of a structured log in JavaScript Object Notation (JSON) format:
"@timestamp":"2024-07-01T10:07:13.425Z", "log.level": "INFO", "message":
"Tomcat started on port(s): 8080 (http) with context path ''" "service.name":
"springpetclinic","process.thread.name":"restartedMain","log.logger":
"org.springframework.boot.web.embedded.tomcat.TomcatWebServer"}
or
Unstructured, meaning the message is a string that could contain almost anything and whose metadata may be irregular and inconsistent requiring the parsed format to be dynamically inferred and often incomplete. Unstructured logs were more common before cloud native workloads, but they are still present given the broad adoption of unstructured syslog from legacy systems. Here is an example of an unstructured syslog message:
212.87.37.154 - - [01/Jul/2024:10:07:13 +0000] "GET /favicon.ico HTTP/1.1" 200
3638 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
In the preceding example, the 200 is an example of metadata where the name is missing and must be inferred. The value represents the HTTP response status code.
In addition to the format, logs can also be of different types. For example, events may be thought of as a unique signal but could also be thought of as a specific kind of log. Events contain the same information as a log record and could be structured or unstructured. What makes events unique is that they indicate that something happened. For example, a deployment to production could be denoted as an event and used to determine whether key performance indicators (KPIs) changed because of the deployment. Aligned with OpenTelemetry, events will be treated as a subtype of log records throughout this book.
Individual logs can be helpful in determining the root cause of issues as well as for security use cases, including providing an audit log of changes. Logs can also contain metric data. If properly parsed, aggregate data extracted from logs can be helpful to determine the health and behavior of a system. In either case, log payloads are significantly larger than metrics, and as such, more data needs to be processed and stored.
It is common for log data required for security purposes to have requirements about collecting all the data for a minimal period of time and even guaranteeing the data is not lost through, for example, a disk-based queue. In addition, keeping logs that identify root causes is also essential. As a result, collecting all logs or at least all logs at a certain severity level is common. Severity is a term used in logging to determine the type of logs to collect. The general severity levels are TRACE, DEBUG, INFO, WARN, ERROR, and FATAL. Care should be taken when deciding which severity level to collect, as improper log collection can impact observability.
Logs have developed open standards over time thanks in part to syslog. Multiple Requests for Comments (RFCs)20 have been created for syslog, including:
RFC 3164 (
https://datatracker.ietf.org/doc/html/rfc3164
)
RFC 5424 (
https://datatracker.ietf.org/doc/html/rfc5424
)
RFC 5425 (
https://datatracker.ietf.org/doc/html/rfc5425
)
RFC 5426 (
https://datatracker.ietf.org/doc/html/rfc5426
)
RFC 6587 (
https://datatracker.ietf.org/doc/html/rfc6587
)