Modern Distributed Tracing in .NET - Liudmila Molkova - E-Book

Modern Distributed Tracing in .NET E-Book

Liudmila Molkova

0,0
32,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

As distributed systems become more complex and dynamic, their observability needs to grow to aid the development of holistic solutions for performance or usage analysis and debugging. Distributed tracing brings structure, correlation, causation, and consistency to your telemetry, thus allowing you to answer arbitrary questions about your system and creating a foundation for observability vendors to build visualizations and analytics.
Modern Distributed Tracing in .NET is your comprehensive guide to observability that focuses on tracing and performance analysis using a combination of telemetry signals and diagnostic tools. You'll begin by learning how to instrument your apps automatically as well as manually in a vendor-neutral way. Next, you’ll explore how to produce useful traces and metrics for typical cloud patterns and get insights into your system and investigate functional, configurational, and performance issues. The book is filled with instrumentation examples that help you grasp how to enrich auto-generated telemetry or produce your own to get the level of detail your system needs, along with controlling your costs with sampling, aggregation, and verbosity.
By the end of this book, you'll be ready to adopt and leverage tracing and other observability signals and tools and tailor them to your needs as your system evolves.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 410

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Modern Distributed Tracing in .NET

A practical guide to observability and performance analysis for microservices

Liudmila Molkova

BIRMINGHAM—MUMBAI

Modern Distributed Tracing in .NET

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Sawant

Publishing Product Manager: Akash Sharma

Book Project Manager: Manisha Singh

Senior Editor: Rohit Singh

Technical Editor: Maran Fernandes

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Production Designer: Shankar Kalbhor

Developer Relations Marketing Executive: Sonia Chauhan

First published: June 2023

Production reference: 1160623

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-83763-613-6

www.packtpub.com

To Sasha and Nick, the constant source of inspiration and joy.

- Liudmila Molkova

Foreword

I have had the pleasure of knowing and working with Liudmila for many years. She is a visionary architect with a gift for practical implementation. This book is a testament to her unique combination of skills. If you want to get familiar with the concepts of distributed tracing, this is the book for you. If you are tasked to observe your .NET application, this book is a good start. And if you are working on implementing and tweaking telemetry, you will not be lost and this book will guide you.

Distributed tracing is a powerful tool for understanding how your applications work. It can help you identify performance bottlenecks, troubleshoot errors, and improve the overall reliability of your system. Historically, distributed tracing was just an add-on, an afterthought. There are a slew of great tools that you can use to enable it. But as any add-on, it was often a big project for any decent-size application. With the rise of microservices and cloud environments, and with the increased pace of frameworks development, many solutions started lagging behind. This is when it became clear that distributed tracing must be a common language, spoken natively by all frameworks, clouds, and apps. You can find all you need to know about distributed tracing and its common language in the first chapter of this book.

There are many languages in the world. What makes .NET stand out is that beyond the extensibility points allowing for a great ecosystem of libraries and tools, there are many carefully integrated and well-supported built-in primitives. This way, most of the needs of any app are covered, and developers can concentrate on business logic. Distributed tracing became this built-in component. Liudmila herself designed primitives and integrated those primitives with .NET. So she knows what she is writing about in Chapters 2 and 3 of the book, showing how easy it is to get started with .NET applications observability.

I also enjoyed the often overlooked aspect of instrumenting the brownfield applications. Liudmila knows how hard the change is, especially in the world of .NET applications where the backward compatibility standards are so high. This is why every .NET developer will appreciate Chapter 15 of the book.

Whether you’re an architect and a seasoned developer, or just getting started with distributed tracing, this book is an essential resource. I highly recommend it to anyone who wants to improve the performance and reliability of their .NET applications.

Sincerely,

Sergey Kanzhelev

Co-founder of OpenTelemetry and Co-chair of the W3C Distributed Tracing Working Group

Contributors

About the author

Liudmila Molkova is a Principal Software Engineer at Microsoft working on observability and client libraries. She is a co-author of distributed tracing implementations across the .NET ecosystem including HTTP client, Azure Functions, and Application Insights SDK. She currently works in the Azure SDK team on improving the developer experience and plays an Observability Architect role in the team. She’s also an active contributor to OpenTelemetry semantic conventions and instrumentation working groups.

Liudmila’s love of observability started at Skype, where she got first-hand experience in running complex systems at a high scale and was fascinated by how much telemetry can reveal even to those deeply familiar with the code.

I’m deeply grateful to those who made this book possible. I would like to thank Sergey for his mentorship and support; Vance and David for trailblazing the distributed tracing on .NET; Noah, Sourabh, Jamie, and Joy for their insightful feedback; Irina for making me believe in myself; and the amazing Packt team for their support and encouragement along the way. Most of all, my husband Pavel, as his reassurance and care were indispensable. Thank you!

About the reviewers

Joy Rathnayake is a Solutions Architect with over 20 years’ experience and a part of the Solution Architecture team in WSO2 Inc., Colombo. He has experience in architecting, designing, and developing software solutions using Microsoft and related technologies. Joy has a professional diploma in software engineering from NIIT.

He is a recognized Microsoft Most Valuable Professional (MVP) and Microsoft Certified Trainer (MCT). He has contributed to developing content for Microsoft Certifications and worked as an SME for several Microsoft exam development projects. He is a passionate speaker and has presented at various events.

In his spare time, Joy enjoys writing blogs, making videos, and reading. You can connect with him on LinkedIn.

Jamie Taylor (@podcasterJay) is an accomplished software developer, esteemed host of The .NET Core Podcast, a Microsoft MVP, and the proud recipient of the prestigious Managing Director of the Year award for 2023. With over a decade of experience in the industry, Jamie has garnered a reputation as a skilled and visionary leader in the field of modern .NET development.

Jamie’s expertise in software engineering, cross-platform development, and cloud hosting has positioned him as a sought-after speaker and thought leader in the .NET community. Through his podcast, he shares his wealth of knowledge, engaging listeners with his ability to simplify complex concepts and provide practical insights.

Sourabh Shirhatti is a product manager specializing in developer tools and frameworks. He currently works on the Developer Platform Team at Uber, building compelling experiences for Uber developers. Previously, he worked in the Developer Division at Microsoft, where he actively worked on observability features, including OpenTelemetry support for .NET. Sourabh’s passion lies in creating intuitive and efficient experiences for developers, enabling them to build high-quality software. Outside of work, he enjoys exploring Seattle’s vibrant culture and beautiful outdoors with his wife.

Table of Contents

Preface

Part 1: Introducing Distributed Tracing

1

Observability Needs of Modern Applications

Understanding why logs and counters are not enough

Logs

Events

Metrics and counters

What’s missing?

Introducing distributed tracing

Span

Tracing – building blocks

Reviewing context propagation

In-process propagation

Out-of-process propagation

Ensuring consistency and structure

Building application topology

Resource attributes

Performance analysis overview

The baseline

Investigating performance issues

Summary

Questions

Further reading

2

Native Monitoring in .NET

Technical requirements

Building a sample application

Log correlation

On-demand logging with dotnet-monitor

Monitoring with runtime counters

Enabling auto-collection with OpenTelemetry

Installing and configuring OpenTelemetry

Exploring auto-generated telemetry

Debugging

Performance

Summary

Questions

3

The .NET Observability Ecosystem

Technical requirements

Configuring cloud storage

Using instrumentations for popular libraries

Instrumenting application

Leveraging infrastructure

Configuring secrets

Configuring observability on Dapr

Tracing

Metrics

Instrumenting serverless environments

AWS Lambda

Azure Functions

Summary

Questions

4

Low-Level Performance Analysis with Diagnostic Tools

Technical requirements

Investigating common performance problems

Memory leaks

Thread pool starvation

Profiling

Inefficient code

Debugging locks

Using diagnostics tools in production

Continuous profiling

The dotnet-monitor tool

Summary

Questions

Part 2: Instrumenting .NET Applications

5

Configuration and Control Plane

Technical requirements

Controlling costs with sampling

Head-based sampling

Tail-based sampling

Enriching and filtering telemetry

Span processors

Customizing instrumentations

Resources

Metrics

Customizing context propagation

Processing a pipeline with the OpenTelemetry Collector

Summary

Questions

6

Tracing Your Code

Technical requirements

Tracing with System.Diagnostics or the OpenTelemetry API shim

Tracing with System.Diagnostics

Tracing with the OpenTelemetry API shim

Using ambient context

Recording events

When to use events

The ActivityEvent API

Correlating spans with links

Using links

Testing your instrumentation

Intercepting activities

Filtering relevant activities

Summary

Questions

7

Adding Custom Metrics

Technical requirements

Metrics in .NET – past and present

Cardinality

When to use metrics

Reporting metrics

Using counters

The Counter class

The UpDownCounter class

The ObservableCounter class

The ObservableUpDownCounter class

Using an asynchronous gauge

Using histograms

Summary

Questions

8

Writing Structured and Correlated Logs

Technical requirements

Logging evolution in .NET

Console

Trace

EventSource

ILogger

Logging with ILogger

Optimizing logging

Capturing logs with OpenTelemetry

Managing logging costs

Pipelines

Backends

Summary

Questions

Part 3: Observability for Common Cloud Scenarios

9

Best Practices

Technical requirements

Choosing the right signal

Getting more with less

Building a new application

Evolving applications

Performance-sensitive scenarios

Staying consistent with semantic conventions

Semantic conventions for HTTP requests

General considerations

Summary

Questions

10

Tracing Network Calls

Technical requirements

Instrumenting client calls

Instrumenting unary calls

Configuring instrumentation

Instrumenting server calls

Instrumenting streaming calls

Basic instrumentation

Tracing individual messages

Observability in action

Summary

Questions

11

Instrumenting Messaging Scenarios

Technical requirements

Observability in messaging scenarios

Messaging semantic conventions

Instrumenting the producer

Trace context propagation

Tracing a publish call

Producer metrics

Instrumenting the consumer

Tracing consumer operations

Consumer metrics

Instrumenting batching scenarios

Batching on a transport level

Processing batches

Performance analysis in messaging scenarios

Summary

Questions

12

Instrumenting Database Calls

Technical requirements

Instrumenting database calls

OpenTelemetry semantic conventions for databases

Tracing implementation

Tracing cache calls

Instrumenting composite calls

Adding metrics

Recording Redis metrics

Analyzing performance

Summary

Questions

Part 4: Implementing Distributed Tracing in Your Organization

13

Driving Change

Understanding the importance of observability

The cost of insufficient observability

The cost of an observability solution

The onboarding process

The pilot phase

Avoiding pitfalls

Continuous observability

Incorporating observability into the design process

Housekeeping

Summary

Questions

Further reading

14

Creating Your Own Conventions

Technical requirements

Defining custom conventions

Naming attributes

Sharing common schema and code

Sharing setup code

Codifying conventions

Using OpenTelemetry schemas and tools

Semantic conventions schema

Defining event conventions

Summary

Questions

15

Instrumenting Brownfield Applications

Technical requirements

Instrumenting legacy services

Legacy service as a leaf node

A legacy service in the middle

Choosing a reasonable level of instrumentation

Propagating context

Leveraging existing correlation formats

Passing context through a legacy service

Consolidating telemetry from legacy monitoring tools

Summary

Questions

Assessments

Index

Other Books You May Enjoy

Part 1: Introducing Distributed Tracing

In this part, we’ll introduce the core concepts of distributed tracing and demonstrate how it makes running cloud applications easier. We’ll auto-instrument our first service and explore the .NET approach to observability, built around OpenTelemetry.

This part has the following chapters:

Chapter 1, Observability Needs of Modern ApplicationsChapter 2, Native Monitoring in .NETChapter 3, The .NET Observability EcosystemChapter 4, Low-Level Performance Analysis with Diagnostic Tools

1

Observability Needs of Modern Applications

With the increasing complexity of distributed systems, we need better tools to build and operate our applications. Distributed tracing is one such technique that allows you to collect structured and correlated telemetry with minimum effort and enables observability vendors to build powerful analytics and automation.

In this chapter, we’ll explore common observability challenges and see how distributed tracing brings observability to our systems where logs and counters can’t. We’ll see how correlation and causation along with structured and consistent telemetry help answer arbitrary questions about the system and mitigate issues faster.

Here’s what you will learn:

An overview of monitoring techniques using counters, logs, and eventsCore concepts of distributed tracing – the span and its structureContext propagation standardsHow to generate meaningful and consistent telemetryHow to use distributed tracing along with metrics and logs for performance analysis and debugging

By the end of this chapter, you will become familiar with the core concepts and building blocks of distributed tracing, which you will be able to use along with other telemetry signals to debug functional issues and investigate performance issues in distributed applications.

Understanding why logs and counters are not enough

Monitoring and observability cultures vary across the industry; some teams use ad hoc debugging with printf while others employ sophisticated observability solutions and automation. Still, almost every system uses a combination of common telemetry signals: logs, events, metrics or counters, and profiles. Telemetry collection alone is not enough. A system is observable if we can detect and investigate issues, and to achieve this, we need tools to store, index, visualize, and query the telemetry, navigate across different signals, and automate repetitive analysis.

Before we begin exploring tracing and discovering how it helps, let’s talk about other telemetry signals and their limitations.

Logs

A log is a record of some event. Logs typically have a timestamp, level, class name, and formatted message, and may also have a property bag with additional context.

Logs are a low-ceremony tool, with plenty of logging libraries and tools for any ecosystem.

Common problems with logging include the following:

Verbosity: Initially, we won’t have enough logs, but eventually, as we fill gaps, we will have too many. They become hard to read and expensive to store.Performance: Logging is a common performance issue even when used wisely. It’s also very common to serialize objects or allocate strings for logging even when the logging level is disabled.

One new log statement can take your production down; I did it once. The log I added was written every millisecond. Multiplied by a number of service instances, it created an I/O bottleneck big enough to significantly increase latency and the error rate for users.

Not queryable: Logs coming from applications are intended for humans. We can add context and unify the format within our application and still only be able to filter logs by context properties. Logs change with every refactoring, disappear, or become out of date. New people joining a team need to learn logging semantics specific to a system, and the learning curve can be steep.No correlation: Logs for different operations are interleaved. The process of finding logs describing certain operations is called correlation. In general, log correlation, especially across services, must be implemented manually (spoiler: not in ASP.NET Core).

Note

Logs are easy to produce but are verbose, and then can significantly impact performance. They are also difficult to filter, query, or visualize.

To be accessible and useful, logs are sent to some central place, a log management system, which stores, parses, and indexes them so they can be queried. This implies that your logs need to have at least some structure.

ILogger in .NET supports structured logging, as we’ll see in Chapter 8, Writing Structured and Correlated Logs, so you get the human-readable message, along with the context. Structured logging, combined with structured storage and indexing, converts your logs into rich events that you can use for almost anything.

Events

An event is a structured record of something. It has a timestamp and a property bag. It may have a name, or that could just be one of the properties.

The difference between logs and events is semantical – an event is structured and usually follows a specific schema.

For example, an event that describes adding an item to a shopping bag should have a well-known name, such as shopping_bag_add_item with user-id and item-id properties. Then, you can query them by name, item, and user. For example, you can find the top 10 popular items across all users.

If you write it as a log message, you’d probably write something like this:

logger.LogInformation("Added '{item-id}' to shopping bag   for '{user-id}'", itemId, userId)

If your logging provider captures individual properties, you would get the same context as with events. So, now we can find every log for this user and item, which probably includes other logs not related to adding an item.

Note

Events with consistent schema can be queried efficiently but have the same verbosity and performance problems as logs.

Metrics and counters

Logs and events share the same problem – verbosity and performance overhead. One way to solve them is aggregation.

A metric is a value of something aggregated by dimensions and over a period of time. For example, a request latency metric can have an HTTP route, status code, method, service name, and instance dimensions.

Common problems with metrics include the following:

Cardinality: Each combination of dimensions is a time series, and aggregation happens within one time series. Adding a new dimension causes a combinatorial explosion, so metrics must have low cardinality – that is, they cannot have too many dimensions, and each one must have a small number of distinct values. As a result, you can’t measure granular things such as per-user experience with metrics.No causation: Metrics only show correlation and no cause and effect, so they are not a great tool to investigate issues.

As an expert on your system, you might use your intuition to come up with possible reasons for certain types of behavior and then use metrics to confirm your hypothesis.

Verbosity: Metrics have problems with verbosity too. It’s common to add metrics that measure just one thing, such as queue_is_full or queue_is_empty. Something such as queue_utilization would be more generic. Over time, the number of metrics grows along with the number of alerts, dashboards, and team processes relying on them.

Note

Metrics have low impact on performance, low volume that doesn’t grow much with scale, low storage costs, and low query time. They are great for dashboards and alerts but not for issue investigation or granular analytics.

A counter is a single time series – it’s a metric without dimensions, typically used to collect resource utilization such as CPU load or memory usage. Counters don’t work well for application performance or usage, as you need a dedicated counter per each combination of attributes, such as HTTP route, status code, and method. It is difficult to collect and even harder to use. Luckily, .NET supports metrics with dimensions, and we will discuss them in Chapter 7, Adding Custom Metrics.

What’s missing?

Now you know all you need to monitor a monolith or small distributed system – use metrics for system health analysis and alerts, events for usage, and logs for debugging. This approach has taken the tech industry far, and there is nothing essentially wrong with it.

With up-to-date documentation, a few key performance and usage metrics, concise, structured, correlated, and consistent events, common conventions, and tools across all services, anyone operating your system can do performance analysis and debug issues.

Note

So, the ultimate goal is to efficiently operate a system, and the problem is not a specific telemetry signal or its limitations but a lack of standard solutions and practices, correlation, and structure for existing signals.

Before we jump into distributed tracing and see how its ecosystem addresses these gaps, let’s summarize the new requirements we have for the perfect observability solution we intend to solve with tracing and the new capabilities it brings. Also, we should keep in mind the old capabilities – low-performance overhead and manageable costs.

Systematic debugging

We need to be able to investigate issues in a generic way. From an error report to an alert on a metric, we should be able to drill down into the issue, follow specific requests end to end, or bubble up from an error deep in the stack to understand its effect on users.

All this should be reasonably easy to do when you’re on call and paged at 2AM to resolve an incident in production.

Answering ad hoc questions

I might want to understand whether users from Redmond, WA, who purchased a product from my website are experiencing longer delivery times than usual and why – because of the shipment company, rain, cloud provider issues in this region, or anything else.

It should not be required to add more telemetry to answer most of the usage or performance questions. Occasionally, you’d need to add a new context property or an event, but it should be rare on a stable code path.

Self-documenting systems

Modern systems are dynamic – with continuous deployments, feature flag changes in runtime, and dozens of external dependencies with their own instabilities, nobody can know everything.

Telemetry becomes your single source of truth. Assuming it has enough context and common semantics, an observability vendor should be able to visualize it reasonably well.

Auto-instrumentation

It’s difficult to instrument everything in your system – it’s repetitive, error-prone, and hard to keep up to date, test, and enforce common schema and semantics. We need shared instrumentations for common libraries, while we would only add application-specific telemetryand context.

With an understanding of these requirements, we will move on to distributed tracing.

Introducing distributed tracing

Distributed tracing is a technique that brings structure, correlation and causation to collected telemetry. It defines a special event called span and specifies causal relationships between spans. Spans follow common conventions that are used to visualize and analyze traces.

Span

A span describes an operation such as an incoming or outgoing HTTP request, a database call, an expensive I/O call, or any other interesting call. It has just enough structure to represent anything and still be useful. Here are the most importantspan properties:

The span’s name should describe the operation type in human-readable format, have low cardinality, and be human-readable.The span’s start time and duration.The status indicates success, failure, or no status.The span kind distinguishes the client, server, and internal calls, or the producer and consumer for async scenarios.Attributes (also known as tags or annotations) describespecific operations.Span context identifies spans and is propagated everywhere, enabling correlation. A parent span identifier is also included on child spans for causation.Events provide additional information about operations within a span.Links connect traces and spans when parent-child relationships don’t work – for example, for batching scenarios.

Note

In .NET, the tracing span is represented by System.Diagnostics.Activity. The System.Span class is not related to distributed tracing.

Relationships between spans

A span is a unit of tracing, and to trace more complex operations, we need multiple spans.

For example, a user may attempt to get an image and send a request to the service. The image is not cached, and the service requests it from the cold storage (as shown in Figure 1.1):

Figure 1.1 – A GET image request flow

To make this operation debuggable, we should report multiple spans:

The incoming requestThe attempt to get the image from the cacheImage retrieval from the cold storageCaching the image

These spans form a trace – a set of related spans fully describing a logical end-to-end operation sharing the same trace-id. Within the trace, each span is identified by span-id. Spans include a pointer to a parent span – it’s just their parent’s span-id.

trace-id, span-id, and parent-span-id allow us to not only correlate spans but also record relationships between them. For example, in Figure 1.2, we can see that Redis GET, SETEX, and HTTP GET spans are siblings and the incoming request is their parent:

Figure 1.2 – Trace visualization showing relationships between spans

Spans can have more complicated relationships, which we’ll talk about later in Chapter 6, Tracing Your Code.

Span context (aka trace-id and span-id) enables even more interesting cross-signal scenarios. For example, you can stamp parent span context on logs (spoiler: just configure ILogger to do it) and you can correlate logs to traces. For example, if you use ConsoleProvider, you will see something like this:

Figure 1.3 – Logs include span context and can be correlated to other signals

You could also link metrics to traces using exemplars – metric metadata containing the trace context of operations that contributed to a recorded measurement. For instance, you can check examples of spans that correspond to the long tail of your latency distribution.

Attributes

Span attributes are a property bag that contains details about the operation.

Span attributes should describe this specific operation well enough to understand what happened. OpenTelemetry semantic conventions specify attributes for popular technologies to help with this, which we’ll talk about in the Ensuring consistency and structure section later in this chapter.

For example, an incoming HTTP request is identified with at least the following attributes: the HTTP method, path, query, API route, and status code:

Figure 1.4 – The HTTP server span attributes

Instrumentation points

So, we have defined a span and its properties, but when should we create spans? Which attributes should we put on them? While there is no strict standard to follow, here’s the rule of thumb:

Create a new span for every incoming and outgoing network call and use standard attributes for the protocol or technology whenever available.

This is what we’ve done previously with the memes example, and it allows us to see what happened on the service boundaries and detect common problems: dependency issues, status, latency, and errors on each service. This also allows us to correlate logs, events, and anything else we collect. Plus, observability backends are aware of HTTP semantics and will know how to interpret and visualize your spans.

There are exceptions to this rule, such as socket calls, where requests could be too small to be instrumented. In other cases, you might still be rightfully concerned with verbosity and the volume of generated data – we’ll see how to control it with sampling in Chapter 5, Configuration and Control Plane.

Tracing – building blocks

Now that you are familiar with the core concepts of tracing and its methodology, let’s talk about implementation. We need a set of convenient APIs to create and enrich spans and pass context around. Historically, every Application Performance Monitoring (APM) tool had its own SDKs to collect telemetry with their own APIs. Changing the APM vendor meant rewriting all your instrumentation code.

OpenTelemetry solves this problem – it’s a cross-language telemetry platform for tracing, metrics, events, and logs that unifies telemetry collection. Most of the APM tools, log management, and observability backends support OpenTelemetry, so you can change vendors without rewriting any instrumentation code.

.NET tracing implementation conforms to the OpenTelemetry API specification, and in this book, .NET tracing APIs and OpenTelemetry APIs are used interchangeably. We’ll talk about the difference between them in Chapter 6, Tracing Your Code.

Even though OpenTelemetry primitives are baked into .NET and the instrumentation code does not depend on them, to collect telemetry from the application, we still need to add the OpenTelemetry SDK, which has everything we need to configure a collection and an exporter. You might as well write your own solution compatible with .NET tracing APIs.

OpenTelemetry became an industry standard for tracing and beyond; it’s available in multiple languages, and in addition to a unified collection of APIs it provides configurable SDKs and a standard wire format for the telemetry – OpenTelemetry protocol (OTLP). You can send telemetry to any compatible vendor, either by adding a specific exporter or, if the backend supports OTLP, by configuring the vendor’s endpoint.

As shown in Figure 1.5, the application configures the OpenTelemetry SDK to export telemetry to the observability backend. Application code, .NET libraries, and various instrumentations use .NET tracing APIs to create spans, which the OpenTelemetry SDK listens to, processes, and forwards to an exporter.

Figure 1.5 – Tracing building blocks

So, OpenTelemetry decouples instrumentation code from the observability vendor, but it does much more than that. Now, different applications can share instrumentation libraries and observability vendors have unified and structured telemetry on top of which they can build rich experiences.

Instrumentation

Historically, all APM vendors had to instrument popular libraries: HTTP clients, web frameworks, Entity Framework, SQL clients, Redis client libraries, RabbitMQ, cloud providers’ SDKs, and so on. That did not scale well. But with .NET tracing APIs and OpenTelemetry semantics, instrumentation became common for all vendors. You can find a growing list of shared community instrumentations in the OpenTelemetry Contrib repo: https://github.com/open-telemetry/opentelemetry-dotnet-contrib.

Moreover, since OpenTelemetry is a vendor-neutral standard and baked into .NET, it’s now possible for libraries to implement native instrumentation – HTTP and gRPC clients, ASP.NET Core, and several other libraries support it.

Even with native tracing support, it’s off by default – you need to install and register specific instrumentation (which we’ll cover in Chapter 2, Native Monitoring in .NET). Otherwise, tracing code does nothing and, thus, does not add any performance overhead.

Backends

The observability backend (aka monitoring, APM tool, and log management system) is a set of tools responsible for ingestion, storage, indexing, visualization, querying, and probably other things that help you monitor your system, investigate issues, and analyze performance.

Observability vendors build these tools and provide rich user experiences to help you use traces along with other signals.

Collecting traces for common libraries became easy with the OpenTelemetry ecosystem. As you’ll see in Chapter 2, Native Monitoring in .NET, most of it can be done automatically with just a few lines of code at startup. But how do we use them?

While you can send spans to stdout and store them on the filesystem, this would not leverage all tracing benefits. Traces can be huge, but even when they are small, grepping them is not convenient.

Tracing visualizations (such as a Gantt chart, trace viewer, or trace timeline) is one of the common features tracing providers have. Figure 1.6 shows a trace timeline in Jaeger – an open source distributed tracing platform:

Figure 1.6 – Trace visualization in Jaeger with errors marked with exclamation point

While it may take a while to find an error log, the visualization shows what’s important – where failures are, latency, and a sequence of steps. As we can see in Figure 1.6, the frontend call failed because of failure on the storage side, which we can further drill into.

However, we can also see that the frontend made four consecutive calls into storage, which potentially could be done in parallel to speed things up.

Another common feature is filtering or querying by any of the span properties such as name, trace-id, span-id, parent-id, name, attribute name, status, timestamp, duration, or anything else. An example of such a query is shown in Figure 1.7:

Figure 1.7 – A custom Azure Monitor query that calculates the Redis hit rate

For example, we don’t report a metric for the cache hit rate, but we can estimate it from traces. While they’re not precise because of sampling and might be more expensive to query than metrics, we can still do it ad hoc, especially when we investigate specific failures.

Since traces, metrics, and logs are correlated, you will fully leverage observability capabilities if your vendor supports multiple signals or integrates well with other tools.

Reviewing context propagation

Correlation and causation are the foundation of distributed tracing. We’ve just covered how related spans share the same trace-id and have a pointer to the parent recorded in parent-span-id, forming a casual chain of operations. Now, let’s explore how it works in practice.

In-process propagation

Even within a single service, we usually have nested spans. For example, if we trace a request to a REST service that just reads an item from a database, we’d want to see at least two spans – one for an incoming HTTP request and another for a database query. To correlate them properly, we need to pass span context from ASP.NET Core to the database driver.

One option is to pass context explicitly as a function argument. It’s a viable solution in Go, where explicit context propagation is a standard, but in .NET, it would make onboarding onto distributed tracing difficult and would ruin the auto-instrumentation magic.

.NET Activity (aka the span) is propagated implicitly. Current activity can always be accessed via the Activity.Current property, backed up by System.Threading.AsyncLocal<T>.

Using our previous example of a service reading from the database, ASP.NET Core creates an Activity for the incoming request, and it becomes current for anything that happens within the scope of this request. Instrumentation for the database driver creates another one that uses Activity.Current as its parent, without knowing anything about ASP.NET Core and without the user application passing the Activity around. The logging framework would stamp trace-id and span-id from Activity.Current, if configured to do so.

It works for sync or async code, but if you process items in the background using in-memory queues or manipulate with threads explicitly, you would have to help runtime and propagate activities explicitly. We’ll talk more about it in Chapter 6, Tracing Your Code.

Out-of-process propagation

In-process correlation is awesome, and for monolith applications, it would be almost sufficient. But in the microservice world, we need to trace requests end to end and, therefore, propagate context over the wire, and here’s where standards come into play.

You can find multiple practices in this space – every complex system used to support something custom, such as x-correlation-id or x-request-id. You can find x-cloud-trace-context or grpc-trace-bin in old Google systems, X-Amzn-Trace-Id on AWS, and Request-Id variations and ms-cv in the Microsoft ecosystem. Assuming your system is heterogeneous and uses a variety of cloud providers and tracing tools, correlation is difficult.

Trace context (which you can explore in more detail at https://www.w3.org/TR/trace-context) is a relatively new standard, converting context propagation over HTTP, but it’s widely adopted and used by default in OpenTelemetry and .NET.

W3C Trace Context

The trace context standard defines traceparent and tracestate HTTP headers and the format to populate context on them.

The traceparent header

The traceparent is an HTTP request header that carries the protocol version, trace-id, parent-id, and trace-flags in the following format:

traceparent: {version}-{trace-id}-{parent-id}-{trace-flags}version: The protocol version – only 00 is defined at the moment.trace-id: The logical end-to-end operation ID.parent-id: Identifies the client span and serves as a parent for the corresponding server span.trace-flags: Represents the sampling decision (which we’ll talk about in Chapter 5, Configuration and Control Plane). For now, we can determine that 00 indicates that the parent span was sampled out and 01 means it was sampled in.

All identifiers must be present – that is, traceparent has a fixed length and is easy to parse. Figure 1.8 shows an example of context propagation with the traceparent header:

Figure 1.8 – traceparent is populated from the outgoing span context and becomes a parent for the incoming span

Note

The protocol does not require creating spans and does not specify instrumentation points. Common practice is to create spans per outgoing and incoming requests, and put client span context into request headers.

The tracestate header

The tracestate is another request header, which carries additional context for the tracing tool to use. It’s designed for OpenTelemetry or an APM tool to carry additional control information and not for application-specific context (covered in detail later in the Baggage section).

The tracestate consists of a list of key-value pairs, serialized to a string with the following format: "vendor1=value1,vendor2=value2".

The tracestate can be used to propagate incompatible legacy correlation IDs, or some additional identifiers vendor needs.

OpenTelemetry, for example, uses it to carry a sampling probability and score. For example, tracestate: "ot=r:3;p:2" represents a key-value pair, where the key is ot (OpenTelemetry tag) and the value is r:3;p:2.

The tracestate header has a soft limitation on size (512 characters) and can be truncated.

The traceresponse (draft) header

Unlike traceparent and tracestate, traceresponse is a response header. At the time of writing, it’s defined in W3C Trace-Context Level 2 (https://www.w3.org/TR/trace-context-2/) and has reached W3C Editor’s Draft status. There is no support for it in .NET or OpenTelemetry.

traceresponse is very similar to traceparent. It has the same format, but instead of client-side identifiers, it returns the trace-id and span-id values of the server span: