Modern Network Observability - David Flores - E-Book

Modern Network Observability E-Book

David Flores

0,0
35,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

As modern IT services and software architectures such as microservices rely increasingly on network performance, the relevance of networks has never been greater. Network observability has emerged as a critical evolution of traditional monitoring, providing the deep visibility needed to manage today’s complex, dynamic environments. In Modern Network Observability, authors David Flores, Christian Adell, and Josh VanDeraa share their extensive experience to guide you through building and deploying a flexible observability stack using open-source tools.
This book begins by addressing the limitations of monolithic monitoring solutions, showing you how to transform them into a composable, flexible observability stack. Through practical implementations, you’ll learn how to collect, normalize, and analyze network data from diverse sources, build intuitive dashboards, and set up actionable alerts that help you stay ahead of potential issues. Later, you’ll cover advanced topics, such as integrating observability data into your network automation strategy, ensuring your network operations align with business objectives.
By the end of this book, you'll be able to proactively manage your network, minimize downtime, and ensure resilient, efficient, and future-proof operations.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 683

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Modern Network Observability

A hands-on approach using open source tools such as Telegraf, Prometheus, and Grafana

David Flores

Christian Adell

Josh VanDeraa

Modern Network Observability

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Pavan Ramchandani

Publishing Product Manager: Prachi Sawant

Book Project Manager: Ashwin Kharwa

Senior Editor: Runcil Rebello

Technical Editor: Yash Bhanushali

Copy Editor: Safis Editing

Proofreader: Runcil Rebello

Indexer: Rekha Nair

Production Designer: Gokul Raj S.T

DevRel Marketing Coordinator: Marylou De Mello

First published: October 2024

Production reference:1120924

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83508-106-8

www.packtpub.com

To my wife, Yaquelin, for your unwavering support and patience, especially during the writing of this book. To my family, for your endless love and encouragement. And to my friends and colleagues, for inspiring me and pushing me to finally discuss networks and observability—thank you!

- David Flores

To everyone who I have worked or collaborated with during my career. You have been instrumental in allowing me to satisfy my neverending curiosity.

- Christian Adell

To my terrific family, thank you for your support, love, and connection. This book shows my kids that you can do anything that you set your mind to. To those who have been part of my growth from youth through various stages of my life and career, I know I could not be here and do things like this without the path I have taken with you.

- Josh VanDeraa

Foreword 1

Imagine cruising down a winding road at midnight without turning on your headlights. Sure, you might enjoy the excitement and the adrenaline rush, but it is only a matter of time before you end up in an ambulance or worse. In the world of networking, operating a network without monitoring and observability is the equivalent of driving at night without headlights. We all understand the importance of visibility and insights into our networks, but for many years, we have relied on rudimentary tools such as ping and traceroute. Those tools were invented over 30 years ago when the network was much, much simpler. We desperately need to gain visibility into our networks, but we face the challenges of data collection, data processing, storage, scalability, vendor incompatibilities, visualization, and the list goes on. Many advancements have been made in recent years to address those challenges, such as streaming telemetry, accurate sampling of flow data, collecting and storing data at scale, and accurate visualization. However, we lack a comprehensive, non-biased form to illustrate how the individual pieces fit into the architecture and how they can be applied in a modern network.

I first met David and Christian while working at Network to Code and have known Josh for years before then. They were all brilliant engineers, but what impressed me the most beyond their technical abilities was their passion for helping others and empowering the community to succeed. In this book, they combine their deep knowledge of the world of modern network observability with years of practical, battle-tested experience. When I say operational experience, I mean the type that is wake-up-at-2 AM mission critical, leaving the scars that are beer-discussion and conference-talk worthy that normally require admission fees to hear about.

The book addresses the complex world of network observability challenges from architectural concepts to practical configuration examples with a focus on vendor-neutral, open source tools. You will gain the necessary knowledge, from collecting data in various forms, normalizing and enriching it, to visualizing and gaining the necessary insights to help you do your job, all with open source tools. If that sounds too good to be true, it is not. You can even take the architecture and tools introduced in the book and swap out the elements with other tools. That is the beauty of the vendor-neutral, open source tools introduced in this book. I believe the benefits you will gain from the first few chapters will outweigh the investment you’ve made. Better yet, the knowledge will pay dividends for years to come. I have been waiting for a book of this kind for years, and David, Christian, and Josh have done a masterful job in the delivery.

What are you waiting for? Let’s dive into the modern world of network observability!

Eric Chou

Network Automation Advocate, Network to Code

Host, Network Automation Nerds Podcast

Founder, Network Automation Nerds, LLC.

Foreword 2

Observability has revolutionized how we manage infrastructure in the DevOps and cloud era, but its application to networking is still catching up. While network observability shares many similarities with broader infrastructure observability, there are specific challenges that have slowed its adoption in the networking field.

In this book, David, Christian, and Josh have done an outstanding job of covering everything network engineers and network reliability engineers need to know about network observability. They guide you from basic concepts to advanced examples, providing a comprehensive framework that has been successfully used in various production environments.

What makes this book special, in my view, is how network observability is presented not as an isolated component but as part of a larger automation stack. In observability, the quality of the data is everything. One of the benefits of modern observability is the ability to capture additional context, allowing us to slice and dice the data until we extract the most valuable information.

One of the key differences between monitoring and observability is the shift from a vertically integrated stack with limited flexibility to an extensible stack where users are encouraged to explore data and extract valuable insights to improve infrastructure reliability. This shift is important because it requires network engineers to develop new skills in data analysis.

At first, the amount of information may seem overwhelming, but it’s important to remember that in a production environment, multiple roles contribute to defining and implementing observability solutions. Network reliability engineers are responsible for building the platform, while network engineers are the main users. However, network engineers should also take the time to understand how the platform works, as the quality of the data collected is crucial to the effectiveness of the observability stack.

A network engineer with a deep understanding of the types of data that can be collected from the infrastructure, and the context in which it is collected, will be an invaluable partner to the network reliability team in building the best possible platform. Whether you are a network engineer or a network reliability engineer, this book covers all the skills and technologies needed for both roles.

As you dive into this book and begin your journey, remember that the tools you use are just one part of the equation. It’s more important to focus on understanding the core concepts and functional building blocks that form the foundation of effective observability practices. Tools may change and evolve, but the principles you’ll learn here will remain relevant.

Don’t be intimidated by the sheer volume of new information. It’s natural to feel overwhelmed at first, but with time and practice, these concepts will become second nature. Approach this learning process with curiosity and an open mind, and you’ll soon find yourself confidently navigating the exciting world of network observability.

For those new to this field, I recommend starting with a high-level understanding, which will help you identify which topics require deeper exploration and which ones are less critical for your role.

Over the years, I’ve had the opportunity to help many people discover and learn about network observability, and I’ve yet to see anyone who has invested time into it who wasn’t genuinely excited by the new capabilities that come with these platforms.

Welcome to the exciting field of network observability.

Damien Garros

Infrastructure automation and observability architect

Co-Founder and CEO, OpsMill

Contributors

About the authors

David Flores is passionate about solving complex problems in network infrastructure, software architectures, automation, and observability. With experience with service providers, cloud providers, and system integrators, David has gained expertise in managing, automating, and building observability stacks for network infrastructure. Currently at CoreWeave, he focuses on enhancing automation and observability. David has also contributed to open source projects such as gns3fy, and actively shares his knowledge through blogs, workshops, and technical events. David is always curious and eager to keep himself updated and open to new ideas in the field.

Christian Adell is a principal architect at Network to Code He is focused on building network automation solutions for diverse use cases, with great emphasis on open source software. He is passionate about learning and helping others to grow, but also has more hobbies than hours in the day, so working remotely from Barcelona gives him the time and the space to achieve his dreams. Christian is a co-author of O’Reilly’s Network Programmability and Automation book and a co-author of Network Automation with Nautobot by Packt. Also in relation to sharing knowledge, he is the organizer of the NetBCN community in Barcelona and has been collaborating with several universities for almost 20 years.

Josh VanDeraa is a network engineer and automation leader. Currently, he is a services director at Network to Code, driving value from network automation solutions. Josh has experience in automation and networking across retail, transportation, and managed services. In his free time, he enjoys being with his family or the Minnesota seasons. Josh co-authored Network Automation with Nautobot and self-published Open Source Network Management.

About the reviewers

Brad Haas is a seasoned professional in network automation, serving as vice president of professional services at Network to Code. With over two decades of experience in the field, Brad leads high-performing teams in complex, customer-focused projects, emphasizing a blend of technical skill and strategic insight. He advocates a data-informed approach to ensure technology aligns with business goals. His career features numerous technical certifications, including multiple CCIEs and cloud credentials. Previously, Brad contributed significantly to network operations and the adaptation of cloud-native applications and microservices, using technology to drive organizational transformation.

Thank you to my wife and children for your patience and love as I ventured into this technical review adventure. While it was fun for me reviewing chapters and testing labs, I know it meant more solo puzzles and one less player for game night at home. Thanks for not hiding my laptop, and for all the laughter and support that kept me going! Love you guys - Always.

Suhaib Saeed is a cloud network engineer with many years of experience in network design, automation, and observability and is currently focused on all things AWS. He has spent most of his career at various ISPs such as BT, working on large-scale projects for FTSE 100 clients. His current role is at Samsara, an IoT company on a mission to improve the safety, efficiency, and sustainability of the operations that power the global economy. Suhaib holds a BSc in Computer Networks and has authored multiple blogs on network automation.

Table of Contents

Preface

Part 1: Understanding Monitoring and Observability

1

Introduction to Monitoring and Observability

Defining network observability

Network monitoring evolution

What has worked so far

Trends and requirements

Network observability pillars

Data quality

Scalability and interoperability

Actionable data

Assisted analysis

Benefits

Summary

2

Role of Monitoring and Observability in Network Infrastructure

Networking in the 2020s

Technological changes

Cultural changes

Transforming data into information

The importance of using business terms

Defining KPIs

From data to information

Expectations for network observability

Heterogeneous and enriched data

Proactive role in network automation

Full visibility of network state

Faster, more accurate, and at scale

Summary

3

Data’s Role in Network Observability

Network monitoring and telemetry

Challenges of traditional network monitoring

Network telemetry

Network observability framework

Collecting data, in practice

Agent-based versus Agentless approach

Network data collection methods

Setting up the lab environment

Summary

Part 2: Building an Effective Observability Stack

4

Observability Stack Architecture

The components of an observability platform

The importance of a well-designed observability stack

Why does an observability stack need to be well designed?

What does it mean to be a well-designed platform?

Understanding data pipelines for observability

The versatility of data pipelines

Unpacking ETL in data pipelines

Challenges and best practices

Scalability

Reliability

Flexibility, extensibility, and customization

Cost management

Other tips and best practices

Setting up a lab environment

Lab scenarios

Summary

5

Data Collectors

A deep dive into data collectors

Key characteristics

A look into Telegraf

Telegraf architecture

Telegraf configuration

Telegraf SNMP input plugin

Telegraf synthetic monitoring input plugins

Telegraf gNMI input plugin

Telegraf exec input plugins

A look into Logstash

Logstash architecture

Logstash syslog input

Summary

6

Data Distribution and Processing

Understanding data normalization

Observability data models

Breaking down metrics and the data model

Enhancing insights with data enrichment

Data enrichment injection

Data enrichment at query time

The scale of the observability data pipeline

Why message brokers/buses matter in observability

Summary

7

Data Storage Solutions for Network Observability

Databases for observability

Time series databases

Matching databases with observability needs

A look into Prometheus TSDB

Prometheus architecture

Writing to Prometheus TSDB

Reading from Prometheus TSDB (PromQL)

Prometheus rules

A look at Grafana Loki

Grafana Loki architecture

Writing to Loki

Reading from Loki (LogQL)

Loki rules

Persistence tips and best practices

Performance and scale

Automation is your best friend

Summary

8

Visualization – Bringing Network Observability to Life

Data visualization principles

A look into Grafana

Architecture

Setting up the lab environment

Creating your first Grafana dashboard

Visualization tips and best practices

Summary

9

Alerting – Network Monitoring and Incident Management

Incident management and alerts

Challenges and considerations on alerting

Alert aggregation and correlation

Alert engine architecture

A look into rulers and Alertmanager

Architecture

Creating your first alerts

Grafana for alerts

External integrations

Alerting tips and best practices

Addressing common alert challenges

Build on top of communication and transparency

Healthy incident management process

The role of AI in alerting

Summary

10

Real-World Observability Architectures

Observability stack options

All-in-one open source tools

Commercial off-the-shelf tools

Controller-based systems

Time series versus snapshot observability

Comparing build versus buy decision points

Defining requirements

Evaluating in-house capabilities and resources

Cost analysis

Assessing risks

Comparing features and flexibility

Making a decision

Orchestrating an observability platform

Deployment methodologies and orchestration

Summary

Part 3: Using Your Network Observability Data

11

Applications of Your Observability Data – Driving Business Success

The business value of observability data

Capacity planning

Percentiles

Forecasting

Defining health status

Treating your network as a service

Monitoring SLIs, SLOs, and SLAs for optimal network performance

How to treat a network as a service

Architecting dashboards

Network-related personas

Dashboard types

Summary

12

Automation Powered by Observability Data – Streamlining Network Operations

Setting up the lab environment

Advanced automation techniques with event-driven automation

Event-driven automation

Closed-loop automation

Event-driven automation with Prefect

Summary

13

Leveraging Artificial Intelligence for Enhanced Network Observability

AI and ML fundamentals

ML algorithms

Neural networks and language models

Real-world AIOps

Lab requirements

Validating operational changes

Assisted root cause analysis

Summary

Appendix A

A lab environment

Hardware requirements

Software requirements

Step 0 – Git repository setup

Step 1 – VM provisioning

Step 2 – interacting with the lab scenarios

Step 3 – removing the lab environment

Step 4 – managing lab scenarios

Summary

Index

Other Books You May Enjoy

Part 1:Understanding Monitoring and Observability

The first part of the book introduces you to monitoring and observability, taking us from where network observability began to where it is now. After reviewing where we have been and where monitoring and observability are going, we dive into the role of monitoring in organizations. The first part of the book wraps up with the different data types, such as logs, metrics, and traces, and how those types fit into the modern observability stack.

This part contains the following chapters:

Chapter 1, Introduction to Monitoring and ObservabilityChapter 2, Role of Monitoring and Observability in Network InfrastructureChapter 3, Data’s Role in Network Observability

1

Introduction to Monitoring and Observability

Since the early days of computer networks, we have needed to detect failures on the different network components (e.g., hardware interface issues, cable cuts, or web service down) to determine outages that require corrective actions. This field has been known as network monitoring.

Interestingly, the last decade has witnessed numerous innovations in the field, especially related to new tools and practices around the DevOps culture. This culture emphasizes merging development and operations responsibilities requiring a better understanding of the operational state. Moreover, there has been a significant adoption of network automation. This advancement drives network operations, transforming monitoring from a passive component to an enabler of closed-loop processes. These changes have been the main drivers behind the evolution from network monitoring to network observability, and this book wants to help you understand and apply it to improve your network operations.

Note

Network observability is a broader topic, especially since the rise of running network applications directly in the host with technologies such as extended Berkeley Packet Filter (eBPF) and Data Plane Development Kit (DPDK). This kind of observability is not covered in detail in the book, even though most of the concepts are applicable too.

In this book, you will begin understanding the basics concepts related to network observability, and then, for the majority of it, we will explain how to build a modern network observability stack, with a practical, but not limited, emphasis on the Telegraf (https://github.com/influxdata/telegraf)/Prometheus (https://github.com/prometheus/prometheus)/Grafana (https://github.com/grafana/grafana) (TPG) stack (details about how to spin up a development environment are in Appendix A). Finally, you will learn how to solve real network operations challenges using the flexible observability stack presented.

In this first chapter, we will cover the following topics:

Defining network observabilityDescribing network monitoring evolutionExposing the key aspects of network observability

Defining network observability

Let’s go straight to the point: what is network observability about?

To answer this, it’s convenient to understand first what network monitoring is because network observability supersedes it. Network monitoring is part of the wide IT operations monitoring focused on the network infrastructure.

Even though you are likely used to the network monitoring term, there is no academic definition of it, and everyone understands it slightly differently. We define network monitoring as measuring the performance and availability of the network infrastructure.

Related to this goal, you may be familiar with some of the technologies that have provided information about the operational state of the network:

Simple Network Management Protocol (SNMP) polls and trapsInternet Control Message Protocol (ICMP) requests (e.g., ping)Flow analysis (e.g., NetFlow)Packet capture (e.g., tcpdump)Logs (e.g., Syslog)

These technologies make up network monitoring, which provides support for diagnostics and service monitoring, with state visualization and alert generation. Network operation teams leverage network monitoring to detect when something is wrong in the network, but this is not enough anymore.

Nowadays, IT operations have raised the bar, and the focus is not only on the infrastructure status but on translating it to the business level. Therefore, observability is about the end user’s experience, and this encompasses many layers, from infrastructure to applications.

This convergence of responsibilities materialized in the DevOps culture (i.e., bringing together Development and Operations) that coordinates all the IT efforts around the same business outcome. One basic practice is to consolidate different monitoring systems to enable data correlation. The DevOps movement has broken long-time silos in IT departments, and this new collaboration has produced a lot of innovations, which we will explore in this book.

Moreover, it has transformed the reactive approach of traditional monitoring into a proactive one that helps answer handling issues before impacting the services. Ironically, this leads to simpler (but more effective) systems, capable of getting the data to provide the insights that help solve these issues. This is what IT observability is about, helping to identify the unknown unknowns and having a holistic view.

Within this observability realm, network observability encompasses all the technological trends that support the overall IT observability in the network realm.

In networking, this trend toward adopting network observability has been translated to more flexibility in different aspects:

Interoperable specialized solutions (e.g., open source solutions provide more flexibility)More efficient data retrieval methods (e.g., network streaming telemetry)More scalable and advanced data processing (e.g., artificial intelligence)Richer context and analysis via data integrations (e.g., source of truth integration)

Note

That being said, we will use both terms (i.e., monitoring and observability) interchangeably in this book, with the same meaning.

This is what this book is about. We want you to understand how to evolve from traditional network monitoring systems to the new network observability approach, tightly connected with the DevOps culture, and how it connects with the other big revolution in network operations: network automation.

Network monitoring evolution

As already mentioned, modern network observability has evolved from network monitoring, a practice that has been in place for several decades. Before delving into the new approach it introduces, it’s important to review what has been effective so far and to understand the trends and requirements that have driven its transformation.

What has worked so far

Networks have been monitored to understand their status since the beginning. ARPANET (which stands for Advanced Research Projects Agency Network), the first packet-switched network started in 1966, had the Interface Message Processor (IMP) protocol, which provided a few monitoring features. Fast-forwarding some years to the rise of TCP/IP networks, in 1988, the SNMP was defined by the IETF (its last version is SNMPv3) to address this need.

SNMP provides a mechanism to manage networks, but it has been mostly used to monitor networks, and not to manage configuration changes (which have been mostly done via CLIs, until the rise of newer management interfaces). The main characteristics of SNMP can be summarized in a few aspects:

The UDP transport protocol is stateless, which is useful for state and status pollingManagement information bases (MIBs) provide structured data to accessspecific contentMassive adoption in all network devices, supporting standard and proprietary MIBs

However, not all that glitters is gold, and SNMP has some limitations such as the performance to retrieve large amounts of data and limited coverage for push mechanisms (i.e., SNMP traps).

Note

This book doesn’t cover SNMP in detail (there are many books dedicated to the topic). We will reference it as one of the available methods to retrieve operational data within a holistic network observability strategy in Chapter 3.

Similarly to SNMP, event logs using Syslog have been widely used, not only for network monitoring but also for applications. Logs are generated when a specific event is seen by the device, and it brings together several pieces of information such as the generation time, the source, the level, and some meaningful message related to the event. This grouping of data is what we refer to as multidomain data. This contrasts with the simple SNMP metrics (integers or strings).

And also, pretty common in network analysis are the flow exporters mechanisms such as NetFlow, sFlow, and IPFIX. With some small differences between them, they represent the basic information to define what a packet flow is about, including the source and destination IP addresses and ports, and some other information. Again, like logs, this is multidomain data.

An important benefit of all these methods is their ubiquitous adoption. It’s more than likely that any network device you have supports them. However, the implementation of the monitoring solutions, usually in the form of monolithic platforms, makes it harder to combine and relate this data that may be related.

Also, the initial technologies to manage and persist this data, such as RRDtool, came with limitations that modern options such as Time Series Databases (TSDBs) have overcome. Understanding what a modern network observability stack looks like and how to design and build one is the main goal of Part 2 of this book.

These methods, together with others such as packet capturing or synthetic monitoring (e.g., ping), have been, and still are, solid pillars to build upon a network monitoring strategy. However, the expectations for observability have disrupted the status quo and, in many environments, traditional network monitoring is no longer enough.

Trends and requirements

Here, we summarize the main trends that have influenced and motivated the evolution of network observability:

Networks are heterogeneous and abstract. Today, networking takes many forms: campus networks, hyper-scale data centers, cloud-based network services, or service mesh. This variety implies supporting different protocols and interfaces and being able to correlate many data types.Network operations, following the DevOps approach, have adopted automation to transform how networks are managed. For most network automation tasks, operational data insights are key and require common data models between developers and operation engineers to get a mutual understanding.Focus on application performance has become more predominant, and network monitoring needs to contribute to the common view with all the related information. Moreover, microservice architectures increase the complexity of correlating the data.Better visibility requires more data, so we need more efficient retrieval methods and data reusability.The volume of the data aggregate can be huge, making it impossible to analyze without the aid of artificial intelligence for IT operations (AIOps).

In this book, we will explain the basic concepts to architect solutions to address these challenges (in Part 2), and practical examples to implement them (in Parts 2and 3).

Network observability pillars

With all these expectations, there are four pillars that sustain the network observability solutions (depicted in Figure 1.1):

Figure 1.1 – Network observability pillars

We summarize these as follows:

Data quality: Any observation is going to be as good as the data it is based on. We must ensure that the data we collect adheres to some principles that grant its quality.Scalability and interoperability: To address complex questions, the architecture needs to incorporate specialized tooling that works together in a distributed manner that can scale out as needed.Actionable data: The active role of network observability within a network automation strategy requires providing data that other components can leverage to act on it.Assisted analysis: To improve the insights generated, the data needs to be analyzed by machines that can process a large amount of data and applyingintelligenceat scale.

Data quality

It may seem obvious but in any process (even more if it’s automated), the quality of the output will be directly proportional to the quality of the input. In network monitoring, operational data is king. Everything depends on the collected data, so it’s necessary to carefully select and manage the data that will be used to generate the insights.

But what is data quality exactly? There are many definitions of data quality, but all of them pivot around a few dimensions:

Relevance: Is the data useful (and used) for the purpose it was collected?Accuracy: Is the data precise enough to give insights into it?Timeliness: Is the data current enough to provide almost real-time conclusions?Comparability: Can the data be compared with other datasets?Completeness: Are there any records missing?

In the context of network observability, we could summarize quality data as the data that is fit for answering the relevant questions about your network.

This book doesn’t cover the data quality topic in depth (there is a myriad of books only focused on this topic), but through this book, we will implicitly refer to data quality characteristics through many sections of the book, as in these instances:

In Chapter 3, we will introduce the relevance of streaming telemetry to get almost real-time metrics from network devices. We will be tackling the timeliness dimension.In Chapter 5, we will explain the importance of normalizing data from different sources to make it comparable.In Chapter 6, the concept of data enrichment increases the data’s relevance, adding more context so data can be easily correlated (not only by its timestamp).In Chapter 7, when explaining the persistency layer, we will tackle how some databases can fill missing data records with probable data to help run processing on top, helping with completeness.

Scalability and interoperability

Traditional network monitoring systems have been implemented as monolithic ones performing all the necessary functionalities (e.g., collection, storage, and visualization). This all-inclusive approach may seem convenient to get started. However, as more specialized features are required (e.g., a new database type or a new collector agent), it becomes evident that it’s unlikely that one tool would be capable of addressing all your needs.

On the contrary, we need to acknowledge that evolving the network observability stack requires plugging in different components, specialized on some of the functionalities. These distributed architectures require clear interfaces to connect the different components, and a solid orchestration to deploy them in a repeatable manner at scale.

Moreover, to make this composable approach efficient, we should avoid duplicating roles. For instance, we have seen many cases where, to leverage the analysis from two different monitoring tools, each one has to run SNMP collectors and collect the very same data. Why not collect it once, and reuse it many times?

Related to architecture requirements, the new stack has to allow high scalability to handle the increase in the amount of data and processing required to provide the expected insights. This scalability should allow per-component upgrades (i.e., scale out) instead of the whole stack required by monolithic systems (i.e., scale up).

We will cover these topics in Part 2, from a general introduction of the recommended stack in Chapter 4 to per-component details in the other chapters.

Actionable data

The adoption of network automation has revolutionized how networks are managed. This movement started with the software-defined networking hype (do you remember OpenFlow?) around 2010, and, from there, it evolved in different ways, strongly influenced by DevOps practices (some people refer to it as NetDevOps). When we talk about network automation, we refer to replacing the manual changes on the network infrastructure (e.g., the CLI in network devices or the GUI on controller-based ones) by using a repeatable approach that could replace (most) human intervention.

Network automation is a big topic that we don’t pretend to cover in this book. However, in every chapter, we will highlight the role of network observability in the network automation space.

One key requirement for network observability within automated network operations is the need to provide actionable data to feed into closed-loop systems (i.e., systems that automatically adjust depending on their output). As we will see later, network automation’s heart is about defining the intended state of the network that drives the whole system, from defining the configuration and the operational state. Using this as a reference, the network observability will collect the actual state and check whether some kind of mitigation action is needed or not.

Part 2 of the book covers how to design and implement the functionalities that support automated operations, and in Part 3, we explain how to leverage the recommended stack to solve real problems.

Also, you should not forget that the network automation solutions need to be observed themselves to understand how they behave and influence the network state.

Assisted analysis

Network monitoring has always supported humans with data to understand patterns or provide insights about how the network services were running. Since the early stages, network monitoring has also aspired to provide capable insights on top of the data collected. However, until recently, the analysis of monitoring data had limited (but still useful) use cases. For instance, you have likely defined threshold alerts in your monitoring systems to raise an alarm when the CPU level exceeds some threshold.

Nowadays, with the massive scale of IT systems and their complexity, the need for assisted analysis is more relevant. Answering this call, the advent of AI/ML (Artificial Intelligence and Machine Learning) technologies has transformed the game, and the promise of implementing Artificial Intelligence for IT Operations (AIOps) is becoming a reality.

AI/ML provides solutions to various problems, including the identification of data clusters with shared characteristics, such as the number of access control list (ACL) hits according to device role, and the detection of anomalies by comparing current metrics against historical data while considering seasonality. Even more popular, large language models (LLMs), used by chatbots such as OpenAI's ChatGPT (https://openai.com/chatgpt), allow reproducing human language processing to provide complex answers based on all the previous training. For instance, you can ask for the potential impact of a log message that has not been classified before in your system to get an educated insight into the related implications.

The potential of these new tools to contribute to the analysis of network operations is immense, but so are the challenges. There is a learning curve to understand when to use one or another technique, and more importantly, how to select and manage the data (including anonymizing it before sharing it with an external system) to achieve the desired results. We won’t go deep into the foundation of ML/AI, but in Chapter 13, we will provide some inspirational examples of leveraging them to improve the insights produced by network observability.

Benefits

After this brief introduction to network observability, we want to finish it by highlighting some of the key benefits it brings to the table:

Reduced time to solve incidents: More and better data is available for deeper analysis of multi-dimensional issues that affect the services running on top of the network infrastructure, providing educated suggestions to resolve the issues.Better end user experience: Due to the shorter time to identify users’ issues, and also including the user’s perspective in the analysis.More accurate capacity planning: By combining all the data generated, it is possible to reduce the over-provisioning of network services tailoring to the actual needs.Accelerated network operations: Being able to validate the state of the network against its intended state enables faster configuration deployments that are validated by observability. It also supports canary deployments, where a big network change is only rolled out to a small subset of the network to reduce the blast radius effect, and incrementally rolled out to the rest once the state is validated.

Understanding the key network observability pillars and the benefits they provide will help you navigate through this book. When we present our proposed architecture and stack in Part 2, you will notice the influence of these pillars behind every recommendation.

Summary

In this chapter, we presented the network observability topic and how it evolved from traditional network monitoring. We looked at its evolution and the main trends that have influenced its transformation. Finally, we introduced the four pillars on top of which we will build the observability stack, and some of the expected benefits of this approach.

In the next chapter, we expand on the importance of providing business-level insights and the role of observability in modern network operations.

2

Role of Monitoring and Observability in Network Infrastructure

As previously introduced in Chapter 1, the relevance and scope of observability in IT infrastructure goes beyond networking, and most of the ideas and topics proposed in this book are easily portable to other infrastructure realms. However, we want to provide a closer look at networking and its own needs.

For this reason, before going into the pure observability topics, we think it’s important to give you a high-level overview of how networks have evolved in the last decade. With this context, it will be easier to later understand the requirements that we expect from a modern network observability solution.

Also in this chapter, we present a shift in how you understand your network. We encourage you to see it as a product – either it has a direct impact on your company’s revenue or it’s supporting it. This mindset will guide how we approach the transformation of the raw data we get from the network, in the form of metrics, logs, and other types, into actual information that can help others set expectations about how to consume the network services.

The chapter covers the following topics:

The state of networking in the 2020sHow to transform data into informationRecapping the expectations for network observability

Let’s start doing a quick recap of what networks look like nowadays.

Networking in the 2020s

We have to admit that networking is not the spearhead of rapid innovation in IT. There are good reasons for that. Networks require proven interoperability that has been usually sustained in standards that may take a long time to implement, and the blast radius of a network failure is worse than an electric power outage (no battery mode exists). Thus, networking has been resistant to changes until it has become totally necessary to support the evolution of applications running on top.

Changes in networking in the last 20 years have taken many flavors, both technological and cultural, and it has impacted the expectations in network observability. Even though this book is not about network architectures or solutions, we believe that having a 10,000-foot view of these factors will help you better understand the impact of the observability solutions around.

Technological changes

Everything depends on the perspective. If we use the 1990s as our reference, we could say that the current networks forming the internet are more homogeneous than before because the network protocols converged into a few of them, such as IP, TCP/UDP, or HTTP. However, it’s also true that today’s networks are no longer built only around closed boxes by a few vendors and are adopting open Linux operating systems (OS) and virtual network services running on different platforms, making these networks more heterogeneous.

These networks have evolved in different directions, depending on which was the main purpose. We can find network service providers that connect many networks, campus networks that provide access to end users, or data center networks that support backend applications connectivity. Moreover, different nuances are depending on the application nature they are supporting, such as the fintech use case, where a stable and low latency is crucial, to video content delivery, where multicast support allows scalability.

Without going into low-level details, we are going to analyze a few of the most relevant technological changes, starting with how the architecture of these networks has changed.

Network architectures

The authors of this book have been managing networks since the 2000s. We can still remember the days when the networks were connecting a bunch of clients (i.e., PCs/personal computers) to a few servers (in the order of hundreds at most), and finally providing internet access.

In that context, a popular architecture stood out, the three-tier architecture. The name comes from the three layers that compose it: the core, distribution, and access layers. It has been, and still is, the most common architecture (with different variances) for multi-purpose Local Area Networks (LANs) because it works well for the north-south traffic pattern (i.e., when most of the traffic goes from one access layer into another domain up in the architecture):

Figure 2.1 – Three-tier architecture

In large scale companies, such as Google or Meta, it became evident that this architecture had scalability limitations to support new application traffic patterns. The new network architecture had to provide a predictable network latency, a higher connectivity capacity, and an easier way to scale it via scaling out (i.e., adding new devices to the network) instead of scaling up (i.e., replacing devices with ones with more capacity). This architecture is known as leaf and spine (though, in some cases, it may have different names), but the actual concept comes from the old Clos network design, which provides a consistent flattened network design.

Many articles and books have been written about these topics, but a blog from Meta (formerly Facebook) in 2014 (https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/), about the data center fabric network topology, was one of the first explaining it to the public. In the blog, you can notice how the design follows a regular pattern replicated at different levels, and there is always the same distance from one server to another.

On top of this standard architecture, different virtual networks can be constructed. The previous is just the underlay network, but on top, many different overlays can be built to provide custom networks. This separation of concerns (i.e., underlay and overlay) allows more flexibility in creating networks dynamically, but it also increments the dimensions to consider about the network state, as there is not only one network but a combination of many.

Note

A common implementation of this architecture in the data center is based on Ethernet VPN (EVPN) and Virtual Extensible LAN (VXLAN) protocols (i.e., EVPN controls how the overlay VXLAN tunnels are created). In the Wide Area Network (WAN) area, the Software-Defined WAN (SD-WAN) solutions use a controller to orchestrate overlay connections between branches connected over the internet (the underlay).

Both network designs are still in use because they serve different purposes, and in many environments, the network is a combination of both. One way or another, what is a reality is the higher number of connected devices (think about IoT devices, for example), thus, a higher number of network devices (i.e., switches and access points) to connect them (which need to be monitored).

Virtualization in networking

Twenty years ago, servers and network devices were directly mapped to physical boxes. Every box was running proprietary applications running on top of closed OS, consuming the hardware resources directly.

In the late 2000s, the virtualization of servers became mainstream leading to a much flexible way to provision new servers on the same physical box. Running different OSs on top of a hypervisor (abstracting the physical resource) allows OSs to think that they run alone while they are sharing the same physical host.

This transformation on the server side motivated the appearance of similar solutions for networking, and other trends, such as Network Function Virtualization (NFV), which pushed toward taking network functions traditionally implemented in hardware into software.

In addition, containerized solutions came into play as an interesting option to create development environments, which also require observing the resulting state before changes are accepted.

Note

In this book, we use containerized network OSs to create the lab environment. More details are in the Appendix A.

You can already notice how all this flexibility requires dynamic network monitoring that is not statically set once and kept for the whole life of the device. It has to change dynamically as the network changes.

Network automation

If you have been networking for a while, likely, you are mostly managing your network via the famous Command-Line Interfaces (CLIs), which provide a human-readable language to define how the network should behave. This worked well for small-scale networks, but as we already mentioned, with the increase in size, and heterogeneity, of modern networks, manually operating network devices doesn’t scale and has other limitations.

In this context, around 2010, we witnessed two new paradigms that eventually converged into what we understand as network automation:

Rise of the Software-Defined Networking (SDN) movement. It proposed that the network behavior should be managed via softwareDevOps culture, where development and operations are deeply connected, so the software and infrastructure evolve closer

In the present day, most of the new networks are managed in an automated way. Instead of using human language constructs via CLI commands, the network devices have adopted new interfaces for software management Application Programming Interfaces (APIs), such as NETCONF, RESTCONF, gRPC Network Management Interface (gNMI), or common-purpose REST APIs. These interfaces also came with new features to support advanced observability use cases. The data (configuration and operational) come modeled, usually using Yet Another Next Generation (YANG), which enables more effective data processing and also provides model-driven telemetry (i.e., a continuous flow of data) to get data at a higher rate (more details in Chapter 3).

Network observability has a key role in this context because it’s no longer a passive component that only monitors the network’s operational state. It is taking a step further and being the catalyst of the automation tasks to mitigate the network when issues (i.e., divergence between reality and expectations) are detected.

Note

The Network Programmability and Automation, 2nd Edition book by authors Matt Oswalt, Christian Adell, Scott S. Lowe, and Jason Edelman provides a high-level architecture of the role of network monitoring in a complete network automation solution.

Linux networking

Traditional network boxes have been running proprietary OS with null or very limited access to all their capacities. This is the reason most of the network monitoring has been done off the box instead of running a software agent within the box to collect operational data, such as has been done with servers.

In the early 2010s, new incumbent players transformed this rule (e.g., Cumulus Networks and Arista). They started to run network functions on top of a Linux OS while allowing access to running processes directly in the box – in some cases (e.g., Cumulus), without any strong coupling with the hardware platform.

This openness has allowed running the network OS on many different platforms, and the appearance of different network OSs disconnected from specific hardware vendor such as SONiC or VyOS.

In terms of observability, this means that we are no longer only interested in the network control state but also in the state of the OS, and the process running in the box.

Cloud networking

Around the same time, the IT industry was radically transformed by cloud services where the IT infrastructure services (e.g., compute, storage, and networking) were abstracted and provisioned on-demand.

This shift in the way of interacting with IT infrastructure allowed a more rapid go-to-market of new applications as you no longer need to provision and manage your infrastructure directly. Instead, you can leverage APIs to embrace the Infrastructure as a Service paradigm and start paying as you grow without worrying about capacity management.

The adoption of cloud services (private and public) is massive today. And, as you may infer, it also involves cloud networking services that need to interact seamlessly with the physical network infrastructure. Most of the network environments are hybrid. This heterogeneity increases the complexity of what to observe and correlate to understand the actual state.

On top of this, if your company is actually running the cloud, you must be ready to manage the underlay and overlay networks to support it. For example, if your company manages its own Kubernetes cluster, the network observability has to cover it to provide a complete network state coverage.

All these technological challenges have transformed the network with all the new requirements. However, these changes came along with another important transformation in terms of cultural changes.

Cultural changes

The biggest cultural shift in IT infrastructure has been the change from a slow and static provisioning process to a fast-changing and dynamic approach. Everything in IT is consumed as a service (via APIs). We have many flavors, but we can simplify in two:

Software as a Service (SaaS), where you simply use an application without caring about how it is built or operatedInfrastructure as a Service (IaaS), where the computing, storage, and networking resources are consumed without having to buy, transport, rack, and connect

Note

IaaS is not magic. There are teams behind these services who do the real work to allow others (the users of the IaaS) to consume it with this paradigm.

Regarding networking, the one that applies to networking is the IaaS, which we could name as Networking as a Service. Aside from the technical challenges (which we introduced earlier), the key points of the cultural transformation are as follows:

Network users want to get their requests fulfilled in the order of seconds or minutes, not days or weeks as we used to.No one (i.e., users) cares about the network heterogeneity (i.e., physical and cloud network services). People just want the network to work, and this requires offering a proper abstraction.Adding human interventions slows down the process and only automated networks can implement this paradigm properly. Adopting automation and all its implications is no longer an option.

All these drivers led to establishing the network as a product (or a bunch of them, depending on the different purposes they have). And, when something becomes a product, it has to be managed as such. This is why we move next into how we evolve from just gathering operational state data into transforming it into business-level information that represents the state of the network as a product.

Transforming data into information

A key aspect of the modern network observability approach is to focus not only on collecting and visualizing operational data but also, on going a step further, transforming the data into actual information with a purpose.

The importance of using business terms

Networks (and most IT infrastructure components) have been seen, in many cases, as necessary IT actors without a strategic role in organizations’ businesses. Despite being a crucial actor in sustaining almost every part of every organization, IT infrastructure departments haven’t been able to properly communicate the value they provide

IT infrastructure teams have to change this to become more relevant in the business strategy. The business’ success is built around all the teams, but some of them can explain more clearly how they contribute because they speak the business language.

So, the question is, how could we speak the business language? We recommend focusing on translating the network information into something that can be translated into business terms. For instance, the next table shows a few examples:

Business Goal

Network Operational Data

How much impact a network incident has on the revenue

How much network downtime there was and which business services were impacted

How much better users’ experiences are, and how this translates to customer satisfaction

The level of packet loss and latency seen by the end users

Control the revenue per utilization of services

Use network stats to charge users per consumed bandwidth

How much faster the company processes can be delivered and contribute to increasing the company revenue

How many times a network service is automated versus run manually

How much spare network capacity is available to expand the business?

How many access ports are available, and how much capacity in the uplink links is left

When will we need to increase the capacity of the network according to the current usage trend?

Use data forecasting analysis to infer the trends of data consumption and determine the breaking point in the future.

Table 2.1 – Business goals mapped to network operation data examples

Mapping the business goals into network observability data is a key task that shouldn’t be procrastinated. In this section, we will use Key Performance Indicators (KPIs) to