8,49 €
Today's network is about agility, automation, and continuous improvement. In Kafka Up and Running for Network DevOps, we will be on a journey to learn and set up the hugely popular Apache Kafka data messaging system. Kafka is unique in its principle to treat network data as a continuous flow of information that can adapt to the ever-changing business requirements. Whether you need a system to aggregate log messages, collect metrics, or something else, Kafka can be the reliable, highly redundant system you want.
We will begin by learning about the core concepts of Kafka, followed by detailed steps of setting up a Kafka system in a lab environment. For the production environment, we will take advantage of the various public cloud provider offerings. Next, we will set up our Kafka cluster in Amazon Managed Kafka Service to host our Kafka cluster in the AWS cloud. We will also learn about AWS Kinesis, Azure Event Hub, and Google Cloud Put/Sub. Finally, the book will illustrate several use cases of how to integrate Kafka with our network from data enhancement, monitoring, to an event-driven architecture.
The Network DevOps Series is a series of books targeted for the next generation of Network Engineers who wants to take advantage of the powerful tools and projects in modern software development and the open-source communities.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 142
This book is for sale at http://leanpub.com/network-devops-kafka-up-and-running
This version was published on 2021-11-12
* * * * *
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do.
* * * * *
ISBN for EPUB version: 978-1-957046-01-3
ISBN for MOBI version: 978-1-957046-02-0
For my family, you are my ‘why’ for everything I do.
I would like to thank the open-source software community. My life would be very different without the many dedicated, talented individuals in the open-source community. Thank you all.
Introduction
What is Kafka
Why do we need Kafka
Prerequisites for this book
Who this book is for
What this book covers
Download the example code files
Conventions used
Get in touch
Chapter 1. Kafka Introduction
History of Kafka
Kafka Use Cases
Disadvantages of Kafka
Kafka Concepts
Conclusion
Chapter 2. Kafka Installation and Testing
Network Lab Setup
Kafka Installation Overview
Install Java
Download Kafka
Configure Zookeeper
Configure Kafka
Start Zookper and Kafka manually
Test the Kafka operations
Configure System Services
Conclusion
Chapter 3. Kafka Concepts and Examples
Producers: Writing Messages
Consumers: Receiving Messages
Offsets in Action
Kafka Topic Administration
Replication
Conclusion
Chapter 4. Hosted Kafka Services
AWS Managed Kafka Service
Amazon MSK Costs
Launch Amazon MSK Cluster
Client Setup
Produce and Consume Data
Conclusion
Chapter 5. Cloud Provider Messaging Services
Amazon Kinesis
Amazon Kinesis Example
Azure Event Hub
Azure Event Hub Example
Google Cloud Pub/Sub
GCP Pub/Sub Python Example
Conclusion
Chapter 6. Network Operations with Kafka
Install Docker
Install Elasticsearch
Install Kibana
Network Data Feed
Network Data Pipeline
Network Log as a Service
Conclusion
Chapter 7. Other Kafka Considerations and Looking Ahead
Hardware Considerations
Kafka Broker and Topic Configurations
Schema Registry
Kafka Stream Processing
Cross-Cluster Data Mirroring
Additional Resources
Conclusion
Appendix A. Installing Lab Instance in Public Cloud
Begin Reading
Welcome to the world of data!
Unless you have been living under a rock for the last few years, you know data processing, machine learning, and artificial intelligence are taking over the world. Data exists everywhere around us. We can now check real-time traffic information from online cameras before we even leave the house. We can connect to our thermometers remotely to automatically adjust house temperatures. Better yet, the thermometers can also be self-taught so that they can adjust the temperatures all by themselves. Before our family weekend movie nights, my kids love to leverage the WiFi-enabled lights to match the lighting with our mood.
How do these cameras, lights, and thermometers able to take measurements and generate data? It turns out the cost of small sensors and tiny computing units have been coming down steadily since the early days and now can be integrated into everyday items. However, the generated data by one or two devices might not be sufficient enough to yield meaningful results. After all, traffic information on one street might only benefit a tiny fraction of people who travels on that street, but aggregated traffic information on all streets can help everyone. Generally, it is by aggregating all disperse data sets across hundreds of devices; we are able to derive useful information that helps us with our daily lives. The data are constantly flowing between producers and consumers of data.
Have you ever wondered how these data are being exchanged between data producers and consumers? Does each of the devices provide an API (Application Programming Interface) to be queried? Do each of them have local databases that persist the data? What about data integrity, transmission latency, or scalability?
There are many tools and projects that address these data streaming and exchange issues. One of the most popular open-source tools widely used by companies large and small alike is Apache Kafka.
You might be thinking, “Don’t we already have lots of data storage systems? Why do we need yet-another-storage-system?” You are right, and we do have lots of storage solutions such as relational and non-relational databases, cache systems, big data storage clusters, search solutions, and many more. But in most of the data storage cases, the data is entered in once, stored in the database, then retrieved later when needed. For example, when I visited my dentist for the first time, they asked for my personal information, entered them into a database so for my future visits, they could pull up my record. This is very different than the traffic sensor data example that we discussed.
What sets Kafka apart is it was built from the ground up to treat data as continuous flows of information that are constantly being produced, enhanced, manipulated, and consumed. Instead of a focus on holding in data like databases, key-value stores, search indexes, or caches, Kafka architects itself as a system that allows data to be a continually evolving stream of information.
According to the Apache Kafka project page:
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Companies known for a large amount of data, such as AirBnb, Datadog, Etsy, and many others across different industries, use Kafka to build their data pipeline. These data pipelines use a variety of services that both produce and consume data in a continuous format.
Don’t worry if you have not heard of Kafka before or are not sure how, as network DevOps engineers, this tool can help us. We will go a lot deeper into Kafka in this book.
As a general overview, there are many uses cases for Kafka in network engineers:
We can use Kafka to stream data, such as logs and NetFlow data, once and be consumed by multiple receivers. Kafka takes care of the ordering of messages, acknowledging receipt to producers, delivery confirmation to consumers, and balancing the data between different recipients. We can separate data into logical partitions called Topics in a single Kafka cluster. This allows subscribers to only receive the data they are interested in, so the log receiver will not need to receive flow data. Kafka allows for an event-driven architecture, such as triggering events based on different types of events. For example, a log receiver can page an on-call engineer if it notices a BGP neighbor of a core device going down. Kafka allows us to build a centralized pipeline for network data processing instead of having dispersed teams process bits and pieces of data separately.These are just some of the use cases of Kafka. By the end of this book, I am sure we will be able to find much more creative use cases.
Basic knowledge of Linux command line is required to make the most out of this book. We would use command-line tools such as using cd for changing directories, ls for listing directories contents, and pwd to know where in the directory tree you are currently operating from.
We will be using Python 3 as the programming language in this book. Python is a popular language amongst network engineers with a large ecosystem of tools and libraries. We will use Python to create Kafka producers, consumers and interface with public cloud providers. However, I do not believe you need to be an expert in Python 3 to understand the scripts in this book. If you need a refresher on Python, a good place to go would be the official Python Tutorial.
This book is ideal for IT professionals and engineers who want to take advantage of Kafka’s distributed, fault-tolerant streaming data platform. This book can also be used by management to gain a general understanding of Kafka and how it fits into the general IT infrastructure.
Chapter 1. Kafka Introduction, In this chapter, we will cover the general concepts of Kafka. The core architecture, components, and tools. The idea behind Kafka, how it was built, and how the components can help maintain data streams at scale.
Chapter 2. Kafka Installation and Testing, In this chapter, we will install Zookeeper and Kafka on a single Virtual Machine and configure both components. We will also prepare our network lab to be used for future examples. After installation, we will work on a few producer-consumer examples using Kafka command-line tools.
Chapter 3. Kafka Concepts and Examples, In this chapter, we will provide examples of Kafka usage for Producers and Consumers. The producers will write messages to a Topic with consumers receiving the messages. We will look at examples of offset, commit, and acknowledgment for data in the topics.
Chapter 4. Hosted Kafka Services, When we want to move Kafka from our lab setup into production, we can use the Kafka-hosting-as-a-service provided by various cloud providers, such as Amazon AWS or Confluent Cloud. In this chapter, we will provide a step-by-step guide to launch our Kafka cluster using Amazon Managed Streaming for Apache Kafka.
Chapter 5. Cloud Providers Messaging Services, If we are not ready for a managed Kafka cluster, the top public cloud providers, Amazon AWS, Microsoft Azure, and Google Cloud, offer their adopted version of message streaming service. The messaging services have various degrees of Kafka compatibility. In this chapter, we will look at examples of AWS Kinesis, Azure Event Hub, and Google Pub/Sub.
Chapter 6. Network Operations with Kafka, In this chapter, we will explore examples of Kafa in network engineering. We will look at data feeds, data enhancement, and Kafka Connect. The Kafka Connect reuses code provided by the community. We will look at the File and Elasticsearch Kafka connect plugins.
Chapter 7. Other Kafka Considerations and Looking Ahead, In this chapter, we will discuss other Kafka considerations, such as hardware requirements, Broker and Topic configuration, Schema registry, and many more. This chapter will provide additional resources for readers to explore Kafka.
The code examples used in this book can be downloaded from GitHub at https://github.com/ericchou1/network-devops-kafka-up-and-running.
There are a number of text conventions used in this book to help organize the flow. Information in bold and italic are used to indicate important or special terms.
Code blocks are shown below:
Command-line input or output will be shown as follows:
Warning, tips, and information will be specified in their own special block:
This is a tip section. It will include useful tips and tricks in relation to the topic discussed at hand.
This is an information section. It will provide additional information to help you explore the topic further.
This is a warning blurb. Please pay special attention to this section when they appear, as they will contain important warnings.
Feedbacks from our readers is always welcome and appreciated. Please consider leaving a review on various platforms. They can really help others to discover the book.
All feedback can be submitted to [email protected].
As mentioned in the introduction section, Apache Kafka is a high-throughput, low-latency platform for handling real-time data feeds.
At first glance, ‘low-latency, high-throughput for real-time data feed’ might not look much. After all, every open-source project and commercial vendor (and their brother) can claim to be low-latency and high-throughput. But once you consider the type of companies using Kafka in their products and services, such as Uber, Netflix, LinkedIn, you quickly realize how significant that claim is. When we click on the like button on a LinkedIn post, it needs to appear on the post right away. That is low-latency. If we consider how many Netflix movies are streaming every second, that is high throughput. Of course, the customers of these companies expect all of the operations to take place in real-time.
According to Netflix, Kafka Inside Keystone Pipeline, “700 Billion messages are ingested on an average day” by their 400+ Kafka brokers. Did they say they process 700 Billion messages in a day in real-time? Or let’s also consider Uber’s use case, Real-Time Exactly-Once Ad Event Processing, of being a two-way marketplace for UberEats. In it, the message needs to be fast and reliable, but they also need to ensure the events are processed only once, with no overcount or undercount. The events need to be exactly once amongst all the consumers, full stop.
Kafka is excellent at how it can achieve its goals for these demanding projects. But how did this fantastic tool come about? First, let’s look into the history of Kafka.
Kafka was originally developed at LinkedIn by Jay Kreps, Neha Narkhede, and Jun Rao (Wikipedia). As the story goes, Jay Kreps named the project Kafka because he likes the author Franz Kafka’s work. The author Franze Kafa has a ‘system optimzed for writing’ and Apache Kafka is also “a system optimzed for writing”.
The project was released as an open-sourced project with the Apache Software Foundation in early 2011 and went from incubation to top-level apache project on October 23, 2012. It is written in Java and Scala with significant community backing.
The three original developers left LinkedIn and found the company Confluent in 2014. The company aims to Set Data in Motion with (surprise!) Kafka is at the center of that idea. As a result, many of the Kafka-related projects, documentation, products, and initiatives are actively developed and sponsored by Confluent.
Within the Kafka architecture, at the center is the idea of event streaming. Software systems drive our world. These systems are interconnected, always-on, and automated. Kafka provides the centralized middle ground for these systems to exchange information, or events, in the form of topics (or categories). The producer systems can send events to a particular topic, while the consumer systems can receive these events via subscription.
We will use the term events and messages interchangeably in this book to refer to the data being exchanged by producers, consumers, and Kafka.
In the words of Kafka, event streaming is analogous to the central nervous system of the human body, which allows the connectivity of tissues between different parts of the body.
In terms of network engineering, in my opinion, can use Kafka event streaming in a few different scenarios:
We can use Kafka to process transactions in real-time, such as device provisioning from warehouse shipment to fully functional in a data center. We can use Kafka to implement an event-driven architecture. Kafka can be used to track and analyze changes in network events, such as BGP neighbor relationships or interface flapping. We can use Kafka to capture and analyze IoT and wireless sensor data continuously. This process can be done in a distributed fashion, with Kafka servers across different regions. We can use Kafka to connect, store, and make available data produced by a single source to multiple destinations. An example would be to store a single set of network SNMP data in a Kafka topic, which multiple monitoring systems can consume. This allows us only to poll the network device once and reduce CPU and network overhead.If we combine the above use cases, Kafka allows us to:
Continuously capture eventsConnect different parts of the system Immediate react to a change in system stateMinimizing the impact on the network devicesWe will look at some of the disadvantages of Kafka in the next section.
If Kafka is so great, why doesn’t everybody use Kafka? Of course, no system can be perfect. Like many, if not all, system design approaches, the design of Kafa is a story of tradeoffs. What are some of the disadvantages of Kafka? Let’s take a look at a few of them:
Kafka clusters can be complex and hard to set up. Managing a Kafka can have a high learning curve.