34,79 €
Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. This data streaming service provides APIs and client SDKs that enable you to produce and consume data at scale.
Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams, along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use case shown through the book to help you get started and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you’ll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elastic Search, and third-party applications such as Splunk.
By the end of this AWS book, you’ll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Data Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 370
Veröffentlichungsjahr: 2021
Design and secure highly available, cost-effective data streaming applications with Amazon Kinesis
Tarik Makota
Brian Maguire
Danny Gagne
Rajeev Chakrabarti
BIRMINGHAM—MUMBAI
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Kunal Parikh
Publishing Product Manager: Devika Battike
Senior Editor: Mohammed Yusuf Imaratwale
Content Development Editors: Sean Lobo and Tazeen Shaikh
Technical Editor: Devanshi Deepak Ayare
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Production Designer: Shankar Kalbhor
First published: March 2021
Production reference: 1300321
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-540-1
www.packt.com
Tarik Makota hails from a small town in Bosnia. He is a principal solutions architect with AWS, a builder, a writer, and the self-proclaimed best fly fisherman at AWS. Never a perfect student, he managed to earn an MSc in software development and management from RIT. When he is not "doing the cloud" or writing, Tarik spends most of his time fly fishing to pursue slippery trout. He feeds his addiction by spending summers in Montana. Tarik lives in New Jersey with his family, Mersiha, Hana, and two exceptionally perfect dogs.
Brian Maguire is a solutions architect at AWS, where he is focused on helping customers build solutions in the cloud. He is a technologist, writer, teacher, and student who loves learning. Brian lives in New Hope, Pennsylvania, with his family, Lorna, Ciara, Chris, and several cats.
Danny Gagne is a solutions architect at AWS. He has extensive experience in the design and implementation of large-scale, high-performance analysis systems. He lives in New York City.
Rajeev Chakrabarti is a principal developer advocate with the Amazon Kinesis and the Amazon MSK team. He has worked for many years in the big data and data streaming space. Before joining the Amazon Kinesis team, he was a streaming specialist solutions architect helping customers build streaming pipelines. He lives in New Jersey with his family, Shaifalee and Anushka.
Ritesh Gupta works as a software development manager with AWS, leading the control plane and data plane teams on the Kinesis Data Streams service. He has over 20 years of experience in leading and delivering geographically distributed web-scale applications and highly available distributed systems supporting millions of transactions per second; he has 10 years of experience in managing engineers and managers. Prior to Amazon, he worked at Microsoft, EA Games, Dell, and a few successful start-ups. His technical expertise cuts across building web-scale applications, enterprise software, and big data. I thank my wife, Jyothi, and daughter, Udita, for putting up with the late-night learning sessions that have allowed me to be where I am.
Randy Ridgley is an experienced technology generalist working with organizations in the media and entertainment, casino gaming, and public sector fields that are looking to adopt cloud technologies. He started his journey into software development at a young age, building BASIC programs on the Commodore 64. In his professional career, he started by building Windows applications, eventually graduating to Linux with multiple programming languages. Currently, you can find Randy spending most of his time building end-to-end real-time streaming solutions on AWS using serverless technologies and IoT.
Amazon Kinesis is a collection of secure, serverless, durable, and highly available purpose-built data streaming services. These data streaming services provide APIs and client SDKs to enable you to produce and consume data at scale.
Scalable Data Streaming with Amazon Kinesis begins with a quick overview of the core concepts of data streams along with the essentials of the AWS Kinesis landscape. You'll then explore the requirements of the use cases shown throughout the book to help you get started, and cover the key pain points encountered in the data stream life cycle. As you advance, you'll get to grips with the architectural components of Kinesis, understand how they are configured to build data pipelines, and delve into the applications that connect to them for consumption and processing. You'll also build a Kinesis data pipeline from scratch and learn how to implement and apply practical solutions. Moving on, you'll learn how to configure Kinesis on a cloud platform. Finally, you'll learn how other AWS services can be integrated into Kinesis. These services include Redshift, Dynamo Database, AWS S3, Elasticsearch, and third-party applications such as Splunk.
By the end of this AWS book, you'll be able to build and deploy your own Kinesis data pipelines with Kinesis Data Streams (KDS), Kinesis Firehose (KFH), Kinesis Video Streams (KVS), and Kinesis Data Analytics (KDA).
This book is for solutions architects, developers, system administrators, data engineers, and data scientists looking to evaluate and choose the most performant, secure, scalable, and cost-effective data streaming technology to overcome their data ingestion and processing challenges on AWS. Prior knowledge of cloud architectures on AWS, data streaming technologies, and architectures is expected.
Chapter 1, What Are Data Streams?, covers core streaming concepts so that you will have a detailed understanding of their application in distributed systems.
Chapter 2,Messaging and Data Streaming in AWS, takes a brief look at the ecosystem of AWS services in the messaging space. After reading this chapter, you will have a good understanding of the various services, be able to differentiate them, and understand the strengths of each service.
Chapter 3, The SmartCity Bike-Sharing Service, reviews the existing bike-sharing application and how the city plans to modernize it. This chapter will provide the background information for the examples used throughout the book.
Chapter 4, Kinesis Data Streams, teaches concepts and capabilities, common deployment patterns, monitoring and scaling, and how to secure KDS. We will step through a data streaming solution that will ingest, process, and feed data from multiple SmartCity data systems.
Chapter 5, Kinesis Firehose, teaches the concepts, common deployment patterns, monitoring and scaling, and security in KFH.
Chapter 6, Kinesis Data Analytics, covers the concepts and capabilities, approaches for common deployment patterns, monitoring and scaling, and security in KDA. You will learn how real-time streaming data can be queried like a database with SQL or code.
Chapter 7, Amazon Kinesis Video Streams, explores the concepts, monitoring and scaling, security, and deployment patterns for real-time communication and data ingestion. We will step through a solution that will provide real-time access to a video stream and ingest video data for the SmartCity data system.
Chapter 8, Kinesis Integrations, reviews how to integrate Kinesis with several Amazon services, such as Amazon Redshift, Amazon DynamoDB, AWS Glue, Amazon Aurora, Amazon Athena, and other third-party services such as Splunk. We will integrate a wide variety of services to create a serverless data lake.
All of the examples in the chapters in this book are run using an AWS account to access services such as Amazon Kinesis, DynamoDB, and Amazon S3. Readers will need a Windows, Mac, or Linux computer with an internet connection. Many of the examples in the book use a command-line terminal such as PuTTY, macOS Terminal, GNOME Terminal, or iTerm2 to run commands and change configuration. The examples written in Python are written for the Python 3 interpreter and may not work with Python 2. For the examples written for the Java platform, readers are encouraged to use Java version 11 and AWS Java SDK version 1.11. We make extensive use of the AWS CLI v2 and will also use Docker for some examples. In addition to software, a webcam or IP camera and Android device will be needed to fully execute some of the examples.
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Streaming-Data-Solutions-with-Amazon-Kinesis. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781800565401_ColorImages.pdf.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "In this command, we'll send the test2.mkv file we downloaded to the KVS stream."
A block of code is set as follows:
aws glue create-database --database-input "{\"Name\":\"smartcitybikes\"}"
aws glue create-table --database-name smartcitybikes --table-input file://SmartCityGlueTable.json
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
mediaSource.start();
Any command-line input or output is written as follows:
aws rekognition start-stream-processor --name kvsprocessor
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Once you have entered the appropriate information, all that's left is to click Create signaling channel."
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
In this section, you will be introduced to the concept of data streams and how they are used to create scalable data solutions.
This section comprises the following chapters:
Chapter 1, What Are Data Streams?Chapter 2, Messaging and Data Streaming in AWSChapter 3, The SmartCity Bike-Sharing ServiceA data stream is a system where data continuously flows from multiple sources, just like water flows through a stream. The data is often produced and collected simultaneously in a continuous flow of many small files or records. Data streams are utilized by a wide range of business, medical, government, social media, and mobile applications. These applications include financial applications for the stock market and e-commerce ordering systems that collect orders and cover fulfillment of delivery.
In the entertainment space, live data is produced by sensing devices embedded in player equipment, video game players generate large amounts of data at a massive scale, and there are new social media posts thousands of times per second. Governments also leverage streaming data and geospatial services to monitor land, wildlife, and other activities.
Data volume and velocity are increasing at faster rates, creating new challenges in data processing and analytics. This book will detail these challenges and demonstrate how Amazon Kinesis can be used to address them. We will begin by discussing key concepts related to messaging in a technology-agnostic form to provide a solid foundation for building your Kinesis knowledge.
Incorporating data streams into your application architecture will allow you to deliver high-performance solutions that are secure, scalable, and fast. In this chapter, we will cover core streaming concepts so that you will have a detailed understanding of their application to distributed systems. You will learn what a data stream is, how to leverage data streams to scale, and examine a number of high-level use cases.
This chapter covers the following topics:
Introducing data streamsChallenges associated with distributed systemsOverview of messaging conceptsExamples of data streamingData streams are a way of storing a sequence of messages. They enable us to design systems where we think about state as a series of events instead of only entities and values, or rows and columns in a database. This shift in mindset and technology enables real-time analytics to extract the value from data by acting on it before it is stale. They also enable organizations to design and develop resilient software based on microservice architectures by helping them to decouple systems. We will begin with an overview of streaming data sources, why real-time data analysis is valuable, and how they can be used architecturally to decouple systems. We will then review the core challenges associated with distributed systems, and conclude with an overview of key messaging concepts and some high-level examples. Messages can contain a wide variety of information and come from different sources, so let's look at the primary sources and data formats.
The proliferation of data steadily increases from sources such as social media, IoT devices, web clickstreams, application logs, and video cameras. This data poses challenges to most systems, since it is typically high-velocity, intermittent, and bursty, making it difficult to adequately provision and design downstream systems. Payloads are generally small, except when containing audio or video data, and come in a variety of formats.
In this book, we will be focusing on three data formats. These formats include the following:
JavaScript Object Notation (JSON) Log filesTime-encoded binary files such as videoJSON has become the dominant format for message serialization over the past 10 years. It is a lightweight data interchange format that is easy for humans to read and write and is based on the JavaScript object syntax. It has two data structures – hash tables and lists. A hash table consists of key-value pairs, {"key":"value"}, where the keys must be unique. A list is a set of values in a specific order, ["value 1", "value 2"]. The following code sample shows a sample IoT JSON message:
{
"deviceid" : "device001",
"eventTime": -192778200,
"temp" : 68.4,
"humidity" : 77.3,
"coords" : {
"latitude" : 32.779039,
"longitude" : -96.808660
}
}
Log files come in a variety of formats. Common ones include Apache Commons Logging, Apache Combined Log, Apache Error Log, and RFC3164 Syslog. They are plain text, and usually each line, delineated by a newline ('\n') character, is a separate log entry. In the following sample log, we see an HTTP GET request where the IP address is 10.13.37.01, the datetime of the request, the HTTP verb, the URL fragment, the HTTP version, the response code, and the size of the result.
The sample log line in Apache Commons Logging format is as follows:
10.13.37.01 - - [03/Sep/2017:12:00:01 +0830] "GET /mailman/listinfo/test HTTP/1.1" 200 2457
Time-encoded binary streams consist of a time series of records where each record is related to the adjacent records (prior and subsequent records). These can be used for a wide variety of sensor data, from audio streams and RADAR signals to video streams. Throughout this book, the primary focus will be video streams and their applications.
Figure 1.1 – Time-encoded video data
As shown in Figure 1.1, video streams are composed of fragments, where each fragment is a self-contained sequence of media frames. There are no dependencies between fragments. We will discuss video streams in more detail in Chapter 7, Kinesis Video Streams. Now that we've covered the types of data that we'll be processing, let's take a step back to understand the value of real-time data in analytics.
Analysis is done to support decision making by individuals, organizations, or computer programs. Traditionally, data analysis has been done on batches of data, usually in long-running jobs that occur overnight and that happen periodically at predetermined times: nightly, weekly, quarterly, and so on. This not only limits the scope of actions available to decisions makers, but it is also only providing them with a representation of the past environment. Information is now available seconds after it is produced, so we need to design systems that provide decision makers with the freshest data available to make timely decisions.
The OODA – Observe, Orient, Decide, Act – loop is a decision-making, conceptual framework that describes how decisions are made when reacting to an event. By breaking it down into these four components, we can optimize each to reduce the overall cycle time. The key idea is that if we make better decisions quicker than our opponent, we can outmaneuver them and win. By moving from batch to real-time analytics, we are reducing the observed portion of this cycle.
John Boyd
John Boyd was a USAF colonel and military strategist. He developed the OODA loop to better understand pilot combat operations. It has since been expanded and is used at a more strategic level by the military, sports teams, and businesses.
By reducing the OODA loop cycle time, new actions become available. They can be taken while events are unfolding and not merely responding to them after the event has occurred. These time-critical decisions can range from responding to security log anomalies to providing customer recommendations based on a user's recently viewed items. These actions are extremely valuable because they allow us to quickly respond to changing events and are only possible because we can process the data in near real time. The following diagram, inspired by the Perishable Insights report by Mike Gualtieri, shows how time to action correlates to the data's perishability. Each insight has a corresponding action that can only be taken if the data is processed quickly enough – before the insight perishes:
Figure 1.2 – Perishable insights
The preceding diagram uses shopping as an example to highlight the key distinction between time-critical and historical analysis. Combining historical data and recent data is extremely valuable since it allows deeper insights and can be used to detect patterns and anomalies. The goal of stream analysis is to reduce the amount of time between an event occurring and the appropriate response.
A distributed system is composed of multiple networked servers that work together by sending messages between each other. They allow applications to be built that require more compute, storage, or resiliency than is available on a single instance. Some common distributed systems are the World Wide Web, distributed databases, and scientific computing clusters. Distributed systems are often fractal. For example, the three-tier web application, perhaps the most common architecture you will see in the wild, is often constructed of distributed databases, log analysis systems, and payment providers.
The need for distributed systems has increased dramatically over the past 10 years. There are three primary drivers for this: data scale, computational requirements, and organization design and coordination. At first, these systems were brittle and challenging to manage, but over time, certain key patterns emerged that have enabled them to scale by reducing complexity.
The first key in managing complexity was adopting standardized interfaces and common data formats and encodings. This allowed the development of microservice-based architectures where different teams could manage functionality and provide it as a service to the rest of the organization. This reduced the amount of coordination among teams and allowed them to iterate and release at their own appropriate speed, thereby acknowledging and leveraging Conway's Law.
Conway's Law
In 1967, Melvin Conway stated: "Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." This is based on the observation that people need to communicate in order to design and develop systems. When this is applied to microservices, it allows the groups to own their services directly and explicitly model the organization/communication/software architecture correspondence.
The second was to separate the program into different fault domains by moving to a loosely coupled architecture. This is often achieved by having one system send another system a message. However, messages being sent from one fault domain to another made it difficult to reason and understand the complex failure modes of these systems. By introducing asynchronous message brokers, we can define clear boundaries between different fault domains, making it possible to reason about them. The message queue acts as an invariant in the system. It provides a clean interface where it can send messages and retrieve them. If another system is unavailable, the message broker will be able to cache the messages, called a backlog, and that system is responsible for handling them when it resumes service.
There are still many challenges to the design, deployment, and orchestration of these decoupled systems. However, the introduction of modern highly available message brokers has been key in reducing their complexity.
Now that we've seen how asynchronous messaging can separate fault domains, let's learn how they fit into distributed systems.
The fundamental challenge of distributed systems is intra-system communication. When possible, a messaging system can provide a core decoupling function, allowing intermittent and transient failures not to cascade or cross fault boundaries. These systems must be highly available, scalable, and durable. The following core concepts are essential to understand and reason about these systems: transactions per second, scaling, latency, and high availability. They allow us to understand the system's key dynamics so that resources can be provisioned to support the workload.
The most important metric for all messaging systems is Transactions Per Second (TPS). This metric is not as simple as it may seem initially, as the maximum TPS is constrained by either a discrete number of transactions or the maximum size of data that can be processed. This max TPS is called capacity. In general, messaging systems have different capacity for the inbound side and the outbound side, with the outbound side normally having a greater capacity to support multiple consumers and prevent large message backlogs.
Backpressure refers to a system state in which the producer TPS is higher than the consumer TPS. The input is coming in faster than it can be processed. There are multiple strategies for handling backpressure. The easiest is to reduce the number of messages being sent, for example, having a temperature sensor send data once a minute instead of once a second. The second is to scale the compute for the consumers to increase the consumer TPS. If the flow of messages is intermittent or bursty, a buffer can temporarily hold the messages and allow the consumers to catch up. Buffers are often used in conjunction with scaling to store messages while compute is scaled up. The last method is to drop messages. Depending on the message type, this can be unacceptable – you don't want to drop customer orders – but, in the case of sensor data, sampling, can be used to process a fixed percentage of data, for example, 5% of data.
Messaging systems need to present an access point that hides the complexity of the internal system. In general, messaging systems consist of multiple independent channels and shards. A shard is an independent unit of capacity. This internal complexity cannot be completely hidden from users since the way data is distributed to the different shards needs to be understood by both senders and receivers. Scaling is used to increase the capacity of the messaging system. One way to think about scaling is to consider cables supporting a bridge. If it has 5 strands, it can support 50 tons, and with 50 strands it can support 500 tons. Thus, one unit of scaling for bridges would be the number of cable strands.
As a system scales, its ability to maintain the global order of records becomes limited. In general, order will only be maintained at a sub-global group level. This is an important design consideration that must be covered when designing real-time solutions. If global order is needed, there are fundamental limits on the system's maximum throughput, and if the throughput required is higher, the system will have to be rearchitected.
Etymology of the shard
In 1997, the game Ultima Online was released. In order to reduce latency and handle scale, there were multiple servers around the world that a player could log in to. Each server functioned independently and existed as its own universe in a multi-verse. This was explained in-game by the wizard Mondain capturing the world of Sosaria in the Gem of Immortality. This gem was then shattered by the Avatar into multiple shards. The player then selected which shard they wanted to play in. The term shard, or sharding, is another way to talk about the horizontal partitioning of data, that is, spreading data across multiple servers.
Latency is the amount of time between a cause and effect in a system. In the context of a messaging system, there are multiple measures of latency that are important in understanding its behavior. In general, it is the time between when a message enters the system and when the message leaves the system. For example, it can be thought of as the time between pressing the brakes in a car and the vehicle stopping. Some workloads, for example, real-time audio/video communication, are especially latency-sensitive and care must be taken to minimize this across all aspects of the system.
The two primary measures of latency in a messaging system are propagation delay and age of message.
The propagation delay is the amount of time from when a message is written to the message broker to when it is read by the consumer application. In most cases, propagation delay is a reflection of how often producers or consumers are polling the message broker. Network effects on the producer's connection to the message system and the acknowledgment of putting a message are known as producer latency, and correspondingly, the time it takes for a request to complete on the consumer side is consumer latency.
The last measure of latency that is extremely important is understanding how long a message has been in the system before it is retrieved, that is, the age of the message. If the average age of messages is increasing, that indicates a backlog and means that messages are being added faster than they are being retrieved.
Messaging systems are foundational to modern distributed systems and need to be designed in such a way to be highly available.
"Everything fails all the time."
– Werner Vogel, Amazon CTO
The preceding quote hints at the difficulty of building highly available systems. To avoid single points of failure, redundancy is required, and messages, once acknowledged, need to be durably stored. Even though messaging systems present a simple interface, to achieve this level of performance, they are actually comprised of many systems configured as a cluster.
Now that we have the vocabulary to talk about inter-system communication, let's introduce the components of messaging systems.
In this section, we will review the concept of message brokers in a high-level, implementation-agnostic manner. First, we will go over the core components of all messaging systems and then we will review some key terminology and concepts related to their use.
There are four components in all messaging systems: producers, consumers, streams, and messages. The following diagram shows a logical breakdown of producers sending messages to a stream, the stream buffering them, and consumers receiving them:
Figure 1.3 – High-level view of messaging
Despite this design's relative simplicity, there is a substantial amount of configuration and optimization that is possible. Now, let's dive a little deeper into each component.
The stream is the system that stores the messages or records sent by the producers and retrieved by the consumers. They can be ordered in a First In First Out (FIFO) model. Messages in the stream that have been received, but not yet retrieved, are referred to as a backlog.
The retention period is the length of time that the records are accessible after they are added to the stream. This is the maximum size the backlog can be, and it is also the maximum time a new, slow, or intermittent consumer can access the records.
A message consists of a payload and header information. The header information consists of information set by the producer, and it includes a unique identifier assigned by the message broker when it is inserted into the stream. In general, messages are relatively small, in the order of kilobytes, and messaging systems generally have a maximum payload size.
The producer is an application that is the source of data that will go into the message or record. It connects to the message broker and puts the data into the stream. There can often be multiple producers sending data to the same message broker.
The consumer is an application that receives the messages that are sent by the producer. It connects to the message broker and retrieves the data from the stream. The responsibility for keeping track of the last read message, so that the consumer can retrieve the next message, can be handled either by the message broker (RabbitMQ or SQS) or by the consumer (Kinesis or Kafka). There can be multiple consumers for a message broker.
When thinking about real-time analytics, it can be useful to expand it from the producer, stream, consumer model to a five-stage model (Figure 1.3): 1) source of data; 2) data ingestion mechanisms; 3) stream storage; 4) real-time stream processing; and 5) destination, data sink, or action. This model helps us elevate our thinking from the structural communication level to the data processing level. For instance, filtering can be applied at every stage to reduce compute downstream.
The source of data refers to where the data is coming from. For example, it could be mobile devices, web clickstreams, log analytics, IoT devices, or smart devices. Once you have the source, the data needs to be ingested into the stream. This requires a solution that can capture data coming from hundreds of thousands of devices, in a scalable and reliable manner, into a stream for analysis. You then need a platform that can reliably and durably store the data while simultaneously reading from any point in the stream. This refers to the stream storage platform. The stored data is then processed by real-time applications to generate actionable insights, perform actions, and execute real-time extract-transform-load (ETL) operations that deliver the stream of data to an end destination, such as a data lake.
Next, let's see how systems can be designed in a resilient manner.
While relatively simple, the implementation of the four components can be nuanced. In all networked systems, failure is complicated. Every network call can have issues, and the systems need to be resilient to handle them. In the following sub-sections, we will review a few key concepts related to resilient systems and also a few advanced stream processing features.
Here are eight fallacies associated with distributed computing. In 1994, Peter Deutsch identified the fact that everyone who builds distributed systems initially gets into trouble by making the assumptions listed here:
The network is reliable.Latency is zero. Bandwidth is infinite. The network is secure. The topology doesn't change. There is one administrator. The cost of transport is zero. The network is homogeneous (added by James Gosling in 1997).Note
All systems should be designed with those fallacies in mind, and with special attention to the unreliability of the network. Systems that don't properly handle these issues will exhibit complicated and confusing behavior as well as error modes that are challenging to debug.
Timeouts allow for efficient allocation of resources and help prevent cascading failures. If an individual process has an error, it can fail to return a value and hang. In this case, the client may continue to wait indefinitely for a response. Timeouts help prevent server resource exhaustion by ending the connection after a maximum amount of time has passed. This allows the server to free up limited resources, for example, memory, connections, and ports, and use them to handle new requests. The client can retry the request again.
Many errors are ephemeral, and merely retrying the exact same request again will succeed. In order for retries to be safe, the system handling them must be idempotent, meaning that it is designed in such a way that the same input will cause the same side effects. At a more systemic level, to prevent a server from being inundated with retry requests, each client should implement back-off and jitter.
Back-off is the process of increasing the time between subsequent retries. Jitter is the process of adding a bit of random delay to retries. Together, these two mechanisms spread out message requests over time so that the server is able to handle the number of requests.
When a producer has to retry due to a timeout, it will send the request again. There is the possibility that a duplicate record could be created. If a record should only be processed once, it is important that the payload of the record has a unique ID that the final system can use to remove duplicates. When a consumer fails, it can fetch the same records again. Consumer retries tend to happen more often than producer retries. It is up to the final application to handle the message payload data properly and in an idempotent manner.
A backlog is the number of messages that the stream contains that have yet to be received by a consumer. Backlogs occur when the number of messages a producer sends into a stream is higher than the number of messages received by a consumer. This often happens when the system consuming the messages has an error and the messages keep being added to the stream. This can quickly go from a small backlog to a large backlog. Large backlogs increase the overall system latency by a large amount as the backlog must be processed before the recently arriving messages are processed. This typically results in a bimodal distribution of message latencies, where the latency is low when the system is working correctly and high when the system is having errors.
Large backlogs are a hidden risk that need to be considered when designing asynchronous applications because they can increase the recovery time following an outage – that is, instead of merely restarting the system and it being down for a brief period of time, the system has to work through the large backlog before it can function properly again.
Dead letter queues store messages that cannot be processed correctly by the message broker for some reason or another. It could be that it is an invalid message, it is too big, or, for some reason, it fails a certain number of retry attempts. It is important to periodically review dead letter queues because they represent errors in the system.
Replay is the ability to read, or replay, the same records in the same order multiple times. This means that a new consumer can be added and re-read messages that have already been consumed. Replay is limited to data in the stream. Data is aged out of the stream after it has existed for a specified period of time, for example, 1 hour, 1 day, or 7 days. This retention period affects the amount of storage required to support the stream.
When processing records, there are multiple approaches depending on the type of data in the payload and the type of analysis required. In the simplest of systems, each record is processed one at a time, that is, record by record. A more complicated approach is to aggregate records by a sliding time window, where records are accessed by the consumer over a period of time, for instance, calculating the highest, lowest, and the average message value over the last 10 seconds.
Filtering allows consumers to receive only the messages that they are interested in. This reduces the amount of data that is needed to be processed and transmitted, which helps the system scale. Messages can be filtered at multiple stages: source, ingestion, stream storage, stream processing, and in the consumer stage. In general, it is best to filter messages as early in the five-stage model as possible as it reduces compute and storage requirements in all subsequent stages. Filtering is determined by the message contents, the source, or the destination. For instance, the producer can send different types of messages to different streams.
Now that we've covered the core concepts, let's see them applied in some example use cases.
Data streams are essential for supporting a wide variety of workloads. This section will go into detail on how data streams can be used for near real-time monitoring of applications through log aggregation, support bursty IoT workloads, be fast to insert recommendations into web applications, and enable machine learning on video. The following diagram shows the data flow of these workloads:
Figure 1.4 – Examples of data stream applications
While these workloads have different performance requirements and scale, the fundamental architecture is the same – producing and consuming messages. Now, let's look at an example of real-time monitoring.
Near real-time monitoring of applications and systems can be used to identify usage patterns, troubleshoot operation events, detect and monitor security incidents, and ensure compliance. Log events are generated on multiple systems and are pushed to a centralized system for analysis. Messaging systems enable this by decoupling the log processing and the analysis systems. In general, for log analysis, there are two different systems consuming the messages: one for near real-time analysis and one for larger historical batch analysis. The near real-time analysis system, often Elasticsearch, contains only fresh data as specified by a data retention policy, and might only hold an hour, a day, or a week's worth of information. The historical system is often an Apache Spark cluster processing data in a data lake (data stored in S3).
Log events are generated in real time and are pushed to the messaging system. The two consumers access the data and perform ETL operations on the data to convert it into the appropriate format for further analysis. For instance, an Apache Commons Logging format can be converted to JSON for insertion into Elasticsearch. The message broker simplifies the system by providing a clear boundary between the log collection and log analysis systems. Since it's designed in a highly available manner, it can cache events if the log analysis system goes down.
There are many sources of log events; two common ones are CloudWatch Logs and agents that can be installed on a machine, for example, Kinesis Agent. CloudWatch is an AWS service that collects logs, metrics, and events from AWS resources and user applications. The logs are sent to streams based on subscriptions and subscription filters that define patterns to determine which log events should be sent. The events are Base64-encoded and compressed with gzip. Agents monitor sets of files and stream events normally delineated by a new line (\n) character.
By bringing all