23,99 €
Apache Kafka is a great open source platform for handling your real-time data pipeline to ensure high-speed filtering and pattern matching on the ?y. In this book, you will learn how to use Apache Kafka for efficient processing of distributed applications and will get familiar with solving everyday problems in fast data and processing pipelines.
This book focuses on programming rather than the configuration management of Kafka clusters or DevOps. It starts off with the installation and setting up the development environment, before quickly moving on to performing fundamental messaging operations such as validation and enrichment.
Here you will learn about message composition with pure Kafka API and Kafka Streams. You will look into the transformation of messages in different formats, such asext, binary, XML, JSON, and AVRO. Next, you will learn how to expose the schemas contained in Kafka with the Schema Registry. You will then learn how to work with all relevant connectors with Kafka Connect. While working with Kafka Streams, you will perform various interesting operations on streams, such as windowing, joins, and aggregations. Finally, through KSQL, you will learn how to retrieve, insert, modify, and delete data streams, and how to manipulate watermarks and windows.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 178
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amey VarangaonkarAcquisition Editor: Siddharth MandalContent Development Editor: Smit CarvalhoTechnical Editor: Niral AlmeidaCopy Editor: Safis EditingProject Coordinator: Pragati ShuklaProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jason MonteiroProduction Coordinator: Deepika Naik
First published: December 2018
Production reference: 1261218
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78899-782-9
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods.
Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Isaac RuizGuerra has been a Java programmer since 2001 and an IT consultant since 2003. Isaac is specialized in systems integration, and has participated in projects to do with the financial sector. Isaac has worked mainly on the backend side, using languages such as Java, Python, and Elixir. For more than 10 years, he has worked with different application servers for the Java world, including JBoss, Glassfish, and WLS. Isaac is currently interested in topics such as microservices, cloud native, and serverless. He is a regular lecturer, mainly at conferences related to the JVM. Isaac is interested in the formation of interdisciplinary and high-performance teams.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title page
Copyright and Credits
Apache Kafka Quick Start Guide
Dedication
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Reviews
Configuring Kafka
Kafka in a nutshell
Kafka installation
Kafka installation on Linux
Kafka installation on macOS
Confluent Platform installation
Running Kafka
Running Confluent Platform
Running Kafka brokers
Running Kafka topics
A command-line message producer
A command-line message consumer
Using kafkacat
Summary
Message Validation
Enterprise service bus in a nutshell
Event modeling
Setting up the project
Reading from Kafka
Writing to Kafka
Running the processing engine
Coding a validator in Java
Running the validation
Summary
Message Enrichment
Extracting the geographic location
Enriching the messages
Extracting the currency price
Enriching with currency price
Running the engine
Extracting the weather data
Summary
Serialization
Kioto, a Kafka IoT company
Project setup
The constants
HealthCheck message
Java PlainProducer
Running the PlainProducer
Java plain consumer
Java PlainProcessor
Running the PlainProcessor
Custom serializer
Java CustomProducer
Running the CustomProducer
Custom deserializer
Java custom consumer
Java custom processor
Running the custom processor
Summary
Schema Registry
Avro in a nutshell
Defining the schema
Starting the Schema Registry
Using the Schema Registry
Registering a new version of a schema under a – value subject
Registering a new version of a schema under a – key subject
Registering an existing schema into a new subject
Listing all subjects
Fetching a schema by its global unique ID
Listing all schema versions registered under the healthchecks–value subject
Fetching version 1 of the schema registered under the healthchecks-value subject
Deleting version 1 of the schema registered under the healthchecks-value subject
Deleting the most recently registered schema under the healthchecks-value subject
Deleting all the schema versions registered under the healthchecks–value subject
Checking whether a schema is already registered under the healthchecks–key subject
Testing schema compatibility against the latest schema under the healthchecks–value subject
Getting the top-level compatibility configuration
Globally updating the compatibility requirements
Updating the compatibility requirements under the healthchecks–value subject
Java AvroProducer
Running the AvroProducer
Java AvroConsumer
Java AvroProcessor
Running the AvroProcessor
Summary
Kafka Streams
Kafka Streams in a nutshell
Project setup
Java PlainStreamsProcessor
Running the PlainStreamsProcessor
Scaling out with Kafka Streams
Java CustomStreamsProcessor
Running the CustomStreamsProcessor
Java AvroStreamsProcessor
Running the AvroStreamsProcessor
Late event processing
Basic scenario
Late event generation
Running the EventProducer
Kafka Streams processor
Running the Streams processor
Stream processor analysis
Summary
KSQL
KSQL in a nutshell
Running KSQL
Using the KSQL CLI
Processing data with KSQL
Writing to a topic
Summary
Kafka Connect
Kafka Connect in a nutshell
Project setup
Spark Streaming processor
Reading Kafka from Spark
Data conversion
Data processing
Writing to Kafka from Spark
Running the SparkProcessor
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Since 2011, Kafka's been exploding in terms of growth. More than one third of Fortune 500 companies use Apache Kafka. These companies include travel companies, banks, insurance companies, and telecom companies.
Uber, Twitter, Netflix, Spotify, Blizzard, LinkedIn, Spotify, and PayPal process their messages with Apache Kafka every day.
Today, Apache Kafka is used to collect data, do real-time data analysis, and perform real-time data streaming. Kafka is also used to feed events to Complex Event Processing (CEP) architectures, is deployed in microservice architectures, and is implemented in Internet of Things (IoT) systems.
In the realm of streaming, there are several competitors to Kafka Streams, including Apache Spark, Apache Flink, Akka Streams, Apache Pulsar, and Apache Beam. They are all in competition to perform better than Kafka. However, Apache Kafka has one key advantage over them all: its ease of use. Kafka is easy to implement and maintain, and its learning curve is not very steep.
This book is a practical quick start guide. It is focused on showing practical examples and does not get involved in theoretical explanations or discussions of Kafka's architecture. This book is a compendium of hands-on recipes, solutions to everyday problems faced by those implementing Apache Kafka.
This book is for data engineers, software developers, and data architects looking for a quick hands-on Kafka guide.
This guide is about programming; it is an introduction for those with no previous knowledge about Apache Kafka.
All the examples are written in Java 8; experience with Java 8 is the only requirement for following this guide.
Chapter 1, Configuring Kafka, explains the basics for getting started with Apache Kafka. It discusses how to install, configure, and run Kafka. It also discusses how to make basic operations with Kafka brokers and topics.
Chapter 2, Message Validation, explores how to program data validation for your enterprise service bus, covering how to filter messages from an input stream.
Chapter 3, Message Enrichment, looks at message enrichment, another important task for an enterprise service bus. Message enrichment is the process of incorporating additional information into the messages of your stream.
Chapter 4, Serialization, talks about how to build serializers and deserializers for writing, reading, or converting messages in binary, raw string, JSON, or AVRO formats.
Chapter 5, Schema Registry, covers how to validate, serialize, deserialize, and keep a history of versions of messages using the Kafka Schema Registry.
Chapter 6,Kafka Streams,explains how to obtain information about a group of messages – in other words, a message stream – and how to obtain additional information, such as that to do with the aggregation and composition of messages, using Kafka Streams.
Chapter 7, KSQL, talks about how to manipulate event streams without a single line of code using SQL over Kafka Streams.
Chapter 8, Kafka Connect, talks about other fast data processing tools and how to make a data processing pipeline with them in conjunction with Apache Kafka. Tools such as Apache Spark and Apache Beam are covered in this chapter.
The reader should have some experience of programming with Java 8.
The minimum configuration required for executing the recipes in this book is an Intel ® Core i3 Processor, 4 GB of RAM, and 128 GB of disk space. Linux or macOS is recommended, as Windows is not fully supported.
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub athttps://github.com/PacktPublishing/Apache-Kafka-Quick-Start-Guide. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The --topic parameter sets the name of the topic; in this case, amazingTopic."
A block of code is set as follows:
{ "event": "CUSTOMER_CONSULTS_ETHPRICE", "customer": { "id": "14862768", "name": "Snowden, Edward", "ipAddress": "95.31.18.111" }, "currency": { "name": "ethereum", "price": "RUB" }, "timestamp": "2018-09-28T09:09:09Z"}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
dependencies { compile group: 'org.apache.kafka', name: 'kafka_2.12', version: '2.0.0'
compile group: 'com.maxmind.geoip', name: 'geoip-api', version: '1.3.1'
compile group: 'com.fasterxml.jackson.core', name: 'jackson-core', version: '2.9.7'}
Any command-line input or output is written as follows:
> <confluent-path>/bin/kafka-topics.sh --list --ZooKeeper localhost:2181
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "To differentiate among them, the events on t1 have one stripe, the events on t2 have two stripes, and the events on t3 have three stripes."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
This chapter describes what Kafka is and the concepts related to this technology: brokers, topics, producers, and consumers. It also talks about how to build a simple producer and consumer from the command line, as well as how to install Confluent Platform. The information in this chapter is fundamental to the following chapters.
In this chapter, we will cover the following topics:
Kafka in a nutshell
Installing Kafka (Linux and macOS)
Installing the Confluent Platform
Running Kafka
Running Confluent Platform
Running Kafka brokers
Running Kafka topics
A command–line message producer
A command–line message consumer
Using kafkacat
Apache Kafka is an open source streaming platform. If you are reading this book, maybe you already know that Kafka scales very well in a horizontal way without compromising speed and efficiency.
The Kafka core is written in Scala, and Kafka Streams and KSQL are written in Java. A Kafka server can run in several operating systems: Unix, Linux, macOS, and even Windows. As it usually runs in production on Linux servers, the examples in this book are designed to run on Linux environments. The examples in this book also consider bash environment usage.
This chapter explains how to install, configure, and run Kafka. As this is a Quick Start Guide, it does not cover Kafka's theoretical details. At the moment, it is appropriate to mention these three points:
Kafka is a
service bus
: To connect heterogeneous applications, we need to implement a message publication mechanism to send and receive messages among them. A message router is known as message broker. Kafka is a message broker, a solution to deal with routing messages among clients in a quick way.
Kafka architecture has two directives
: The first is to not block the producers (in order to deal with the back pressure). The second is to isolate producers and consumers. The producers should not know who their consumers are, hence Kafka follows the dumb broker and smart clients
model
.
Kafka is a real-time messaging system
:Moreover, Kafka is a software solution with a publish-subscribe model: open source, distributed, partitioned, replicated, and commit-log-based.
There are some concepts and nomenclature in Apache Kafka:
Cluster
: This is a set of Kafka brokers.
Zookeeper
:
This is a
cluster coordinator—a tool with different services that are part of the Apache ecosystem.
Broker
:
This is a
Kafka server, also the Kafka server process itself.
Topic
:
This is a
queue (that has log partitions); a broker can run several topics.
Offset
:
This is a
n identifier for each message.
Partition
:
This is a
n immutable and ordered sequence of records continually appended to a structured commit log.
Producer
: This is the program that publishes data to topics.
Consumer
: Th
is is the
program that processes data from the topics.
Retention period
:
Th
is is the t
ime to keep messages available for consumption.
In Kafka, there are three types of clusters:
Single node–single broker
Single node–multiple broker
Multiple node–multiple broker
In Kafka, there are three (and just three) ways to deliver messages:
Never redelivered
: The messages may be lost because, once delivered, they are not sent again.
May be redelivered
: The messages are never lost because, if it is not received, the message can be sent again.
Delivered once
: The message is delivered exactly once. This is the most difficult form of delivery; since the message is only sent once and never redelivered, it implies that there is zero loss of any message.
The message log can be compacted in two ways:
Coarse-grained
: Log compacted by time
Fine-grained
: Log compacted by message
There are three ways to install a Kafka environment:
Downloading the executable files
Using
brew
(in macOS) or
yum
(in Linux)
Installing Confluent Platform
For all three ways, the first step is to install Java; we need Java 8. Download and install the latest JDK 8 from the Oracle's website:
http://www.oracle.com/technetwork/java/javase/downloads/index.html
At the time of writing, the latest Java 8 JDK version is 8u191.
For Linux users :
Change the file mode to executable as follows, follows these steps:
> chmod +x jdk-8u191-linux-x64.rpm
Go to the directory in which you want to install Java:
> cd <directory path>
Run the
rpm
installer with the following command:
> rpm -ivh jdk-8u191-linux-x64.rpm
Add to your environment the
JAVA_HOME
variable.
The following command writes the
JAVA_HOME
environment variable to the
/etc/profile
file:
> echo "export JAVA_HOME=/usr/java/jdk1.8.0_191" >> /etc/profile
Validate the Java installation as follows:
> java -version java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
At the time of writing, the latest Scala version is 2.12.6. To install Scala in Linux, perform the following steps:
Download the latest Scala binary from
http://www.scala-lang.org/download
Extract the
downloaded file,
scala-2.12.6.tgz
, as follows:
> tar xzf scala-2.12.6.tgz
Add the
SCALA_HOME
variable
to your environment as follows:
> export SCALA_HOME=/opt/scala
Add the Scala bin directory to your
PATH
environment variable as follows:
> export PATH=$PATH:$SCALA_HOME/bin
To validate the Scala installation, do the following:
> scala -version Scala code runner version 2.12.6 -- Copyright 2002-2018, LAMP/EPFL and Lightbend, Inc.