41,99 €
Designing and writing a real-time streaming publication with Apache Apex
This book assumes knowledge of application development with Java and familiarity with distributed systems. Familiarity with other real-time streaming frameworks is not required, but some practical experience with other big data processing utilities might be helpful.
Apache Apex is a next-generation stream processing framework designed to operate on data at large scale, with minimum latency, maximum reliability, and strict correctness guarantees.
Half of the book consists of Apex applications, showing you key aspects of data processing pipelines such as connectors for sources and sinks, and common data transformations. The other half of the book is evenly split into explaining the Apex framework, and tuning, testing, and scaling Apex applications.
Much of our economic world depends on growing streams of data, such as social media feeds, financial records, data from mobile devices, sensors and machines (the Internet of Things - IoT). The projects in the book show how to process such streams to gain valuable, timely, and actionable insights. Traditional use cases, such as ETL, that currently consume a significant chunk of data engineering resources are also covered.
The final chapter shows you future possibilities emerging in the streaming space, and how Apache Apex can contribute to it.
This book is divided into two major parts: first it explains what Apex is, what its relevant parts are, and how to write well-built Apex applications. The second part is entirely application-driven, walking you through Apex applications of increasing complexity.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 335
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: November 2017
Production reference: 2061217
ISBN 978-1-78829-640-3
www.packtpub.com
Authors
Thomas Weise
Munagala V. Ramanath
David Yan
Kenneth Knowles
Copy Editors
Safis Editing
Tom Jacob
Reviewer
Ananth Gundabattula
Project Coordinator
Suzanne Coutinho
Acquisition Editor
Frank Pohlmann
Indexer
Pratik Shirodkar
Content Development Editor
Monika Sangwan
Graphics
Jason Monteiro
Technical Editor
Joel D'Souza
Production Coordinator
Arvindkumar Gupta
Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.
Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.
David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at Berkeley.
Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in Programming Language Theory from the University of California, Santa Cruz.
Ananth Gundabattula is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a PhD in Computer Science Security and is interested in all things related to data, including low latency distributed processing systems, machine learning, and data engineering domains. He holds three patents granted by USPTO and has one application pending.
Prior to joining CBA, Ananth was an architect at Threatmetrix and the member of the core team that scaled the Threatmetrix architecture to 100 million transactions per day, which runs at very low latencies using Cassandra, ZooKeeper, and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs, enabling some of the first IBM CIO projects onboarding HBase, Hadoop, and Mahout stack.
Ananth is a committer for Apache Apex and is currently working for the next generation architectures for the CBA fraud platform and Advanced Analytics Omnia platform at CBA.
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788296400.
If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introduction to Apex
Unbounded data and continuous processing
Stream processing
Stream processing systems
What is Apex and why is it important?
Use cases and case studies
Real-time insights for Advertising Tech (PubMatic)
Industrial IoT applications (GE)
Real-time threat detection (Capital One)
Silver Spring Networks (SSN)
Application Model and API
Directed Acyclic Graph (DAG)
Apex DAG Java API
High-level Stream Java API
SQL
JSON
Windowing and time
Value proposition of Apex
Low latency and stateful processing
Native streaming versus micro-batch
Performance
Where Apex excels
Where Apex is not suitable
Summary
Getting Started with Application Development
Development process and methodology
Setting up the development environment
Creating a new Maven project
Application specifications
Custom operator development
The Apex operator model
CheckpointListener/CheckpointNotificationListener
ActivationListener
IdleTimeHandler
Application configuration
Testing in the IDE
Writing the integration test
Running the application on YARN
Execution layer components
Installing Apex Docker sandbox
Running the application
Working on the cluster
YARN web UI
Apex CLI
Logging
Dynamically adjusting logging levels
Summary
The Apex Library
An overview of the library
Integrations
Apache Kafka
Kafka input
Kafka output
Other streaming integrations
JMS (ActiveMQ, SQS, and so on)
Kinesis streams
Files
File input
File splitter and block reader
File writer
Databases
JDBC input
JDBC output
Other databases
Transformations
Parser
Filter
Enrichment
Map transform
Custom functions
Windowed transformations
Windowing
Global Window
Time Windows
Sliding Time Windows
Session Windows
Window propagation
State
Accumulation
Accumulation Mode
State storage
Watermarks
Allowed lateness
Triggering
Merging of streams
The windowing example
Dedup
Join
State Management
Summary
Scalability, Low Latency, and Performance
Partitioning and how it works
Elasticity
Partitioning toolkit
Configuring and triggering partitioning
StreamCodec
Unifier
Custom dynamic partitioning
Performance optimizations
Affinity and anti-affinity
Low-latency versus throughput
Sample application for dynamic partitioning
Performance – other aspects for custom operators
Summary
Fault Tolerance and Reliability
Distributed systems need to be resilient
Fault-tolerance components and mechanism in Apex
Checkpointing
When to checkpoint
How to checkpoint
What to checkpoint
Incremental state saving
Incremental recovery
Processing guarantees
Example – exactly-once counting
The exactly-once output to JDBC
Summary
Example Project – Real-Time Aggregation and Visualization
Streaming ETL and beyond
The application pattern in a real-world use case
Analyzing Twitter feed
Top Hashtags
TweetStats
Running the application
Configuring Twitter API access
Enabling WebSocket output
The Pub/Sub server
Grafana visualization
Installing Grafana
Installing Grafana Simple JSON Datasource
The Grafana Pub/Sub adapter server
Setting up the dashboard
Summary
Example Project – Real-Time Ride Service Data Processing
The goal
Datasource
The pipeline
Simulation of a real-time feed using historical data
Parsing the data
Looking up of the zip code and preparing for the windowing operation
Windowed operator configuration
Serving the data with WebSocket
Running the application
Running the application on GCP Dataproc
Summary
Example Project – ETL Using SQL
The application pipeline
Building and running the application
Application configuration
The application code
Partitioning
Application testing
Understanding application logs
Calcite integration
Summary
Introduction to Apache Beam
Introduction to Apache Beam
Beam concepts
Pipelines, PTransforms, and PCollections
ParDo – elementwise computation
GroupByKey/CombinePerKey – aggregation across elements
Windowing, watermarks, and triggering in Beam
Windowing in Beam
Watermarks in Beam
Triggering in Beam
Advanced topic – stateful ParDo
WordCount in Apache Beam
Setting up your pipeline
Reading the works of Shakespeare in parallel
Splitting each line on spaces
Eliminating empty strings
Counting the occurrences of each word
Format your results
Writing to a sharded text file in parallel
Testing the pipeline at small scale with DirectRunner
Running Apache Beam WordCount on Apache Apex
Summary
The Future of Stream Processing
Lower barrier for building streaming pipelines
Visual development tools
Streaming SQL
Better programming API
Bridging the gap between data science and engineering
Machine learning integration
State management
State query and data consistency
Containerized infrastructure
Management tools
Summary
With business demand for faster insights derived from a growing number of information sources, the stream data processing technology is gaining popularity. Open-source-based products have become the prevailing implementation choice, and there has been a shift from the early MapReduce-based batch processing paradigm to newer, more expressive frameworks designed to process data as streams with minimal latency, high reliability, and accuracy guarantees.
Apache Apex is a large-scale stream-first big data processing framework that can be used to build low-latency, high-throughput complex analytics pipelines that execute on clusters. Apex was developed in 2012, and is continuously improving, and today it is being used in production by a number of companies for real time and batch processing at scale.
The big data landscape has a wide array of technology components and choices, and it remains a challenge for end users to piece everything together to be successful with their big data projects and realize value from their technology investments.
This book will focus on how to apply Apex to big data processing use cases. It is written by experts in the area, including key contributors of Apache Apex who built the platform and have extensive experience working with users in the field that use Apex in their enterprise solutions. This book is an instructional and example-driven guide on how to build Apex applications for developers and hands-on enterprise architects. It will help identify use cases, the building blocks needed to put together solutions, and the process of implementing, testing and tuning applications for production. Fully functional example projects are provided to cover key aspects of data processing pipelines such as connectors for sources and sinks and common transformations. These projects can also be used as starting points for custom development.
To connect with the Apache Apex project, please visit the website (http://apex.apache.org/), subscribe to the mailing lists mentioned there, or follow @ApacheApex on Twitter (https://twitter.com/apacheapex) or on SlideShare (http://www.slideshare.net/ApacheApex/presentations).
Chapter 1, Introduction to Apex, tells us how processing of data-in-motion is realized by Apache Apex. It also gives us a few Apex stream processing use cases and applications, and talks about their value propositions.
Chapter 2, Getting Started with Application Development, shows us how the Apex development process works from project creation to application deployment; the result is a simple working application.
Chapter 3, The Apex Library, talks about the Malhar library, which contains functional building blocks for writing real-world Apex applications.
Chapter 4, Scalability, Low Latency, and Performance, teaches us how Apex can scale and parallize processing, how to achieve dynamic scaling and better resource allocation in general, and why low latency and high throughput are both achievable without trading one off against the other. Operator partitioning and related techniques are central to this endeavor and are shown in practice in a sample application.
Chapter 5, Fault Tolerance and Reliability, explores the implementation of fault-tolerance and reliability in Apex including exactly-once semantics via distributed checkpointing and effective state management.
Chapter 6, Example Project – Real-Time Aggregation and Visualization, puts together all the building blocks to show a streaming analytics project and how to integrate it with a UI and existing infrastructure.
Chapter 7, Example Project – Real-Time Ride Service Data Processing, relies on a historical dataset to simulate a real-time ride service data stream. We are using event time and out-of-order processing, in particular, to build a simple analytics application that can serve as a template for more complicated event stream data pipelines.
Chapter 8, Example Project – ETL Using SQL, shows how to build a classic ETL application using Apex and Apache Calcite.
Chapter 9, Introduction to Apache Beam, introduces the Beam stream processing framework and an approach that allows a stream application engine such as Apex to be swapped in if needed.
Chapter 10, The Future of Stream Processing, looks at the road ahead for Apex and stream processing in general. We are going to examine the role of machine learning, as well as the role of SQL and why it is important for streaming.
Apex applications can be built and run locally on the user’s development machine via a properly written JUnit test. To do this, the user need only ensure that recent versions of the following software packages are present:
Java JDK (please note that the JRE alone is not adequate).
Maven build system
Git revision control system (optional)
A Java IDE such as Eclipse or IntelliJ
(optional)
To run Apex applications on a cluster, one needs a cluster with Hadoop installed and a client to launch them. This client needs to be installed on the edge node (sometimes referred to as the gateway node or the client node); there is no need to install anything on the entire cluster.
There are several options to install the client, and some of them are listed on the Apex download page: http://apex.apache.org/downloads.html.
Without an existing Hadoop cluster, an easy way to get started for experimentation is a sandbox VM that already has a single node cluster configured (sandbox VMs are available from Hadoop vendors, as docker images and so on).
This book is a practical and example-oriented guide on how to use Apache Apex to successfully implement solutions for real-world big data problems. The book assumes knowledge of application development with Java and familiarity with distributed systems. It does not require prior knowledge of Apex or stream processing, although knowledge of other big data processing frameworks is helpful.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an email to [email protected], and mention the book title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Apache-Apex. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningApacheApex_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at [email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
You can contact us at [email protected] if you are having a problem with any aspect of the book, and we will do our best to address it.
The world is producing data at unprecedented levels, with a rapidly growing number of mobile devices, sensors, industrial machines, financial transactions, web logs, and so on. Often, the streams of data generated by these sources can offer valuable insights if processed quickly and reliably, and companies are finding it increasingly important to take action on this data-in-motion in order to remain competitive. MapReduce and Apache Hadoop were among the first technologies to enable processing of very large datasets on clusters of commodity hardware. The prevailing paradigm at the time was batch processing, which evolved from MapReduce's heavy reliance on disk I/O to Apache Spark's more efficient, memory-based approach.
Still, the downside of batch processing systems is that they accumulate data into batches, sometimes over hours, and cannot address use cases that require a short time to insight for continuous data in motion. Such requirements can be handled by newer stream processing systems, which can process data in real time, sometimes with latency as low as a few milliseconds. Apache Storm was the first ecosystem project to offer this capability, albeit with prohibitive trade-offs such as reliability versus latency. Today, there are newer and production-ready frameworks that don't force the user to make such choices. Rather, they enable low latency, high throughput, reliability, and a unified architecture that can be applied to both streaming and batch use cases. This book will introduce Apache Apex, a next-generation platform for processing data in motion.
In this chapter, we will cover the following topics:
Unbounded data and continuous processing
Use cases and case studies
Application Model and API
Value proposition of Apex
Datasets can be classified as unbounded or bounded. Bounded data is finite; it has a beginning and an end. Unbounded data is an ever-growing, essentially infinite data set. The distinction is independent of how the data is processed. Often, unbounded data is equated to stream processing and bounded data to batch processing, but this is starting to change. We will see how state-of-the-art stream processors, such as Apache Apex, can be used to (and are very capable of) processing both unbounded and bounded data, and there is no need for a batch processing system just because the data set happens to be finite.
Most big datasets (high volume) that are eventually processed by big data systems are unbounded. There is a rapidly increasing volume of such infinite data from sources such as IoT sensors (such as industrial gauge sensors, automobile data ports, connected home, and quantified self), stock markets and financial transactions, telecommunications towers and satellites, and so on. At the same time, the legacy processing and storage systems are either nearing performance and capacity limits, or total cost of ownership (TCO) is becoming prohibitive.
Businesses need to convert the available data into meaningful insights and make data-driven, real-time decisions to remain competitive.
Organizations are increasingly relying on very fast processing (high velocity), as the value of data diminishes as it ages:
How were these unbounded datasets processed without streaming architecture?
To be consumable by a batch processor, they had to be divided into bounded data, often at intervals of hours. Before processing could begin, the earliest events would wait for a long time for their batch to be ready. At the time of processing, data would already be old and less valuable.
Stream processing means processing event by event, as soon as it is available. Because there is no waiting for more input after an event arrives, there is no artificially added latency (unlike with batching). This is important for real-time use cases, where information should be processed and results available with minimum latency or delay. However, stream processing is not limited to real-time data. We will see there are benefits to applying this continuous processing in a uniform manner to historical data as well.
Consider data that is stored in a file. By reading the file line by line and processing each line as soon as it is read, subsequent processing steps can be performed while the file is still being read, instead of waiting for the entire input to be read before initiating the next stage. Stream processing is a pipeline, and each item can be acted upon immediately. Apart from low latency, this can also lead to even resource consumption (memory, CPU, network) with steady (versus bursty) throughput, when operations performed inherently don't require any blocking:
Data flows through the pipeline as individual events, and all processing steps are active at the same time. In a distributed system, operations are performed on different nodes and data flows through the system, allowing for parallelism and high throughput. Processing is decentralized and without inherent bottlenecks, in contrast to architectures that attempt to move processing to where the data resides.
Stream processing is a natural fit for how events occur in the real world. Sources generate data continuously (mobile devices, financial transactions, web traffic, sensors, and so on). It therefore makes sense to also process them that way instead of artificially breaking the processing into batches (or micro-batches).
The meaning of real time, or time for fast decision making, varies significantly between businesses. Some use cases, such as online fraud detection, may require processing to complete within milliseconds, but for others multiple seconds or even minutes might be sufficiently fast. In any case, the underlying platform needs to be equipped for fast and correct low-latency processing.
Streaming applications can process data fast, with low latency. Stream processing has gained popularity along with growing demand for faster processing of current data, but it is not a synonym for real-time processing. Input data does not need to be real-time. Older data can also be processed as stream (for example, reading from a file) and results are not always emitted in real-time either. Stream processing can perform operations such as sum, average, or top, that are performed over multiple events before the result becomes available.
To perform such operations, the stream needs to be sliced at temporal boundaries. This is called windowing. It demarcates finite datasets for computations. All data belonging to a window needs to be observed before a result can be emitted and windowing provides these boundaries. There are different strategies to define such windows over a data stream, and these will be covered in Chapter 3, The Apex Library:
In the preceding diagram we see the sum of incoming readings computed over tumbling (non-overlapping) and sliding (overlapping) windows. At the end of each window, the result is emitted.
With windowing, the final result of an operation for a given window is only known after all its data elements are processed. However, many windowed operations still benefit from event-by-event arrival of data and incremental computation. Windowing doesn't always mean that processing can only start once all input has arrived. In our example, the sum can be updated whenever the next event arrives vs. storing all individual events and deferring computation until the end of the window. Sometimes, even the intermediate result of a windowed computation is of interest and can be made available for downstream consumption and subsequently refined with the final result.
The first open source stream processing framework in the big data ecosystem was Apache Storm. Since then, several other Apache projects for stream processing have emerged. Next-generation streaming first architectures such as Apache Apex and Apache Flink come with stronger capabilities and are more broadly applicable. They are not only able to process data with low latency, but also provide for state management (for data that an operation may require across individual events), strong processing guarantees (correctness), fault tolerance, scalability, and high performance.
Users can now also expect such frameworks to come with comprehensive libraries of connectors, other building blocks and APIs that make development of non-trivial streaming applications productive and allow for predictable project implementation cycles. Equally importantly, next-generation frameworks should cater to aspects such as operability, security, and the ability to run on shared infrastructure (multi-tenancy) to satisfy DevOps requirements for successful production launch and uptime.
Streaming can do it all!
Limitations of early stream processing systems lead to the so-called Lambda Architecture, essentially a parallel setup of stream and batch processing path to obtain fast but potentially unreliable results through the stream processor and, in parallel, correct but slow results through a batch processing system like Apache Hadoop MapReduce:
The fast processing path in the preceding diagram can potentially produce incorrect results, hence the need to re-compute the same results in an alternate batch processing path. Correctness issues are caused by previous technical limitations of stream processing, not by the paradigm itself. For example, if events are processed multiple times or lost, it leads to double or under counting, which would be a problem for an application that relies on accurate results, for example, in the financial sector.
This setup requires the same functionality to be implemented with two different frameworks, as well as extra infrastructure and operational skills, and therefore, results in longer time to production and higher Total Cost of Ownership (TOC). With recent advances in stream processing, Lambda Architecture is no longer necessary. Instead, a unified streaming architecture can be used for reliable processing in a much more TOC effective solution.
This approach based on a single system was outlined in 2014 as Kappa Architecture, and today there are several stream processing technology options, including Apache Apex, that support batch as a special case of streaming.
These newer systems are fault-tolerant, produce correct results, can achieve low latency as well as high throughput, and provide options for enterprise-grade operability and support. Potential users are no longer confronted with the shortcomings that previously justified a parallel batch processing system. We will later see how Apache Apex ensures correct processing, including its support for exactly-once processing.
Apache Apex (http://apex.apache.org/) is a stream processing platform and framework that can process data in-motion with low latency in a way that is highly scalable, highly performant, fault-tolerant, stateful, secure, distributed, and easily operable. Apex is written in Java, and Java is the primary application development environment.
In a typical streaming data pipeline, events from sources are stored and transported through a system such as Apache Kafka. The events are then processed by a stream processor and the results delivered to sinks, which are frequently databases, distributed file systems or message buses that link to downstream pipelines or services.
The following figure illustrates this:
In the end-to-end scenario depicted in this illustration, we see Apex as the processing component. The processing can be complex logic, with operations performed in sequence or in parallel in a distributed environment.
Apex runs on cluster infrastructure and currently supports and depends on Apache Hadoop, for which it was originally written. Support for Apache Mesos and other Docker-based infrastructure is on the roadmap.
Apex supports integration with many external systems out of the box, with connectors that are maintained and released by the project, including but not limited to the systems shown in the preceding diagram. The most frequently used connectors include Kafka and file readers. Frequently used sinks for the computed results are files and databases, though results can also be delivered directly to frontend systems for purposes such as real-time reporting directly from the Apex application, a use case that we will look at later.
Origin of Apex
The development of the Apex project started in 2012, with the original vision of enabling fast, performant, and scalable real-time processing on Hadoop. At that time, batch processing and MapReduce-based frameworks such as Apache Pig, Hive, or Cascading were still the standard options for processing data. Hadoop 2.x with YARN (Yet Another Resource Negotiator) was about to be released to pave the way for a number of new processing frameworks and paradigms to become available as alternatives to MapReduce. Due to its roots in the Hadoop ecosystem, Apex is very well integrated with YARN, and since its earliest days has offered features such as dynamic resource allocation for scaling and efficient recovery. It is also leading in high performance (with low latency), scalability and operability, which were focus areas from the very beginning.
The technology was donated to the Apache Software Foundation (ASF) in 2015, at which time it entered the Apache incubation program and graduated after only eight months to achieve Top Level Project status in April 2016.
Apex had its first production deployments in 2014 and today is used in mission-critical deployments in various industries for processing at scale. Use cases range from very low-latency processing in the real-time category to large-scale batch processing; a few examples will be discussed in the next section. Some of the organizations that use Apex can be found on the Powered by Apache Apex page on the Apex project web site at https://apex.apache.org/powered-by-apex.html.
Apex is a platform and framework on top of which specific applications (or solutions) are built.
As such, Apex is applicable to to a wide range of use cases, including real-time machine learning model scoring, real-time ETL (Extract, Transform, and Load), predictive analytics, anomaly detection, real-time recommendations, and systems monitoring:
As organizations realize the financial and competitive importance of making data-driven decisions in real time, the number of industries and use cases will grow.
In the remainder of this section, we will discuss how companies in various industries are using Apex to solve important problems. These companies have presented their particular use cases, implementations and findings at conferences and meetups, and references to this source material are provided with each case study when available.
Companies in the advertising technology (AdTech) industry need to address data increasing at breakneck speed, along with customers demanding faster insights and analytical reporting.
PubMatic is a leading AdTech company providing marketing automation for publishers and is driven by data at a massive scale. On a daily basis, the company processes over 350 billion bids, serves over 40 billion ad impressions, and processes over 50 terabytes of data. Through real-time analytics, yield management, and workflow automation, PubMatic enables publishers to make smarter inventory decisions and improve revenue performance. Apex is used for real-time reporting and for the allocation engine.
In PubMatic's legacy batch processing system, there could be a delay of five hours to obtain updated data for their key metrics (revenues, impressions and clicks) and a delay of nine hours to obtain data for auction logs.
PubMatic decided to pursue a real-time streaming solution so that it could provide publishers, demand side platforms (DSPs), and agencies with actionable insights as close to the time of event generation as possible. PubMatic's streaming implementation had to achieve the following:
Ingest and analyze a high volume of clicks and views (200,000 events/sec) to help their advertising customers improve revenues
Utilize auction and client log data (22 TB/day) to report critical metrics for campaign monetization
Handle rapidly increasing network traffic with efficient utilization of resources
Provide a feedback loop to the ad server for making efficient ad serving decisions.
This high volume data would need to be processed in real-time to derive actionable insights, such as campaign decisions and audience targeting.
PubMatic decided to implement its real-time streaming solution with Apex based on the following factors:
Time to value - the solution was able to be implemented within a short time frame
The Apex applications could run on PubMatic's existing Hadoop infrastructure
Apex had important connectors (files, Apache Kafka, and so on) available out of the box
Apex supported event time dimensional aggregations with real-time query capability
With the Apex-based solution, deployed to production in 2014, PubMatic's end-to-end latency to obtain updated data and metrics for their two use cases fell from hours to seconds. This enabled real-time visibility into successes and shortcomings of its campaigns and timely tuning of models to maximize successful auctions.
General Electric (GE) is a large, diversified company with business units in energy, power, aviation, transportation, healthcare, finance, and other industries. Many of these business units deal in industrial machinery and devices such as wind turbines, aviation components, locomotive components, healthcare imaging machines, and so on. Such industrial devices continually generate high volumes of real-time data, and GE decided to provide advanced IoT analytics solutions to the thousands of customers using these devices and sensors across its various business units and industries.
The GE Predix platform enables users to develop and execute Industrial IoT applications to gain real-time insights about their devices and their usage, as well as take actions based on these insights. Certain services offered by Predix are powered by Apache Apex. GE selected Apex for these services based on the following features (feature details will be covered later in this book):
High performance and distributed computing
Dynamic partitioning
Rich library of existing operators
Support for at-least-once, at-most-once, and exactly-once processing
Hadoop/YARN compatibility
Fault tolerance and platform stability
Ease of deployment and operability
Enterprise grade security
One Predix service that runs on Apex is the Time Series service, which leverages Apex due to its speed, scalability, high performance, and fault tolerance capabilities.
The service provides:
Efficient storage of time series data
Data indexing for quick retrieval
Industrial focused query modes
High availability and horizontal scalability
Millisecond data point precision
By running Apex, users of the Time Series service are able to:
Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss
Run predictive analytics to reduce costly maintenance and improve customer service
Conduct unified monitoring of all connected sensors and devices to minimize disruptions
Have fast application development cycles
