Learning Apache Apex - Thomas Weise - E-Book

Learning Apache Apex E-Book

Thomas Weise

0,0
41,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Designing and writing a real-time streaming publication with Apache Apex

About This Book

  • Get a clear, practical approach to real-time data processing
  • Program Apache Apex streaming applications
  • This book shows you Apex integration with the open source Big Data ecosystem

Who This Book Is For

This book assumes knowledge of application development with Java and familiarity with distributed systems. Familiarity with other real-time streaming frameworks is not required, but some practical experience with other big data processing utilities might be helpful.

What You Will Learn

  • Put together a functioning Apex application from scratch
  • Scale an Apex application and configure it for optimal performance
  • Understand how to deal with failures via the fault tolerance features of the platform
  • Use Apex via other frameworks such as Beam
  • Understand the DevOps implications of deploying Apex

In Detail

Apache Apex is a next-generation stream processing framework designed to operate on data at large scale, with minimum latency, maximum reliability, and strict correctness guarantees.

Half of the book consists of Apex applications, showing you key aspects of data processing pipelines such as connectors for sources and sinks, and common data transformations. The other half of the book is evenly split into explaining the Apex framework, and tuning, testing, and scaling Apex applications.

Much of our economic world depends on growing streams of data, such as social media feeds, financial records, data from mobile devices, sensors and machines (the Internet of Things - IoT). The projects in the book show how to process such streams to gain valuable, timely, and actionable insights. Traditional use cases, such as ETL, that currently consume a significant chunk of data engineering resources are also covered.

The final chapter shows you future possibilities emerging in the streaming space, and how Apache Apex can contribute to it.

Style and approach

This book is divided into two major parts: first it explains what Apex is, what its relevant parts are, and how to write well-built Apex applications. The second part is entirely application-driven, walking you through Apex applications of increasing complexity.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 335

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Learning Apache Apex

 

 

 

 

 

 

 

 

 

 

 

 

 

Real-time streaming applications with Apex

 

 

 

 

 

 

 

 

 

 

 

Thomas Weise
Munagala V. Ramanath
David Yan
Kenneth Knowles

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Learning Apache Apex

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: November 2017

 

Production reference: 2061217

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78829-640-3

 

www.packtpub.com

Credits

Authors

Thomas Weise

Munagala V. Ramanath

David Yan

Kenneth Knowles

Copy Editors

Safis Editing

Tom Jacob

 

Reviewer

Ananth Gundabattula

Project Coordinator

Suzanne Coutinho

Acquisition Editor

Frank Pohlmann

Indexer

Pratik Shirodkar

Content Development Editor

Monika Sangwan

Graphics

Jason Monteiro

Technical Editor

Joel D'Souza

Production Coordinator

Arvindkumar Gupta

About the Authors

Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.

 

 

I'd like to thank the Apache Apex community and project cofounder, Chetan Narsude, for creating this advanced streaming platform. I'd also like to thank Dean Lockgaard, Ananth G., and my coauthors for their help with the book, and Packt editors for their support.

 

 

 

 

Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.

David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at Berkeley.

 

 

 

 

 

 

Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow—Google’s Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in Programming Language Theory from the University of California, Santa Cruz.

About the Reviewer

Ananth Gundabattula is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a PhD in Computer Science Security and is interested in all things related to data, including low latency distributed processing systems, machine learning, and data engineering domains. He holds three patents granted by USPTO and has one application pending.

Prior to joining CBA, Ananth was an architect at Threatmetrix and the member of the core team that scaled the Threatmetrix architecture to 100 million transactions per day, which runs at very low latencies using Cassandra, ZooKeeper, and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs, enabling some of the first IBM CIO projects onboarding HBase, Hadoop, and Mahout stack.

Ananth is a committer for Apache Apex and is currently working for the next generation architectures for the CBA fraud platform and Advanced Analytics Omnia platform at CBA.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

 

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1788296400.

 

If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book 

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Introduction to Apex

Unbounded data and continuous processing

Stream processing

Stream processing systems

What is Apex and why is it important?

Use cases and case studies

Real-time insights for Advertising Tech (PubMatic)

Industrial IoT applications (GE)

Real-time threat detection (Capital One)

Silver Spring Networks (SSN)

Application Model and API

Directed Acyclic Graph (DAG)

Apex DAG Java API

High-level Stream Java API

SQL

JSON

Windowing and time

Value proposition of Apex

Low latency and stateful processing

Native streaming versus micro-batch

Performance

Where Apex excels

Where Apex is not suitable

Summary

Getting Started with Application Development

Development process and methodology

Setting up the development environment

Creating a new Maven project

Application specifications

Custom operator development

The Apex operator model

CheckpointListener/CheckpointNotificationListener

ActivationListener

IdleTimeHandler

Application configuration

Testing in the IDE

Writing the integration test

Running the application on YARN

Execution layer components

Installing Apex Docker sandbox

Running the application

Working on the cluster

YARN web UI

Apex CLI

Logging

Dynamically adjusting logging levels

Summary

The Apex Library

An overview of the library

Integrations

Apache Kafka

Kafka input

Kafka output

Other streaming integrations

JMS (ActiveMQ, SQS, and so on)

Kinesis streams

Files

File input

File splitter and block reader

File writer

Databases

JDBC input

JDBC output

Other databases

Transformations

Parser

Filter

Enrichment

Map transform

Custom functions

Windowed transformations

Windowing

Global Window

Time Windows

Sliding Time Windows

Session Windows

Window propagation

State

Accumulation

Accumulation Mode

State storage

Watermarks

Allowed lateness

Triggering

Merging of streams

The windowing example

Dedup

Join

State Management

Summary

Scalability, Low Latency, and Performance

Partitioning and how it works

Elasticity

Partitioning toolkit

Configuring and triggering partitioning

StreamCodec

Unifier

Custom dynamic partitioning

Performance optimizations

Affinity and anti-affinity

Low-latency versus throughput

Sample application for dynamic partitioning

Performance – other aspects for custom operators

Summary

Fault Tolerance and Reliability

Distributed systems need to be resilient

Fault-tolerance components and mechanism in Apex

Checkpointing

When to checkpoint

How to checkpoint

What to checkpoint

Incremental state saving

Incremental recovery

Processing guarantees

Example – exactly-once counting

The exactly-once output to JDBC

Summary

Example Project – Real-Time Aggregation and Visualization

Streaming ETL and beyond

The application pattern in a real-world use case

Analyzing Twitter feed

Top Hashtags

TweetStats

Running the application

Configuring Twitter API access

Enabling WebSocket output

The Pub/Sub server

Grafana visualization

Installing Grafana

Installing Grafana Simple JSON Datasource

The Grafana Pub/Sub adapter server

Setting up the dashboard

Summary

Example Project – Real-Time Ride Service Data Processing

The goal

Datasource

The pipeline

Simulation of a real-time feed using historical data

Parsing the data

Looking up of the zip code and preparing for the windowing operation

Windowed operator configuration

Serving the data with WebSocket

Running the application

Running the application on GCP Dataproc

Summary

Example Project – ETL Using SQL

The application pipeline

Building and running the application

Application configuration

The application code

Partitioning

Application testing

Understanding application logs

Calcite integration

Summary

Introduction to Apache Beam

Introduction to Apache Beam

Beam concepts

Pipelines, PTransforms, and PCollections

ParDo – elementwise computation

GroupByKey/CombinePerKey – aggregation across elements

Windowing, watermarks, and triggering in Beam

Windowing in Beam

Watermarks in Beam

Triggering in Beam

Advanced topic – stateful ParDo

WordCount in Apache Beam

Setting up your pipeline

Reading the works of Shakespeare in parallel

Splitting each line on spaces

Eliminating empty strings

Counting the occurrences of each word

Format your results

Writing to a sharded text file in parallel

Testing the pipeline at small scale with DirectRunner

Running Apache Beam WordCount on Apache Apex

Summary

The Future of Stream Processing

Lower barrier for building streaming pipelines

Visual development tools

Streaming SQL

Better programming API

Bridging the gap between data science and engineering

Machine learning integration

State management

State query and data consistency

Containerized infrastructure

Management tools

Summary

Preface

With business demand for faster insights derived from a growing number of information sources, the stream data processing technology is gaining popularity. Open-source-based products have become the prevailing implementation choice, and there has been a shift from the early MapReduce-based batch processing paradigm to newer, more expressive frameworks designed to process data as streams with minimal latency, high reliability, and accuracy guarantees.

Apache Apex is a large-scale stream-first big data processing framework that can be used to build low-latency, high-throughput complex analytics pipelines that execute on clusters. Apex was developed in 2012, and is continuously improving, and today it is being used in production by a number of companies for real time and batch processing at scale.

The big data landscape has a wide array of technology components and choices, and it remains a challenge for end users to piece everything together to be successful with their big data projects and realize value from their technology investments. 

This book will focus on how to apply Apex to big data processing use cases. It is written by experts in the area, including key contributors of Apache Apex who built the platform and have extensive experience working with users in the field that use Apex in their enterprise solutions. This book is an instructional and example-driven guide on how to build Apex applications for developers and hands-on enterprise architects. It will help identify use cases, the building blocks needed to put together solutions, and the process of implementing, testing and tuning applications for production. Fully functional example projects are provided to cover key aspects of data processing pipelines such as connectors for sources and sinks and common transformations. These projects can also be used as starting points for custom development. 

To connect with the Apache Apex project, please visit the website (http://apex.apache.org/), subscribe to the mailing lists mentioned there, or follow @ApacheApex on Twitter (https://twitter.com/apacheapex) or on SlideShare (http://www.slideshare.net/ApacheApex/presentations).

What this book covers

Chapter 1, Introduction to Apex, tells us how processing of data-in-motion is realized by Apache Apex. It also gives us a few Apex stream processing use cases and applications, and talks about their value propositions.

Chapter 2, Getting Started with Application Development, shows us how the Apex development process works from project creation to application deployment; the result is a simple working application.

Chapter 3, The Apex Library, talks about the Malhar library, which contains functional building blocks for writing real-world Apex applications.

Chapter 4, Scalability, Low Latency, and Performance, teaches us how Apex can scale and parallize processing, how to achieve dynamic scaling and better resource allocation in general, and why low latency and high throughput are both achievable without trading one off against the other. Operator partitioning and related techniques are central to this endeavor and are shown in practice in a sample application.

Chapter 5, Fault Tolerance and Reliability, explores the implementation of fault-tolerance and reliability in Apex including exactly-once semantics via distributed checkpointing and effective state management.

Chapter 6, Example Project – Real-Time Aggregation and Visualization, puts together all the building blocks to show a streaming analytics project and how to integrate it with a UI and existing infrastructure.

Chapter 7, Example Project – Real-Time Ride Service Data Processing, relies on a historical dataset to simulate a real-time ride service data stream. We are using event time and out-of-order processing, in particular, to build a simple analytics application that can serve as a template for more complicated event stream data pipelines.

Chapter 8, Example Project – ETL Using SQL, shows how to build a classic ETL application using Apex and Apache Calcite.

Chapter 9, Introduction to Apache Beam, introduces the Beam stream processing framework and an approach that allows a stream application engine such as Apex to be swapped in if needed.

Chapter 10, The Future of Stream Processing, looks at the road ahead for Apex and stream processing in general. We are going to examine the role of machine learning, as well as the role of SQL and why it is important for streaming.

What you need for this book 

Apex applications can be built and run locally on the user’s development machine via a properly written JUnit test. To do this, the user need only ensure that recent versions of the following software packages are present:

Java JDK (please note that the JRE alone is not adequate).

Maven build system

Git revision control system (optional)

A Java IDE such as Eclipse or IntelliJ

(optional)

To run Apex applications on a cluster, one needs a cluster with Hadoop installed and a client to launch them. This client needs to be installed on the edge node (sometimes referred to as the gateway node or the client node); there is no need to install anything on the entire cluster.

There are several options to install the client, and some of them are listed on the Apex download page: http://apex.apache.org/downloads.html.

Without an existing Hadoop cluster, an easy way to get started for experimentation is a sandbox VM that already has a single node cluster configured (sandbox VMs are available from Hadoop vendors, as docker images and so on).

Who this book is for

This book is a practical and example-oriented guide on how to use Apache Apex to successfully implement solutions for real-world big data problems. The book assumes knowledge of application development with Java and familiarity with distributed systems. It does not require prior knowledge of Apex or stream processing, although knowledge of other big data processing frameworks is helpful. 

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an email to [email protected], and mention the book title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Apache-Apex. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningApacheApex_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at [email protected] if you are having a problem with any aspect of the book, and we will do our best to address it.

Introduction to Apex

The world is producing data at unprecedented levels, with a rapidly growing number of mobile devices, sensors, industrial machines, financial transactions, web logs, and so on. Often, the streams of data generated by these sources can offer valuable insights if processed quickly and reliably, and companies are finding it increasingly important to take action on this data-in-motion in order to remain competitive. MapReduce and Apache Hadoop were among the first technologies to enable processing of very large datasets on clusters of commodity hardware. The prevailing paradigm at the time was batch processing, which evolved from MapReduce's heavy reliance on disk I/O to Apache Spark's more efficient, memory-based approach.

Still, the downside of batch processing systems is that they accumulate data into batches, sometimes over hours, and cannot address use cases that require a short time to insight for continuous data in motion. Such requirements can be handled by newer stream processing systems, which can process data in real time, sometimes with latency as low as a few milliseconds. Apache Storm was the first ecosystem project to offer this capability, albeit with prohibitive trade-offs such as reliability versus latency. Today, there are newer and production-ready frameworks that don't force the user to make such choices. Rather, they enable low latency, high throughput, reliability, and a unified architecture that can be applied to both streaming and batch use cases. This book will introduce Apache Apex, a next-generation platform for processing data in motion.

In this chapter, we will cover the following topics:

Unbounded data and continuous processing

Use cases and case studies

Application Model and API

Value proposition of Apex

Unbounded data and continuous processing

Datasets can be classified as unbounded or bounded. Bounded data is finite; it has a beginning and an end. Unbounded data is an ever-growing, essentially infinite data  set. The distinction is independent of how the data is processed. Often, unbounded data is equated to stream processing and bounded data to batch processing, but this is starting to change. We will see how state-of-the-art stream processors, such as Apache Apex, can be used to (and are very capable of) processing both unbounded and bounded data, and there is no need for a batch processing system just because the data set happens to be finite.

For more details on these data processing concepts, you can visit the following link: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101.

Most big datasets (high volume) that are eventually processed by big data systems are unbounded. There is a rapidly increasing volume of such infinite data from sources such as IoT sensors (such as industrial gauge sensors, automobile data ports, connected home, and quantified self), stock markets and financial transactions, telecommunications towers and satellites, and so on. At the same time, the legacy processing and storage systems are either nearing performance and capacity limits, or total cost of ownership (TCO) is becoming prohibitive.

Businesses need to convert the available data into meaningful insights and make data-driven, real-time decisions to remain competitive.

Organizations are increasingly relying on very fast processing (high velocity), as the value of data diminishes as it ages:

How were these unbounded datasets processed without streaming architecture?

To be consumable by a batch processor, they had to be divided into bounded data, often at intervals of hours. Before processing could begin, the earliest events would wait for a long time for their batch to be ready. At the time of processing, data would already be old and less valuable.

Stream processing

Stream processing means processing event by event, as soon as it is available. Because there is no waiting for more input after an event arrives, there is no artificially added latency (unlike with batching). This is important for real-time use cases, where information should be processed and results available with minimum latency or delay. However, stream processing is not limited to real-time data. We will see there are benefits to applying this continuous processing in a uniform manner to historical data as well.

Consider data that is stored in a file. By reading the file line by line and processing each line as soon as it is read, subsequent processing steps can be performed while the file is still being read, instead of waiting for the entire input to be read before initiating the next stage. Stream processing is a pipeline, and each item can be acted upon immediately. Apart from low latency, this can also lead to even resource consumption (memory, CPU, network) with steady (versus bursty) throughput, when operations performed inherently don't require any blocking:

An example of a data pipeline

Data flows through the pipeline as individual events, and all processing steps are active at the same time. In a distributed system, operations are performed on different nodes and data flows through the system, allowing for parallelism and high throughput. Processing is decentralized and without inherent bottlenecks, in contrast to architectures that attempt to move processing to where the data resides.

Stream processing is a natural fit for how events occur in the real world. Sources generate data continuously (mobile devices, financial transactions, web traffic, sensors, and so on). It therefore makes sense to also process them that way instead of artificially breaking the processing into batches (or micro-batches).

The meaning of real time, or time for fast decision making, varies significantly between businesses. Some use cases, such as online fraud detection, may require processing to complete within milliseconds, but for others multiple seconds or even minutes might be sufficiently fast. In any case, the underlying platform needs to be equipped for fast and correct low-latency processing.

Streaming applications can process data fast, with low latency. Stream processing has gained popularity along with growing demand for faster processing of current data, but it is not a synonym for real-time processing. Input data does not need to be real-time. Older data can also be processed as stream (for example, reading from a file) and results are not always emitted in real-time either. Stream processing can perform operations such as sum, average, or top, that are performed over multiple events before the result becomes available.

To perform such operations, the stream needs to be sliced at temporal boundaries. This is called windowing. It demarcates finite datasets for computations. All data belonging to a window needs to be observed before a result can be emitted and windowing provides these boundaries. There are different strategies to define such windows over a data stream, and these will be covered in Chapter 3, The Apex Library:

Windowing of a stream

In the preceding diagram we see the sum of incoming readings computed over tumbling (non-overlapping) and sliding (overlapping) windows. At the end of each window, the result is emitted.

With windowing, the final result of an operation for a given window is only known after all its data elements are processed. However, many windowed operations still benefit from event-by-event arrival of data and incremental computation. Windowing doesn't always mean that processing can only start once all input has arrived. In our example, the sum can be updated whenever the next event arrives vs. storing all individual events and deferring computation until the end of the window. Sometimes, even the intermediate result of a windowed computation is of interest and can be made available for downstream consumption and subsequently refined with the final result.

Stream processing systems

The first open source stream processing framework in the big data ecosystem was Apache Storm. Since then, several other Apache projects for stream processing have emerged. Next-generation streaming first architectures such as Apache Apex and Apache Flink come with stronger capabilities and are more broadly applicable. They are not only able to process data with low latency, but also provide for state management (for data that an operation may require across individual events), strong processing guarantees (correctness), fault tolerance, scalability, and high performance.

Users can now also expect such frameworks to come with comprehensive libraries of connectors, other building blocks and APIs that make development of non-trivial streaming applications productive and allow for predictable project implementation cycles. Equally importantly, next-generation frameworks should cater to aspects such as operability, security, and the ability to run on shared infrastructure (multi-tenancy) to satisfy DevOps requirements for successful production launch and uptime.

Streaming can do it all!

Limitations of early stream processing systems lead to the so-called Lambda Architecture, essentially a parallel setup of stream and batch processing path to obtain fast but potentially unreliable results through the stream processor and, in parallel, correct but slow results through a batch processing system like Apache Hadoop MapReduce:

The fast processing path in the preceding diagram can potentially produce incorrect results, hence the need to re-compute the same results in an alternate batch processing path. Correctness issues are caused by previous technical limitations of stream processing, not by the paradigm itself. For example, if events are processed multiple times or lost, it leads to double or under counting, which would be a problem for an application that relies on accurate results, for example, in the financial sector.

This setup requires the same functionality to be implemented with two different frameworks, as well as extra infrastructure and operational skills, and therefore, results in longer time to production and higher Total Cost of Ownership (TOC). With recent advances in stream processing, Lambda Architecture is no longer necessary. Instead, a unified streaming architecture can be used for reliable processing in a much more TOC effective solution.

This approach based on a single system was outlined in 2014 as Kappa Architecture, and today there are several stream processing technology options, including Apache Apex, that support batch as a special case of streaming.

To know more about the Kappa Architecture, please refer to following link: https://www.oreilly.com/ideas/questioning-the-lambda-architecture.

These newer systems are fault-tolerant, produce correct results, can achieve low latency as well as high throughput, and provide options for enterprise-grade operability and support. Potential users are no longer confronted with the shortcomings that previously justified a parallel batch processing system. We will later see how Apache Apex ensures correct processing, including its support for exactly-once processing.

What is Apex and why is it important?

Apache Apex (http://apex.apache.org/) is a stream processing platform and framework that can process data in-motion with low latency in a way that is highly scalable, highly performant, fault-tolerant, stateful, secure, distributed, and easily operable. Apex is written in Java, and Java is the primary application development environment.

In a typical streaming data pipeline, events from sources are stored and transported through a system such as Apache Kafka. The events are then processed by a stream processor and the results delivered to sinks, which are frequently databases, distributed file systems or message buses that link to downstream pipelines or services.

The following figure illustrates this:

In the end-to-end scenario depicted in this illustration, we see Apex as the processing component. The processing can be complex logic, with operations performed in sequence or in parallel in a distributed environment.

Apex runs on cluster infrastructure and currently supports and depends on Apache Hadoop, for which it was originally written. Support for Apache Mesos and other Docker-based infrastructure is on the roadmap.

Apex supports integration with many external systems out of the box, with connectors that are maintained and released by the project, including but not limited to the systems shown in the preceding diagram. The most frequently used connectors include Kafka and file readers. Frequently used sinks for the computed results are files and databases, though results can also be delivered directly to frontend systems for purposes such as real-time reporting directly from the Apex application, a use case that we will look at later.

Origin of Apex

The development of the Apex project started in 2012, with the original vision of enabling fast, performant, and scalable real-time processing on Hadoop. At that time, batch processing and MapReduce-based frameworks such as Apache Pig, Hive, or Cascading were still the standard options for processing data. Hadoop 2.x with YARN (Yet Another Resource Negotiator) was about to be released to pave the way for a number of new processing frameworks and paradigms to become available as alternatives to MapReduce. Due to its roots in the Hadoop ecosystem, Apex is very well integrated with YARN, and since its earliest days has offered features such as dynamic resource allocation for scaling and efficient recovery. It is also leading in high performance (with low latency), scalability and operability, which were focus areas from the very beginning.

The technology was donated to the Apache Software Foundation (ASF) in 2015, at which time it entered the Apache incubation program and graduated after only eight months to achieve Top Level Project status in April 2016.

Apex had its first production deployments in 2014 and today is used in mission-critical deployments in various industries for processing at scale. Use cases range from very low-latency processing in the real-time category to large-scale batch processing; a few examples will be discussed in the next section. Some of the organizations that use Apex can be found on the Powered by Apache Apex page on the Apex project web site at https://apex.apache.org/powered-by-apex.html.

Use cases and case studies

Apex is a platform and framework on top of which specific applications (or solutions) are built.

As such, Apex is applicable to to a wide range of use cases, including real-time machine learning model scoring, real-time ETL (Extract, Transform, and Load), predictive analytics, anomaly detection, real-time recommendations, and systems monitoring:

As organizations realize the financial and competitive importance of making data-driven decisions in real time, the number of industries and use cases will grow.

In the remainder of this section, we will discuss how companies in various industries are using Apex to solve important problems. These companies have presented their particular use cases, implementations and findings at conferences and meetups, and references to this source material are provided with each case study when available.

Real-time insights for Advertising Tech (PubMatic)

Companies in the advertising technology (AdTech) industry need to address data increasing at breakneck speed, along with customers demanding faster insights and analytical reporting.

PubMatic is a leading AdTech company providing marketing automation for publishers and is driven by data at a massive scale. On a daily basis, the company processes over 350 billion bids, serves over 40 billion ad impressions, and processes over 50 terabytes of data. Through real-time analytics, yield management, and workflow automation, PubMatic enables publishers to make smarter inventory decisions and improve revenue performance. Apex is used for real-time reporting and for the allocation engine.

In PubMatic's legacy batch processing system, there could be a delay of five hours to obtain updated data for their key metrics (revenues, impressions and clicks) and a delay of nine hours to obtain data for auction logs.

PubMatic decided to pursue a real-time streaming solution so that it could provide publishers, demand side platforms (DSPs), and agencies with actionable insights as close to the time of event generation as possible. PubMatic's streaming implementation had to achieve the following:

Ingest and analyze a high volume of clicks and views (200,000 events/sec) to help their advertising customers improve revenues

Utilize auction and client log data (22 TB/day) to report critical metrics for campaign monetization

Handle rapidly increasing network traffic with efficient utilization of resources

Provide a feedback loop to the ad server for making efficient ad serving decisions.

This high volume data would need to be processed in real-time to derive actionable insights, such as campaign decisions and audience targeting.

PubMatic decided to implement its real-time streaming solution with Apex based on the following factors:

Time to value - the solution was able to be implemented within a short time frame

The Apex applications could run on PubMatic's existing Hadoop infrastructure

Apex had important connectors (files, Apache Kafka, and so on) available out of the box

Apex supported event time dimensional aggregations with real-time query capability

With the Apex-based solution, deployed to production in 2014, PubMatic's end-to-end latency to obtain updated data and metrics for their two use cases fell from hours to seconds. This enabled real-time visibility into successes and shortcomings of its campaigns and timely tuning of models to maximize successful auctions.

Additional ResourcesVideo: PubMatic presents High Performance AdTech Use Cases with Apache Apex at https://www.youtube.com/watch?v=JSXpgfQFcU8Slides: https://www.slideshare.net/ashishtadose1/realtime-adtech-reporting-targeting-with-apache-apex

Industrial IoT applications (GE)

General Electric (GE) is a large, diversified company with business units in energy, power, aviation, transportation, healthcare, finance, and other industries. Many of these business units deal in industrial machinery and devices such as wind turbines, aviation components, locomotive components, healthcare imaging machines, and so on. Such industrial devices continually generate high volumes of real-time data, and GE decided to provide advanced IoT analytics solutions to the thousands of customers using these devices and sensors across its various business units and industries.

The GE Predix platform enables users to develop and execute Industrial IoT applications to gain real-time insights about their devices and their usage, as well as take actions based on these insights. Certain services offered by Predix are powered by Apache Apex. GE selected Apex for these services based on the following features (feature details will be covered later in this book):

High performance and distributed computing

Dynamic partitioning

Rich library of existing operators

Support for at-least-once, at-most-once, and exactly-once processing

Hadoop/YARN compatibility

Fault tolerance and platform stability

Ease of deployment and operability

Enterprise grade security

One Predix service that runs on Apex is the Time Series service, which leverages Apex due to its speed, scalability, high performance, and fault tolerance capabilities.

The service provides:

Efficient storage of time series data

Data indexing for quick retrieval

Industrial focused query modes

High availability and horizontal scalability

Millisecond data point precision

By running Apex, users of the Time Series service are able to:

Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss

Run predictive analytics to reduce costly maintenance and improve customer service

Conduct unified monitoring of all connected sensors and devices to minimize disruptions

Have fast application development cycles