39,59 €
With the rise of Big Data, there is an increasing need to process large amounts of data continuously, with a shorter turnaround time. Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible.
This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.
We’ll cover topics such as how to set up components, basic executions, integrations, advanced use cases, alerts, and monitoring. You’ll be exposed to the popular tools used in real-time processing today such as Apache Spark, Apache Flink, and Storm. Finally, you will put your knowledge to practical use by implementing all of the techniques in the form of a practical, real-world use case.
By the end of this book, you will have a solid understanding of all the aspects of real-time data processing and analytics, and will know how to deploy the solutions in production environments in the best possible manner.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 312
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2017
Production reference: 1250917
ISBN 978-1-78728-120-2
www.packtpub.com
Author
Shilpi Saxena Saurabh Gupta
Copy Editor
Safis Editing
Reviewers
Ruben Oliva Ramos Tomas Oliva Prateek Bha
Project Coordinator
Nidhi Joshi
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Tejal Daruwale Soni
Content Development Editor
Mayur Pawanikar
Graphics
Tania Dutta
Technical Editor
Karan Thakkar
Production Coordinator
Arvindkumar Gupta
Shilpi Saxena is an IT professional and also a technology evangelist. She is an engineer who has had exposure to various domains (machine to machine space, healthcare, telecom, hiring, and manufacturing). She has experience in all the aspects of conception and execution of enterprise solutions. She has been architecting, managing, and delivering solutions in the big data space for the last 3 years. She also handles a high-performance and geographically-distributed team of elite engineers.
Shilpi has more than 12 years (3 years in the big data space) of experience in the development and execution of various facets of enterprise solutions both in the products and services dimensions of the software industry. An engineer by degree and profession, she has worn varied hats, such as developer, technical leader, product owner, tech manager, and so on, and she has seen all the flavors that the industry has to offer. She has architected and worked through some of the pioneers' production implementations in Big Data on Storm and Impala with auto-scaling in AWS.
Shilpi also authored Real-time Analytics with Storm and Cassandra with Packt Publishing.
Saurabh Gupta is an software engineer who has worked aspects of software requirements, designing, execution, and delivery. Saurabh has more than 3 years of experience working in Big Data domain. Saurabh is handling and designing real time as well as batch processing projects running in production including technologies like Impala, Storm, NiFi, Kafka and deployment on AWS using Docker. Saurabh also worked in product development and delivery.
Saurabh has total 10 years (3+ years in big data) rich experience in IT industry. Saurabh has exposure in various IOT use-cases including Telecom, HealthCare, Smart city, Smart cars and so on.
Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering, teleinformatics, and networking specialization from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience in developing web applications to control and monitor devices connected with Arduino and Raspberry Pi using web frameworks and cloud services to build the Internet of Things applications.
He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 in Leon, Guanajuato, Mexico, teaching subjects such as electronics, robotics and control, automation, and microcontrollers at Mechatronics Technician Career; he is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, Windows Phone, HTML5, PHP, CSS, Ajax, JavaScript, Angular, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers (such as Node.js and IIS), hardware programming (such as Arduino, Raspberry pi, Ethernet Shield, GPS, and GSM/GPRS, ESP8266), and control and monitor systems for data acquisition and programming.
He has authored the book Internet of Things Programming with JavaScript by Packt Publishing. He is also involved in monitoring, controlling, and the acquisition of data with Arduino and Visual Basic .NET for Alfaomega.
Juan Tomás Oliva Ramos is an environmental engineer from the university of Guanajuato, Mexico, with a master’s degree in administrative engineering and quality. He has more than 5 years of experience in management and development of patents, technological innovation projects, and development of technological solutions through the statistical control of processes. He has been a teacher of statistics, entrepreneurship, and technological development of projects since 2011. He became an entrepreneur mentor, and started a new department of technology management and entrepreneurship at instituto Tecnologico Superior de Purisima del Rincon.
Juan is a Alfaomega reviewer and has worked on the book Wearable designs for Smart watches, Smart TVs and Android mobile devices.
He has developed prototypes through programming and automation technologies for the improvement of operations, which have been registered for patents.
Prateek Bhati is currently working in Accenture. He has total experience of 4 years in real-time data processing and currently living in New Delhi. He completed his graduation from Amity University.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787281205.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Introducing Real-Time Analytics
What is big data?
Big data infrastructure
Real–time analytics – the myth and the reality
Near real–time solution – an architecture that works
NRT – The Storm solution
NRT – The Spark solution
Lambda architecture – analytics possibilities
IOT – thoughts and possibilities
Edge analytics
Cloud – considerations for NRT and IOT
Summary
Real Time Applications – The Basic Ingredients
The NRT system and its building blocks
Data collection
Stream processing
Analytical layer – serve it to the end user
NRT – high-level system view
NRT – technology view
Event producer
Collection
Broker
Transformation and processing
Storage
Summary
Understanding and Tailing Data Streams
Understanding data streams
Setting up infrastructure for data ingestion
Apache Kafka
Apache NiFi
Logstash
Fluentd
Flume
Taping data from source to the processor - expectations and caveats
Comparing and choosing what works best for your use case
Do it yourself
Setting up Elasticsearch
Summary
Setting up the Infrastructure for Storm
Overview of Storm
Storm architecture and its components
Characteristics
Components
Stream grouping
Setting up and configuring Storm
Setting up Zookeeper
Installing
Configuring
Standalone
Cluster
Running
Setting up Apache Storm
Installing
Configuring
Running
Real-time processing job on Storm
Running job
Local
Cluster
Summary
Configuring Apache Spark and Flink
Setting up and a quick execution of Spark
Building from source
Downloading Spark
Running an example
Setting up and a quick execution of Flink
Build Flink source
Download Flink
Running example
Setting up and a quick execution of Apache Beam
Beam model
Running example
MinimalWordCount example walk through
Balancing in Apache Beam
Summary
Integrating Storm with a Data Source
RabbitMQ – messaging that works
RabbitMQ exchanges
Direct exchanges
Fanout exchanges
Topic exchanges
Headers exchanges
RabbitMQ setup
RabbitMQ — publish and subscribe
RabbitMQ – integration with Storm
AMQPSpout
PubNub data stream publisher
String together Storm-RMQ-PubNub sensor data topology
Summary
From Storm to Sink
Setting up and configuring Cassandra
Setting up Cassandra
Configuring Cassandra
Storm and Cassandra topology
Storm and IMDB integration for dimensional data
Integrating the presentation layer with Storm
Setting up Grafana with the Elasticsearch plugin
Downloading Grafana
Configuring Grafana
Installing the Elasticsearch plugin in Grafana
Running Grafana
Adding the Elasticsearch datasource in Grafana
Writing code
Executing code
Visualizing the output on Grafana
Do It Yourself
Summary
Storm Trident
State retention and the need for Trident
Transactional spout
Opaque transactional Spout
Basic Storm Trident topology
Trident internals
Trident operations
Functions
map and flatMap
peek
Filters
Windowing
Tumbling window
Sliding window
Aggregation
Aggregate
Partition aggregate
Persistence aggregate
Combiner aggregator
Reducer aggregator
Aggregator
Grouping
Merge and joins
DRPC
Do It Yourself
Summary
Working with Spark
Spark overview
Spark framework and schedulers
Distinct advantages of Spark
When to avoid using Spark
Spark – use cases
Spark architecture - working inside the engine
Spark pragmatic concepts
RDD – the name says it all
Spark 2.x – advent of data frames and datasets
Summary
Working with Spark Operations
Spark – packaging and API
RDD pragmatic exploration
Transformations
Actions
Shared variables – broadcast variables and accumulators
Broadcast variables
Accumulators
Summary
Spark Streaming
Spark Streaming concepts
Spark Streaming - introduction and architecture
Packaging structure of Spark Streaming
Spark Streaming APIs
Spark Streaming operations
Connecting Kafka to Spark Streaming
Summary
Working with Apache Flink
Flink architecture and execution engine
Flink basic components and processes
Integration of source stream to Flink
Integration with Apache Kafka
Example
Integration with RabbitMQ
Running example
Flink processing and computation
DataStream API
DataSet API
Flink persistence
Integration with Cassandra
Running example
FlinkCEP
Pattern API
Detecting pattern
Selecting from patterns
Example
Gelly
Gelly API
Graph representation
Graph creation
Graph transformations
DIY
Summary
Case Study
Introduction
Data modeling
Tools and frameworks
Setting up the infrastructure
Implementing the case study
Building the data simulator
Hazelcast loader
Building Storm topology
Parser bolt
Check distance and alert bolt
Generate alert Bolt
Elasticsearch Bolt
Complete Topology
Running the case study
Load Hazelcast
Generate Vehicle static value
Deploy topology
Start simulator
Visualization using Kibana
Summary
This book will have basic to advanced recipes on real-time computing. We will cover technologies such as Flink, Spark and Storm. The book includes practical recipes to help you to process unbounded streams of data, thus doing for real-time processing what Hadoop did for batch processing. You will begin with setting up the development environment and proceed to implement stream processing. This will be followed by recipes on real-time problems using Rabbit-MQ, Kafka, and Nifi along with Storm, Spark, Flink, Beam, and more. By the end of this book, you will have gained a thorough understanding of the fundamentals of NRT and its applications, and be able to identify and apply those fundamentals to any suitable problem.
This book is written in a cookbook style, with plenty of practical recipes, well-explained code examples, and relevant screenshots and diagrams.
Section – A: Introduction – Getting Familiar
This section gives the readers basic familiarity with the real-time analytics spectra and domains. We talk about the basic components and their building blocks. This sections consist of the following chapters:
Chapter 1
: Introducing Real-Time Analytics
Chapter 2
: Real-Time Application – The Basic Ingredients
Section – B: Setup and Infrastructure
This section is predominantly setup-oriented, where we have the basic components set up. This sections consist of the following chapters:
Chapter 3
: Understanding and Tailing Data Streams
Chapter 4
: Setting Up the infrastructure for Storm
Chapter 5
: Configuring Apache Spark and Flink
Section – C: Storm Computations
This section predominantly focuses on exploring Storm, its compute capabilities, and its various features. This sections consist of the following chapters:
Chapter 6
: Integration of Source with Storm
Chapter 7
: From Storm to Sink
Chapter 8
: Storm Trident
Section – D: Using Spark Compute In Real Time
This section predominantly focuses on exploring Spark, its compute capabilities, and its various features. This sections consist of the following chapters:
Chapter 9
: Working with Spark
Chapter 10
: Working with Spark Operations
Chapter 11
: Spark Streaming
Section – E: Flink for Real-Time Analytics This section focuses on exploring Flink, its compute capabilities, and its various features.
Chapter 12
: Working with Apache Flink
Section – F: Let’s Have It Stringed Together
This sections consist of the following chapters:
Chapter 13
: Case Study 1
The book is intended to graduate our readers into real-time streaming technologies. We expect the readers to have fundamental knowledge of Java and Scala. In terms of setup, we expect readers to have basic maven, Java, and Eclipse set up to run the examples.
If you are a Java developer who would like to be equipped with all the tools required to devise an end-to-end practical solution on real-time data streaming, then this book is for you. Basic knowledge of real-time processing will be helpful, and knowing the fundamentals of Maven, Shell, and Eclipse would be great.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows: "Once the kafka_2.11-0.10.1.1.tgz file is downloaded, extract the files."
A block of code is set as follows:
cp kafka_2.11-0.10.1.1.tgz /home/ubuntu/demo/kafka
cd /home/ubuntu/demo/kafka
tar -xvf kafka_2.11-0.10.1.1.tgz
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go toFiles|Settings|Project Name|Project Interpreter."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Real-time-Processing-and-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
This chapter sets the context for the reader by providing an overview of the big data technology landscape in general and real–time analytics in particular. This provides an outline for the book conceptually, with an attempt to ignite the spark for inquisitiveness that will encourage readers to undertake the rest of the journey through the book.
The following topics will be covered:
What is big data?
Big data infrastructure
Real–time analytics – the myth and the reality
Near real–time solution – an architecture that works
Analytics – a plethora of possibilities
IOT – thoughts and possibilities
Cloud – considerations for NRT and IOT
Well to begin with, in simple terms, big data helps us deal with three V's – volume, velocity, and variety. Recently, two more V's were added to it, making it a five–dimensional paradigm; they are veracity and value
Volume
: This dimension refers to the amount of data; look around you, huge amounts of data are being generated every second – it may be the email you send, Twitter, Facebook, or other social media, or it can just be all the videos, pictures, SMS messages, call records, and data from varied devices and sensors. We have scaled up the data–measuring metrics to terabytes, zettabytes and Yottabytes – they are all humongous figures. Look at Facebook alone; it's like ~10 billion messages on a day, consolidated across all users. We have ~5 billion likes a day and around ~400 million photographs are uploaded each day. Data statistics in terms of volume are startling; all of the data generated from the beginning of time to 2008 is kind of equivalent to what we generate in a day today, and I am sure soon it will be an hour. This volume aspect alone is making the traditional database dwarf to store and process this amount of data in reasonable and useful time frames, though a big data stack can be employed to store process and compute on amazingly large data sets in a cost–effective, distributed, and reliably efficient manner.
Velocity
: This refers to the data generation speed, or the rate at which data is being generated. In today's world, where we mentioned that the volume of data has undergone a tremendous surge, this aspect is not lagging behind. We have loads of data because we are able to generate it so fast. Look at social media; things are circulated in seconds and they become viral, and the insight from social media is analysed in milliseconds by stock traders, and that can trigger lots of activity in terms of buying or selling. At a target point of sale counter it takes a few seconds for a credit card swipe, and within that fraudulent transaction processing, payment, bookkeeping, and acknowledgement is all done. Big data gives us the power to analyse the data at tremendous speed.
Variety
: This dimension tackles the fact that the data can be unstructured. In the traditional database world, and even before that, we were used to having a very structured form of data that fitted neatly into tables. Today, more than 80% of data is unstructured – quotable examples are photos, video clips, social media updates, data from variety of sensors, voice recordings, and chat conversations. Big data lets you store and process this unstructured data in a very structured manner; in fact, it effaces the variety.
Veracity
: It's all about validity and correctness of data. How accurate and usable is the data? Not everything out of millions and zillions of data records is corrected, accurate, and referable. That's what actual veracity is: how trustworthy the data is and what the quality of the data is. Examples of data with veracity include Facebook and Twitter posts with nonstandard acronyms or typos. Big data has brought the ability to run analytics on this kind of data to the table. One of the strong reasons for the volume of data is veracity.
Value
: This is what the name suggests: the value that the data actually holds. It is unarguably the most important V or dimension of big data. The only motivation for going towards big data for processing super large data sets is to derive some valuable insight from it. In the end, it's all about cost and benefits.
Big data is a much talked about technology across businesses and the technical world today. There are myriad domains and industries that are convinced of its usefulness, but the implementation focus is primarily application-oriented, rather than infrastructure-oriented. The next section predominantly walks you through the same.
Before delving further into big data infrastructure, let's have a look at the big data high–level landscape.
The following figure captures high–level segments that demarcate the big data space:
It clearly depicts the various segments and verticals within the big data technology canvas (bottom up).
The key is the bottom layer that holds the data in scalable and distributed mode:
Technologies
: Hadoop, MapReduce, Mahout, Hbase, Cassandra, and so on
Then, the next level is the infrastructure framework layer that enables the developers to choose from myriad infrastructural offerings depending upon the use case and its solution design
Analytical Infrastructure
: EMC, Netezza, Vertica, Cloudera, Hortonworks
Operational Infrastructure
: Couchbase, Teradata, Informatica and many more
Infrastructure as a service
(
IAAS
): AWS, Google cloud and many more
Structured Databases
: Oracle, SQLServer, MySQL, Sybase and many more
The next level specializes in catering to very specific needs in terms of
Data As A Service
(
DaaS
): Kaggale, Azure, Factual and many more
Business Intelligence
(
BI
): Qlikview, Cognos, SAP BO and many more
Analytics and Visualizations
: Pentaho, Tableau, Tibco and many more
Today, we see traditional robust RDBMS struggling to survive in a cost–effective manner as a tool for data storage and processing. The scaling of traditional RDBMS, at the compute power expected to process huge amount of data with low latency came at a very high price. This led to the emergence of new technologies, which were low cost, low latency, highly scalable at low cost/open source. To our rescue comes The Yellow Elephant—Hadoop that took the data storage and computation arena by surprise. It's designed and developed as a distributed framework for data storage and computation on commodity hardware in a highly reliable and scalable manner. The key computational methodology Hadoop works on involves distributing the data in chunks over all the nodes in a cluster, and then processing the data concurrently on all the nodes.
Now that you are acquainted with the basics of big data and the key segments of the big data technology landscape, let's take a deeper look at the big data concept with the Hadoop framework as an example. Then, we will move on to take a look at the architecture and methods of implementing a Hadoop cluster; this will be a close analogy to high–level infrastructure and the typical storage requirements for a big data cluster. One of the key and critical aspect that we will delve into is information security in the context of big data.
A couple of key aspects that drive and dictate the move to big data infraspace are highlighted in the following figure:
Cluster design
: This is the most significant and deciding aspect for infrastructural planning. The cluster design strategy of the infrastructure is basically the backbone of the solution; the key deciding elements for the same are the application use cases and requirements, workload, resource computation (depending upon memory intensive and compute intensive computations), and security considerations.
Apart from compute, memory, and network utilization, another very important aspect to be considered is storage which will be either cloud–based or on the premises. In terms of the cloud, the option could be public, private, or hybrid, depending upon the consideration and requirements of use case and the organization
Hardware architecture
: A lot on the storage cost aspect is driven by the volume of the data to be stored, archival policy, and the longevity of the data. The decisive factors are as follows:
The computational needs of the implementations (whether the commodity components would suffice, or if the need is for high–performance GPUs).
What are the memory needs? Are they low, moderate, or high? This depends upon the in–memory computation needs of the application implementations.
Network architecture
: This may not sound important, but it is a significant driver in big data computational space. The reason is that the key aspect for big data is distributed computation, and thus, network utilization is much higher than what would have been in the case of a single–server, monolithic implementation. In distributed computation, loads of data and intermediate compute results travel over the network; thus, the network bandwidth becomes the throttling agent for the overall solution and depending on key aspect for selection of infrastructure strategy. Bad design approaches sometimes lead to network chokes, where data spends less time in processing but more in shuttling across the network or waiting to be transferred over the wire for the next step in execution.
Security architecture
: Security is a very important aspect of any application space, in big data, it becomes all the more significant due to the volume and diversity of the data, and due to network traversal of the data owing to the compute methodologies. The aspect of the cloud computing and storage options adds further needs to the complexity of being a critical and strategic aspect of big data infraspace.
One of the biggest truths about the real–time analytics is that nothing is actually real–time; it's a myth. In reality, it's close to real–time. Depending upon the performance and ability of a solution and the reduction of operational latencies, the analytics could be close to real–time, but, while day-by-day we are bridging the gap between real–time and near–real–time, it's practically impossible to eliminate the gap due to computational, operational, and network latencies.
Before we go further, let's have a quick overview of what the high–level expectations from these so called real–time analytics solutions are. The following figure captures the high–level intercept of the same, where, in terms of data we are looking for a system that could process millions of transactions with a variety of structured and unstructured data sets. My processing engine should be ultra–fast and capable of handling very complex joined-up and diverse business logic, and at the end, it is also expected to generate astonishingly accurate reports, revert to my ad–hoc queries in a split–second, and render my visualizations and dashboards with no latency:
As if the previous aspects of the expectations from the real–time solutions were not sufficient, to have them rolling out to production, one of the basic expectations in today's data generating and zero downtime era, is that the system should be self–managed/managed with minimalistic efforts and it should be inherently built in a fault tolerant and auto–recovery manner for handling most if not all scenarios. It should also be able to provide my known basic SQL kind of interface in similar/close format.
However outrageously ridiculous the previous expectations sound, they are perfectly normal and minimalistic expectation from any big data solution of today. Nevertheless, coming back to our topic of real–time analytics, now that we have touched briefly upon the system level expectations in terms of data, processing and output, the systems are being devised and designed to process zillions of transactions and apply complex data science and machine learning algorithms on the fly, to compute the results as close to real time as possible. The new term being used is close to real–time/near real–time or human real–time. Let's dedicate a moment to having a look at the following figure that captures the concept of computation time and the context and significance of the final insight:
As evident in the previous figure, in the context of time:
Ad–hoc queries over zeta bytes of data take up computation time in the order of hour(s) and are thus typically described as batch. The noteworthy aspect being depicted in the previous figure with respect to the size of the circle is that it is an analogy to capture the size of the data being processed in diagrammatic form.
Ad impressions/Hashtag trends/deterministic workflows/tweets
: These use cases are predominantly termed as online and the compute time is generally in the order of 500ms/1 second. Though the compute time is considerably reduced as compared to previous use cases, the data volume being processed is also significantly reduced. It would be very rapidly arriving data stream of a few GBs in magnitude.
Financial tracking/mission critical applications
: Here, the data volume is low, the data arrival rate is extremely high, the processing is extremely high, and low latency compute results are yielded in time windows of a few milliseconds.
Apart from the computation time, there are other significant differences between batch and real–time processing and solution designing:
Data is at rest
Data is in motion
Batch size is bounded
Data is essentially coming in as a stream and is un–bounded
Access to entire data
Access to data in current transaction/sliding window
Data processed in batches
Processing is done at event, window, or at the most at micro batch level
Efficient, easier administration
Real–time insights, but systems are fragile as compared to batch
Towards the end of this section, all I would like to emphasis is that a near real–time (NRT) solution is as close to true real–time as it is practically possible attain. So, as said, RT is actually a myth (or hypothetical) while NRT is a reality. We deal with and see NRT applications on a daily basis in terms of connected vehicles, prediction and recommendation engines, health care, and wearable appliances.
There are some critical aspects that actually introduce latency to total turnaround time, or TAT as we call it. It's actually the time lapse between occurrences of an event to the time actionable insight is generated out of it.
The data/events generally travel from diverse geographical locations over the wire (internet/telecom channels) to the processing hub. There is some time lapsed in this activity.
Processing:
Data landing
: Due to security aspects, data generally lands on an edge node and is then ingested into the cluster
Data cleansing
: The data veracity aspect needs to be catered for, to eliminate bad/incorrect data before processing
Data massaging and enriching
: Binding and enriching transnational data with dimensional data
Actual processing
Storing the results
All previous aspects of processing incur:
CPU cycles
Disk I/O
Network I/O
Active marshaling and un–marshalling of data serialization aspects.
So, now that we understand the reality of real–time analytics, let's look a little deeper into the architectural segments of such solutions.
In this section, we will learn about what all architectural patterns are possible to build a scalable, sustainable, and robust real–time solution.
A high–level NRT solution recipe looks very straight and simple, with a data collection funnel, a distributed processing engine, and a few other ingredients like in–memory cache, stable storage, and dashboard plugins.
At a high level, the basic analytics process can be segmented into three shards, which are depicted well in previous figure:
Real–time data collection of the streaming data
Distributed high–performance computation on flowing data
Exploring and visualizing the generated insights in the form of query–able consumable layer/dashboards
If we delve a level deeper, there are two contending proven streaming computation technologies on the market, which are Storm and Spark. In the coming section we will take a deeper look at a high–level NRT solution that's derived from these stacks.