Practical Real-time Data Processing and Analytics - Shilpi Saxena - E-Book

Practical Real-time Data Processing and Analytics E-Book

Shilpi Saxena

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

With the rise of Big Data, there is an increasing need to process large amounts of data continuously, with a shorter turnaround time. Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible.
This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.
We’ll cover topics such as how to set up components, basic executions, integrations, advanced use cases, alerts, and monitoring. You’ll be exposed to the popular tools used in real-time processing today such as Apache Spark, Apache Flink, and Storm. Finally, you will put your knowledge to practical use by implementing all of the techniques in the form of a practical, real-world use case.
By the end of this book, you will have a solid understanding of all the aspects of real-time data processing and analytics, and will know how to deploy the solutions in production environments in the best possible manner.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 312

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Practical Real-Time Data Processing and Analytics

 

 

 

 

 

 

 

 

 

 

Distributed Computing and Event Processing using Apache Spark, Flink, Storm, and Kafka

 

 

 

 

 

 

 

 

 

 

Shilpi Saxena
Saurabh Gupta

 

 

 

 

 

 

 

 

 

 

BIRMINGHAM - MUMBAI

Practical Real-Time Data Processing and Analytics

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: September 2017

 

Production reference: 1250917

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78728-120-2

 

www.packtpub.com

Credits

Author

 

Shilpi Saxena Saurabh Gupta

Copy Editor

 

Safis Editing

Reviewers

 

Ruben Oliva Ramos Tomas Oliva Prateek Bha

Project Coordinator

 

Nidhi Joshi

Commissioning Editor

 

Amey Varangaonkar

Proofreader

 

Safis Editing

Acquisition Editor

 

Tushar Gupta

Indexer

 

Tejal Daruwale Soni

Content Development Editor

 

Mayur Pawanikar

Graphics

 

Tania Dutta

Technical Editor

 

Karan Thakkar

Production Coordinator

 

Arvindkumar Gupta

About the Authors

Shilpi Saxena is an IT professional and also a technology evangelist. She is an engineer who has had exposure to various domains (machine to machine space, healthcare, telecom, hiring, and manufacturing). She has experience in all the aspects of conception and execution of enterprise solutions. She has been architecting, managing, and delivering solutions in the big data space for the last 3 years. She also handles a high-performance and geographically-distributed team of elite engineers.

Shilpi has more than 12 years (3 years in the big data space) of experience in the development and execution of various facets of enterprise solutions both in the products and services dimensions of the software industry. An engineer by degree and profession, she has worn varied hats, such as developer, technical leader, product owner, tech manager, and so on, and she has seen all the flavors that the industry has to offer. She has architected and worked through some of the pioneers' production implementations in Big Data on Storm and Impala with auto-scaling in AWS.

Shilpi also authored Real-time Analytics with Storm and Cassandra with Packt Publishing.

 

 

 

 

Saurabh Gupta is an software engineer who has worked aspects of software requirements, designing, execution, and delivery. Saurabh has more than 3 years of experience working in Big Data domain. Saurabh is handling and designing real time as well as batch processing projects running in production including technologies like Impala, Storm, NiFi, Kafka and deployment on AWS using Docker. Saurabh also worked in product development and delivery.

Saurabh has total 10 years (3+ years in big data) rich experience in IT industry. Saurabh has exposure in various IOT use-cases including Telecom, HealthCare, Smart city, Smart cars and so on.

About the Reviewers

Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering, teleinformatics, and networking specialization from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience in developing web applications to control and monitor devices connected with Arduino and Raspberry Pi using web frameworks and cloud services to build the Internet of Things applications.

He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 in Leon, Guanajuato, Mexico, teaching subjects such as electronics, robotics and control, automation, and microcontrollers at Mechatronics Technician Career; he is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, Windows Phone, HTML5, PHP, CSS, Ajax, JavaScript, Angular, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers (such as Node.js and IIS), hardware programming (such as Arduino, Raspberry pi, Ethernet Shield, GPS, and GSM/GPRS, ESP8266), and control and monitor systems for data acquisition and programming.

He has authored the book Internet of Things Programming with JavaScript by Packt Publishing. He is also involved in monitoring, controlling, and the acquisition of data with Arduino and Visual Basic .NET for Alfaomega.

I would like to thank my savior and lord, Jesus Christ, for giving me the strength and courage to pursue this project; my dearest wife, Mayte; our two lovely sons, Ruben and Dario; my dear father, Ruben; my dearest mom, Rosalia; my brother, Juan Tomas; and my sister, Rosalia, whom I love, for all their support while reviewing this book, for allowing me to pursue my dream, and tolerating not being with them after my busy day job.

Juan Tomás Oliva Ramos is an environmental engineer from the university of Guanajuato, Mexico, with a master’s degree in administrative engineering and quality. He has more than 5 years of experience in management and development of patents, technological innovation projects, and development of technological solutions through the statistical control of processes. He has been a teacher of statistics, entrepreneurship, and technological development of projects since 2011.  He became an entrepreneur mentor, and started a new department of technology management and entrepreneurship at instituto Tecnologico Superior de Purisima del Rincon.

Juan is a Alfaomega reviewer and has worked on the book Wearable designs for Smart watches, Smart TVs and Android mobile devices.

He has developed prototypes through programming and automation technologies for the improvement of operations, which have been registered for patents.

I want to thank God for giving me the wisdom and humility to review this book. I thank Packt for giving me the opportunity to review this amazing book and to collaborate with a group of committed people. I want to thank my beautiful wife, Brenda; our two magic princesses, Regina and Renata; and our next member, Angel Tadeo; all of you give me the strength, happiness, and joy to start a new day. Thanks for being my family

 

Prateek Bhati  is currently working in Accenture. He has total experience of 4 years in real-time data processing and currently living in New Delhi. He completed his graduation from Amity University.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787281205.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Introducing Real-Time Analytics

What is big data?

Big data infrastructure

Real–time analytics – the myth and the reality

Near real–time solution – an architecture that works

NRT – The Storm solution

NRT – The Spark solution

Lambda architecture – analytics possibilities

IOT – thoughts and possibilities

Edge analytics

Cloud – considerations for NRT and IOT

Summary

Real Time Applications – The Basic Ingredients

The NRT system and its building blocks

Data collection

Stream processing

Analytical layer – serve it to the end user

NRT – high-level system view

NRT – technology view

Event producer

Collection

Broker

Transformation and processing

Storage

Summary

Understanding and Tailing Data Streams

Understanding data streams

Setting up infrastructure for data ingestion

Apache Kafka

Apache NiFi

Logstash

Fluentd

Flume

Taping data from source to the processor - expectations and caveats

Comparing and choosing what works best for your use case

Do it yourself

Setting up Elasticsearch

Summary

Setting up the Infrastructure for Storm

Overview of Storm

Storm architecture and its components

Characteristics

Components

Stream grouping

Setting up and configuring Storm

Setting up Zookeeper

Installing

Configuring

Standalone

Cluster

Running

Setting up Apache Storm

Installing

Configuring

Running

Real-time processing job on Storm

Running job

Local

Cluster

Summary

Configuring Apache Spark and Flink

Setting up and a quick execution of Spark

Building from source

Downloading Spark

Running an example

Setting up and a quick execution of Flink

Build Flink source

Download Flink

Running example

Setting up and a quick execution of Apache Beam

Beam model

Running example

MinimalWordCount example walk through

Balancing in Apache Beam

Summary

Integrating Storm with a Data Source

RabbitMQ – messaging that works

RabbitMQ exchanges

Direct exchanges

Fanout exchanges

Topic exchanges

Headers exchanges

RabbitMQ setup

RabbitMQ — publish and subscribe

RabbitMQ – integration with Storm

AMQPSpout

PubNub data stream publisher

String together Storm-RMQ-PubNub sensor data topology

Summary

From Storm to Sink

Setting up and configuring Cassandra

Setting up Cassandra

Configuring Cassandra

Storm and Cassandra topology

Storm and IMDB integration for dimensional data

Integrating the presentation layer with Storm

Setting up Grafana with the Elasticsearch plugin

Downloading Grafana

Configuring Grafana

Installing the Elasticsearch plugin in Grafana

Running Grafana

Adding the Elasticsearch datasource in Grafana

Writing code

Executing code

Visualizing the output on Grafana

Do It Yourself

Summary

Storm Trident

State retention and the need for Trident

Transactional spout

Opaque transactional Spout

Basic Storm Trident topology

Trident internals

Trident operations

Functions

map and flatMap

peek

Filters

Windowing

Tumbling window

Sliding window

Aggregation

Aggregate

Partition aggregate

Persistence aggregate

Combiner aggregator

Reducer aggregator

Aggregator

Grouping

Merge and joins

DRPC

Do It Yourself

Summary

Working with Spark

Spark overview

Spark framework and schedulers

Distinct advantages of Spark

When to avoid using Spark

Spark – use cases

Spark architecture - working inside the engine

Spark pragmatic concepts

RDD – the name says it all

Spark 2.x – advent of data frames and datasets

Summary

Working with Spark Operations

Spark – packaging and API

RDD pragmatic exploration

Transformations

Actions

Shared variables – broadcast variables and accumulators

Broadcast variables

Accumulators

Summary

Spark Streaming

Spark Streaming concepts

Spark Streaming - introduction and architecture

Packaging structure of Spark Streaming

Spark Streaming APIs

Spark Streaming operations

Connecting Kafka to Spark Streaming

Summary

Working with Apache Flink

Flink architecture and execution engine

Flink basic components and processes

Integration of source stream to Flink

Integration with Apache Kafka

Example

Integration with RabbitMQ

Running example

Flink processing and computation

DataStream API

DataSet API

Flink persistence

Integration with Cassandra

Running example

FlinkCEP

Pattern API

Detecting pattern

Selecting from patterns

Example

Gelly

Gelly API

Graph representation

Graph creation

Graph transformations

DIY

Summary

Case Study

Introduction

Data modeling

Tools and frameworks

Setting up the infrastructure

Implementing the case study

Building the data simulator

Hazelcast loader

Building Storm topology

Parser bolt

Check distance and alert bolt

Generate alert Bolt

Elasticsearch Bolt

Complete Topology

Running the case study

Load Hazelcast

Generate Vehicle static value

Deploy topology

Start simulator

Visualization using Kibana

Summary

Preface

This book will have basic to advanced recipes on real-time computing. We will cover technologies such as Flink, Spark and Storm. The book includes practical recipes to help you to process unbounded streams of data, thus doing for real-time processing what Hadoop did for batch processing. You will begin with setting up the development environment and proceed to implement stream processing. This will be followed by recipes on real-time problems using Rabbit-MQ, Kafka, and Nifi along with Storm, Spark, Flink, Beam, and more. By the end of this book, you will have gained a thorough understanding of the fundamentals of NRT and its applications, and be able to identify and apply those fundamentals to any suitable problem.

This book is written in a cookbook style, with plenty of practical recipes, well-explained code examples, and relevant screenshots and diagrams.

What this book covers

Section – A:  Introduction – Getting Familiar

This section gives the readers basic familiarity with the real-time analytics spectra and domains. We talk about the basic components and their building blocks. This sections consist of the following chapters:

Chapter 1

: Introducing Real-Time Analytics

Chapter 2

: Real-Time Application – The Basic Ingredients

Section – B:  Setup and Infrastructure

This section is predominantly setup-oriented, where we have the basic components set up. This sections consist of the following chapters:

Chapter 3

: Understanding and Tailing Data Streams

Chapter 4

: Setting Up the infrastructure for Storm

Chapter 5

: Configuring Apache Spark and Flink

Section – C: Storm Computations

This section predominantly focuses on exploring Storm, its compute capabilities, and its various features. This sections consist of the following chapters:

Chapter 6

: Integration of Source with Storm

Chapter 7

: From Storm to Sink

Chapter 8

: Storm Trident

Section – D: Using Spark Compute In Real Time

This section predominantly focuses on exploring Spark, its compute capabilities, and its various features. This sections consist of the following chapters:

Chapter 9

: Working with Spark

Chapter 10

: Working with Spark Operations

Chapter 11

: Spark Streaming

Section – E: Flink for Real-Time Analytics This section focuses on exploring Flink, its compute capabilities, and its various features.

Chapter 12

: Working with Apache Flink

Section – F: Let’s Have It Stringed Together

This sections consist of the following chapters:

Chapter 13

: Case Study 1

What you need for this book

The book is intended to graduate our readers into real-time streaming technologies. We expect the readers to have fundamental knowledge of Java and Scala. In terms of setup, we expect readers to have basic maven, Java, and Eclipse set up to run the examples.

Who this book is for

If you are a Java developer who would like to be equipped with all the tools required to devise an end-to-end practical solution on real-time data streaming, then this book is for you. Basic knowledge of real-time processing will be helpful, and knowing the fundamentals of Maven, Shell, and Eclipse would be great.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows: "Once the kafka_2.11-0.10.1.1.tgz file is downloaded, extract the files."

A block of code is set as follows:

cp kafka_2.11-0.10.1.1.tgz /home/ubuntu/demo/kafka

cd /home/ubuntu/demo/kafka

tar -xvf kafka_2.11-0.10.1.1.tgz

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go toFiles|Settings|Project Name|Project Interpreter."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Real-time-Processing-and-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Introducing Real-Time Analytics

This chapter sets the context for the reader by providing an overview of the big data technology landscape in general and real–time analytics in particular. This provides an outline for the book conceptually, with an attempt to ignite the spark for inquisitiveness that will encourage readers to undertake the rest of the journey through the book.

The following topics will be covered:

What is big data?

Big data infrastructure

Real–time analytics – the myth and the reality

Near real–time solution – an architecture that works

Analytics – a plethora of possibilities

IOT – thoughts and possibilities

Cloud – considerations for NRT and IOT

What is big data?

Well to begin with, in simple terms, big data helps us deal with three V's – volume, velocity, and variety. Recently, two more V's were added to it, making it a five–dimensional paradigm; they are veracity and value

Volume

: This dimension refers to the amount of data; look around you, huge amounts of data are being generated every second – it may be the email you send, Twitter, Facebook, or other social media, or it can just be all the videos, pictures, SMS messages, call records, and data from varied devices and sensors. We have scaled up the data–measuring metrics to terabytes, zettabytes and Yottabytes – they are all humongous figures. Look at Facebook alone; it's like ~10 billion messages on a day, consolidated across all users. We have ~5 billion likes a day and around ~400 million photographs are uploaded each day. Data statistics in terms of volume are startling; all of the data generated from the beginning of time to 2008 is kind of equivalent to what we generate in a day today, and I am sure soon it will be an hour. This volume aspect alone is making the traditional database dwarf to store and process this amount of data in reasonable and useful time frames, though a big data stack can be employed to store process and compute on amazingly large data sets in a cost–effective, distributed, and reliably efficient manner.

Velocity

: This refers to the data generation speed, or the rate at which data is being generated. In today's world, where we mentioned that the volume of data has undergone a tremendous surge, this aspect is not lagging behind. We have loads of data because we are able to generate it so fast. Look at social media; things are circulated in seconds and they become viral, and the insight from social media is analysed in milliseconds by stock traders, and that can trigger lots of activity in terms of buying or selling. At a target point of sale counter it takes a few seconds for a credit card swipe, and within that fraudulent transaction processing, payment, bookkeeping, and acknowledgement is all done. Big data gives us the power to analyse the data at tremendous speed.

Variety

: This dimension tackles the fact that the data can be unstructured. In the traditional database world, and even before that, we were used to having a very structured form of data that fitted neatly into tables. Today, more than 80% of data is unstructured – quotable examples are photos, video clips, social media updates, data from variety of sensors, voice recordings, and chat conversations. Big data lets you store and process this unstructured data in a very structured manner; in fact, it effaces the variety.

Veracity

: It's all about validity and correctness of data. How accurate and usable is the data? Not everything out of millions and zillions of data records is corrected, accurate, and referable. That's what actual veracity is: how trustworthy the data is and what the quality of the data is. Examples of data with veracity include Facebook and Twitter posts with nonstandard acronyms or typos. Big data has brought the ability to run analytics on this kind of data to the table. One of the strong reasons for the volume of data is veracity.

Value

: This is what the name suggests: the value that the data actually holds. It is unarguably the most important V or dimension of big data. The only motivation for going towards big data for processing super large data sets is to derive some valuable insight from it. In the end, it's all about cost and benefits.

Big data is a much talked about technology across businesses and the technical world today. There are myriad domains and industries that are convinced of its usefulness, but the implementation focus is primarily application-oriented, rather than infrastructure-oriented. The next section predominantly walks you through the same.

Big data infrastructure

Before delving further into big data infrastructure, let's have a look at the big data high–level landscape.

The following figure captures high–level segments that demarcate the big data space:

It clearly depicts the various segments and verticals within the big data technology canvas (bottom up).

The key is the bottom layer that holds the data in scalable and distributed mode:

Technologies

: Hadoop, MapReduce, Mahout, Hbase, Cassandra, and so on

Then, the next level is the infrastructure framework layer that enables the developers to choose from myriad infrastructural offerings depending upon the use case and its solution design

Analytical Infrastructure

: EMC, Netezza, Vertica, Cloudera, Hortonworks

Operational Infrastructure

: Couchbase, Teradata, Informatica and many more

Infrastructure as a service

(

IAAS

): AWS, Google cloud and many more

Structured Databases

: Oracle, SQLServer, MySQL, Sybase and many more

The next level specializes in catering to very specific needs in terms of

Data As A Service

(

DaaS

): Kaggale, Azure, Factual and many more

Business Intelligence

(

BI

): Qlikview, Cognos, SAP BO and many more

Analytics and Visualizations

: Pentaho, Tableau, Tibco and many more

Today, we see traditional robust RDBMS struggling to survive in a cost–effective manner as a tool for data storage and processing. The scaling of traditional RDBMS, at the compute power expected to process huge amount of data with low latency came at a very high price. This led to the emergence of new technologies, which were low cost, low latency, highly scalable at low cost/open source. To our rescue comes The Yellow Elephant—Hadoop that took the data storage and computation arena by surprise. It's designed and developed as a distributed framework for data storage and computation on commodity hardware in a highly reliable and scalable manner. The key computational methodology Hadoop works on involves distributing the data in chunks over all the nodes in a cluster, and then processing the data concurrently on all the nodes.

Now that you are acquainted with the basics of big data and the key segments of the big data technology landscape, let's take a deeper look at the big data concept with the Hadoop framework as an example. Then, we will move on to take a look at the architecture and methods of implementing a Hadoop cluster; this will be a close analogy to high–level infrastructure and the typical storage requirements for a big data cluster. One of the key and critical aspect that we will delve into is information security in the context of big data.

A couple of key aspects that drive and dictate the move to big data infraspace are highlighted in the following figure:

Cluster design

: This is the most significant and deciding aspect for infrastructural planning. The cluster design strategy of the infrastructure is basically the backbone of the solution; the key deciding elements for the same are the application use cases and requirements, workload, resource computation (depending upon memory intensive and compute intensive computations), and security considerations.

Apart from compute, memory, and network utilization, another very important aspect to be considered is storage which will be either cloud–based or on the premises. In terms of the cloud, the option could be public, private, or hybrid, depending upon the consideration and requirements of use case and the organization

Hardware architecture

: A lot on the storage cost aspect is driven by the volume of the data to be stored, archival policy, and the longevity of the data. The decisive factors are as follows:

The computational needs of the implementations (whether the commodity components would suffice, or if the need is for high–performance GPUs).

What are the memory needs? Are they low, moderate, or high? This depends upon the in–memory computation needs of the application implementations.

Network architecture

: This may not sound important, but it is a significant driver in big data computational space. The reason is that the key aspect for big data is distributed computation, and thus, network utilization is much higher than what would have been in the case of a single–server, monolithic implementation. In distributed computation, loads of data and intermediate compute results travel over the network; thus, the network bandwidth becomes the throttling agent for the overall solution and depending on key aspect for selection of infrastructure strategy. Bad design approaches sometimes lead to network chokes, where data spends less time in processing but more in shuttling across the network or waiting to be transferred over the wire for the next step in execution.

Security architecture

: Security is a very important aspect of any application space, in big data, it becomes all the more significant due to the volume and diversity of the data, and due to network traversal of the data owing to the compute methodologies. The aspect of the cloud computing and storage options adds further needs to the complexity of being a critical and strategic aspect of big data infraspace.

Real–time analytics – the myth and the reality

One of the biggest truths about the real–time analytics is that nothing is actually real–time; it's a myth. In reality, it's close to real–time. Depending upon the performance and ability of a solution and the reduction of operational latencies, the analytics could be close to real–time, but, while day-by-day we are bridging the gap between real–time and near–real–time, it's practically impossible to eliminate the gap due to computational, operational, and network latencies.

Before we go further, let's have a quick overview of what the high–level expectations from these so called real–time analytics solutions are. The following figure captures the high–level intercept of the same, where, in terms of data we are looking for a system that could process millions of transactions with a variety of structured and unstructured data sets. My processing engine should be ultra–fast and capable of handling very complex joined-up and diverse business logic, and at the end, it is also expected to generate astonishingly accurate reports, revert to my ad–hoc queries in a split–second, and render my visualizations and dashboards with no latency:

As if the previous aspects of the expectations from the real–time solutions were not sufficient, to have them rolling out to production, one of the basic expectations in today's data generating and zero downtime era, is that the system should be self–managed/managed with minimalistic efforts and it should be inherently built in a fault tolerant and auto–recovery manner for handling most if not all scenarios. It should also be able to provide my known basic SQL kind of interface in similar/close format.

However outrageously ridiculous the previous expectations sound, they are perfectly normal and minimalistic expectation from any big data solution of today. Nevertheless, coming back to our topic of real–time analytics, now that we have touched briefly upon the system level expectations in terms of data, processing and output, the systems are being devised and designed to process zillions of transactions and apply complex data science and machine learning algorithms on the fly, to compute the results as close to real time as possible. The new term being used is close to real–time/near real–time or human real–time. Let's dedicate a moment to having a look at the following figure that captures the concept of computation time and the context and significance of the final insight:

As evident in the previous figure, in the context of time:

Ad–hoc queries over zeta bytes of data take up computation time in the order of hour(s) and are thus typically described as batch. The noteworthy aspect being depicted in the previous figure with respect to the size of the circle is that it is an analogy to capture the size of the data being processed in diagrammatic form.

Ad impressions/Hashtag trends/deterministic workflows/tweets

: These use cases are predominantly termed as online and the compute time is generally in the order of 500ms/1 second. Though the compute time is considerably reduced as compared to previous use cases, the data volume being processed is also significantly reduced. It would be very rapidly arriving data stream of a few GBs in magnitude.

Financial tracking/mission critical applications

: Here, the data volume is low, the data arrival rate is extremely high, the processing is extremely high, and low latency compute results are yielded in time windows of a few milliseconds.

Apart from the computation time, there are other significant differences between batch and real–time processing and solution designing:

Batch processingReal–time processing

Data is at rest

Data is in motion

Batch size is bounded

Data is essentially coming in as a stream and is un–bounded

Access to entire data

Access to data in current transaction/sliding window

Data processed in batches

Processing is done at event, window, or at the most at micro batch level

Efficient, easier administration

Real–time insights, but systems are fragile as compared to batch

 

Towards the end of this section, all I would like to emphasis is that a near real–time (NRT) solution is as close to true real–time as it is practically possible attain. So, as said, RT is actually a myth (or hypothetical) while NRT is a reality. We deal with and see NRT applications on a daily basis in terms of connected vehicles, prediction and recommendation engines, health care, and wearable appliances.

There are some critical aspects that actually introduce latency to total turnaround time, or TAT as we call it. It's actually the time lapse between occurrences of an event to the time actionable insight is generated out of it.

The data/events generally travel from diverse geographical locations over the wire (internet/telecom channels) to the processing hub. There is some time lapsed in this activity.

Processing:

Data landing

: Due to security aspects, data generally lands on an edge node and is then ingested into the cluster

Data cleansing

: The data veracity aspect needs to be catered for, to eliminate bad/incorrect data before processing

Data massaging and enriching

: Binding and enriching transnational data with dimensional data

Actual processing

Storing the results

All previous aspects of processing incur:

CPU cycles

Disk I/O

Network I/O

Active marshaling and un–marshalling of data serialization aspects.

So, now that we understand the reality of real–time analytics, let's look a little deeper into the architectural segments of such solutions.

Near real–time solution – an architecture that works

In this section, we will learn about what all architectural patterns are possible to build a scalable, sustainable, and robust real–time solution.

A high–level NRT solution recipe looks very straight and simple, with a data collection funnel, a distributed processing engine, and a few other ingredients like in–memory cache, stable storage, and dashboard plugins.

At a high level, the basic analytics process can be segmented into three shards, which are depicted well in previous figure:

Real–time data collection of the streaming data

Distributed high–performance computation on flowing data

Exploring and visualizing the generated insights in the form of query–able consumable layer/dashboards

If we delve a level deeper, there are two contending proven streaming computation technologies on the market, which are Storm and Spark. In the coming section we will take a deeper look at a high–level NRT solution that's derived from these stacks.