E-Book
33,59 €

Architecting Data-Intensive Applications E-Book

Anuj Kumar

0,0

33,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Architect and design data-intensive applications and, in the process, learn how to collect, process, store, govern, and expose data for a variety of use cases

Key FeaturesIntegrate the data-intensive approach into your application architectureCreate a robust application layout with effective messaging and data querying architectureEnable smooth data flow and make the data of your application intensive and fastBook Description

Are you an architect or a developer who looks at your own applications gingerly while browsing through Facebook and applauding it silently for its data-intensive, yet ﬂuent and efficient, behaviour? This book is your gateway to build smart data-intensive systems by incorporating the core data-intensive architectural principles, patterns, and techniques directly into your application architecture.

This book starts by taking you through the primary design challenges involved with architecting data-intensive applications. You will learn how to implement data curation and data dissemination, depending on the volume of your data. You will then implement your application architecture one step at a time. You will get to grips with implementing the correct message delivery protocols and creating a data layer that doesn’t fail when running high traffic. This book will show you how you can divide your application into layers, each of which adheres to the single responsibility principle. By the end of this book, you will learn to streamline your thoughts and make the right choice in terms of technologies and architectural principles based on the problem at hand.

What you will learnUnderstand how to envision a data-intensive systemIdentify and compare the non-functional requirements of a data collection componentUnderstand patterns involving data processing, as well as technologies that help to speed up the development of data processing systemsUnderstand how to implement Data Governance policies at design time using various Open Source ToolsRecognize the anti-patterns to avoid while designing a data store for applicationsUnderstand the different data dissemination technologies available to query the data in an efficient mannerImplement a simple data governance policy that can be extended using Apache FalconWho this book is for

This book is for developers and data architects who have to code, test, deploy, and/or maintain large-scale, high data volume applications. It is also useful for system architects who need to understand various non-functional aspects revolving around Data Intensive Systems.

Anuj Kumar is a senior enterprise architect with FireEye, a Cyber Security Service Provider where he is involved in the Architecture, Strategy, and Design of various systems that deal with huge amounts of data on a regular basis. Anuj has more than 15 years of professional IT Industry experience ranging from development, design, architecture, management, and strategy. He's an active member of OASIS Technical Committee on STIX/TAXII specification. He is a firm believer in Agile Methodology, Modular/ (Staged) Event Driven Architecture, API-First Approach, and Continuous Integration/Deployment/Delivery. Anuj is also an author of Easy Test Framework, which is a Data Driven Testing Framework used by more than 50 companies.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 428

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Architecting Data-Intensive Applications

Develop scalable, data-intensive, and robust applications the smart way

Anuj Kumar

BIRMINGHAM - MUMBAI

Architecting Data-Intensive Applications

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Amarabha BanerjeeAcquisition Editor: Nigel FernandesContent Development Editor: Roshan KumarTechnical Editor: Diksha WakodeCopy Editor: Safis EditingProject Coordinator: Hardik BhindeProofreader: Safis EditingIndexer: Rekha NairGraphics: Jason MonteiroProduction Coordinator: Arvindkumar Gupta

First published: July 2018

Production reference: 1310718

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78646-509-2

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Special thanks to Deepti, my wife, whose encouragement and support kept me going; my two little munchkins for supporting me; and Chief Architect @ FireEye Global Service and Intel, Paul Patrick, under whose guidance I architected, designed, and practiced challenging Data Intensive Systems. Thanks to Roshan Kumar, my content development editor, whose professionalism and help was invaluable.

About the reviewer

Anindita Basak is a Cloud Solution Architect in Data Analytics and AI platform. She has worked with Microsoft Azure since its inception and also with teams of Microsoft as FTE as Azure Development Support Engineer, Pro-Direct Delivery Manager, and Technical Consultant. She coauthored Stream Analytics with Microsoft Azure and was a technical reviewer for five books of Packt on Azure HDInsight, SQL Server Business Intelligence, Hadoop Development, and Smart Learning with Internet of Things and Decision Science. She also authored two video courses on Azure Stream Analytics from Packt.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Architecting Data-Intensive Applications

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Reviews

Exploring the Data Ecosystem

What is a data ecosystem?

A complex set of interconnected data

Data environment

What constitutes a data ecosystem?

Data sharing

Traffic light protocol

Information exchange policy

Handling policy statements

Action policy statements

Sharing policy statements

Licensing policy statements

Metadata policy statements

The 3 V's

Volume

Variety

Velocity

Use cases

Use case 1 – Security

Use case 2 – Modem data collection

Summary

Defining a Reference Architecture for Data-Intensive Systems

What is a reference architecture?

Problem statement

Reference architecture for a data-intensive system

Component view

Data ingest

Data preparation

Data processing

Workflow management

Data access

Data insight

Data governance

Data pipeline

Oracle's information management conceptual reference architecture

Conceptual view

Oracle's information management reference architecture

Data process view

Reference architecture – business view

Real-life use case examples

Machine learning use case 

Data enrichment use case

Extract transform load use case

Desired properties of a data-intensive system

Defining architectural principles

Principle 1

Principle 2

Principle 3

Principle 4

Principle 5

Principle 6

Principle 7

Listing architectural assumptions

Architectural capabilities

UI capabilities

Content mashup

Multi-channel support

User workflow

AR/VR support

Service gateway/API gateway capabilities

Security

Traffic control

Mediation

Caching

Routing

Service orchestration

Business service capabilities

Microservices

Messaging

Distributed (batch/stream) processing

Data capabilities

Data partitioning

Data replication

Summary

Patterns of the Data Intensive Architecture

Application styles

API Platform

Message-oriented application style

Micro Services application styles

Communication styles

Combining different application styles

Architectural patterns

The retry pattern

The circuit breaker

Throttling

Bulk heads

Event-sourcing

Command and Query Responsibility Segregation

Summary

Discussing Data-Centric Architectures

Coordination service

Reliable messaging

Distributed processing

Distributed storage

Lambda architecture

Kappa architecture

A brief comparison of different leading No-Sql data stores

Summary

Understanding Data Collection and Normalization Requirements and Techniques

Data lineage

Apache Atlas

Apache Atlas high-level architecture

Apache Falcon

Data quality

Types of data sources

Data collection system requirements

Data collection system architecture principles

High-level component architecture

High-level architecture

Service gateway

Discovery server

Architecture technology mapping

An introduction to ETCD

Scheduler

Designing the Micro Service

Summary

Creating a Data Pipeline for Consistent Data Collection, Processing, and Dissemination

Query-Data pipelines

Event-Data Pipelines

Topology 1

Topology 2

Topology 3

Resilience

High-availability

Availability Chart

Clustering

Clustering and Network Partitions

Mirrored queues

Persistent Messages

Data Manipulation and Security

Use Case 1

Use Case 2

Exchanges

Guidelines on choosing the right Exchange Type

Headers versus Topic Exchanges

Routing

Header-Based Content Routing

Topic-Based Content Routing

Alternate Exchanges

Dead-Letter Exchanges

Summary

Building a Robust and Fault-Tolerant Data Collection System

Apache Flume

Flume event flow reliability

Flume multi-agent flow

Flow multiplexer

Apache Sqoop

ELK

Beats

Load-balancing

Logstash

Back pressure

High-availability

Centralized collection of distributed data

Apache Nifi

Summary

Challenges of Data Processing

Making sense of the data

What is data processing?

The 3 + 1 Vs and how they affect choice in data processing design

Cost associated with latency

Classic way of doing things

Sharing resources among processing applications

How to perform the processing

Where to perform the processing

Quality of data

Networks are everywhere

Effective consumption of the data

Summary

Let Us Process Data in Batches

What do we mean by batch processing

Lambda architecture and batch processing

Batch layer components and subcomponents

Read/extract component

Normalizer component

Validation component

Processing component

Writer/formatter component

Basic shell component

Scheduler/executor component

Processing strategy

Data partitioning

Range-based partitioning

Hash-based partitioning

Distributed processing

What are Hadoop and HDFS

NameNode

DataNode

MapReduce

Data pipeline

Luigi

Azkaban

Oozie

AirFlow

Summary

Handling Streams of Data

What is a streaming system?

Capabilities (and non-capabilities) of a streaming application

Lambda architecture's speed layer

Computing real time views

High-level reference architecture

Samza architecture

Architectural concepts

Event-streaming layer

Apache Kafka as an event bus

Message persistence

Persistent Queue Design

Message batch

Kafka and the sendfile operation

Compression

Kafka streams

Stream processing topology

Notion of time in stream processing

Samza's stream processing API

The scheduler/executor component of the streaming architecture

Processing concepts and tradeoffs

Processing guarantees

Micro-batch stream processing

Windowing

Types of windows

Summary

References

Let Us Store the Data

The data explosion problem

Relational Database Management Systems and Big data

Introducing Hadoop, the Big Elephant

Apache YARN

Hadoop Distributed Filesystem

HDFS architecture principles (and assumptions)

High-level architecture of HDFS

HDFS file formats

HBase

Understanding the basics of HBase

HBase data model

HBase architecture

Horizontal scaling with automatic sharding of HBase tables

HMaster, region assignment, and balancing

Components of Apache HBase architecture

Tips for improved performance from your HBase cluster

Graph stores

Background of the use case

Scenario

Solution discussion

Bank fraud data model (as can be designed in a property graph data store such as Neo4J)

Semantic graph

Linked data

Vocabularies

Semantic Query Language

Inference

Stardog

GraphQL queries

Gremlin

Virtual Graphs – a Unifying DAO

Structured data

CVS

BITES – Unstructured/Semistructured document store

Structured data extraction

Text extraction

Document queries

Highly-available clusters

Guarantees

Scaling up

Integration with SPARQL

Data Formats

Data integrity and validating constraints

Strict parsing of RDF

Integrity Constraint Validation

Monitoring and operation

Performance

Summary

Preface

Architecting Data Intensive Applications is all about exploring the principles, capabilities, and patterns of a system that is being architected and designed to handle variety of workflows such as read, process, write, and analyze from a variety of data sources that are emitting different volumes of data at a consistent pace. This book educates its readers about various aspects, pitfalls to avoid and use cases that point to the need of a system capable of handling large data. It avoids the notion of comparison with Big Data systems. The reason is that in the reader’s opinion, "Big Data" phrase is quite overloaded already. How "Big" is really "Big" depends on the context in which the application is being built. Something that is "Big" for an organization with three employees that handles Twitter feeds of 10,000 users may not be "Big" for Twitter that handles millions of Twitter feeds every day. Therefore, this book tries to avoid any mention or comparison with the Big Data terminology. Readers will find this book as a technical guide and also a go-to book in situations where the reader wants to understand the aspects of dealing with data, such as Data Collection, Data Processing, Data Dissemination, Data Governance. This book will also contain example code at various places that will mostly be written in Java. All care has been taken to keep the examples simple and easy to understand with sufficient description, therefore, working knowledge of Java is not mandatory, although it will speed up the process of grasping the concept. Knowledge of OOP is essential though.

Who this book is for

What this book covers

Chapter 1, Exploring the Data Ecosystem, will start with data ecosystem and also helps us in understanding its characteristics. You will take a look at the 3Vs of data ecosystem and discuss some data and information sharing standards and frameworks.

Chapter 2, Defining a Reference Architecture for Data-Intensive Systems, will give you an insight into reference architecture for a data-intensive system and will then provide you with a variety of possible implementations of that framework in different scenarios. You will also take a look at the architectural principles and its capabilities.

Chapter 3, Patterns of the Data Intensive Architecture, will focus on various architectural patterns and discuss the application and the communication style in detail. You will learn how to combine different application styles and dive deep in various architectural patterns, enabling you to understand the why as well as the how of a data-centric architecture.

Chapter 4, Discussing Data-Centric Architectures, will discuss the various reference architectures for a data-intensive system. This chapter will also look at the functional components that make the foundation of a distributed system and understand why the Lambda architecture is so popular with distributed systems. It will also provide an insight into Kappa architecture, which is a simplified version of Lambda architecture.

Chapter 5, Understanding Data Collection and Normalization Requirements and Techniques, will provide an in-depth design of a data collection system that you want to build from the scratch and its requirements and techniques.

Chapter 6, Creating a Data Pipeline for Consistent Data Collection, Processing, and Dissemination, will help you to learn how to create a scalable and highly-available architecture for designing and implementing a data pipeline in your overall architecture. This chapter will also delve deeper into the different considerations of designing the data pipeline and take a look at various design patterns that will help you in creating a resilient-data pipeline.

Chapter 7, Building a Robust and Fault-Tolerant Data Collection System, will focus on data collection systems that are available in the open source community and NiFi, which is a highly-scalable and user-friendly system to define data flows. It will also deal with Sqoop, which addresses a very specific use case of transferring data between HDFS and relational systems.

Chapter 8, Challenges of Data Processing, will act as a backbone for the further chapters. This chapter will discuss various challenges that an architect can face while creating data processing system within their organization. You will learn how to enable the large-scale processing of data while keeping the overall system costs lower and how to keep the overall processing time within the defined SLA as the load on the processing system increases. You will also learn how to effectively consume the processed data.

Chapter 9, Let Us Process Data in Batches, will explore the creation of a batch processing system and the criteria necessary for designing a batch system. This will also discuss the Lambda architecture and its batch processing layer. Then, you’ll learn about how distributed processing works and how Hadoop and Map reduce is the go-to system to implement a batch processing system.

Chapter 10, Handling Streams of Data, will explore the concepts and capabilities of a streaming application and its association with the Lambda architecture. Also, this chapter discusses the various sub-components of a stream-based system. Also, you will take a look at the various design considerations when designing a stream-based application and take a walk through the different components of a stream-based system in action.

Chapter 11, Let's Store the Data, will help you understand how to store a huge dataset and discuss about HDFS and its storage formats and discuss HBase, a columnar data store, and take a look at the graph databases.

Chapter 12, When Data Dissemination is as Important as Data Itself, will explore how efficiently you can disseminate your data using indexing technologies and caching techniques. This chapter will also take a look at the data governance and teach you how to design a dissemination architecture.

To get the most out of this book

Inform the reader of the things that they need to know before they start, and spell out what knowledge you are assuming.

Any additional installation instructions and information they need for getting set up.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Exploring the Data Ecosystem

In God we trust. All others must bring data. —W. Edwards Deming https://deming.org/theman/overview

Until a few years ago, a successful organization was one that had access to superior technology, which, in most cases, was either proprietary to the organization or was acquired at great expense. These technologies enabled organizations to define complex business process flows, addressing specific use cases that helped them to generate revenue. In short, technology drove the business. Data did not constitute any part of the decision-making process. With such an approach, organizations could only utilize a part of their data. This resulted in lost opportunities and, in some cases, unsatisfied customers. One of the reasons for these missed opportunities was the fact that there was no reliable and economical way to store such huge quantities of data, especially when organizations didn't know how to make business out of it. Hardware costs were a prohibitive factor.

Things started to change a few years ago when Google published its white paper on GFS (https://static.googleusercontent.com/media/research.google.com/en/archive/gfs-sosp2003.pdf), which was picked up by Doug Cutting, who created the open source distributed file system, called Apache Hadoop, capable of storing large volumes of data using commodity hardware.

Suddenly, organizations, both big and small, realized its potential and started storing any piece of data in Hadoop that had the potential to turn itself into a source of revenue later. The industry coined a term for such a huge, raw store of data, calling it a data lake.

The Wikipedia definition of a data lake is as follows:

"A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files."

In short, a data lake is a collection of various pieces of data that may, or may not, be important to an organization. The key phrase here is natural format. What this means is that the data that is collected is seldom processed prior to being stored. The reasoning behind this is that any processing may potentially lead to a loss of information which, in turn, may have an effect in terms of generating new sources of revenue for the organization. This does not mean that we do not process the data while in flight. What it does mean is that at least one copy of the data is stored in the manner in which it was received from the external system.

But how do organizations fill this data lake and what data should be stored there? The answer to this question lies in understanding the data ecosystem that exists today. Understanding where the data originates from, and which data to persist, helps organizations to become data-driven instead of process-driven. This ability helps organizations to not only explore new business opportunities, but also helps them to react more quickly in the face of an ever-changing business landscape.

In this introductory chapter, we will:

Try to understand what we mean by a data ecosystem

Try to understand the characteristics of a data ecosystem, in other words, what constitutes a data ecosystem

Talk about some data and information sharing standards and frameworks, such as the traffic light protocol and the information exchange policy framework

Continue our exploration of the 3V's of the data ecosystem

Conclude with a couple of use cases to prove our point

I hope that this chapter will pique your interest and get you excited about exploring the rest of the book.

What is a data ecosystem?

An ecosystem is defined as a complex set of relationships between interconnected elements and their environments. For example, the social construct around our daily lives is an ecosystem. We depend on the state to provide us with basic necessities, including food, water, and gas. We rely on our local stores for our daily needs, and so on. Our livelihood is directly or indirectly dependent upon the social construct of our society. The inter-dependency, as well as the inter-connectivity of these social elements, is what defines a society.

Along the same lines, a data ecosystem can be defined as a complex set of possibly interconnected data and the environment from which that data originates. Data from social websites, such as Twitter, Facebook, and Instagram; data from connected devices, such as sensors; data from the (Industrial) Internet of Things; SCADA systems; data from your phone; and data from your home router, all constitute a data ecosystem to some extent. As we will see in the following sections, this huge variety of data, when connected, can be really useful in providing insights into previously undiscovered business opportunities.

A complex set of interconnected data

What this section implies is that data can be a collection of structured, semi-structured, or unstructured data (hence, a complex set). Additionally, data collected from different sources may relate to one another, in some form or other. To put it in perspective, let's look at a very simple use case, where data from different sources can be connected. Imagine you have an online shopping website and you would like to recommend to your visitors the things that they would most probably want to buy. For the recommendation to succeed, you may need a lot of relevant information about the person. You may want to know what a person likes/dislikes, what they have been searching for in the last few days, what they have been tweeting about, and what topics they are discussing in public forums. All these constitute different sources of data and, even though, at first glance, it may appear that the data from individual sources is not connected, the reality is that all the data pertains to one individual, and their likes and dislikes. Establishing such connections in different data sources is key for an organization when it comes to quickly turning an idea into a business opportunity.

Data environment

The environment in which the data originates is as important as the data itself. The environment provides us with the contextual information to attach to the data, which may further help us in making the correct decision. Having contextual information helps us to understand the relevancy as well as the reliability of the data source, which ultimately feeds into the decision-making process. The environment also tells us about the data lineage (to be discussed in detail in Chapter 12, When Data Dissemination Is as Important as Data Itself), which helps us to understand whether the data has been modified during its journey or not and, if it has, how it affects our use case.

Each organization has its own set of data sources that constitute their specific data ecosystem. Remember that one organization's data sources may not be the same as another organization's.

The data evangelist within the organization should always focus on identifying which sources of data are more relevant than others for a given set of use cases that the organization is trying to resolve.

This feeds into our next topic, what constitutes a data ecosystem?

What constitutes a data ecosystem?

Nowadays, data comes from a variety of sources, at varying speeds, and in a number of different formats. Understanding data and its relevance is the most important task for any data-driven organization.

To understand the importance of data, the data czars in an organization should look at all possible sources of data that may be important to them. Being far-sighted helps, although, given the pace of modern society, it is almost impossible to gather data from every relevant source. Hence, it is important that the person/people involved in identifying relevant data sources are also well aware of the business landscape in which they operate. This knowledge will help tremendously in averting problems later. Data source identifiers should also be aware that data can be sourced both inside and outside of an organization, since, at the broadest level, data is first classified as being either internal or external data.

Given the opportunity, internal data should first be converted into information. Handling internal data first helps the organization to understand its importance early in the life cycle, without needing to set up a complex system, thereby making the process agile. In addition, it also gives the technical team an opportunity to understand what technology and architecture would be most appropriate in their situation. Such a distinction also helps organizations to not only put a reliability rating on data, but also to define any security rules in connection with the data.

So, what are the different sources of data that an organization can utilize to its benefit? The following diagram depicts a part of the landscape that constitutes the data ecosystem. I say "a part" because the landscape is so huge that listing all of them would not be possible:

The preceding mentioned data can be categorized as internal or external, depending upon the business segment in which an organization is involved. For example, as regards an organization such as Facebook, all the social media-related data on its website would constitute an internal source, whereas the same data for an advertising firm would represent an external source of data.

As you may have already noticed, the preceding set of data can broadly be classified into three sub-categories:

Structured data

This type of data contains a well-defined structure that can be parsed easily by any standard machine parser. This type of data usually comes with a schema that defines the structure of the data. For example, incoming data in XML format having an associated XML schema constitutes what is known as structured data. Examples of such data include Customer Relationship Management (CRM) data, and ERP data.

Semi-structured data

Semi-structured data consists of data that does not have a formal schema associated with it. Log data from different machines can be regarded as semi-structured data. For example, a firewall log statement consists of the following fields as a minimum: the timestamp, host IP, destination IP, host port, and destination port, as well as some free text describing the event that took place resulting in the generation of the log statement.

Unstructured data

Finally, we have data that is unstructured. When I say unstructured, what I really mean is that, looking at the data, it is hard to derive any structured information directly from the data itself. It does not mean that we can't get information from the unstructured data. Examples of unstructured data include video files, audio files, and blogs, while most of the data generated on social media also falls under the category of unstructured data.

One thing to note about any kind of data is that, more often than not, each piece of data will have metadata associated with it. For example, when we take a picture using our cellphone, the picture itself constitutes the data, whereas its properties, such as when it was taken, where it was taken, what the focal length was, its brightness, and whether it was modified by software such as Adobe Photoshop, constitutes its metadata.

Sometimes, it is also difficult to clearly categorize data. For example, the scenario where a security firm that sells hardware appliances to its customers that is installed at the customer location and collects access log data constitutes one such scenario where it is difficult to categorize data. It is data for the end customer that the customer has given permission to be used for a specific purpose and that is used to detect a security threat. Thus, even though the data resides at the security organization, it still cannot be used (without consent) for any purpose other than to detect a threat for that specific customer.

This brings us to our next topic: data sharing.

Data sharing

Whenever we collect data from an external source, there is always a clause about how that data can be used. At times, this aspect is implicit, but there are times when you need to provide an explicit mechanism for how the data can be shared by the collecting organization, both within and outside the organization. This distinction becomes important when data is shared between specific organizations. For example, one particular financial institution may decide to share certain information with another financial institution because both are part of a larger consortium that requires them to work collectively towards combating cyber threats. Now, the data on cyber threats that is collected and shared by these organizations may come with certain restrictions. Namely:

When should the shared data be used?

How may this data be shared with other parties, both within and outside an organization?

There are numerous ways in which this sharing agreement can be agreed upon by organizations. Two such ways, that are defined and used by many organizations, are:

The traffic light protocol and

The information exchange policy framework from first.org

Let's discuss each of these briefly.

Traffic light protocol

The traffic light protocol (hereinafter referred to as TLP, https://www.us-cert.gov/tlp and https://www.first.org/tlp) is a set of designations used to ensure that sensitive information is shared with the appropriate audience. TLP was created to facilitate the increased sharing of information between organizations. It employs four colors to indicate the expected sharing boundaries to be applied by the recipient(s):

RED

AMBER

GREEN

WHITE

TLP provides a simple and intuitive schema for indicating when and how sensitive information can be shared, thereby facilitating more frequent and effective collaboration. TLP is not a control marking or classification scheme. TLP was not designed to handle licensing terms, handling and encryption rules, and restrictions on action or instrumentation of information. TLP labels and their definitions are not intended to have any effect on freedom of information or sunshine laws in any jurisdiction.

TLP is optimized for ease of adoption, human readability, and person-to-person sharing; it may be used in automated sharing exchanges, but is not optimized for such use.

The source is responsible for ensuring that recipients of TLP information understand and can follow TLP sharing guidance.

If a recipient needs to share the information more widely than is indicated by the original TLP designation, they must obtain explicit permission from the original source.

The United States Computer Emergency Readiness Team provides the following definition of TLP, along with its usage and sharing guidelines:

TLP color

When it should be used

How it may be shared

RED

Not for disclosure, restricted to participants only.

Sources may use TLP:RED when information cannot be effectively acted upon by additional parties, and could impact on a party's privacy, reputation, or operations if misused.

Recipients may not share TLP:RED information with any parties outside of the specific exchange, meeting, or conversation in which it was originally disclosed. In the context of a meeting, for example, TLP:RED information is limited to those present at the meeting. In most circumstances, TLP:RED should be exchanged verbally or in person.

AMBER

Limited disclosure, restricted to participants' organizations.

Sources may use TLP:AMBER when information requires support to be effectively acted upon, yet carries risks to privacy, reputation, or operations if shared outside of the organizations involved.

Recipients may only share TLP:AMBER information with members of their own organization, and with clients or customers who need to know the information to protect themselves or prevent further harm. Sources are at liberty to specify additional intended limits associated with the sharing: these must be adhered to.

GREEN

Limited disclosure, restricted to the community.

Sources may use TLP:GREEN when information is useful for making all participating organizations, as well as peers within the broader community or sector, aware.

Recipients may share TLP:GREEN information with peers and partner organizations within their sector or community, but not via publicly accessible channels. Information in this category can be circulated widely within a particular community. TLP:GREEN information may not be released outside of the community.

WHITE

Disclosure is not limited.

Sources may use TLP:WHITE when information carries minimal or no foreseeable risk of misuse, in accordance with applicable rules and procedures for public release.

Subject to standard copyright rules, TLP:WHITE information may be distributed without restriction.

Remember that this is guidance and not a rule. Therefore, if an organization feels the need to have further types of restrictions, it may certainly do so, provided the receiving entity is either aware of them and is not exposed to the extension.

Information exchange policy

The information exchange policy framework (https://www.first.org/iep) was put together by FIRST for Computer Security Incident Response Teams (CSIRT), security communities, organizations, and vendors who may consider implementation with a view to supporting their information sharing and information exchange initiatives.

The IEP framework is composed of four different policy types: Handling, Action, Sharing, and Licensing (HASL).

Let's look at each of these briefly.

Handling policy statements

Policy statement

ENCRYPT IN TRANSIT

Type

HANDLING

Description

States whether the information received has to be encrypted when it is retransmitted by the recipient.

Enumerations

MUST

recipients MUST encrypt the information received when it is retransmitted or redistributed.

MAY

recipients MAY encrypt the information received when it is retransmitted or redistributed.

Required

Policy statement

ENCRYPT IN REST

Type

HANDLING

Description

States whether the information received has to be encrypted by the recipient when it is stored.

Enumerations

MUST

recipients MUST encrypt the information received when it is stored.

MAY

recipients MAY encrypt the information received when it is stored.

Required

Action policy statements

Policy statement

PERMITTED ACTIONS

Type

ACTION

Description

States the permitted actions that recipients can take upon receiving information.

Enumerations

NONE

recipients MUST NOT act upon the information received.

CONTACT FOR INSTRUCTION

recipients MUST contact the providers before acting upon the information received. An example is where information redacted by the provider could be derived by the recipient and the affected parties identified.

INTERNALLY VISIBLE ACTIONS

recipients MAY conduct actions on the information received that are only visible on the recipient's internal networks and systems, and MUST NOT conduct actions that are visible outside of the recipient's networks and systems, or that are visible to third parties.

EXTERNALLY VISIBLE INDIRECT ACTIONS

recipients MAY conduct indirect, or passive, actions on the information received that are externally visible and MUST NOT conduct direct, or active, actions.

EXTERNALLY VISIBLE DIRECT ACTIONS

recipients MAY conduct direct, or active, actions on the information received that are externally visible.

Required

Policy statement

AFFECTED PARTY NOTIFICATIONS

Type

ACTION

Description

Recipients are permitted to notify affected third parties of a potential compromise or threat. Examples include permitting National CSIRTs to send notifications to affected constituents, or a service provider contacting affected customers.

Enumerations

MAY

recipients MAY notify affected parties of a potential compromise or threat.

MUST NOT

recipients MUST NOT notify affected parties of potential compromises or threats.

Required

Sharing policy statements

Policy statement

TRAFFIC LIGHT PROTOCOL

Type

SHARING

Description

Recipients are permitted to redistribute the information received within the scope of redistribution, as defined by the enumerations. The enumerations "RED", "AMBER", "GREEN", and "WHITE" in this document are to be interpreted as described in the FIRST traffic light protocol policy.

Enumerations

RED

Personal for identified recipients only.

AMBER

Limited sharing based on a need-to-know basis.

GREEN

Community-wide sharing.

WHITE

Unlimited sharing.

Required

Policy statement

PROVIDER ATTRIBUTION

Type

SHARING

Description

Recipients could be required to attribute or anonymize the provider when redistributing the information received.

Enumerations

MAY

recipients MAY attribute the provider when redistributing the information received.

MUST

recipients MUST attribute the provider when redistributing the information received.

MUST NOT

recipients MUST NOT attribute the provider when redistributing the information received.

Required

Policy statement

OBFUSCATE AFFECTED PARTIES

Type

SHARING

Description

Recipients could be required to obfuscate or anonymize information that could be used to identify the affected parties before redistributing the information received. Examples include removing affected parties' IP addresses, or removing the affected parties names, but leaving the affected parties' industry vertical prior to sending a notification.

Enumerations

MAY

recipients MAY obfuscate information concerning the specific parties affected.

MUST

recipients MUST obfuscate information

concerning the specific parties affected

MUST NOT

recipients MUST NOT obfuscate information concerning the specific parties affected.

Required

Licensing policy statements

Policy statement

EXTERNAL REFERENCE

Type

LICENSING

Description

This statement can be used to convey a description or reference to any applicable licenses, agreements, or conditions between the producer and receiver, for example, specific terms of use, contractual language, agreement name, or a URL.

Enumerations

There are no EXTERNAL REFERENCE enumerations and this is a free-form text field.

Required

Policy statement

UNMODIFIED RESALE

Type

LICENSING

Description

States whether the recipient MAY or MUST NOT resell the information received unmodified, or in a semantically equivalent format, for example, transposing the information from a

.csv

file format to a

.json

file format would be considered semantically equivalent.

Enumerations

MAY

recipients MAY resell the information received.

MUST NOT

recipients MUST NOT resell the information received unmodified or in a semantically equivalent format.

Required

Metadata policy statements

Policy statement

POLICY ID

Type

METADATA

Description

Provides a unique ID to identify a specific IEP implementation.

Required

YES

Policy statement

POLICY VERSION

Type

METADATA

Description

States the version of the IEP framework that has been used, for instance, 1.0.

Required

Policy statement

POLICY NAME

Type

METADATA

Description

This statement can be used to provide a name for an IEP implementation, for instance, FIRST Mailing List IEP.

Required

Policy statement

POLICY START DATE

Type

METADATA

Description

States the UTC date from when the IEP is effective.

Required

Policy statement

POLICY END DATE

Type

METADATA

Description

States the UTC date that the IEP is effective until.

Required

Policy statement

POLICY REFERENCE

Type

METADATA

Description

This statement can be used to provide a URL reference to the specific IEP implementation.

Required

It is very important for any organization to understand where they are gathering the data from and what the obligations associated with the data are before using it for both internal purposes or sharing. A lack of clear understanding of these could lead to breaches of trust and may not be a desirable situation.

Now that data sharing is behind us, let's talk a little bit about the nature of the data itself. Data usually exhibits three characteristics, which are essential to understand when it comes to designing the data collection system (to be discussed in the next couple of chapters). Industry calls it the 3 V's of data. Let's briefly look at what the 3 V's stand for and why they are important to bear in mind when designing the system.

The 3 V's

The 3 V's stand for:

Volume

Variety

Velocity

Volume

Today's world consists of petabytes of data being emitted by a variety of sources, be it social media, sensors, blockchain, video, audio, or even transactional. The data collected can be huge, depending on the nature of the business, but, if you are reading this book, it essentially means that you have huge volumes of data that you need to understand how to handle in an effective manner.

Variety

Variety refers to the different data formats. Relational databases, Excel files, or even simple text files are all examples of different data formats. A system should be capable of handling new varieties of data as and when they arrive. Extensibility is the key component for a data-intensive system when it comes to handling varieties of data. Data variety can be broadly classified into three major blocks:

Structured:

Data that has a well-defined schema associated with it, for example, relational data, and XML-formatted data.

Semi-structured:

Data whose structure can be anticipated but that does not always conform to a set standard. Examples include JSON-formatted data, and columnar data.

Unstructured: Binary large object

(

BLOB

) data, for example, video, and audio.

Velocity

Velocity denotes the speed at which the data arrives and becomes stale. There was a time when even one month-old data was considered fresh. In today's world, where social media has taken the place of traditional information sources and sensors have replaced human log books, we can't even rely on yesterday's data as it may have already become stale. The data moves at near real time and, if not processed properly and in time, may represent a lost opportunity for the business.

Until now, we have only discussed the data ecosystem, what it consists of, what requirements are associated with it in terms of the ability to share, and the types of data you can expect to collect. None of this will make sense unless we associate the data ecosystem and collection with the value drivers associated with that data for an organization.

Broadly speaking, any data that an organization decides to collect or use has two motivations/intentions behind it. Either the organization wants to use it for improving its own system/processes, or it wants to place itself strategically in a situation where it can generate new opportunities for itself.

Better decision-making processes, be they quicker or more proactive, are directly proportional to the revenue of a company.

Improvements in internal capabilities, either via automation or improved business process management, save time and money, thereby giving organizations more opportunities to innovate and, in turn, reducing costs further and opening up new business opportunities.

As you may have already noticed, this is a circle of dependencies and, once an organization can find a balance within this circle, the only way for it is upward.

Use cases

Having understood the data ecosystem and its constituent elements, let's finally look at some practical use cases that could lead an organization to start thinking in terms of data rather than processes.

Use case 1 – Security

Until a few years ago, the best way to combat external cyber security threats was to create a series of firewalls that were assumed to be impenetrable and thereby provide security to the systems behind the firewall. To combat internal cyber attacks, anti-virus software was considered to be more than sufficient. This traditional defense gave a sense of security, but was more of an illusion than a reality. Typical software system attackers are well versed in hiding in plain sight and, consequently, looking for "known bad" signatures didn't help in combating Advanced Persistent Threats (aka APT). As systems developed in complexity, the attack patterns also became sophisticated, with coordinated hacking efforts persisting over a long period and exploiting every aspect of the vulnerable system.

For example, a use case within the security domain is the Detection of Anomaly within the generated machine data, where the data is explored to identify any non-homogeneous event or transaction in a seemingly homogeneous set of events. An example of anomaly detection is when banks perform sophisticated transformations and context association with incoming credit card transactions to identify whether a transaction looks suspicious. Banks do it to prevent fraudsters from looting the bank, either directly or indirectly.

Organizations responded by creating hunting teams that looked at various data (for example, system logs, network packets, and firewall access logs) with a view to doing the following:

Hunting for undetected intrusions/breaches

Detecting anomalies and raising alerts in connection with any malicious activity

The main challenges for organizations in terms of creating these hunting teams were the following:

The fact that data is scattered throughout the organization's IT landscape

Data quality issues and multiple data versioning issues

Access and contractual limitations

All these requirements and challenges created the need for a platform that can support various data formats and a platform that is capable of:

Long-term data retention

Correlating different data sources

Providing fast access to correlated data

Real-time analysis

Summary

Just as it is important to capture data for various efficiencies and insights, it is also equally important to understand what data an organization does not want. You may think that you need everything, but the truth is that you do not want everything. Understanding what you need is critical to hastening the journey toward becoming a data-driven organization.