33,59 €
Architect and design data-intensive applications and, in the process, learn how to collect, process, store, govern, and expose data for a variety of use cases
Key FeaturesIntegrate the data-intensive approach into your application architectureCreate a robust application layout with effective messaging and data querying architectureEnable smooth data flow and make the data of your application intensive and fastBook Description
Are you an architect or a developer who looks at your own applications gingerly while browsing through Facebook and applauding it silently for its data-intensive, yet fluent and efficient, behaviour? This book is your gateway to build smart data-intensive systems by incorporating the core data-intensive architectural principles, patterns, and techniques directly into your application architecture.
This book starts by taking you through the primary design challenges involved with architecting data-intensive applications. You will learn how to implement data curation and data dissemination, depending on the volume of your data. You will then implement your application architecture one step at a time. You will get to grips with implementing the correct message delivery protocols and creating a data layer that doesn’t fail when running high traffic. This book will show you how you can divide your application into layers, each of which adheres to the single responsibility principle. By the end of this book, you will learn to streamline your thoughts and make the right choice in terms of technologies and architectural principles based on the problem at hand.
What you will learnUnderstand how to envision a data-intensive systemIdentify and compare the non-functional requirements of a data collection componentUnderstand patterns involving data processing, as well as technologies that help to speed up the development of data processing systemsUnderstand how to implement Data Governance policies at design time using various Open Source ToolsRecognize the anti-patterns to avoid while designing a data store for applicationsUnderstand the different data dissemination technologies available to query the data in an efficient mannerImplement a simple data governance policy that can be extended using Apache FalconWho this book is for
This book is for developers and data architects who have to code, test, deploy, and/or maintain large-scale, high data volume applications. It is also useful for system architects who need to understand various non-functional aspects revolving around Data Intensive Systems.
Anuj Kumar is a senior enterprise architect with FireEye, a Cyber Security Service Provider where he is involved in the Architecture, Strategy, and Design of various systems that deal with huge amounts of data on a regular basis. Anuj has more than 15 years of professional IT Industry experience ranging from development, design, architecture, management, and strategy. He's an active member of OASIS Technical Committee on STIX/TAXII specification. He is a firm believer in Agile Methodology, Modular/ (Staged) Event Driven Architecture, API-First Approach, and Continuous Integration/Deployment/Delivery. Anuj is also an author of Easy Test Framework, which is a Data Driven Testing Framework used by more than 50 companies.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 428
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Amarabha BanerjeeAcquisition Editor: Nigel FernandesContent Development Editor: Roshan KumarTechnical Editor: Diksha WakodeCopy Editor: Safis EditingProject Coordinator: Hardik BhindeProofreader: Safis EditingIndexer: Rekha NairGraphics: Jason MonteiroProduction Coordinator: Arvindkumar Gupta
First published: July 2018
Production reference: 1310718
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78646-509-2
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Anuj Kumar is a senior enterprise architect with FireEye, a Cyber Security Service Provider where he is involved in the Architecture, Strategy, and Design of various systems that deal with huge amounts of data on a regular basis. Anuj has more than 15 years of professional IT Industry experience ranging from development, design, architecture, management, and strategy. He's an active member of OASIS Technical Committee on STIX/TAXII specification. He is a firm believer in Agile Methodology, Modular/ (Staged) Event Driven Architecture, API-First Approach, and Continuous Integration/Deployment/Delivery. Anuj is also an author of Easy Test Framework, which is a Data Driven Testing Framework used by more than 50 companies.
Anindita Basak is a Cloud Solution Architect in Data Analytics and AI platform. She has worked with Microsoft Azure since its inception and also with teams of Microsoft as FTE as Azure Development Support Engineer, Pro-Direct Delivery Manager, and Technical Consultant. She coauthored Stream Analytics with Microsoft Azure and was a technical reviewer for five books of Packt on Azure HDInsight, SQL Server Business Intelligence, Hadoop Development, and Smart Learning with Internet of Things and Decision Science. She also authored two video courses on Azure Stream Analytics from Packt.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Architecting Data-Intensive Applications
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Get in touch
Reviews
Exploring the Data Ecosystem
What is a data ecosystem?
A complex set of interconnected data
Data environment
What constitutes a data ecosystem?
Data sharing
Traffic light protocol
Information exchange policy
Handling policy statements
Action policy statements
Sharing policy statements
Licensing policy statements
Metadata policy statements
The 3 V's
Volume
Variety
Velocity
Use cases
Use case 1 – Security
Use case 2 – Modem data collection
Summary
Defining a Reference Architecture for Data-Intensive Systems
What is a reference architecture?
Problem statement
Reference architecture for a data-intensive system
Component view
Data ingest
Data preparation
Data processing
Workflow management
Data access
Data insight
Data governance
Data pipeline
Oracle's information management conceptual reference architecture
Conceptual view
Oracle's information management reference architecture
Data process view
Reference architecture – business view
Real-life use case examples
Machine learning use case 
Data enrichment use case
Extract transform load use case
Desired properties of a data-intensive system
Defining architectural principles
Principle 1
Principle 2
Principle 3
Principle 4
Principle 5
Principle 6
Principle 7
Listing architectural assumptions
Architectural capabilities
UI capabilities
Content mashup
Multi-channel support
User workflow
AR/VR support
Service gateway/API gateway capabilities
Security
Traffic control
Mediation
Caching
Routing
Service orchestration
Business service capabilities
Microservices
Messaging
Distributed (batch/stream) processing
Data capabilities
Data partitioning
Data replication
Summary
Patterns of the Data Intensive Architecture
Application styles
API Platform
Message-oriented application style
Micro Services application styles
Communication styles
Combining different application styles
Architectural patterns
The retry pattern
The circuit breaker
Throttling
Bulk heads
Event-sourcing
Command and Query Responsibility Segregation
Summary
Discussing Data-Centric Architectures
Coordination service
Reliable messaging
Distributed processing
Distributed storage
Lambda architecture
Kappa architecture
A brief comparison of different leading No-Sql data stores
Summary
Understanding Data Collection and Normalization Requirements and Techniques
Data lineage
Apache Atlas
Apache Atlas high-level architecture
Apache Falcon
Data quality
Types of data sources
Data collection system requirements
Data collection system architecture principles
High-level component architecture
High-level architecture
Service gateway
Discovery server
Architecture technology mapping
An introduction to ETCD
Scheduler
Designing the Micro Service
Summary
Creating a Data Pipeline for Consistent Data Collection, Processing, and Dissemination
Query-Data pipelines
Event-Data Pipelines
Topology 1
Topology 2
Topology 3
Resilience
High-availability
Availability Chart
Clustering
Clustering and Network Partitions
Mirrored queues
Persistent Messages
Data Manipulation and Security
Use Case 1
Use Case 2
Exchanges
Guidelines on choosing the right Exchange Type
Headers versus Topic Exchanges
Routing
Header-Based Content Routing
Topic-Based Content Routing
Alternate Exchanges
Dead-Letter Exchanges
Summary
Building a Robust and Fault-Tolerant Data Collection System
Apache Flume
Flume event flow reliability
Flume multi-agent flow
Flow multiplexer
Apache Sqoop
ELK
Beats
Load-balancing
Logstash
Back pressure
High-availability
Centralized collection of distributed data
Apache Nifi
Summary
Challenges of Data Processing
Making sense of the data
What is data processing?
The 3 + 1 Vs and how they affect choice in data processing design
Cost associated with latency
Classic way of doing things
Sharing resources among processing applications
How to perform the processing
Where to perform the processing
Quality of data
Networks are everywhere
Effective consumption of the data
Summary
Let Us Process Data in Batches
What do we mean by batch processing
Lambda architecture and batch processing
Batch layer components and subcomponents
Read/extract component
Normalizer component
Validation component
Processing component
Writer/formatter component
Basic shell component
Scheduler/executor component
Processing strategy
Data partitioning
Range-based partitioning
Hash-based partitioning
Distributed processing
What are Hadoop and HDFS
NameNode
DataNode
MapReduce
Data pipeline
Luigi
Azkaban
Oozie
AirFlow
Summary
Handling Streams of Data
What is a streaming system?
Capabilities (and non-capabilities) of a streaming application
Lambda architecture's speed layer
Computing real time views
High-level reference architecture
Samza architecture
Architectural concepts
Event-streaming layer
Apache Kafka as an event bus
Message persistence
Persistent Queue Design
Message batch
Kafka and the sendfile operation
Compression
Kafka streams
Stream processing topology
Notion of time in stream processing
Samza's stream processing API
The scheduler/executor component of the streaming architecture
Processing concepts and tradeoffs
Processing guarantees
Micro-batch stream processing
Windowing
Types of windows
Summary
References
Let Us Store the Data
The data explosion problem
Relational Database Management Systems and Big data
Introducing Hadoop, the Big Elephant
Apache YARN
Hadoop Distributed Filesystem
HDFS architecture principles (and assumptions)
High-level architecture of HDFS
HDFS file formats
HBase
Understanding the basics of HBase
HBase data model
HBase architecture
Horizontal scaling with automatic sharding of HBase tables
HMaster, region assignment, and balancing
Components of Apache HBase architecture
Tips for improved performance from your HBase cluster
Graph stores
Background of the use case
Scenario
Solution discussion
Bank fraud data model (as can be designed in a property graph data store such as Neo4J)
Semantic graph
Linked data
Vocabularies
Semantic Query Language
Inference
Stardog
GraphQL queries
Gremlin
Virtual Graphs – a Unifying DAO
Structured data
CVS
BITES – Unstructured/Semistructured document store
Structured data extraction
Text extraction
Document queries
Highly-available clusters
Guarantees
Scaling up
Integration with SPARQL
Data Formats
Data integrity and validating constraints
Strict parsing of RDF
Integrity Constraint Validation
Monitoring and operation
Performance
Summary
Further reading
When Data Dissemination is as Important as Data Itself
Data dissemination
Communication protocol
Target audience
Use case
Response schema
Communication channel
Data dissemination architecture in a threat intel sharing system
Threat intel share – backend
RT query processor
View builder
Threat intel share – frontend
AWS Lambda
AWS API gateway
Cache population
Cache eviction
Discussing the non-functional aspects of the preceding architecture
Non-functional use cases for dissemination architecture
Elastic search and free text search queries
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Architecting Data Intensive Applications is all about exploring the principles, capabilities, and patterns of a system that is being architected and designed to handle variety of workflows such as read, process, write, and analyze from a variety of data sources that are emitting different volumes of data at a consistent pace. This book educates its readers about various aspects, pitfalls to avoid and use cases that point to the need of a system capable of handling large data. It avoids the notion of comparison with Big Data systems. The reason is that in the reader’s opinion, "Big Data" phrase is quite overloaded already. How "Big" is really "Big" depends on the context in which the application is being built. Something that is "Big" for an organization with three employees that handles Twitter feeds of 10,000 users may not be "Big" for Twitter that handles millions of Twitter feeds every day. Therefore, this book tries to avoid any mention or comparison with the Big Data terminology. Readers will find this book as a technical guide and also a go-to book in situations where the reader wants to understand the aspects of dealing with data, such as Data Collection, Data Processing, Data Dissemination, Data Governance. This book will also contain example code at various places that will mostly be written in Java. All care has been taken to keep the examples simple and easy to understand with sufficient description, therefore, working knowledge of Java is not mandatory, although it will speed up the process of grasping the concept. Knowledge of OOP is essential though.
This book is for developers and data architects who have to code, test, deploy, and/or maintain large-scale, high data volume applications. It is also useful for system architects who need to understand various non-functional aspects revolving around Data Intensive Systems.
Chapter 1, Exploring the Data Ecosystem, will start with data ecosystem and also helps us in understanding its characteristics. You will take a look at the 3Vs of data ecosystem and discuss some data and information sharing standards and frameworks.
Chapter 2, Defining a Reference Architecture for Data-Intensive Systems, will give you an insight into reference architecture for a data-intensive system and will then provide you with a variety of possible implementations of that framework in different scenarios. You will also take a look at the architectural principles and its capabilities.
Chapter 3, Patterns of the Data Intensive Architecture, will focus on various architectural patterns and discuss the application and the communication style in detail. You will learn how to combine different application styles and dive deep in various architectural patterns, enabling you to understand the why as well as the how of a data-centric architecture.
Chapter 4, Discussing Data-Centric Architectures, will discuss the various reference architectures for a data-intensive system. This chapter will also look at the functional components that make the foundation of a distributed system and understand why the Lambda architecture is so popular with distributed systems. It will also provide an insight into Kappa architecture, which is a simplified version of Lambda architecture.
Chapter 5, Understanding Data Collection and Normalization Requirements and Techniques, will provide an in-depth design of a data collection system that you want to build from the scratch and its requirements and techniques.
Chapter 6, Creating a Data Pipeline for Consistent Data Collection, Processing, and Dissemination, will help you to learn how to create a scalable and highly-available architecture for designing and implementing a data pipeline in your overall architecture. This chapter will also delve deeper into the different considerations of designing the data pipeline and take a look at various design patterns that will help you in creating a resilient-data pipeline.
Chapter 7, Building a Robust and Fault-Tolerant Data Collection System, will focus on data collection systems that are available in the open source community and NiFi, which is a highly-scalable and user-friendly system to define data flows. It will also deal with Sqoop, which addresses a very specific use case of transferring data between HDFS and relational systems.
Chapter 8, Challenges of Data Processing, will act as a backbone for the further chapters. This chapter will discuss various challenges that an architect can face while creating data processing system within their organization. You will learn how to enable the large-scale processing of data while keeping the overall system costs lower and how to keep the overall processing time within the defined SLA as the load on the processing system increases. You will also learn how to effectively consume the processed data.
Chapter 9, Let Us Process Data in Batches, will explore the creation of a batch processing system and the criteria necessary for designing a batch system. This will also discuss the Lambda architecture and its batch processing layer. Then, you’ll learn about how distributed processing works and how Hadoop and Map reduce is the go-to system to implement a batch processing system.
Chapter 10, Handling Streams of Data, will explore the concepts and capabilities of a streaming application and its association with the Lambda architecture. Also, this chapter discusses the various sub-components of a stream-based system. Also, you will take a look at the various design considerations when designing a stream-based application and take a walk through the different components of a stream-based system in action.
Chapter 11, Let's Store the Data, will help you understand how to store a huge dataset and discuss about HDFS and its storage formats and discuss HBase, a columnar data store, and take a look at the graph databases.
Chapter 12, When Data Dissemination is as Important as Data Itself, will explore how efficiently you can disseminate your data using indexing technologies and caching techniques. This chapter will also take a look at the data governance and teach you how to design a dissemination architecture.
Inform the reader of the things that they need to know before they start, and spell out what knowledge you are assuming.
Any additional installation instructions and information they need for getting set up.
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
Until a few years ago, a successful organization was one that had access to superior technology, which, in most cases, was either proprietary to the organization or was acquired at great expense. These technologies enabled organizations to define complex business process flows, addressing specific use cases that helped them to generate revenue. In short, technology drove the business. Data did not constitute any part of the decision-making process. With such an approach, organizations could only utilize a part of their data. This resulted in lost opportunities and, in some cases, unsatisfied customers. One of the reasons for these missed opportunities was the fact that there was no reliable and economical way to store such huge quantities of data, especially when organizations didn't know how to make business out of it. Hardware costs were a prohibitive factor.
Things started to change a few years ago when Google published its white paper on GFS (https://static.googleusercontent.com/media/research.google.com/en/archive/gfs-sosp2003.pdf), which was picked up by Doug Cutting, who created the open source distributed file system, called Apache Hadoop, capable of storing large volumes of data using commodity hardware.
Suddenly, organizations, both big and small, realized its potential and started storing any piece of data in Hadoop that had the potential to turn itself into a source of revenue later. The industry coined a term for such a huge, raw store of data, calling it a data lake.
The Wikipedia definition of a data lake is as follows:
In short, a data lake is a collection of various pieces of data that may, or may not, be important to an organization. The key phrase here is natural format. What this means is that the data that is collected is seldom processed prior to being stored. The reasoning behind this is that any processing may potentially lead to a loss of information which, in turn, may have an effect in terms of generating new sources of revenue for the organization. This does not mean that we do not process the data while in flight. What it does mean is that at least one copy of the data is stored in the manner in which it was received from the external system.
But how do organizations fill this data lake and what data should be stored there? The answer to this question lies in understanding the data ecosystem that exists today. Understanding where the data originates from, and which data to persist, helps organizations to become data-driven instead of process-driven. This ability helps organizations to not only explore new business opportunities, but also helps them to react more quickly in the face of an ever-changing business landscape.
In this introductory chapter, we will:
Try to understand what we mean by a data ecosystem
Try to understand the characteristics of a data ecosystem, in other words, what constitutes a data ecosystem
Talk about some data and information sharing standards and frameworks, such as the traffic light protocol and the information exchange policy framework
Continue our exploration of the 3V's of the data ecosystem
Conclude with a couple of use cases to prove our point
I hope that this chapter will pique your interest and get you excited about exploring the rest of the book.
An ecosystem is defined as a complex set of relationships between interconnected elements and their environments. For example, the social construct around our daily lives is an ecosystem. We depend on the state to provide us with basic necessities, including food, water, and gas. We rely on our local stores for our daily needs, and so on. Our livelihood is directly or indirectly dependent upon the social construct of our society. The inter-dependency, as well as the inter-connectivity of these social elements, is what defines a society.
Along the same lines, a data ecosystem can be defined as a complex set of possibly interconnected data and the environment from which that data originates. Data from social websites, such as Twitter, Facebook, and Instagram; data from connected devices, such as sensors; data from the (Industrial) Internet of Things; SCADA systems; data from your phone; and data from your home router, all constitute a data ecosystem to some extent. As we will see in the following sections, this huge variety of data, when connected, can be really useful in providing insights into previously undiscovered business opportunities.
What this section implies is that data can be a collection of structured, semi-structured, or unstructured data (hence, a complex set). Additionally, data collected from different sources may relate to one another, in some form or other. To put it in perspective, let's look at a very simple use case, where data from different sources can be connected. Imagine you have an online shopping website and you would like to recommend to your visitors the things that they would most probably want to buy. For the recommendation to succeed, you may need a lot of relevant information about the person. You may want to know what a person likes/dislikes, what they have been searching for in the last few days, what they have been tweeting about, and what topics they are discussing in public forums. All these constitute different sources of data and, even though, at first glance, it may appear that the data from individual sources is not connected, the reality is that all the data pertains to one individual, and their likes and dislikes. Establishing such connections in different data sources is key for an organization when it comes to quickly turning an idea into a business opportunity.
The environment in which the data originates is as important as the data itself. The environment provides us with the contextual information to attach to the data, which may further help us in making the correct decision. Having contextual information helps us to understand the relevancy as well as the reliability of the data source, which ultimately feeds into the decision-making process. The environment also tells us about the data lineage (to be discussed in detail in Chapter 12, When Data Dissemination Is as Important as Data Itself), which helps us to understand whether the data has been modified during its journey or not and, if it has, how it affects our use case.
Each organization has its own set of data sources that constitute their specific data ecosystem. Remember that one organization's data sources may not be the same as another organization's.
The data evangelist within the organization should always focus on identifying which sources of data are more relevant than others for a given set of use cases that the organization is trying to resolve.
This feeds into our next topic, what constitutes a data ecosystem?
Nowadays, data comes from a variety of sources, at varying speeds, and in a number of different formats. Understanding data and its relevance is the most important task for any data-driven organization.
To understand the importance of data, the data czars in an organization should look at all possible sources of data that may be important to them. Being far-sighted helps, although, given the pace of modern society, it is almost impossible to gather data from every relevant source. Hence, it is important that the person/people involved in identifying relevant data sources are also well aware of the business landscape in which they operate. This knowledge will help tremendously in averting problems later. Data source identifiers should also be aware that data can be sourced both inside and outside of an organization, since, at the broadest level, data is first classified as being either internal or external data.
Given the opportunity, internal data should first be converted into information. Handling internal data first helps the organization to understand its importance early in the life cycle, without needing to set up a complex system, thereby making the process agile. In addition, it also gives the technical team an opportunity to understand what technology and architecture would be most appropriate in their situation. Such a distinction also helps organizations to not only put a reliability rating on data, but also to define any security rules in connection with the data.
So, what are the different sources of data that an organization can utilize to its benefit? The following diagram depicts a part of the landscape that constitutes the data ecosystem. I say "a part" because the landscape is so huge that listing all of them would not be possible:
The preceding mentioned data can be categorized as internal or external, depending upon the business segment in which an organization is involved. For example, as regards an organization such as Facebook, all the social media-related data on its website would constitute an internal source, whereas the same data for an advertising firm would represent an external source of data.
As you may have already noticed, the preceding set of data can broadly be classified into three sub-categories:
Structured data
This type of data contains a well-defined structure that can be parsed easily by any standard machine parser. This type of data usually comes with a schema that defines the structure of the data. For example, incoming data in XML format having an associated XML schema constitutes what is known as structured data. Examples of such data include Customer Relationship Management (CRM) data, and ERP data.
Semi-structured data
Semi-structured data consists of data that does not have a formal schema associated with it. Log data from different machines can be regarded as semi-structured data. For example, a firewall log statement consists of the following fields as a minimum: the timestamp, host IP, destination IP, host port, and destination port, as well as some free text describing the event that took place resulting in the generation of the log statement.
Unstructured data
Finally, we have data that is unstructured. When I say unstructured, what I really mean is that, looking at the data, it is hard to derive any structured information directly from the data itself. It does not mean that we can't get information from the unstructured data. Examples of unstructured data include video files, audio files, and blogs, while most of the data generated on social media also falls under the category of unstructured data.
One thing to note about any kind of data is that, more often than not, each piece of data will have metadata associated with it. For example, when we take a picture using our cellphone, the picture itself constitutes the data, whereas its properties, such as when it was taken, where it was taken, what the focal length was, its brightness, and whether it was modified by software such as Adobe Photoshop, constitutes its metadata.
Sometimes, it is also difficult to clearly categorize data. For example, the scenario where a security firm that sells hardware appliances to its customers that is installed at the customer location and collects access log data constitutes one such scenario where it is difficult to categorize data. It is data for the end customer that the customer has given permission to be used for a specific purpose and that is used to detect a security threat. Thus, even though the data resides at the security organization, it still cannot be used (without consent) for any purpose other than to detect a threat for that specific customer.
This brings us to our next topic: data sharing.
Whenever we collect data from an external source, there is always a clause about how that data can be used. At times, this aspect is implicit, but there are times when you need to provide an explicit mechanism for how the data can be shared by the collecting organization, both within and outside the organization. This distinction becomes important when data is shared between specific organizations. For example, one particular financial institution may decide to share certain information with another financial institution because both are part of a larger consortium that requires them to work collectively towards combating cyber threats. Now, the data on cyber threats that is collected and shared by these organizations may come with certain restrictions. Namely:
When should the shared data be used?
How may this data be shared with other parties, both within and outside an organization?
There are numerous ways in which this sharing agreement can be agreed upon by organizations. Two such ways, that are defined and used by many organizations, are:
The traffic light protocol and
The information exchange policy framework from first.org
Let's discuss each of these briefly.
The traffic light protocol (hereinafter referred to as TLP, https://www.us-cert.gov/tlp and https://www.first.org/tlp) is a set of designations used to ensure that sensitive information is shared with the appropriate audience. TLP was created to facilitate the increased sharing of information between organizations. It employs four colors to indicate the expected sharing boundaries to be applied by the recipient(s):
RED
AMBER
GREEN
WHITE
TLP provides a simple and intuitive schema for indicating when and how sensitive information can be shared, thereby facilitating more frequent and effective collaboration. TLP is not a control marking or classification scheme. TLP was not designed to handle licensing terms, handling and encryption rules, and restrictions on action or instrumentation of information. TLP labels and their definitions are not intended to have any effect on freedom of information or sunshine laws in any jurisdiction.
TLP is optimized for ease of adoption, human readability, and person-to-person sharing; it may be used in automated sharing exchanges, but is not optimized for such use.
The source is responsible for ensuring that recipients of TLP information understand and can follow TLP sharing guidance.
If a recipient needs to share the information more widely than is indicated by the original TLP designation, they must obtain explicit permission from the original source.
The United States Computer Emergency Readiness Team provides the following definition of TLP, along with its usage and sharing guidelines:
TLP color
When it should be used
How it may be shared
RED
Not for disclosure, restricted to participants only.
Sources may use TLP:RED when information cannot be effectively acted upon by additional parties, and could impact on a party's privacy, reputation, or operations if misused.
Recipients may not share TLP:RED information with any parties outside of the specific exchange, meeting, or conversation in which it was originally disclosed. In the context of a meeting, for example, TLP:RED information is limited to those present at the meeting. In most circumstances, TLP:RED should be exchanged verbally or in person.
AMBER
Limited disclosure, restricted to participants' organizations.
Sources may use TLP:AMBER when information requires support to be effectively acted upon, yet carries risks to privacy, reputation, or operations if shared outside of the organizations involved.
Recipients may only share TLP:AMBER information with members of their own organization, and with clients or customers who need to know the information to protect themselves or prevent further harm. Sources are at liberty to specify additional intended limits associated with the sharing: these must be adhered to.
GREEN
Limited disclosure, restricted to the community.
Sources may use TLP:GREEN when information is useful for making all participating organizations, as well as peers within the broader community or sector, aware.
Recipients may share TLP:GREEN information with peers and partner organizations within their sector or community, but not via publicly accessible channels. Information in this category can be circulated widely within a particular community. TLP:GREEN information may not be released outside of the community.
WHITE
Disclosure is not limited.
Sources may use TLP:WHITE when information carries minimal or no foreseeable risk of misuse, in accordance with applicable rules and procedures for public release.
Subject to standard copyright rules, TLP:WHITE information may be distributed without restriction.
Remember that this is guidance and not a rule. Therefore, if an organization feels the need to have further types of restrictions, it may certainly do so, provided the receiving entity is either aware of them and is not exposed to the extension.
The information exchange policy framework (https://www.first.org/iep) was put together by FIRST for Computer Security Incident Response Teams (CSIRT), security communities, organizations, and vendors who may consider implementation with a view to supporting their information sharing and information exchange initiatives.
The IEP framework is composed of four different policy types: Handling, Action, Sharing, and Licensing (HASL).
Let's look at each of these briefly.
Policy statement
ENCRYPT IN TRANSIT
Type
HANDLING
Description
States whether the information received has to be encrypted when it is retransmitted by the recipient.
Enumerations
MUST
recipients MUST encrypt the information received when it is retransmitted or redistributed.
MAY
recipients MAY encrypt the information received when it is retransmitted or redistributed.
Required
NO
Policy statement
ENCRYPT IN REST
Type
HANDLING
Description
States whether the information received has to be encrypted by the recipient when it is stored.
Enumerations
MUST
recipients MUST encrypt the information received when it is stored.
MAY
recipients MAY encrypt the information received when it is stored.
Required
NO
Policy statement
PERMITTED ACTIONS
Type
ACTION
Description
States the permitted actions that recipients can take upon receiving information.
Enumerations
NONE
recipients MUST NOT act upon the information received.
CONTACT FOR INSTRUCTION
recipients MUST contact the providers before acting upon the information received. An example is where information redacted by the provider could be derived by the recipient and the affected parties identified.
INTERNALLY VISIBLE ACTIONS
recipients MAY conduct actions on the information received that are only visible on the recipient's internal networks and systems, and MUST NOT conduct actions that are visible outside of the recipient's networks and systems, or that are visible to third parties.
EXTERNALLY VISIBLE INDIRECT ACTIONS
recipients MAY conduct indirect, or passive, actions on the information received that are externally visible and MUST NOT conduct direct, or active, actions.
EXTERNALLY VISIBLE DIRECT ACTIONS
recipients MAY conduct direct, or active, actions on the information received that are externally visible.
Required
NO
Policy statement
AFFECTED PARTY NOTIFICATIONS
Type
ACTION
Description
Recipients are permitted to notify affected third parties of a potential compromise or threat. Examples include permitting National CSIRTs to send notifications to affected constituents, or a service provider contacting affected customers.
Enumerations
MAY
recipients MAY notify affected parties of a potential compromise or threat.
MUST NOT
recipients MUST NOT notify affected parties of potential compromises or threats.
Required
NO
Policy statement
TRAFFIC LIGHT PROTOCOL
Type
SHARING
Description
Recipients are permitted to redistribute the information received within the scope of redistribution, as defined by the enumerations. The enumerations "RED", "AMBER", "GREEN", and "WHITE" in this document are to be interpreted as described in the FIRST traffic light protocol policy.
Enumerations
RED
Personal for identified recipients only.
AMBER
Limited sharing based on a need-to-know basis.
GREEN
Community-wide sharing.
WHITE
Unlimited sharing.
Required
NO
Policy statement
PROVIDER ATTRIBUTION
Type
SHARING
Description
Recipients could be required to attribute or anonymize the provider when redistributing the information received.
Enumerations
MAY
recipients MAY attribute the provider when redistributing the information received.
MUST
recipients MUST attribute the provider when redistributing the information received.
MUST NOT
recipients MUST NOT attribute the provider when redistributing the information received.
Required
NO
Policy statement
OBFUSCATE AFFECTED PARTIES
Type
SHARING
Description
Recipients could be required to obfuscate or anonymize information that could be used to identify the affected parties before redistributing the information received. Examples include removing affected parties' IP addresses, or removing the affected parties names, but leaving the affected parties' industry vertical prior to sending a notification.
Enumerations
MAY
recipients MAY obfuscate information concerning the specific parties affected.
MUST
recipients MUST obfuscate information
concerning the specific parties affected
.
MUST NOT
recipients MUST NOT obfuscate information concerning the specific parties affected.
Required
NO
Policy statement
EXTERNAL REFERENCE
Type
LICENSING
Description
This statement can be used to convey a description or reference to any applicable licenses, agreements, or conditions between the producer and receiver, for example, specific terms of use, contractual language, agreement name, or a URL.
Enumerations
There are no EXTERNAL REFERENCE enumerations and this is a free-form text field.
Required
NO
Policy statement
UNMODIFIED RESALE
Type
LICENSING
Description
States whether the recipient MAY or MUST NOT resell the information received unmodified, or in a semantically equivalent format, for example, transposing the information from a
.csv
file format to a
.json
file format would be considered semantically equivalent.
Enumerations
MAY
recipients MAY resell the information received.
MUST NOT
recipients MUST NOT resell the information received unmodified or in a semantically equivalent format.
Required
NO
Policy statement
POLICY ID
Type
METADATA
Description
Provides a unique ID to identify a specific IEP implementation.
Required
YES
Policy statement
POLICY VERSION
Type
METADATA
Description
States the version of the IEP framework that has been used, for instance, 1.0.
Required
NO
Policy statement
POLICY NAME
Type
METADATA
Description
This statement can be used to provide a name for an IEP implementation, for instance, FIRST Mailing List IEP.
Required
NO
Policy statement
POLICY START DATE
Type
METADATA
Description
States the UTC date from when the IEP is effective.
Required
NO
Policy statement
POLICY END DATE
Type
METADATA
Description
States the UTC date that the IEP is effective until.
Required
NO
Policy statement
POLICY REFERENCE
Type
METADATA
Description
This statement can be used to provide a URL reference to the specific IEP implementation.
Required
NO
It is very important for any organization to understand where they are gathering the data from and what the obligations associated with the data are before using it for both internal purposes or sharing. A lack of clear understanding of these could lead to breaches of trust and may not be a desirable situation.
Now that data sharing is behind us, let's talk a little bit about the nature of the data itself. Data usually exhibits three characteristics, which are essential to understand when it comes to designing the data collection system (to be discussed in the next couple of chapters). Industry calls it the 3 V's of data. Let's briefly look at what the 3 V's stand for and why they are important to bear in mind when designing the system.
The 3 V's stand for:
Volume
Variety
Velocity
Today's world consists of petabytes of data being emitted by a variety of sources, be it social media, sensors, blockchain, video, audio, or even transactional. The data collected can be huge, depending on the nature of the business, but, if you are reading this book, it essentially means that you have huge volumes of data that you need to understand how to handle in an effective manner.
Variety refers to the different data formats. Relational databases, Excel files, or even simple text files are all examples of different data formats. A system should be capable of handling new varieties of data as and when they arrive. Extensibility is the key component for a data-intensive system when it comes to handling varieties of data. Data variety can be broadly classified into three major blocks:
Structured:
Data that has a well-defined schema associated with it, for example, relational data, and XML-formatted data.
Semi-structured:
Data whose structure can be anticipated but that does not always conform to a set standard. Examples include JSON-formatted data, and columnar data.
Unstructured: Binary large object
(
BLOB
) data, for example, video, and audio.
Velocity denotes the speed at which the data arrives and becomes stale. There was a time when even one month-old data was considered fresh. In today's world, where social media has taken the place of traditional information sources and sensors have replaced human log books, we can't even rely on yesterday's data as it may have already become stale. The data moves at near real time and, if not processed properly and in time, may represent a lost opportunity for the business.
Until now, we have only discussed the data ecosystem, what it consists of, what requirements are associated with it in terms of the ability to share, and the types of data you can expect to collect. None of this will make sense unless we associate the data ecosystem and collection with the value drivers associated with that data for an organization.
Broadly speaking, any data that an organization decides to collect or use has two motivations/intentions behind it. Either the organization wants to use it for improving its own system/processes, or it wants to place itself strategically in a situation where it can generate new opportunities for itself.
Better decision-making processes, be they quicker or more proactive, are directly proportional to the revenue of a company.
Improvements in internal capabilities, either via automation or improved business process management, save time and money, thereby giving organizations more opportunities to innovate and, in turn, reducing costs further and opening up new business opportunities.
As you may have already noticed, this is a circle of dependencies and, once an organization can find a balance within this circle, the only way for it is upward.
Having understood the data ecosystem and its constituent elements, let's finally look at some practical use cases that could lead an organization to start thinking in terms of data rather than processes.
Until a few years ago, the best way to combat external cyber security threats was to create a series of firewalls that were assumed to be impenetrable and thereby provide security to the systems behind the firewall. To combat internal cyber attacks, anti-virus software was considered to be more than sufficient. This traditional defense gave a sense of security, but was more of an illusion than a reality. Typical software system attackers are well versed in hiding in plain sight and, consequently, looking for "known bad" signatures didn't help in combating Advanced Persistent Threats (aka APT). As systems developed in complexity, the attack patterns also became sophisticated, with coordinated hacking efforts persisting over a long period and exploiting every aspect of the vulnerable system.
For example, a use case within the security domain is the Detection of Anomaly within the generated machine data, where the data is explored to identify any non-homogeneous event or transaction in a seemingly homogeneous set of events. An example of anomaly detection is when banks perform sophisticated transformations and context association with incoming credit card transactions to identify whether a transaction looks suspicious. Banks do it to prevent fraudsters from looting the bank, either directly or indirectly.
Organizations responded by creating hunting teams that looked at various data (for example, system logs, network packets, and firewall access logs) with a view to doing the following:
Hunting for undetected intrusions/breaches
Detecting anomalies and raising alerts in connection with any malicious activity
The main challenges for organizations in terms of creating these hunting teams were the following:
The fact that data is scattered throughout the organization's IT landscape
Data quality issues and multiple data versioning issues
Access and contractual limitations
All these requirements and challenges created the need for a platform that can support various data formats and a platform that is capable of:
Long-term data retention
Correlating different data sources
Providing fast access to correlated data
Real-time analysis
Just as it is important to capture data for various efficiencies and insights, it is also equally important to understand what data an organization does not want. You may think that you need everything, but the truth is that you do not want everything. Understanding what you need is critical to hastening the journey toward becoming a data-driven organization.
