114,99 €
DESIGNING BIG DATA PLATFORMS Provides expert guidance and valuable insights on getting the most out of Big Data systems An array of tools are currently available for managing and processing data--some are ready-to-go solutions that can be immediately deployed, while others require complex and time-intensive setups. With such a vast range of options, choosing the right tool to build a solution can be complicated, as can determining which tools work well with each other. Designing Big Data Platforms provides clear and authoritative guidance on the critical decisions necessary for successfully deploying, operating, and maintaining Big Data systems. This highly practical guide helps readers understand how to process large amounts of data with well-known Linux tools and database solutions, use effective techniques to collect and manage data from multiple sources, transform data into meaningful business insights, and much more. Author Yusuf Aytas, a software engineer with a vast amount of big data experience, discusses the design of the ideal Big Data platform: one that meets the needs of data analysts, data engineers, data scientists, software engineers, and a spectrum of other stakeholders across an organization. Detailed yet accessible chapters cover key topics such as stream data processing, data analytics, data science, data discovery, and data security. This real-world manual for Big Data technologies: * Provides up-to-date coverage of the tools currently used in Big Data processing and management * Offers step-by-step guidance on building a data pipeline, from basic scripting to distributed systems * Highlights and explains how data is processed at scale * Includes an introduction to the foundation of a modern data platform Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems is a must-have for all professionals working with Big Data, as well researchers and students in computer science and related fields.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 511
Veröffentlichungsjahr: 2021
Cover
Title Page
Copyright
List of Contributors
Preface
Acknowledgments
Acronyms
Introduction
1 An Introduction: What's a Modern Big Data Platform
1.1 Defining Modern Big Data Platform
1.2 Fundamentals of a Modern Big Data Platform
2 A Bird's Eye View on Big Data
2.1 A Bit of History
2.2 What Makes Big Data
2.3 Components of Big Data Architecture
2.4 Making Use of Big Data
3 A Minimal Data Processing and Management System
3.1 Problem Definition
3.2 Processing Large Data with Linux Commands
3.3 Processing Large Data with PostgreSQL
3.4 Cost of Big Data
4 Big Data Storage
4.1 Big Data Storage Patterns
4.2 On‐Premise Storage Solutions
4.3 Cloud Storage Solutions
4.4 Hybrid Storage Solutions
5 Offline Big Data Processing
5.1 Defining Offline Data Processing
5.2 MapReduce Technologies
5.3 Apache Spark
5.4 Apache Flink
5.5 Presto
6 Stream Big Data Processing
6.1 The Need for Stream Processing
6.2 Defining Stream Data Processing
6.3 Streams via Message Brokers
6.4 Streams via Stream Engines
7 Data Analytics
7.1 Log Collection
7.2 Transferring Big Data Sets
7.3 Aggregating Big Data Sets
7.4 Data Pipeline Scheduler
7.5 Patterns and Practices
7.6 Exploring Data Visually
8 Data Science
8.1 Data Science Applications
8.2 Data Science Life Cycle
8.3 Data Science Toolbox
8.4 Productionalizing Data Science
9 Data Discovery
9.1 Need for Data Discovery
9.2 Data Governance
9.3 Data Discovery Tools
10 Data Security
10.1 Infrastructure Security
10.2 Data Privacy
10.3 Law Enforcement
10.4 Data Security Tools
11 Putting All Together
11.1 Platforms
11.2 Big Data Systems and Tools
11.3 Challenges
12 An Ideal Platform
12.1 Event Sourcing
12.2 Kappa Architecture
12.3 Data Mesh
12.4 Data Reservoirs
12.5 Data Catalog
12.6 Self‐service Platform
12.7 Abstraction
12.8 Data Guild
12.9 Trade‐offs
12.10 Data Ethics
Appendix A: Further Systems and Patterns
A.1 Lambda Architecture
A.2 Apache Cassandra
A.3 Apache Beam
Appendix B: Recipes
B.1 Activity Tracking Recipe
B.2 Data Quality Assurance
B.3 Estimating Time to Delivery
B.4 Incident Response Recipe
B.5 Leveraging Spark SQL Metrics
B.6 Airbnb Price Prediction
Bibliography
Index
End User License Agreement
Chapter 4
Table 4.1 Comparison of big data storage patterns
Chapter 10
Table 10.1 Gateway to component mapping.
Chapter 2
Figure 2.1 MapReduce execution steps.
Figure 2.2 HDFS architecture.
Figure 2.3 YARN architecture.
Figure 2.4 Components of Big Data architecture.
Chapter 4
Figure 4.1 Provisioned data warehouse architecture.
Figure 4.2 Tree data warehouse architecture.
Figure 4.3 Virtual warehouse architecture.
Chapter 5
Figure 5.1 Offline Big Data processing overview.
Figure 5.2 Pig compilation and execution steps.
Figure 5.3 Pig Latin to MapReduce.
Figure 5.4 Hive architecture overview.
Figure 5.5 Spark RDD flow.
Figure 5.6 Narrow vs wide transformations.
Figure 5.7 Spark layers.
Figure 5.8 Spark execution plan.
Figure 5.9 Spark high‐level architecture.
Figure 5.10 Spark stages.
Figure 5.11 Spark cluster in detail.
Figure 5.12 Presto architecture.
Figure 5.13 Presto logical plan.
Figure 5.14 Presto stages.
Chapter 6
Figure 6.1 Average page views by five minutes intervals.
Figure 6.2 A message broker.
Figure 6.3 Kafka topic.
Figure 6.4 Kafka offset.
Figure 6.5 Kafka producer/consumer.
Figure 6.6 Samza job structure.
Figure 6.7 Samza architecture.
Figure 6.8 Anatomy of Kafka Streams application.
Figure 6.9 Pulsar topic subscription modes.
Figure 6.10 Pulsar architecture.
Figure 6.11 Pulsar functions programming model.
Figure 6.12 Pulsar functions worker.
Figure 6.13 Flink architecture.
Figure 6.14 Flink barriers.
Figure 6.15 Flink task scheduling.
Figure 6.16 Execution graph.
Figure 6.17 Storm layers.
Figure 6.18 Storm spouts and bolts.
Figure 6.19 Storm architecture.
Figure 6.20 Heron architecture.
Figure 6.21 Spark streaming micro‐batches.
Chapter 7
Figure 7.1 Flume agent deployment.
Figure 7.2 Fluentd data pipeline.
Figure 7.3 Fluentd deployment.
Figure 7.4 Gobblin architecture.
Figure 7.5 Data aggregation stages.
Figure 7.6 Celery executor architecture.
Figure 7.7 Domain‐driven data sets.
Chapter 8
Figure 8.1 Data science life cycle.
Figure 8.2 A sample data science model deployment.
Figure 8.3
‐means usage segments.
Figure 8.4 TensorFlow architecture.
Figure 8.5 Apache PredictionIO architecture.
Chapter 9
Figure 9.1 Metacat architecture.
Figure 9.2 Amundsen architecture.
Figure 9.3 Atlas types.
Figure 9.4 Atlas architecture.
Chapter 10
Figure 10.1 Ranger access conditions.
Figure 10.2 Ranger architecture.
Figure 10.3 Sentry architecture.
Figure 10.4 Knox services.
Figure 10.5 Knox architecture.
Chapter 11
Figure 11.1 Big Data platform verticals.
Chapter 12
Figure 12.1 Shopping cart events.
Figure 12.2 Kappa architecture.
Figure 12.3 Subdomain data sets.
Figure 12.4 Multiple data reservoirs.
Figure 12.5 Data catalog feedback loop.
Figure 12.6 Self‐service Big Data platform overview.
Appendix A
Figure A.1 Lambda architecture.
Figure A.2 Cassandra architecture.
Figure A.3 Apache Beam programming model.
Figure A.4 Apache Beam pipeline.
Figure A.5 Apache Beam ParDo processing.
Appendix B
Figure B.1 Data ingestion components.
Figure B.2 Computation components.
Figure B.3 Streaming reference architecture.
Figure B.4 Streaming feature vector. Given a trained ML model and feature ve...
Figure B.5 The time‐to‐delivery estimation data flow.
Figure B.6 Spark SQL metrics.
Figure B.7 Spark SQL metrics pipeline.
Figure B.8 Airbnb histogram.
Cover Page
Title Page
Copyright
List of Contributors
Preface
Acknowledgments
Acronyms
Introduction
Table of Contents
Begin Reading
Appendix A Further Systems and Patterns
Appendix B Recipes
Bibliography
Index
WILEY END USER LICENSE AGREEMENT
iii
iv
xvii
xix
xx
xxi
xxiii
xxiv
xxv
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
Yusuf Aytas
Dublin, Ireland
This edition first published 2021
© 2021 John Wiley and Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Yusuf Aytas to be identified as the author of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data Applied for:
ISBN: 9781119690924
Cover design by Wiley
Cover image: © monsitj/iStock/Getty Images
Ömer E. Avşaroğulları
San Francisco, CA
United States
Louis Calisi
Acton, MA
United States
Ender Demirkaya
Seattle, WA
United States
Alperen Eraslan
Ankara
Turkey
Ang Gao
Dublin
Ireland
Zehra Kavasoğlu
London
United Kingdom
David Kjerrumgaard
Henderson, Nevada
United States
Ari Miller
Los Angeles, CA
United States
Alper Mermer
Manchester
United Kingdom
Will Schneider
Waltham, MA
United States
Big data has been one of the buzzwords of the last couple of decades. Although there is a certain hype around big data, it is now seen as increasingly essential for organizations. There are so many new systems popping up now and then. This book does not particularly focus on any of the new systems or projects, but rather focuses on how these systems are built and used. The book gives an overview of the systems and lays out the architecture. The book mentions and sometimes goes into detail about both new and old big data systems, such as Apache Hadoop and Apache Spark; it attempts to give an overall view over designing big data platforms regardless of the technology.
The book offers some general knowledge about designing big data platforms. If you are a front‐end engineer, backend engineer, data engineer, data analyst, or data scientist, you will see different aspects of designing big data platforms. The book at times goes into technical detail on some subjects through code or design but this doesn't prevent the non‐specialist obtaining an understanding of the underlying concepts. If you are an expert on big data platforms, perhaps, the book can revisit things of which you might already be aware.
The book gives a general overview of big data technologies to design big data platforms. The book covers many interesting technologies, but it is not a reference book for any of the technologies mentioned. It dives deep into certain technologies but overall tries to establish a framework rather than focusing on certain tools.
At the beginning, the book tries to go over big data, big data platforms, and a simple data processing system. Later, it starts to go into discussing various aspects of big data such as storage, processing, discovery, security, and so forth. At the end of the book, it summarizes systems that it talked about and discusses some of the useful patterns for designing big data platforms. In Appendix A, the book discusses some of the other technologies and patterns that don't really fit the flow of the book. In Appendix B, the book discusses recipes where it presents a solution for a particular big data problem.
5 February 2021
Yusuf Aytas
Dublin
This book is an overview of many systems, patterns, and ideas, in an attempt to combine experiences from academia and industry with reference to many great articles, posts, and books in order to provide a consolidated view on designing big data platforms. While writing this book, I have gained enormous knowledge from many exceptional authors, engineers, analysts, scientists, and editors. I'm very grateful for all the information that people made available.
Yusuf Aytas
ACL
access control list
API
application programming interface
CI
continues integration
CPU
central processing unit
CQL
Cassandra Query Language
CSV
comma separated values
DAG
directed acyclic graph
DBMS
data base management system
DB
database
DDD
domain driven design
DS
date string
ETL
extract, transform, load
FIFO
first in first out
GPU
graphical processing unit
HDFS
Hadoop distributed file system
HH
hour string
HQL
hive query language
HTML
hypertext markup language
IO
input/output
IP
Internet Protocol
JSON
JavaScript Object Notation
LB
load balancer
ML
machine learning
NoSQL
not only SQL
OS
operating system
RDD
resilient distributed dataset
SDK
software development kit
SLA
service level agreement
SLO
service level objective
SQL
structured query language
TCP
transport control protocol
TH
threshold
URL
uniform resource locator
VM
virtual machine
XML
extensible markup language
YARN
yet another resource manager
The value of big data and processing data to get actionable insights have been a competitive advantage across the world. Organizations are still in a rush to collect, process, and drive values from big data. Designing big data platforms to collect, store, process, and discover big data is a significant challenge from many aspects. Thanks to collaborative push from engineers from different parts of the world and several organizations, we have so many great systems we can use to design a big data platform.
A big data platform consists of many components where we have many alternatives for the same job. Our task is to design a platform that caters to the needs and requirements of the organization. In doing so, we should choose the right tool for the job. Ideally, the platform should adapt, accept, and evolve due to new expectations. The challenge is to design a simple platform while keeping it cost‐efficient in terms of development, maintenance, deployment, and the actual running expense.
In this book, I present many different technologies for the same job. Some of them are already off the shelf, while others are cutting edge. I want to give perspectives on these systems so that we can create solutions that are based on the experience of others. Hopefully, we can all have a better grasp on designing big data platforms after reading this book.
After reading this chapter, you should be able to:
Define a modern Big Data platform
Describe expectations from data
Describe expectations from a platform
This chapter discusses the different aspects of designing Big Data platforms, in order to define what makes a big platform and to set expectations for these platforms.
The key factor in defining Big Data platform is the extent of data. Big Data platforms involve large amounts of data that cannot be processed or stored by a few nodes. Thus, Big Data platform is defined here as an infrastructure layer that can serve and process large amounts of data that require many nodes. The requirements of the workload shape the number of nodes required for the job. For example, some workloads require tens of nodes for a few hours or fewer nodes for days of work. The nature of the workloads depends on the use case.
Organizations use Big Data platforms for business intelligence, data analytics, and data science, among others, because they identify, extract, and forecast information based on the collected data, thus aiding companies to make informed decisions, improve their strategies, and evaluate parts of their business. The more the data recorded in different aspects of business, the better the understanding. The solutions for Big Data processing vary based on the company strategy.
Companies can either use on‐site or cloud‐based solutions for their Big Data computing and storage needs. In either case, various parts can be considered all together as a Big Data platform. The cogs of the platform might differ in terms of storage type, compute power, and life span. Nevertheless, the platform as a whole remains responsible for business needs.
What makes a modern Big Data platform remains unclear. A modern Big Data platform has several requirements, and to meet them correctly, expectations with regard to data should be set. Once a base is established for expectations from data, we can then reason about a modern platform that can serve it.
Big Data may be structured, semi‐structured, or unstructured in a modern Big Data platform and come from various sources with different frequencies or volumes. A modern Big Data platform should accept each data source in the current formats and process them according to a set of rules. After processing, the prepared data should meet the following expectations.
Accessing prepared data depends on internal customer groups. The users of the platform can have a very diverse set of technical abilities. Some of them are engineers, who would like to get very deep and technical with the platform. On the other hand, some may be less technically savvy. The Big Data platform should ideally serve both ends of the customer spectrum.
Engineers dealing with the platform expect to have an application programming interface (API) to communicate with about the platform in various integration points. Some of the tasks would require coding or automation from their end. Moreover, data analysts expect to access the data through standard tooling like SQL or write an extract, transform, load (ETL) job to extract or analyze information. Lastly, the platform should offer a graphical user interface to those who simply want to see a performance metric or a business insight even without a technical background.
Data is an invaluable asset for organizations. Securing data has become a crucial aspect of a modern Big Data platform. Safeguarding against a possible data breach is a big concern because a leak would result in financial loses, reduced customer trust, and damage to the overall reputation of the company.
Security risks should be eliminated, but users should be able to leverage the platform easily. Achieving both user‐friendliness and data protection requires a combination of different security measures such as authentication, access control, and encryption.
The organizations should identify who can access to the platform. At the same time, access to a particular class of data should be restricted to a certain user or user group. Furthermore, some of the data might contain critical information like PII, which should be encrypted.
High‐quality data enables businesses to make healthier decisions, opens up new opportunities, and provides a competitive advantage. The data quality depends on factors such as accuracy, consistency, reliability, and visibility. A modern Big Data platform should support ways to accomplish accurate and consistent data between data sources to produce visible data definition and reliable processed data. The domain is the driving factor for a Big Data platform when it comes to data quality. Hence, the number of resources allocated to the data quality changes according to the domain. Some of the business domains might be quite flexible, while others would require strict rules or regulations.
Iterative development is an essential part of software engineering. It is no surprise that it is also part of Big Data processing. A modern Big Data platform should empower the ease of reprocessing. Once the data is produced, the platform should provide infrastructure to extend the data easily. This is an important aspect because there are many ways things can go wrong when dealing with data. One or more iteration can be necessary.
Moreover, the previously obtained results should be reproducible. The platform should reprocess the data and achieve the same results when the given parameters are the same. It is also important to mention that the platform should offer mechanisms to detect deviations from the expected result.
After establishing expectations regarding the data, how to meet these expectations by the platform should be discussed. Before starting, the importance of the human factor should be noted. Ideal tooling can be built, but these would be useful only in a collaborative environment. Some of the critical business information and processing can occur with good communication and methods. This section will present an overview of the features in pursuit of our ideal Big Data platform; we will not go into detail in explaining each of the features we would employ since we have chapters discussing it.
Ideally, a storage layer that can scale in terms of capacity, process an increasing number of reads and writes, accept different data types, and provide access permissions. Typical Big Data storage systems handle the capacity problem by scaling horizontally. New nodes can be introduced transparently to the applications backed by the system. With the advent of cloud providers, one can also employ cloud storage to deal with the growing amount of storage needs. Moreover, a hybrid solution is an option where the platform uses both on‐site and cloud solutions. While providing scalability in terms of volume and velocity, the platform should also provide solutions in cases of backup, disaster recovery, and cleanups.
One of the hard problems of Big Data is backups as the vast amount of storage needed is overwhelming for backups. One of the options for backups is magnetic tapes as they are resilient to failures and do not require power when they are not in use. A practical option is relying on durable and low‐cost cloud storage. In addition, an expensive but yet very fast solution is to have a secondary system that either holds partly or the whole data storage. With one of the proposed solutions in place, the platform can potentially perform periodic backups.
In case of disaster recovery from backups, separate sets of data sorted by their priority are an option since retrieving backup data would take quite some time. Having different data sets also provides the ability to spin up multiple clusters to process critical data in parallel. The clusters can be spun up on separate hardware or again using a cloud provider. The key is to be able to define which data sets are business‐critical. Categorizing and assigning priority to each data set enables the recovery execution to be process‐driven.
The storage layer can suffer from lost space when the data are replicated in many different ways but no process is available to clean up. There are two ways to deal with data clean up. The first is the retention policy. If all data sets have a retention policy, then one could build processes to flush expired data whenever it executes. The second is the proactive claiming of unused data space. To understand which data is not accessed, a process might look at the access logs and determine unused data. Hence, a reclaiming process should be initiated by warning the owners of the data. Once the owners approve, the process should be initiated and reclaim the space.
The workload management consists of managing resources across multiple requests, prioritization of tasks, meeting service‐level agreements (SLAs), and assessing the cost. The platform should enable important tasks to finish on time, respond to ad hoc requests promptly, and use available resources judiciously to complete tasks quickly and measure the cost. To accomplish these, the platform should provide an approach for resource sharing, visibility for the entire platform, monitoring around individual tasks, and cost reporting structure.
Resource sharing strategies can affect the performance of the platform and fairness toward individual jobs. On one hand, when there is no task running, the platform should use as much resources as possible to perform a given task. On the other hand, a previously initiated job slows down all other requests that started after this task. Therefore, most of the Big Data systems provide a queuing mechanism to separate resources. Queuing enables sharing of resources across different business units. On the other hand, it is less dramatic when the platform uses cloud‐based technologies. A cloud solution can give the platform the versatility to run tasks on short‐lived clusters that can automatically scale to meet the demand. With this option, the platform can employ as many nodes as needed to perform tasks faster.
Oftentimes, the visibility of the platform in terms of usage might not be a priority. Thus, making a good judgment is difficult without easily accessible performance information. Furthermore, the platform can consist of a different set of clusters, which then makes it even harder to visualize the activity in the platform at a snapshot of time. For each of the technology used under the hoot, the platform should be able to access performance metrics or calculate itself and report them in multiple graphical dashboards.
The number of tasks performed on the platform slows down a cluster or even bring it down. It is important to set SLAs for each performed task and monitor individual tasks for their runtime or resource allocation. When there is an oddity in executing tasks, the platform should notify the owner of the task or abort the task entirely. If the platform makes use of cloud computing technologies, then it is extremely important to abort tasks or not even start executing them by using the estimated costs.
I believe the cost should be an integral part of the platform. It is extremely important to be transparent for the customers. If the platform can tell how much it cost or can cost to run their workloads, it would be customers of the platform to decide how much money they can spend. The team maintaining the platform would not be responsible for the cost. If one business unit wants to spin up a big cluster or buy new hardware, then it is their problem to justify the need.
ETL stands for extract, transform, and load. ETL is the core of Big Data processing; therefore, it is the heart of a modern Big Data platform. The Big Data platform should provide an ETL solution/s that manages the experience end to end. The platform should control the flow from data generation to processing and making means out of the data. ETL developers should be able to develop, test, stage, and deploy their changes. Besides, the platform should hide technical details where possible and provide advanced features.
The size of the company is a factor for the number of storage system required because this system should be able to support multiple sources and targets for a given ETL engine. The more integration points it offers, the more useful the ETL engine becomes. Ideally, the ETL engine should have the plug‐in capability where each kind of data source/target is configured by additional plug‐ins. When there is a demand for a new source/target, the platform would simply require another plug‐in to support a new source/target.
The platform should encourage ownership of flows and data. The ETL engine should make it obvious underlying data is owned by the same user group. If a user does not have rights to modify the flow, the right to access the data is not granted, or vice versa. The ETL engine itself may require exclusive rights on the data storage layer to manage access permissions for user and user groups.
The support for the development life cycle is an important aspect. The platform should be able to let developers build their ETLs potentially locally, test the flow, review the changes, stage the changes, and finally deploy to production. The key to local development is the ability to generate partial test data. Since the platform should also accept the data, partial creation of source data should be made easy by supplying a sampling percent. Once the test data is available, testing becomes much easier.
In most of the modern ETL tooling, an intuitive user interface might be missing for the creation of flows. Common ETL engines require some understanding of technical details such as source control and column mapping. Some users of the platform may not be technically savvy or it might be genuinely easier to just deal with user interface rather than coding. A user interface to drag and drop data from multiple sources and merge in different ways would assist to configure trivial flows faster.
The meaning of data can get lost quickly in big organizations. As the amount of the data grows, so does the metadata. To ensure that the metadata definition is shared across the company, the platform should offer metadata discovery capabilities. The metadata should be collected from various resources into a single repository where the definition of metadata can be updated or modified to reflect the context. Additional information such as owner, lineage, and related information would be useful when reviewing metadata. Moreover, the repository should be quickly searchable by various dimensions.
Nobody likes manual jobs. The platform should provide a data discovery tool that automatically updates metadata information by crawling each data source configured. When crawling for metadata, the discovery tool should get information such as attribute definition, type, and technology‐specific information, e.g. partition key. Once the information is stored in a single repository, the relevant information should be shown to the users where they can update any information related to the metadata.
The discovery tool will use other information like queries or foreign keys to form the lineage where possible. Additional processing and storage will be necessary as most of the storage engines would not keep queries forever. If the queries are related to metadata, one can give a sample of queries when the metadata is viewed. Finding the owner of the data can be tricky since the owner of the table would not reveal much because the metadata may be from an actual team or group. Thus, ownership may be dealt with semi‐automated fashion by having people from the organization confirm the group for the given set of metadata.
A single repository brings the ability to search for everything in one place. The search should have the ability to filter metadata information by type and data source. The metadata should be also searchable by any attribute or definition.
Reporting is a necessary process for any business to quickly review the performance and status of the different areas of the business. A modern Big Data platform should provide a tool to present rich visualizations, user‐friendly interface for exploring, dashboard creation/sharing, and aggregation functions on how data sources are displayed. Furthermore, the tooling should seamlessly integrate with the existing data storage layer.
The ability to show the data in a wide array of visualizations helps to quickly understand the summary. To make visualizations faster, the tooling will rely on client‐side caching to avoid querying underlying data storage. This is a significant optimization as it both saves from computing power and gives the chance to swiftly load the requested dashboard.
Once reporting tooling supports common paradigms like SQL, it is easy to integrate most of the SQL supported storage engines. The tooling should support various drivers to communicate with the storage layer and retrieve data from various data sources. The tooling itself should understand the SQL to generate the query to load the dashboard and apply aggregation functions.
As it happens in any other platform, many systems or user errors would occur in a Big Data platform. An upgrade to the storage layer may change how currencies are handled or how someone can calculate the item price in euros incorrectly. On top of this, there might be node failures, update to connection settings, and more. All of these problems could delay or interrupt data processing. As the complexity of the system grows, so does the number of edge cases. Consequently, Big Data platforms are quite complex as they are built based on distributed systems. Preparation for failures is the only solution as even the detection of problems is complex. The platform should have protection against node/process failures, validation for schema and data, and SLAs per task or flow.
Nodes fail and processes lag. Even though most of the Big Data systems are designed to deal with occasional hiccups, failures still become problematic in practice. The Big Data platform should monitor the health of each Big Data system. The best way to verify everything, at least those that are functional, is to have small tasks executed against each of the systems. If these small tasks fail for one or more reasons, a manual intervention should be undertaken. If the problems could not be detected early, this would lead to the disruption of one or more data flows. Sometimes, the small tasks would be executed with no problem, but bigger tasks would lag due to various reasons. Such problems should be resolved on the flow level as big tasks could not be processed and especially the specific ones are causing errors. We should have SLAs for the delivery of full flow. If it does not meet the agreement, the problem should be escalated within the platform.
The platform should also check schema changes for data flowing through the systems. A schema validation framework is necessary to ensure that the changes to the schema are backward compatible. Moreover, validating schema itself is not enough. The data can be corrupted even if it conforms to validations. A new change may introduce the corruption of data at its core. To deal with such problems, basic anomaly detection should be performed and complex anomaly detection might be required. The basic anomaly detection would be only checking counts or number of duplicates, while a complex anomaly detection requires complex queries over time. The platform should offer both solutions as protection mechanisms.
Ideally, each part of the Big Data platform should have relevant test suits. However, testing is often skipped at many stages due to the pressure of the expected productivity but with errors. Other than decent unit test coverage, the platform should perform integration tests between the systems, performance tests, failure testing, and automation for running tests.
The importance of isolating a component and ensuring it produces the expected behavior of the given input is undisputable. Yet, under implementation, such suits might seem somewhat cumbersome for Big Data platforms. One reason is the need for stubbing for various external systems when testing. However, we cannot ensure any other way than unit testing to verify the component behaves as we expected. Thus, it is necessary to have unit tests in place to continuously validate the behavior against new changes.
Big Data platforms have many systems underneath. Each system has different ways to communicate. Additionally, these systems need to talk to each other. Sometimes an upgrade or a new change might break the contract. To detect such issues before they make it to production, we should have integration test suits between the systems. The integration test suits should ideally run for every change that is pushed to any of the systems. If running per change is difficult, then the integration tests can be scheduled to run multiple times a day to detect potential issues.
The load testing aspect is crucial when a new system gets introduced to a Big Data platform. Since we are working with Big Data systems, a virtual load should be created in a staging environment by streaming the expected volume of data to the new system. The expected volume can be estimated by a series of prediction analysis. Once the volume is confirmed, the data should be fed and the system validated that it can cope with it. Moreover, we should also stress the system with extra load. We would like to answer questions about the best throughput vs latency in different scenarios or the point where the system gets unresponsive.
Occasionally, testing the system with extreme scenarios is a beneficial exercise to see the worst‐case scenario. Users may want to see how the system behaves in the presence of a kernel panic in multiple nodes or in a split‐brain scenario. Moreover, it is interesting to monitor where the system experiences CPU slowdown, high packet loss, and slow disk access. One can add many other exercises to test against. Lastly, we would like to see how the system degrades with random problems.
Designing, developing, and maintaining Big Data systems is complicated and requires all‐around team effort and coordination. We need to draw the big picture and coordinate teams or team members according to the needs. Otherwise, we would end up in situations where nobody knows what to do next or get lost in rabbit holes. To prevent such frustrations, a structured plan is needed, where the progress for each component is visible and the next steps are clear. Hence, I propose a common structure with the phases as follows: planning, designing, developing, maintenance, and deprecation.
The planning phase involves resource allocation, cost estimation, and schedule of the Big Data systems. In the design phase, the requirements are met and a prototype is built. Once we have a working system, integrations and interactions are designed with other systems or end users. The next phase is development, where the software are built and the deployment pipelines including several test suits are prepared. Once the software is ready, the maintenance phase begins. If for some reason we decide not to invest in the system, we would go to the deprecation phase where our clients/customers will be moved from the system to the alternative offer.
After reading this chapter, you should be able to
Learn chronological information about Big Data processing
List qualities that characterize Big Data
List components of Big Data platforms
Describe uses cases for Big Data
Development of Big Data platforms has spanned for over two decades. To provide an overview of its evolution, this chapter will present the qualities of Big Data, components of Big Data platform, and use cases of Big Data.
Computing has advanced drastically in the past two decades, from network to data storage, with significant improvements. Despite the rapid changes, the definition of Big Data remains relevant. This section presents the evolution of Big Data chronologically.
The term Big Data was used and described by Cox and Ellsworth (1997), who presented two ideas: Big Data collections and Big Data objects. Big Data collections are streamed by remote sensors as well as satellites. The challenge is pretty similar to today's Big Data where data is unstructured and has different data sources. Big Data objects are produced from large‐scale simulations of computational dynamics and weather modeling. The combined problem of the Big Data object and Big Data collections are again comparable to today's Big Data challenges where data is too large for memory and disk to fit a single machine.
In his presentations regarding Big Data, Mashey (1998) noted that the need for storage has been growing faster and more data are being created in the Internet. Given the explosion of widely accessible data lead to problems as regards creating, understanding, and moving it, Mashey (1998) concluded that processing large amounts of data would require more computing, network, and disks and thus, more machines to distribute data.
At the time, Big Data had become popular. Weiss and Indurkhya (1998) reported that at the start of Big Data revolution, running data mining algorithms was similar to operating a warehouse and discussed the concepts related to extract, transform, load (ETL) for data mining purposes. Law et al. (1999) refers to the multi‐threaded streaming pipeline architecture for large structured data sets to create visualization systems. The popularity of the term has been increasing even more since 2000 and cited in many academic articles such as Friedman et al. (2000), and Ang and Teo (2000), among others.
The uses of Big Data were unknown up until Dean and Ghemawat (2004) introduced MapReduce, whose paradigm drastically shifted the perspective as regarding processing of Big Data. It is a simple yet very powerful programming model that can process large sets of data. The programmers specify a Map function that generates intermediary data that is fed into a Reduce function to subsequently merge values. The MapReduce program uses a set of input key/value pairs and produces a set of output key/value pairs. The programmer specifies two functions: Map and Reduce. The Map function takes the user input and produces intermediary output. The framework then shuffles the intermediary key/value pairs such that intermediary keys belong to the same node. Once shuffled, the Reduce function takes an intermediary key and set of values associated with the key and merges them into smaller values.
Let us see this powerful programming model in action. Consider the following code where we would count occurrences of a keyword in given documents.
In the Map function, we simply take the document name and its contents. We iterate through each word and emit the result as a key(word)/value(count) pair. Afterward, the framework collects all pairs with the same key(word) from all outputs and groups them together. Grouped pairs then feed into the Reducer function that simply sums the count then emits the result as a key(word)/value(sum) pair.
Figure 2.1 MapReduce execution steps.
The Map function calls are partitioned across multiple machines by M splits. Once the mapping phase is complete, the intermediary keys are partitioned across machines by R pieces using a Partition function. The list of actions occur is illustrated in Figure 2.1.
The MapReduce framework splits the inputs into
M
pieces of typically block size and then begins running copies of the program on a cluster of machines. One copy of the program becomes the master copy that is used when assigning tasks to other machines.
In the mapping phase, machines pull the data locally, run the Map function, and simply emit the result as intermediary key/value pairs. The intermediary pairs are then partitioned into
R
partitions by the Partition function.
The MapReduce framework then shuffles and sorts the data by the intermediary key.
When a reduce slave has read all intermediate data, it then runs the Reduce function and outputs the final data.
The MapReduce framework is expected to handle very large data sets so the master keeps track of slaves and checks out every slave periodically. If it cannot receive a response back from the slave for a given timeout period, it then marks the task as failed and schedules the task on another machine. The rescheduling of tasks is possible since each task is idempotent. In the case of master failure, the program can be started from the last checkpoint.
Shortly after Google published the paper on MapReduce, facing the same issues while developing Apache Apache Nutch (2004), a web crawler, Doug Cutting decided to use this new programming model. He switched the infrastructure behind the web crawler to his new implementation of MapReduce, Hadoop, named after his son's elephant toy (Olson, 2010). As being a direct competitor to Google, Yahoo! decided to support the development of Hadoop. Hadoop became the open‐source solution for the Big Data explosion.
The use of weblogs which recorded the activity of the user on the website along with structured data became a valuable source of information for companies. As there was no commercially available software to process large sets of data, Hadoop became the tool for processing Big Data. Hadoop enabled companies to use commodity hardware to run jobs over Big Data. Instead of relying on the hardware to deliver high availability, Hadoop assumes all hardware are prone to failure and handles these failures automatically. It consists of two major components: a distributed file system and a framework to process jobs in a distributed fashion.
The Apache Hadoop (2006) Distributed File System (HDFS) is a fault‐tolerant, massively scalable, a distributed file system designed to run on commodity hardware. HDFS aims to deliver the following promises:
Failure recovery
Stream data access
Support very large data sets
Write once and read many times
Collocate the computation with data
Portability
HDFS is based on master/slave architecture where NameNode is the master and DataNodes are slaves. NameNode coordinates the communication with clients and organizes access to files by clients. NameNode manages file system namespace and executes operations like opening, closing, renaming, and deleting files and determines the mapping of the data over DataNodes. On the other hand, DataNodes are many and live in the nodes with data and simply serve read/write requests from the file system clients as well as for instructions by the NameNode.
HDFS supports traditional file system organization. A client can create a directory and store multiple files or directories in this directory. NameNode keeps the information about file system namespace. Any update to the file system has to go through the NameNode. HDFS supports user quotas and access permissions. Moreover, HDFS allows clients to specify the replication count per file.
NameNode keeps the entire file system properties including Blockmap which contains blocks to files in memory. For any update, it logs every transaction to a transaction log called EditLog. A new file creation or replication factor change results in an entry to EditLog. Moreover, NameNode saves the in‐memory data to a file called FsImage and truncates the EditLog periodically. This process is called a checkpoint. The period for creating checkpoints is configurable. When NameNode restarts, it reads everything from FsImage and applies additional changes that are recorded in EditLog. Once the transactions are safely written, NameNode can truncate the EditLog and create another checkpoint.
HDFS can store very large files over a cluster of machines. HDFS divides each file into blocks and replicates them over machines by the replication factor. Files are written once and read many times. So, modification to files is not supported except for appending and truncation. The NameNode keeps track of all the blocks for a given file. Moreover, NameNode periodically receives a block report from the DataNodes that contains the list of data blocks that DataNode has.
The placement of data over machines is critical for HDFS's reliability. HDFS employs a rack aware replication policy to improve the resiliency of the system against node and rack failures. For the common use case of a replica count of 3, HDFS writes a replica of the block to a random node in the same rack and another replica of the block to another node in another rack. This replication behavior is illustrated in Figure 2.2.
Figure 2.2 HDFS architecture.
HDFS has to be resilient to failures. For this reason, DataNodes send heartbeat messages to NameNode. If NameNode does not receive the heartbeat message due to node failure or a network partition, it marks these DataNodes as failed and the death of a DataNode can cause the number of replicas for a block to become less than the minimum. NameNode tracks these events and initiates block replication when necessary.
HDFS can occasionally move data from one DataNode to another when the space for DataNode decreases below a certain threshold. Moreover, HDFS can decide to replicate more if there is a fast‐growing demand on a certain file. When interacting with DataNodes, the client calculates a checksum of files received and retrieves another copy of the block if the file is corrupt.
NameNode is the single point of failure for HDFS. If the NameNode fails, manual intervention is needed. To prevent catastrophic failures, HDFS has the option to keep FsImage and EditLog in multiple files. A common approach to deal with this is to write FsImage and EditLog to highly available storage like NFS or distributed edit log.
As HDFS stores files in blocks, applications write their data to these blocks once and read them repeatedly. When a client begins writing data to HDFS, NameNode finds the list of DataNodes where the client will write the data. The client then starts writing data to the first DataNode. After that, the DataNode starts replicating the data to the next one on the list, and then the next on the list starts replicating to the third one the list. HDFS pipelines data replication to any number of replicas where each one receives data and replicates it to the next node on the list.
The writes from the client do not arrive at a DataNode directly. The client stages the data worthy of HDFS block size and contacts NameNode for the DataNode to write. After that, the client writes to the specified destination. When the write operation ends, the client contact NameNode and NameNode commits file creation operation.
HadoopMapReduce is an open‐source implementation of Google's MapReduce programming model. HadoopMapReduce runs compute nodes next to DataNodes to allow the framework to execute tasks in the same node the data lives. The MapReduce framework executes MapReduce jobs with the help of YARN. YARN, yet another resource negotiator, is a job scheduling and resource management system to cater to the needs of the Big Data.
The rising adoption of MapReduce led to new applications of Big Data due to its availability in HDFS and multiple possible ways to process the same data. In addition, MapReduce is batch‐oriented, hence missing the support for real‐time applications. Having an existing Hadoop cluster to do more is cost‐effective in terms of administration and maintenance (Murthy et al., 2014). Hence, the Hadoop community wanted a real multitenancy solution for these requirements.
Separating resource management from job scheduling and monitoring is a primary design concern for YARN. YARN provides a resource manager and application master per application. The resource manager is the arbitrator to share resources among applications and employs a node manager per node to control containers and monitor resource usage. On the other hand, the application manager negotiates resources with the resource manager and works alongside the node manager to execute jobs.
The resource manager has two major components: scheduler and applications manager. The scheduler is responsible for allocating resources for applications and a pure scheduler as it is just limited to perform scheduling tasks; the scheduler neither tracks nor monitors nor start failed tasks. The scheduler schedules containers that are an abstraction over CPU, network, disk, memory, and so forth. The scheduler has a pluggable policy for allocating resources among applications based on their queues.
ApplicationsManager accepts job submissions, negotiates for ApplicationMaster containers, monitors them, and restarts these containers upon failure. ApplicationMaster negotiates resources with the scheduler and keeps track of the containers.
As shown in Figure 2.3, this architecture allows YARN to scale better because there is no bottleneck at the resource manager. The resource manager does not have to deal with the monitoring of the various applications. Furthermore, it also enables moving all application‐specific code to the application master so that one can perform other tasks than MapReduce. Nevertheless, YARN has to protect itself from ApplicationMasters since it is the user code.
Figure 2.3 YARN architecture.
YARN provides a pretty generic resource model. The YARN resource manager can track any countable resources. By default, it monitors CPU and memory for all applications and queues. YARN is designed to handle multiple applications at once. To the so, the scheduler has extensive knowledge about the status of the resources. Each ApplicationMaster negotiates with the resource manager for resources by a resource request with the following attributes:
Priority
of the request.
Resource name
of the host or rack on which the allocation is desired.
Resource
required for each request, e.g. CPU/memory.
Number of containers
, of above specifications, which are required by the application.
Relax locality
, defaulting to true, tells the ResourceManager if the application wants locality to be loose.
Big Data has been described by many authors from diverse perspectives. Mayer‐Schönberger and Cukier (2013) refers to Big Data as things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more. Jacobs (2009) defines Big Data as the size of data forces one to look beyond the tried‐and‐true methods that are prevalent at that time, while for Pope et al. (2014), these are large and complex digital datasets that typically require nonstandard computational facilities for storage, management, and analysis. Russom et al. (2011) observed that Big Data has become a growing challenge that organizations face as they deal with large and fast‐growing sources of data or information that also present a complex range of analysis and use problems. Lohr (2012) describes Big Data as a meme and a marketing term that is also a shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. I think all of the above definitions describe Big Data quite well from different aspects. I define Big Data as a new breed of data, resulting from technological advancements in storage and a growing understanding of value extraction, different than traditional structured data in terms of volume, velocity, variety, and complexity.
The volume of the data is rapidly growing. Simply, the world generates more data on a year over year basis ranging from monitoring to logging. There are more applications, more devices, and more sensors. While the volume of data is getting bigger, it keeps being one of the predominant characteristics of Big Data.
What we think is large today may not be as large in the future since data storage technologies continue to advance. Moreover, the type of data also defines whether it is relatively big. If we store video files, then it is rather easy to get high volumes rapidly. If we collect activity logs, the storage need might not get as fast. So, defining ”big” depends on the type of data and the amount of storage used.
While the common argument for defining what is big has been the number of machines to store the data, I argue the number of dedicated machines is the defining factor. This is a subtle detail as several virtual machines are available for storing data in the age of cloud computing. Moreover, the number of machines to make the data big is unclear and difficult to define, but I argue it should be at least in the order of 10s.
The rate of data accumulated per second drastically increases every day owing to more sales transactions and more social activity per second. The number of new devices and sensors drives the need for real‐time processing. Companies want to deliver the best experience to their users within a fraction of a second by making use of generated data. The real‐time data provide valuable information about demographics, location, and past activities that companies use to deliver real customer value.
The push for real‐time analytics creates the need for capturing and processing the data as fast as possible. Conventional data platforms are not suitable to deal with huge data feeds spontaneously. With the growing demand, we see the Big Data technologies becoming part of the puzzle. Hence, velocity is another defining factor for Big Data.
The Internet generates data from various sources ranging from social networks, database tables, spreadsheets, activity feeds, emails, a sensor source, or IoT devices for different activities from user tracking to activity feeds. The variety of data sources result in heterogeneous data (structured, semi‐structured, and unstructured data). Most of these sources can be consumed readily and require preprocessing for integration into an application. To make holistic decisions, businesses want to use all the data that they can get.
