35,99 €
PEEK "UNDER THE HOOD" OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: * Identify the foundations of large-scale, distributed data processing systems * Make major software design decisions that optimize performance * Diagnose performance problems and distributed operation issues * Understand state-of-the-art research in big data * Explain and use the major big data frameworks and understand what underpins them * Use big data analytics in the real world to solve practical problems
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 586
Veröffentlichungsjahr: 2021
Cover
Title Page
Introduction
History of Data-Intensive Applications
Data Processing Architecture
Foundations of Data-Intensive Applications
Who Should Read This Book?
Organization of the Book
Scope of the Book
References
References
CHAPTER 1: Data Intensive Applications
Anatomy of a Data-Intensive Application
Parallel Applications
Application Classes and Frameworks
What Makes It Difficult?
Summary
References
Notes
CHAPTER 2: Data and Storage
Storage Systems
Data Formats
Data Replication
Data Partitioning
NoSQL Databases
Message Queuing
Summary
References
Notes
CHAPTER 3: Computing Resources
A Demonstration
Computer Clusters
Data Analytics in Clusters
Distributed Application Life Cycle
Computing Resources
Cluster Resource Managers
Job Scheduling
Summary
References
Notes
CHAPTER 4: Data Structures
Virtual Memory
The Need for Data Structures
Object and Text Data
Vectors and Matrices
Table
Summary
References
Notes
CHAPTER 5: Programming Models
Introduction
Data Structures and Operations
Message Passing Model
Distributed Data Model
Task Graphs (Dataflow Graphs)
Batch Dataflow
Streaming Dataflow
SQL
Summary
References
Notes
CHAPTER 6: Messaging
Network Services
Messaging for Data Analytics
Distributed Operations
Distributed Operations on Arrays
Distributed Operations on Tables
Advanced Topics
Summary
References
Notes
CHAPTER 7: Parallel Tasks
CPUs
Accelerators
Task Execution
Batch Tasks
Streaming Tasks
Summary
References
CHAPTER 8: Case Studies
Apache Hadoop
Apache Spark
Apache Storm
Kafka Streams
PyTorch
Cylon
Rapids cuDF
Summary
References
Notes
CHAPTER 9: Fault Tolerance
Dependable Systems and Failures
Recovering from Faults
Checkpointing
Streaming Systems
Batch Systems
Summary
References
CHAPTER 10: Performance and Productivity
Performance Metrics
Performance Factors
Finding Issues
Programming Languages
Productivity
Summary
References
Notes
Index
Copyright
Dedication
About the Authors
About the Editor
Acknowledgments
End User License Agreement
Chapter 1
Table 1-1: CSV File Structure for User's Data
Chapter 2
Table 2-1: Storage Area Network Channels and Protocols
Table 2-2: Original Dataset with Four Rows
Table 2-3: First Table with Only First Three Columns of Original Table
Table 2-4: Second Table With The First Column and the Last Two Columns from the ...
Table 2-5: Horizontal Partition with the Year 2019
Table 2-6: Horizontal Partition with the Year 2020
Table 2-11: Key Value Data Store
Table 2-12: Document Store
Table 2-13: Wide Column Store
Chapter 3
Table 3-1: Data Analytics Frameworks and Resource Scheduling
Table 3-2: Different Resource Managers
Table 3-3: Data Center Tiers
Table 3-4: Massive Data Centers of the World
Table 3-5: Top 5 Supercomputers in the World
Chapter 4
Table 4-1: Time for Accessing 100 Million Records in Three Methods
Table 4-2: Time for Accessing 200 Million Records in Three Methods
Table 4-3: CSV File with Attributes
Table 4-4: Column Arrays
Table 4-5: Fundamental Relational Algebra Operations of Pandas DataFrame
Chapter 5
Table 5-1: Applications, Data Structures, and Programming Models
Table 5-2: Common Primitive Types Supported by Data Systems
Table 5-3: Common Complex Types Supported by Data Systems
Table 5-4: Distributed Operations on Arrays
Table 5-5: Basic Relational Algebra Operations
Table 5-6: Auxiliary Operations on Tables
Table 5-7: Message Passing Implementations
Table 5-8: Table Operations
Table 5-9: Input Records of Two Datasets
Table 5-10: Streaming Operations
Table 5-11: Streaming Distributed Operations
Table 5-12: Popular Operations on Windowed Data
Chapter 6
Table 6-1: Popular MPI Implementations
Table 6-2: Common Distributed Operations on Arrays
Table 6-3: Distributed Operations for Tables
Table 6.4: Table-Based Operations and Their Implementations
Chapter 7
Table 7-1: Synchronization Primitives
Table 7-2: Use of Task Graph Model for Data Analytics
Table 7-5: Partitioning Strategies
Table 7-6: Local Operations on Streams
Table 7-7: Windowed Operations
Table 7-8: Distributed Operations
Chapter 10
Table 10-1: Profiling Tools
Table 10-2: Frameworks and Languages
Table 10-3: Frameworks and Target Applications
Table 10-4: Cloud vs. On-Premises for Data Processing
Table 10-5: CPUs and GPUs for Data-Intensive Applications
Table 10-6: Cloud Services for Data Pipelines
Introduction
Figure I-1: Overall data processing architecture
Figure I-2: Data science workflow
Chapter 1
Figure 1-1: Linear partitioning of files among processes
Figure 1-2: Runtime components of a data-intensive application
Figure 1-3: Steps in making a parallel program
Figure 1-4: The logical view of the shared memory model
Figure 1-5: Shared memory implementation over distributed resources
Figure 1-6: The logical view of the distributed memory model
Figure 1.7: Distributed memory implementation over distributed resources
Figure 1.8: The logical view of hybrid memory model
Figure 1-9: The logical view of PGAS memory model
Figure 1-10: PGAS example with three tasks
Figure 1-11: Four application classes
Figure 1-12: An example workflow
Chapter 2
Figure 2-1: Three ways to connect storage to computer clusters: direct-attac...
Figure 2-2: SAN system with a Fibre Channel network
Figure 2-3: Storage abstractions: block storage (left), file storage (middle...
Figure 2-4: Storing values of a row in consecutive memory spaces
Figure 2-5: Storing values of a column in consecutive memory spaces
Figure 2-6: Parquet file structure
Figure 2-7: Avro file structure
Figure 2-8: Pipelined replication
Figure 2-9: CAP theorem
Figure 2-10: Message brokers in enterprises
Figure 2-11: Message queue
Figure 2-12: Message queue replication
Chapter 3
Figure 3-1: Cluster with separate storage
Figure 3-2: Cluster with compute and storage nodes
Figure 3-3: Cloud-based cluster using a storage service
Figure 3-4: Apache Spark architecture
Figure 3-5: Life cycle of distributed applications
Figure 3-6: Virtual machine architecture
Figure 3-7: Docker architecture
Figure 3-8: Hierarchical memory access with caches
Figure 3-9: NUMA memory access
Figure 3-10: Kubernetes architecture
Figure 3-11: Slurm architecture
Figure 3-12: Yarn architecture
Figure 3-13: Backfill scheduling and FIFO scheduling
Chapter 4
Figure 4-1: Virtual memory layout of a Linux program
Figure 4-2: Mapping of virtual memory to physical memory
Figure 4-3: Translating virtual address to physical address
Figure 4-4: Cache structure of an 8-core Intel CPU with L1, L2, and L3 cache...
Figure 4-5: Array memory layout versus list memory layout
Figure 4-6: Matrix and indexes
Figure 4-7: Row major representation of a matrix
Figure 4-8: Column major representation of a matrix
Figure 4-9: Time difference of access patterns
Figure 4-10: Memory representation of 3D tensor
Figure 4-11: Dense representation of a sparse matrix
Figure 4-12: CSR representation of the sparse matrix
Chapter 5
Figure 5-1: Distributed operation
Figure 5-2: AllReduce on arrays on two processes
Figure 5-3: Table partitioned in two processes
Figure 5-4: Tables after a distributed sort on Year column
Figure 5-5: Graph representation
Figure 5-6: Graph relationships
Figure 5-7: BSP program with computations and communications
Figure 5-8: Large data processing occurring in a streaming fashion
Figure 5-9: Program to execution graph
Figure 5-10: A graph that reads a set of files and does two separate computa...
Figure 5-11: Execute and cache (left); execute left path (middle); execute r...
Figure 5-12: Loop outside the task graph. Central driver submits the graph i...
Figure 5-13: DOACROSS parallel tasks (left); PIPELINE parallel tasks (right)
Figure 5-14: Distributed streaming architecture with message queues
Figure 5-15: Streaming processing with all the tasks active at the same time
Figure 5-16: Nonoverlapping time windows for grouping the events
Figure 5-17: Overlapping time windows for grouping the events
Chapter 6
Figure 6-1: A message with a header. The header is sent through the network ...
Figure 6-2: Processes 1, 2, and 3 send TCP messages to Process 0 with their ...
Figure 6-3: Dataflow graph with Source, Reduce Op, and Reduce Tasks (left); ...
Figure 6-4: MPI process model for computing and communication
Figure 6-5: Broadcast operation
Figure 6-6: Reduce operation
Figure 6-7: AllReduce operation
Figure 6-8: Gather operation
Figure 6-9: AllGather operation
Figure 6-10: Scatter operation
Figure 6-11: AllToAll operation
Figure 6-12: Broadcast with a flat tree
Figure 6-13: Broadcast with a binary tree
Figure 6-14: AllReduce with Ring
Figure 6-15: AllReduce with recursive doubling
Figure 6-16: Shuffle operation
Figure 6-17: Shuffle with data that does not fit into memory
Figure 6-18: Fetch-based shuffle
Figure 6-19: Chaining shuffle with four processes
Figure 6-20: GroupBy operation
Figure 6-21: Aggregate function sum on a distributed table
Figure 6-22: Sorted merge join
Figure 6-23: Distributed join on two tables distributed in two processes
Chapter 7
Figure 7-1: False sharing with threads
Figure 7-2: An example of array addition using scalar and vector instruction...
Figure 7-3: Vector instruction extensions in Intel processors
Figure 7-4: Layered architecture of programming
Figure 7-5: Transformation of a program into an executable plan
Figure 7-6: Work stealing
Figure 7-7: Synchronous system
Figure 7-8: Asynchronous system
Figure 7-9: Actor model
Figure 7-10: Execution of tasks
Figure 7-11: Remote execution from a driver program
Figure 7-12: SPMD execution (left); MPMD execution (right)
Figure 7-13: Spark task graph
Figure 7-14: The architecture of an accelerator cluster
Figure 7-15: Task graph in parallel stages
Figure 7-16: Scheduling with data locality
Figure 7-17: Scheduling without data locality
Figure 7-18: Streaming execution of tasks
Figure 7-19: Flink streaming graph
Figure 7-20: Stream graph scheduling
Figure 7-21: Back pressure on a set of streaming tasks
Figure 7-22: Apache Storm back-pressure handling
Chapter 8
Figure 8-1: Hadoop fixed graph
Figure 8-2: Hadoop execution of map-reduce
Figure 8-3: Apache Spark cluster
Figure 8-4: Overall architecture of a streaming application in Spark
Figure 8-5: Spark streaming with mini-batches
Figure 8-6: Example Storm topology with two spouts and one bolt
Figure 8-7: Apache Storm cluster
Figure 8-8: Kafka streams execution
Figure 8-9: An example tensor graph for matrix multiplication
Figure 8-10: Topological order for computation graph
Figure 8-11: Perceptron learning
Figure 8-12: Backward pass implementation in Autograd using a shared context
Figure 8-13: PyTorch Distributed Data Parallel
Figure 8-14: Different types of parallelism in deep learning
Figure 8-15: Pipeline parallelism of Resnet50 using RPC support in PyTorch
Figure 8-16: Cylon architecture
Figure 8-17: Cylon program execution on two processes
Figure 8-18: CUDF architecture
Chapter 9
Figure 9-1: Globally consistent state
Figure 9-2: Domino effect
Figure 9-3: Program synchronization point
Figure 9-4: Program synchronization point
Figure 9-5: Apache Storm architecture
Figure 9-6: Apache Storm acknowledgments
Figure 9-7: Apache Flink architecture
Figure 9-8: Apache Flink barriers and checkpointing
Figure 9-9: Iterative program checkpointing
Figure 9-10: Apache Spark architecture
Figure 9-11: Spark persistence
Chapter 10
Figure 10-1: Parallel speedup
Figure 10-2: Parallel efficiency
Figure 10-3: An analogy to Amdahl's law
Figure 10-4: The values are following a normal distribution (left); latency ...
Figure 10-5: Garbage collection
Figure 10-6: Java heap space monitoring
Figure 10-7: Java GC and heap space breakdown
Figure 10-8: Effects of garbage collection for distributed messaging
Figure 10-9: Spark memory structure
Figure 10-10: PySpark architecture
Figure 10-11: Data formats for data processing, ML/deep learning
Figure 10.12 Components of end-to-end data pipeline
Cover Page
Title Page
Copyright
Dedication
About the Authors
About the Editor
Acknowledgments
Introduction
Table of Contents
Begin Reading
Index
WILEY END USER LICENSE AGREEMENT
iii
xxvii
xxviii
xxix
xxx
xxxi
xxxii
xxxiii
xxxiv
xxxv
xxxvi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
iv
v
vii
ix
xi
377
Supun Kamburugamuve
Saliya Ekanayake
Many would say the byword for success in the modern era has to be data. The sheer amount of information stored, processed, and moved around over the past 50 years has seen a staggering increase, without any end in sight. Enterprises are hungry to acquire and process more data to get a leg up on the competition. Scientists especially are looking to use data-intensive methods to advance research in ways that were not possible only a few decades ago.
With this worldwide demand, data-intensive applications have gone through a remarkable transformation since the start of the 21st century. We have seen wide adoption of big data frameworks such as Apache Hadoop, Apache Spark, and Apache Flink. The amazing advances being made in the fields of machine learning (ML) and deep learning (DL) are taking the big data era to new heights for both enterprise and research communities. These fields have further broadened the scope of data-intensive applications, which demand faster and more integrable systems that can operate on both specialized and commodity hardware.
Data-intensive applications deal with storing and extracting information that involves disk access. They can be computing intensive as well, with deep learning and machine learning applications that not only consume massive data but also do a substantial number of computations. Because of memory, storage, and computational requirements, these applications require resources beyond a single computer's ability to provide.
There are two main branches of data-intensive processing applications: streaming data and batch data. Streaming data analytics is defined as the continuous processing of data. Batch data analytics involves processing data as a complete unit. In practice, we can see streaming and batch applications working together. Machine learning and deep learning fall under the batch application category. There are some streaming ML algorithms as well. When we say machine learning or deep learning, we mostly focus on the training of the models, as it is the most compute-intensive aspect.
Users run these applications on their laptops, clouds, high-performance clusters, graphic processing unit (GPU) clusters, and even supercomputers. Such systems have different hardware and software environments. While we may be developing our applications to deploy in one environment, the framework we use to construct them needs to work in a variety of settings.
Data-intensive applications have a long history from even before the start of the map-reduce era. With the increasing popularity of internet services, the need for storing and processing large datasets became vitally important starting around the turn of the century. In February 2001, Doug Laney published a research note titled “3D Data Management: Controlling Data Volume, Velocity, and Variety” [1]. Since then, we have used the so-called “3 Vs” to describe the needs of large-scale data processing.
From a technology perspective, the big data era began with the MapReduce paper from Jeffrey Dean and Sanjay Ghemawat at Google [2]. Apache Hadoop was created as an open source implementation of the map-reduce framework along with the Hadoop Distributed File System following the Google File System [3]. The simple interface of map-reduce for data processing attracted many companies and researchers. At the time, a network was often a bottleneck, and the primary focus was to bring the computation to where the data were kept.
A whole ecosystem of storage solutions and processing systems rose around these ideas. Some of the more notable processing systems include Apache Spark, Apache Flink, Apache Storm, and Apache Tez. Storage system examples are Apache HBase, Apache Cassandra, and MongoDB.
With more data came the need to learn from them. Machine learning algorithms have been in development since the 1950s when the first perceptron [4] was created and the nearest neighbor algorithm [5] was introduced. Since then, many algorithms have appeared steadily over the years to better learn from data. Indeed, most of the deep learning theory was created in the 1980s and 1990s.
Despite a long evolution, it is fair to say that modern deep learning as we know it spiked around 2006 with the introduction of the Deep Belief Network (DBN). This was followed by the remarkable success of AlexNet in 2009 [6]. The primary reason for this shift was the increase in computational power in the form of parallel computing that allowed neural networks to grow several orders of magnitude larger than what had been achieved in the past. A direct consequence of this has been the increase in the number of layers in a neural network, which is why it's called deep learning.
With machine learning and deep learning, users needed more interactive systems to explore data and do experiments quickly. This paved the way for Python-based APIs for data processing. The success of Python libraries such as NumPy and Pandas contributed to its popularity among data-intensive applications. But while all these changes were taking place, computer hardware was going through a remarkable transformation as well.
Since the introduction of the first microprocessor from Intel in 1972, there have been tremendous advances in CPU architectures, leading to quantum leaps in performance. Moore's law and Dennard scaling coupled together were driving the performance of microprocessors until recently. Moore's law is an observation made by Gordon Moore on the doubling of the number of transistors roughly every two years. Dennard scaling states that the power density of MOSFET transistors roughly stays the same through each generation. The combination of these two principles suggested the number of computations that these microprocessors could perform for the same amount of energy would double every 1.5 years.
For half a century, this phenomenon has helped speed up applications across the board with each new generation of processors. Dennard scaling, however, has halted since about 2006, meaning clock frequencies of microprocessors hit a wall around 4GHz. This has led to the multicore era of microprocessors, where some motherboards even support more than one CPU or multisockets. The result of this evolution is single computers equipped with core counts that can go up to 128 today. Programming multicore computers require more consideration than traditional CPUs, which we will explore in detail later.
Alongside the multicore evolution came the rise of GPUs as general-purpose computing processors. This trend took a boost with the exponential growth in machine learning and deep learning. GPUs pack thousands of lightweight cores compared to CPUs. This has paved the way for accelerated computing kernels available to speed up computations.
GPUs were not the only type of accelerator to emerge. Intel KNL processors came out initially as accelerators. Field Programmable Gate Arrays (FPGAs) are now being used to develop custom accelerators, especially to improve deep learning training and inference. The trend has gone further, with the development of custom chips known as Application Specific Integrated Circuit (ASIC). Google's Tensor Processing Unit (TPU) is a popular ASIC solution to advance the training of large models.
The future of hardware evolution is leaning more toward accelerators, to the point where devices would carry multiple accelerators designed specifically for different application needs. This is known as accelerator level parallelism. For instance, Apple's A14 chip has accelerators for graphics, neural network processing, and cryptography.
Modern data-intensive applications consist of data storage and management systems as well as data science workflows, as illustrated in Figure I-1. Data management is mostly an automated process involving their development and deployment. On the other hand, data science is a process that calls for human involvement.
Figure I-1: Overall data processing architecture
Figure I-1 shows a typical data processing architecture. The priority is to ingest the data from various sources, such as web services and devices, into raw storage. The sources can be internal or external to an organization, the most common being web services and IoT devices, all of which produce streaming data. Further data can be accumulated from batch processes such as the output of an application run. Depending on our requirements, we can analyze this data to some degree before they are stored.
The raw data storage serves as a rich trove for extracting structured data. Ingesting and extracting data via such resources requires data processing tools that can work at massive scale.
From the raw storage, more specialized use cases that require subsets of this data are supported. We can store such data in specialized formats for efficient queries and model building, which is known as data warehousing. In the early stages of data-intensive applications, large-scale data storage and querying were the dominant use cases.
Data science workflows have grown into an integral part of modern data-intensive applications. They involve data preparation, analysis, reflection, and dissemination, as shown in Figure I-2. Data preparation works with sources like databases and files to retrieve, clean, and reformat data.
Figure I-2: Data science workflow
Data analysis includes modeling, training, debugging, and model validations. Once the analysis is complete, we can make comparisons in the reflection step to see whether such an analysis is what we need. This is an iterative process where we check different models and tweak them to get the best results.
After the models are finalized, we can deploy them, create reports, and catalog the steps we took in the experiments. The actual code related to learning algorithms may be small compared to all the other systems and applications surrounding and supporting it [7].
The data science workflow is an interactive process with human involvement every step of the way. Scripting languages like Python are a good fit for such interactive environments, with the ability to quickly prototype and test various hypotheses. Other technologies like Python Notebooks are extensively used by data scientists in their workflows.
Whether it is large-scale data management or a data scientist evaluating results in a small cluster, we use frameworks and APIs to develop and run data-intensive applications. These frameworks provide APIs and the means to run applications at scale, handling failures and various hardware features. The frameworks are designed to run different workloads and applications:
Streaming data processing frameworks
—Process continuous data streams.
Batch data processing frameworks
—Manage large datasets. Perform extract, transform, and load operations.
Machine/deep learning frameworks
—Designed to run models that learn from data through iterative training.
Workflow systems
—Combine data-intensive applications to create larger applications.
There are many frameworks available for the classes of applications found today. From the outside, even the frameworks designed to solve a single specific class of applications might look quite diverse, with different APIs and architectures. But if we look at the core of these frameworks and the applications built on top of them, we find there are similar techniques being used. Seeing as how they are trying to solve the same problems, all of them use similar data abstractions, techniques for running at scale, and even fault handling. Within these similarities, there are differences that create distinct frameworks for various application classes.
It is hard to imagine one framework to solve all our data-intensive application needs. Building frameworks that work at scale for a complex area such as data-intensive applications is a colossal challenge. We need to keep in mind that, like any other software project, frameworks are built with finite resources and time constraints. Various programming languages are used when developing them, each with their own benefits and limitations. There are always trade-offs between usability and performance. Sometimes the most user-friendly APIs may not be the most successful.
Some software designs and architectures for data-intensive applications are best suited for only certain application classes. The frameworks are built according to these architectures to solve specific classes of problems but may not be that effective when applied to others. What we see in practice is frameworks are built for one purpose and are being adapted for other use cases as they mature.
Data processing is a complicated space that demands hardware, software, and application requirements. On top of this, such demands have been evolving rapidly. At times, there are so many options and variables it seems impossible to choose the correct hardware and software for our data-intensive problems. Having a deep understanding of how things work beneath the surface can help us make better choices when designing our data processing architectures.
Data-intensive applications incorporate ideas from many domains of computer science. This includes areas such as computer systems, databases, high-performance computing, cloud computing, distributed computing, programming languages, computer networks, statistics, data structures, and algorithms.
We can study data-intensive applications in three perspectives: data storage and management, learning from data, and scaling data processing. Data storage and management is the first step in any data analytics pipeline. It can include techniques ranging from sequential databases to large-scale data lakes.
Learning from data can take many forms depending on the data type and use case. For example, computing basic statistics, clustering data into groups, finding interesting patterns in graph data, joining tables of data to enrich information, and fitting functions to existing data are several ways we learn from data. The overarching goal of the algorithmic perspective is to form models from data that can be used to predict something in the future.
A key enabler for these two perspectives is the third one, where data processing is done at scale. Therefore, we will primarily look at data-intensive applications from a distributed processing perspective in this book. Our focus is to understand how the data-intensive applications run at scale utilizing various computing resources available. We take principles from databases, parallel computing, and distributing computing to delve deep into the ideas behind data-intensive applications operating at scale.
This book is aimed at data engineers and data scientists. If you develop data-intensive applications or are planning to do so, the information contained herein can provide insight into how things work regarding various frameworks and tools. This book can be helpful if you are trying to make decisions about what frameworks to choose in your applications.
You should have a foundational knowledge of computer science in general before reading any further. If you have a basic understanding of networks and computer architecture, that will be helpful to understand the content better.
Our target audience is the curious reader who likes to understand how data-intensive applications function at scale. Whether you are considering developing applications using certain frameworks or you are developing your own applications from scratch for highly specialized use cases, this book will help you to understand the inner workings of the systems. If you are familiar with data processing tools, it can deepen your understanding about the underlying principles.
The book starts with introducing the challenges of developing and executing data-intensive applications at scale. Then it gives an introduction to data and storage systems and cluster resources. The next few chapters describe the internals of data-intensive frameworks with data structures, programming models, messaging, and task execution along with a few case studies of existing systems. Finally, we talk about fault tolerance techniques and finish the book with performance implications.
Chapter 1
: Scaling Data-Intensive Applications
—Describes serial and parallel applications for data-intensive applications and the challenges faced when running them at scale.
Chapter 2
: Data and Storage
—Overview of the data storage systems used in the data processing. Both hardware and software solutions are discussed.
Chapter 3
: Computing Resources
—Introduces the computing resources used in data-intensive applications and how they are managed in large-scale environments.
Chapter 4
: Data Structures
—This chapter describes the data abstractions used in data analytics applications and the importance of using memory correctly to speed up applications.
Chapter 5
: Programming Models
—Discusses various programming models available and the APIs for data analytics applications.
Chapter 6
: Messaging
—Examines how the network is used by data analytics applications to process data at scale by exchanging data between the distributed processes.
Chapter 7
: Parallel Tasks
—Shows how to execute tasks in parallel for data analytics applications combining messaging.
Chapter 8
: Case Studies
—Studies of a few widely used systems to highlight how the principles we discussed in the book are applied in practice.
Chapter 9
: Fault Tolerance
—Illustrates handy techniques for handling faults in data-intensive applications.
Chapter 10
: Performance and Productivity
—Defines various metrics used for measuring performance and discusses productivity when choosing tools.
Our focus here is the fundamentals of parallel and distributed computing and how they are applied in data processing systems. We take examples from existing systems to explain how these are used in practice, and we are not focusing on any specific system or programming language. Our main goal is to help you understand the trade-off between the available techniques and how they are used in practice. This book does not try to teach parallel programming.
You will also be introduced to a few examples of deep learning systems. Although DL at scale works according to the principles we describe here, we will not be going into any great depth on the topic. We also will not be covering any specific data-intensive framework on configuration and deployment. There are plenty of specific books and online resources on these frameworks.
SQL is a popular choice for querying large datasets. Still, you will not find much about it here because it is a complex subject of its own, involving query parsing and optimization, which is not the focus of this book. Instead, we look at how the programs are executed at scale.
At the end of each chapter, we put some important references that paved the way for some of the discussions we included. Most of the content you will find is the result of work done by researchers and software engineers over a long period. We include these as references to any reader interested in learning more about the topics discussed.
1. D. Laney, “3D data management: Controlling data volume, velocity and variety,”
META group research note
,
vol. 6, no. 70, p. 1, 2001.
2. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters.,”
Sixth Symposium on Operating Systems Design and Implementation
,
pp. 137–150, 2004.
3. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” presented at the 19th ACM Symposium on Operating Systems Principles, 2003.
4. F. Rosenblatt,
The perceptron, a perceiving and recognizing automaton Project Para
.
Cornell Aeronautical Laboratory
, 1957.
5. T. Cover and P. Hart, “Nearest neighbor pattern classification,”
IEEE transactions on information theory
,
vol. 13, no. 1, pp. 21–27, 1967.
6. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”
Advances in neural information processing systems
,
vol. 25, pp. 1097–1105, 2012.
7. D. Sculley et al., “Hidden technical debt in machine learning systems,”
Advances in neural information processing systems
,
vol. 28, pp. 2503–2511, 2015.
