E-Book
35,99 €

Foundations of Data Intensive Applications E-Book

Supun Kamburugamuve

0,0

35,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

PEEK "UNDER THE HOOD" OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: * Identify the foundations of large-scale, distributed data processing systems * Make major software design decisions that optimize performance * Diagnose performance problems and distributed operation issues * Understand state-of-the-art research in big data * Explain and use the major big data frameworks and understand what underpins them * Use big data analytics in the real world to solve practical problems

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 586

Veröffentlichungsjahr: 2021

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Title Page

Introduction

History of Data-Intensive Applications

Data Processing Architecture

Foundations of Data-Intensive Applications

Who Should Read This Book?

Organization of the Book

Scope of the Book

References

CHAPTER 1: Data Intensive Applications

Anatomy of a Data-Intensive Application

Parallel Applications

Application Classes and Frameworks

What Makes It Difficult?

Summary

References

Notes

CHAPTER 2: Data and Storage

Storage Systems

Data Formats

Data Replication

Data Partitioning

NoSQL Databases

Message Queuing

Summary

References

Notes

CHAPTER 3: Computing Resources

A Demonstration

Computer Clusters

Data Analytics in Clusters

Distributed Application Life Cycle

Computing Resources

Cluster Resource Managers

Job Scheduling

Summary

References

Notes

CHAPTER 4: Data Structures

Virtual Memory

The Need for Data Structures

Object and Text Data

Vectors and Matrices

Table

Summary

References

Notes

CHAPTER 5: Programming Models

Introduction

Data Structures and Operations

Message Passing Model

Distributed Data Model

Task Graphs (Dataflow Graphs)

Batch Dataflow

Streaming Dataflow

SQL

Summary

References

Notes

CHAPTER 6: Messaging

Network Services

Messaging for Data Analytics

Distributed Operations

Distributed Operations on Arrays

Distributed Operations on Tables

Advanced Topics

Summary

References

Notes

CHAPTER 7: Parallel Tasks

CPUs

Accelerators

Task Execution

Batch Tasks

Streaming Tasks

Summary

References

CHAPTER 8: Case Studies

Apache Hadoop

Apache Spark

Apache Storm

Kafka Streams

PyTorch

Cylon

Rapids cuDF

Summary

References

Notes

CHAPTER 9: Fault Tolerance

Dependable Systems and Failures

Recovering from Faults

Checkpointing

Streaming Systems

Batch Systems

Summary

References

CHAPTER 10: Performance and Productivity

Performance Metrics

Performance Factors

Finding Issues

Programming Languages

Productivity

Summary

References

Notes

Index

Dedication

About the Authors

About the Editor

Acknowledgments

End User License Agreement

List of Tables

Chapter 1

Table 1-1: CSV File Structure for User's Data

Chapter 2

Table 2-1: Storage Area Network Channels and Protocols

Table 2-2: Original Dataset with Four Rows

Table 2-3: First Table with Only First Three Columns of Original Table

Table 2-4: Second Table With The First Column and the Last Two Columns from the ...

Table 2-5: Horizontal Partition with the Year 2019

Table 2-6: Horizontal Partition with the Year 2020

Table 2-11: Key Value Data Store

Table 2-12: Document Store

Table 2-13: Wide Column Store

Chapter 3

Table 3-1: Data Analytics Frameworks and Resource Scheduling

Table 3-2: Different Resource Managers

Table 3-3: Data Center Tiers

Table 3-4: Massive Data Centers of the World

Table 3-5: Top 5 Supercomputers in the World

Chapter 4

Table 4-1: Time for Accessing 100 Million Records in Three Methods

Table 4-2: Time for Accessing 200 Million Records in Three Methods

Table 4-3: CSV File with Attributes

Table 4-4: Column Arrays

Table 4-5: Fundamental Relational Algebra Operations of Pandas DataFrame

Chapter 5

Table 5-1: Applications, Data Structures, and Programming Models

Table 5-2: Common Primitive Types Supported by Data Systems

Table 5-3: Common Complex Types Supported by Data Systems

Table 5-4: Distributed Operations on Arrays

Table 5-5: Basic Relational Algebra Operations

Table 5-6: Auxiliary Operations on Tables

Table 5-7: Message Passing Implementations

Table 5-8: Table Operations

Table 5-9: Input Records of Two Datasets

Table 5-10: Streaming Operations

Table 5-11: Streaming Distributed Operations

Table 5-12: Popular Operations on Windowed Data

Chapter 6

Table 6-1: Popular MPI Implementations

Table 6-2: Common Distributed Operations on Arrays

Table 6-3: Distributed Operations for Tables

Table 6.4: Table-Based Operations and Their Implementations

Chapter 7

Table 7-1: Synchronization Primitives

Table 7-2: Use of Task Graph Model for Data Analytics

Table 7-5: Partitioning Strategies

Table 7-6: Local Operations on Streams

Table 7-7: Windowed Operations

Table 7-8: Distributed Operations

Chapter 10

Table 10-1: Profiling Tools

Table 10-2: Frameworks and Languages

Table 10-3: Frameworks and Target Applications

Table 10-4: Cloud vs. On-Premises for Data Processing

Table 10-5: CPUs and GPUs for Data-Intensive Applications

Table 10-6: Cloud Services for Data Pipelines

List of Illustrations

Introduction

Figure I-1: Overall data processing architecture

Figure I-2: Data science workflow

Chapter 1

Figure 1-1: Linear partitioning of files among processes

Figure 1-2: Runtime components of a data-intensive application

Figure 1-3: Steps in making a parallel program

Figure 1-4: The logical view of the shared memory model

Figure 1-5: Shared memory implementation over distributed resources

Figure 1-6: The logical view of the distributed memory model

Figure 1.7: Distributed memory implementation over distributed resources

Figure 1.8: The logical view of hybrid memory model

Figure 1-9: The logical view of PGAS memory model

Figure 1-10: PGAS example with three tasks

Figure 1-11: Four application classes

Figure 1-12: An example workflow

Chapter 2

Figure 2-1: Three ways to connect storage to computer clusters: direct-attac...

Figure 2-2: SAN system with a Fibre Channel network

Figure 2-3: Storage abstractions: block storage (left), file storage (middle...

Figure 2-4: Storing values of a row in consecutive memory spaces

Figure 2-5: Storing values of a column in consecutive memory spaces

Figure 2-6: Parquet file structure

Figure 2-7: Avro file structure

Figure 2-8: Pipelined replication

Figure 2-9: CAP theorem

Figure 2-10: Message brokers in enterprises

Figure 2-11: Message queue

Figure 2-12: Message queue replication

Chapter 3

Figure 3-1: Cluster with separate storage

Figure 3-2: Cluster with compute and storage nodes

Figure 3-3: Cloud-based cluster using a storage service

Figure 3-4: Apache Spark architecture

Figure 3-5: Life cycle of distributed applications

Figure 3-6: Virtual machine architecture

Figure 3-7: Docker architecture

Figure 3-8: Hierarchical memory access with caches

Figure 3-9: NUMA memory access

Figure 3-10: Kubernetes architecture

Figure 3-11: Slurm architecture

Figure 3-12: Yarn architecture

Figure 3-13: Backfill scheduling and FIFO scheduling

Chapter 4

Figure 4-1: Virtual memory layout of a Linux program

Figure 4-2: Mapping of virtual memory to physical memory

Figure 4-3: Translating virtual address to physical address

Figure 4-4: Cache structure of an 8-core Intel CPU with L1, L2, and L3 cache...

Figure 4-5: Array memory layout versus list memory layout

Figure 4-6: Matrix and indexes

Figure 4-7: Row major representation of a matrix

Figure 4-8: Column major representation of a matrix

Figure 4-9: Time difference of access patterns

Figure 4-10: Memory representation of 3D tensor

Figure 4-11: Dense representation of a sparse matrix

Figure 4-12: CSR representation of the sparse matrix

Chapter 5

Figure 5-1: Distributed operation

Figure 5-2: AllReduce on arrays on two processes

Figure 5-3: Table partitioned in two processes

Figure 5-4: Tables after a distributed sort on Year column

Figure 5-5: Graph representation

Figure 5-6: Graph relationships

Figure 5-7: BSP program with computations and communications

Figure 5-8: Large data processing occurring in a streaming fashion

Figure 5-9: Program to execution graph

Figure 5-10: A graph that reads a set of files and does two separate computa...

Figure 5-11: Execute and cache (left); execute left path (middle); execute r...

Figure 5-12: Loop outside the task graph. Central driver submits the graph i...

Figure 5-13: DOACROSS parallel tasks (left); PIPELINE parallel tasks (right)

Figure 5-14: Distributed streaming architecture with message queues

Figure 5-15: Streaming processing with all the tasks active at the same time

Figure 5-16: Nonoverlapping time windows for grouping the events

Figure 5-17: Overlapping time windows for grouping the events

Chapter 6

Figure 6-1: A message with a header. The header is sent through the network ...

Figure 6-2: Processes 1, 2, and 3 send TCP messages to Process 0 with their ...

Figure 6-3: Dataflow graph with Source, Reduce Op, and Reduce Tasks (left); ...

Figure 6-4: MPI process model for computing and communication

Figure 6-5: Broadcast operation

Figure 6-6: Reduce operation

Figure 6-7: AllReduce operation

Figure 6-8: Gather operation

Figure 6-9: AllGather operation

Figure 6-10: Scatter operation

Figure 6-11: AllToAll operation

Figure 6-12: Broadcast with a flat tree

Figure 6-13: Broadcast with a binary tree

Figure 6-14: AllReduce with Ring

Figure 6-15: AllReduce with recursive doubling

Figure 6-16: Shuffle operation

Figure 6-17: Shuffle with data that does not fit into memory

Figure 6-18: Fetch-based shuffle

Figure 6-19: Chaining shuffle with four processes

Figure 6-20: GroupBy operation

Figure 6-21: Aggregate function sum on a distributed table

Figure 6-22: Sorted merge join

Figure 6-23: Distributed join on two tables distributed in two processes

Chapter 7

Figure 7-1: False sharing with threads

Figure 7-2: An example of array addition using scalar and vector instruction...

Figure 7-3: Vector instruction extensions in Intel processors

Figure 7-4: Layered architecture of programming

Figure 7-5: Transformation of a program into an executable plan

Figure 7-6: Work stealing

Figure 7-7: Synchronous system

Figure 7-8: Asynchronous system

Figure 7-9: Actor model

Figure 7-10: Execution of tasks

Figure 7-11: Remote execution from a driver program

Figure 7-12: SPMD execution (left); MPMD execution (right)

Figure 7-13: Spark task graph

Figure 7-14: The architecture of an accelerator cluster

Figure 7-15: Task graph in parallel stages

Figure 7-16: Scheduling with data locality

Figure 7-17: Scheduling without data locality

Figure 7-18: Streaming execution of tasks

Figure 7-19: Flink streaming graph

Figure 7-20: Stream graph scheduling

Figure 7-21: Back pressure on a set of streaming tasks

Figure 7-22: Apache Storm back-pressure handling

Chapter 8

Figure 8-1: Hadoop fixed graph

Figure 8-2: Hadoop execution of map-reduce

Figure 8-3: Apache Spark cluster

Figure 8-4: Overall architecture of a streaming application in Spark

Figure 8-5: Spark streaming with mini-batches

Figure 8-6: Example Storm topology with two spouts and one bolt

Figure 8-7: Apache Storm cluster

Figure 8-8: Kafka streams execution

Figure 8-9: An example tensor graph for matrix multiplication

Figure 8-10: Topological order for computation graph

Figure 8-11: Perceptron learning

Figure 8-12: Backward pass implementation in Autograd using a shared context

Figure 8-13: PyTorch Distributed Data Parallel

Figure 8-14: Different types of parallelism in deep learning

Figure 8-15: Pipeline parallelism of Resnet50 using RPC support in PyTorch

Figure 8-16: Cylon architecture

Figure 8-17: Cylon program execution on two processes

Figure 8-18: CUDF architecture

Chapter 9

Figure 9-1: Globally consistent state

Figure 9-2: Domino effect

Figure 9-3: Program synchronization point

Figure 9-4: Program synchronization point

Figure 9-5: Apache Storm architecture

Figure 9-6: Apache Storm acknowledgments

Figure 9-7: Apache Flink architecture

Figure 9-8: Apache Flink barriers and checkpointing

Figure 9-9: Iterative program checkpointing

Figure 9-10: Apache Spark architecture

Figure 9-11: Spark persistence

Chapter 10

Figure 10-1: Parallel speedup

Figure 10-2: Parallel efficiency

Figure 10-3: An analogy to Amdahl's law

Figure 10-4: The values are following a normal distribution (left); latency ...

Figure 10-5: Garbage collection

Figure 10-6: Java heap space monitoring

Figure 10-7: Java GC and heap space breakdown

Figure 10-8: Effects of garbage collection for distributed messaging

Figure 10-9: Spark memory structure

Figure 10-10: PySpark architecture

Figure 10-11: Data formats for data processing, ML/deep learning

Figure 10.12 Components of end-to-end data pipeline

Guide

Cover Page

Title Page

Dedication

About the Authors

About the Editor

Acknowledgments

Introduction

Table of Contents

Begin Reading

Index

WILEY END USER LICENSE AGREEMENT

Pages

iii

xxvii

xxviii

xxix

xxx

xxxi

xxxii

xxxiii

xxxiv

xxxv

xxxvi

100

101

102

103

104

105

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

vii

377

Foundations of Data Intensive Applications

Large Scale Data Analytics under the Hood

Supun Kamburugamuve

Saliya Ekanayake

Introduction

Many would say the byword for success in the modern era has to be data. The sheer amount of information stored, processed, and moved around over the past 50 years has seen a staggering increase, without any end in sight. Enterprises are hungry to acquire and process more data to get a leg up on the competition. Scientists especially are looking to use data-intensive methods to advance research in ways that were not possible only a few decades ago.

With this worldwide demand, data-intensive applications have gone through a remarkable transformation since the start of the 21st century. We have seen wide adoption of big data frameworks such as Apache Hadoop, Apache Spark, and Apache Flink. The amazing advances being made in the fields of machine learning (ML) and deep learning (DL) are taking the big data era to new heights for both enterprise and research communities. These fields have further broadened the scope of data-intensive applications, which demand faster and more integrable systems that can operate on both specialized and commodity hardware.

Data-intensive applications deal with storing and extracting information that involves disk access. They can be computing intensive as well, with deep learning and machine learning applications that not only consume massive data but also do a substantial number of computations. Because of memory, storage, and computational requirements, these applications require resources beyond a single computer's ability to provide.

There are two main branches of data-intensive processing applications: streaming data and batch data. Streaming data analytics is defined as the continuous processing of data. Batch data analytics involves processing data as a complete unit. In practice, we can see streaming and batch applications working together. Machine learning and deep learning fall under the batch application category. There are some streaming ML algorithms as well. When we say machine learning or deep learning, we mostly focus on the training of the models, as it is the most compute-intensive aspect.

Users run these applications on their laptops, clouds, high-performance clusters, graphic processing unit (GPU) clusters, and even supercomputers. Such systems have different hardware and software environments. While we may be developing our applications to deploy in one environment, the framework we use to construct them needs to work in a variety of settings.

History of Data-Intensive Applications

Data-intensive applications have a long history from even before the start of the map-reduce era. With the increasing popularity of internet services, the need for storing and processing large datasets became vitally important starting around the turn of the century. In February 2001, Doug Laney published a research note titled “3D Data Management: Controlling Data Volume, Velocity, and Variety” [1]. Since then, we have used the so-called “3 Vs” to describe the needs of large-scale data processing.

From a technology perspective, the big data era began with the MapReduce paper from Jeffrey Dean and Sanjay Ghemawat at Google [2]. Apache Hadoop was created as an open source implementation of the map-reduce framework along with the Hadoop Distributed File System following the Google File System [3]. The simple interface of map-reduce for data processing attracted many companies and researchers. At the time, a network was often a bottleneck, and the primary focus was to bring the computation to where the data were kept.

A whole ecosystem of storage solutions and processing systems rose around these ideas. Some of the more notable processing systems include Apache Spark, Apache Flink, Apache Storm, and Apache Tez. Storage system examples are Apache HBase, Apache Cassandra, and MongoDB.

With more data came the need to learn from them. Machine learning algorithms have been in development since the 1950s when the first perceptron [4] was created and the nearest neighbor algorithm [5] was introduced. Since then, many algorithms have appeared steadily over the years to better learn from data. Indeed, most of the deep learning theory was created in the 1980s and 1990s.

Despite a long evolution, it is fair to say that modern deep learning as we know it spiked around 2006 with the introduction of the Deep Belief Network (DBN). This was followed by the remarkable success of AlexNet in 2009 [6]. The primary reason for this shift was the increase in computational power in the form of parallel computing that allowed neural networks to grow several orders of magnitude larger than what had been achieved in the past. A direct consequence of this has been the increase in the number of layers in a neural network, which is why it's called deep learning.

With machine learning and deep learning, users needed more interactive systems to explore data and do experiments quickly. This paved the way for Python-based APIs for data processing. The success of Python libraries such as NumPy and Pandas contributed to its popularity among data-intensive applications. But while all these changes were taking place, computer hardware was going through a remarkable transformation as well.

Hardware Evolution

Since the introduction of the first microprocessor from Intel in 1972, there have been tremendous advances in CPU architectures, leading to quantum leaps in performance. Moore's law and Dennard scaling coupled together were driving the performance of microprocessors until recently. Moore's law is an observation made by Gordon Moore on the doubling of the number of transistors roughly every two years. Dennard scaling states that the power density of MOSFET transistors roughly stays the same through each generation. The combination of these two principles suggested the number of computations that these microprocessors could perform for the same amount of energy would double every 1.5 years.

For half a century, this phenomenon has helped speed up applications across the board with each new generation of processors. Dennard scaling, however, has halted since about 2006, meaning clock frequencies of microprocessors hit a wall around 4GHz. This has led to the multicore era of microprocessors, where some motherboards even support more than one CPU or multisockets. The result of this evolution is single computers equipped with core counts that can go up to 128 today. Programming multicore computers require more consideration than traditional CPUs, which we will explore in detail later.

Alongside the multicore evolution came the rise of GPUs as general-purpose computing processors. This trend took a boost with the exponential growth in machine learning and deep learning. GPUs pack thousands of lightweight cores compared to CPUs. This has paved the way for accelerated computing kernels available to speed up computations.

GPUs were not the only type of accelerator to emerge. Intel KNL processors came out initially as accelerators. Field Programmable Gate Arrays (FPGAs) are now being used to develop custom accelerators, especially to improve deep learning training and inference. The trend has gone further, with the development of custom chips known as Application Specific Integrated Circuit (ASIC). Google's Tensor Processing Unit (TPU) is a popular ASIC solution to advance the training of large models.

The future of hardware evolution is leaning more toward accelerators, to the point where devices would carry multiple accelerators designed specifically for different application needs. This is known as accelerator level parallelism. For instance, Apple's A14 chip has accelerators for graphics, neural network processing, and cryptography.

Data Processing Architecture

Modern data-intensive applications consist of data storage and management systems as well as data science workflows, as illustrated in Figure I-1. Data management is mostly an automated process involving their development and deployment. On the other hand, data science is a process that calls for human involvement.

Figure I-1: Overall data processing architecture

Data Storage and Querying

Figure I-1 shows a typical data processing architecture. The priority is to ingest the data from various sources, such as web services and devices, into raw storage. The sources can be internal or external to an organization, the most common being web services and IoT devices, all of which produce streaming data. Further data can be accumulated from batch processes such as the output of an application run. Depending on our requirements, we can analyze this data to some degree before they are stored.

The raw data storage serves as a rich trove for extracting structured data. Ingesting and extracting data via such resources requires data processing tools that can work at massive scale.

From the raw storage, more specialized use cases that require subsets of this data are supported. We can store such data in specialized formats for efficient queries and model building, which is known as data warehousing. In the early stages of data-intensive applications, large-scale data storage and querying were the dominant use cases.

Data Science Workflows

Data science workflows have grown into an integral part of modern data-intensive applications. They involve data preparation, analysis, reflection, and dissemination, as shown in Figure I-2. Data preparation works with sources like databases and files to retrieve, clean, and reformat data.

Figure I-2: Data science workflow

Data analysis includes modeling, training, debugging, and model validations. Once the analysis is complete, we can make comparisons in the reflection step to see whether such an analysis is what we need. This is an iterative process where we check different models and tweak them to get the best results.

After the models are finalized, we can deploy them, create reports, and catalog the steps we took in the experiments. The actual code related to learning algorithms may be small compared to all the other systems and applications surrounding and supporting it [7].

The data science workflow is an interactive process with human involvement every step of the way. Scripting languages like Python are a good fit for such interactive environments, with the ability to quickly prototype and test various hypotheses. Other technologies like Python Notebooks are extensively used by data scientists in their workflows.

Data-Intensive Frameworks

Whether it is large-scale data management or a data scientist evaluating results in a small cluster, we use frameworks and APIs to develop and run data-intensive applications. These frameworks provide APIs and the means to run applications at scale, handling failures and various hardware features. The frameworks are designed to run different workloads and applications:

Streaming data processing frameworks

—Process continuous data streams.

Batch data processing frameworks

—Manage large datasets. Perform extract, transform, and load operations.

Machine/deep learning frameworks

—Designed to run models that learn from data through iterative training.

Workflow systems

—Combine data-intensive applications to create larger applications.

There are many frameworks available for the classes of applications found today. From the outside, even the frameworks designed to solve a single specific class of applications might look quite diverse, with different APIs and architectures. But if we look at the core of these frameworks and the applications built on top of them, we find there are similar techniques being used. Seeing as how they are trying to solve the same problems, all of them use similar data abstractions, techniques for running at scale, and even fault handling. Within these similarities, there are differences that create distinct frameworks for various application classes.

There Is No Silver Bullet

It is hard to imagine one framework to solve all our data-intensive application needs. Building frameworks that work at scale for a complex area such as data-intensive applications is a colossal challenge. We need to keep in mind that, like any other software project, frameworks are built with finite resources and time constraints. Various programming languages are used when developing them, each with their own benefits and limitations. There are always trade-offs between usability and performance. Sometimes the most user-friendly APIs may not be the most successful.

Some software designs and architectures for data-intensive applications are best suited for only certain application classes. The frameworks are built according to these architectures to solve specific classes of problems but may not be that effective when applied to others. What we see in practice is frameworks are built for one purpose and are being adapted for other use cases as they mature.

Data processing is a complicated space that demands hardware, software, and application requirements. On top of this, such demands have been evolving rapidly. At times, there are so many options and variables it seems impossible to choose the correct hardware and software for our data-intensive problems. Having a deep understanding of how things work beneath the surface can help us make better choices when designing our data processing architectures.

Foundations of Data-Intensive Applications

Data-intensive applications incorporate ideas from many domains of computer science. This includes areas such as computer systems, databases, high-performance computing, cloud computing, distributed computing, programming languages, computer networks, statistics, data structures, and algorithms.

We can study data-intensive applications in three perspectives: data storage and management, learning from data, and scaling data processing. Data storage and management is the first step in any data analytics pipeline. It can include techniques ranging from sequential databases to large-scale data lakes.

Learning from data can take many forms depending on the data type and use case. For example, computing basic statistics, clustering data into groups, finding interesting patterns in graph data, joining tables of data to enrich information, and fitting functions to existing data are several ways we learn from data. The overarching goal of the algorithmic perspective is to form models from data that can be used to predict something in the future.

A key enabler for these two perspectives is the third one, where data processing is done at scale. Therefore, we will primarily look at data-intensive applications from a distributed processing perspective in this book. Our focus is to understand how the data-intensive applications run at scale utilizing various computing resources available. We take principles from databases, parallel computing, and distributing computing to delve deep into the ideas behind data-intensive applications operating at scale.

Who Should Read This Book?

This book is aimed at data engineers and data scientists. If you develop data-intensive applications or are planning to do so, the information contained herein can provide insight into how things work regarding various frameworks and tools. This book can be helpful if you are trying to make decisions about what frameworks to choose in your applications.

You should have a foundational knowledge of computer science in general before reading any further. If you have a basic understanding of networks and computer architecture, that will be helpful to understand the content better.

Our target audience is the curious reader who likes to understand how data-intensive applications function at scale. Whether you are considering developing applications using certain frameworks or you are developing your own applications from scratch for highly specialized use cases, this book will help you to understand the inner workings of the systems. If you are familiar with data processing tools, it can deepen your understanding about the underlying principles.

Organization of the Book

The book starts with introducing the challenges of developing and executing data-intensive applications at scale. Then it gives an introduction to data and storage systems and cluster resources. The next few chapters describe the internals of data-intensive frameworks with data structures, programming models, messaging, and task execution along with a few case studies of existing systems. Finally, we talk about fault tolerance techniques and finish the book with performance implications.

Chapter 1

: Scaling Data-Intensive Applications

—Describes serial and parallel applications for data-intensive applications and the challenges faced when running them at scale.

Chapter 2

: Data and Storage

—Overview of the data storage systems used in the data processing. Both hardware and software solutions are discussed.

Chapter 3

: Computing Resources

—Introduces the computing resources used in data-intensive applications and how they are managed in large-scale environments.

Chapter 4

: Data Structures

—This chapter describes the data abstractions used in data analytics applications and the importance of using memory correctly to speed up applications.

Chapter 5

: Programming Models

—Discusses various programming models available and the APIs for data analytics applications.

Chapter 6

: Messaging

—Examines how the network is used by data analytics applications to process data at scale by exchanging data between the distributed processes.

Chapter 7

: Parallel Tasks

—Shows how to execute tasks in parallel for data analytics applications combining messaging.

Chapter 8

: Case Studies

—Studies of a few widely used systems to highlight how the principles we discussed in the book are applied in practice.

Chapter 9

: Fault Tolerance

—Illustrates handy techniques for handling faults in data-intensive applications.

Chapter 10

: Performance and Productivity

—Defines various metrics used for measuring performance and discusses productivity when choosing tools.

Scope of the Book

Our focus here is the fundamentals of parallel and distributed computing and how they are applied in data processing systems. We take examples from existing systems to explain how these are used in practice, and we are not focusing on any specific system or programming language. Our main goal is to help you understand the trade-off between the available techniques and how they are used in practice. This book does not try to teach parallel programming.

You will also be introduced to a few examples of deep learning systems. Although DL at scale works according to the principles we describe here, we will not be going into any great depth on the topic. We also will not be covering any specific data-intensive framework on configuration and deployment. There are plenty of specific books and online resources on these frameworks.

SQL is a popular choice for querying large datasets. Still, you will not find much about it here because it is a complex subject of its own, involving query parsing and optimization, which is not the focus of this book. Instead, we look at how the programs are executed at scale.

References

At the end of each chapter, we put some important references that paved the way for some of the discussions we included. Most of the content you will find is the result of work done by researchers and software engineers over a long period. We include these as references to any reader interested in learning more about the topics discussed.

References

1. D. Laney, “3D data management: Controlling data volume, velocity and variety,”

META group research note

vol. 6, no. 70, p. 1, 2001.

2. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters.,”

Sixth Symposium on Operating Systems Design and Implementation

pp. 137–150, 2004.

3. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” presented at the 19th ACM Symposium on Operating Systems Principles, 2003.

4. F. Rosenblatt,

The perceptron, a perceiving and recognizing automaton Project Para

Cornell Aeronautical Laboratory

, 1957.

5. T. Cover and P. Hart, “Nearest neighbor pattern classification,”

IEEE transactions on information theory

vol. 13, no. 1, pp. 21–27, 1967.

6. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”

Advances in neural information processing systems

vol. 25, pp. 1097–1105, 2012.

7. D. Sculley et al., “Hidden technical debt in machine learning systems,”

Advances in neural information processing systems