Modern Big Data Architectures - Dominik Ryzko - E-Book

Modern Big Data Architectures E-Book

Dominik Ryzko

0,0
41,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Provides an up-to-date analysis of big data and multi-agent systems

The term Big Data refers to the cases, where data sets are too large or too complex for traditional data-processing software. With the spread of new concepts such as Edge Computing or the Internet of Things, production, processing and consumption of this data becomes more and more distributed. As a result, applications increasingly require multiple agents that can work together. A multi-agent system (MAS) is a self-organized computer system that comprises multiple intelligent agents interacting to solve problems that are beyond the capacities of individual agents. Modern Big Data Architectures examines modern concepts and architecture for Big Data processing and analytics.

This unique, up-to-date volume provides joint analysis of big data and multi-agent systems, with emphasis on distributed, intelligent processing of very large data sets. Each chapter contains practical examples and detailed solutions suitable for a wide variety of applications. The author, an internationally-recognized expert in Big Data and distributed Artificial Intelligence, demonstrates how base concepts such as agent, actor, and micro-service have reached a point of convergence—enabling next generation systems to be built by incorporating the best aspects of the field. This book:

  • Illustrates how data sets are produced and how they can be utilized in various areas of industry and science
  • Explains how to apply common computational models and state-of-the-art architectures to process Big Data tasks
  • Discusses current and emerging Big Data applications of Artificial Intelligence

Modern Big Data Architectures: A Multi-Agent Systems Perspective is a timely and important resource for data science professionals and students involved in Big Data analytics, and machine and artificial learning. 

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 329

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

COVER

LIST OF FIGURES

LIST OF TABLES

PREFACE

ACKNOWLEDGMENTS

ACRONYMS

CHAPTER 1: Introduction

1.1 Motivation

1.2 Assumptions

1.3 For Whom Is This Book?

1.4 Book Structure

CHAPTER 2: Evolution of IT Architectures and Paradigms

2.1 Evolution of IT Architectures

2.2 Actors and Agents

2.3 From ACID to BASE, CAP, and NoSQL – The Database (R)evolution

2.4 The Cloud

2.5 From Distributed Sensor Networks to the Internet of Things and Cyber-Physical Systems

2.6 The Rise of Big Data

CHAPTER 3: Sources of Data

3.1 The Internet

3.2 Scientific Data

3.3 Industrial Data

3.4 Internet of Things

CHAPTER 4: Big Data Tasks

4.1 Recommender Systems

4.2 Search

4.3 Ad-tech and RTB Algorithms

4.4 Cross-Device Graph Generation

4.5 Forecasting and Prediction Systems

4.6 Social Media Big Data

4.7 Anomaly and Fraud Detection

4.8 New Drug Discovery

4.9 Smart Grid Control and Monitoring

4.10 IoT and Big Data Applications

CHAPTER 5: Cloud Computing

5.1 Cloud Enabled Architectures

5.2 Agents and the Cloud

CHAPTER 6: Big Data Architectures

6.1 Big Data Computation Models

6.2 Publish-Subscribe Systems

6.3 Stream Processing

6.4 Higer Level Big Data Architectures

6.5 Industry and Other Approaches

6.6 Actor and Agent-Based Big Data Architectures

CHAPTER 7: Big Data Analytics, Mining, and Machine Learning

7.1 To SQL or Not to SQL

7.2 Big Data Mining and Machine Learning

CHAPTER 8: Physically Distributed Systems – Mobile Cloud, Internet of Things, Edge Computing

8.1 Mobile Cloud

8.2 Edge and Fog Computing

8.3 Internet of Things

CHAPTER 9: Summary

BIBLIOGRAPHY

INDEX

END USER LICENSE AGREEMENT

List of Tables

Chapter 3

Table 3.1 Social media users.

Table 3.2 The amount of data collected by 1 million metering devices in a yea...

Chapter 5

Table 5.1 Cloud computing versus multi-agent systems

Chapter 6

Table 6.1 Spark versus Flink versus Storm

List of Illustrations

Chapter 2

Figure 2.1 BI in monolith architecture.

Figure 2.2 Data warehouse architecture.

Figure 2.3 Akka actor hierarchy.

Figure 2.4 BDI architecture.

Figure 2.5 Vertical layered architecture.

Figure 2.6 Horizontal layered architecture.

Chapter 3

Figure 3.1 Semantic web stack.

Figure 3.2 RDF graph.

Figure 3.3 LOD cloud.

Figure 3.4 The cost of sequencing per genome over time.

Figure 3.5 The millenium simulation.

Figure 3.6 Big data value potential index.

Chapter 4

Figure 4.1 Search engine – indexing.

Figure 4.2 Search engine – query.

Figure 4.3 Ad-tech ecosystem.

Figure 4.4 RTB message flow.

Chapter 5

Figure 5.1 OpenStack Map

Figure 5.2 Deployment method evolution.

Figure 5.3 Auto scaling groups.

Figure 5.4 Distributed DB architectures.

Figure 5.5 HDFS architecture.

Chapter 6

Figure 6.1 MapReduce

Figure 6.2 Graph operation in Dryad

Figure 6.3 Finding minimal value with Pregel

Figure 6.4 Roaring bitmaps.

Figure 6.5 Summary reduction.

Figure 6.6 Kafka topic.

Figure 6.7 Spark Streaming flow.

Figure 6.8 Unbounded table.

Figure 6.9 Flink architecture.

Figure 6.10 Storm topology.

Figure 6.11 Storm task grouping.

Figure 6.12 S4 processing node.

Figure 6.13 Mantis architecture.

Figure 6.14 Spark stack.

Figure 6.15 The lambda architecture.

Figure 6.16 Architecture for multi-agent big data processing.

Figure 6.17 Kappa architecture.

Figure 6.18 Delta architecture.

Figure 6.19 Realtime data processing at Facebook.

Figure 6.20 Starfish ecosystem

Chapter 7

Figure 7.1 Hive architecture (MapReduce)

Figure 7.2 Tez data processing.

Figure 7.3 High-level architecture of Kylin.

Figure 7.4 Shark architecture.

Figure 7.5 Spark SQL data flows.

Figure 7.6 Spark SQL query planning

Figure 7.7 Twitter ML architecture – integration of learners into Pig storag...

Figure 7.8 Pipelined MapReduce.

Chapter 8

Figure 8.1 Mobile cloud computing architecture.

Figure 8.2 Mobile cloud computing architecture with backend and local clouds...

Figure 8.3 Cloud of things architecture.

Figure 8.4 IoTCloud architecture.

Guide

Cover

Table of Contents

Begin Reading

Pages

ii

iii

iv

ix

x

xi

xiii

xiv

xv

xvii

xviii

1

2

3

4

5

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

179

180

181

182

183

184

185

186

187

188

189

Founded in 1807, John Wiley & Sons is the oldest independent publishing company in the United States. With offices in North America, Europe, Asia, and Australia, Wiley is globally committed to developing and marketing print and electronic products and services for our customers' professional and personal knowledge and understanding.

The Wiley CIO series provides information, tools, and insights to IT executives and managers. The products in this series cover a wide range of topics that supply strategic and implementation guidance on the latest technology trends, leadership, and emerging best practices.

Titles in the Wiley CIO series include:

The Agile Architecture Revolution: How Cloud Computing, REST-Based SOA, and Mobile Computing Are Changing Enterprise IT by Jason Bloomberg

Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS) by Michael Kavis

Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses by Michael Minelli, Michele Chambers, and Ambiga Dhiraj

The CEO of Technology: Lead, Reimagine, and Reinvent to Drive Growth and Create Value in Unprecedented Times by Hunter Muller

The Chief Information Officer's Body of Knowledge: People, Process, and Technology by Dean Lane

Cloud Computing and Electronic Discovery by James P. Martin and Harry Cendrowski

Confessions of a Successful CIO: How the Best CIOs Tackle Their Toughest Business Challenges by Dan Roberts and Brian Watson

CIO Best Practices: Enabling Strategic Value with Information Technology (Second Edition) by Joe Stenzel, Randy Betancourt, Gary Cokins, Alyssa Farrell, Bill Flemming, Michael H. Hugos, Jonathan Hujsak, and Karl Schubert

The CIO Playbook: Strategies and Best Practices for IT Leaders to Deliver Value by Nicholas R. Colisto

The Complete Software Project Manager: Mastering Technology from Planning to Launch and Beyond by Anna P. Murray

Decoding the IT Value Problem: An Executive Guide for Achieving Optimal ROI on Critical IT Investments by Gregory J. Fell

Enterprise Performance Management Done Right: An Operating System for Your Organization by Ron Dimon

Information Governance: Concepts, Strategies and Best Practices, Second Edition by Robert F. Smallwood

IT Leadership Manual: Roadmap to Becoming a Trusted Business Partner by Alan R. Guibord

Leading the Epic Revolution: How CIOs Drive Innovation and Create Value Across the Enterprise by Hunter Muller

Managing Electronic Records: Methods, Best Practices, and Technologies by Robert F. Smallwood

On Top of the Cloud: How CIOs Leverage New Technologies to Drive Change and Build Value Across the Enterprise by Hunter Muller

Straight to the Top: CIO Leadership in a Mobile, Social, and Cloud-based World, Second Edition by Gregory S. Smith

Strategic IT: Best Practices for Managers and Executives, Second Edition by Arthur M. Langer and Lyle Yorks

Trust and Partnership: Strategic IT Management for Turbulent Times by Robert Benson

Transforming IT Culture: How to Use Social Intelligence, Human Factors, and Collaboration to Create an IT Department That Outperforms by Frank Wander

Unleashing the Power of IT: Bringing People, Business, and Technology Together, Second Edition by Dan Roberts

The U.S. Technology Skills Gap: What Every Technology Executive Must Know to Save America's Future by Gary J. Beach

MODERN BIG DATA ARCHITECTURES

A MULTI-AGENT SYSTEMS PERSPECTIVE

 

Dominik Ryżko

 

 

 

 

 

 

 

 

© 2020 by John Wiley & Sons, Inc.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993, or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data is Available:

ISBN 978-1-119-59784-1 (hardback)ISBN 978-1-119-59794-0 (ePDF)ISBN 978-1-119-59793-3 (ePub)

Cover Design: WileyCover Image: © Pobytov/Getty Images

LIST OF FIGURES

Figure 2.1

BI in monolith architecture

Figure 2.2

Data warehouse architecture

Figure 2.3

Akka actor hierarchy

Figure 2.4

BDI architecture

Figure 2.5

Vertical layered architecture

Figure 2.6

Horizontal layered architecture

Figure 3.1

Semantic web stack

Figure 3.2

RDF graph

Figure 3.3

LOD cloud

Figure 3.4

The cost of sequencing per genome over time

Figure 3.5

The millenium simulation

Figure 3.6

Big data value potential index

Figure 4.1

Search engine – indexing

Figure 4.2

Search engine – query

Figure 4.3

Ad-tech ecosystem

Figure 4.4

RTB message flow

Figure 5.1

OpenStack map

Figure 5.2

Deployment method evolution

Figure 5.3

Auto scaling groups

Figure 5.4

Distributed DB architectures

Figure 5.5

HDFS architecture

Figure 6.1

MapReduce

Figure 6.2

Graph operation in Dryad

Figure 6.3

Finding minimal value with Pregel

Figure 6.4

Roaring bitmaps

Figure 6.5

Summary reduction

Figure 6.6

Kafka topic

Figure 6.7

Spark Streaming flow

Figure 6.8

Unbounded table

Figure 6.9

Flink architecture

Figure 6.10

Storm topology

Figure 6.11

Storm task grouping

Figure 6.12

S4 processing node

Figure 6.13

Mantis architecture

Figure 6.14

Spark stack

Figure 6.15

The lambda architecture

Figure 6.16

Architecture for multi-agent big data processing

Figure 6.17

Kappa architecture

Figure 6.18

Delta architecture

Figure 6.19

Realtime data processing at Facebook

Figure 6.20

Starfish ecosystem

Figure 7.1

Hive architecture (MapReduce)

Figure 7.2

Tez data processing

Figure 7.3

High-level architecture of Kylin

Figure 7.4

Shark architecture

Figure 7.5

SparkSQL data flows

Figure 7.6

SparkSQL query planning

Figure 7.7

Twitter ML architecture – integration of learners into Pig storage functions

Figure 7.8

Pipelined MapReduce

Figure 8.1

Mobile cloud computing architecture

Figure 8.2

Mobile cloud computing architecture with backend and local clouds

Figure 8.3

Cloud of things architecture

Figure 8.4

IoTCloud architecture

LIST OF TABLES

Table 3.1

Social media users

Table 3.2

The amount of data collected by 1 million metering devices in a year

Table 5.1

Cloud computing versus multi-agent systems

Table 6.1

Spark versus Flink versus Storm

PREFACE

Over several years of my career in IT, I have observed how various ideas and technologies have come and gone, taking different paths, from being new and innovative to maturity and adoption, only to be replaced by even newer concepts as they arrive. Some gained popularity very quickly and became the buzzwords of their time, something that everybody tries and claims to be doing. Such is the case of big data, the popularity of which skyrocketed and was embraced by research, industry, and governments alike. In 2012 the Obama Administration announced the Big Data Research and Development Initiative [2012] acknowledging it as a key enabler to accelerate the pace of discovery in science and engineering, strengthen our national security, and transform teaching and learning. Only recently has big data been overshadowed by the widespread adoption of artificial intelligence (AI), which by the way, builds on the foundations of big data. However, big data will remain strong for the foreseeable future.

Other promising technologies from the past have not stood the test of time. The rise of statistical approaches to AI, and deep learning in particular in the past decade, gave the final blow to the symbolic methods, which I found elegant and fascinated me at the time of my undergraduate studies in the 1990s. Have the logical systems passed forever? Possibly not; after all, there are still open questions as on how humans analyze facts, reason, and make decisions, which we are not yet able to model purely by statistical methods. Only the future will show us in which direction science will progress.

Another interesting story is related to Multi-Agent Systems (MAS), the field I picked for my PhD and later research. While the concept of an agent, or closely related actor, go way back to the 1970s, it never gained wide popularity outside the relatively narrow research community and some niche business applications. Despite bringing in innovative views on information system paradigms and promises of solving some of its challenges, the idea never became widely adopted.

As my focus started to switch toward industry projects and I became more and more involved with building large scale big data and AI systems, I observed that some of the fundamental assumptions behind MAS have made their way into mainstream information systems. Monolith systems, dominant at the beginning of my career, started giving way to services and later to micro-services. Physical devices distributed in the physical environment gained computational power and thus built-in intelligence and increased autonomy. These changes were happening not because MAS were gaining in popularity, but due to the fact that it made sense to solve real issues in this way.

Finally, a few years back, I was able to formulate a more concrete conclusion, which can be used as a working thesis for this book – mainstream computer science is on a convergence path with multi-agent paradigms. Or to be more specific: the fundamental building blocks of modern information systems have been gaining the properties of those attributed to agents in MAS and thus the whole system has become more adaptive, autonomous, and intelligent. I decided to devote some time to studying these analogies, by comparing the fundamental assumptions and paradigms as well as by looking at the applications of MAS in solving various problems in the big data area. This book summarizes this research by taking a journey through modern big data architectures viewed through the eyes of the MAS domain.

I hope the view taken in this book will be fresh and interesting and will inspire further critical thinking about the evolution of contemporary information systems and the direction they are heading.

Dominik Ryżko

Warsaw

August 2019

ACKNOWLEDGMENTS

As the work on this book from the initial idea to its completion stretches over a period of a few years, it is not possible to mention all the people with whom I have discussed the ideas and the book itself during this period. However, a few of them have had significantly more influence on my thoughts and the final shape of the work.

Most of all I want to thank my family for supporting me and accepting the effort and time needed for such endeavor. I want to thank my supervisors and directors at the Institute of Computer Science, Warsaw University of Technology, Professors Marzena Kryszkiewicz, Henryk Rybiński, Mieczysław Muraszkiewicz, and Jarosław Arabas for encouraging me to pick up this project and coming up with valuable advice. Special thanks go to my friend and colleague Bartłomiej Trwardowski with whom I have spent numerous hours exchanging thoughts and ideas on various scientific topics and who was kind enough to provide feedback on an early draft. Last by not least I thank my past and future students, who are among the main recipients of this work. Your open and curious minds were a big motivator to make this book insightful, covering the most important ideas but also focusing on practical topics. I hope you will find it this way.

ACRONYMS

ACID

Atomicity, Consistency, Independence, Durability

ACL

Asynchronous Connection-Less

AI

Artificial Intelligence

AMI

Advanced Metering Infrastructure

API

Application Programming Interface

BASE

Basically Available, Soft state, Eventually consistent

BDI

Belief Desire Intention

BI

Business Intelligence

BPEL

Business Process Execution Language

CAP

Consistency, Availability, Partitioning

CEP

Complex Event Processing

CPS

Cyber-Physical Systems

CRM

Customer Relationship Management

CSO

Cooperative Smart Object

CUDA

Compute Unified Device Architecture

DAG

Directed Acyclic Graph

DBMS

DataBase Management System

DFS

Distributed File System

DL

Description Logic

DMP

Data Management Platform

DNS

Domain Name System

DSN

Distributed Sensor Networks

DSP

Demand-Side Platform

EDA

Event-Driven Architecture

EHR

Electronic Health Records

ERP

Enterprise Resource Planning

ESB

Enterprise Service Bus

ETL

Extract Transform Load

FAP

Femto-Access Points

GIS

Geographical Information Systems

GPGPU

General Purpose computing on Graphics Processing Units

GPS

Global Positioning System

HDFS

Hadoop Distributed File System

HEP

High Energy Physics

IaaS

Infrastructure as a Service

IDS

Intrusion Detection Systems

IFP

Information Flow Processing

JSON

JavaScript Object Notation

KPI

Key Performance Indicator

LHC

Large Hadron Collider

MAS

Multi-Agent System

MCC

Mobile Cloud Computing

MEC

Mobile Edge Computing

ML

Machine Learning

MRI

Magnetic Resonance Imaging

M2M

Machine to Machine

NFC

Near-Field Communication

NLP

Natural Language Processing

OWL

Web Ontology Language

OS

Operating System

PaaS

Platform as a Service

PCA

Principal Component Analysis

PET

Positron Emission Tomography

PSL

Probabilistic Soft Logic

QoS

Quality of Service

RDF

Resource Description Framework

REST

Representational State TRansfer

RFID

Radio Frequency IDentification

RPC

Remote Procedure Call

RTB

Real-Time Bidding

SaaS

Software as a Service

SEM

Search Engine Marketing

SEO

Search Engine Optimization

SGD

Stochastic Gradient Descent

SLA

Service Level Agreement

SOA

Service Oriented Architecture

SPARQL

Simple Protocol and RDF Query Language

SRL

Statistical Relational Learning

SQL

Structured Query Language

SSP

Supply-Side Platform

URI

Uniform Resource Identifier

VM

Virtual Machine

VPN

Virtual Private Network

WSN

Wireless Sensor Networks

XML

eXtensible Markup Language

CHAPTER 1Introduction

1.1 Motivation

In recent years, big data has emerged as one of the leading trends not only in computer science, but due to its potential, also in economy, science, and major branches of the industry. People realized that huge data sets have become a key asset which should be taken into account in evaluating business opportunities, company valuations, or product development. Several major mergers and acquisitions in recent years have been driven not only in order to gain synergies, customer base, or market access, but also to obtain access to valuable customer data. For example, Microsoft's acquisition of Linkedin gave it data on jobs, skills, career paths, and a contact network of millions of workers across the globe.

For technology vendors, consultancies as well as numerous startups, this rapid growth opened up huge new business opportunities. According to IDC, the market value of big data and business analytics is expected to grow beyond $200 Billion by the year 2020. Forbes [2017]. These forecasts have fueled huge investments in big data related research and development efforts, both in academia and in industry, leading to a wide range of proposed architectures, solutions, models, algorithms, as well as commercial products.

Large industry players have made the big data concept fundamental to their products, architectures, and strategies. Every day, new ventures emerge which concentrate solely on big data as an opportunity for innovation and growth. Those who failed to follow the trend early see the rising competition and disruption, even in well established and heavily regulated industries such as banking or insurance, as can be observed by the growing number of fintech and insurtech ventures.

Academia has been intensively updating curricula to educate the next generation of data scientists, big data engineers, DevOps, etc. The research areas and goals of computer science departments have followed suit. New dedicated MOOCs (Massive Online Open Courses) become available every month and gather thousands of attendants. The number of conference tracks and entirely new events around the subjects of analytics and processing of big data is growing rapidly each year.

While there is no single agreed on definition of big data, it is commonly regarded as a general move towards analytics and applications, which rely heavily on processing of extremely large data sets in order to provide intelligent, personalized services to the users and other services in the ecosystem. This trend has been largely supported by recent advances in parallel computing architectures, emergence of NoSQL databases, cloud computing technologies and continuous improvements in machine learning and other branches of Artificial Intelligence (AI).

Multi-Agent Systems (MAS) use the concept of the agent as a central entity for building systems. This is often confusing as the term is heavily overloaded even within computer science, not to mention its use in multiple other disciplines such as economy, sociology, cognitive science, etc. MAS however iterates specifically the properties an agent should implement. It should be autonomous, understood as making its own decision based on internal state, goals, and observations. An agent should be proactive, so it should act when it believes it is appropriate not only when explicitly called. Finally, it should be intelligent in the AI sense of intelligence, therefore capable of solving complex tasks and learning by past experiences. Building on such components, MAS tries to assemble complex systems in which agents communicate asynchronously and collaboratively solve given tasks.

Even though MAS emerged as a separate field of research much earlier than big data, it failed to achieve such wide adoption and popularity. We can identify several reasons for this. One is that, until recently, there were no advanced and mature architectures for efficient distributed asynchronous processing. Only in the last decade the limitations to Moore's Law increased the efforts towards parallel computations. Another reason is the radical approach to the distribution of control in MAS. Agents were proposed as highly independent, autonomous, proactive entities communicating with the use of “soft” protocols, e.g. negotiation, argumentation etc. These assumptions were not in line with available means for monitoring of such systems, and so were unacceptable for several practical industry applications, where strict control and risk minimization are key, e.g. energy grid management, financial systems, traffic monitoring, etc.

This publication argues that the fields of big data and MAS have a lot in common. If we track the evolution of the IT systems from monolith, through SOA to microservices and most recently cyber-physical systems, we can see that the elementary system components more and more resemble agents as proposed many decades back. We rely more and more on loosely-coupled components centered around some well defined functionality and capable of autonomous and flexible operations even if other components fail or are temporarily out of reach. Now that distributed, cloud based computing has become standard, database paradigms have shifted from a strict transactional approach and physical objects obtain built-in intelligence, MAS approach no longer looks radical and unfeasible.

It seems we have arrived at the point where several research results achieved in both fields can be combined and benefit from cross-fertilization of ideas, tools, and architectures. Mobile agents for sensor networks can be applied for real time analytics in the fast growing area of the Internet of Things (IoT). Distributed machine learning algorithms can be coordinated with multi-agent cooperation protocols. Mobile and IoT cloud computing environments experience challenges related to resources and latency similar to the ones present in MAS especially for mobile agents.

On the other hand, modern big data environments offer unprecedented possibilities of performing large scale computations both in batch and streaming mode, which can greatly enhance capabilities of MAS. Cloud resources supporting mobile and IoT devices might well be used to empower intelligent agents located in the environment. On the lower level, modern distributed programming libraries (e.g. Scala Akka) can greatly improve performance of MAS, which often use less advanced environments, not capable of efficient thread and resource management.

1.2 Assumptions

While establishing the scope and focus of this book, several assumptions and compromises had to be made. Firstly, when describing a field such as Big Data, where new concepts and projects emerge on a daily basis, it is difficult to resist the temptation to include every new finding, so the book will be as up to date as possible at the time of publishing. On the other hand it is difficult to predict the future of freshly proposed solutions, before they become more mature and are hardened by real life applications.

Therefore, difficult choices have been made and some might argue that a particular important architecture, project, or framework has been left out. In general, I have been following the rule of writing about topics, which have some proven maturity, e.g. have become mainstream Apache projects, have been followed by highly cited publications, have been applied by at least one of the large and recognized industry players, etc.

Secondly, since the book title refers to big data architectures, the contents concentrate on large scale solutions capable of solving practical problems experienced in the industry. Therefore, specific tools applicable at particular points in the larger architectures are described only to the point where they are relevant from the point of view of the big picture they take part in, rather than in their internal and technical details. For example Hadoop, which is often regarded as a technological synonym for big data, is described as a component for batch processing used in larger big data architectures. Map-Reduce, Hadoops', underlying algorithm, is presented as one of the generic computational models for processing extremely large data sets. Similarly, Spark is an example of stream processing and plays that role in larger big data setups.

In the field of MAS things have been somewhat easier, since the field is more mature in general and several comprehensive textbooks have been published to date, which summarize the research and development efforts in this area. Therefore, major agent models and architectures are described in line with the state of the art long established in the field. This is complemented with some more recent and more specific examples of applications of multi-agent paradigms in solving various big data problems.

1.3 For Whom Is This Book?

This book could be of interest to both researchers and practitioners from the fields of big data, analytics, machine learning, MAS, distributed computing, cloud computing, distributed artificial intelligence, as well as a number of other related fields.

The intention has been, for anyone from the fields mentioned above to see the current state of the art in distributed, asynchronous processing of massive data sets. As well as this it will be shown how various field and areas of research relate to each other by tackling similar issues and challenges from their respective perspectives.

For big data practitioners not familiar with MAS research it may come as a surprise how many relevant ideas and concepts have already been analyzed several years back. MAS researchers will find several big data environments, libraries, and tools very useful for taking their systems to the next level of efficiency.

In the end I hope that this book will initiate mutual discussion and exchange of ideas, which is to some extent already present but could become much more intense and fruitful.

1.4 Book Structure

The book is organized as follows. Chapter 2 discusses how major paradigms and concepts have changed over the last few decades, leading to the current landscape. Specifically we will analyze how the evolution of IT architectures influenced storage and analytics of the data. We will also look at the shift of paradigms in database systems, the growing role of the cloud, the Internet, and the IoT. Also the concepts of an agent and an actor are introduced. We conclude by discussing how all these trends led to the rise of big data.

In Chapter 3 we look at where the data comes from in the big data setups. We start with the Internet as the most commonly available data source today. Then we iterate over various branches of science and industry looking at how much data they generate and what is specific about each of them. Finally, the IoT as the fast growing source of huge data streams is described.

Once we are familiar with the data sources, the book dives into specific tasks which need to be performed with the use of the data. Chapter 4 looks at the most important challenges that research and industry is working on in the big data area. This covers recommender systems, search, real time bidding, as well as multiple other topics.

Cloud computing is discussed in Chapter 5. It deserves a separate chapter as a major trend shaping the creation of the next generation of information systems. We look at the advantages and challenges of utilizing cloud resources and how it enables the building of scalable, distributed big data systems. The means for efficient cloud management both in VM and container based setups are described.

In Chapter 6 several big data architectures are presented. We start with fundamental computational models and move towards more complex setups. This includes among others Lambda and Kappa architectures, which have recently emerged as important design patterns for building scalable big data processing and analytics. A separate section is devoted to stream processing.

The means for data analytics and building machine learning models are the subject of Chapter 7. The role of SQL versus other forms of ad-hoc interaction with the data is analyzed. Tools and architectures for providing SQL capabilities in noSQL environments are analyzed. We look at frameworks and tools for efficient building, deploying, and testing of machine learning models.

Geographically distributed systems are the topic of Chapter 8. We will take a look at how the latest trends driven by mobile computing and the IoT led to the emergence of edge and fog computing as new paradigms for extending the cloud towards the distributed elements of the cyber-physical systems.

The work is closed by Chapter 9 with a summary and conclusions. References to the literature complete the volume.

CHAPTER 2Evolution of IT Architectures and Paradigms

2.1 Evolution of IT Architectures

Over recent decades corporate IT architectures have evolved significantly. Starting from the large monolith application, through the introduction of web services and the emergence of the Service Oriented Architecture (SOA), which has evolved into microservices, we went through the wide adoption of cloud computing and have now reached the popularity of edge computing, the Internet of Things and cyber-physical systems. Each of these steps required a change in the way we produced, processed, stored, and analyzed the data, which will be explored in the subsequent sections of this chapter.

2.1.1 Monolith

Back in the 1990s corporate systems were built mainly as large monolith applications. They were based on a number of tightly coupled modules with strong interdependencies. This caused high development and maintenance costs. At the beginning of the software development process it is beneficial to have all the building blocks in one place, but as the system grows, it becomes tedious to track all the internal dependencies and the code base becomes hard to manage. The growing size and complexity of a monolith impacts all software life cycle steps influencing design, development, testing, and deployment.

Each design and development decision taken in a monolith system has long lasting consequences. This phenomenon is well described by the term technical debt coined by Cunningham Cunningham [1993]. The larger the system, the more reluctant we are to introduce necessary changes and the debt grows.

Figure 2.1 BI in monolith architecture.

In monolith systems scalability is limited. More instances of the system can be set up to introduce load balancing. However, replicating the entire functionality each time is very costly. Demand for different functionalities can vary, and we do not have the tools to scale them separately.

On the other hand it is relatively easy to manage and analyze the data processed by such systems. We usually have a single underlying database with a relational schema, which can be easily exported to an analytical environment, typically a data warehouse, where a set of BI tools produce reports, KPI visuals, dashboards, etc. In the worst case we have to deal with a handful of monolith systems (e.g. ERP, CRM, Billing, etc.) and introduce some form of Extract Transform Load (ETL) processing, in order to combine them before loading into the warehouse. Figure 2.1 shows the overall reporting architecture in the world of monolith systems.

The methodology for creating and maintaining a data warehouse is well researched by now. Typically, the following layers can be identified in such a system:

Data Source Layer – systems and sources which feed the data into the warehouse

Data Extraction Layer – responsible for pulling the data into the warehouse

Staging Area – the area where data stays before the major transformations (ETL) begins

ETL Layer – in this layer is a set of processes which transform the data into the format usable for reporting and analysis

Figure 2.2 Data warehouse architecture.

Source: Kimball and Ross (2011). Reproduced with permission.

Data Storage Layer – stores the data after it has been transformed and cleaned

Data Logic Layer – gives semantic to the data by defining the report structure

Data Presentation Layer – provides interface to the user

Metadata Layer – describes the data stored in the warehouse

System Operations Layer – allows administrators to manage the data warehouse

In large organizations data marts are usually created, which are subsets of the overall data limited and optimized for specific groups of users. The Data Marts are efficient for analysis across multiple predefined dimensions such as time, region, product, etc. Kimball and Ross [2011]. A Data warehouse architecture is shown in Figure 2.2.

While ETL processes in a large organization can become quite complex, entities coming from a single monolith system are well structured and related with each other. What remains, is managing the relations between the data sets from various monoliths and from external sources if we wish to include them in our reporting setup.

2.1.2 Service Oriented Architecture

In the 2000s Service Oriented Architecture (SOA) paradigms were introduced. The idea was to break the large systems into reusable components, implementing specific groups of functionalities accessible by strictly defined APIs. In SOA the services are more loosely coupled then in the monolith systems. In other words services are self-describing, open components that support rapid, low-cost composition of distributed applications. Papazoglou [2003].

The Open Group formally defines SOA in the following way:

SOA is an architectural style that supports service-orientation

.

Service-orientation is a way of thinking in terms of services and service-based development and the outcomes of services

.

A service:

Is a logical representation of a repeatable business activity that has a specified outcome (e.g. check customer credit, provide weather data, consolidate drilling reports)

Is self-contained

May be composed of other services

Is a “black box” to consumers of the service

Such a setup requires a composition layer, which provides coordination, monitoring, conformance, and QoS functionalities in order to provide composite services to the clients. The backbone of the SOA system which allows it to do this is called the Enterprise Service Bus (ESB). The following specific tasks can be handled by the ESB. Josuttis [2007]:

Providing connectivity

Data transformation

(Intelligent) routing

Dealing with security

Dealing with reliability

Service management

Monitoring and logging

In order to manage the business processes, specific languages, e.g. XML-based BPEL (Business Process Execution Language) and business process servers have been introduced. The services can be built in various technologies as long as their APIs follow Web Service standards.

As the number of services and potential interactions in SOA increase, new problems arise. The dynamic nature of collaborating services means several issues can be experienced at run-time. Network can lag, messages can be lost, services can experience performance problems or crash entirely.