Data Lakes -  - E-Book

Data Lakes E-Book

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

The concept of a data lake is less than 10 years old, but they are already hugely implemented within large companies. Their goal is to efficiently deal with ever-growing volumes of heterogeneous data, while also facing various sophisticated user needs. However, defining and building a data lake is still a challenge, as no consensus has been reached so far. Data Lakes presents recent outcomes and trends in the field of data repositories. The main topics discussed are the data-driven architecture of a data lake; the management of metadata – supplying key information about the stored data, master data and reference data; the roles of linked data and fog computing in a data lake ecosystem; and how gravity principles apply in the context of data lakes. A variety of case studies are also presented, thus providing the reader with practical examples of data lake management.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 315

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Preface

1 Introduction to Data Lakes: Definitions and Discussions

1.1. Introduction to data lakes

1.2. Literature review and discussion

1.3. The data lake challenges

1.4. Data lakes versus decision-making systems

1.5. Urbanization for data lakes

1.6. Data lake functionalities

1.7. Summary and concluding remarks

2 Architecture of Data Lakes

2.1. Introduction

2.2. State of the art and practice

2.3. System architecture

2.4. Use case: the Constance system

2.5. Concluding remarks

3 Exploiting Software Product Lines and Formal Concept Analysis for the Design of Data Lake Architectures

3.1. Our expectations

3.2. Modeling data lake functionalities

3.3. Building the knowledge base of industrial data lakes

3.4. Our formalization approach

3.5. Applying our approach

3.6. Analysis of our first results

3.7. Concluding remarks

4 Metadata in Data Lake Ecosystems

4.1. Definitions and concepts

4.2. Classification of metadata by NISO

4.3. Other categories of metadata

4.4. Sources of metadata

4.5. Metadata classification

4.6. Why metadata are needed

4.7. Business value of metadata

4.8. Metadata architecture

4.9. Metadata management

4.10. Metadata and data lakes

4.11. Metadata management in data lakes

4.12. Metadata and master data management

4.13. Conclusion

5 A Use Case of Data Lake Metadata Management

5.1. Context

5.2. Related work

5.3. Metadata model

5.4. Metadata implementation

5.5. Concluding remarks

6 Master Data and Reference Data in Data Lake Ecosystems

6.1. Introduction to master data management

6.2. Deciding what to manage

6.3. Why should I manage master data?

6.4. What is master data management?

6.5. Master data and the data lake

6.6. Conclusion

7 Linked Data Principles for Data Lakes

7.1. Basic principles

7.2. Using Linked Data in data lakes

7.3. Limitations and issues

7.4. The smart cities use case

7.5. Take-home message

8 Fog Computing

8.1. Introduction

8.2. A little bit of context

8.3. Every machine talks

8.4. The volume paradox

8.5. The fog, a shift in paradigm

8.6. Constraint environment challenges

8.7. Calculations and local drift

8.8. Quality is everything

8.9. Fog computing versus cloud computing and edge computing

8.10. Concluding remarks: fog computing and data lake

9 The Gravity Principle in Data Lakes

9.1. Applying the notion of gravitation to information systems

9.2. Impact of gravitation on the architecture of data lakes

Glossary

References

List of Authors

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1. Data warehouses versus data lakes

Chapter 4

Table 4.1. Metadata classification by NISO [RIL 17]

Chapter 5

Table 5.1. List of implemented datasets

Table 5.2. Comparison of relational and graph DBMS

List of Illustrations

Chapter 1

Figure 1.1. Queries about “data lake” on Google

Figure 1.2. Baseline architecture of a data lake as proposed by IBM [IBM 14]. Fo...

Figure 1.3. Baseline architecture of a data lake of [SUR 16]

Figure 1.4. Data lake software prices

Figure 1.5. Data lake sales: positive and negative aspects and challenges. For a...

Figure 1.6. Data lake position in the information system. For a color version of...

Figure 1.7. Data lake interaction in information systems. For a color version of...

Figure 1.8. Urbanization of the information system

Figure 1.9. Functional architecture of a data lake

Figure 1.10. Applicative architecture of a data lake (Hortonworks

TM

). For a colo...

Figure 1.11. Acquisition module shared between data lake and decision-support sy...

Chapter 2

Figure 2.1. Architecture of a data lake. For a color version of this figure, see...

Figure 2.2. Schema management workflow in a data lake. For a color version of th...

Figure 2.3. Constance system overview

Figure 2.4. Major components of Constance user interface: (a) data source manage...

Chapter 3

Figure 3.1. Feature model of the categorization functionality

Figure 3.2. Modeling a data lake through functionalities. For a color version of...

Figure 3.3. Knowledge base - acquisition functionality

Figure 3.4. A concept

Figure 3.5. A concept lattice

Figure 3.6. An equivalent class feature diagram

Figure 3.7. Our process for the production of product lines

Figure 3.8. Creation of the formal context for the categorization functionality

Figure 3.9. Formal context and the associated concept in the case of secure func...

Figure 3.10. The FCA and AOC-poset for the secure functionality

Chapter 4

Figure 4.1. Metadata subject areas

Figure 4.2. Scenario 1: point-to-point metadata architecture. For a color versio...

Figure 4.3. Scenario 2: hub and spoke metadata architecture. For a color version...

Figure 4.4. Scenario 3: tool of record metadata architecture. For a color versio...

Figure 4.5. Scenario 4: hybrid metadata architecture. For a color version of thi...

Figure 4.6. Scenario 5: federated metadata architecture. For a color version of ...

Figure 4.7. Metadata management system context

Figure 4.8. Metadata layers in an ecosystem based on a data lake

Chapter 5

Figure 5.1. Data lake definitions

Figure 5.2. Data lake functional architecture evolution

Figure 5.3. Data lake functional architecture

Figure 5.4. An implementation of the data lake functional architecture

Figure 5.5. Meta data classification

Figure 5.6. Meta data classification

Figure 5.7. Class diagram of metadata conceptual model

Figure 5.8. Search datasets by keywords

Figure 5.9. Logical data model

Figure 5.10. Request for finding all the datasets related to “brain trauma”

Figure 5.11. Request for finding source or target dataset

Figure 5.12. Request for finding all the datasets that have not been processed

Figure 5.13. Mapping between UML class diagram and Neo4J Cypher query language

Figure 5.14. Neo4j data model

Figure 5.15. Find the datasets concerning “brain trauma” – Neo4j

Figure 5.16. Find the relevant dataset – Neo4j

Figure 5.17. Find the datasets concerning “brain trauma” and show all the inform...

Figure 5.18. Find the datasets that have not been processed – Neo4j

Chapter 6

Figure 6.1. Illustration of master data, reference data and metadata. For a colo...

Figure 6.2. Scenario 1: MDM hub

Figure 6.3. Scenario 2: MDM in the data lake

Chapter 8

Figure 8.1. Edge/fog/cloud computing layers. For a color version of this figure,...

Chapter 9

Figure 9.1. The gravitational force keeps the planets in their orbits

Figure 9.2. Space–time distortion. For a color version of this figure, see www.i...

Figure 9.3. Gravitation in an information system. For a color version of this fi...

Figure 9.4. Data-process attraction. For a color version of this figure, see www...

Figure 9.5. Hybrid architecture of data lakes. For a color version of this figur...

Guide

Cover

Table of Contents

Begin Reading

Pages

v

ii

iii

iv

xi

xii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

223

224

225

226

227

228

229

230

231

232

233

To Christine Collet

Databases and Big Data Set

coordinated by

Dominique Laurent and Anne Laurent

Volume 2

Data Lakes

Edited by

Anne Laurent

Dominique Laurent

Cédrine Madera

First published 2020 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK

www.iste.co.uk

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA

www.wiley.com

© ISTE Ltd 2020

The rights of Anne Laurent, Dominique Laurent and Cédrine Madera to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2019954836

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-78630-585-5

Preface

This book is part of a series entitled Database and Big Data (DB & BD), the content of which is motivated by the radical and rapid evolution (not to say revolution) of database systems over the last decade. Indeed, since the 1970s – inspired by the relational database model – many research topics have emerged in the database community, such as: deductive databases, object-oriented databases, semi-structured databases, resource description framework (RDF), open data, linked data, data warehouses, data mining and, more recently, cloud computing, NoSQL and Big Data, to name just a few. Currently, the last three issues are increasingly important and attract most research efforts in the domain of databases. Consequently, considering Big Data environments must now be handled in most current applications, the goal of this series is to address some of the latest issues in such environments. By doing so, in addition to reporting on specific recent research results, we aim to provide the readers with evidence that database technology is changing significantly, so as to face important challenges imposed in most applications. More precisely, although relational databases are still commonly used in traditional applications, it is clear that most current Big Data applications cannot be handled by relational database management systems (RDBMSs), mainly because of the following reasons:

– efficiency when facing Big Data in a distributed and replicated environment is now a key issue that RDBMSs fail to achieve, in particular when it comes to joining big-sized tables;

– there is a strong need for considering heterogeneous data, structured, semi-structured or even unstructured, for which no common schema exists.

– Data warehouses are not flexible enough to handle such a variety of data and usages.

Data lakes appeared a couple of years ago in industrial applications and are now deployed in most of the big companies for valorizing the data.

More recently, academic scientists focused their interest on this concept and proposed several contributions to formalizing and exploiting data lakes, but the scientific literature is not yet very rich.

In this book, we try to bring together several points of view on this emerging concept by proposing a panorama.

This book is a tribute to our departed friend and colleague Christine Collet.

Anne LAURENT

Dominique LAURENT

Cédrine MADERA

December 2019

1Introduction to Data Lakes: Definitions and Discussions

As stated by Power [POW 08, POW 14], a new component of information systems is emerging when considering data-driven decision support systems. This is the case because enhancing the value of data requires that information systems contain a new data-driven component, instead of an information-driven component1. This new component is precisely what is called data lake.

In this chapter, we first briefly review existing work on data lakes and then introduce a global architecture for information systems in which data lakes appear as a new additional component, when compared to existing systems.

1.1. Introduction to data lakes

The interest in the emerging concept of data lake is increasing, as shown in Figure 1.1, which depicts the number of times the expression “data lake” has been searched for during the last five years on Google. One of the earliest research works on the topic of data lakes was published in 2015 by Fang [FAN 15].

The term data lake was first introduced in 2010 by James Dixon, a Penthao CTO, in a blog [DIX 10]. In this seminal work, Dixon expected that data lakes would be huge sets of row data, structured or not, which users could access for sampling, mining or analytical purposes.

Figure 1.1.Queries about “data lake” on Google

In 2014, Gartner [GAR 14] considered that the concept of data lake was nothing but a new way of storing data at low cost. However, a few years later, this claim was changed2, based on the fact that data lakes have been considered valuable in many companies [MAR 16a]. Consequently, Gartner now considers that the concept of data lake is like a graal in information management, when it comes to innovating through the value of data.

In the following, we review the industrial and academic literature about data lakes, aiming to better understand the emergence of this concept. Note that this review should not be considered as an exhaustive, state of the art of the topic, due to the recent increase in published papers about data lakes.

1.2. Literature review and discussion

In [FAN 15], which is considered one of the earliest academic papers about data lakes, the author lists the following characteristics:

– storing data, in their native form, at low cost. Low cost is achieved because (1) data servers are cheap (typically based on the standard X86 technology) and (2) no data transformation, cleaning and preparation is required (thus avoiding very costly steps);

– storing various types of data, such as blobs, data from relational DBMSs, semi-structured data or multimedia data;

– transforming the data only on exploitation. This makes it possible to reduce the cost of data modeling and integrating, as done in standard data warehouse design. This feature is known as the

schema-on-read

approach;

– requiring specific analysis tools to use the data. This is required because data lakes store row data;

– allowing for identifying or eliminating data;

– providing users with information on data provenance, such as the data source, the history of changes or data versioning.

According to Fang [FAN 15], no particular architecture characterizes data lakes and creating a data lake is closely related to the settlement of an Apache Hadoop environment. Moreover, in this same work, the author anticipates the decline of decision-making systems, in favor of data lakes stored in a cloud environment.

As emphasized in [MAD 17], considering data lakes as outlined in [FAN 15] leads to the following four limitations:

1) only Apache Hadoop technology is considered;

2) criteria for preventing the movement of the data are not taken into account;

3) data governance is decoupled from data lakes;

4) data lakes are seen as data warehouse “killers”.

In 2016, Bill Inmon published a book on a data lake architecture [INM 16] in which the issue of storing useless or impossible to use data is addressed. More precisely, in this book, Bill Inmon advocates that the data lake architecture should evolve towards information systems, so as to avoid storing only row data, but also “prepared” data, through a process such as ETL (Extract-Transform-Load) that is widely used in data warehouses. We also stress that, in this book, the use of metadata and the specific profile of data lake users (namely that of data scientists) are emphasized. It is proposed the data is organized according to three types, namely analog data, application data and textual data. However, the issue of how to store the data is not addressed.

In [RUS 17], Russom first mentioned the limitations of Apache Hadoop technology as being the only possible environment of data lakes, which explains why Russom’s proposal is based on a hybrid technology, i.e. not only on Apache Hadoop technology but also on relational database technology. Therefore, similar to data warehouses, a few years after Fang’s proposal [FAN 15], data lakes are now becoming multi-platform and hybrid software components.

The work in [SUR 16] considers the problems of data lineage and traceability before their transformation in the data lake. The authors propose a baseline architecture that can take these features into account in the context of huge volumes of data, and they assess their proposal through a prototype, based on Apache Hadoop tools, such as Hadoop HDFS, Spark and Storm. This architecture is shown in Figure 1.3, from which it can be seen that elements of the IBM architecture (as introduced in [IBM 14] and shown in Figure 1.2) are present.

In [ALR 15], the authors introduced what they call personal data lake, as a means to query and analyze personal data. To this end, the considered option is to store the data in a single place so as to optimize data management and security. This work thus addresses the problem of data confidentiality, a crucial issue with regard to the General Data Protection Regulation3.

Figure 1.2.Baseline architecture of a data lake as proposed by IBM [IBM 14]. For a color version of this figure, see www.iste.co.uk/laurent/data.zip

In [MIL 16], the authors referred to the three Vs cited by Gartner [GAR 11] (Volume, Variety, Velocity), considered the additional V (Veracity) introduced by IBM and proposed three more Vs, namely Variability, Value and Visibility. In this context, the authors of [MIL 16] stated that the data lake should be part of IT systems, and then studied the three standard modes for data acquisition, namely batch pseudo real time, real time (or streaming) and hybrid. However, the same authors did not study the impact of these different modes on the data lake architecture. In this work, a data lake is seen as a data pool, gathering historical data along with new data produced by some pseudo real-time processes, in a single place and without specific schema, as long as data is not queried. A catalog containing data lineage is thus necessary in this context.

The most successful work about data lake architecture, components and positioning is presented in [IBM 14], because the emphasis is on data governance and more specifically on the metadata catalog. In [IBM 14], the authors highlighted, in this respect, that the metadata catalog is a major component of data lakes that prevents them from being transformed into data “swamps”. This explains why metadata and their catalog currently motivate important research efforts, some of which are mentioned as follows:

– in [NOG 18a], the authors presented an approach to data vault (an approach to data modeling for storing historical data coming from different sources) for storing data lake metadata;

– in [TER 15], the importance of metadata as a key challenge is emphasized. It is then proposed that semantic information obtained from domain ontologies and vocabularies be part of metadata, in addition to traditional data structure descriptions;

– in [HAI 16], the authors proposed an approach for handling metadata called

Constance

. This approach focuses on discovering and summarizing structural metadata and their annotation using semantic information;

– in [ANS 18], the author introduced a semantic profiling approach to data lakes to prevent them from being transformed into “data swamps”. To this end, it is shown that the semantic web provides improvements to data usability and the detection of integrated data in a data lake.

Regarding data storage, in [IBM 14], it is argued that the exclusive use of Apache Hadoop is now migrating to hybrid approaches for data storage (in particular using relational or NoSQL techniques, in addition to Apache Hadoop), and also for platforms (considering different servers either locally present or in the cloud). As mentioned earlier, these changes were first noted in [RUS 17].

An attempt to unify these different approaches to data lakes can be found in [MAD 17] as the following definition:

A data lake is a collection of data such that:

– the data have no fixed schema;

– all data formats should be possible;

– the data have not been transformed;

– the data are conceptually present in one single place, but can be physically distributed;

– the data are used by one or multiple experts in data science;

– the data must be associated with a metadata catalog;

– the data must be associated with rules and methods for their governance.

Based on this definition, the main objective of a data lake is to allow for a full exploitation of its content for providing value to the data in the company. In this context, the data lake is a data-driven system that is part of the decision-making system.

1.3. The data lake challenges

Initially regarded as low-cost storage environments [GAR 14], data lakes are now considered by companies as strategic tools due to their potential ability to give data a high value [MAR 16b].

As shown in Figure 1.4, data lake selling is rapidly increasing. It is expected that sales would reach the amount of $8.81 billion in 2021, with an increase of 32.1%.

Figure 1.3.Baseline architecture of a data lake of [SUR 16]

Figure 1.4.Data lake software prices

In the report [MAR 16b], on the analysis of worldwide data lake sales, it is argued that the most important problem for increasing these sales is due to a lack of information about novel techniques for storing and analyzing data and about long-term data governance. This report also identifies, in this respect, the problems of data security and data confidentiality, as well as a lack of experts who are able to understand the new challenges related to the increasing importance of data lakes.

On the contrary, the increasing need to fully benefit from their data explains why more and more data lakes are used in companies. In light of the report [MAR 16b], data lakes can be seen as an important and challenging issue for companies, regardless of their size. Figure 1.5 shows an overview of these phenomena.

Figure 1.5.Data lake sales: positive and negative aspects and challenges. For a color version of this figure, see www.iste.co.uk/laurent/data.zip

We now study how to position data lakes in information systems, based on Le Moigne’s approach [LEM 84].

Summarizing the works in [FAN 15, IBM 14, HAI 16, RUS 17, NOG 18a], the following points should be considered:

– the importance of metadata management;

– the importance of handling data security and confidentiality;

– the importance of handling the data lifecycle;

– the importance of data lineage and data processing.

Based on these points, it turns out that data lakes are associated with projects on data governance rather than those on decision-making. In other words, this means that data lakes are data driven, while decision-making systems are information driven.

As a result, data lakes should be positioned in the information system, next to the decision-making system, and thus a data lake appears as a new component of the information system. This is shown in Figure 1.7, which is an expansion of Figure 1.6, by considering external data in order to better take into account that massive data are handled in a data lake environment.

Figure 1.6.Data lake position in the information system. For a color version of this figure, see www.iste.co.uk/laurent/data.zip

1.4. Data lakes versus decision-making systems

While we previously discussed the positioning of the data lake inside the information system, we now emphasize the differences between data lakes and decision-making systems.

Data lakes are often compared to data warehouses because both concepts allow for storing huge volumes of data in order to transform them into information. However, data lakes are expected to be more flexible than data warehouses, because data lakes do not impose that data are integrated according to the same schema. Therefore, any data can be inserted in a data lake, regardless of their nature and origin, which implies that one of the major challenges is to allow for any processing dealing with these data4.

Figure 1.7.Data lake interaction in information systems. For a color version of this figure, see www.iste.co.uk/laurent/data.zip

This way of integrating data differs from data integration in data warehouses in the sense that data lakes are said to be schema on read, whereas data warehouses are said to be schema on write. This terminology illustrates that:

– in a data warehouse, the schema is fixed before integration because the expected information is known in advance. This explains why, in the case of a data warehouse, the different steps of data integration are known as ETL (Extract-Transform-Load);

– in a data lake, the information to be extracted from the integrated data is totally unknown, and it is up to the user to express his or her needs, and thus it is up to the user to define the needed data schema. This explains why the data integration steps for a data lake are known as ELT (Extract-Load-Transform).

It is important to note that to ensure data veracity when integrating several sources, data cleaning is a crucial step. In a data warehouse environment, this step is achieved during the transformation process, i.e. before loading, which is not possible in a data lake environment, since the data are transformed “on demand” after loading. However, in practice, in a data lake environment, data cleaning may also be achieved on loading through normalization and extraction of the metadata. This explains the importance of data governance, and more specifically of the maintenance of a metadata catalog, when it comes to guaranteeing data lineage and veracity.

In conclusion, the main differences between data warehouses and data lakes are summarized in Table 1.1.

Table 1.1.Data warehouses versus data lakes

Data lakes

Data warehouses

Data storage

HDFS, NoSQL, Relational database

Relational database

Data qualification

No

Yes

Data value

High

High

Data granularity

Raw

Aggregated

Data preparation

On the fly

Before integration

Data integration

No treatment

Quality control, filtering

Data transformation

No transformation

ELT

Transformation

ETL

Schema

On read

On write

Information architecture

Horizontal

Vertical

Model

On the fly

Star, snowflake

Metadata

Yes

Optional

Conception

Data driven

Information driven

Data analysis method

Unique

Repetitive

Users

Computer/data scientists, developers

Decision-makers

Update frequency

Real time/batch

Batch

Architecture

Centralized, federated or hybrid

Centralized

1.5. Urbanization for data lakes

According to Servigne [SER 10], urbanization of an information system should provide users and information providers with a common view of the information system, and moreover, urbanization should ensure that the information system supports the goals and the transformation of the company, while reducing expenses and easing the implementation of evolution and strategy changing.

The expected outcomes of urbanization for the information system are more precisely stated as follows:

– make the system responsive to the global business project;

– align the information system with strategic targets;

– ensure consistency between data and processes;

– easily take into account technological innovations;

– provide full value to the data and the knowledge in the information system;

– reduce the costs of maintenance and exploitation;

– improve the functional quality;

– improve the quality of service;

– make data more reliable;

– make the information system flexible.

Referring to “urbanization”, as used in architecture and geography, the goal of “urbanizing” the information system is to structure the system so as to improve its performance and upgradability. As shown in Figure 1.8, the process of “urbanizing” the information system is achieved using the following four-layer model:

1) the business architecture;

2) the functional architecture;

3) the application architecture;

4) the technical architecture.

Figure 1.8.Urbanization of the information system

In the following, we detail these four architectures in the specific case of data lake urbanization.

The business architecture of a data lake is about the issue of knowledge capitalization and value, in the goal of digital transformation. Regarding the functional architecture, the data lake is meant to collect all data on a single place (with regard to capitalization), at a conceptual level, so as to allow various software tools to exploit them (thus ensuring data value). The functional schema of a data lake can be considered, as shown in Figure 1.9, to guarantee the following needs:

– accessibility to all data sources;

– centralization of all data sources;

– provision of a catalog of available data.

The applicative architecture is a computational view of the functional architecture described above. An example of an applicative architecture is shown in Figure 1.105, and the main components of such an architecture are listed as follows:

– data storage: relational DBMSs, NoSQL systems and file systems such as HDFS;

– data manipulation: architecture frameworks such as MapReduce or Apache Spark;

– metadata management: software such as Informatica or IBM Metadata Catalogue;

– suites for data lakes, based on HDFS such as Cloudera or Hortonworks;

– machine learning software such as Apache Spark and IBM Machine Learning.

Figure 1.9.Functional architecture of a data lake

The technological architecture provides the description of the hardware components that support those from the applicative architecture. These components are typically:

– servers;

– workstations;

– storage devices (storage units, SAN, filers, etc.);

– backup systems;

– networking equipment (routers, firewalls, switches, load balancers, SSL accelerators).

Figure 1.10.Applicative architecture of a data lake (HortonworksTM). For a color version of this figure, see www.iste.co.uk/laurent/data.zip

The choice of these components is driven by specific needs for data lakes regarding the characteristics of the data to be handled. These needs are mainly:

– security;

– availability;

– scalability;

– auditability;

– performance;

– volume;

– integrity;

– robustness;

– maintenance;

– reliability.

We note in this respect that so far the influence of these needs on the performance of the data lake has not been investigated. This is due to the lack of maturity of most current data lake projects, and it is expected that the importance of the impact of these needs on the technological architecture of data lakes will increase in the near future. As an example of such impact, in this book we address the issue of data gravity, by investigating how technological constraints on data storage may impact components of the applicative architecture.

1.6. Data lake functionalities

From our point of view, data lake functionalities should address the following issues: data acquisition, metadata catalog, data storage, data exploration, data governance, data lifecycle and data quality. In this section, we briefly review each of these issues.

The functionality of data acquisition deals with data integration from various data sources (internal or external, structured or not). This is much like the ODS (Operation Data Store) component of an industrial decision-making system. As mentioned earlier, this functionality must allow for the integration of any data stream, but another possibility is to merge it with that of the associated decision-making system. This option is shown in Figure 1.11.

Regarding metadata catalog, all approaches in [FAN 15, HAI 16, ANS 18, STE 14, IBM 14] put the emphasis on this functionality, because without it, the data lake would be like a data “swamp”. As mentioned earlier, metadata are among the most important information in a data lake because they allow for data governance, data security, data quality and lifecycle. Although no consensus exists regarding their format, it is agreed that metadata should provide answers to the following typical questions regarding any piece of data: – Who created it? Who uses it? Whom does it belong to? Who maintains it?

– What is its business definition? What are its associated business rules? Which is its security level? Which standard terms define it within the databases?

– Where is it stored? Where is it from? Where is it used or shared? Which regulatory or legal standard does it comply with?

– Why is it stored? What is its usage and utility? What is its associated business lever?

– When has it been created or updated? When will it be erased?

– How has it been formatted?

– In how many databases or data sources can it be found?

Figure 1.11.Acquisition module shared between data lake and decision-support system. For a color version of this figure, see www.iste.co.uk/laurent/data.zip

Among the basic expectations regarding metadata, we mention the following:

access to information

by any user, including non-expert users, to collect information on data structure, content or quality;

data quality

must be eased so that the user can concentrate on data exploitation rather than on data information;

saving time

by providing a full data profile;

data security

must be guaranteed, specifically in the context of RGPD regulation regarding personal data;

data exploitation

must be eased, in particular through data lineage information contained in metadata.

– companies have reservoirs of valuable

hidden data

, which are not exploited. These data can be produced by operating systems or complex applications. Once registered in the data lake with their metadata, these hidden data can be used and maintained properly.

Now turning to data storage functionality, we recall that according to [GAR 14] and [FAN 15], a data lake is reduced to data storage using an Apache Hadoop technology; in [RUS 17], the expression “Hadoop Data Lake” is used to refer to these approaches.

As noted previously on Russom’s work [RUS 17], alternative and complementary technologies can be used, such as relational or NoSQL DBMSs. When dealing with gravity, we also show that technological and applicative technologies can have a significant impact on this functionality.

The aim of the data exploration functionality is to allow users to exploit the content of the data lake. This functionality is significantly different when comparing data lakes and data warehouses. This is because data lake users are fewer and smarter than data warehouse users because data lake technology involves more sophisticated tools and more heterogeneous data than data warehouses.

The data lake governance functionality relies on the management of the metadata catalog. However, we stress that security, lifecycle and quality of the data are also fundamental regarding data lake governance. The data security functionality deals with data protection, confidentiality and privacy. These are hot topics in the context of data lakes; they are taken into account by rules, processes or ad hoc technologies for data encryption or anonymization.

The data lifecycle functionality should make it possible to efficiently manage the data during their “life” in the data lake, starting when they are first stored, used and then ending when they become obsolete and thus are either archived or erased. Such a management is, of course, crucial in a data lake, given the huge volume of data to be managed. To be effective and efficient, this functionality relies on metadata that contain the required information.

The last functionality to be mentioned here is the data quality functionality, which is considered challenging in a data lake environment. This is because any type of data can be accepted in a data lake, namely either primary data (which might have been transformed and/or cleaned) or row data (neither transformed nor cleaned). Here again, metadata play a key role in informing the user about the lineage of these data.

1.7. Summary and concluding remarks

As stated previously, [SER 10], the data lake should be considered as a new component of the information system, with its own business, functional, soft and technological architectures, just like any other component. Its content is composed of row data (in their original format) that can be accessed by “smart” users based on specific and innovative tools. A data lake must be agile and flexible, contrary to data warehouses.

In order to avoid the transformation of a data lake into a data “swamp”, rules for data governance must be considered in terms of the management of data quality, data security, data lifecycle and of metadata. We emphasize again that the metadata catalog is the key component of the data lake architecture for ensuring the consistency of data sources as well as efficient data governance.

Chapter written by Anne LAURENT, Dominique LAURENT and Cédrine MADERA.

1

https://www.gartner.com/smarterwithgartner/the-key-to-establishing-a-data-driven-culture/

.

2

https://www.gartner.com/webinar/3745620

.

3