139,99 €
Big Data is a new field, with many technological challenges to be understood in order to use it to its full potential. These challenges arise at all stages of working with Big Data, beginning with data generation and acquisition. The storage and management phase presents two critical challenges: infrastructure, for storage and transportation, and conceptual models. Finally, to extract meaning from Big Data requires complex analysis. Here the authors propose using metaheuristics as a solution to these challenges; they are first able to deal with large size problems and secondly flexible and therefore easily adaptable to different types of data and different contexts.
The use of metaheuristics to overcome some of these data mining challenges is introduced and justified in the first part of the book, alongside a specific protocol for the performance evaluation of algorithms. An introduction to metaheuristics follows. The second part of the book details a number of data mining tasks, including clustering, association rules, supervised classification and feature selection, before explaining how metaheuristics can be used to deal with them. This book is designed to be self-contained, so that readers can understand all of the concepts discussed within it, and to provide an overview of recent applications of metaheuristics to knowledge discovery problems in the context of Big Data.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 285
Veröffentlichungsjahr: 2016
Cover
Title
Copyright
Acknowledgments
Introduction
1 Optimization and Big Data
1.1. Context of Big Data
1.2. Knowledge discovery in Big Data
1.3. Performance analysis of data mining algorithms
1.4. Conclusion
2 Metaheuristics – A Short Introduction
2.1. Introduction
2.2. Common concepts of metaheuristics
2.3. Single solution-based/local search methods acceptance approach
2.4. Population-based metaheuristics
2.5. Multi-objective metaheuristics
2.6. Conclusion
3 Metaheuristics and Parallel Optimization
3.1. Parallelism
3.2. Parallel metaheuristics
3.3. Infrastructure and technologies for parallel metaheuristics
3.4. Quality measures
3.5. Conclusion
4 Metaheuristics and Clustering
4.1. Task description
4.2. Big Data and clustering
4.3. Optimization model
4.4. Overview of methods
4.5. Validation
4.6. Conclusion
5 Metaheuristics and Association Rules
5.1. Task description and classical approaches
5.2. Optimization model
5.3. Overview of metaheuristics for the association rules mining problem
5.4. General table
5.5. Conclusion
6 Metaheuristics and (Supervised) Classification
6.1. Task description and standard approaches
6.2. Optimization model
6.3. Metaheuristics to build standard classifiers
6.4. Metaheuristics for classification rules
6.5. Conclusion
7 On the Use of Metaheuristics for Feature Selection in Classification
7.1. Task description
7.2. Optimization model
7.3. Overview of methods
7.4. Conclusion
8 Frameworks
8.1. Frameworks for designing metaheuristics
8.2. Framework for data mining
8.3. Framework for data mining with metaheuristics
8.4. Conclusion
Conclusion
Bibliography
Index
End User License Agreement
Introduction
Figure I.1. Main phases of a Big Data process
1 Optimization and Big Data
Figure 1.1. Evolution of Google requests for “Big Data” (Google source)
Figure 1.2. Overview of the KDD process
Figure 1.3. Overview of main tasks and approaches in data mining
Figure 1.4. Statistical test summary [JAC 13b]
2 Metaheuristics – A Short Introduction
Figure 2.1. Solving a problem from the class
Figure 2.2. Neighborhood operator for the TSP
Figure 2.3. Objective space and specific points of a bi-objective problem
3 Metaheuristics and Parallel Optimization
Figure 3.1. Parallel multi-start model: several single solution-based metaheuristics are launched in parallel
Figure 3.2. Move acceleration model: the solution is evaluated in parallel
Figure 3.3. Sub-linear, linear and super-linear speedup
4 Metaheuristics and Clustering
Figure 4.1. An example of dendrogram
Figure 4.2. Optimizing both objectives simultaneously [GAR 12]
Figure 4.3. Multi-objective clustering a Pareto set of solutions [GAR 12]
Figure 4.4. Binary encoding with a fixed number of clusters from [JOS 16]
Figure 4.5. Binary encoding for representative from [JOS 16]
Figure 4.6. Integer encoding: label-based representation from [JOS 16]
Figure 4.7. Integer encoding: graph-based representation from [JOS 16]
6 Metaheuristics and (Supervised) Classification
Figure 6.1. Classification task
Figure 6.2. K-nearest neighbor method
Figure 6.3. Example of a decision tree to predict the flu
Figure 6.4. A three-layer artificial neural network
Figure 6.5. Linear support vector machine
Figure 6.6. Performance evaluation methodology in supervised classification
Figure 6.7. Cross validation (example of a 10-fold)
Figure 6.8. Receiver operating characteristic (ROC) curve
Figure 6.9. Venn diagram illustrating repartition of observations [IGL 06]
7 On the Use of Metaheuristics for Feature Selection in Classification
Figure 7.1. Filter model for feature selection: learned on the training set and tested on the test dataset
Figure 7.2. Wrapper model for feature selection.
Figure 7.3. Some representations for metaheuristic in feature selection for the selection of attributes 1,3,7,9. a) binary representation; b) fixed length representation; c) variable length representation
8 Frameworks
Figure 8.1. Clustering and tree exploration with Orange
Figure 8.2. Tree exploration with Rattle GUI
Figure 8.3. Tree exploration with RapidMiner
Figure 8.4. Decision tree with WEKA
Figure 8.5. LIONoso
Cover
Table of Contents
Begin Reading
C1
iii
iv
v
xi
xiii
xiv
xv
xvi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
135
136
137
138
139
140
141
142
143
144
145
147
148
149
150
151
152
153
154
155
156
157
158
159
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
G1
G2
G3
G4
G5
G6
G7
Metaheuristics Set
coordinated by
Nicolas Monmarché and Patrick Siarry
Volume 5
Clarisse Dhaenens
Laetitia Jourdan
First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
www.iste.co.uk
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
www.wiley.com
© ISTE Ltd 2016
The rights of Clarisse Dhaenens and Laetitia Jourdan to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2016944993
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-84821-806-2
This book is an overview of metaheuristics for Big Data. Hence it is based on a large literature review conducted by the authors in the Laboratory CRIStAL (Research Center in Computer Science, Signal and Automatics), University of Lille and CNRS, France and in the Lille Nord Europe Research Center of INRIA (French National Institute for Computer Science and Applied Mathematics) between 2000 and the present. We are grateful to our former and current PhD students and colleagues for all the work they have done together with us that has led to this book.
We are particularly grateful to Aymeric Blot, Fanny Dufossé, Lucien Mousin and Maxence Vandromme who read and corrected the first versions of this book. A special word of gratitude to Marie-Elénore Marmion who read carefully and commented on several chapters.
We would like to thank Nicolas Monmarché and Patrick Siarry for their proposal to write this book and for their patience! Sorry for the time we took.
Finally, we would like to thank our families for their support and love.
Clarisse DHAENENS and Laetitia JOURDAN
Big Data: a buzzword or a real challenge?
Both answers are suitable. On the one hand, the term Big Data has not yet been well defined, although several attempts have been made to give it a definition. Indeed, the term Big Data does not have the same meaning according to the person who uses it. It could be seen as a buzzword: everyone talks about Big Data but no one really manipulates it.
On the other hand, the characteristics of Big Data, often reduced to the three “Vs” – volume, variety and velocity – introduce plenty of new technological challenges at different phases of the Big Data process. These phases are presented in a very simple way in Figure I.1.
Starting from the generation of data, its storage and management, analyses can be made to help decision-making. This process may be reiterated if additional information is required. At each phase, some important challenges arise.
Indeed, during the generation and capture of data, some challenges may be related to technological aspects that are linked to the acquisition of real-time data, for example. However, at this phase, challenges are also related to the identification of meaningful data.
The storage and management phase leads to two critical challenges: first, the infrastructures for the storage of data and its transportation; second, conceptual models to provide well-formed available data that may be used for analysis.
Figure I.1.Main phases of a Big Data process
Then, the analysis phase has its own challenges, with the manipulation of heterogeneous massive data. In particular, when considering the knowledge extraction, in which unknown patterns have to be discovered, analysis may be very complex due to the nature of data manipulated. This is at the heart of data mining. A way to address data mining problems is to model them as optimization problems. In the context of Big Data, most of these problems are large-scale ones. Hence metaheuristics seem to be good candidates to tackle them. However, as we will see in the following, metaheuristics are suitable not only to address the large size of the problem, but also to deal with other aspects of Big Data, such as variety and velocity.
The aim of this book is to present how metaheuristics can provide answers to some of the challenges induced by the Big Data context and particularly within the data analytics phase.
This book is composed of three parts. The first part is an introductory part consisting of three chapters. The aim of this part is to provide the reader with elements to understand the following aspects.
Chapter 1, Optimization and Big Data, provides elements to understand the main issues led by the Big Data context. It then reveals what characterizes Big Data and focuses on the analysis phase and, more precisely, on the data mining task. This chapter indicates how data mining problems may be seen as combinatorial optimization problems and justifies the use of metaheuristics to address some of these problems. A section is also dedicated to the performance evaluation of algorithms, as in data mining, a specific protocol has to be followed.
Chapter 2 presents an introduction to metaheuristics, to make this book self-contained. First, common concepts of metaheuristics are presented and then the most widely known metaheuristics are described with a distinction between single solution-based and population-based methods. A section is also dedicated to multi-objective metaheuristics, as many of them have been proposed to deal with data mining problems.
Chapter 3 provides indications on parallel optimization and the way metaheuristics may be parallelized to tackle very large size problems. As it will be revealed, the parallelization is considered not only to deal with large problems, but also to provide better quality solutions.
The second part, composed of the following four chapters, is the heart of the book. Each of these chapters details a data mining task and indicates how metaheuristics can be used to deal with it.
Chapter 4 begins the second part of the book and is dedicated to clustering. This chapter first presents the clustering task that aims to group similar objects and some of the classical approaches to solve it. Then, the chapter provides indications on the modeling of the clustering task as an optimization problem and focuses on the quality measures that are commonly used, on the interest of a multi-objective resolution approach and on the representation of a solution in metaheuristics. An overview of multi-objective methods is then proposed. The chapter ends with a specific and difficult point in the clustering task: how the estimation of the quality of a clustering solution and its validation can be done.
Chapter 5 deals with association rules. It first describes the corresponding data mining task and the classical approach: the a priori algorithm. Then, the chapter indicates how this task may be modeled as an optimization task and then focuses on metaheuristics proposed to deal with this task. It differentiates the metaheuristics according to the type of rules that are considered: categorical association rules, quantitative association rules or fuzzy association rules. A general table summarizes the most important works of the literature.
Chapter 6 is dedicated to supervised classification. Data mining is of great importance as it allows the prediction of the class of a new observation regarding information from observations whose classes are known. The chapter first gives a description of the classification task and briefly presents standard classification methods. Then, an optimization perspective of some of these standard methods is presented as well as the use of metaheuristics to optimize some of them. The last part of the chapter is dedicated to the use of metaheuristics for the search of classification rules, viewed as a special case of association rules.
Chapter 7 deals with feature selection for classification that aims to reduce the number of attributes and to improve the classification performance. The chapter uses several notions that are presented in Chapter 6 on classification. After a presentation of generalities on feature selection, the chapter gives its modeling as an optimization problem. Different representations of solutions and their associated search mechanisms are then presented. An overview of metaheuristics for feature selection is finally proposed.
Finally, the last part is composed of a single chapter (Chapter 8) which presents frameworks dedicated to data mining and/or metaheuristics. A short comparative survey is provided for each kind of framework.
Browsing the different chapters, the reader will have an overview of the way metaheuristics have been applied so far to tackle problems that are present in the Big Data context, with a focus on the data mining part, which provides the optimization community with many challenging opportunities of applications.
The term Big Data refers to vast amounts of information that come from different sources. Hence Big Data refers not only to this huge data volume but also to the diversity of data types, delivered at various speeds and frequencies. This chapter attempts to provide definitions of Big Data, the main challenges induced by this context, and focuses on Big Data analytics.
As depicted in Figure 1.1, the evolution of Google requests on the term “Big Data” has grown exponentially since 2011.
Figure 1.1. Evolution of Google requests for “Big Data” (Google source)
How can we explain the increasing interest in this subject? Some responses may be formulated, when we know that everyday 2.5 quintillion bytes of data are generated – such that 90% of the data in the world today have been created in the last two years. These data come from everywhere, depending on the industry and organization: sensors are used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records and cellphone GPS signals, to name but a few [IBM 16b]. Such data are recorded, stored and analyzed.
Big Data appears in a lot of situations where large amounts of complex data are generated. Each situation presents challenges to handle. We may cite some examples of such situations:
–
Social networks
: the quantity of data generated in social networks is huge. Indeed, monthly estimations indicate that 12 billion tweets are sent by about 200 million active users, 4 billion hours of video are watched on YouTube and 30 billion pieces of content are shared on Facebook [IBM 16a]. Moreover, such data are of different formats/types.
–
Traffic management
: in the context of creation of smart cities, the traffic within cities is an important issue. This becomes feasible, as the widespread adoption in recent years of technologies such as smartphones, smartcards and various sensors has made it possible to collect, store and visualize information on urban activities such as people and traffic flows. However, this also represents a huge amount of data collected that need to be managed.
–
Healthcare:
in 2011, the global size of data in healthcare was estimated as 150 exabytes. Such data are unique and difficult to deal with because: 1) data are in multiple places (different source systems in different formats including text as well as images); 2) data are structured and unstructured; 3) data may be inconsistent (they may have different definitions according to the person in charge of filling data); 4) data are complex (it is difficult to identify standard processes); 5) data are subject to regulatory requirement changes [LES 16].
–
Genomic studies:
with the rapid progress of DNA sequencing techniques that now allows us to identify more than 1 million SNPs (genetic variations), large-scale genome-wide association studies (GWAS) have become practical. The aim is to track genetic variations that may, for example, explain genetic susceptibility for a disease. In their analysis on the new challenges induced by these new massive data, Moore
et al
. first indicate the necessity of the development on new biostatistical methods for quality control, imputation and analysis issues [MOO 10]. They also indicate the challenge of recognizing the complexity of the genotype–phenotype relationship that is characterized by significant heterogeneity.
In all these contexts, the term Big Data is now become a widely used term. Thus, this term needs to be defined clearly. Hence, some definitions are proposed.
Many definitions of the term Big Data have been proposed. Ward and Baker propose a survey on these definitions [WAR 13]. As a common aspect, all these definitions indicate that size is not the only characteristic.
A historical definition was given by Laney from Meta Group in 2001 [LAN 01]. Indeed, even if he did not mention the term “Big Data”, he identified, mostly for the context of e-commerce, new data management challenges along three dimensions – the three “Vs”: volume, velocity and variety:
–
Data volume:
as illustrated earlier, the number of data created and collected is huge and the growth of information size is exponential. It is estimated that 40 zettabytes (40 trillion gigabytes) will be created by 2020.
–
Data velocity:
data collected from connected devices, websites and sensors require specific data management not only because of real-time analytics needs (analysis of streaming data) but also to deal with data obtained at different speeds.
–
Data variety:
there is a variety of data coming from several types of sources. Dealing simultaneously with such different data is also a difficult challenge.
The former definition has been extended. First, a fourth “V” has been proposed: veracity. Indeed, another important challenge is the uncertainty of data. Hence around 1 out of 3 business leaders do not trust the information they use to make decisions [IBM 16a]. In addition, a fifth “V” may also be associated with Big Data: value, in a sense that the main interest to deal with data is to produce additional value from information collected [NUN 14].
More recently, following the line of “V” definitions, Laney and colleagues from Gartner [BEY 12] propose the following definition:
“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making.
This definition has been reinforced and completed by the work of DeMauro et al. who analyzed recent corpus of industry and academic articles [DEM 16]. They found that the main themes of Big Data are: information, technology, methods and impact. They propose a new definition:
Big Data is the Information asset characterized by such a high-volume, -velocity and -variety to require specific technology and analytical methods for its transformation into value.
Even if these definitions of 3Vs, 4Vs or 5Vs are the more widely used to explain to a general public the context of Big Data, some other attempts of definitions have been proposed. The common point of these definitions is to mainly reduce the importance of the size characteristic for the benefit of the complexity one.
For example, in the definition proposed by MIKE2.0 1, it is indicated that elements of Big Data include [MIK 15]:
– the degree of complexity within the dataset;
– the amount of value that can be derived from innovative versus noninnovative analysis techniques;
– the use of longitudinal information supplements the analysis.
They indicate that “big” refers to big complexity rather than big volume. Of course, valuable and complex datasets of this sort naturally tend to grow rapidly and so Big Data quickly becomes truly massive. Big Data can be very small and not all large datasets are big. As an example, they consider that the data streaming from a hundred thousand sensors on an aircraft is Big Data. However, the size of the dataset is not as large as might be expected. Even a hundred thousand sensors, each producing an eight byte reading every second, would produce less than 3GB of data in an hour of flying (100,000 sensors × 60 minutes × 60 seconds × 60 bytes).
Rather, it is a combination of data management technologies that have evolved over time. Big Data enables organizations to store, manage and manipulate vast amounts of data at the right speed and at the right time to get the right insights.
Hence the different steps of the value chain of Big Data may be organized in three stages:
1) data generation and acquisition;
2) data storage and management;
3) data analysis.
Each of these stages leads to challenges from the highest importance. Many books and articles are dedicated to this subject (see, for example, [CHE 14, HU 14, JAG 14]).
The generation of data is not a problem anymore, due to the huge number of sources that can generate data. We may cite all kinds of sensors, customer purchasing, astronomical data and text messages. One of the challenges may be to a priori identify data that may be interesting to generate. What should be measured? This is directly linked with the analysis that needs to be realized. Much of these data, for example data generated by sensor networks that are highly redundant, can be filtered and compressed by orders of magnitude without compromising our ability to reason about the underlying activity of interest. One challenge is to define these online filters in such a way that they do not discard useful information, since the raw data is often too voluminous to even allow the option of storing it all [JAG 14]. On the contrary, generated data may offer a rich context for further analysis (but may lead to very complex ones).
Before being stored, an information extraction process that extracts the required information from the underlying sources and expresses it in a structured form suitable for storage and analysis is required. Indeed, most data sources are notoriously unreliable: sensors can be faulty, humans may provide biased opinions, remote websites might be stale and so on. Understanding and modeling these sources of error is a first step toward developing data cleaning techniques. Unfortunately, much of this is data source and application dependent and is still a technical challenge [JAG 14].
Many companies use one or several relational database management systems to store their data. This allows them to identify what the data stored are and where they are stored. However, these systems are less adapted for a Big Data context and one of the challenges linked to Big Data is the development of efficient technologies to store available data.
These technologies must be able to deal with specificities of Big Data, such as scalability (limitations of the underlying physical infrastructure), variety of data (including unstructured data), velocity of data (taking into account non-synchronous acquisition), etc. Hence, non-relational database technologies, such as NoSQL, have been developed. These technologies do not rely on tables and may be more flexible.
Among these technologies, we may cite:
– key-value pair databases, based on the key-value pair model, where most of the data are stored as strings;
– document databases, a repository for full document-style content. In addition, the structure of the documents and their parts may be provided by JavaScript Object Notation (JSON) and/or Binary JSON (BSON);
– columnar databases or column-oriented database, where data are stored in across rows (e.g. HBase from Apache). This offers great flexibility, performance and scalability, in terms of volume and variety of data;
– graph databases, based on node relationships that have been proposed to deal with highly interconnected data;
– spatial databases that incorporate spatial data. Let us note that spatial data itself is standardized through the efforts of the Open Geospatial Consortium (OGC), which establishes OpenGIS (geographic information system) and a number of other standards for spatial data.
Big Data management includes data transportation [CHE 14]: transportation of data from data sources to data centers or transportation of data within data centers. For both types of transportation, technical challenges arise:
– efficiency of the physical network infrastructure to deal with the growth of traffic demand (the physical network infrastructure in most regions around the world is constituted by high-volume, high-rate and cost-effective optic fiber transmission systems, but other technologies are under study);
– security of transmission to ensure the property of data as well as its provenance.
These technological challenges related to data acquisition, storage and management are crucial to obtain well-formed available data that may be used for analysis.
(Big) Data analysis aims at extracting knowledge from the data. Regarding the knowledge to be extracted, Maimon et al. identify three levels of analysis [MAI 07]:
–
Reports:
the simplest level deals with report generation. This may be obtained by descriptive statistics as well as simple database queries.
–
Multi-level analysis:
this requires advanced database organization to make such analysis (OLAP multi-level analysis).
–
Complex analysis:
this is used to discover unknown patterns. This concerns specifically data mining, as it will be defined later, and requires efficient approaches. This book focuses on this level of analysis.
In contrast to traditional data, Big Data varies in terms of volume, variety, velocity, veracity and value. Thus, it becomes difficult to analyze Big Data with traditional data analytics tools that are not designed for them. Developing adequate Big Data analytics techniques may help discover more valuable information. Let us note that Big Data brings not only new challenges, but also opportunities – the interconnected Big Data with complex and heterogeneous contents bear new sources of knowledge and insights.
We can observe that while Big Data has become a highlighted buzzword over the last few years, Big Data mining, i.e. mining from Big Data, has almost immediately followed up as an emerging interrelated research area [CHE 13].
Typically, the aim of data mining is to uncover interesting patterns and relationships hidden in a large volume of raw data. Applying existing data mining algorithms and techniques to real-world problems has recently been running into many challenges. Current data mining techniques and algorithms are not ready to meet the new challenges of Big Data. Mining Big Data requires highly scalable strategies and algorithms, more efficient preprocessing steps such as data filtering and integration, advanced parallel computing environments, and intelligent and effective user interaction.
Hence the goals of Big Data mining techniques go beyond fetching the requested information or even uncovering some hidden relationships and patterns between numerous parameters. Analyzing fast and massive stream data may lead to new valuable insights and theoretical concepts [CHE 13]. In particular, the need for designing and implementing very-large-scale parallel machine learning and data mining algorithms (ML-DM) has increased remarkably, parallel to the emergence of powerful parallel and very-large-scale data processing platforms, e.g. Hadoop MapReduce [LAN 15].
In this book, we are mainly interested in this stage of the value chain of Big Data, that is to say how can we analyze Big Data and, in particular, how metaheuristics may be used for this. Hence the analysis stage is detailed in the following sections.
A common definition of metaheuristics is:
Techniques and methods used for solving various optimization problems, especially large-scale ones.
By this definition, metaheuristics seem to be good candidates to solve large-scale problems induced by the Big Data context.
However, metaheuristics are able to provide the answer not only to the large-scale characteristics, but also to the other ones:
–
Data volume
: metaheuristics are mostly developed for large-scale problems. Moreover, their ability to be parallelized gives opportunities to deal with very large ones.
–
Data velocity
: in a context where data are regularly updated and/or the response must be a real-time one, metaheuristics are any-time methods that may propose a good solution rapidly (even if it is not optimal).
–
Data variety
: working simultaneously with different types of data may be difficult for some standard methods, for example those coming from statistics. Metaheuristics propose encodings that are able to consider several types of data simultaneously. This will give the opportunity to jointly analyze data coming from different sources.
–
Data veracity
: working with uncertainty (or more precisely with unknown data) may also be difficult for classical methods. Metaheuristics can propose to integrate stochastic approaches or partial analysis to be able to extract information from these non-precise data.
–
Data value
: metaheuristics are optimization methods based on an objective function. Hence they enable us to evaluate the interest of the knowledge extracted, its value. Using different objective functions gives the opportunity to express the value in different ways according to the context to which it is applied, for example.
In the context of Big Data, metaheuristics have mostly been used within the data analysis step for solving data mining tasks. However, some of them have been proposed to solve other kinds of optimization problems that are related to the Big Data context.
For example, in the work of Stanimirovic and Miskovic, a problem of exploration of online social networks is studied [STA 13]. The goal is to choose locations for installing some control devices and to assign users to active control devices. Several objective functions are proposed. They formulate the problems as several optimization problems and propose a metaheuristic (a pure evolutionary algorithm; EA) and two hybrid metaheuristics (EA with a local search; EA with a Tabu Search) to solve the identified optimization problems (for more information about metaheuristics, see Chapter 2). Therefore, they define all the necessary components (encodings, operators, etc.). They compare their methods on large-scale problems (up to 20,000 user nodes and 500 potential locations) in terms of quality of the solution produced and time required to obtain a good solution. The results obtained are very convincing.
The relationships between metaheuristics and Big Data are linked strongly to the data analysis step, which consists of extracting knowledge from available data. Hence we will focus on this data mining aspect. This section will first situate the data mining in the whole context of knowledge discovery and then present the main data mining tasks briefly. These tasks will be discussed in detail in the following chapters of the book, as one chapter will be dedicated to each of them. Hence, each chapter will present an optimization point of view of the data mining task concerned and will present how metaheuristics have been used to deal with it.
