112,99 €
A timely overview of cutting edge technologies for multimedia retrieval with a special emphasis on scalability The amount of multimedia data available every day is enormous and is growing at an exponential rate, creating a great need for new and more efficient approaches for large scale multimedia search. This book addresses that need, covering the area of multimedia retrieval and placing a special emphasis on scalability. It reports the recent works in large scale multimedia search, including research methods and applications, and is structured so that readers with basic knowledge can grasp the core message while still allowing experts and specialists to drill further down into the analytical sections. Big Data Analytics for Large-Scale Multimedia Search covers: representation learning, concept and event-based video search in large collections; big data multimedia mining, large scale video understanding, big multimedia data fusion, large-scale social multimedia analysis, privacy and audiovisual content, data storage and management for big multimedia, large scale multimedia search, multimedia tagging using deep learning, interactive interfaces for big multimedia and medical decision support applications using large multimodal data. * Addresses the area of multimedia retrieval and pays close attention to the issue of scalability * Presents problem driven techniques with solutions that are demonstrated through realistic case studies and user scenarios * Includes tables, illustrations, and figures * Offers a Wiley-hosted BCS that features links to open source algorithms, data sets and tools Big Data Analytics for Large-Scale Multimedia Search is an excellent book for academics, industrial researchers, and developers interested in big multimedia data search retrieval. It will also appeal to consultants in computer science problems and professionals in the multimedia industry.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 736
Veröffentlichungsjahr: 2019
Cover
Introduction
List of Contributors
About the Companion Website
Part I: Feature Extraction from Big Multimedia Data
1 Representation Learning on Large and Small Data
1.1 Introduction
1.2 Representative Deep CNNs
1.3 Transfer Representation Learning
1.4 Conclusions
References
2 Concept‐Based and Event‐Based Video Search in Large Video Collections
2.1 Introduction
2.2 Video preprocessing and Machine Learning Essentials
2.3 Methodology for Concept Detection and Concept‐Based Video Search
2.4 Methods for Event Detection and Event‐Based Video Search
2.5 Conclusions
2.6 Acknowledgments
References
3 Big Data Multimedia Mining: Feature Extraction Facing Volume, Velocity, and Variety
3.1 Introduction
3.2 Scalability through Parallelization
3.3 Scalability through Feature Engineering
3.4 Deep Learning‐Based Feature Learning
3.5 Benchmark Studies
3.6 Closing Remarks
Acknowledgements
References
Part II: Learning Algorithms for Large-Scale Multimedia
4 Large‐Scale Video Understanding with Limited Training Labels
4.1 Introduction
4.2 Video Retrieval with Hashing
4.3 Graph‐Based Model for Video Understanding
4.4 Conclusions and Future Work
References
5 Multimodal Fusion of Big Multimedia Data
5.1 Multimodal Fusion in Multimedia Retrieval
5.2 Multimodal Fusion in Multimedia Classification
5.3 Conclusions
References
6 Large‐Scale Social Multimedia Analysis
6.1 Social Multimedia in Social Media Streams
6.2 Large‐Scale Analysis of Social Multimedia
6.3 Large‐Scale Multimedia Opinion Mining System
6.4 Conclusion
References
7 Privacy and Audiovisual Content: Protecting Users as Big Multimedia Data Grows Bigger
7.1 Introduction
7.2 Protecting User Privacy
7.3 Multimedia Privacy
7.4 Privacy‐Related Multimedia Analysis Research
7.5 The Larger Research Picture
7.6 Outlook on Multimedia Privacy Challenges
References
Part III: Scalability in Multimedia Access
8 Data Storage and Management for Big Multimedia
8.1 Introduction
8.2 Media Storage
8.3 Processing Media
8.4 Multimedia Delivery
8.5 Case Studies: Facebook
8.6 Conclusions and Future Work
References
9 Perceptual Hashing for Large‐Scale Multimedia Search
9.1 Introduction
9.2 Unsupervised Perceptual Hash Algorithms
9.3 Supervised Perceptual Hash Algorithms
9.4 Constructing Perceptual Hash Algorithms
9.5 Conclusion and Discussion
References
Part IV: Applications of Large-Scale Multimedia Search
10 Image Tagging with Deep Learning: Fine‐Grained Visual Analysis
10.1 Introduction
10.2 Basic Deep Learning Models
10.3 Deep Image Tagging for Fine‐Grained Image Recognition
10.4 Deep Image Tagging for Fine‐Grained Sentiment Analysis
10.5 Conclusion
References
11 Visually Exploring Millions of Images using Image Maps and Graphs
11.1 Introduction and Related Work
11.2 Algorithms for Image Sorting
11.3 Improving SOMs for Image Sorting
11.4 Quality Evaluation of Image Sorting Algorithms
11.5 2D Sorting Results
11.6 Demo System for Navigating 2D Image Maps
11.7 Graph‐Based Image Browsing
11.8 Conclusion and Future Work
References
12 Medical Decision Support Using Increasingly Large Multimodal Data Sets
12.1 Introduction
12.2 Methodology for Reviewing the Literature in this chapter
12.3 Data, Ground Truth, and Scientific Challenges
12.4 Techniques used for Multimodal Medical Decision Support
12.5 Application Types of Image‐Based Decision Support
12.6 Discussion on Multimodal Medical Decision Support
12.7 Outlook or the Next Steps of Multimodal Medical Decision Support
References
Conclusions and Future Trends
Index
End User License Agreement
Chapter 1
Table 1.1 Image classification performance on the ImageNet subset designated for...
Table 1.2 OM classification experimental results.
Table 1.3 Melanoma classification experimental results.
Chapter 2
Table 2.1 Performance (MXinfAP, %) for each of the stage classifiers used in the...
Table 2.2 Performance (MXinfAP, %) for different classifier combination approach...
Table 2.3 MXInfAP (%) for different STL and MTL methods, trained on the features...
Table 2.4 Performance (MXinfAP, %) for a typical single‐layer concept detection ...
Table 2.5 MXinfAP for different configurations of our concept detection approach...
Table 2.6 Training complexity: (a) the required number of classifier combination...
Table 2.7 Relative amount of classifier evaluations (%) for different classifier...
Table 2.8 MED
Pre‐specified events.
Table 2.9 Learning from positive example results.
Table 2.10 Zero‐example learning results.
Chapter 3
Table 3.1 Common handcrafted features.
Table 3.2 Overview of the architectures of the two CNNs used for the extraction ...
Table 3.3 Convolutional neural network speed up through process parallelization.
Chapter 4
Table 4.1 Comparison of MAP on CC_WEB_VIDEO.
Table 4.2 Comparison of MAP and time for the Combined Dataset.
Chapter 5
Table 5.1 Notations and definitions.
Table 5.2 Some special cases of the unifying unsupervised fusion model of Eq. 5....
Table 5.3 Evaluation results under MAP and P@20 18.
Table 5.4 Notation table.
Chapter 6
Table 6.1 This table shows the major social media streams characterized by the f...
Table 6.2 The first row shows the total number of collected tweets aggregated pe...
Table 6.3 The top 20 most frequently extracted concepts of all analyzed images. ...
Table 6.4 Tweets with GPS tags aggregated by country name. A significant number ...
Table 6.5 The language distribution among all shares and posts. English is the m...
Table 6.6 Evaluation results of image type classification approaches.
Table 6.7 The performance of features extracted from state‐of‐the‐art CNNs for t...
Chapter 8
Table 8.1 Key observations and principles on which the Lambda Architecture is fo...
Chapter 9
Table 9.1 Typical pros and cons of the presented perceptual hash algorithms.
Chapter 10
Table 10.1 The comparison of different CNN architectures on model size, error ra...
Table 10.2 Comparison of attention localization in terms of classification accur...
Table 10.3 Comparison results on the CUB‐200‐2011 dataset.
Table 10.4 Comparison results on the Stanford Dogs dataset without an extra boun...
Table 10.5 The precision, recall, F1, and accuracy of different approaches to th...
Chapter 11
Table 11.1 Percentage of the neighborhood for varying
w
values.
Table 11.2 Test image sets.
Chapter 12
Table 12.1 Overview of papers using multimodal data sets for medical decision su...
Chapter 1
Figure 1.1 Naive version of the Inception module, refined from [24].
Figure 1.2 Inception module with dimension reduction, refined from [24...
Figure 1.3 Residual learning: a building block, refined from [25].
Figure 1.4 The flowchart of our transfer representation learning algori...
Figure 1.5 Four classification flows (the OM photos are from [58]).
Figure 1.6 The visualization of helpful features from different classes...
Figure 1.7 The visualization of helpful features for different patterns...
Chapter 2
Figure 2.1 (a) Block diagram of the developed cascade architecture for ...
Figure 2.2 Threshold assignment and stage ordering of the proposed casc...
Figure 2.3 A two‐layer stacking architecture instantiated with the LP ...
Figure 2.4 The event kit text for the event class
Attempting a bike tri
...
Figure 2.5 The proposed pipeline for zero‐example event detection.
Chapter 3
Figure 3.1 The Vs of big data.
Figure 3.2 SVM kernels illustrated using web‐based implementation of Li...
Figure 3.3 In the process parallelization scheme, a task is split into ...
Figure 3.4 Difference between PCA and LDA feature reduction techniques,...
Figure 3.5 Spectral clustering takes into account data adjacencies in a...
Figure 3.6 Bag of words approach. When the size of the moving window ma...
Figure 3.7 The multilayer perceptron model. For every node, the output
Figure 3.8 The autoencoder (compression encoder). The hidden layers rep...
Figure 3.9 In a convolution operation, the receptive field kernel (size...
Figure 3.10 In a recurrent neural network, output from a node is fed ba...
Figure 3.11 Overview of the system with CNN‐based feature extraction th...
Figure 3.12 Convolutional neural network speed up through process paral...
Chapter 4
Figure 4.1 The proposed framework for video retrieval.
Figure 4.2 The proposed framework of SVH for video retrieval.
Figure 4.3 Comparison of recall on MED dataset.
Figure 4.4 Video retrieval performance on the CCV dataset.
Figure 4.5 The overview of OGL.
Figure 4.6 Framework of our context memory. The sensory memory is prese...
Figure 4.7 Context structure in evolving units of LCM. A snapshot is al...
Figure 4.8 Different definitions of the
context cluster
. The similar in...
Figure 4.9 The flowchart of the proposed event video mashup approach.
Figure 4.10 The instructions for the crowdworkers.
Figure 4.11 Experiment results on four datasets.
Chapter 5
Figure 5.1 A multimodal fusion approach of
similarities, using a to...
Figure 5.2 Social Active Learning for Image Classification (SALIC) sche...
Figure 5.3 Probability of selecting a sample based on (a) visual and (b...
Figure 5.4 Iteratively adding positive, negative or both examples. Trai...
Figure 5.5 Comparing with sample selection baselines. Training set
,...
Figure 5.6 Comparing with fusion baselines. Training set
, pool of c...
Figure 5.7 Parameter sensitivity. Training set:
, pool of candidates...
Figure 5.8 Comparing with active learning. Training set
, pool of ca...
Figure 5.9 Comparing with weakly supervised learning. Training set (a)
Chapter 6
Figure 6.1 The occurrences of links and their frequency in retweets. Th...
Figure 6.2 This plot shows the top ten countries determined by GPS vers...
Figure 6.3 Architectural overview of the developed system. There are th...
Figure 6.4 These images illustrate two detected clusters of duplicate i...
Chapter 8
Figure 8.1 Simplified big multimedia system architecture. The metadata ...
Figure 8.2 Overview of the Metadata Storage architecture (based on the ...
Figure 8.3 Performance comparison of a high‐end HDD (600 GB, $360) and ...
Chapter 9
Figure 9.1 Schematic diagrams of perceptual hashing: (a) hash generatio...
Chapter 10
Figure 10.1 The framework of an RA‐CNN. The inputs are from coarse full...
Figure 10.2 Five bird examples of the learned region attention at diffe...
Figure 10.3 The learning process in each iteration of region attention ...
Figure 10.4 The learned region attention of dogs at different scales.
Figure 10.5 The deep‐coupled adjective and noun neural network (DCAN). ...
Chapter 11
Figure 11.1 1024 images tagged with “flower” shown in random order. Ove...
Figure 11.2 The same 1024 images projected by the t‐SNE algorithm.
Figure 11.3 1024 “flower” images projected by the proposed image sortin...
Figure 11.4 The weighting function for the normalized rank
r
' for diffe...
Figure 11.5 Projection quality for the 1024 flower images for different...
Figure 11.6 Projection quality for 4096 RGB color images for different ...
Figure 11.7 Sorting results for the hierarchical SOM, original SSM, and...
Figure 11.8 The running times of the implementations on Lab color vecto...
Figure 11.9 www.picsbuffet.com allows users to visually browse all imag...
Figure 11.10 Pyramid structure of the picsbuffet image browsing scheme....
Figure 11.11 Left: initial image graph based on the image position of t...
Figure 11.12 Building the graph of the next hierarchy level.
Figure 11.13 Schematic sequence of the improved edge swapping algorithm...
Figure 11.14 Quality over time for building a graph with 10000 RGB colo...
Figure 11.15 Navigating the graph. Dragging the map (indicated by the a...
Figure 11.16 Examples of successive views of a user's visual exploratio...
Chapter 12
Figure 12.1 Screen shot of a tool for 3D lung tissue categorization (de...
Figure 12.2 Screen shot of an image retrieval application that allows f...
Figure 12.3 Architecture of an image retrieval system that exploits mul...
Figure 12.4 Breakdown of papers included in the survey by year and data...
Cover
Table of Contents
Begin Reading
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
209
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
337
338
339
339
340
341
342
343
344
345
346
347
348
349
350
Edited by
Stefanos Vrochidis
Information Technologies Institute, Centre for Research and Technology HellasThessaloniki, Greece
Benoit Huet
EURECOMSophia-AntipolisFrance
Edward Y. Chang
HTC Research & HealthcareSan Francisco, USA
Ioannis Kompatsiaris
Information Technologies Institute, Centre for Research and Technology HellasThessaloniki, Greece
This edition first published 2019
© 2019 John Wiley & Sons Ltd.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Stefanos Vrochidis, Benoit Huet, Edward Y. Chang and Ioannis Kompatsiaris to be identified as the authors of the editorial material in this work asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Vrochidis, Stefanos, 1975- editor. | Huet, Benoit, editor. | Chang,
Edward Y., editor. | Kompatsiaris, Ioannis, editor.
Title: Big Data Analytics for Large‐Scale Multimedia Search / Stefanos
Vrochidis, Information Technologies Institute, Centre for Research and
Technology Hellas, Thessaloniki, Greece; Benoit Huet, EURECOM,
Sophia-Antipolis, France; Edward Y. Chang, HTC Research & Healthcare, San
Francisco, USA; Ioannis Kompatsiaris, Information Technologies Institute,
Centre for Research and Technology Hellas, Thessaloniki, Greece.
Description: Hoboken, NJ, USA : Wiley, [2018] | Includes bibliographical
references and index. |
Identifiers: LCCN 2018035613 (print) | LCCN 2018037546 (ebook) | ISBN
9781119376989 (Adobe PDF) | ISBN 9781119377009 (ePub) | ISBN 9781119376972
(hardcover)
Subjects: LCSH: Multimedia data mining. | Big data.
Classification: LCC QA76.9.D343 (ebook) | LCC QA76.9.D343 V76 2018 (print) |
DDC 005.7 – dc23
LC record available at https://lccn.loc.gov/2018035613
Cover design: Wiley
Cover image: © spainter_vfx/iStock.com
In recent years, the rapid development of digital technologies, including the low cost of recording, processing, and storing media, and the growth of high‐speed communication networks enabling large‐scale content sharing, has led to a rapid increase in the availability of multimedia content worldwide. The availability of such content, as well as the increasing user need of analysing and searching into large multimedia collections, increases the demand for the development of advanced search and analytics techniques for big multimedia data. Although multimedia is defined as a combination of different media (e.g., audio, text, video, images etc.) this book mainly focuses on textual, visual, and audiovisual content, which are considered the most characteristic types of multimedia.
In this context, the big multimedia data era brings a plethora of challenges to the fields of multimedia mining, analysis, searching, and presentation. These are best described by the Vs of big data: volume, variety, velocity, veracity, variability, value, and visualization. A modern multimedia search and analytics algorithm and/or system has to be able to handle large databases with varying formats at extreme speed, while having to cope with unreliable “ground truth” information and “noisy” conditions. In addition, multimedia analysis and content understanding algorithms based on machine learning and artificial intelligence have to be employed. Further, the interpretation of the content over time may change, leading to a “drifting target” with multimedia content being perceived differently in different times with often low value of data points. Finally, the assessed information needs to be presented in comprehensive and transparent ways to human users.
The main challenges for big multimedia data analytics and search are identified in the areas of:
multimedia representation by extracting low‐ and high‐level conceptual features
application of machine learning and artificial intelligence for large‐scale multimedia
scalability in multimedia access and retrieval.
Feature extraction is an essential step in any computer vision and multimedia data analysis task. Though progress has been made in past decades, it is still quite difficult for computers to accurately recognize an object or comprehend the semantics of an image or a video. Thus, feature extraction is expected to remain an active research area in advancing computer vision and multimedia data analysis for the foreseeable future. The traditional approach of feature extraction is model‐based in that researchers engineer useful features based on heuristics, and then conduct validations via empirical studies. A major shortcoming of the model‐based approach is that exceptional circumstances such as different lighting conditions and unexpected environmental factors can render the engineered features ineffective. The data‐driven approach complements the model‐based approach. Instead of human‐engineered features, the data‐driven approach learns representation from data. In principle, the greater the quantity and diversity of data, the better the representation can be learned.
An additional layer of analysis and automatic annotation of big multimedia data involves the extraction of high‐level concepts and events. Concept‐based multimedia data indexing refers to the automatic annotation of multimedia fragments with specific simple labels, e.g., “car”, “sky”, “running” etc., from large‐scale collections. In this book we mainly deal with video as a characteristic multimedia example for concept‐based indexing. To deal with this task, concept detection methods have been developed that automatically annotate images and videos with semantic labels referred to as concepts. A recent trend in video concept detection is to learn features directly from the raw keyframe pixels using deep convolutional neural networks (DCNNs). On the other hand, event‐based video indexing aims to represent video fragments with high‐level events in a given set of videos. Typically, events are more complex than concepts, i.e., they may include complex activities, occurring at specific places and times, and involving people interacting with other people and/or object(s), such as “opening a door”, “making a cake”, etc. The event detection problem in images and videos can be addressed either with a typical video event detection framework, including feature extraction and classification, and/or by effectively combining textual and visual analysis techniques.
When it comes to multimedia analysis, machine learning is considered to be one of the most popular techniques that can be applied. These include CNN for representation learning such as imagery and acoustic data, as well as recurrent neural networks for series data, e.g., speech and video. The challenge of video understanding lies in the gap between large‐scale video data and the limited resource we can afford in both label collection and online computing stages.
An additional step in the analysis and retrieval of large‐scale multimedia is the fusion of heterogeneous content. Due to the diverse modalities that form a multimedia item (e.g., visual, textual modality), multiple features are available to represent each modality. The fusion of multiple modalities may take place at the feature level (early fusion) or the decision level (late fusion). Early fusion techniques usually rely on the linear (weighted) combination of multimodal features, while lately non‐linear fusion approaches have prevailed. Another fusion strategy relies on graph‐based techniques, allowing the construction of random walks, generalized diffusion processes, and cross‐media transitions on the formulated graph of multimedia items. In the case of late fusion, the fusion takes place at the decision level and can be based on (i) linear/non‐linear combinations of the decisions from each modality, (ii) voting schemes, and (iii) rank diffusion processes. Scalability issues in multimedia processing systems typically occur for two reasons: (i) the lack of labelled data, which limits the scalability with respect to the number of supported concepts, and (ii) the high computational overload in terms of both processing time and memory complexity. For the first problem, methods that learn primarily on weakly labelled data (weakly supervised learning, semi‐supervised learning) have been proposed. For the second problem, methodologies typically rely on reducing the data space they work on by using smartly‐selected subsets of the data so that the computational requirements of the systems are optimized.
Another important aspect of multimedia nowadays is the social dimension and the user interaction that is associated with the data. The internet is abundant with opinions, sentiments, and reflections of the society about products, brands, and institutions hidden under large amounts of heterogeneous and unstructured data. Such analysis includes the contextual augmentation of events in social media streams in order to fully leverage the knowledge present in social media, taking into account temporal, visual, textual, geographical, and user‐specific dimensions. In addition, the social dimension includes an important privacy aspect. As big multimedia data continues to grow, it is essential to understand the risks for users during online multimedia sharing and multimedia privacy. Specifically, as multimedia data gets bigger, automatic privacy attacks can become increasingly dangerous. Two classes of algorithms for privacy protection in a large‐scale online multimedia sharing environment are involved. The first class is based on multimedia analysis, and includes classification approaches that are used as filters, while the second class is based on obfuscation techniques.
The challenge of data storage is also very important for big multimedia data. At this scale, data storage, management, and processing become very challenging. At the same time, there has been a proliferation of big data management techniques and tools, which have been developed mostly in the context of much simpler business and logging data. These tools and techniques include a variety of noSQL and newSQL data management systems, as well as automatically distributed computing frameworks (e.g., Hadoop and Spark). The question is which of these big data techniques apply to today's big multimedia collections. The answer is not trivial since the big data repository has to store a variety of multimedia data, including raw data (images, video or audio), meta‐data (including social interaction data) associated with the multimedia items, derived data, such as low‐level concepts and semantic features extracted from the raw data, and supplementary data structures, such as high‐dimensional indices or inverted indices. In addition, the big data repository must serve a variety of parallel requests with different workloads, ranging from simple queries to detailed data‐mining processes, and with a variety of performance requirements, ranging from response‐time driven online applications to throughput‐driven offline services. Although several different techniques have been developed there is no single technology that can cover all the requirements of big multimedia applications.
Finally, the book discusses the two main challenges of large‐scale multimedia search: accuracy and scalability. Conventional techniques typically focus on the former. However, recently attention has mainly been paid to the latter, since the amount of multimedia data is rapidly increasing. Due to the curse of dimensionality, conventional feature representations of high dimensionality are not in favour of fast search. The big data era requires new solutions for multimedia indexing and retrieval based on efficient hashing. One of the robust solutions is perceptual hash algorithms, which are used for generating hash values from multimedia objects in big data collections, such as images, audio, and video. A content‐based multimedia search can be achieved by comparing hash values. The main advantages of using hash values instead of other content representations is that hash values are compact and facilitate fast in‐memory indexing and search, which is very important for large‐scale multimedia search.
Given the aforementioned challenges, the book is organized in the following chapters. Chapters 1, 2 and 3 deal with feature extraction from big multimedia data, while Chapters 4, 5, 6, and 7 discuss techniques relevant to machine learning for multimedia analysis and fusion. Chapters , and 9 deal with scalability in multimedia access and retrieval, while Chapters 10, 11 and 12 present applications of large‐scale multimedia retrieval. Finally, we conclude the book by summarizing and presenting future trends and challenges.
Laurent Amsaleg
Univ Rennes, Inria, CNRS
IRISA
France
Shahin Amiriparian
ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing
University of Augsburg
Germany
Kai Uwe Barthel
Visual Computing Group
HTW Berlin
University of Applied Sciences
Berlin
Germany
Benjamin Bischke
German Research Center for Artificial Intelligence and TU Kaiserslautern
Germany
Philippe Bonnet
IT University of Copenhagen
Copenhagen
Denmark
Damian Borth
University of St. Gallen
Switzerland
Edward Y. Chang
HTC Research & Healthcare
San Francisco, USA
Elisavet Chatzilari
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
Liangliang Cao
College of Information and Computer Sciences
University of Massachusetts Amherst
USA
Chun‐Nan Chou
HTC Research & Healthcare
San Francisco, USA
Jaeyoung Choi
Delft University of Technology
Netherlands
and
International Computer Science Institute
USA
Fu‐Chieh Chang
HTC Research & Healthcare
San Francisco, USA
Jocelyn Chang
Johns Hopkins University
Baltimore
USA
Wen‐Huang Cheng
Department of Electronics Engineering and Institute of Electronics
National Chiao Tung University
Taiwan
Andreas Dengel
German Research Center for Artificial Intelligence and TU Kaiserslautern
Germany
Arjen P. de Vries
Radboud University
Nijmegen
The Netherlands
Zekeriya Erkin
Delft University of Technology and
Radboud University
The Netherlands
Gerald Friedland
University of California
Berkeley
USA
Jianlong Fu
Multimedia Search and Mining Group
Microsoft Research Asia
Beijing
China
Damianos Galanopoulos
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
Lianli Gao
School of Computer Science and Center for Future Media
University of Electronic Science and Technology of China
Sichuan
China
Ilias Gialampoukidis
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
Gylfi Þór Guðmundsson
Reykjavik University
Iceland
Nico Hezel
Visual Computing Group
HTW Berlin
University of Applied Sciences
Berlin
Germany
I‐Hong Jhuo
Center for Open‐Source Data & AI Technologies
San Francisco
California
Björn Þór Jónsson
IT University of Copenhagen
Denmark
and
Reykjavik University
Iceland
Ioannis Kompatsiaris
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
Martha Larson
Radboud University and
Delft University of Technology
The Netherlands
Amr Mousa
Chair of Complex and Intelligent Systems
University of Passau
Germany
Foteini Markatopoulou
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
and
School of Electronic Engineering and Computer Science
Queen Mary University of London
United Kingdom
Henning Müller
University of Applied Sciences Western Switzerland (HES‐SO)
Sierre
Switzerland
Tao Mei
JD AI Research
China
Vasileios Mezaris
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
Spiros Nikolopoulos
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
Ioannis Patras
School of Electronic Engineering and Computer Science
Queen Mary University of London
United Kingdom
Vedhas Pandit
ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing
University of Augsburg
Germany
Maximilian Schmitt
ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing
University of Augsburg
Germany
Björn Schuller
ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing
University of Augsburg
Germany
and
GLAM ‐ Group on Language, Audio and Music
Imperial College London
United Kingdom
Chuen‐Kai Shie
HTC Research & Healthcare
San Francisco, USA
Manel Slokom
Delft University of Technology
The Netherlands
Jingkuan Song
School of Computer Science and Center for Future Media
University of Electronic Science and Technology of China
Sichuan
China
Christos Tzelepis
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
and
School of Electronic Engineering and Computer Science
QMUL, UK
Devrim Ünay
Department of Biomedical Engineering
Izmir University of Economics
Izmir
Turkey
Stefanos Vrochidis
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki
Greece
Li Weng
Hangzhou Dianzi University
China
and
French Mapping Agency (IGN)
Saint‐Mande
France
Xu Zhao
Department of Automation
Shanghai Jiao Tong University
China
This book is accompanied by a companion website:
www.wiley.com/go/vrochidis/bigdata
The website includes:
Open source algorithms
Data sets
Tools materials for demostration purpose
Scan this QR code to visit the companion website.
Chun‐Nan Chou Chuen‐Kai Shie Fu‐Chieh Chang Jocelyn Chang and Edward Y. Chang
Extracting useful features from a scene is an essential step in any computer vision and multimedia data analysis task. Though progress has been made in past decades, it is still quite difficult for computers to comprehensively and accurately recognize an object or pinpoint the more complicated semantics of an image or a video. Thus, feature extraction is expected to remain an active research area in advancing computer vision and multimedia data analysis for the foreseeable future.
The approaches in feature extraction can be divided into two categories: model‐centric and data‐driven. The model‐centric approach relies on human heuristics to develop a computer model (or algorithm) to extract features from an image. (We use imagery data as our example throughout this chapter.) Some widely used models are Gabor filter, wavelets, and scale‐invariant feature transform (SIFT) [1]. These models were engineered by scientists and then validated via empirical studies. A major shortcoming of the model‐centric approach is that unusual circumstances that a model does not take into consideration during its design, such as different lighting conditions and unexpected environmental factors, can render the engineered features less effective. In contrast to the model‐centric approach, which dictates representations independent of data, the data‐driven approach learns representations from data [2]. Examples of data‐driven algorithms are multilayer perceptron (MLP) and convolutional neural networks (CNNs), which belong to the general category of neural networks and deep learning [3,4].
Both model‐centric and data‐driven approaches employ a model (algorithm or machine). The differences between model‐centric and data‐driven can be described in two related aspects:
Can data affect model parameters? With model‐centric, training data does not affect the model. With data‐driven, such as MLP or CNN, their internal parameters are changed/learned based on the discovered structure in large data sets
[5]
.
Can more data help improve representations? Whereas more data can help a data‐driven approach to improve representations, more data cannot change the features extracted by a model‐centric approach. For example, the features of an image can be affected by the other images in the CNN (because the structure parameters modified through back‐propagation are affected by all training images), but the feature set of an image is invariant of the other images in a model‐centric pipeline such as SIFT.
The greater the quantity and diversity of data, the better the representations can be learned by a data‐driven pipeline. In other words, if a learning algorithm has seen enough training instances of an object under various conditions, e.g., in different postures, and has been partially occluded, then the features learned from the training data will be more comprehensive.
The focus of this chapter is on how neural networks, specifically CNNs, achieve effective representation learning. Neural networks, a kind of neuroscience‐motivated models, were based on Hubel and Wiesel's research on cats' visual cortex [6], and subsequently formulated into computation models by scientists in the early 1980s. Pioneer neural network models include Neocognitron [7] and the shift‐invariant neural network [8]. Widely cited enhanced models include LeNet‐5 [9] and Boltzmann machines [10]. However, the popularity of neural networks surged only in 2012 after large training data sets became available. In 2012, Krizhevsky [11] applied deep convolutional networks on the ImageNet dataset1, and their AlexNet achieved breakthrough accuracy in the ImageNet Large‐Scale Visual Recognition Challenge (ILSVRC) 2012 competition.2 This work convinced the research community and related industries that representation learning with big data is promising. Subsequently, several efforts have aimed to further improve the learning capability of neural networks. Today, the top‐5 error rate3 for the ILSVRC competition has dropped to , a remarkable achievement considering the error rate was before AlexNet [11] was proposed.
We divide the remainder of this chapter into two parts before suggesting related reading in the concluding remarks. The first part reviews representative CNN models proposed since 2012. These key representatives are discussed in terms of three aspects addressed in He's tutorial presentation [14] at ICML 2016: (i) representation ability, (ii) optimization ability, and (iii) generalization ability. The representation ability is the ability of a CNN to learn/capture representations from training data assuming the optimum could be found. Here, the optimum refers to attaining the best solution of the underlying learning algorithm, modeled as an optimization problem. This leads to the second aspect that He's tutorial addresses: the optimization ability. The optimization ability is the feasibility of finding an optimum. Specifically on CNNs, the optimization problem is to find the optimal solution of the stochastic gradient descent. Finally, the generalization ability is the quality of the test performance once model parameters have been learned from training data.
The second part of this chapter deals with the small data problem. We present how features learned from one source domain with big data can be transferred to a different target domain with small data. This transfer representation learning approach is critical for remedying the small data challenge often encountered in the medical domain. We use the Otitis Media detector, designed and developed for our XPRIZE Tricorder [15] device (code name DeepQ), to demonstrate how learning on a small dataset can be bolstered by transferring over learned representations from ImageNet, a dataset that is entirely irrelevant to otitis media.
Deep learning has its roots in neuroscience. Strongly driven by the fact that the human visual system can effortlessly recognize objects, neuroscientists have been developing vision models based on physiological evidence that can be applied to computers. Though such research may still be in its infancy and several hypotheses remain to be validated, some widely accepted theories have been established. Building on the pioneering neuroscience work of Hubel [6], all recent models are founded on the theory that visual information is transmitted from the primary visual cortex (V1) over extrastriate visual areas (V2 and V4) to the inferotemporal cortex (IT). The IT in turn is a major source of input to the prefrontal cortex (PFC), which is involved in linking perception to memory and action [16].
The pathway from V1 to the IT, called the ventral visual pathway [17], consists of a number of simple and complex layers. The lower layers detect simple features (e.g., oriented lines) at the pixel level. The higher layers aggregate the responses of these simple features to detect complex features at the object‐part level. Pattern reading at the lower layers is unsupervised, whereas recognition at the higher layers involves supervised learning. Pioneer computational models developed based on the scientific evidence include Neocognitron [7] and the shift‐invariant neural network [8]. Widely cited enhanced models include LeNet‐5 [9] and Boltzmann machines [10]. The remainder of this chapter uses representative CNN models, which stem from LeNet‐5 [9], to present three design aspects: representation, optimization, and generalization.
CNNs are composed of two major components: feature extraction and classification. For feature extraction, a standard structure consists of stacked convolutional layers, which are followed by optional layers of contrast normalization or pooling. For classification, there are two widely used structures. One structure employs one or more fully connected layers. The other structure uses a global average pooling layer, which is illustrated in section 1.2.2.2.
The accuracy of several computer vision tasks, such as house number recognition [18], traffic sign recognition [19], and face recognition [20], has been substantially improved recently, thanks to advances in CNNs. For many similar object‐recognition tasks, the advantage of CNNs over other methods is that CNNs join classification with feature extraction. Several works, such as [21], show that CNNs can learn superior representations to boost the performance of classification. Table 1.1 presents four top‐performing CNN models proposed over the past four years and their performance statistics in the top‐5 error rate. These representative models mainly differ in their number of layers or parameters. (Parameters refer to the learnable variables by supervised training including weight and bias parameters of the CNN models.) Besides the four CNN models depicted in Table 1.1, Lin et al. [22] proposed network in network (NIN), which has considerably influenced subsequent models such as GoogLeNet, Visual Geometry Group (VGG), and ResNet. In the following sections, we present these five models' novel ideas and key techniques, which have had significant impacts on designing subsequent CNN models.
Table 1.1 Image classification performance on the ImageNet subset designated for ILSVRC [13].
Model
Year
Rank
Error (top‐5)
Number of parameter layers
Number of parameters in a single model
AlexNet
[11]
2012
1st
8
60m
VGG
[23]
2014
2nd
19
144m
GoogLeNet
[24]
2014
1st
22
5m
ResNet
[25]
2015
1st
152
60m
Krizhevsky [11] proposed AlexNet, which was the winner of the ILSVRC‐2012 competition and outperformed the runner‐up significantly (top‐5 error rate of in comparison with ). The outstanding performance of AlexNet led to increased prevalence of CNNs in the computer vision field. AlexNet achieved this breakthrough performance by combining several novel ideas and effective techniques. Based on He's three aspects of deep learning models [14], these novel ideas and effective techniques can be categorized as follows:
1)
Representation ability
. In contrast to prior CNN models such as LetNet‐5
[9]
, AlexNet was deeper and wider in the sense that both the number of parameter layers and the number of parameters are larger than those of its predecessors.
2)
Optimization ability
. AlexNet utilized a non‐saturating activation function, the rectified linear unit (ReLU) function, to make training faster.
3)
Generalization ability
. AlexNet employed two effective techniques, data augmentation and dropout, to alleviate overfitting.
AlexNet's three key ingredients according to the description in [11] are ReLU nonlinearity, data augmentation, and dropout.
In order to model nonlinearity, the neural network introduces the activation function during the evaluation of neuron outputs. The traditional way to evaluate a neuron output as a function of its input is with where can be a sigmoid function or a hyperbolic tangent function . Both of these functions are saturating nonlinear. That is, the ranges of these two functions are fixed between a minimum value and a maximum value.
Instead of using saturating activation functions, however, AlexNet adopted the nonsaturating activation function ReLU proposed in [26]. ReLU computes the function , which has a threshold at zero. Using ReLU enjoys two benefits. First, ReLU requires less computation in comparison with sigmoid and hyperbolic tangent functions, which involve expensive exponential operations. The other benefit is that ReLU, in comparison to sigmoid and hyperbolic tangent functions, is found to accelerate the convergence of stochastic gradient descent (SGD). As demonstrated in the first figure of [11], a CNN with ReLU is six times faster to train than that with a hyperbolic tangent function. Due to these two advantages, recent CNN models have adopted ReLU as their activation functions.
As shown in Table 1.1, the AlexNet architecture has 60 million parameters. This huge number of parameters makes overfitting highly possible if training data is not sufficient. To combat overfitting, AlexNet incorporates two schemes: data augmentation and dropout.
Thanks to ImageNet, AlexNet is the first model that enjoys big data and takes advantage of benefits from the data‐driven feature learning approach advocated by [2]. However, even the 1.2 million ImageNet labeled instances are still considered insufficient given that the number of parameters is 60 million. (From simple algebra, 1.2 million equations are insufficient for solving 60 million variables.) Conventionally, when the training dataset is limited, the common practice in image data is to artificially enlarge the dataset by using label‐preserving transformations [27–29]. In order to enlarge the training data, AlexNet employs two distinct forms of data augmentation, both of which can produce the transformed images from the original images with very little computation [ 11,30].
The first scheme of data augmentation includes a random cropping function and horizontal reflection function. Data augmentation can be applied to both the training and testing stages. For the training stage, AlexNet randomly extracts smaller image patches and their horizontal reflections from the original images . The AlexNet model is trained on these extracted patches instead of the original images in the ImageNet dataset. In theory, this scheme is capable of increasing the training data by a factor of . Although the resultant training examples are highly interdependent, Krizhevsky [11] claimed that without this data augmentation scheme the AlexNet model would suffer from substantial overfitting. (This is evident from our algebra example.) For the testing stage, AlexNet generated ten patches, including four corner patches, one center patch, and each of the five patches' horizontal reflections from test images. Based on the generated ten patches, AlexNet first derived temporary results from the network's softmax layer and then made a prediction by averaging the ten results.
The second scheme of data augmentation alters the intensities of the RGB channels in training images by using principal component analysis (PCA). This scheme is used to capture an important property of natural images: the invariance of object identity to changes in the intensity and color of the illumination. The detailed implementation is as follows. First, the principal components of RGB pixel values are acquired by performing PCA on a set of RGB pixel values throughout the ImageNet training set. When a particular training image is chosen to train the network, each RGB pixel of this chosen training image is refined by adding the following quantity:
where and represent the th eigenvector and the eigenvalue of the covariance matrix of RGB pixel values, respectively, and is a random variable drawn from a Gaussian model with mean zero and standard deviation 0.1. Note that each time one training image is chosen to train the network, each is redrawn. Thus, during the training, of data augmentation varies with different times for the same training image. Once is drawn, is applied to all the pixels of this chosen training image.
Model ensembles such as bagging[31], boosting[32], and random forest[33] have long been shown to effectively reduce class‐prediction variance and hence testing error. Model ensembles rely on combing the predictions from several different models. However, this method is impractical for large‐scale CNNs such as AlexNet, since training even one CNN can take several days or even weeks.
Rather than training multiple large CNNs, Krizhevsky [11] employed the “dropout” technique introduced in [34] to efficiently perform model combination. This technique simply sets the output of each hidden neuron to zero with a probability (e.g., 0.5 in [11]). Afterwards, the dropped‐out neurons neither contribute to the forward pass nor participate in the subsequent back‐propagation pass. In this manner, different network architectures are sampled when each training instance is presented, but all these sampled architectures share the same parameters. In addition to combining models efficiently, the dropout technique has the effect of reducing the complex co‐adaptations of neurons, since a neuron cannot depend on the presence of other neurons. In this way, more robust features are forcibly learned. At the time of testing, all neurons are used, but their outputs are multiplied by , which is a reasonable approximation of the geometric mean of the predictive distributions produced by the exponential quantity of dropout networks [34].
In [11], dropout was only applied to the first two fully connected layers of AlexNet and roughly doubled the number of iterations required for convergence. Krizhevsky [11] also claimed that AlexNet suffered from substantial overfitting without dropout.
Although NIN, presented in [22], has not ranked among the best of ILSVRC competitions in recent years, its novel designs have significantly influenced subsequent CNN models, especially its convolutional filters. The convolutional filters are widely used by current CNN models and have been incorporated into VGG, GoogLeNet, and ResNet. Based on He's three aspects of learning deep models, the novel designs proposed in NIN can be categorized as follows:
1)
Representation ability
. In order to enhance the model's discriminability, NIN adopted MLP convolutional layers with more complex structures to abstract the data within the receptive field.
2)
Optimization ability
. Optimization in NIN remained typical compared to that of the other models.
3)
Generalization ability
. NIN utilized global average pooling over feature maps in the classification layer because global average pooling is less prone to overfitting than traditional fully connected layers.
The work of Lin et al. [22] argued that the conventional CNNs [9] implicitly make the assumption that the samples of the latent concepts are linearly separable. Thus, typical convolutional layers generate feature maps with linear convolutional filters followed by nonlinear activation functions. This kind of feature map can be calculated as follows:
Here, is the pixel index, and is the filter index. Parameter stands for the input patch centered at location . Parameters and represent the weight and bias parameters of the th filter, respectively. Parameter denotes the result of the convolutional layer and the input to the activation function, while denotes the activation function, which can be a sigmoid , hyperbolic tangent , or ReLU .
However, instances of the same concept often live on a nonlinear manifold. Hence, the representations that capture these concepts are generally highly nonlinear functions of the input. In NIN, the linear convolutional filter is replaced with an MLP. This new type of layer is called mlpconv in [22], where MLP convolves over the input. There are two reasons for choosing an MLP. First, an MLP is a general nonlinear function approximator. Second, an MLP can be trained by using back‐propagation, and is therefore compatible with conventional CNN models. The first figure in [22] depicts the difference between a linear convolutional layer and an mlpconv layer. The calculation for an mlpconv layer is performed as follows:
Here, is the number of layers in the MLP, and is the filter index of the th layer. Lin et al. [22] used ReLU as the activation function in the MLP.
From a pooling point of view, Eq. 1.2 is equivalent to performing cross‐channel parametric pooling on a typical convolutional layer. Traditionally, there is no learnable parameter involved in the pooling operation. Besides, conventional pooling is performed within one particular feature map, and is thus not cross‐channel. However, Eq. 1.2 performs a weighted linear recombination on the input feature maps, which then goes through a nonlinear activation function, therefore Lin et al. [22] interpreted Eq. 1.2 as a cross‐channel parametric pooling operation. They also suggested that we can view Eq. 1.2 as a convolutional layer with a filter.
Lin [22] made the following remarks. The traditional CNN adopts the fully connected layers for classification. Specifically, the feature maps of the last convolutional layer are flattened into a vector, and this vector is fed into some fully connected layers followed by a softmax layer [ 11,35,36]. In this fashion, convolutional layers are treated as feature extractors, using traditional neural networks to classify the resulting features. However, the traditional neural networks are prone to overfitting, thereby degrading the generalization ability of the overall network.
Instead of using the fully connected layers with regularization methods such as dropout, Lin [22] proposed global average pooling to replace the traditional fully connected layers in CNNs. Their idea was to derive one feature map from the last mlpconv layer for each corresponding category of the classification task. The values of each derived feature map would be averaged spatially, and all the average values would be flattened into a vector which would then be fed directly into the softmax layer. The second figure in [22] delineates the design of global average pooling. One advantage of global average pooling over fully connected layers is that there is no parameter to optimize in global average pooling, preventing overfitting at this layer. Another advantage is that the linkage between feature maps of the last convolutional layer and categories of classification can be easily interpreted, which allows for better understanding. Finally, global average pooling aggregates spatial information and thus offers more robust spatial translations of the input.
VGG, proposed by Simonyan and Zisserman [23], ranked first and second in the localization and classification tracks of the ImageNet Challenge 2014, respectively. VGG reduced the top‐5 error rate of AlexNet from to , which is an improvement of more than . Using very small () convolutional filters makes a substantial contribution to this improvement. Consequently, very small () convolutional filters have been very popular in recent CNN models. Here, the convolutional filter is small or large, depending on the size of its receptive field. According to He's three aspects of learning deep models, the essential ideas in VGG can be depicted as follows:
1)
Representation ability
. VGG used very small (
) convolutional filters, which makes the decision function more discriminative. Additionally, the depth of VGG was increased steadily to 19 parameter layers by adding more convolutional layers, an increase that is feasible due to the use of very small (
) convolutional filters in all layers.
2)
Optimization ability
. VGG used very small (
) convolutional filters, thereby decreasing the number of parameters.
3)
Generalization ability
. VGG employed training to recognize objects over a wide range of scales.
According to [23], instead of using relatively large convolutional filters in the first convolutional layers (e.g., with stride 4 in [11] or with stride 2 in [ 21,37]), VGG used very small convolutional filters with stride 1 throughout the network. The output dimension of a stack of two convolutional filters (without spatial pooling operation in between) is equal to the output dimension of one convolutional filter. Thus, [23] claimed that a stack of two convolutional filters has an effective receptive field of . By following the same rule, we can conclude that three such filters construct a effective receptive field.
The reasons for using smaller convolutional filters are twofold. First, the decision function is more discriminative. For example, using a stack of three
