91,99 €
The advent of increasingly large consumer collections of audio (e.g., iTunes), imagery (e.g., Flickr), and video (e.g., YouTube) is driving a need not only for multimedia retrieval but also information extraction from and across media. Furthermore, industrial and government collections fuel requirements for stock media access, media preservation, broadcast news retrieval, identity management, and video surveillance. While significant advances have been made in language processing for information extraction from unstructured multilingual text and extraction of objects from imagery and video, these advances have been explored in largely independent research communities who have addressed extracting information from single media (e.g., text, imagery, audio). And yet users need to search for concepts across individual media, author multimedia artifacts, and perform multimedia analysis in many domains.
This collection is intended to serve several purposes, including reporting the current state of the art, stimulating novel research, and encouraging cross-fertilization of distinct research disciplines. The collection and integration of a common base of intellectual material will provide an invaluable service from which to teach a future generation of cross disciplinary media scientists and engineers.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 913
Veröffentlichungsjahr: 2012
Table of Contents
COVER
IEEE COMPUTER SOCIETY
TITLE PAGE
COPYRIGHT PAGE
FOREWORD
PREFACE
ACKNOWLEDGMENTS
CONTRIBUTORS
CHAPTER 1 INTRODUCTION
1.1 MOTIVATION
1.2 DEFINITIONS
1.3 COLLECTION OVERVIEW
1.4 CONTENT INDEX
1.5 MAPPING TO CORE CURRICULUM
1.6 SUMMARY
ACKNOWLEDGMENTS
CHAPTER 2 MULTIMEDIA INFORMATION EXTRACTION: HISTORY AND STATE OF THE ART
2.1 ORIGINS
2.2 TEXT EXTRACTION
2.3 AUDIO EXTRACTION
2.4 IMAGE EXTRACTION
2.5 VIDEO EXTRACTION
2.6 AFFECT EXTRACTION: EMOTIONS AND SENTIMENTS
2.7 SOCIAL MEDIA EXTRACTION
2.8 SENSOR EXTRACTION
2.9 FUSION
2.10 A ROADMAP TO THE FUTURE
2.11 CONCLUSION
ACKNOWLEDGMENT
SECTION 1: IMAGE EXTRACTION
CHAPTER 3 VISUAL FEATURE LOCALIZATION FOR DETECTING UNIQUE OBJECTS IN IMAGES
3.1 INTRODUCTION
3.2 SCENE MATCHING IN CONSUMER IMAGES
3.3 LOGO DETECTION IN CONSUMER IMAGES
3.4 CONCLUSIONS
ACKNOWLEDGMENTS
CHAPTER 4 ENTROPY-BASED ANALYSIS OF VISUAL AND GEOLOCATION CONCEPTS IN IMAGES
4.1 INTRODUCTION
4.2 RELATED WORK
4.3 ENTROPY ANALYSIS
4.4 EXPERIMENTS
4.5 CONCLUSIONS AND FUTURE WORK
CHAPTER 5 THE MEANING OF 3D SHAPE AND SOME TECHNIQUES TO EXTRACT IT
5.1 INTRODUCTION TO 3D OBJECTS
5.2 GEOMETRY AND SEMANTICS: RECIPE WITH SOME OPEN QUESTIONS
5.3 WHY ARE SHAPE SEMANTICS SO URGENTLY NEEDED?
5.4 DESCRIPTION OF GEOMETRICAL KNOWLEDGE
5.5 REVERSE ENGINEERING AS INVERSE PROBLEM
5.6 EXAMPLES OF 3D INFORMATION EXTRACTION
5.7 CONCLUSION
CHAPTER 6 A DATA-DRIVEN MEANINGFUL REPRESENTATION OF EMOTIONAL FACIAL EXPRESSIONS
6.1 INTRODUCTION
6.2 RELATED WORK AND CONTRIBUTION
6.3 CONSTRUCTION OF AN APPEARANCE SPACE
6.4 SIMPLIFICATION OF THE APPEARANCE SPACE
6.5 APPLICATIONS
6.6 CONCLUSION
SECTION 2: VIDEO EXTRACTION
CHAPTER 7 VISUAL SEMANTICS FOR REDUCING FALSE POSITIVES IN VIDEO SEARCH
7.1 INTRODUCTION
7.2 EVENT DETECTION IN VIDEO
7.3 VISUAL SEMANTICS
7.4 EXPERIMENTATION WITH NOMINAL EVENT DETECTION
7.5 IMPACT OF NOMINAL EVENTS IN KEYWORD-BASED SEARCH OF VIDEO CLIPS
7.6 SUMMARY
CHAPTER 8 AUTOMATED ANALYSIS OF IDEOLOGICAL BIAS IN VIDEO
8.1 INTRODUCTION
8.2 VIDEO CORPUS
8.3 VISUAL SEMANTIC CONCEPTS FOR DESCRIBING VIDEO
8.4 JOINT TOPIC AND PERSPECTIVE MODEL
8.5 EXPERIMENTS ON DIFFERENTIATING IDEOLOGICAL VIDEO
8.6 SUMMARY
CHAPTER 9 MULTIMEDIA INFORMATION EXTRACTION IN A LIVE MULTILINGUAL NEWS MONITORING SYSTEM
9.1 INTRODUCTION
9.2 EVITAP SYSTEM OVERVIEW
9.3 TRANSMEDIA INFORMATION FUSION
9.4 JOINT TRAINING OF INFORMATION EXTRACTION COMPONENTS
9.5 CONCLUSION
ACKNOWLEDGMENTS
CHAPTER 10 SEMANTIC MULTIMEDIA EXTRACTION USING AUDIO AND VIDEO
10.1 INTRODUCTION AND MOTIVATION
10.2 RELATED RESEARCH
10.3 SEMANTIC MULTIMEDIA EXTRACTION USING AUDIO OR CLOSED-CAPTIONS
10.4 SEMANTIC MULTIMEDIA EXTRACTION USING VIDEO
10.5 CONCLUSION AND FUTURE WORK
CHAPTER 11 ANALYSIS OF MULTIMODAL NATURAL LANGUAGE CONTENT IN BROADCAST VIDEO
11.1 INTRODUCTION
11.2 OVERVIEW OF SYSTEM FOR CONTENT EXTRACTION FROM AUDIO AND VIDEO TEXT
11.3 METHODOLOGY FOR COMPARING CONTENT IN AUDIO AND VIDEO TEXT
11.4 EXPERIMENTAL RESULTS AND ANALYSIS
11.5 CONCLUSIONS AND FUTURE WORK
CHAPTER 12 WEB-BASED MULTIMEDIA INFORMATION EXTRACTION BASED ON SOCIAL REDUNDANCY
12.1 INTRODUCTION
12.2 REDUNDANCY DETECTION AND GENERATION OF THE VISUAL AFFINITY GRAPH
12.3 SOCIAL SUMMARIZATION
12.4 IMPROVING ANNOTATIONS
12.5 CONCLUSIONS
CHAPTER 13 INFORMATION FUSION AND ANOMALY DETECTION WITH UNCALIBRATED CAMERAS IN VIDEO SURVEILLANCE
13.1 INTRODUCTION
13.2 GEOMETRY INDEPENDENCE OF ACTIVITY
13.3 DENSE MULTI-CAMERA MATCHING
13.4 MULTI-CAMERA INFORMATION FUSION AND ANOMALY DETECTION
13.5 SUMMARY AND CONCLUSIONS
SECTION 3: AUDIO, GRAPHICS, AND BEHAVIOR EXTRACTION
CHAPTER 14 AUTOMATIC DETECTION, INDEXING, AND RETRIEVAL OF MULTIPLE ATTRIBUTES FROM CROSS-LINGUAL MULTIMEDIA DATA
14.1 INTRODUCTION
14.2 DETECTING AND USING MULTIPLE ATTRIBUTES FROM THE AUDIO
14.3 KEYWORD RETRIEVAL USING WORD-BASED AND PHONEME-BASED RECOGNITION ENGINES
14.4 QUERY EXPANSION
14.5 AHS RESEARCH PROTOTYPE
14.6 CONCLUSION
CHAPTER 15 INFORMATION GRAPHICS IN MULTIMODAL DOCUMENTS
15.1 INTRODUCTION
15.2 ROLE OF INFORMATION GRAPHICS IN MULTIMODAL DOCUMENTS
15.3 METHODOLOGY FOR PROCESSING INFORMATION GRAPHICS
15.4 IMPLEMENTATION OF OUR MESSAGE RECOGNITION SYSTEM
15.5 RELATED WORK
15.6 CONCLUSION
ACKNOWLEDGMENT
CHAPTER 16 EXTRACTING INFORMATION FROM HUMAN BEHAVIOR
16.1 INTRODUCTION
16.2 THE MISSION SURVIVAL CORPORA
16.3 AUTOMATIC DETECTION OF GROUP FUNCTIONAL ROLES
16.4 AUTOMATIC PREDICTION OF PERSONALITY TRAIT
16.5 CONCLUSION
SECTION 4: AFFECT EXTRACTION FROM AUDIO AND IMAGERY
CHAPTER 17 RETRIEVAL OF PARALINGUISTIC INFORMATION IN BROADCASTS
17.1 INTRODUCTION
17.2 EMOTIONS IN TV BROADCASTS: THE VAM CORPUS
17.3 CLASSIFICATION AND FEATURE SELECTION METHODS
17.4 METHODS FOR ACOUSTIC AND LINGUISTIC ANALYSIS
17.5 PERFORMANCE ON MEDIA BROADCASTS
17.6 CONCLUSION AND OUTLOOK
ACKNOWLEDGMENT
CHAPTER 18 AUDIENCE REACTIONS FOR INFORMATION EXTRACTION ABOUT PERSUASIVE LANGUAGE IN POLITICAL COMMUNICATION
18.1 INTRODUCTION
18.2 PERSUASION AND NLP
18.3 CORPS
18.4 EXPLOITING THE CORPUS
18.5 CORPS AND PERSUASIVE EXPRESSION MINING
18.6 CORPS AND QUALITATIVE ANALYSIS OF PERSUASIVE COMMUNICATION
18.7 PREDICTING AUDIENCE REACTION
18.8 CONCLUSIONS AND FUTURE WORK
CHAPTER 19 THE NEED FOR AFFECTIVE METADATA IN CONTENT-BASED RECOMMENDER SYSTEMS FOR IMAGES
19.1 INTRODUCTION
19.2 AFFECTIVE-BASED CBR SYSTEMS
19.3 EXPERIMENT
19.4 CONCLUSION AND OUTLOOK
ACKNOWLEDGMENTS
CHAPTER 20 AFFECT-BASED INDEXING FOR MULTIMEDIA DATA
20.1 INTRODUCTION
20.2 AFFECT REPRESENTATION AND COMPUTING
20.3 AFFECT ANALYSIS FOR CONTENT-BASED VIDEO INDEXING
20.4 DESIGN OF A NOVEL AFFECT-BASED VIDEO INDEXING AND RETRIEVAL SYSTEM
20.5 EXPERIMENTAL INVESTIGATION OF AFFECT LABELING
20.6 CONCLUSIONS AND NEXT STEPS
SECTION 5: MULTIMEDIA ANNOTATION AND AUTHORING
CHAPTER 21 MULTIMEDIA ANNOTATION, QUERYING, AND ANALYSIS IN ANVIL
21.1 INTRODUCTION
21.2 ANVIL: A MULTIMEDIA ANNOTATION TOOL
21.3 RELATED ANNOTATION TOOLS
21.4 DATABASE INTEGRATION
21.5 INTEGRATING MOTION CAPTURE
21.6 ANALYSIS
21.7 CONCLUSIONS
ACKNOWLEDGMENTS
CHAPTER 22 TOWARD FORMALIZATION OF DISPLAY GRAMMAR FOR INTERACTIVE MEDIA PRODUCTION WITH MULTIMEDIA INFORMATION EXTRACTION
22.1 INTRODUCTION
22.2 DISPLAY GRAMMAR BACKGROUND
22.3 DISPLAY GRAMMAR DEFINITION AND CHARACTERISTICS
22.4 DISPLAY GRAMMAR METHODOLOGY
22.5 DISPLAY GRAMMAR USE CASE
22.6 FUTURE DIRECTIONS: APPLICATION OF DISPLAY GRAMMARS TO MMIE
ACKNOWLEDGMENTS
CHAPTER 23 MEDIA AUTHORING WITH ONTOLOGICAL REASONING: USE CASE FOR MULTIMEDIA INFORMATION EXTRACTION
23.1 INTRODUCTION
23.2 INTERACTIVE MEDIA AND MEDIA AUTHORING: IMPLICATIONS FOR MMIE
23.3 PROTOTYPE SYSTEM FOR MEDIA AUTHORING WITH ONTOLOGICAL REASONING
23.4 ONTOLOGICAL DATA DESIGN FOR NAVIGATING MEDIA RESOURCES OF MULTIPLE TYPES
23.5 EXAMPLE USE CASES: SOUND AUTHORING DATA WITH AUDIO MMIE
23.6 CLOSING STATEMENTS AND FUTURE DIRECTION
ACKNOWLEDGMENTS
CHAPTER 24 ANNOTATING SIGNIFICANT RELATIONS ON MULTIMEDIA WEB DOCUMENTS
24.1 INTRODUCTION
24.2 RELATED WORK
24.3 TWO SCENARIOS
24.4 MADCOW BASIC ARCHITECTURE
24.5 IMPLEMENTATION OF MULTISTRUCTURES
24.6 INTERACTIVE CREATION AND USAGE OF ANNOTATIONS ON MULTISTRUCTURES
24.7 CONCLUSIONS
ACKNOWLEDGMENTS
ABBREVIATIONS AND ACRONYMS
REFERENCES
INDEX
Press Operating Committee
Chair
James W. Cortada
IBM Institute for Business Value
Board Members
Richard E. (Dick) Fairley, Founder and Principal Associate, Software Engineering Management Associates (SEMA)
Cecilia Metra, Associate Professor of Electronics, University of Bologna
Linda Shafer, former Director, Software Quality Institute, The University of Texas at Austin
Evan Butterfield, Director of Products and Services
Kate Guillemette, Product Development Editor, CS Press
IEEE Computer Society Publications
The world-renowned IEEE Computer Society publishes, promotes, and distributes a wide variety of authoritative computer science and engineering texts. These books are available from most retail outlets. Visit the CS Store at http://computer.org/store for a list of products.
IEEE Computer Society / Wiley Partnership
The IEEE Computer Society and Wiley partnership allows the CS Press authored book program to produce a number of exciting new titles in areas of computer science, computing and networking with a special focus on software engineering. IEEE Computer Society members continue to receive a 15% discount on these titles when purchased through Wiley or at wiley.com/ieeecs.
To submit questions about the program or send proposals please [email protected] write to Books, IEEE Computer Society, 10662 Los Vaqueros Circle, Los Alamitos, CA 90720-1314. Telephone +1-714-816-2169.
Additional information regarding the Computer Society authored book program can also be accessed from our web site at http://computer.org/cspress.
Copyright © 2012 by IEEE Computer Society. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data:
Maybury, Mark T.
Multimedia information extraction : advances in video, audio, and imagery analysis for search, data mining, surveillance, and authoring / by Mark T. Maybury.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-118-11891-7 (hardback)
1. Data mining. 2. Metadata harvesting. 3. Computer files. I. Title.
QA76.9.D343M396 2012
006.3'12–dc23
2011037229
FOREWORD
I was delighted when I was asked to write a foreword for this book as, apart from the honor, it gives me the chance to stand back and think a bit more deeply about multimedia information extraction than I would normally do and also to get a sneak preview of the book. One of the first things I did when preparing to write this was to dig out a copy of one of Mark T. Maybury’s previous edited books, Intelligent Multimedia Information Retrieval from 1997.1 The bookshelves in my office don’t actually have many books anymore—a copy of Keith van Rijsbergen’s Information Retrieval from 1979 (well, he was my PhD supervisor!); Negroponte’s book Being Digital; several generations of TREC, SIGIR, and LNCS proceedings from various conferences; and some old database management books from when I taught that topic to undergraduates. Intelligent Multimedia Information Retrieval was there, though, and had survived the several culls that I had made to the bookshelves’ contents over the years, each time I’ve had to move office or felt claustrophobic and wanted to dump stuff out of the office. All that the modern professor, researcher, student, or interested reader might need to have these days is accessible from our fingertips anyway; and it says a great deal about Mark T. Maybury and his previous edited collection that it survived these culls; that can only be because it still has value to me. I would expect the same to be true for this book, Multimedia Information Extraction.
Finding that previous edited collection on my bookshelf was fortunate for me because it gave me the chance to reread the foreword that Karen Spärck Jones had written. In that foreword, she raised the age-old question of whether a picture was worth a thousand words or not. She concluded that the question doesn’t actually need answering anymore, because now you can have both. That conclusion was in the context of discussing the natural hierarchy of information types—multimedia types if you wish—and the challenge of having to look at many different kinds of information at once on your screen. Karen’s conclusion has grown to be even more true over the years, but I’ll bet that not even she could have foreseen exactly how true it would become today. The edited collection of chapters, published in 1997, still has many chapters that are relevant and good reading today, covering the various types of content-based information access we aspired to then, and, in the case of some of those media, the kind of access to which we still aspire. That collection helped to define the field of using intelligent, content-based techniques in multimedia information retrieval, and the collection as a whole has stood the test of time.
Over the years, content-based information access has changed, however; or rather, it has had to shift sideways in order to work around the challenges posed by analyzing and understanding information encoded in some types of media, notably visual media. Even in 1997, we had more or less solved the technical challenges of capturing, storing, transmitting, and rendering multimedia, specifically text, image, audio, and moving video; and seemingly the only major challenges remaining were multimedia analysis so that we could achieve content-based access and navigation, and, of course, scale it all up. Standards for encoding and transmission were in place, network infrastructure and bandwidth was improving, mobile access was becoming easy, and all we needed was a growing market of people to want the content and somebody to produce it. Well, we got both; but we didn’t realize that the two needs would be satisfied by the same source—the ordinary user. Users generating their own content introduced a flood of material; and professional content-generators, like broadcasters and musicians, for example, responded by opening the doors to their own content so that within a short time, we have become overwhelmed by the sheer choice of multimedia material available to us.
Unfortunately, those of us who were predicting back in 1997 that content-based multimedia access would be based on the true content are still waiting for this to happen in the case of large-scale, generic, domain-independent applications. Content-based multimedia retrieval does work to some extent on smaller, personal, or domain-dependent collections, but not on the larger scale. Fully understanding media content to the level whereby the content we identify automatically in a video or image can be used directly for indexing has proven to be much more difficult than we anticipated for large-scale applications, like searching the Internet. For achieving multimedia information access, searching, summarizing, and linking, we now leverage more from the multimedia collateral—the metadata, user-assigned tags, user commentary, and reviews—than from the actual encoded content. YouTube videos, Flickr images, and iTunes music, like most large multimedia archives, are navigated more often based on what people say about a video, image, or song than what it actually contains. That means that we need to be clever about using this collateral information, like metadata, user tags, and commentaries. The challenges of intelligent multimedia information retrieval in 1997 have now grown into the challenges of multimedia information mining in 2012, developing and testing techniques to exploit the information associated with multimedia information to best effect. That is the subject of the present collection of articles—identifying and mining useful information from text, image, graphics, audio, and video, in applications as far apart as surveillance or broadcast TV.
In 1997, when the first of this series of books edited by Mark T. Maybury was published, I did not know him. I first encountered him in the early 2000s, and I remember my first interactions with him were in discussions about inviting a keynote speaker for a major conference I was involved in organizing. Mark suggested somebody named Tim Berners-Lee who was involved in starting some initiative he called the “semantic web,” in which he intended to put meaning representations behind the content in web pages. That was in 2000 and, as always, Mark had his finger on the pulse of what is happening and what is important in the broad information field. In the years that followed, we worked together on a number of program committees—SIGIR, RIAO, and others—and we were both involved in the development of LSCOM, the Large Scale Ontology for Broadcast TV news, though his involvement was much greater than mine. In all the interactions we have had, Mark’s inputs have always shown an ability to recognize important things at the right time, and his place in the community of multimedia researchers has grown in importance as a result of that.
That brings us to this book. When Karen Spärck Jones wrote her foreword to Mark’s edited book in 1997 and alluded to pictures worth a thousand words, she may have foreseen how creating and consuming multimedia, as we do each day, would be easy and ingrained into our society. The availability, the near absence of technical problems, the volume of materials, the ease of access to it, and the ease of creation and upload were perhaps predictable to some extent by visionaries. However, the way in which this media is now enriched as a result of its intertwining with social networks, blogging, tagging, and folksonomies, user-generated content of the wisdom of crowds—that was not predicted. It means that being able to mine information from multimedia, information culled from the raw content as well as the collateral or metadata information, is a big challenge.
This book is a timely addition to the literature on the topic of multimedia information mining, as it is needed at this precise time as we try to wrestle with the problems of leveraging the “collateral” and the metadata associated with multimedia content. The five sections covering extraction from image, from video, from audio/graphics/behavior, the extraction of affect, and finally the annotation and authoring of multimedia content, collectively represent what is the leading edge of the research work in this area. The more than 80 coauthors of the 24 chapters in this volume have come together to produce a volume which, like the previous volumes edited by Mark T. Maybury, will help to define the field.
I won’t be so bold, or foolhardy, as to predict what the multimedia field will be like in 10 or 15 years’ time, what the problems and challenges will be and what the achievements will have been between now and then. I won’t even guess what books might look like or whether we will still have bookshelves. I would expect, though, that like its predecessors, this volume will still be on my bookshelf in whatever form; and, for that, we have Mark T. Maybury to thank.
Thanks, Mark!
ALAN F. SMEATON
Note
1 Maybury, M.T., ed., Intelligent Multimedia Information Retrieval (AAAI Press, 1997).
PREFACE
This collection is an outgrowth of the Association for the Advancement of Artificial Intelligence’s (AAAI) Fall Symposium on Multimedia Information Extraction organized by Mark T. Maybury (The MITRE Corporation) and Sharon Walter (Air Force Research Laboratory) and held at the Westin Arlington Gateway in Arlington, Virginia, November 7–9, 2008. The program committee included Kelcy Allwein, Elisabeth Andre, Thom Blum, Shih-Fu Chang, Bruce Croft, Alex Hauptmann, Andy Merlino, Ram Nevatia, Prem Natarajan, Kirby Plessas, David Palmer, Mubarak Shah, Rohini K. Shrihari, Oliviero Stock, John Smith, and Rick Steinheiser. The symposium brought together scientists from the United States and Europe to report on recent advances to extraction information from growing personal, organizational, and global collections of audio, imagery, and video. Experts from industry, academia, government, and nonprofit organizations joined together with an objective of collaborating across the speech, language, image, and video processing communities to report advances and to chart future directions for multimedia information extraction theories and technologies.
The symposium included three invited speakers from government and academia. Dr. Nancy Chinchor from the Emerging Media Group in the Director of National Intelligence’s Open Source Center described open source collection and how exploitation of social, mobile, citizen, and virtual gaming mediums could provide early indicators of global events (e.g., increased sales of medicine can indicate flu outbreak). Professor Ruzena Bajcsy (UC Berkeley) described understanding human gestures and body language using environmental and body sensors, enabling the transfer of body movement to robots or virtual choreography. Finally, John Garofolo (NIST) described multimodal metrology research and discussed challenges such as multimodal meeting diarization and affect/emotion recognition. Papers from the symposium were published as AAAI Press Technical Report FS-08-05 (Maybury and Walter 2008).
In this collection, extended versions of six selected papers from the symposium are augmented with over twice as many new contributions. All submissions were critically peer reviewed and those chosen were revised to ensure coherency with related chapters. The collection is complementary to preceding AAAI and/or MIT Press collections on Intelligent Multimedia Interfaces (1993), Intelligent Multimedia Information Retrieval (1997), Advances in Automatic Text Summarization (1999), New Directions in Question Answering (2004), as well as Readings in Intelligent User Interfaces (1998).
Multimedia Information Extraction serves multiple purposes. First, it aims to motivate and define the field of multimedia information extraction. Second, by providing a collection of some of the most innovative approaches and methods, it aims to become a standard reference text. Third, it aims to inspire new application areas, as well as to motivate continued research through the articulation of remaining gaps. The book can be used as a reference for students, researchers, and practitioners or as a collection of papers for use in undergraduate and graduate seminars.
To facilitate these multiple uses, Multimedia Information Extraction is organized into five sections, representing key areas of research and development:
Section 1
: Image Extraction
Section 2
: Video Extraction
Section 3
: Audio, Graphics, and Behavior Extraction
Section 4
: Affect Extraction in Audio and Imagery
Section 5
: Multimedia Annotation and Authoring
The book begins with an introduction that defines key terminology, describes an integrated architecture for multimedia information extraction, and provides an overview of the collection. To facilitate research, the introduction includes a content index to augment the back-of-the-book index. To assist instruction, a mapping to core curricula is provided. A second chapter outlines the history, the current state of the art, and a community-created roadmap of future multimedia information extraction research. Each remaining section in the book is framed with an editorial introduction that summarizes and relates each of the chapters, places them in historical context, and identifies remaining challenges for future research in that particular area. References are provided in an integrated listing.
Taken as a whole, this book articulates a collective vision of the future of multimedia. We hope it will help promote the development of further advances in multimedia information extraction making it possible for all of us to more effectively and efficiently benefit from the rapidly growing collections of multimedia materials in our homes, schools, hospitals, and offices.
MARK T. MAYBURYCape Cod, Massachusetts
ACKNOWLEDGMENTS
I thank Jackie Hargest for her meticulous proofreading and Paula MacDonald for her indefatigable pursuit of key references. I also thank each of the workshop participants who launched this effort and each of the authors for their interest, energy, and excellence in peer review to create what we hope will become a valued collection.
Most importantly, I dedicate this collection to my inspiration, Michelle, not only for her continual encouragement and selfless support, but even more so for her creation of our most enduring multimedia legacies: Zach, Max, and Julia. May they learn to extract what is most meaningful in life.
MARK T. MAYBURYCape Cod, Massachusetts
CONTRIBUTORS
MATUSALA ADDISU, Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Roma, Italy 00198, [email protected]
GEETU AMBWANI, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
DANILO AVOLA, Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Roma, Italy 00198, [email protected], [email protected]
AMIT BAGGA, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
ERHAN BAKI ERMIS, Boston University, 8 Saint Mary’s Street, Boston, MA 02215, USA, [email protected]
ROBIN BARGAR, Dean, School of Media Arts, Columbia College of Chicago, 33 E. Congress, Chicago, IL 60606, [email protected]
KOBUS BARNARD, University of Arizona, Tucson, AZ 85721, USA, [email protected]
PAOLA BIANCHI, Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Roma, Italy 00198, [email protected]
ANDREW C. BLOSE, Kodak Research Laboratories, Eastman Kodak Company, Rochester, NY 14650, USA, [email protected]
PAOLO BOTTONI, Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Roma, Italy 00198, [email protected]
STANLEY M. BOYKIN, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
GASPARD BRETON, Orange Labs, 4 rue du Clos Courtel, 35510 Cesson-Sevigne, France, [email protected]
RICHARD BURNS, University of Delaware, Department of Computer and Information Sciences, Newark, DE 19716, USA, [email protected]
ALESSANDRO CAPPELLETTI, FBK-IRST, Via Sommarive, 18, 38123 Trento, Italy, [email protected]
SANDRA CARBERRY, University of Delaware, Department of Computer and Information Sciences, Newark, DE 19716, USA, [email protected]
CHING HAU CHAN, MIMOS Berhad, Technology Park Malaysia, 57000 Kuala Lumpur, Malaysia, [email protected]
DANIEL CHESTER, University of Delaware, Department of Computer and Information Sciences, Newark, DE 19716, USA, [email protected]
LESLIE CHIPMAN, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
INSOOK CHOI, Emerging Media Program, Department of Entertainment Technology, New York City College of Technology of the City University of New York, 300 Jay Street, Brooklyn, NY 11201, USA, [email protected]
MADIRAKSHI DAS, Kodak Research Laboratories, Eastman Kodak Company, Rochester, NY 14650, USA, [email protected]
ANTHONY R. DAVIS, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
SENIZ DEMIR, University of Delaware, Department of Computer and Information Sciences, Newark, DE 19716, USA, [email protected]
STEPHANIE ELZER, Millersville University, Department of Computer Science, Millersville, PA 17551, USA, [email protected]
FLORIAN EYBEN, Technische Universität München, Theresienstrasse 90, 80333 München, Germany, [email protected]
RYAN FARRELL, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
DIETER W. FELLNER, Fraunhofer Austria Research GmbH, Geschäftsbereich Visual Computing, Inffeldgasse 16c, 8010 Graz, Austria; Fraunhofer IGD and GRIS, TU Darmstadt, Fraunhoferstrasse 5, D-64283 Darmstadt, Germany, [email protected]
RANDALL K. FISH, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
FRED J. GOODMAN, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
WARREN R. GREIFF, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
MARCO GUERINI, FBK-IRST, I-38050, Povo, Trento, Italy, [email protected]
ALEXANDER G. HAUPTMANN, Carnegie Mellon University, School of Computer Science, 5000 Forbes Ave, Pittsburgh, PA 15213, USA, [email protected]
SVEN HAVEMANN, Fraunhofer Austria Research GmbH, Geschäftsbereich Visual Computing, Inffeldgasse 16c, 8010 Graz, Austria, [email protected]
DAVID HOUGHTON, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
QIAN HU, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
PIERRE-MARC JODOIN, Université de Sherbrooke, 2500 boulevard de l’Université, Sherbrooke, QC J1K2R1, Canada, [email protected]
OLIVER JOJIC, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
GARETH J. F. JONES, Centre for Digital Video Processing, School of Computing, Dublin City University, Dublin 9, Ireland, [email protected]
STEPHEN R. JONES, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
VAIVA KALNIKAITE, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK, [email protected]
HIDETOSHI KAWAKUBO, The University of Electro-Communications, Tokyo, 1-5-1 Chofugaoka, Chofu-shi, Tokyo, 182-8585, Japan, [email protected]
MICHAEL KIPP, DFKI, Campus D3.2, Saarbrücken, Germany, [email protected]
ANDREJ KOŠIR, University of Ljubljana, Faculty of Electrical Engineering, Tržaška 25, 1000 Ljubljana, Slovenia, [email protected]
BRUNO LEPRI, FBK-IRST, Via Sommarive, 18, 38123 Trento, Italy, [email protected]
STEFANO LEVIALDI, Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Roma, Italy 00198, [email protected]
WEI-HAO LIN, Carnegie Mellon University, School of Computer Science, 5000 Forbes Ave, Pittsburgh, PA 15213, USA, [email protected]
ALEXANDER C. LOUI, Kodak Research Laboratories, Eastman Kodak Company, Rochester, NY 14650, USA, [email protected]
EHRY MACROSTIE, Raytheon BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA, [email protected]
NADIA MANA, FBK-IRST, Via Sommarive, 18, 38123 Trento, Italy, [email protected]
MARK T. MAYBURY, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
STEPHEN R. MOORE, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, [email protected]
PREM NATARAJAN, Raytheon BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA, [email protected]
JAN NEUMANN, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
ADRIAN NOVISCHI, Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA, [email protected]
DAVID D. PALMER, Autonomy Virage Advanced Technology Group, 1 Memorial Drive, Cambridge, MA 02142, USA, [email protected]
EMANUELE PANIZZI, Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Roma, Italy 00198, [email protected]
FABIO PIANESI, FBK-IRST, Via Sommarive, 18, 38123 Trento, Italy, [email protected]
ROHIT PRASAD, Raytheon BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA, [email protected]
MARC B. REICHMAN, Autonomy Virage Advanced Technology Group, 1 Memorial Drive, Cambridge, MA 02142, USA, [email protected]
GERHARD RIGOLL, Technische Universität München, Theresienstrasse 90, 80333 München, Germany, [email protected]
ROBERT RUBINOFF, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
VENKATESH SALIGRAMA, Boston University, 8 Saint Mary’s Street, Boston, MA 02215, USA, [email protected]
JOSE SAN PEDRO, Telefonica Research, Via Augusta 177, 08021 Barcelona, Spain, [email protected]
BJÖRN SCHULLER, Technische Universität München, Theresienstrasse 90, 80333 München, Germany, [email protected]
RENAUD SEGUIER, Supelec, La Boulaie, 35510 Cesson-Sevigne, France, [email protected]
BAGESHREE SHEVADE, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
STEFAN SIERSDORFER, L3S Research Centre, Appelstr. 9a, 30167 Hannover, Germany, [email protected]
ALAN SMEATON, CLARITY: Centre for Sensor Web Technologies, Dublin City University, Glasnevin, Dublin 9, Ireland, [email protected]
ROHINI K. SRIHARI, Dept. of Computer Science & Engineering, State University of New York at Buffalo, 338 Davis Hall, Buffalo, NY, USA, [email protected]
OLIVIERO STOCK, FBK-IRST, I-38050, Povo, Trento, Italia, [email protected]
NICOLAS STOIBER, Orange Labs, 4 rue du Clos Courtel, 35510 Cesson-Sevigne, France, [email protected]
CARLO STRAPPARAVA, FBK-IRST, I-38050, Povo, Trento, Italy, [email protected]
JURIJ TASI, University of Ljubljana, Faculty of Electrical Engineering, Tržaška 25, 1000 Ljubljana, Slovenia, [email protected]
MARKO TKALI, University of Ljubljana, Faculty of Electrical Engineering, Tržaška 25, 1000 Ljubljana, Slovenia, [email protected]
EVELYNE TZOUKERMANN, The MITRE Corporation, 7525 Colshire Drive, McLean, VA 22102, USA, [email protected]
TORSTEN ULLRICH, Fraunhofer Austria Research GmbH, Geschäftsbereich Visual Computing, Inffeldgasse 16c, 8010 Graz, Austria, [email protected]
JONATHAN WATSON, Raytheon BBN Technologies, 10 Moulton Street, Cambridge, MA 02138, USA, [email protected]
NOAH WHITE, Autonomy Virage Advanced Technology Group, 1 Memorial Drive, Cambridge, MA 02142, USA, [email protected]
STEVE WHITTAKER, University of California Santa Cruz, 1156 High Street, Santa Cruz, CA 95064, USA, [email protected]
MARTIN WÖLLMER, Technische Universität München, Theresienstrasse 90, 80333 München, Germany, [email protected]
PENG WU, University of Delaware, Department of Computer and Information Sciences, Newark, DE 19716, USA, [email protected]
KEIJI YANAI, The University of Electro-Communications, Tokyo, 1-5-1 Chofugaoka, Chofu-shi, Tokyo, 182-8585, Japan, [email protected]
MASSIMO ZANCANARO, FBK-IRST, Via Sommarive, 18, 38123 Trento, Italy, [email protected]
HONGZHONG ZHOU, StreamSage/Comcast, 1110 Vermont Avenue NW, Washington, DC 20005, USA, [email protected]
CHAPTER 1
INTRODUCTION
MARK T. MAYBURY
Our world has become massively multimedia. In addition to rapidly growing personal and industrial collections of music, photography, and video, media sharing sites have exploded in recent years. The growth of social media sites for not only social networking but for information sharing has further fueled the broad and deep availability of media sources. Even special industrial collections once limited to proprietary access (e.g., Time-Life images), or precious books or esoteric scientific materials once restricted to special collection access, or massive scientific collections (e.g., genetics, astronomy, and medical), or sensors (traffic, meteorology, and space imaging) once accessible only to a few privileged users are increasingly becoming widely accessible.
Rapid growth of global and mobile telecommunications and the Web have accelerated both the growth of and access to media. As of 2012, over one-third of the world’s population is currently online (2.3 billion users), although some regions of the world (e.g., Africa) have less than 15% of their potential users online. The World Wide Web runs over the Internet and provides easy hyperlinked access to pages of text, images, and video—in fact, to over 800 million websites, a majority of which are commercial (.com). The most visited site in the world, Google (Yahoo! is second) performs hundreds of millions of Internet searches on millions of servers that process many petabytes of user-generated content daily. Google has discovered over one trillion unique URLs. Wikis, blogs, Twitter, and other social media (e.g., MySpace and LinkedIn) have grown exponentially. Professional imagery sharing on Flickr now contains over 6 billion images. Considering social networking, more than 6 billion photos and more than 12 million videos are uploaded each month on Facebook by over 800 billion users. Considering audio, IP telephony, pod/broadcasting, and digital music has similarly exploded. For example, over 16 billion songs and over 25 billion apps have been downloaded from iTunes alone since its 2003 launch, with as many as 20 million songs being downloaded in one day. In a simple form of extraction, loudness and frequency spectrum analysis are used to generate music visualizations.
Parallel to the Internet, the amount of television consumption in developed countries is impressive. According to the A.C. Nielsen Co., the average American watches more than 4 hours of TV each day. This corresponds to 28 hours each week, or 2 months of nonstop TV watching per year. In an average 65-year lifespan, a person will have spent 9 years watching television. Online video access has rocketed in recent times. In April of 2009, over 150 million U.S. viewers watched an average of 111 videos watching on average about six and a half hours of video. Nearly 17 billion online videos were viewed in June 2009, with 40 percent of these at Youtube (107 million viewers, averaging 3–5 minutes each video), a site at which approximately 20 hours of video are uploaded every minute, twice the rate of the previous year. By March 2012, this had grown to 48 hours of video being uploaded every minute, with over 3 billion views per day. Network traffic involving YouTube accounts for 20% of web traffic and 10% of all Internet traffic. With billions of mobile device subscriptions and with mobiles outnumbering PCs five to one, increasingly access will be mobile. Furthermore, in the United States, four billion hours of surveillance video is recorded every week. Even if one person were able to monitor 10 cameras simultaneously for 40 hours a week, monitoring all the footage would require 10 million surveillance staff, roughly about 3.3% of the U.S. population. As collections of personal media, web media, cultural heritage content, multimedia news, meetings, and others develop from gigabyte to terabyte to petabyte, the need will only increase for accurate, rapid, and cross-media extraction for a variety of user retrieval and reuse needs. This massive volume of media is driving a need for more automated processing to support a range of educational, entertainment, medical, industrial, law enforcement, defense, historical, environmental, economic, political, and social needs.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
