Crowdsourcing for Speech Processing - Maxine Eskenazi - E-Book

Crowdsourcing for Speech Processing E-Book

Maxine Eskenazi

4,5
95,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Provides an insightful and practical introduction to crowdsourcing as a means of rapidly processing speech data Intended for those who want to get started in the domain and learn how to set up a task, what interfaces are available, how to assess the work, etc. as well as for those who already have used crowdsourcing and want to create better tasks and obtain better assessments of the work of the crowd. It will include screenshots to show examples of good and poor interfaces; examples of case studies in speech processing tasks, going through the task creation process, reviewing options in the interface, in the choice of medium (MTurk or other) and explaining choices, etc. * Provides an insightful and practical introduction to crowdsourcing as a means of rapidly processing speech data. * Addresses important aspects of this new technique that should be mastered before attempting a crowdsourcing application. * Offers speech researchers the hope that they can spend much less time dealing with the data gathering/annotation bottleneck, leaving them to focus on the scientific issues. * Readers will directly benefit from the book's successful examples of how crowd- sourcing was implemented for speech processing, discussions of interface and processing choices that worked and choices that didn't, and guidelines on how to play and record speech over the internet, how to design tasks, and how to assess workers. Essential reading for researchers and practitioners in speech research groups involved in speech processing

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 726

Veröffentlichungsjahr: 2013

Bewertungen
4,5 (18 Bewertungen)
10
7
1
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Cover

Title Page

Copyright

List of Contributors

Preface

Chapter 1: An Overview

1.1 Origins of Crowdsourcing

1.2 Operational Definition of Crowdsourcing

1.3 Functional Definition of Crowdsourcing

1.4 Some Issues

1.5 Some Terminology

1.6 Acknowledgments

References

Chapter 2: The Basics

2.1 An Overview of the Literature on Crowdsourcing for Speech Processing

2.2 Alternative Solutions

2.3 Some Ready-Made Platforms for Crowdsourcing

2.4 Making Task Creation Easier

2.5 Getting Down to Brass Tacks

2.6 Quality Control

2.7 Judging the Quality of the Literature

2.8 Some Quick Tips

2.9 Acknowledgments

References

Further reading

Chapter 3: Collecting Speech from Crowds

3.1 A Short History of Speech Collection

3.2 Technology for Web-Based Audio Collection

3.3 Example: WAMI Recorder

3.4 Example: The WAMI Server

3.5 Example: Speech Collection on Amazon Mechanical Turk

3.6 Using the Platform Purely for Payment

3.7 Advanced Methods of Crowdsourced Audio Collection

3.8 Summary

3.9 Acknowledgments

References

Chapter 4: Crowdsourcing for Speech Transcription

4.1 Introduction

4.2 Transcribing Speech

4.3 Preparing the Data

4.4 Setting Up the Task

4.5 Submitting the Open Call

4.6 Quality Control

4.7 Conclusion

4.8 Acknowledgments

References

Chapter 5: How to Control and Utilize Crowd-Collected Speech

5.1 Read Speech

5.2 Multimodal Dialog Interactions

5.3 Games for Speech Collection

5.4 Quizlet

5.5 Voice Race

5.6 Voice Scatter

5.7 Summary

5.8 Acknowledgments

References

Chapter 6: Crowdsourcing in Speech Perception

6.1 Introduction

6.2 Previous Use of Crowdsourcing in Speech and Hearing

6.3 Challenges

6.4 Tasks

6.5 BIGLISTEN: A Case Study in the Use of Crowdsourcing to Identify Words in Noise

6.6 Issues for Further Exploration

6.7 Conclusions

References

Chapter 7: Crowdsourced Assessment of Speech Synthesis

7.1 Introduction

7.2 Human Assessment of TTS

7.3 Crowdsourcing for TTS: What Worked and What Did Not

7.4 Related Work: Detecting and Preventing Spamming

7.5 Our Experiences: Detecting and Preventing Spamming

7.6 Conclusions and Discussion

References

Chapter 8: Crowdsourcing for Spoken Dialog System Evaluation

8.1 Introduction

8.2 Prior Work on Crowdsourcing: Dialog and Speech Assessment

8.3 Prior Work in SDS Evaluation

8.4 Experimental Corpus and Automatic Dialog Classification

8.5 Collecting User Judgments on Spoken Dialogs with Crowdsourcing

8.6 Collected Data and Analysis

8.7 Conclusions and Future Work

8.8 Acknowledgments

References

Chapter 9: Interfaces for Crowdsourcing Platforms

9.1 Introduction

9.2 Technology

9.3 Crowdsourcing Platforms

9.4 Interfaces to Crowdsourcing Platforms

9.5 Summary

References

Chapter 10: Crowdsourcing for Industrial Spoken Dialog Systems

10.1 Introduction

10.2 Architecture

10.3 Transcription

10.4 Semantic Annotation

10.5 Subjective Evaluation of Spoken Dialog Systems

10.6 Conclusion

References

Chapter 11: Economic and Ethical Background of Crowdsourcing for Speech

11.1 Introduction

11.2 The Crowdsourcing Fauna

11.3 Economic and Ethical Issues

11.4 Under-Resourced Languages: A Case Study

11.5 Toward Ethically Produced Language Resources

11.6 Conclusion

11.7 Disclaimer

References

Index

This edition first published 2013 © 2013 John Wiley & Sons, Ltd

Registered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Eskénazi, Maxine. Crowdsourcing for speech processing : applications to data collection, transcription, and assessment / Maxine Eskénazi, Gina-Anne Levow, Helen Meng, Gabriel Parent, David Suendermann. pages cm Includes bibliographical references and index. ISBN 978-1-118-35869-6 (hardback : alk. paper) – ISBN 978-1-118-54127-2 (ebook/epdf) – ISBN 978-1-118-54125-8 (epub) – ISBN 978-1-118-54126-5 (mobi) – ISBN 978-1-118-54124-1 1. Speech processing systems–Research. 2. Human computation. 3. Data mining. I. Title. II. Title: Crowd sourcing for speech processing. TK7882.S65E85 2013 006.4′54–dc23 2012036598

A catalogue record for this book is available from the British Library.

ISBN: 978-1-118-35869-6

List of Contributors

Gilles AddaLIMSI-CNRS, France

Jon BarkerUniversity of Sheffield, UK

Laurent BesacierLIG-CNRS, France

Sabine BuchholzSynapseWork Ltd, UK

Martin CookeIkerbasque, Spain University of the Basque Country, Spain

Christoph DraxlerLudwig-Maximilian University, Germany

Maxine EskénaziCarnegie Mellon University, USA

Hadrien GelasLIG-CNRS, France DDL-CNRS, France

Javier LatorreToshiba Research Europe Ltd, UK

Gina-Anne LevowUniversity of Washington, USA

Maria Luisa Garcia LecumberriUniversity of the Basque Country, Spain

Joseph J. MarianiLIMSI-CNRS, France IMMI-CNRS, France

Ian McGrawMassachusetts Institute of Technology, USA

Helen MengThe Chinese University of Hong Kong, China

Gabriel ParentCarnegie Mellon University, USA

Roberto PieracciniICSI, USA

Joseph PolifroniQuanta Research, USA

David SuendermannSynchronoss, USA DHBW Stuttgart, Germany ICSI, USA

Kayoko YanagisawaToshiba Research Europe Ltd, UK

Zhaojun YangUniversity of Southern California, USA

Preface

This book came about as a result of the standing-room-only special session on crowdsourcing for speech processing at Interspeech 2011. There has been a great amount of interest in this new technique as a means to solve some persistent issues. Some researchers dived in head first and have been using crowdsourcing for a few years by now. Others waited to see if it was reliable, and yet others waited for some service to exist in their country. The first results are very encouraging: crowdsourcing can be a solution that approaches expert results. However, it also comes with warnings: the incoming data must go through quality assessment.

This book is a hands-on, how-to manual that is directed at several groups of readers:

Experienced users: Those who have already used crowdsourcing for speech processing should find a good set of references to the literature as well as some novel approaches that they may not be familiar with.Speech processing users who have not yet used crowdsourcing: The information in this book should help you get up to speed rapidly and avoid reinventing the wheel for common interface and assessment issues.Users who are not speech processing experts who also need to use crowdsourcing for their speech data: This book should also help you get started since you will have many of the same issues in dealing with your data.

We start the book with an overview of the principles of crowdsourcing. This is followed by some basic concepts and an overview of research in the area. The following chapters in the book cover most of the present types of speech processing. Chapter 3 covers the acquisition of speech. Chapter 4 covers speech labeling. Chapter 5 covers the variability of crowd speech and how to acquire and label speech in one effort. Chapter 6 explains how to run perception experiments using crowdsourcing. Chapter 7 explains how to use crowdsourcing for speech synthesis. Chapter 8 describes how to use crowdsourcing for assessment of spoken dialog systems. Chapter 9 covers the variety of platforms that are used for crowdsourcing and how they work. Chapter 10 covers industrial applications of crowdsourcing for speech processing. Finally, Chapter 11 covers the legal and ethical issues surrounding the use of crowdsourcing.

We express our wholehearted gratitude to all the contributing authors for their hard work and extremely timely delivery. Their extraordinary support helped us meet the unprecedented deadlines from proposal to production of not much more than a year. Also, the continuous guidance by Wiley's editorial team—Alex King, Liz Wingett, and Richard Davies—was essential for the success of this project.

Maxine Eskénazi, Gina-Anne Levow, Helen Meng, Gabriel Parent and David Suendermann

1

An Overview

Maxine Eskénazi

Carnegie Mellon University, USA

In the early days of automatic speech processing, researchers dealt with relatively small sets of speech data. They used them mainly to build small automatic systems and to test the systems’ validity. The data was often obtained by recording speakers in an anechoic chamber on magnetic tape. It was manually sent to a freestanding spectrogram machine in the 1960s and 1970s or input to a computer in the late 1970s and thereafter. Getting speakers (other than colleagues and students) took time, and labeling the speech that was acquired took much more time. Both endeavors were very costly. These difficulties entered into consideration every time a researcher planned a project, often imposing limitations on the amount of data collected and the scientific goals.

As time went on, one factor that dramatically increased the need for more data was the success of statistically based methods for automatic speech processing. The models for these systems, which quickly became ubiquitous, needed large amounts of data. The expression “more data is better data” was born. As automatic speech processing researchers switched from one application, like Broadcast News, to another, like Communicator, they found that the data from the former application was not very useful for the new one. As data specific to the new application was collected, processed, and fed into speech systems, results improved.

At the same time, other speech research publications (speech synthesis, spoken dialog systems, perception, etc.) also included some assessment. This required increasing numbers of speakers, callers, and judges, and thus a significant investment in data resources. This investment involved researcher time as they found and recorded subjects, as they trained transcribers to write down exactly what had been said, and as they found subjects to try out the resulting systems. Besides the time of the researcher, the investment also included the payment of the speakers and the transcribers, and sometimes a company was engaged to either recruit speakers or to manage transcription, thus adding to the costs.

As Google became a major search engine of choice, it gathered a very large amount of data, larger than any that had ever been used before. Google researchers produced conference papers that demonstrated to their colleagues that some previously unsolved issues in natural language processing became surmountable just by using several orders of magnitude more data (Brants et al. 2007). The field became ripe for a solution that would provide more processed data at significantly lower cost. The present estimate of the cost of transcribing 1 hour of speech data by an expert (for ASR training) is 6 hours of transcription time for each actual hour of speech that is processed, at a cost of $90–$150 per hour (Williams et al. 2011).

At the same time, linguists and sociolinguists, freed from the use of magnetic tape and the onerous postprocessing that accompanied it, found that recording speech directly on computers enabled them to rapidly obtain large samples of the speech that they wanted to study. Many speakers with more diverse backgrounds could be recorded. Several groups of speakers with varying characteristics could be recorded instead of just one. However, as the need for more speakers and more transcriptions of their speech increased, these communities ran up against the same obstacles that the automatic speech processing community had encountered.

What seems like the answer to these needs has come in the form of crowdsourcing. This technique offers the promise of dramatically lowering the cost of collecting and annotating speech data. Some of the automatic speech processing community has quickly embraced crowdsourcing. This chapter and the next will give a short history and description of crowdsourcing, some basic guidelines, and then review the uses of crowdsourcing for speech that have been published in the past few years.

1.1 Origins of Crowdsourcing

What may be one of the earliest examples of the use of the crowd is the open call that the Oxford English Dictionary (OED) made to the community in the 1800s for volunteers to index all of the words in the English language and to find example quotations for each of the uses of each word (Wikipedia 2012).

More recently, James Surowiecki’s (2004) book, The Wisdom of Crowds, gives an explanation of the power of the wisdom of the crowd. It maintains that a diverse collection of opinions from people who are making independent decisions can produce some types of decisions better than obtaining them from experts. Surowiecki sees three advantages to what he terms disorganized decisions: cognition (thinking and information processing), coordination (optimization of actions), and cooperation (forming networks of trust with no central control).

A good example of cooperation in a disorganized decision is the US Defense Advanced Research Projects Agency (DARPA) experiment in crowdsourcing to mark the 40th anniversary of the Internet. The goal was to locate 10 balloon markers that had been placed in a variety of locations across the United States. Teams were formed, each vying to be the first to find all 10 markers. This required collaborative efforts with networks of informers in many locations across the country. The team from MIT had the shortest time (under 9 hours). Its groups, comprised friends, and friends of friends, signed up to help locate the balloons. This underlines the observation that, in a crowd situation, where each person is independent and fairly anonymous, an individual will give their knowledge and opinions more freely. The success of the endeavor centered on this generous participation. Indeed, authors of crowdsourcing tasks who ask the members of their crowd if they have suggestions on how to improve a task (without giving them additional remuneration) often find that some of the crowd will take the time to make very insightful and helpful suggestions.

1.2 Operational Definition of Crowdsourcing

The operational basis of crowdsourcing rests on the idea that a task is to be done, there is a means to attract many nonexperts to accomplish this task, and that some open call has gone out to advertise the task to the nonexperts (Wikipedia 2012). The presence of the Internet and cellphones facilitates not only the open call for nonexperts but also the presentation of the task, its accomplishment, and the accumulation of the nonexperts’ opinions. The nonexperts also possess some relevant knowledge, be it only that they are native speakers of a given language. From these assumptions, it is believed that the aggregate opinion of many nonexperts will approach the quality of the opinion of an expert. It is also believed that the use of nonexperts in this manner will be less onerous and more rapid than the use of experts. Given this economy of means, it is understandable that the speech and language processing communities have seen crowdsourcing as a possible solution to their large data dilemma. To illustrate the operational aspects of crowdsourcing, consider a task that comes up at barbeques and other social events. A total of 672 jellybeans have been put into a clear jar. Each person attending the barbeque is asked to estimate how many jellybeans are in the jar. They are aware that there is something that the organizer of the barbeque wants them to do (the open call for nonexperts). They are also aware that they are to provide an estimate of the number of jellybeans in the jar (the task), and they know how to count jellybeans or make estimates (have some expertise). They put their answers on a piece of paper and put that into a box. The organizer looks at all of the tickets in the box and finds answers like 300, 575, 807, 653, 678, 599, and 775. The aggregate answer, such as the average (626), or the median (653), is very close to the real number of jellybeans in the jar. Thus, the conditions that characterize crowdsourcing are:

A task.An open call.Attracting many nonexperts.

So we will define a crowd as a group of nonexperts who have answered an open call to perform a given task.

1.3 Functional Definition of Crowdsourcing

A functional view of crowdsourcing, from Surowiecki, defines four characteristics of the wise crowd. First, the members of any crowd have a diversity of opinions. The opinions may be only slightly different from one another, and some may be correct while others are wrong. Second, each member of the crowd has an opinion that is independent of all of the other members of the crowd. No member’s opinion is influenced by that of any other member. Third, information that the crowd may have is decentralized. Everyone has some local information, but no one in the crowd has access to all of the information that may be pertinent to the task. Finally, the opinions of the members of the crowd can be merged to form an aggregate, one collaborative solution. To illustrate this, here is an example where the crowd has the task of translating some text. If we have the following text in French,

Je pense qu’il est temps de partir. On prendra congé de ma mère et de ma sœur et on se mettra en route au plus tard à neuf heures.

we can ask a crowd, that is, English and French speaking, for its translation. Some of its members may offer these four solutions:

S1:   I think it’s time to leave. We will say goodbye to my mother and my sister and get going at 9 a.m. at the latest.
S2:   The time has come to leave. Let’s say goodbye to my mother and my sister and be on our way by 9 a.m.
S3:   We need to go. We’ll say goodbye to my mother and my sister and leave by 9 a.m.
S4:   Let’s go. Take leave of my mother and my sister and be on our way by 9 a.m.

The four aspects of the functional nature of crowdsourcing are illustrated here. We can see that the four solutions offered by members of the crowd (S1–S4) reflect diverse opinions on exactly what the right translation is. We also can imagine that these opinions have been arrived at independently from one another. Each member of this crowd possesses some individual pieces of information that they are using when forming their opinion. S1, for example, may reflect the idea that “il est temps de partir” should be translated literally as “it’s time to leave” while S4 may reflect a broader definition, which, in this context, results in the expression “let’s go.” Finally, we can merge these opinions to form one solution by, for example, asking the members of another crowd to vote on which one they like the best, by choosing the one that is most frequently produced, or by using a string-merging algorithm such as Banerjee and Lavie (2005). There has also been work (Kittur et al. 2011; CastingWords 2012) on having the crowd collaborate with one another to make the translation evolve into something on which they can all agree.

Thus, according to Surowiecki, the four characteristics of a wise crowd are:

Has a diversity of opinions.Each individual works independently of the others.The information is decentralized.An aggregate solution can be formed.

1.4 Some Issues

While crowdsourcing seems to be a remarkable solution to the problems plaguing the speech and linguistics communities, it must be approached with care since misleading or incorrect results can also easily be obtained from crowdsourcing. Several issues should be kept in mind to prevent this.

The first issue concerns the amount of information given to the crowd. The crowd should be given just enough information to be able to complete the task, but not enough to influence their decisions. For the translation task above, for example, although the creator of the task could ask for a translation that is as literal and close to the original text as possible, this additional information may make the final result less desirable. The workers’ opinions should not be influenced by information from the task creator. The second issue is having a crowd that is too homogeneous. A crowd that is too homogeneous will not give a superior result. Oinas-Kukkonen (2008) has found that the best decisions come when there is disagreement and contest within the crowd. Note that giving too much information is one way that a crowd may be rendered too homogeneous. He mentions another issue that contributes to homogeneity—too much communication. When members of the crowd have less anonymity and the creator of the task has more communication with the crowd, too much information may gradually be transmitted. Linked to the issue of communication is that of imitation. If participants are given access to the opinions of other workers, they may be influenced by them and, consciously or not, imitate what they have seen (thus leading us back to a more homogeneous crowd). While some tasks are given to one crowd and then the result is sent to another crowd for verification, some mechanism should be in place to avoid having this influence.

A fifth issue that should be addressed concerns the prerequisites of the crowd. We have seen that the members of the crowd are presumed to have some local knowledge. It is not evident that everyone who responds to a call has that knowledge. For example, in the above translation task, the creator of the task will assume that the participants speak both French and English. Since there may be some impostors in the crowd, it is wise to give some sort of pretest. We will discuss this further in Chapter 2. Pretests of performance on the specific task at hand are a reasonable way to winnow out those who may not be able to perform the task. However, creation and checking of the pretest is onerous in itself and it may be more time- and cost-saving to let all who respond complete the task and then eliminate outlier answers later on.

Another issue is motivation. Why should someone participate in a task? Are they learning something, playing a game, being remunerated? There should be some reason for an individual to not only sign up to work on a task but also want to continue to work on it. This is linked to a seventh issue, keeping a reasonable expectation of the work effort. If members of the crowd are lead to believe that there is less work than what is actually expected, especially in the case of remunerated work, they will quit the task and recommend to others (via worker forums and blogs) that they also avoid this task.

Finally, as we will see in several chapters in this book, it is absolutely necessary to carry out some form of quality control. This control can come in many forms and it is meant to weed out the work of poor workers (who have good intentions, but who furnish work that is not of good quality) and malicious workers (those who randomly enter answers or automated bots).

When reading research papers that incorporate crowdsourcing results, it is wise to determine whether these issues have been dealt with since this may affect the wellfoundedness of a paper.

Therefore, before proposing a task, researchers should deal with the following issues:

Giving the crowd too much information.A crowd that is too homogeneous.Having too much communication with the crowd.Avoiding the possibility of imitation.Requesting prerequisites from the crowd.Maintaining crowd motivation.Presenting a reasonable expectation of workload.Conducting quality control.

1.5 Some Terminology

At this point, a short discussion of terminology is useful. The person who is creating the task and who submits it is called (at Amazon Mechanical Turk, MTurk, in this book, a platform that is used for crowdsourcing) the requester. This person may be called the client at other crowdsourcing sites. Herein we will use the term requester. The person in the crowd who does the work is appropriately called the worker (some also say turker) at MTurk and other sites, a freelancer at MiniFreelance, and a contributor at CrowdFlower and elsewhere. We will use the term worker. The individual task itself is called a Human Intelligence Task or HIT at MTurk, a mission at AgentAnything.com, a microjob at MicroWorkers, and a task at CrowdFlower. We will use the term task here, but the reader will also see this term broken down into three types of tasks, according to granularity:

Set of tasks is the complete set of items that the requester wants to have done. For example, the transcription of 2000 hours of speech.Unit task, for example, transcribing one utterance out of the 2000 hours of speech in the set of tasks above.Assignment is one piece of work within the unit task; that is, the transcription of the above unit task of one utterance may be assigned to three workers, thus there would be three assignments for one unit task.

Also, when referring to the number of unit tasks completed per hour, we will use the term throughput. When referring to when the requester makes a set of tasks available to the workers, we will use the term submission. When talking about the agreement between multiple workers in an annotation task, we will use the term interannotator agreement (ITA).

1.6 Acknowledgments

This work was supported by National Science Foundation grant IIS0914927. Any opinions, findings, and conclusions and recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

References

Amazon Mechanical Turk—Artificial Artificial Intelligence. http://mturk.com (accessed 9 July 2012).

Brants T, Popat AC, Xu P, Och FJ and Dean J (2007) Large language models in machine translation. Proceedings of the Conference on Empirical Methods on Natural Language Processing(EMNLP-2007).

Banerjee S and Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization.

CastingWords. http://castingwords.com (accessed 9 July 2012).

Kittur A, Smus B and Kraut R (2011) CrowdForge: crowdsourcing complex work. Proceedings of the ACM 2011 Annual Conference on Human Factors in Computing Systems.

Oinas-Kukkonen H (2008) Network analysis and crowds of people as sources of new organizational knowledge, in Knowledge Management: Theoretical Foundation (eds A Koohang et al.). Informing Science Press, Santa Rosa, CA, pp. 173–189.

Surowiecki J (2004) The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Doubleday Anchor.

Wikipedia—Crowdsourcing. http://en.wikipedia.org/wiki/Crowdsourcing (accessed 9 July 2012).

Williams JD, Melamed ID, Alonso T, Hollister B and Wilpon J (2011) Crowd-sourcing for difficult transcription of speech. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2011).

2

The Basics

Maxine Eskénazi

Carnegie Mellon University, USA

This chapter contains some basic general information about crowdsourcing that is useful for all speech applications. It uses a review of the literature to show what areas of research have used crowdsourcing and to show what has been done to deal with specific concerns, like quality control. It begins with an overview of the literature on crowdsourcing for speech processing. Then it maps out the types of alternatives to crowdsourcing that the reader might consider and describes the crowdsourcing platforms that exist. From there it discusses how to make task creation easier. Then it gives details about some basic considerations such as getting the audio in and out and payment. Prequalification and native language of the worker are also presented in this section and can serve as an introduction to the issue of quality control that comes later in Section 2.6. The chapter finishes with some useful tips.

2.1 An Overview of the Literature on Crowdsourcing for Speech Processing

Any overview in this area can only serve as an indication of trends, since, by the time that it is published, additional knowledge will have been accumulated. We have found speech-related publications both in the usual automatic speech processing venues and in others not specific to speech processing. Readers should also note that speech crowdsourcing has adapted techniques that have been successful in other areas such as machine translation.

In this section, we will look at what has been published mainly in acquiring, labeling, and transcribing speech and assessing automatic speech processing systems. In the chapters that follow, there will be more specific information in each of these areas. This overview is intended as a census, laying out the land: in what areas of automatic speech processing has crowdsourcing been used, how much effort has there been in each area, what has met with success, and which new areas are emerging.

Papers in the realm of natural language processing that use crowdsourcing date back to about 2007. However, we see the first publications on speech processing appearing about 2009. With this short time-frame, it would be difficult to draw strong conclusions about trends in the publications. However, it is possible to see how greatly the number of publications has grown, where the publications come from, and how the use of quality control has been reported. There are many interesting papers in the slightly more mature area of text/natural language processing that reveal new techniques that can be applied to speech processing. Since that area of research is not reviewed here, the reader may want to look at the overview paper of the 2010 NAACL workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (Callison-Burch and Dredze 2010) for a good general description revealing the breadth of research that was presented at that workshop.

In the few years that crowdsourcing for speech has existed, the venues where papers appear have been varied. The reader can find papers in the following venues:

The Association for Computational Linguistics (ACL).The North American Association for Computational Linguistics (NAACL).The European Association for Computational Linguistics (EACL).The International Speech Communication Association (ISCA) Interspeech Conference.The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).The ISCA Speech and Language Technologies for Education (SLaTE) Special Interest Group.The Association for the Advancement of Artificial Intelligence (AAAI).

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!