Speech in Mobile and Pervasive Environments - Nitendra Rajput - E-Book

Speech in Mobile and Pervasive Environments E-Book

Nitendra Rajput

0,0
103,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book provides a cross-disciplinary reference to speech in mobile and pervasive environments

Speech in Mobile and Pervasive Environments  addresses the issues related to speech processing on resource-constrained mobile devices. These include speech recognition in noisy environments, specialised hardware for speech recognition and synthesis, the use of context to enhance recognition and user experience, and the emerging software standards required for interoperability.  This book takes a multi-disciplinary look at these matters, while offering an insight into the opportunities and challenges of speech processing in mobile environs. In developing regions, speech-on-mobile is set to play a momentous role, socially and economically; the authors discuss how voice-based solutions and applications offer a compelling and natural solution in this setting.

Key Features

  • Provides a holistic overview of all speech technology related topics in the context of mobility
  • Brings together the latest research in a logically connected way in a single volume
  • Covers hardware, embedded recognition and synthesis, distributed speech recognition, software technologies, contextual interfaces
  • Discusses multimodal dialogue systems and their evaluation
  • Introduces speech in mobile and pervasive environments for developing regions

This book provides a comprehensive overview for beginners and experts alike. It can be used as a textbook for advanced undergraduate and postgraduate students in electrical engineering and computer science. Students, practitioners or researchers in the areas of mobile computing, speech processing, voice applications, human-computer interfaces, and information and communication technologies will also find this reference insightful. For experts in the above domains, this book complements their strengths. In addition, the book will serve as a guide to practitioners working in telecom-related industries.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 500

Veröffentlichungsjahr: 2012

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Series Page

Title Page

Copyright

Dedication

About the Series Editors

List of Contributors

Foreword

Preface

What this book is about

Why we wrote this book

Chapters

The audience for this book

Acknowledgments

Chapter 1: Introduction

1.1 Application Design

1.2 Interaction Modality

1.3 Speech Processing

1.4 Evaluations

Chapter 2: Mobile Speech Hardware: The Case For Custom Silicon

2.1 Introduction

2.2 Mobile Hardware: Capabilities and Limitations

2.3 Profiling Existing Software Systems

2.4 Recognizers for Mobile Hardware: Conventional Approaches

2.5 Custom Hardware for Mobile Speech Recognition

2.6 Conclusion

Bibliography

Chapter 3: Embedded Automatic Speech Recognition and Text-to-speech Synthesis

3.1 Automatic Speech Recognition

3.2 Mathematical Formulation

3.3 Acoustic Parameterization

3.4 Acoustic Modeling

3.5 Language Modeling

3.6 Modifications for Embedded Speech Recognition

3.7 Applications

3.8 Text-to-speech Synthesis

3.9 Text to Speech in a Nutshell

3.10 Front End

3.11 Back End

3.12 Embedded Text-to-speech

3.13 Evaluation

3.14 Summary

Bibliography

Chapter 4: Distributed Speech Recognition

4.1 Elements of Distributed Speech Processing

4.2 Front-end Processing

4.3 ETSI Standards

4.4 Transfer Protocol

4.5 Energy-aware Distributed Speech Recognition

4.6 ESR, NSR, DSR

Chapter 5: Context in Conversation

5.1 Context Modeling and Aggregation

5.2 Context-based Speech Applications: Conspeakuous

5.3 Context-based Speech Applications: Responsive Information Architect

5.4 Conclusion

Bibliography

Chapter 6: Software: Infrastructure, Standards, Technologies

6.1 Introduction

6.2 Mobile Operating Systems

6.3 Voice Over Internet Protocol

6.4 Standards

6.5 Standards: VXML

6.6 Standards: VoiceFleXML

6.7 SAMVAAD

6.8 Conclusion

6.9 Summary and Future Work

Bibliography

Chapter 7: Architecture of Mobile Speech-based and Multimodal Dialog Systems

7.1 Introduction

7.2 Multimodal Architectures

7.3 Multimodal Frameworks

7.4 Multimodal Mobile Applications

7.5 Architectural Models

7.6 Distribution in the Stopman System

7.7 Conclusions

Bibliography

Chapter 8: Evaluation of Mobile and Pervasive Speech Applications

8.1 Introduction

8.2 Evaluation of Mobile Speech-based Systems

8.3 Case Studies

8.4 Theoretical Measures for Dialog Call-flows

8.5 Conclusions

Bibliography

Chapter 9: Developing Regions

9.1 Introduction

9.2 Applications and studies

9.3 Systems

9.4 Challenges

Bibliography

Index

Wiley Series on Wireless Communications and Mobile Computing

Series Editors: Dr Xuemin (Sherman) Shen, University of Waterloo, Canada

Dr Yi Pan, Georgia State University, USA

The “Wiley Series on Wireless Communications and Mobile Computing” is a series of comprehensive, practical and timely books on wireless communication and network systems. The series focuses on topics ranging from wireless communication and coding theory to wireless applications and pervasive computing. The books provide engineers and other technical professionals, researchers, educators, and advanced students in these fields with invaluable insight into the latest developments and cutting-edge research.

Other titles in the series:

Misic and Misic: Wireless Personal Area Networks: Performance, Interconnection, and Security with IEEE 802.15.4, January 2008, 978-0-470-51847-2

Takagi and Walke: Spectrum Requirement Planning in Wireless Communications: Model and Methodology for IMT-Advanced, April 2008, 978-0-470-98647-9

Pérez-Fontán and Espiñeira: Modeling the Wireless Propagation Channel: A simulation approach with MATLAB®, August 2008, 978-0-470-72785-0

Ippolito: Satellite Communications Systems Engineering: Atmospheric Effects, Satellite Link Design and System Performance, August 2008, 978-0-470-72527-6

Lin and Sou: Charging for Mobile All-IP Telecommunications, September 2008, 978-0-470-77565-3

Myung and Goodman: Single Carrier FDMA: A New Air Interface for Long Term Evolution, October 2008, 978-0-470-72449-1

Wang, Kondi, Luthra and Ci: 4G Wireless Video Communications, April 2009, 978-0-470-77307-9

Cai, Shen and Mark: Multimedia Services in Wireless Internet: Modeling and Analysis, June 2009, 978-0-470-77065-8

Stojmenovic: Wireless Sensor and Actuator Networks: Algorithms and Protocols for Scalable Coordination and Data Communication, February 2010, 978-0-470-17082-3

Liu and Weiss, Wideband Beamforming: Concepts and Techniques, March 2010, 978-0-470-71392-1

Riccharia and Westbrook, Satellite Systems for Personal Applications: Concepts and Technology, July 2010, 978-0-470-71428-7

Qian, Muller and Chen: Security in Wireless Networks and Systems, March 2014, 978-0-470-512128

This edition first published 2012

© 2012 John Wiley & Sons Ltd.

Registered office

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Rajput, Nitendra.

Speech in mobile and pervasive environments / Nitendra Rajput and Amit A. Nanavati.

p. cm.

Includes bibliographical references and index.

ISBN 978-0-470-69435-0 (cloth)

1. Speech processing systems. 2. Cell phone systems. I. Nanavati, Amit A. II. Title.

TK7882.S65R334 2012

006.5–dc23

2011033626

A catalogue record for this book is available from the British Library.

ISBN: 9780470694350 (H/B)

For,

To,

Family & Friends

About the Series Editors

Xuemin (Sherman) Shen (M'97-SM'02) received a BSc degree in electrical engineering from Dalian Maritime University, China in 1982, and the MSc and PhD degrees (both in electrical engineering) from Rutgers University, New Jersey, USA, in 1987 and 1990 respectively. He is a Professor and University Research Chair, and the Associate Chair for Graduate Studies, at the Department of Electrical and Computer Engineering, University of Waterloo, Canada. His research focuses on mobility and resource management in interconnected wireless/wired networks, UWB wireless communications systems, wireless security, and ad hoc and sensor networks. He is a co-author of three books, and has published more than 300 papers and book chapters on wireless communications and networks, control and filtering. Dr. Shen serves as a founding area editor for IEEE Transactions on Wireless Communications; editor-in-chief for Peer-to-Peer Networking and Application; associate editor for IEEE Transactions on Vehicular Technology, KICS/IEEE Journal of Communications and Networks, Computer Networks, ACM/Wireless Networks and Wireless Communications and Mobile Computing. He has also served as a guest editor for IEEE JSAC, IEEE Wireless Communications and IEEE Communications Magazine. Dr. Shen received the Excellent Graduate Supervision Award in 2006, and the Outstanding Performance Award in 2004 from the University of Waterloo, the Premier's Research Excellence Award (PREA) in 2003 from the Province of Ontario, Canada, and the Distinguished Performance Award in 2002 from the Faculty of Engineering, University of Waterloo. Dr. Shen is a registered Professional Engineer of Ontario, Canada.

Yi Pan is the Chair and a Professor in the Department of Computer Science at Georgia State University, USA. Dr Pan received his BEng and MEng degrees in computer engineering from Tsinghua University, China, in 1982 and 1984, respectively, and his PhD degree in computer science from the University of Pittsburgh, USA, in 1991. Dr. Pan's research interests include parallel and distributed computing, optical networks, wireless networks and bioinformatics. Dr. Pan has published more than 100 journal papers, with over 30 papers published in IEEE journals. In addition, he has published over 130 papers in refereed conferences (including IPDPS, ICPP, ICDCS, INFOCOM, and GLOBECOM). He has also co-edited over 30 books. Dr. Pan has served as an editor-in-chief or an editorial board member for 15 journals including five IEEE Transactions journals and has organized many international conferences and workshops. Dr. Pan has delivered over 10 keynote speeches at international conferences. Dr. Pan is an IEEE Distinguished Speaker (2000–2002), a Yamacraw Distinguished Speaker (2002), and a Shell Oil Colloquium Speaker (2002). He is listed in Men of Achievement, Who's Who in America, Who's Who in American Education, Who's Who in Computational Science and Engineering, and Who's Who of Asian Americans.

List of Contributors

Patrick J. Bourke, Carnegie Mellon University, USA

Om D. Deshmukh, IBM Research, India

Jaakko Hakulinen, University of Tampere, Finland

Amit A. Nanavati, IBM Research, India

Nitendra Rajput, IBM Research, India

Rob A. Rutenbar, Carnegie Mellon University, USA

Markku Turunen, University of Tampere, Finland

Kai Yu, Carnegie Mellon University, USA

Foreword

Advances in computing–in terms of both the creation of novel mathematical techniques and the design of data-driven technologies–have fuelled the ubiquitous development and deployment of speech technologies over the last two decades. Some of the core speech technologies and their applications to coding, recognition, synthesis, enhancement and such have been well documented in several excellent books, and have been since incorporated in university course across the world. With the recent trends toward convergence of computing and communication, well exemplified by the global proliferation of mobile devices in the past decade, there has been significant speech technology research and development effort focused on algorithms and multimodal interfaces applications targeting and tailored to the new requirements of mobile platforms and interaction paradigms. This book is a natural and timely contribution that summarizes the state of the art in this domain of mobile speech technologies that can be useful both as teaching resource as well as a reference to the practitioner.

In this book, the authors have done a fantastic job in providing a comprehensive account, offering an end-to-end story for enabling speech interfaces on mobile devices. The challenge in undertaking to write a book of this nature is to have the right breadth to do justice to a multifaceted area. As the founders and organizers of the yearly workshops on Speech in Mobile and Pervasive Environments since 2006, the authors were indeed well positioned to take on this challenging assignment. I have been a close and participating witness to the growth of interest and the forming of a community in this field. Furthermore, I believe that that several core technology pieces have matured well to warrant the deployment of speech applications on mobile devices in the marketplace. It is hence I AM delighted to see a book on this topic at this time.

While on the one hand mobile and pervasive environments provide an opportunity to build novel speech-based applications but on the other, the computing limitations of the device pose a challenge to the design and implementation of the underlying speech technology. This book provides an excellent overview of both the technologies from a device point of view and the possible applications from an enabling technology point of view. Chapters 2, 3 and 4 describe the implications and research in core speech technology owing to the mobile device restrictions. Chapters 5 and 6 move towards the application side of the story and describe the special requirements of designing speech applications on mobile devices. Chapters 7, 8 and 9 focus on the possible speech based applications on mobile devices. The earlier chapters are structured such that there is a natural transition from the descriptions of core technology that refer to the standard ways of speech recognition and synthesis to those tailored for the mobile device. Similarly, the later chapters on applications constantly refer to specific real world solutions to drive home the vast technology possibilities.

I see this book to have a wide audience in both academia and industry. In the university, it can be used for advanced courses on mobile technologies and speech technologies, targeting not just engineers but also application designers and mobile technology developers. Students can pick a specific sub-area (either hardware or speech recognition or speech synthesis or interface design) and do course projects. I also believe this book to be of value to industry since it brings together both the core technology and the wide range of application possibilities.

I hope that the reader will be able to appreciate and benefit from the comprehensive account that aims to bridge technology and applications. This book does well to achieve the dual goals as an educational resource for the student and as a reference for the practitioner. This book is indeed a compelling and useful contribution to the speech research and development world.

Shrikanth Narayanan

University of Southern California

Preface

What this book is about

If early 21st century is to be remembered for one global phenomenon, the rise of the mobile phone is a good contender. Mobiles are here to stay. Speech in Mobile and Pervasive Environments deals with issues related to speech processing on resource-constrained, wireless, mobile devices including: speech recognition in noisy environments, specialized hardware for speech recognition and synthesis, the use of context to enhance recognition, the emerging new standards required for interoperability, multimodal applications on mobile devices, distributed processing between the client and the server, and the relevance of speech for developing regions.

While speech processing has been an independent area of research for many years, the proliferation of the mobile device is making speech mainstream. Many novel and interesting applications are being offered on mobile devices, and the mobile platform brings its own opportunities and constraints: new sources of context, resource constraints and noisy environments. There is a rapidly growing interest, as indicated in academic conferences, as well as considerable investment in the telecoms and related industries.

Why we wrote this book

A multi-disciplinary approach to this topic is the primary motivation for writing this book. Further,

The mobile is a convergent platform—a personal communication device, and an interface for applications and services.We feel that it is useful for speech recognition experts to be able to leverage context and for people working in developing regions to learn about embedded recognition; it is therefore helpful to have all these topics between two covers.The topics covered in this book vary widely in flavor (from distributed speech recognition to applications in developing regions) and in maturity (the oldest being hardware and developing regions being the youngest). As a result of this, the chapters are not uniform in length or presentation, or in the approach we have taken toward the selection of material.Even for topics that are well-established, we hope that the expert will find something of use in this book. For example, rather than cover the issues that generally arise in gathering and managing context, our focus here has been to address context as it relates to speech in mobile and pervasive environments.We hope that this book will fulfill its ambition of bringing together people working across disciplines, increasing interactions among them and advancing this field further.

Chapters

The chapters of the book are as follows.

The chapter on hardware looks in detail at speech recognition from a hardware-centric viewpoint. It investigates the possibility of rendering the recognizer in the hardware itself.The chapter on embedded recognition and synthesis provides an overview and discusses acoustic parameterization, speech recognition algorithms and acoustic modeling.The chapter on distributed speech recognition includes the various protocols and standards used for distributing the recognition effort.The chapter on context discusses ways of modeling context and aggregating it for use in speech applications. It also describes a few context-based applications.The chapter on software talks about infrastructure, standards and technologies. VoIP and VoiceXML are a part of this chapter. Some possible extensions to VoiceXML from a mobile-speech perspective are also explored. This chapter also considers the question of restructuring a call-flow so that it can be accommodated on devices with constraints.The chapter on multimodal dialogue systems provides some examples of distributed multimodal applications and the software architectures needed to support them.The chapter on evaluation describes the need to develop appropriate methods for evaluating mobile speech systems. Generic methods of usability studies are discussed first, followed by a consideration of more specific subjects relevant to the evaluation of mobile and speech-based systems. A theoretical measure for dialogue call-flow is also presented.A large number of people on our planet cannot read or write, but find mobile telephones very usable and useful. The chapter on developing regions focusses on the needs of this population and the applications and interfaces that are needed to serve them.

The audience for this book

Students

This book can be used as a textbook for a one-semester postgraduate or advanced undergraduate course on the subject for computer science, electrical engineering, and human–computer-interface students. The book might be used as follows.

A one-semester course for electrical engineers could include the chapters on hardware (Chapter 2), embedded recognition and synthesis (Chapter 3), distributed speech recognition (Chapter 4), multimodality (Chapter 7) and developing regions (Chapter 9). While the first three chapters are core, the last two could feed back into the design of hardware.A one-semester course for computer science students could include the chapters on distributed speech recognition (Chapter 4), software (Chapter 6), multimodality (Chapter 7), context in conversation (Chapter 5) and developing regions (Chapter 9).A one-semester course for human–computer-interface students could include context in conversation (Chapter 5), multimodality (Chapter 7), software (Chapter 6), evaluation (Chapter 8) and developing regions (Chapter 9).

Professionals

There is a large number of professionals, across many disciplines, who are creating applications and solutions for the mobile platform. Many domain-specific (e.g. healthcare, agriculture) mobile application developers and software developers in the mobile/telecoms industries will also find this book useful.

Nitendra Rajput

Amit A. Nanavati

Acknowledgments

We are extremely grateful to our collaborators, because of whom we had enough to say: Rob, Patrick and Kai of Carnegie Mellon University; We still recall the enthusiastic response we received from Rob for the book, and how he thought this effort was very timely. We deeply appreciate their following through in such a timely manner. To the ever-supportive Marrku and Jakko, who quickly suggested their topics when we talked to them about our proposal, and were prompt with the drafts. Also to Om, who became our saviour in the hour of need—we wonder where he hides his white wings. Our deepest and more heartfelt gratitude to each one of you! When you decide to write your books, think of us:-)

The wisdom that writing a book is not at all like what one thinks it is going to be before writing it, came to us too. Enabling you to read these words took not just patience and drive but also dealing with promotions, transfers and happy family additions on the part of many of our friends at Wiley: Sarah Tilley, Susan Barclay and Sarah Hinton, who helped us conclude that patience is indeed a virtue (we are still alive), and Anna Smart, who risked her career and braved a couple of engineers' suggestions on cover design. We are grateful to our friends at Laserwords and Anglosphere Editing for their meticulous efforts.

Nitendra: To Samvit, who timed his arrival to perfection to ensure that I would have yet another excuse to give my coauthor and the publishers. To Praneeta, who has been patient throughout my life—and with respect to book as well—and never asked tough questions regarding the snail's pace of my writing. To my parents, who always maintain their faith in me in whatever I undertake—a faith that I have not yet been able to extend to my son.

Amit: To my uncle, Prof. H.C. Dholakia, who wrote ten odd (well, law) law books, and misled me into feeling that it was easy to write one. To my young nephews Arnav and Setu, who, through their innocent questioning, made us realize that my uncle had actually cheated us. Poto, for gently advising and encouraging me to complete this work—it would be nice if he took his own advice more often. And most of all, my Teen Deviyaan, who define me.

Also to our collaborators in our various papers that lead us into this area and finally to this effort. To our fathers-in-law who were unblushingly punctual in checking on the progress. Finally, to the umpteen friends and all our family members who will start showering their wishes on us as soon as they see this in print—without ever worrying about its content. In the end it is only for these wishes that we have toiled so hard, truly, in embarking upon this adventure.

Nitendra Rajput and Amit A. Nanavati

Chapter 1

Introduction

Nitendra Rajput and Amit A. Nanavati

IBM Research, India,

Speech is the most natural and widely used modality of interaction. In the devices world, mobile phones have surpassed television by a huge margin. Mobile devices and speech interaction therefore form a uniquely pervasive platform to access any information technology application. This book addresses the technology related to speech interaction on such mobile and pervasive devices. While speech interfaces and mobile devices have separately been a key areas of research and study, the two together have not been studied together in such detail. This is the gap that the book wishes to address.

Speech in mobile and pervasive environments is an exciting and a very promising area from the perspectives of timing, technology and adoption. We will now elaborate these three perspectives.

Increased processing speeds and reduced dimensions of computing devices over the last two decades have made computing more pervasive. While the first computer ever made (ENIAC in the 1940s) was of the size of a warehouse, the same amount of computing is now possible in chips that are measured in millimeters. The effect of Moore's law in reducing the size and increasing computational speed is clearly evident in the computing devices that are available today. The computers in 1950s were mostly huge machines used by enterprises or research organizations. By the 1990s, most computers in the world were personal computers that people used in home or offices. The last decade has seen the computing world move to laptops and smart phones. This journey of reduced size and increased speed in the computing world has been complemented by a corresponding acceptance of more personalized and pervasive devices in the consumer space. Computing has thus penetrated every aspect of human life. Devices that are easy to use and carry have increased the pervasiveness of technology. The timing is therefore right to look at such devices, which are the doors enabling access to the computing world in the 2010s.

From a technology perspective, research in speech technologies started its journey way back in the 1950s in the area of digit recognition1 and later through the IBM Shoebox. The technology has advanced significantly since then and now speech recognition and synthesis are becoming increasingly available on mobile devices. Speech recognition systems now understand natural language and use the context of the application to increase the recognition accuracy. Speech recognition and synthesis are also available in a large number of languages. The technology has now reached such a level of maturity that we can consider speech as a main medium of interaction on the mobile platform.

In terms of adoption, mobile devices clearly outnumber any other computing devices. The phenomenon is not only true in the developed nations, but also in developing regions, where cost and literacy are a challenge. We spend a significant amount of time on the move. Smart phones and other pervasive computing technologies are finding their way in the market to address the needs of such people who are not always near a computer. Not only do these devices provide comfortable access, but they also penetrate the market to reach out to populations who do not necessarily use computers. Pervasive environments, including mobile devices, are therefore now connecting more people on this planet than computers did. Being pervasive, such devices are more accessible than their counterpart computers. So not only do more users now own such devices, they are accessing them for longer durations than they would on traditional computers.

In such an evolving environment, mobile and pervasive devices are expected to attract attention of most users for maximum time.

We tend to rely on a keyboard to interact with computers and mobile devices. Keyboards are designed from a machine perspective and the machine is able to parse any keyboard entry with complete accuracy. However keyboards are not a natural means of interaction for humans. Other natural modalities such as gestures and emotions are easier for a human, but parsing such inputs is difficult for a machine, given the state of the art. Speech provides a good balance, being natural to humans and still parsable by the machine.

Using speech as a medium of interaction with the pervasive environment therefore promises to be a very natural and highly usable environment for humans to interact with machines.

From a technical perspective, speech on mobile devices has orthogonal implications on the entire life-cycle of the system. This includes the application design, the modality of interaction, the processing of interaction and the eventual evaluation. We will now look at the implications on each of these components of the life-cycle and present the manner in which they need to be addressed from a speech and a mobile perspective. These discussions will enable a reader to relate to each of these topics and then relate to the chapters of the book that describe them. The goal of this chapter is therefore twofold: we want to introduce the reader to the challenges associated with speech in mobile and pervasive environments and secondly, we describe the layout of the book so that the chapters can be read in a different sequence if preferred.

1.1 Application Design

When a mobile application uses the speech modality, it can be designed to derive more intelligence from the context of the user since such devices are mostly personal. The rich context of the user and their environment can make the application communicate more intelligently with the user. An application design should therefore incorporate the context of the user and the environment. Speech interfaces tend to gain further from this context since the machine is able to parse human speech more effectively if it knows the context. The implications of the availability of context and its usage is described in Chapter 5.

Since mobile applications need to run on a variety of devices, it is pertinent that the speech application runs over the multitude of operating systems and that they use standard authoring languages. Chapter 6 focuses on the various standards that have implications for speech applications on mobile devices. It also provides details about the way such standards need to be engineered to enable efficient processing of speech signals on resource-constrained mobile devices.

1.2 Interaction Modality

Speech, though natural, is not the most efficient modality in delivering every information. Research in the mobile interactions world has realized that a multi-modal interface involving both speech and visual interactions can be a better modality than a speech-only or a visual-only interface. Adding speech modality to any application therefore results in a multimodal application. The architecture and framework of building such multimodal applications therefore needs to be studied in a different way compared to traditional application development interfaces such as web application development frameworks. Several mobile-based applications can either perform the processing on the device or push it to the server. Architectural models that can support applications in such distributed processing modes also need to be studied. We describe the multimodal architecture and illustrate this through several multimodal applications that run on mobile devices. This explanation and more detail of the several client-server architectural models form the basis of Chapter 7.

1.3 Speech Processing

Speech interaction itself has many broad areas of research. One of the more intensively studied areas relate to converting user speech to text and converting text to speech. Research in these two areas has been one of the key applications in the core fields of digital signal processing and pattern recognition. Researchers have worked extensively on developing systems that can process a variety of human speech pronunciations and determine the underlying text. The difficulty of this problem is attributed to the nuances of the language and its multiple variations in terms of dialects, varying pronunciations and spontaneity of the speaker. Traditional approaches to solving the speech-to-text problem are therefore compute and data intensive as they try to determine a pattern from a sea of possibilities. Similarly, converting text to natural speech is also data intensive, as this involves determining the most appropriate audio sample from a large sea of possible sounds in a particular language. Speech processing in the mobile world is therefore a non-trivial adaptation from the computer world. Several techniques to efficiently process speech on resource-constrained devices are outlined in Chapters 2 and 3.

Owing to the high computing and data requirements, speech to text and text to speech need novel solutions when they are to be performed in a mobile environment. Most mobile devices cannot support huge amounts of processing and they do not have large amounts of processing memory either. From a mobile context, it therefore becomes critical to build speech recognition systems that can work on resource-constrained devices. Chapter 2 provides a brief overview of a simple speech recognition system and then elaborates the various techniques that are used to reduce the footprint. Embedded speech recognition provides methods to perform speech recognition with least degradation in accuracy but on such low-end devices. The chapter also illustrates techniques to convert text into natural speech by using a smaller data set, which can be incorporated in pervasive devices.

Since a large number of applications are in a client-server environment, a server can also be used to perform some speech processing. There are standards that enable such distribution of speech processing across the client and the server devices. We describe this concept of distributing speech processing in Chapter 4. The chapter also provides details on the underlying protocols that are used to communicate the processing between the client and the server. We compare the different distributed processing techniques at the end of the chapter.

1.4 Evaluations

The last leg of a mobile application is its evaluation. Being multimodal in nature, the evaluations of such an interface requires redesigning of standard evaluation techniques that usually work on specific modalities. Since we are focusing on speech applications in pervasive devices, the evaluations need to mimic the real-world conditions in which such applications are expected to be used. In most pervasive environments, a user is surrounded by other activities in addition to the device. Therefore seeking full attention of the user to the application is not always possible. We outline the various evaluation techniques for multimodal applications in Chapter 8.

We describe the distinction between field studies and real-world studies and describe their use in separate applications.

Toward the end of this book, we present a specific case of speech applications as applied in the context of the developing world. This deserves a separate chapter since we believe that owing to the pervasiveness of these devices and their easy modality, the applications are well suited to low-literate users in the developing world. In Chapter 9 we describe the challenges that are currently associated with current technologies for low-literate users and then outline how speech-based mobile applications can bridge this gap.

The book thus covers the various aspects of a mobile application life-cycle when speech processing will be used. We believe that the reader will be able to get a broad understanding of all the inter-related issues and also a detailed understanding of the specific challenges in this domain. The authors also conduct yearly workshops on this title and the reader is encouraged to participate in these2 and join the community at our wiki.3

1 Davies, K.H., Biddulph, R. and Balashek, S. (1952) Automatic Speech Recognition of Spoken Digits, J. Acoust. Soc. Am. 24(6), pp. 637–642.

2http://research.ihost.com/SiMPE.

3http://simpe.wikispaces.com.

Chapter 2

Mobile Speech Hardware: The Case For Custom Silicon

Patrick J. Bourke, Kai Yu and Rob A. Rutenbar

Carnegie Mellon University, USA,

Mobile platforms offer limited computational resources—a significant side effect of constraints on their cost, size and batteries. As a consequence, speech recognizers on mobile platforms usually make one of two unavoidable decisions: (1) to reduce the capability of a software-based recognizer hosted on this platform or (2) to move some part of the recognition computation off this platform.

In this chapter we suggest a third alternative: render the recognizer itself in hardware on the mobile platform. This is the path taken by graphics applications such as video playback, which are no longer handled in software, even on mobile phones. This chapter describes the hardware-based solution, explains its novel constraints and opportunities, and discusses recent results targeting low-power custom silicon recognizers.

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!