Language and Speech Processing -  - E-Book

Language and Speech Processing E-Book

0,0
215,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Speech processing addresses various scientific and technological areas. It includes speech analysis and variable rate coding, in order to store or transmit speech. It also covers speech synthesis, especially from text, speech recognition, including speaker and language identification, and spoken language understanding. This book covers the following topics: how to realize speech production and perception systems, how to synthesize and understand speech using state-of-the-art methods in signal processing, pattern recognition, stochastic modelling computational linguistics and human factor studies.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 810

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Preface

Chapter 1 Speech Analysis

1.1. Introduction

1.2. Linear prediction

1.3. Short-term Fourier transform

1.4. A few other representations

1.5. Conclusion

1.6. References

Chapter 2 Principles of Speech Coding

2.1. Introduction

2.2. Telephone-bandwidth speech coders

2.3. Wideband speech coding

2.4. Audiovisual speech coding

2.5. References

Chapter 3 Speech Synthesis

3.1. Introduction

3.2. Key goal: speaking for communicating

3.3. Synoptic presentation of the elementary modules in speech synthesis systems

3.4. Description of linguistic processing

3.5. Acoustic processing methodology

3.6. Speech signal modeling

3.7. Control of prosodic parameters: the PSOLA technique

3.8. Towards variable-size acoustic units

3.9. Applications and standardization

3.10. Evaluation of speech synthesis

3.11. Conclusions

3.12. References

Chapter 4 Facial Animation for Visual Speech

4.1. Introduction

4.2. Applications of facial animation for visual speech

4.3. Speech as a bimodal process

4.4. Synthesis of visual speech

4.5. Animation

4.6. Conclusion

4.7. References

Chapter 5 Computational Auditory Scene Analysis

5.1. Introduction

5.2. Principles of auditory scene analysis

5.3. CASA principles

5.4. Critique of the CASA approach

5.5. Perspectives

5.6. References

Chapter 6 Principles of Speech Recognition

6.1. Problem definition and approaches to the solution

6.2. Hidden Markov models for acoustic modeling

6.3. Observation probabilities

6.4. Composition of speech unit models

6.5. The Viterbi algorithm

6.6. Language models

6.7. Conclusion

6.8. References

Chapter 7 Speech Recognition Systems

7.1. Introduction

7.2. Linguistic model

7.3. Lexical representation

7.4. Acoustic modeling

7.5. Decoder

7.6. Applicative aspects

7.7. Systems

7.8. Perspectives

7.9. References

Chapter 8 Language Identification

8.1. Introduction

8.2. Language characteristics

8.3. Language identification by humans

8.4. Language identification by machines

8.5. LId resources

8.6. LId formulation

8.7. LId modeling

8.8. Discussion

8.9. References

Chapter 9 Automatic Speaker Recognition

9.1. Introduction

9.2. Typology and operation of speaker recognition systems

9.3. Fundamentals

9.4. Performance evaluation

9.5. Applications

9.6. Conclusions

9.7. Further reading

Chapter 10 Robust Recognition Methods

10.1. Introduction

10.2. Signal pre-processing methods

10.3. Robust parameters and distance measures

10.4. Adaptation methods

10.5. Compensation of the Lombard effect

10.6. Missing data scheme

10.7. Conclusion

10.8. References

Chapter 11 Multimodal Speech: Two or Three Senses are Better than One

11.1. Introduction

11.2. Speech is a multimodal process

11.3. Architectures for Audiovisual fusion in speech perception

11.4. Audiovisual speech recognition systems

11.5. Conclusions

11.6. References

Chapter 12 Speech and Human-Computer Communication

12.1. Introduction

12.2. Context

12.3. Specificities of speech

12.4. Application domains with voice-only interaction

12.5. Application domains with multimodal interaction

12.6. Conclusions

12.7. References

Chapter 13 Voice Services in the Telecom Sector

13.1. Introduction

13.2. Automatic speech processing and telecommunications

13.3. Speech coding in the telecommunication sector

13.4. Voice command in telecom services

13.5. Speaker verification in telecom services

13.6. Text-to-speech synthesis in telecommunication systems

13.7. Conclusions

13.8. References

List of Authors

Index

First published in France in 2002 by Hermes Science/Lavoisier entitled Traitement automatique du langage parlé 1 et 2 © LAVOISIER, 2002

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd 27–37 St George's Road London SW19 4EU UK

www.iste.co.uk

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA

www.wiley.com

© ISTE Ltd, 2009

The rights of Joseph Mariani to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Cataloging-in-Publication Data

Traitement automatique du langage parlé 1 et 2. English

  Spoken language processing / edited by Joseph Mariani.

       p. cm

Includes bibliographical references and index.

ISBN 978-1-84821-031-8

1. Automatic speech recognition. 2. Speech processing systems. I. Mariani, Joseph. II. Title.

TK7895.S65T7213 2008

1. Semiconductor storage devices. 2. Flash memories (Computers) I. Title.

006.4'54--dc22

2008036758

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN: 978-1-84821-031-8

Preface

This book, entitled Spoken Language Processing, addresses all the aspects covering the automatic processing of spoken language: how to automate its production and perception, how to synthesize and understand it. It calls for existing know-how in the field of signal processing, pattern recognition, stochastic modeling, computational linguistics, human factors, but also relies on knowledge specific to spoken language.

The automatic processing of spoken language covers activities related to the analysis of speech, including variable rate coding to store or transmit it, to its synthesis, especially from text, to its recognition and understanding, should it be for a transcription, possibly followed by an automatic indexation, or for human-machine dialog or human-human machine-assisted interaction. It also includes speaker and spoken language recognition. These tasks may take place in a noisy environment, which makes the problem even more difficult.

The activities in the field of automatic spoken language processing started after the Second World War with the works on the Vocoder and Voder at Bell Labs by Dudley and colleagues, and were made possible by the availability of electronic devices. Initial research work on basic recognition systems was carried out with very limited computing resources in the 1950s. The computer facilities that became available to researchers in the 1970s made it possible to achieve initial progress within laboratories, and microprocessors then led to the early commercialization of the first voice recognition and speech synthesis systems at an affordable price. The steady progress in the speed of computers and in the storage capacity accompanied the scientific advances in the field.

Research investigations in the 1970s, including those carried out in the large DARPA “Speech Understanding Systems” (SUS) program in the USA, suffered from a lack of availability of speech data and of means and methods for evaluating the performance of different approaches and systems. The establishment by DARPA, as part of its following program launched in 1984, of a national language resources center, the Linguistic Data Consortium (LDC), and of a system assessment center, within the National Institute of Standards and Technology (NIST, formerly NBS), brought this area of research into maturity. The evaluation campaigns in the area of speech recognition, launched in 1987, made it possible to compare the different approaches that had coexisted up to then, based on “Artificial Intelligence” methods or on stochastic modeling methods using large amounts of data for training, with a clear advantage to the latter. This led progressively to a quasi-generalization of stochastic approaches in most laboratories in the world. The progress made by researchers has constantly accompanied the increasing difficulty of the tasks which were handled, starting from the recognition of sentences read aloud, with a limited vocabulary of 1,000 words, either speaker-dependent or speaker-independent, to the dictation of newspaper articles for vocabularies of 5,000, 20,000 and 64,000 words, and then to the transcription of radio or television broadcast news, with unlimited size vocabularies. These evaluations were opened to the international community in 1992. They first focused on the American English language, but early initiatives were also carried out on the French, German or British English languages in a French or European context. Other campaigns were subsequently held on speaker recognition, language identification or speech synthesis in various contexts, allowing for a better understanding of the pros and cons of an approach, and for measuring the status of technology and the progress achieved or still to be achieved. They led to the conclusion that a sufficient level of maturation has been reached for putting the technology on the market, in the field of voice dictation systems for example. However, it also identified the difficulty of other more challenging problems, such as those related to the recognition of conversational speech, justifying the need to keep on supporting fundamental research in this area.

This book consists of two parts: a first part discusses the analysis and synthesis of speech and a second part speech recognition and understanding. The first part starts with a brief introduction of the principles of speech production, followed by a broad overview of the methods for analyzing speech: linear prediction, short-term Fourier transform, time-representations, wavelets, cepstrum, etc. The main methods for speech coding are then developed for the telephone bandwidth, such as the CELP coder, or, for broadband communication, such as “transform coding” and quantization methods. The audio-visual coding of speech is also introduced. The various operations to be carried out in a text-to-speech synthesis system are then presented regarding the linguistic processes (grapheme-to-phoneme transcription, syntactic and prosodic analysis) and the acoustic processes, using rule-based approaches or approaches based on the concatenation of variable length acoustic units. The different types of speech signal modeling – articulatory, formant-based, auto-regressive, harmonic-noise or PSOLA-like – are then described. The evaluation of speech synthesis systems is a topic of specific attention in this chapter. The extension of speech synthesis to talking faces animation is the subject of the next chapter, with a presentation of the application fields, of the interest of a bimodal approach and of models used to synthesize and animate the face. Finally, computational auditory scene analysis opens prospects in the signal processing of speech, especially in noisy environments.

The second part of the book focuses on speech recognition. The principles of speech recognition are first presented. Hidden Markov models are introduced, as well as their use for the acoustic modeling of speech. The Viterbi algorithm is depicted, before introducing language modeling and the way to estimate probabilities. It is followed by a presentation of recognition systems, based on those principles and on the integration of those methodologies, and of lexical and acoustic-phonetic knowledge. The applicative aspects are highlighted, such as efficiency, portability and confidence measures, before describing three types of recognition systems: for text dictation, for audio documents indexing and for oral dialog. Research in language identification aims at recognizing which language is spoken, using acoustic, phonetic, phonotactic or prosodic information. The characteristics of languages are introduced and the way humans or machines can achieve that task is depicted, with a large presentation of the present performances of such systems. Speaker recognition addresses the recognition and verification of the identity of a person based on his voice. After an introduction on what characterizes a voice, the different types and designs of systems are presented, as well as their theoretical background. The way to evaluate the performances of speaker recognition systems and the applications of this technology are a specific topic of interest. The use of speech or speaker recognition systems in noisy environments raises especially difficult problems to solve, but they must be taken into account in any operational use of such systems. Various methods are available, either by pre-processing the signal, during the parameterization phase, by using specific distances or by adaptation methods. The Lombard effect, which causes a change in the production of the voice signal itself due to the noisy environment surrounding the speaker, benefits from a special attention. Along with recognition based solely on the acoustic signal, bi-modal recognition combines two acquisition channels: auditory and visual. The value added by bimodal processing in a noisy environment is emphasized and architectures for the audiovisual merging of audio and visual speech recognition are presented. Finally, applications of automatic spoken language processing systems, generally for human-machine communication and particularly in telecommunications, are described. Many applications of speech coding, recognition or synthesis exist in many fields, and the market is growing rapidly. However, there are still technological and psychological barriers that require more work on modeling human factors and ergonomics, in order to make those systems widely accepted.

The reader, undergraduate or graduate student, engineer or researcher will find in this book many contributions of leading French experts of international renown who share the same enthusiasm for this exciting field: the processing by machines of a capacity which used to be specific to humans: language.

Finally, as editor, I would like to warmly thank Anna and Frédéric Bimbot for the excellent work they achieved in translating the book Traitement automatique du langage parlé, on which this book is based.

Joseph Mariani November 2008

Chapter 1

Speech Analysis1

1.1. Introduction

1.1.1. Source-filter model

Speech, the acoustic manifestation of language, is probably the main means of communication between human beings. The invention of telecommunications and the development of digital information processing have therefore entailed vast amounts of research aimed at understanding the mechanisms of speech communication.

Speech can be approached from different angles. In this chapter, we will consider speech as a signal, a one-dimensional function, which depends on the time variable (as in [BOI 87, OPP 89, PAR 86, RAB 75, RAB 77]). The acoustic speech signal is obtained at a given point in space by a sensor (microphone) and converted into electrical values. These values are denoted s(t) and they represent a real-valued function of real variable t, analogous to the variation of the acoustic pressure. Even if the acoustic form of the speech signal is the most widespread (it is the only signal transmitted over the telephone), other types of analysis also exist, based on alternative physiological signals (for instance, the electroglottographic signal, the palatographic signal, the airflow), or related to other modalities (for example, the image of the face or the gestures of the articulators). The field of speech analysis covers the set of methods aiming at the extraction of information on and from this signal, in various applications, such as:

– speech coding: the compression of information carried by the acoustic signal, in order to save data storage or to reduce transmission rate;

– speech recognition and understanding, speaker and spoken language recognition;

– speech synthesis or automatic speech generation, from an arbitrary text;

– speech signal processing, which covers many applications, such as auditory aid, denoising, speech encrypting, echo cancellation, post-processing for audiovisual applications;

– phonetic and linguistic analysis, speech therapy, voice monitoring in professional situations (for instance, singers, speakers, teachers, managers, etc.).

Two ways of approaching signal analysis can be distinguished: the model-based approach and the representation-based approach. When a voice signal model (or a voice production model or a voice perception model) is assumed, the goal of the analysis step is to identify the parameters of that model. Thus, many analysis methods, referred to as parametric methods, are based on the source-filter model of speech production; for example, the linear prediction method. On the other hand, when no particular hypothesis is made on the signal, mathematical representations equivalent to its time representation can be defined, so that new information can be drawn from the coefficients of the representation. An example of a non-parametric method is the short-term Fourier transform (STFT). Finally, there are some hybrid methods (sometimes referred to as semi-parametric). These consist of estimating some parameters from non-parametric representations. The sinusoidal and cepstral representations are examples of semi-parametric representation.

This chapter is centered on the linear acoustic source-filter speech production model. It presents the most common speech signal analysis techniques, together with a few illustrations. The reader is assumed to be familiar with the fundamentals of digital signal processing, such as discrete-time signals, Fourier transform, Laplace transform, Z-transforms and digital filters.

1.1.2. Speech sounds

The human speech apparatus can be broken down into three functional parts [HAR 76]: 1) the lungs and trachea, 2) the larynx and 3) the vocal tract. The abdomen and thorax muscles are the engine of the breathing process. Compressed by the muscular system, the lungs act as bellows and supply some air under pressure which travels through the trachea (subglottic pressure). The airflow thus expired is then modulated by the movements of the larynx and those of the vocal tract.

The larynx is composed of the set of muscles, articulated cartilage, ligaments and mucous membranes located between the trachea on one side, and the pharyngeal cavity on the other side. The cartilage, ligaments and muscles in the larynx can set the vocal cords in motion, the opening of which is called the glottis. When the vocal cords lie apart from each other, the air can circulate freely through the glottis and no sound is produced. When both membranes are close to each other, they can join and modulate the subglottic airflow and pressure, thus generating isolated pulses or vibrations. The fundamental frequency of these vibrations governs the pitch of the voice signal (F0).

The vocal tract can be subdivided into three cavities: the pharynx (from the larynx to the velum and the back of the tongue), the oral tract (from the pharynx to the lips) and the nasal cavity. When it is open, the velum is able to divert some air from the pharynx to the nasal cavity. The geometrical configuration of the vocal tract depends on the organs responsible for the articulation: jaws, lips, tongue.

Each language uses a certain subset of sounds, among those that the speech apparatus can produce [MAL 74]. The smallest distinctive sound units used in a given language are called phonemes. The phoneme is the smallest spoken unit which, when substituted with another one, changes the linguistic content of an utterance. For instance, changing the initial /p/ sound of “pig” (/pIg/) into /b / yields a different word: “big” (/bIg/). Therefore, the phonemes /p/ and /b/ can be distinguished from each other.

A set of phonemes, which can be used for the description of various languages [WEL 97], is given in Table 1.1 (described both by the International Phonetic Alphabet, IPA, and the computer readable Speech Assessment Methodologies Phonetic Alphabet, SAMPA). The first subdivision that is observed relates to the excitation mode and to the vocal tract stability: the distinction between vowels and consonants. Vowels correspond to a periodic vibration of the vocal cords and to a stable configuration of the vocal tract. Depending on whether the nasal branch is open or not (as a result of the lowering of the velum), vowels have either a nasal or an oral character. Semivowels are produced when the periodic glottal excitation occurs simultaneously with a fast movement of the vocal tract, between two vocalic positions.

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!