96,99 €
DIGITAL SPEECH TRANSMISSION AND ENHANCEMENT
Enables readers to understand the latest developments in speech enhancement/transmission due to advances in computational power and device miniaturization
The Second Edition of Digital Speech Transmission and Enhancement has been updated throughout to provide all the necessary details on the latest advances in the theory and practice in speech signal processing and its applications, including many new research results, standards, algorithms, and developments which have recently appeared and are on their way into state-of-the-art applications.
Besides mobile communications, which constituted the main application domain of the first edition, speech enhancement for hearing instruments and man-machine interfaces has gained significantly more prominence in the past decade, and as such receives greater focus in this updated and expanded second edition.
Readers can expect to find information and novel methods on:
Digital Speech Transmission and Enhancement is a single-source, comprehensive guide to the fundamental issues, algorithms, standards, and trends in speech signal processing and speech communication technology, and as such is an invaluable resource for engineers, researchers, academics, and graduate students in the areas of communications, electrical engineering, and information technology.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 883
Veröffentlichungsjahr: 2023
Second edition
Peter Vary
Institute of Communication Systems
RWTH Aachen University
Aachen, Germany
Rainer Martin
Institute of Communication Acoustics
Ruhr‐Universität Bochum
Bochum, Germany
This second edition first published 2024© 2024 John Wiley & Sons Ltd.
Edition HistoryJohn Wiley & Sons Ltd. (1e, 2006)
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Peter Vary and Rainer Martin to be identified as the authors of this work has been asserted in accordance with law.
Registered OfficesJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USAJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data applied for
Hardback: 9781119060963
ePdf: 9781119060994
epub: 9781119060987
Cover Design: Wiley
Cover Image: © BAIVECTOR/Shutterstock
Digital processing, storage, and transmission of speech signals have gained great practical importance. The main areas of application are digital mobile radio, audio‐visual conferencing, acoustic human–machine communication, and hearing aids. In fact, these applications are the driving forces behind many scientific and technological developments in this field. A specific feature of these application areas is that theory and implementation are closely linked; there is a seamless transition from theory and algorithms to system simulations using general‐purpose computers and to implementations on embedded processors.
This book has been written for engineers and engineering students specializing in speech and audio processing. It summarizes fundamental theory and recent developments in the broad field of digital speech transmission and enhancement and includes joint research of the authors and their PhD students. This book is being used in graduate courses at RWTH Aachen University and Ruhr‐Universität Bochum and other universities.
This second edition also reflects progress in digital speech transmission and enhancement since the publication of the first edition [Vary, Martin 2006]. In this respect, new speech coding standards have been included, such as the Enhanced Voice Services (EVS) codec. Throughout this book, the term enhancement comprises besides noise reduction also the topics of error concealment, artificial bandwidth extension, echo cancellation, and the new topic of near‐end listening enhancement.
Furthermore, summaries of essential tools such as spectral analysis, digital filter banks, including the so‐called filter bank equalizer, as well as stochastic signal processing and estimation theory are provided. Recent trends of applying machine learning techniques in speech signal processing are addressed.
As a supplement to the first and second edition, the companion book Advances in Digital Speech Transmission [Martin et al. 2008] should be mentioned that covers specific topics in Speech Quality Assessment, Acoustic Signal Processing, Speech Coding, Joint Source‐Channel Coding, and Speech Processing in Hearing Instruments and Human–Machine Interfaces.
Furthermore, the reader will find supplementary information, publications, programs, and audio samples, the Aachen databases (single and multichannel room impulse responses, active noise cancellation impulse responses), and a database of simulated room impulse responses for acoustic sensor networks on the following web sites:
http://www.iks.rwth-aachen.de
http://www.rub.de/ika
The scope of the individual subjects treated in the book chapters exceeds that of graduate lectures; recent research results, standards, problems of realization, and applications have been included, as well as many suggestions for further reading. The reader should be familiar with the fundamentals of digital signal processing and statistical signal processing.
The authors are grateful to all current and former members of their groups and students who contributed to the book through research results, discussions, or editorial work. In particular, we like to thank Dr.‐Ing. Christiane Antweiler, Dr.‐Ing. Colin Breithaupt, Prof. Gerald Enzner, Prof. Tim Fingscheidt, Prof. Timo Gerkmann, Prof. Peter Jax, Dr.‐Ing. Heiner Löllmann, Prof. Nilesh Madhu, Dr.‐Ing. Anil Nagathil, Dr.‐Ing. Markus Niermann, Dr.‐Ing. Bastian Sauert, and Dr.‐Ing. Thomas Schlien for fruitful discussions and valuable contributions. Furthermore, we would especially like to thank Dr.‐Ing. Christiane Antweiler, for her tireless support to this project, and Horst Krott and Dipl.‐Geogr. Julia Ringeis for preparing most of the diagrams.
Finally, we would like to express our sincere thanks to the managing editors and staff of John Wiley & Sons for their kind and patient assistance.
Aachen and BochumOctober 2023
Peter Vary and Rainer Martin
Martin, R.; Heute, U.; Antweiler, C. (2008).
Advances in Digital Speech Transmission
, John Wiley & Sons.
Vary, P.; Martin, R. (2006).
Digital Speech Transmission – Enhancement, Coding and Error Concealment
, John Wiley & Sons.
Language is the most essential means of human communication. It is used in two modes: as spoken language (speech communication) and as written language (textual communication). In our modern information society both modes are greatly enhanced by technical systems and devices. E‐mail, short messaging, and the worldwide web have revolutionized textual communication, while
digital cellular radio systems,
audio–visual conference systems,
acoustic human–machine communication, and
digital hearing aids
have significantly expanded the possibilities and convenience of speech and audio–visual communication.
Digital processing and enhancement of speech signals for the purpose of transmission (or storage) is a branch of information technology and an engineering science which draws on various other disciplines, such as physiology, phonetics, linguistics, acoustics, and psychoacoustics. It is this multidisciplinary aspect which makes digital speech processing a challenging as well as rewarding task.
The goal of this book is a comprehensive discussion of fundamental issues, standards, and trends in speech communication technology. Speech communication technology helps to mitigate a number of physical constraints and technological limitations, most notably
bandwidth limitations of the telephone channel,
shortage of radio frequencies,
acoustic background noise at the near‐end (receiving side),
acoustic background noise at the far‐end (transmitting side),
(residual) transmission errors and packet losses caused by the transmission channel,
interfering acoustic echo signals from loudspeaker(s).
The enormous advances in signal processing technology have contributed to the success of speech signal processing. At present, integrated digital signal processors allow economic real‐time implementations of complex algorithms, which require several thousand operations per speech sample. For this reason, advanced speech signal processing functions can be implemented in cellular phones and audio–visual terminals, as illustrated in Figure 1.1.
Figure 1.1 Speech signal processing in a handsfree cellular terminal. BF: beamforming, AEC: acoustic echo cancellation, NR: noise reduction, SC: speech coding, ETC: equivalent transmission channel, EC: error concealment, SD: speech decoding, BWE: bandwidth extension, and NELE: near‐end listening enhancement.
The handsfree terminal in Figure 1.1 facilitates communication via microphones and loudspeakers. Handsfree telephone devices are installed in motor vehicles in order to enhance road safety and to increase convenience in general.
At the far end of the transmission system, three different pre‐processing steps are taken to improve communication in the presence of ambient noise and loudspeaker signals. In the first step, two or more microphones are used to enhance the near‐end speech signal by beamforming (BF). Specific characteristics of the interference, such as the spatial distribution of the sound sources and the statistics of the spatial sound field, are exploited.
Acoustic echoes occur when the far‐end signal leaks at the near‐end from the loudspeaker of the handsfree set into the microphone(s) via the acoustic path. As a consequence, the far‐end speakers will hear their own voice delayed by twice the signal propagation time of the telephone network. Therefore, in a second step, the acoustic echo must be compensated by an adaptive digital filter, the acoustic echo canceller (AEC).
The third module of the pre‐processing chain is noise reduction (NR) aiming at an improvement of speech quality prior to coding and transmission. Single‐channel NR systems rely on spectral modifications and are most effective for short‐term stationary noise.
Speech coding (SC), error concealment (EC), and speech decoding (SD) facilitate the efficient use of the transmission channel. SC algorithms for cellular communications with typical bit rates between 4 and 24 bit/s are explicitly based upon a model of speech production and exploit properties of the hearing mechanism.
At the receiving side of the transmission system, speech quality is ensured by means of error correction (channel decoding), which is not within the scope of this book. In Figure 1.1, the (inner) channel coding/decoding as well as modulation/demodulation and transmission over the physical channel are modeled as an equivalent transmission channel (ETC). In spite of channel coding, quite frequently residual errors remain. The negative auditive effects of these errors can be mitigated by error concealment (EC) techniques. In many cases, these effects can be reduced by exploiting both residual source redundancy and information about the instantaneous quality of the transmission channel.
Finally, the decoded signal might be subjected to artificial bandwidth extension (BWE) which expands narrowband (0.3–) to wideband (0.05–7.0 kHz) speech or wideband speech to super wideband (0.05–14.0 kHz). With the introduction of true wideband and super wideband speech audio coding into telephone networks, this step will be of significant importance as, for a long transition period, narrowband and wideband speech terminals will coexist.
At the receiving end (near‐end), the perception of the decoded (and eventually bandwidth expanded) speech signal might be disturbed by acoustic background noise. The task of the last module in the transmission chain is to improve intelligibility or at least to reduce the listening effort. The received speech signal is modified, taking the near‐end background noise into account, which can be captured with a microphone. This method is called near‐end listening enhancement (NELE).
Some of these processing functions find also applications in audio–visual conferencing devices and digital hearing aids.
The book is organized as follows. The first part fundamentals (Chapters 2–5) deals with models of speech production and hearing, spectral transformations, filter banks, and stochastic processes.
The second part speech coding (Chapters 6–8) covers quantization, differential waveform coding and especially the concepts of code excited linear prediction (CELP) are discussed. Finally, some of the most relevant speech codec standards are presented. Recent developments such as the Adaptive Multi‐Rate,AMR codec, or the Enhanced Voice Services (EVS) codec for cellular and IP communication are described.
The third part speech enhancement (Chapters 9–15) is concerned with error concealment, bandwidth extension, near‐end listening enhancement, single and dual‐channel noise and reverberation reduction, acoustic echo cancellation, and beamforming.
Digital speech communication systems are largely based on knowledge of speech production, hearing, and perception. In this chapter, we will discuss some fundamental aspects in so far as they are of importance for optimizing speech‐processing algorithms such as speech coding, speech enhancement, or feature extraction for automatic speech recognition.
In particular, we will study the mechanism of speech production and the typical characteristics of speech signals. The digital speech production model will be derived from acoustical and physical considerations. The resulting all‐pole model of the vocal tract is the key element of most of the current speech‐coding algorithms and standards.
Furthermore, we will provide insights into the human auditory system and we will focus on perceptual fundamentals which can be exploited to improve the quality and the effectiveness of speech‐processing algorithms to be discussed in later chapters. With respect to perception, the main aspects to be considered in digital speech transmission are the masking effect and the spectral resolution of the auditory system.
As a detailed discussion of the acoustic theory of speech production, phonetics, psychoacoustics, and perception is beyond the scope of this book, the reader is referred to the literature (e.g., [Fant 1970], [Flanagan 1972], [Rabiner, Schafer 1978], [Picket 1980], and [Zwicker, Fastl 2007]).
Sound is a mechanical vibration that propagates through matter in the form of waves. Sound waves may be described in terms of a sound pressure field and a sound velocity vector field , which are both functions of a spatial co‐ordinate vector and time . While the sound pressure characterizes the density variations (we do not consider the DC component, also known as atmospheric pressure), the sound velocity describes the velocity of dislocation of the physical particles of the medium which carries the waves. This velocity is different from the speed of the traveling sound wave.
In the context of our applications, i.e., sound waves in air, sound pressure and resulting density variations are related by
and also the relation between and may be linearized. Then, in the general case of three spatial dimensions these two quantities are related via differential operators in an infinitesimally small volume of air particles as
where and are the speed of sound and the density at rest, respectively. These equations, also known as Euler's equation and continuity equation [Xiang, Blauert 2021], may be combined into the wave equation
where the Laplace operator in Cartesian coordinates is
A solution of the wave equation (2.3) is plane waves which feature surfaces of constant sound pressure propagating in a given spatial direction. A harmonic plane wave of angular frequency which propagates in positive direction or negative direction may be written in complex notation as
where is the wave number, is the wavelength, and , are the (possibly complex‐valued) amplitudes. Using (2.2), the component of the sound velocity is then given by
Thus, for a plane wave, the sound velocity is proportional to the sound pressure.
In our applications, waves which have a constant sound pressure on concentrical spheres are also of interest. Indeed, the wave equation (2.3) delivers a solution for the spherical wave which propagates in radial direction as
where is the propagating waveform. The amplitude of the sound wave diminishes with increasing distance from the source. We may then use the abstraction of a point source to explain the generation of such spherical waves.
An ideal point source may be represented by its source strength [Xiang, Blauert 2021]. Furthermore, with (2.2) we have
Then, the radial component of the velocity vector may be integrated over a sphere of radius to yield . For , the second term on the right‐hand side of (2.8) is smaller than the first. Therefore, for an infinitesimally small sphere, we find with (2.8)
and, with (2.7), for any
which characterizes, again, a spherical wave. The sound pressure is inversely proportional to the radial distance from the point source. For a harmonic excitation
we find the sound pressure
and hence, with (2.8) and an integration with respect to time, the sound velocity
Clearly, (2.12) and (2.13) satisfy (2.8). Because of the second term in the parentheses in (2.13), sound pressure and sound velocity are not in phase. Depending on the distance of the observation point to the point source, the behavior of the wave is distinctly different. When the second term cannot be neglected, the observation point is in the nearfield of the source. For , the observation point is in the farfield. The transition from the nearfield to the farfield depends on the wave number and, as such, on the wavelength or the frequency of the harmonic excitation.
The production of speech sounds involves the manipulation of an airstream. The acoustic representation of speech is a sound pressure wave originating from the physiological speech production system. A simplified schematic of the human speech organs is given in Figure 2.1. The main components and their functions are:
lungs:
trachea:
larynx with vocal cords:
vocal tract with pharynx, oral and nasal cavities:
the energy generator,
for energy transport,
the signal generator, and
the acoustic filter.
By contraction, the lungs produce an airflow which is modulated by the larynx, processed by the vocal tract, and radiated via the lips and the nostrils. The larynx provides several biological and sound production functions. In the context of speech production, its purpose is to control the stream of air that enters the vocal tract via the vocal cords.
Speech sounds are produced by means of various mechanisms. Voiced sounds are produced when the airflow is interrupted periodically by the movements (vibration) of the vocal cords (see Figure 2.2). This self‐sustained oscillation, i.e., the repeated opening and closing of the vocal cords, can be explained by the so‐called Bernoulli effect as in fluid dynamics: as airflow velocity increases, local pressure decreases. At the beginning of each cycle, the area between the vocal cords, which is called the glottis, is almost closed by means of appropriate tension of the vocal cords. Then an increased air pressure builds up below the glottis, forcing the vocal cords to open. As the vocal cords diverge, the velocity of the air flowing through the glottis increases steadily, which causes a drop in the local pressure. Then, the vocal cords snap back to their initial position and the next cycle can start if the airflow from the lungs and the tension of the vocal cords are sustained. Due to the abrupt periodic interruptions of the glottal airflow, as schematically illustrated in Figure 2.2, the resulting excitation (pressure wave) of the vocal tract has a fundamental frequency of and has a large number of harmonics. These are spectrally shaped according to the frequency response of the acoustic vocal tract. The duration of a single cycle is called the pitch period.
Figure 2.1 Organs of speech production.
Unvoiced sounds are generated by a constriction at the open glottis or along the vocal tract causing a nonperiodic turbulent airflow.
Plosive sounds (also known as stops) are caused by building up the air pressure behind a complete constriction somewhere in the vocal tract, followed by a sudden opening. The released airflow may create a voiced or an unvoiced sound or even a mixture of both, depending on the actual constellation of the articulators.
The vocal tract can be subdivided into three sections: the pharynx, the oral cavity, and the nasal cavity. As the entrance to the nasal cavity can be closed by the velum, a distinction is often made in the literature between the nasal tract (from velum to nostrils) and the other two sections (from trachea to lips, including the pharynx cavity). In this chapter, we will define the vocal tract as a variable acoustic resonator including the nasal cavity with the velum either open or closed, depending on the specific sound to be produced. From the engineering point of view, the resonance frequencies are varied by changing the size and the shape of the vocal tract using different constellations and movements of the articulators, i.e., tongue, teeth, lips, velum, lower jaw, etc. Thus, humans can produce a variety of different sounds based on different vocal tract constellations and different acoustic excitations.
Figure 2.2 Glottal airflow during voiced sounds.
Finally, the acoustic waves carrying speech sounds are radiated via the mouth and head. In a first approximation, we may model the radiating head as a spherical source in free space. The (complex‐valued) acoustic load at the lips may then be approximated by the radiation load of a spherical source of radius where , represents the head radius. Following (2.13), this load exhibits a high‐pass characteristic,
where denotes angular frequency and is the speed of sound. This model suggests an acoustic “short circuit” at very low frequencies, i.e., little acoustic radiation at low frequencies, which is also supported by measurements [Flanagan 1960]. For an assumed head radius of cm and 343 m/s, the 3‐dB cutoff frequency is about 640 Hz.
Most languages can be described as a set of elementary linguistic units, which are called phonemes. A phoneme is defined as the smallest unit which differentiates the meaning of two words in one language. The acoustic representation associated with a phoneme is called a phone. American English, for instance, consists of about 42 phonemes, which are subdivided into four classes:
Vowels are voiced and belong to the speech sounds with the largest energy. They exhibit a quasi periodic time structure caused by oscillation of the vocal cords. The duration varies from 40 to . Vowels can be distinguished by the time‐varying resonance characteristics of the vocal tract. The resonance frequencies are also called formant frequencies. Examples: /a/ as in “father” and /i/ as in “eve.”
Diphthongs involve a gliding transition of the articulators from one vowel to another vowel. Examples: /oU/ as in “boat” and /ju/ as in “you.”
Approximants are a group of voiced phonemes for which the airstream escapes through a relatively narrow aperture in the vocal tract. They can, thus, be regarded as intermediate between vowels and consonants [Gimson, Cruttenden 1994]. Examples: /w/ in “wet” and /r/ in “ran.”
Consonants are produced with stronger constriction of the vocal tract than vowels. All kinds of excitation can be observed. Consonants are subdivided into nasals, stops, fricatives, aspirates, and affricatives. Examples of these five subclasses: /m/ as in “more,” /t/ as in “tea,” /f/ as in “free,” /h/ as in “hold,” and /t/ as in “chase.”
Each of these classes may be further divided into subclasses, which are related to the interaction of the articulators within the vocal tract. The phonemes can further be classified as either continuant (excitation of a more or less non time‐varying vocal tract) or non continuant (rapid vocal tract changes). The class of continuant sounds consists of vowels and fricatives (voiced and unvoiced). The non continuant sounds are represented by diphthongs, semivowels, stops, and affricates.
For the purpose of speech‐signal processing, specific articulatory and phonetic aspects are not as important as the typical characteristics of the waveforms, namely, the basic categories:
voiced,
unvoiced,
mixed voiced/unvoiced,
plosive, and
silence.
Voiced sounds are characterized by their fundamental frequency, i.e., the frequency of vibration of the vocal cords, and by the specific pattern of amplitudes of the spectral harmonics.
In the speech signal processing literature, the fundamental frequency is often called pitch and the respective period is called pitch period. It should be noted, however, that in psychoacoustics the term pitch is used differently, i.e., for the perceived fundamental frequency of a sound, whether or not that frequency is actually present in the waveform (e.g., [Deller Jr. et al. 2000]). The fundamental frequency of young men ranges from 85 to 155 Hz and that of young women from 165 to 255 Hz [Fitch, Holbrook 1970]. Fundamental frequency, also in combination with vocal tract length, is indicative of sex, age, and size of the speaker [Smith, Patterson 2005].
Unvoiced sounds are determined mainly by their characteristic spectral envelopes. Voiced and unvoiced excitation do not exclude each other. They may occur simultaneously, e.g., in fricative sounds.
The distinctive feature of plosive sounds is the dynamically transient change of the vocal tract. Immediately before the transition, a total constriction in the vocal tract stops sound radiation from the lips for a short period. There might be a small amount of low‐frequency components radiated through the throat. Then, the sudden change with release of the constriction produces a plosive burst.
Some typical speech waveforms are shown in Figure 2.3.
The purpose of developing a model of speech production is not to obtain an accurate description of the anatomy and physiology of human speech production but rather to achieve a simplifying mathematical representation for reproducing the essential characteristics of speech signals.
In analogy to the organs of human speech production as discussed in Section 2.2, it seems reasonable to design a parametric two‐stage model consisting of an excitation source and a vocal tract filter, see also [Rabiner, Schafer 1978], [Parsons 1986], [Quatieri 2001], [Deller Jr. et al. 2000]. The resulting digital source‐filter model, as illustrated in Figure 2.4, will be derived below.
Figure 2.3 Characteristic waveforms of speech signals: (a) Voiced (vowel with transition to voiced consonant); (b) Unvoiced (fricative); (c) Transition: pause–plosive–vowel.
The model consists of two components:
the
excitation source
featuring mainly the influence of the lungs and the vocal cords (voiced, unvoiced, mixed) and
the
time‐varying digital vocal tract filter
approximating the behavior of the vocal tract (spectral envelope and dynamic transitions).
In the first and simple model, the excitation generator only has to deliver either white noise or a periodic sequence of pitch pulses for synthesizing unvoiced and voiced sounds, respectively, whereas the vocal tract is modeled as a time‐varying discrete‐time filter.
Figure 2.4 Digital source‐filter model.
The digital source‐filter model of Figure 2.4, especially the vocal tract filter, will be derived from the physics of sound propagation inside an acoustic tube. To estimate the necessary filter order, we start with the extremely simplifying physical model of Figure 2.5