Audio Source Separation and Speech Enhancement -  - E-Book

Audio Source Separation and Speech Enhancement E-Book

0,0
122,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Learn the technology behind hearing aids, Siri, and Echo Audio source separation and speech enhancement aim to extract one or more source signals of interest from an audio recording involving several sound sources. These technologies are among the most studied in audio signal processing today and bear a critical role in the success of hearing aids, hands-free phones, voice command and other noise-robust audio analysis systems, and music post-production software. Research on this topic has followed three convergent paths, starting with sensor array processing, computational auditory scene analysis, and machine learning based approaches such as independent component analysis, respectively. This book is the first one to provide a comprehensive overview by presenting the common foundations and the differences between these techniques in a unified setting. Key features: * Consolidated perspective on audio source separation and speech enhancement. * Both historical perspective and latest advances in the field, e.g. deep neural networks. * Diverse disciplines: array processing, machine learning, and statistical signal processing. * Covers the most important techniques for both single-channel and multichannel processing. This book provides both introductory and advanced material suitable for people with basic knowledge of signal processing and machine learning. Thanks to its comprehensiveness, it will help students select a promising research track, researchers leverage the acquired cross-domain knowledge to design improved techniques, and engineers and developers choose the right technology for their target application scenario. It will also be useful for practitioners from other fields (e.g., acoustics, multimedia, phonetics, and musicology) willing to exploit audio source separation or speech enhancement as pre-processing tools for their own needs.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 954

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

List of Authors

Preface

Acknowledgment

Notations

Acronyms

About the Companion Website

Part I: Prerequisites

Chapter 1: Introduction

1.1 Why are Source Separation and Speech Enhancement Needed?

1.2 What are the Goals of Source Separation and Speech Enhancement?

1.3 How can Source Separation and Speech Enhancement be Addressed?

1.4 Outline

Bibliography

Chapter 2: Time‐Frequency Processing: Spectral Properties

2.1 Time‐Frequency Analysis and Synthesis

2.2 Source Properties in the Time‐Frequency Domain

2.3 Filtering in the Time‐Frequency Domain

2.4 Summary

Bibliography

Chapter 3: Acoustics: Spatial Properties

3.1 Formalization of the Mixing Process

3.2 Microphone Recordings

3.3 Artificial Mixtures

3.4 Impulse Response Models

3.5 Summary

Bibliography

Chapter 4: Multichannel Source Activity Detection, Localization, and Tracking

4.1 Basic Notions in Multichannel Spatial Audio

4.2 Multi‐Microphone Source Activity Detection

4.3 Source Localization

4.4 Summary

Bibliography

Part II: Single‐Channel Separation and Enhancement

Chapter 5: Spectral Masking and Filtering

5.1 Time‐Frequency Masking

5.2 Mask Estimation Given the Signal Statistics

5.3 Perceptual Improvements

5.4 Summary

Bibliography

Chapter 6: Single‐Channel Speech Presence Probability Estimation and Noise Tracking

6.1 Speech Presence Probability and its Estimation

6.2 Noise Power Spectrum Tracking

6.3 Evaluation Measures

6.4 Summary

Bibliography

Chapter 7: Single‐Channel Classification and Clustering Approaches

7.1 Source Separation by Computational Auditory Scene Analysis

7.2 Source Separation by Factorial HMMs

7.3 Separation Based Training

7.4 Summary

Bibliography

Chapter 8: Nonnegative Matrix Factorization

8.1 NMF and Source Separation

8.2 NMF Theory and Algorithms

8.3 NMF Dictionary Learning Methods

8.4 Advanced NMF Models

8.5 Summary

Bibliography

Chapter 9: Temporal Extensions of Nonnegative Matrix Factorization

9.1 Convolutive NMF

9.2 Overview of Dynamical Models

9.3 Smooth NMF

9.4 Nonnegative State‐Space Models

9.5 Discrete Dynamical Models

9.6 The Use of Dynamic Models in Source Separation

9.7 Which Model to Use?

9.8 Summary

9.9 Standard Distributions

Bibliography

Part III: Multichannel Separation and Enhancement

Chapter 10: Spatial Filtering

10.1 Fundamentals of Array Processing

10.2 Array Topologies

10.3 Data‐Independent Beamforming

10.4 Data‐Dependent Spatial Filters: Design Criteria

10.5 Generalized Sidelobe Canceler Implementation

10.6 Postfilters

10.7 Summary

Bibliography

Chapter 11: Multichannel Parameter Estimation

11.1 Multichannel Speech Presence Probability Estimators

11.2 Covariance Matrix Estimators Exploiting SPP

11.3 Methods for Weakly Guided and Strongly Guided RTF Estimation

11.4 Summary

Bibliography

Chapter 12: Multichannel Clustering and Classification Approaches

12.1 Two‐Channel Clustering

12.2 Multichannel Clustering

12.3 Multichannel Classification

12.4 Spatial Filtering Based on Masks

12.5 Summary

Bibliography

Chapter 13: Independent Component and Vector Analysis

13.1 Convolutive Mixtures and their Time‐Frequency Representations

13.2 Frequency‐Domain Independent Component Analysis

13.3 Independent Vector Analysis

13.4 Example

13.5 Summary

Bibliography

Chapter 14: Gaussian Model Based Multichannel Separation

14.1 Gaussian Modeling

14.2 Library of Spectral and Spatial Models

14.3 Parameter Estimation Criteria and Algorithms

14.4 Detailed Presentation of Some Methods

14.5 Summary

Acknowledgment

Bibliography

Chapter 15: Dereverberation

15.1 Introduction to Dereverberation

15.2 Reverberation Cancellation Approaches

15.3 Reverberation Suppression Approaches

15.4 Direct Estimation

15.5 Evaluation of Dereverberation

15.6 Summary

Bibliography

Part IV: Application Scenarios and Perspectives

Chapter 16: Applying Source Separation to Music

16.1 Challenges and Opportunities

16.2 Nonnegative Matrix Factorization in the Case of Music

16.3 Taking Advantage of the Harmonic Structure of Music

16.4 Nonparametric Local Models: Taking Advantage of Redundancies in Music

16.5 Taking Advantage of Multiple Instances

16.6 Interactive Source Separation

16.7 Crowd‐Based Evaluation

16.8 Some Examples of Applications

16.9 Summary

Bibliography

Chapter 17: Application of Source Separation to Robust Speech Analysis and Recognition

17.1 Challenges and Opportunities

17.2 Applications

17.3 Robust Speech Analysis and Recognition

17.4 Integration of Front‐End and Back‐End

17.5 Use of Multimodal Information with Source Separation

17.6 Summary

Bibliography

Chapter 18: Binaural Speech Processing with Application to Hearing Devices

18.1 Introduction to Binaural Processing

18.2 Binaural Hearing

18.3 Binaural Noise Reduction Paradigms

18.4 The Binaural Noise Reduction Problem

18.5 Extensions for Diffuse Noise

18.6 Extensions for Interfering Sources

18.7 Summary

Bibliography

Chapter 19: Perspectives

19.1 Advancing Deep Learning

19.2 Exploiting Phase Relationships

19.3 Advancing Multichannel Processing

19.4 Addressing Multiple‐Device Scenarios

19.5 Towards Widespread Commercial Use

Acknowledgment

Bibliography

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Evaluation software and metrics.

Chapter 3

Table 3.1 Range of RT60 reported in the literature for different environments (Ribas

et al

., 2016).

Table 3.2 Example artificial mixing effects, from Sturmel

et al

. (2012).

Chapter 5

Table 5.1 Bayesian probability distributions for the observation

and the searched quantity

.

Table 5.2 Criteria for the estimation of

.

Table 5.3 Overview of the discussed estimation schemes.

Chapter 6

Table 6.1 Summary of noise estimation in the minima controlled recursive averaging approach (Cohen, 2003).

Chapter 7

Table 7.1 SDR (dB) achieved on the CHiME‐2 dataset using supervised training (2ch: average of two input channels).

Chapter 14

Table 14.1 Categorization of existing approaches according to the underlying mixing model, spectral model, spatial model, estimation criterion, and algorithm.

Chapter 17

Table 17.1 Recent noise robust speech recognition tasks: ASpIRE (Harper, 2015), AMI (Hain

et al

., 2007), CHiME‐1 (Barker

et al

., 2013), CHiME‐2 (Vincent

et al

., 2013), CHiME‐3 (Barker

et al

., 2015), CHiME‐4 (Vincent

et al

., 2017), and REVERB (Kinoshita

et al

., 2016). See Le Roux and Vincent (2014)

a

for a more detailed list of robust speech processing datasets.

Table 17.2 ASR word accuracy achieved by a GMM‐HMM acoustic model on MFCCs with delta and double‐delta features. The data are enhanced by FD‐ICA followed by time‐frequency masking. We compare two different masking schemes, based on phase or interference estimates (Kolossa

et al

., 2010). UD and MI stand for uncertainty decoding and modified imputation with estimated uncertainties, respectively, while UD* and MI* stand for uncertainty decoding and modified imputation with ideal uncertainties, respectively. Bold font indicates the best results achievable in practice, i.e. without the use of oracle knowledge.

Table 17.3 ASR performance on the AMI meeting task using a single distant microphone and enhanced signals obtained by the DS beamformer and the joint training‐based beamforming network.

Table 17.4 Human recognition rates (%) of a listening test. Each score is based on averaging about 260 unique utterances. “Ephraim–Malah” refers to the log‐spectral amplitude estimator of Ephraim and Malah (1985) in the implementation by Loizou (2007).

Chapter 19

Table 19.1 Average signal‐to‐distortion ratio (SDR) achieved by the computational auditory scene analysis (CASA) method of Hu and Wang (2013), a DRNN trained to separate the foreground speaker, and two variants of deep clustering for the separation of mixtures of two speakers with all gender combinations at random signal‐to‐noise ratios (SNRs) between 0 and 10 dB (Hershey

et al

., 2016; Isik

et al

., 2016). The test speakers are not in the training set.

List of Illustrations

Chapter 1

Figure 1.1 General mixing process, illustrated in the case of

sources, including three point sources and one diffuse source, and

channels.

Figure 1.2 General processing scheme for single‐channel and multichannel source separation and speech enhancement.

Chapter 2

Figure 2.1 STFT analysis.

Figure 2.2 STFT synthesis.

Figure 2.3 STFT and Mel spectrograms of an example music signal. High energies are illustrated with dark color and low energies with light color.

Figure 2.4 Set of triangular filter responses distributed uniformly on the Mel scale.

Figure 2.5 Independent sound sources are sparse in time‐frequency domain. The top row illustrates the magnitude STFT spectrograms of two speech signals. The bottom left panel illustrates the histogram of the magnitude STFT coefficients of the first signal, and the bottom right panel the bivariate histogram of the coefficients of both signals.

Figure 2.6 Magnitude spectrum of an exemplary harmonic sound. The fundamental frequency is marked with a cross. The other harmonics are at integer multiples of the fundamental frequency.

Figure 2.7 Example spectrograms of a stationary noise signal (top left), a note played by a piano (top right), a sequence of drum hits (bottom left), and a speech signal (bottom right).

Chapter 3

Figure 3.1 Schematic illustration of the shape of an acoustic impulse response

for a room of dimensions 8.00

5.00

3.10 m, an RT60 of 230 ms, and a source distance of

 m. All reflections are depicted as Dirac impulses.

Figure 3.2 First 100 ms of a pair of real acoustic impulse responses

from the Aachen Impulse Response Database (Jeub

et al

., 2009) recorded in a meeting room with the same characteristics as in Figure 3.1 and sampled at 48 kHz.

Figure 3.3 DRR as a function of the RT60 and the source distance

based on Eyring's formula (Gustafsson

et al

., 2003). These curves assume that there is no obstacle between the source and the microphone, so that the direct path exists. The room dimensions are the same as in Figure 3.1.

Figure 3.4 IC

of the reverberant part of an acoustic impulse response as a function of microphone distance

and frequency

.

Figure 3.5 ILD and IPD corresponding to the pair of real acoustic impulse responses in Figure 3.2. Dashed lines denote the theoretical ILD and IPD in the free field, as defined by the relative steering vector in (3.15).

Figure 3.6 Geometrical illustration of the position of a far‐field source

with respect to a pair of microphones on the horizontal plane, showing the azimuth

, the elevation

, the angle of arrival

, the microphone distance

, the source‐to‐microphone distances

and

, and the unit‐norm vector

pointing to the source.

Chapter 4

Figure 4.1 Example of GCC‐PHAT computed for a microphone pair and represented with gray levels in the case of a speech utterance in noisy and reverberant environment. The left part highlights the GCC‐PHAT at a single frame, with a clear peak at lag

.

Figure 4.2 Two examples of global coherence field acoustic maps. (a) 2D spatial localization using a distributed network of three microphone pairs, represented by circles. Observe the hyperbolic high correlation lines departing from the three pairs and crossing in the source position. (b) DOA likelihood for upper hemisphere angles using an array of five microphones in the (

,

) plane. High likelihood region exists around azimuth

and elevation

, i.e. unit‐norm vector

, see Figure 3.6.

Figure 4.3 1‐

position error distributions (68.3% confidence) resulting from TDOA errors (

sample) for two microphone pairs in three different geometries.

Figure 4.4 Graphical example of the particle filter procedure.

Chapter 5

Figure 5.1 Separation of speech from cafe noise by binary vs. soft masking. The masks shown in this example are oracle masks.

Figure 5.2 Illustration of the Rician posterior

(5.45) for

,

, and

. The red dashed line shows the mode of the posterior and thus the MAP estimate of target spectral magnitudes

. The purple dotted line corresponds to the approximate MAP estimate (5.47), and the yellow dash‐dotted line corresponds to the posterior mean (5.46) and thus the MMSE estimate of

.

Figure 5.3 Histogram of the real part of complex speech coefficients (Gerkmann and Martin, 2010).

Figure 5.4 Input‐output characteristics of different spectral filtering masks. In this example

. “Wiener” refers to the Wiener filter, “Ephraim–Malah” to the short‐time spectral amplitude estimator of Ephraim and Malah (1984), and “approx. MAP” to the approximate MAP amplitude estimator (5.47) of Wolfe and Godsill (2003). While “Wiener”, “Ephraim–Malah”, and “approx. MAP” are based on a Gaussian speech model, “Laplace prior” refers to an estimator of complex speech coefficients with a super‐Gaussian speech prior (Martin and Breithaupt, 2003). Compared to the linear Wiener filter, amplitude estimators tend to apply less attenuation for low inputs, while super‐Gaussian estimators tend to apply less attenuation for high inputs.

Figure 5.5 Examples of estimated filters for the noisy speech signal in Figure 5.1. The filters were computed in the STFT domain but are displayed on a nonlinear frequency scale for visualization purposes.

Chapter 6

Figure 6.1 State‐of‐the‐art single‐channel noise reduction system operating in the STFT domain. The application of a spectral gain function to the output of the STFT block results in an estimate

of clean speech spectral coefficients

. The spectral gain is controlled by the estimated SNR, which in turn requires the noise power tracker as a central component.

Figure 6.2 Spectrograms of (a) a clean speech sample, (b) clean speech mixed with traffic noise at 5 dB SNR, (c) SPP based on the Gaussian model, and (d) SPP based on the Gaussian model with a fixed a priori SNR prior. Single bins in the time‐frequency plane are considered. The fixed a priori SNR was set to 15 dB and

.

Figure 6.3 Power of the noisy speech signal (thin solid line) and estimated noise power (thick solid line) using the noise power tracking approach according to (6.19). The slope parameter is set such that the maximum slope is 5 dB /s. Top: stationary white Gaussian noise; bottom: nonstationary multiple‐speaker babble noise.

Figure 6.4 Probability distribution of short‐time power (

distribution with 10 degrees of freedom (DoF)) and the corresponding distribution of the minimum of

independent power values.

Figure 6.5 Optimal smoothing parameter

as a function of the power ratio

.

Figure 6.6 Power of noisy speech signal (thin solid line) and estimated noise power (thick solid line) using the MMSE noise power tracking approach (Hendriks

et al

., 2010). Top: stationary white Gaussian noise; bottom: nonstationary multiple‐speaker babble noise.

Figure 6.7 Spectrogram of (a) clean speech, (b) clean speech plus additive amplitude‐modulated white Gaussian noise, (c) minimum statistics noise power estimate, (d) log‐error for the minimum statistics estimator, (e) MMSE noise power estimate, and (f) log‐error for the MMSE estimator. For the computation of these noise power estimates we concatenated three identical phrases of which only the last one is shown in the figure. Note that in (d) and (f) light color indicates noise power overestimation while dark color indicates noise power underestimation.

Chapter 7

Figure 7.1 An implementation of a feature‐based CASA system for source separation.

Figure 7.2 Architecture of a GMM‐HMM. The GMM models the typical spectral patterns produced by each source, while the HMM models the spectral continuity.

Figure 7.3 Schematic depiction of an exemplary DNN with two hidden layers and three neurons per layer.

Figure 7.4 DRNN and unfolded DRNN.

Figure 7.5 Architecture of regression DNN for speech enhancement.

Chapter 8

Figure 8.1 An example signal played by piano consists of a sequence of notes C, E, and G, followed by the three notes played simultaneously. The basic NMF models the magnitude spectrogram of the signal

(top left) as a sum of components having fixed spectra

(rightmost panels) and activation coefficients

(lowest panels). Each component represents parts of the spectrogram corresponding to an individual note.

Figure 8.2 The NMF model can be used to generate time‐frequency masks in order to separate sources from a mixture. Top row: the spectrogram

of the mixture signal in Figure 8.1 is modeled with an NMF. Middle row: the model for an individual source can be obtained using a specific set of components. In this case, only component 1 is used to represent an individual note in the mixture. Bottom row: the mixture spectrogram is elementwise multiplied by the time‐frequency mask matrix

, resulting in a separated source spectrogram

.

Figure 8.3 Illustration of the separation of two sources, where source‐specific models are obtained in the training stage. Dictionaries

and

consisting of source‐specific basis vectors are obtained for sources 1 and 2 using isolated samples of each source. The mixture spectrogram

is modeled as a weighted linear combination of all the basis vectors. Matrices

and

contain activations of basis vectors in all frames.

Figure 8.4 Illustration of the model where a basis vector matrix

is obtained for the target source at the training stage and kept fixed, and a basis vector matrix

that represents other sources in the mixture is estimated from the mixture. The two activation matrices

and

are both estimated from the mixture.

Figure 8.5 Gaussian composite model (IS‐NMF) by Févotte

et al

. (2009).

Figure 8.6 Harmonic NMF model by Vincent

et al

. (2010) and Bertin

et al

. (2010).

Figure 8.7 Illustration of the coupled factorization model, where basis vectors

acquired from training data are elementwise multiplied with an equalization filter response

to better model the observed data at test time.

Chapter 9

Figure 9.1 Learning temporal dependencies. The top right plot shows the input matrix

, which has a consistent left‐right structure. The top left plot shows the learned matrices

and the bottom right plot shows the learned activations

. The bottom left plot shows the bases again, only this time we concatenate the corresponding columns from each

. We can clearly see that this sequence of columns learns bases that extend over time.

Figure 9.2 Extraction of time‐frequency sources. The input to this case is shown in the top right plot. It is a drum pattern composed of four distinct drum sounds. The set of four top left plots shows the extracted time‐frequency templates using 2D convolutive NMF. Their corresponding activations are shown in the step plots in the bottom right, and the individual convolutions of each template with its activation as the lower left set of plots. As one can see this model learns the time‐frequency profile of the four drum sounds and correctly identifies where they are located.

Figure 9.3 Convolutive NMF dictionary elements (

) for a speech recording. Note that each component has the form of a short phoneme‐like speech inflection.

Figure 9.4 Convolutive NMF decomposition for a violin recording. Note how the one extracted basis

corresponds to a constant‐Q spectrum that when 2D convolved with the activation

approximates the input. The peaks in

produce a pitch transcription of the recording by indicating energy at each pitch and time offset.

Figure 9.5 Effect of regularization for

. A segment of one of the rows of

is displayed, corresponding to the activations of the accompaniment (piano and double bass). A trumpet solo occurs in the middle of the displayed time interval, where the accompaniment vanishes; the regularization smoothes out coefficients with small energies that remain in unpenalized IS‐NMF.

Figure 9.6 Dictionaries were learned from the speech data of a given speaker. Shown are the dictionaries learned for 18 of the 40 states. Each dictionary is composed of 10 elements that are stacked next to each other. Each of these dictionaries roughly corresponds to a subunit of speech, either a voiced or unvoiced phoneme.

Figure 9.7 Example of dynamic models for source separation. The four spectrograms show the mixture and the extracted speech for three different approaches. D‐PLCA denotes dynamic PLCA and N‐HMM denotes nonnegative HMM. The bar plot shows a quantitative evaluation of the separation performance of each approach. Adapted from Smaragdis

et al

. (2014).

Chapter 10

Figure 10.1 General block diagram of a beamformer with filters

.

Figure 10.2 Beampower (10.6) in polar coordinates of the DS beamformer for a uniform linear array. The beampower is normalized to unity at the look direction. The different behavior of the beampower as a function of the array look direction and the signal wavelength is clearly demonstrated.

Figure 10.3 Harmonically nested linear arrays with

for

to 8 kHz.

Figure 10.4 Constant‐beamwidth distortionless response beamformer design with desired response towards broadside and constrained white noise gain after FIR approximation of

(Mabande

et al

., 2009) for the array of Figure 10.3.

Figure 10.5 GSC structure for implementing the LCMV beamformer.

Chapter 11

Figure 11.1 High‐level block diagram of the common estimation framework for constructing data‐dependent spatial filters.

Figure 11.2 Example of single‐channel and multichannel SPP in a noisy scenario.

Figure 11.3 Example of a DDR estimator in a noisy and reverberant scenario.

Figure 11.4 Postfilter incorporating spatial information (Cohen

et al

., 2003; Gannot and Cohen, 2004).

Chapter 12

Figure 12.1 Left: spectrograms of three‐source mixture recorded on two directional microphones spaced approximately 6 cm apart (Sawada

et al

., 2011). Right: ILD and IPD of this recording.

Figure 12.2 Histogram of ITD and ILD features extracted by DUET from the recording in Figure 12.1 along with separation masks estimated using manually selected parameters.

Figure 12.3 Probabilistic masks estimated by MESSL from the mixture shown in Figure 12.1 using Markov random field mask smoothing (Mandel and Roman, 2015).

Figure 12.4 Multichannel MESSL algorithm. Computes masks for each pair of microphones in the E‐step, then combines masks across pairs, then re‐estimates parameters for each pair from the global masks.

Figure 12.5 Narrowband clustering followed by permutation alignment.

Figure 12.6 Example spectra and masks via a multichannel clustering system on reverberant mixtures (Sawada

et al

., 2011).

Figure 12.7 System diagram of a multichannel classification system.

Figure 12.8 Block diagram of spatial filtering based on masks.

Chapter 13

Figure 13.1 Multichannel time‐frequency representation of an observed signal (left) and its slices: frequency‐wise and microphone‐wise (right). Methods discussed in this chapter are shown in red.

Figure 13.2 The scatter plot on the left‐hand side illustrates the joint probability distribution of two normalized and uniformly distributed signals. The histograms correspond to marginal distributions of the two variables. The left‐hand side corresponds to the same signals when mixed by matrix

. In the latter (mixed) case, the histograms are closer to a Gaussian.

Figure 13.3 (a) Examples of generalized Gaussian distributions for

(gamma),

(Laplacian),

(Gaussian), and

. The latter case demonstrates the fact that for

the distribution is uniform. (b) Histogram of a female utterance, which can be modeled as the Laplacian or gamma distribution.

Figure 13.4 Complex‐valued source models. (a) and (b) are based on 13.16. (c) is a complex Gaussian distribution.

Figure 13.5 A single‐trial SIR achieved by EFICA and BARBI when separating 15 different signals (five i.i.d. sequences, three nonwhite AR Gaussian processes, three piecewise Gaussian i.i.d. processes (nonstationary) and four speech signals).

Figure 13.6 Flow of FD‐ICA for separating convolutive mixtures.

Figure 13.7 Comparison of two criteria for permutation alignment: magnitude spectrum and power ratio. Permutations are aligned as each color (blue or green) corresponds to the same source. Power ratios generally exhibit a higher correlation coefficient for the same source and a more negative correlation coefficient for different sources.

Figure 13.8 TDOA estimation and permutation alignment. For a two‐microphone two‐source situation (left), ICA is applied in each frequency bin and TDOAs for two sources between the two microphones are estimated (right upper). Each color (navy or orange) corresponds to the same source. Clustering for TDOAs aligns the permutation ambiguities (right lower).

Figure 13.9 Experimental setups: source positions a and b for a two‐source setup, a and b and c for a three‐source setup, and a through d for a four‐source setup. The number of microphones used is always the same as the number of sources.

Figure 13.10 Experimental results (dataset A): SDR averaged over separated spatial images of the sources.

Figure 13.11 Spectrograms of the separated signals (dataset A, two sources, STFT window size of 1024). Here, IVA failed to solve the permutation alignment. These two sources are difficult to separate because the signals share the same silent period at around 2.5 s.

Figure 13.12 Experimental results with less active sources (dataset B): SDR averaged over separated spatial images of the sources.

Chapter 14

Figure 14.1 Illustration of multichannel NMF.

,

, and

represent the complex‐valued spectrograms of the sources and the mixture channels, and the complex‐valued mixing coefficients, respectively. NMF factors the power spectrogram

of each source as

(see Chapter 8 and Section 14.2.1). The mixing system is represented by a rank‐1 spatial model (see Section 14.2.2).

Figure 14.2 Block diagram of multichannel Gaussian model based source separation.

Figure 14.3 Illustration of the HMM and multichannel NMF spectral models.

Figure 14.4 Graphical illustration of the EM algorithm for MAP estimation.

Chapter 15

Figure 15.1 Example room acoustic impulse response.

Figure 15.2 Schematic view of the linear‐predictive multiple‐input equalization method, after Habets (2016).

Figure 15.3 Data‐dependent spatial filtering approach to perform reverberation suppression.

Figure 15.4 Single‐channel spectral enhancement approach to perform reverberation suppression.

Chapter 16

Figure 16.1 STFT and constant‐Q representation of a trumpet signal composed of three musical notes of different pitch.

Figure 16.2 Schematic illustration of the filter part

. Durrieu

et al

. (2011) defined the dictionary

of filter atomic elements as a set of 30 Hann functions, with 75% overlap.

Figure 16.3 The source‐filter model in the magnitude frequency domain.

Figure 16.4 Local regularities in the spectrograms of percussive (vertical) and harmonic (horizontal) sounds.

Figure 16.5 REPET: building the repeating background model. In stage 1, we analyze the mixture spectrogram and identify the repeating period. In stage 2, we split the mixture into patches of the identified length and take the median of them. This allows their common part and hence the repeating pattern to be extracted. In stage 3, we use this repeating pattern in each segment to construct a mask for separation.

Figure 16.6 Examples of kernels to use in KAM for modeling (a) percussive, (b) harmonic, (c) repetitive, and (d) spectrally smooth sounds.

Figure 16.7 Using the audio tracks of multiple related videos to perform source separation. Each circle on the left represents a mixture containing music and vocals in the language associated with the flag. The music is the same in all mixtures. Only the language varies. Given multiple copies of a mixture where one element is fixed lets one separate out this stable element (the music) from the varied elements (the speech in various languages).

Figure 16.8 Interferences of different sources in real‐world multitrack recordings. Left: microphone setup. Right: interference pattern. (Courtesy of R. Bittner and T. Prätzlich.)

Chapter 17

Figure 17.1 Block diagram of an ASR system.

Figure 17.2 Calculation of MFCC and log Mel‐filterbank features. “DCT” is the discrete cosine transform.

Figure 17.3 Flow chart of feature extraction in a typical speaker recognition system.

Figure 17.4 Flow chart of robust multichannel ASR with source separation, adaptation, and testing blocks.

Figure 17.5 Comparison of the DS and MVDR beamformers on the CHiME‐3 dataset. The results are obtained with a DNN acoustic model applied to the feature pipeline in Figure 17.6 and trained with the state‐level minimum Bayes risk cost. Real‐dev and Simu‐dev denote the real and simulation development sets, respectively. Real‐test and Simu‐test denote the real and simulation evaluation sets, respectively.

Figure 17.6 Pipeline of state‐of‐the‐art feature extraction, normalization, and transformation procedure for noise robust speech analysis and recognition. CMN, LDA, STC, and fMLLR stand for cepstral mean normalization, linear discriminant analysis, semitied covariance transform, and feature‐space ML linear regression, respectively.

Figure 17.7 Observation uncertainty pipeline.

Figure 17.8 Joint training of a unified network for beamforming and acoustic modeling. Adapted from Xiao

et al

. (2016).

Chapter 18

Figure 18.1 Block diagram for binaural spectral postfiltering based on a common spectro‐temporal gain: (a) direct gain computation (one microphone on each hearing device) and (b) indirect gain computation (two microphones on each hearing device).

Figure 18.2 Block diagram for binaural spatial filtering: (a) incorporating constraints into spatial filter design and (b) mixing with scaled reference signals.

Figure 18.3 Considered acoustic scenario, consisting of a desired source (

), an interfering source (

), and background noise in a reverberant room. The signals are received by the microphones on both hearing devices of the binaural hearing system.

Figure 18.4 Schematic overview of this chapter.

Figure 18.5 Psychoacoustically motivated lower and upper MSC boundaries: (a) frequency range 0–8000 Hz and (b) frequency range 0–500 Hz. For frequencies below 500 Hz, the boundaries depend on the desired MSC while for frequencies above 500 Hz the boundaries are independent of the desired MSC.

Figure 18.6 MSC error of the noise component, intelligibility‐weighted speech distortion, and intelligibility‐weighted output SNR for the MWF, MWF‐N, and MWF‐IC.

Figure 18.7 Performance measures for the binaural MWF, MWF‐RTF, MWF‐IR‐0, and MWF‐IR‐0.2 for a desired speech source at

5° and different interfering source positions. The global input SINR was equal to

3 dB.

Figure 18.8 Performance measures for the binaural MWF, MWF‐RTF, MWF‐IR‐0.2, and MWF‐IR‐0 for a desired speech source at

35° and different interfering source positions. The global input SINR was equal to

3 dB.

Chapter 19

Figure 19.1 Short‐term magnitude spectrum and various representations of the phase spectrum of a speech signal for an STFT analysis window size of 64 ms. For easier visualization, the deviation of the instantaneous frequency from the center frequency of each band is shown rather than the instantaneous frequency itself.

Figure 19.2 Interchannel level difference (ILD) and IPD for two different source positions

(plain curve) and

(dashed curve) 10 cm apart from each other at 1.70 m distance from the microphone pair. The source DOAs are

and

, respectively. The room size is

 m, the reverberation time is 230 ms, and the microphone distance is 15 cm.

Figure 19.3 IPD between two microphones spaced by 15 cm belonging (a) to the same device or (b) to two distinct devices with

relative sampling rate mismatch. For illustration purposes, the recorded sound scene consists of a single speech source at a distance of 1.70 m and a DOA of

in a room with a reverberation time of 230 ms, without any interference or noise, and the two devices have zero temporal offset at

.

Guide

Cover

Table of Contents

Begin Reading

Pages

C1

vi

xvii

xviii

xix

xxi

xxii

xxiii

xxv

xxvi

xxvii

xxviii

xxix

xxx

xxxi

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

189

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

345

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

461

462

463

464

465

466

467

468

469

470

Audio Source Separation and Speech Enhancement

Edited by

Emmanuel Vincent

InriaFrance

 

Tuomas Virtanen

Tampere University of TechnologyFinland

 

Sharon Gannot

Bar-Ilan UniversityIsrael

Copyright

This edition first published 2018

© 2018 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Emmanuel Vincent, Tuomas Virtanen & Sharon Gannot to be identified as authors of the editorial material in this work has been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Vincent, Emmanuel (Research scientist), editor. | Virtanen, Tuomas, editor. | Gannot, Sharon, editor.

Title: Audio source separation and speech enhancement / edited by Emmanuel Vincent, Tuomas Virtanen, Sharon Gannot.

Description: Hoboken, NJ : John Wiley & Sons, 2018. | Includes bibliographical references and index. |

Identifiers: LCCN 2018013163 (print) | LCCN 2018021195 (ebook) | ISBN 9781119279884 (pdf) | ISBN 9781119279914 (epub) | ISBN 9781119279891 (cloth)

Subjects: LCSH: Speech processing systems. | Automatic speech recognition.

Classification: LCC TK7882.S65 (ebook) | LCC TK7882.S65 .A945 2018 (print) | DDC 006.4/54-dc23

LC record available at https://lccn.loc.gov/2018013163

Cover Design: Wiley

Cover Images: © 45RPM/iStockphoto;

© franckreporter/iStockphoto

List of Authors

Shoko Araki

NTT Communication Science Laboratories

Japan

Roland Badeau

Institut Mines‐Télécom

France

Alessio Brutti

Fondazione Bruno Kessler

Italy

Israel Cohen

Technion

Israel

Simon Doclo

Carl von Ossietzky‐Universität Oldenburg

Germany

Jun Du

University of Science and Technology of China

China

Zhiyao Duan

University of Rochester

NY

USA

Cédric Févotte

CNRS

France

Sharon Gannot

Bar‐Ilan University

Israel

Tian Gao

University of Science and Technology of China

China

Timo Gerkmann

Universität Hamburg

Germany

Emanuël A.P. Habets

International Audio Laboratories Erlangen

Germany

Elior Hadad

Bar‐Ilan University

Israel

Hirokazu Kameoka

The University of Tokyo

Japan

Walter Kellermann

Friedrich‐Alexander Universität Erlangen‐Nürnberg

Germany

Zbyněk Koldovský

Technical University of Liberec

Czech Republic

Dorothea Kolossa

Ruhr‐Universität Bochum

Germany

Antoine Liutkus

Inria

France

Michael I. Mandel

City University of New York

NY

USA

Erik Marchi

Technische Universität München

Germany

Shmulik Markovich‐Golan

Bar‐Ilan University

Israel

Daniel Marquardt

Carl von Ossietzky‐Universität Oldenburg

Germany

Rainer Martin

Ruhr‐Universität Bochum

Germany

Nasser Mohammadiha

Chalmers University of Technology

Sweden

Gautham J. Mysore

Adobe Research

CA

USA

Tomohiro Nakatani

NTT Communication Science Laboratories

Japan

Patrick A. Naylor

Imperial College London

UK

Maurizio Omologo

Fondazione Bruno Kessler

Italy

Alexey Ozerov

Technicolor

France

Bryan Pardo

Northwestern University

IL

USA

Pasi Pertilä

Tampere University of Technology

Finland

Gaël Richard

Institut Mines‐Télécom

France

Hiroshi Sawada

NTT Communication Science Laboratories

Japan

Paris Smaragdis

University of Illinois at Urbana‐Champaign

IL

USA

Piergiorgio Svaizer

Fondazione Bruno Kessler

Italy

Emmanuel Vincent

Inria

France

Tuomas Virtanen

Tampere University of Technology

Finland

Shinji Watanabe

Johns Hopkins University

MD

USA

Felix Weninger

Nuance Communications

Germany

Preface

Source separation and speech enhancement are some of the most studied technologies in audio signal processing. Their goal is to extract one or more source signals of interest from an audio recording involving several sound sources. This problem arises in many everyday situations. For instance, spoken communication is often obscured by concurrent speakers or by background noise, outdoor recordings feature a variety of environmental sounds, and most music recordings involve a group of instruments. When facing such scenes, humans are able to perceive and listen to individual sources so as to communicate with other speakers, navigate in a crowded street or memorize the melody of a song. Source separation and speech enhancement technologies aim to empower machines with similar abilities.

These technologies are already present in our lives today. Beyond “clean” single‐source signals recorded with close microphones, they allow the industry to extend the applicability of speech and audio processing systems to multi‐source, reverberant, noisy signals recorded with distant microphones. Some of the most striking examples include hearing aids, speech enhancement for smartphones, and distant‐microphone voice command systems. Current technologies are expected to keep improving and spread to many other scenarios in the next few years.

Traditionally, speech enhancement has referred to the problem of segregating speech and background noise, while source separation has referred to the segregation of multiple speech or audio sources. Most textbooks focus on one of these problems and on one of three historical approaches, namely sensor array processing, computational auditory scene analysis, or independent component analysis. These communities now routinely borrow ideas from each other and other approaches have emerged, most notably based on deep learning.

This textbook is the first to provide a comprehensive overview of these problems and approaches by presenting their shared foundations and their differences using common language and notations. Starting with prerequisites (Part I), it proceeds with single‐channel separation and enhancement (Part II), multichannel separation and enhancement (Part III), and applications and perspectives (Part IV). Each chapter provides both introductory and advanced material.

We designed this textbook for people in academia and industry with basic knowledge of signal processing and machine learning. Thanks to its comprehensiveness, we hope it will help students select a promising research track, researchers leverage the acquired cross‐domain knowledge to design improved techniques, and engineers and developers choose the right technology for their application scenario. We also hope that it will be useful for practitioners from other fields (e.g., acoustics, multimedia, phonetics, musicology) willing to exploit audio source separation or speech enhancement as a pre‐processing tool for their own needs.

Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot

May 2017

Acknowledgment

We would like to thank all the chapter authors, as well as the following people who helped with proofreading: Sebastian Braun, Yaakov Buchris, Emre Cakir, Aleksandr Diment, Dylan Fagot, Nico Gößling, Tomoki Hayashi, Jakub Janský, Ante Jukić, Václav Kautský, Martin Krawczyk‐Becker, Simon Leglaive, Bochen Li, Min Ma, Paul Magron, Zhong Meng, Gaurav Naithani, Zhaoheng Ni, Aditya Arie Nugraha, Sanjeel Parekh, Robert Rehr, Lea Schönherr, Georgina Tryfou, Ziteng Wang, and Mehdi Zohourian

Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot

May 2017

Notations

Linear algebra

scalar

vector

vector with entries

th entry of vector

vector of zeros

vector of ones

matrix

matrix with entries

th entry of matrix

identity matrix

tensor/array (with three or more dimensions) or set

tensor with entries

diagonal matrix whose entries are those of vector

entrywise product of matrices

and

trace of matrix

determinant of matrix

transpose of vector

conjugate‐transpose of vector

conjugate of scalar

real part of scalar

imaginary unit

Statistics

probability distribution of continuous random variable

conditional probability distribution of

given

probability value of discrete random variable

conditional probability value of

given

expectation of random variable

conditional expectation of

entropy of random variable

real Gaussian distribution with mean

and covariance

complex Gaussian distribution with mean

and covariance

estimated value of random variable

(e.g., first‐order statistics)

variance of random variable

estimated second‐order statistics of random variable

autocovariance of random vector

estimated second‐order statistics of random vector

covariance of random vectors

and

estimated second‐order statistics of random vectors

and

cost function to be minimized w.r.t. the vector of parameters

objective function to be maximized w.r.t. the vector of parameters

auxiliary function to be minimized or maximized, depending on the context

Common indexes

number of microphones or channels

microphone or channel index in

number of sources

source index in

number of time‐domain samples

sample index in

time‐domain filter length

tap index in

number of time frames

time frame index in

number of frequency bins

frequency bin index in

Acronyms

AR

autoregressive

ASR

automatic speech recognition

BSS

blind source separation

CASA

computational auditory scene analysis

DDR

direct‐to‐diffuse ratio

DFT

discrete Fourier transform

DNN

deep neural network

DOA

direction of arrival

DRNN

deep recurrent neural network

DRR

direct‐to‐reverberant ratio

DS

delay‐and‐sum

ERB

equivalent rectangular bandwidth

EM

expectation‐maximization

EUC

Euclidean

FD‐ICA

frequency‐domain independent component analysis

FIR

finite impulse response

GCC

generalized cross‐correlation

GCC‐PHAT

generalized cross‐correlation with phase transform

GMM

Gaussian mixture model

GSC

generalized sidelobe canceler

HMM

hidden Markov model

IC

interchannel (or interaural) coherence

ICA

independent component analysis

ILD

interchannel (or interaural) level difference

IPD

interchannel (or interaural) phase difference

ITD

interchannel (or interaural) time difference

IVA

independent vector analysis

IS

Itakura–Saito

KL

Kullback–Leibler

LCMV

linearly constrained minimum variance

LSTM

long short‐term memory

MAP

maximum a posteriori

MFCC

Mel‐frequency cepstral coefficient

ML

maximum likelihood

MM

majorization‐minimization

MMSE

minimum mean square error

MSC

magnitude squared coherence

MSE

mean square error

MVDR

minimum variance distortionless response

MWF

multichannel Wiener filter

NMF

nonnegative matrix factorization

PLCA

probabilistic latent component analysis

RNN

recurrent neural network

RT60

reverberation time

RTF

relative transfer function

SAR

signal‐to‐artifacts ratio

SDR

signal‐to‐distortion ratio

SINR

signal‐to‐interference‐plus‐noise ratio

SIR

signal‐to‐interference ratio

SNR

signal‐to‐noise ratio

SPP

speech presence probability

SRP

steered response power

SRP‐PHAT

steered response power with phase transform

SRR

signal‐to‐reverberation ratio

STFT

short‐time Fourier transform

TDOA

time difference of arrival

VAD

voice activity detection

VB

variational Bayesian

About the Companion Website

This book is accompanied by a companion website:

https://project.inria.fr/ssse/

The website includes:

Implementations of algorithms

Audio samples

Part IPrerequisites

1Introduction

Emmanuel Vincent Sharon Gannot and Tuomas Virtanen

Source separation and speech enhancement are core problems in the field of audio signal processing, with applications to speech, music, and environmental audio. Research in this field has accompanied technological trends, such as the move from landline to mobile or hands‐free phones, the gradual replacement of stereo by 3D audio, and the emergence of connected devices equipped with one or more microphones that can execute audio processing tasks which were previously regarded as impossible. In this short introductory chapter, after a brief discussion of the application needs in Section 1.1, we define the problems of source separation and speech enhancement and introduce relevant terminology regarding the scenarios and the desired outcome in Section 1.2. We then present the general processing scheme followed by most source separation and speech enhancement approaches and categorize these approaches in Section 1.3. Finally, we provide an outline of the book in Section 1.4.

1.1 Why are Source Separation and Speech Enhancement Needed?

The problems of source separation and speech enhancement arise from several application needs in the context of speech, music, and environmental audio processing.

Real‐world speech signals are often contaminated by interfering speakers, environmental noise, and/or reverberation. These phenomena deteriorate speech quality and, in adverse scenarios, speech intelligibility and automatic speech recognition (ASR) performance. Source separation and speech enhancement are therefore required in such scenarios. For instance, spoken communication over mobile phones or hands‐free systems requires the separation or enhancement of the near‐end speaker's voice with respect to interfering speakers and environmental noises before it is transmitted to the far‐end listener. Conference call systems or hearing aids face the same problem, except that several speakers may be considered as targets. Source separation and speech enhancement are also crucial preprocessing steps for robust distant‐microphone ASR, as available in today's personal assistants, car navigation systems, televisions, video game consoles, medical dictation devices, and meeting transcription systems. Finally, they are necessary components in providing humanoid robots, assistive listening devices, and surveillance systems with “super‐hearing” capabilities, which may exceed the hearing capabilities of humans.

Besides speech, music and movie soundtracks are another important application area for source separation. Indeed, music recordings typically involve several instruments playing together live or mixed together in a studio, while movie soundtracks involve speech overlapped with music and sound effects. Source separation has been successfully used to upmix mono or stereo recordings to 3D sound formats and/or to remix them. It lies at the core of object‐based audio coders, which encode a given recording as the sum of several sound objects that can then easily be rendered and manipulated. It is also useful for music information retrieval purposes, e.g. to transcribe the melody or the lyrics of a song from the separated singing voice.

This is an emerging research field with many real‐life applications concerning the analysis of general sound scenes, involving the detection of sound events, their localization and tracking, and the inference of the acoustic environment properties.

1.2 What are the Goals of Source Separation and Speech Enhancement?

The goal of source separation and speech enhancement can be defined in layman's terms as that of recovering the signal of one or more sound sources from an observed signal involving other sound sources and/or reverberation. This definition turns out to be ambiguous. In order to address the ambiguity, the notion of source and the process leading to the observed signal must be characterized more precisely. In this section and in the rest of this book we adopt the general notations defined on p. xxv–xxvii.

1.2.1 Single‐Channel vs. Multichannel

Let us assume that the observed signal has channels indexed by . By channel, we mean the output of one microphone in the case when the observed signal has been recorded by one or more microphones, or the input of one loudspeaker in the case when it is destined to be played back on one or more loudspeakers.1 A signal with channels is called single‐channel and is represented by a scalar , while a signal with channels is called multichannel and is represented by an vector . The explanation below employs multichannel notation, but is also valid in the single‐channel case.

1.2.2 Point vs. Diffuse Sources

Furthermore, let us assume that there are sound sources indexed by . The word “source” can refer to two different concepts. A point source such as a human speaker, a bird, or a loudspeaker is considered to emit sound from a single point in space. It can be represented as a single‐channel signal. A diffuse source such as a car, a piano, or rain simultaneously emits sound from a whole region in space. The sounds emitted from different points of that region are different but not always independent of each other. Therefore, a diffuse source can be thought of as an infinite collection of point sources. The estimation of the individual point sources in this collection can be important for the study of vibrating bodies, but it is considered irrelevant for source separation or speech enhancement. A diffuse source is therefore typically represented by the corresponding signal recorded at the microphone(s) and it is processed as a whole.

1.2.3 Mixing Process

The mixing process leading to the observed signal can generally be expressed in two steps. First, each single‐channel point source signal is transformed into an source spatial image signal (Vincent et al., 2012) by means of a possibly nonlinear spatialization operation. This operation can describe the acoustic propagation from the point source to the microphone(s), including reverberation, or some artificial mixing effects. Diffuse sources are directly represented by their spatial images instead. Second, the spatial images of all sources are summed to yield the observed signal called the mixture:

(1.1)

This summation is due to the superposition of the sources in the case of microphone recording or to explicit summation in the case of artificial mixing. This implies that the spatial image of each source represents the contribution of the source to the mixture signal. A schematic overview of the mixing process is depicted in Figure 1.1. More specific details are given in Chapter 3.

Note that target sources, interfering sources, and noise are treated in the same way in this formulation. All these signals can be either point or diffuse sources. The choice of target sources depends on the use case. Also, the distinction between interfering sources and noise may or may not be relevant depending on the use case. In the context of speech processing, these terms typically refer to undesired speech vs. nonspeech sources, respectively. In the context of music or environmental sound processing, this distinction is most often irrelevant and the former term is preferred to the latter.

Figure 1.1 General mixing process, illustrated in the case of sources, including three point sources and one diffuse source, and channels.

In the following, we assume that all signals are digital, meaning that the time variable is discrete. We also assume that quantization effects are negligible, so that we can operate on continuous amplitudes. Regarding the conversion of acoustic signals to analog audio signals and analog signals to digital, see, for example, Havelock et al. (2008, Part XII) and Pohlmann (1995, pp. 22–49).

1.2.4 Separation vs. Enhancement

The above mixing process implies one or more distortions of the target signals: interfering sources, noise, reverberation, and echo emitted by the loudspeakers (if any). In this context, source separation refers to the problem of extracting one or more target sources while suppressing interfering sources and noise. It explicitly excludes dereverberation and echo cancellation. Enhancement is more general, in that it refers to the problem of extracting one or more target sources while suppressing all types of distortion, including reverberation and echo. In practice, though, this term is mostly used in the case when the target sources are speech. In the audio processing literature, these two terms are often interchanged, especially when referring to the problem of suppressing both interfering speakers and noise from a speech signal. Note that, for either source separation or enhancement tasks, the extracted source(s) can be either the spatial image of the source or its direct path component, namely the delayed and attenuated version of the original source signal (Vincent et al., 2012; Gannot et al., 2001).