122,99 €
Learn the technology behind hearing aids, Siri, and Echo Audio source separation and speech enhancement aim to extract one or more source signals of interest from an audio recording involving several sound sources. These technologies are among the most studied in audio signal processing today and bear a critical role in the success of hearing aids, hands-free phones, voice command and other noise-robust audio analysis systems, and music post-production software. Research on this topic has followed three convergent paths, starting with sensor array processing, computational auditory scene analysis, and machine learning based approaches such as independent component analysis, respectively. This book is the first one to provide a comprehensive overview by presenting the common foundations and the differences between these techniques in a unified setting. Key features: * Consolidated perspective on audio source separation and speech enhancement. * Both historical perspective and latest advances in the field, e.g. deep neural networks. * Diverse disciplines: array processing, machine learning, and statistical signal processing. * Covers the most important techniques for both single-channel and multichannel processing. This book provides both introductory and advanced material suitable for people with basic knowledge of signal processing and machine learning. Thanks to its comprehensiveness, it will help students select a promising research track, researchers leverage the acquired cross-domain knowledge to design improved techniques, and engineers and developers choose the right technology for their target application scenario. It will also be useful for practitioners from other fields (e.g., acoustics, multimedia, phonetics, and musicology) willing to exploit audio source separation or speech enhancement as pre-processing tools for their own needs.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 954
Veröffentlichungsjahr: 2018
Cover
List of Authors
Preface
Acknowledgment
Notations
Acronyms
About the Companion Website
Part I: Prerequisites
Chapter 1: Introduction
1.1 Why are Source Separation and Speech Enhancement Needed?
1.2 What are the Goals of Source Separation and Speech Enhancement?
1.3 How can Source Separation and Speech Enhancement be Addressed?
1.4 Outline
Bibliography
Chapter 2: Time‐Frequency Processing: Spectral Properties
2.1 Time‐Frequency Analysis and Synthesis
2.2 Source Properties in the Time‐Frequency Domain
2.3 Filtering in the Time‐Frequency Domain
2.4 Summary
Bibliography
Chapter 3: Acoustics: Spatial Properties
3.1 Formalization of the Mixing Process
3.2 Microphone Recordings
3.3 Artificial Mixtures
3.4 Impulse Response Models
3.5 Summary
Bibliography
Chapter 4: Multichannel Source Activity Detection, Localization, and Tracking
4.1 Basic Notions in Multichannel Spatial Audio
4.2 Multi‐Microphone Source Activity Detection
4.3 Source Localization
4.4 Summary
Bibliography
Part II: Single‐Channel Separation and Enhancement
Chapter 5: Spectral Masking and Filtering
5.1 Time‐Frequency Masking
5.2 Mask Estimation Given the Signal Statistics
5.3 Perceptual Improvements
5.4 Summary
Bibliography
Chapter 6: Single‐Channel Speech Presence Probability Estimation and Noise Tracking
6.1 Speech Presence Probability and its Estimation
6.2 Noise Power Spectrum Tracking
6.3 Evaluation Measures
6.4 Summary
Bibliography
Chapter 7: Single‐Channel Classification and Clustering Approaches
7.1 Source Separation by Computational Auditory Scene Analysis
7.2 Source Separation by Factorial HMMs
7.3 Separation Based Training
7.4 Summary
Bibliography
Chapter 8: Nonnegative Matrix Factorization
8.1 NMF and Source Separation
8.2 NMF Theory and Algorithms
8.3 NMF Dictionary Learning Methods
8.4 Advanced NMF Models
8.5 Summary
Bibliography
Chapter 9: Temporal Extensions of Nonnegative Matrix Factorization
9.1 Convolutive NMF
9.2 Overview of Dynamical Models
9.3 Smooth NMF
9.4 Nonnegative State‐Space Models
9.5 Discrete Dynamical Models
9.6 The Use of Dynamic Models in Source Separation
9.7 Which Model to Use?
9.8 Summary
9.9 Standard Distributions
Bibliography
Part III: Multichannel Separation and Enhancement
Chapter 10: Spatial Filtering
10.1 Fundamentals of Array Processing
10.2 Array Topologies
10.3 Data‐Independent Beamforming
10.4 Data‐Dependent Spatial Filters: Design Criteria
10.5 Generalized Sidelobe Canceler Implementation
10.6 Postfilters
10.7 Summary
Bibliography
Chapter 11: Multichannel Parameter Estimation
11.1 Multichannel Speech Presence Probability Estimators
11.2 Covariance Matrix Estimators Exploiting SPP
11.3 Methods for Weakly Guided and Strongly Guided RTF Estimation
11.4 Summary
Bibliography
Chapter 12: Multichannel Clustering and Classification Approaches
12.1 Two‐Channel Clustering
12.2 Multichannel Clustering
12.3 Multichannel Classification
12.4 Spatial Filtering Based on Masks
12.5 Summary
Bibliography
Chapter 13: Independent Component and Vector Analysis
13.1 Convolutive Mixtures and their Time‐Frequency Representations
13.2 Frequency‐Domain Independent Component Analysis
13.3 Independent Vector Analysis
13.4 Example
13.5 Summary
Bibliography
Chapter 14: Gaussian Model Based Multichannel Separation
14.1 Gaussian Modeling
14.2 Library of Spectral and Spatial Models
14.3 Parameter Estimation Criteria and Algorithms
14.4 Detailed Presentation of Some Methods
14.5 Summary
Acknowledgment
Bibliography
Chapter 15: Dereverberation
15.1 Introduction to Dereverberation
15.2 Reverberation Cancellation Approaches
15.3 Reverberation Suppression Approaches
15.4 Direct Estimation
15.5 Evaluation of Dereverberation
15.6 Summary
Bibliography
Part IV: Application Scenarios and Perspectives
Chapter 16: Applying Source Separation to Music
16.1 Challenges and Opportunities
16.2 Nonnegative Matrix Factorization in the Case of Music
16.3 Taking Advantage of the Harmonic Structure of Music
16.4 Nonparametric Local Models: Taking Advantage of Redundancies in Music
16.5 Taking Advantage of Multiple Instances
16.6 Interactive Source Separation
16.7 Crowd‐Based Evaluation
16.8 Some Examples of Applications
16.9 Summary
Bibliography
Chapter 17: Application of Source Separation to Robust Speech Analysis and Recognition
17.1 Challenges and Opportunities
17.2 Applications
17.3 Robust Speech Analysis and Recognition
17.4 Integration of Front‐End and Back‐End
17.5 Use of Multimodal Information with Source Separation
17.6 Summary
Bibliography
Chapter 18: Binaural Speech Processing with Application to Hearing Devices
18.1 Introduction to Binaural Processing
18.2 Binaural Hearing
18.3 Binaural Noise Reduction Paradigms
18.4 The Binaural Noise Reduction Problem
18.5 Extensions for Diffuse Noise
18.6 Extensions for Interfering Sources
18.7 Summary
Bibliography
Chapter 19: Perspectives
19.1 Advancing Deep Learning
19.2 Exploiting Phase Relationships
19.3 Advancing Multichannel Processing
19.4 Addressing Multiple‐Device Scenarios
19.5 Towards Widespread Commercial Use
Acknowledgment
Bibliography
Index
End User License Agreement
Chapter 1
Table 1.1 Evaluation software and metrics.
Chapter 3
Table 3.1 Range of RT60 reported in the literature for different environments (Ribas
et al
., 2016).
Table 3.2 Example artificial mixing effects, from Sturmel
et al
. (2012).
Chapter 5
Table 5.1 Bayesian probability distributions for the observation
and the searched quantity
.
Table 5.2 Criteria for the estimation of
.
Table 5.3 Overview of the discussed estimation schemes.
Chapter 6
Table 6.1 Summary of noise estimation in the minima controlled recursive averaging approach (Cohen, 2003).
Chapter 7
Table 7.1 SDR (dB) achieved on the CHiME‐2 dataset using supervised training (2ch: average of two input channels).
Chapter 14
Table 14.1 Categorization of existing approaches according to the underlying mixing model, spectral model, spatial model, estimation criterion, and algorithm.
Chapter 17
Table 17.1 Recent noise robust speech recognition tasks: ASpIRE (Harper, 2015), AMI (Hain
et al
., 2007), CHiME‐1 (Barker
et al
., 2013), CHiME‐2 (Vincent
et al
., 2013), CHiME‐3 (Barker
et al
., 2015), CHiME‐4 (Vincent
et al
., 2017), and REVERB (Kinoshita
et al
., 2016). See Le Roux and Vincent (2014)
a
for a more detailed list of robust speech processing datasets.
Table 17.2 ASR word accuracy achieved by a GMM‐HMM acoustic model on MFCCs with delta and double‐delta features. The data are enhanced by FD‐ICA followed by time‐frequency masking. We compare two different masking schemes, based on phase or interference estimates (Kolossa
et al
., 2010). UD and MI stand for uncertainty decoding and modified imputation with estimated uncertainties, respectively, while UD* and MI* stand for uncertainty decoding and modified imputation with ideal uncertainties, respectively. Bold font indicates the best results achievable in practice, i.e. without the use of oracle knowledge.
Table 17.3 ASR performance on the AMI meeting task using a single distant microphone and enhanced signals obtained by the DS beamformer and the joint training‐based beamforming network.
Table 17.4 Human recognition rates (%) of a listening test. Each score is based on averaging about 260 unique utterances. “Ephraim–Malah” refers to the log‐spectral amplitude estimator of Ephraim and Malah (1985) in the implementation by Loizou (2007).
Chapter 19
Table 19.1 Average signal‐to‐distortion ratio (SDR) achieved by the computational auditory scene analysis (CASA) method of Hu and Wang (2013), a DRNN trained to separate the foreground speaker, and two variants of deep clustering for the separation of mixtures of two speakers with all gender combinations at random signal‐to‐noise ratios (SNRs) between 0 and 10 dB (Hershey
et al
., 2016; Isik
et al
., 2016). The test speakers are not in the training set.
Chapter 1
Figure 1.1 General mixing process, illustrated in the case of
sources, including three point sources and one diffuse source, and
channels.
Figure 1.2 General processing scheme for single‐channel and multichannel source separation and speech enhancement.
Chapter 2
Figure 2.1 STFT analysis.
Figure 2.2 STFT synthesis.
Figure 2.3 STFT and Mel spectrograms of an example music signal. High energies are illustrated with dark color and low energies with light color.
Figure 2.4 Set of triangular filter responses distributed uniformly on the Mel scale.
Figure 2.5 Independent sound sources are sparse in time‐frequency domain. The top row illustrates the magnitude STFT spectrograms of two speech signals. The bottom left panel illustrates the histogram of the magnitude STFT coefficients of the first signal, and the bottom right panel the bivariate histogram of the coefficients of both signals.
Figure 2.6 Magnitude spectrum of an exemplary harmonic sound. The fundamental frequency is marked with a cross. The other harmonics are at integer multiples of the fundamental frequency.
Figure 2.7 Example spectrograms of a stationary noise signal (top left), a note played by a piano (top right), a sequence of drum hits (bottom left), and a speech signal (bottom right).
Chapter 3
Figure 3.1 Schematic illustration of the shape of an acoustic impulse response
for a room of dimensions 8.00
5.00
3.10 m, an RT60 of 230 ms, and a source distance of
m. All reflections are depicted as Dirac impulses.
Figure 3.2 First 100 ms of a pair of real acoustic impulse responses
from the Aachen Impulse Response Database (Jeub
et al
., 2009) recorded in a meeting room with the same characteristics as in Figure 3.1 and sampled at 48 kHz.
Figure 3.3 DRR as a function of the RT60 and the source distance
based on Eyring's formula (Gustafsson
et al
., 2003). These curves assume that there is no obstacle between the source and the microphone, so that the direct path exists. The room dimensions are the same as in Figure 3.1.
Figure 3.4 IC
of the reverberant part of an acoustic impulse response as a function of microphone distance
and frequency
.
Figure 3.5 ILD and IPD corresponding to the pair of real acoustic impulse responses in Figure 3.2. Dashed lines denote the theoretical ILD and IPD in the free field, as defined by the relative steering vector in (3.15).
Figure 3.6 Geometrical illustration of the position of a far‐field source
with respect to a pair of microphones on the horizontal plane, showing the azimuth
, the elevation
, the angle of arrival
, the microphone distance
, the source‐to‐microphone distances
and
, and the unit‐norm vector
pointing to the source.
Chapter 4
Figure 4.1 Example of GCC‐PHAT computed for a microphone pair and represented with gray levels in the case of a speech utterance in noisy and reverberant environment. The left part highlights the GCC‐PHAT at a single frame, with a clear peak at lag
.
Figure 4.2 Two examples of global coherence field acoustic maps. (a) 2D spatial localization using a distributed network of three microphone pairs, represented by circles. Observe the hyperbolic high correlation lines departing from the three pairs and crossing in the source position. (b) DOA likelihood for upper hemisphere angles using an array of five microphones in the (
,
) plane. High likelihood region exists around azimuth
and elevation
, i.e. unit‐norm vector
, see Figure 3.6.
Figure 4.3 1‐
position error distributions (68.3% confidence) resulting from TDOA errors (
sample) for two microphone pairs in three different geometries.
Figure 4.4 Graphical example of the particle filter procedure.
Chapter 5
Figure 5.1 Separation of speech from cafe noise by binary vs. soft masking. The masks shown in this example are oracle masks.
Figure 5.2 Illustration of the Rician posterior
(5.45) for
,
, and
. The red dashed line shows the mode of the posterior and thus the MAP estimate of target spectral magnitudes
. The purple dotted line corresponds to the approximate MAP estimate (5.47), and the yellow dash‐dotted line corresponds to the posterior mean (5.46) and thus the MMSE estimate of
.
Figure 5.3 Histogram of the real part of complex speech coefficients (Gerkmann and Martin, 2010).
Figure 5.4 Input‐output characteristics of different spectral filtering masks. In this example
. “Wiener” refers to the Wiener filter, “Ephraim–Malah” to the short‐time spectral amplitude estimator of Ephraim and Malah (1984), and “approx. MAP” to the approximate MAP amplitude estimator (5.47) of Wolfe and Godsill (2003). While “Wiener”, “Ephraim–Malah”, and “approx. MAP” are based on a Gaussian speech model, “Laplace prior” refers to an estimator of complex speech coefficients with a super‐Gaussian speech prior (Martin and Breithaupt, 2003). Compared to the linear Wiener filter, amplitude estimators tend to apply less attenuation for low inputs, while super‐Gaussian estimators tend to apply less attenuation for high inputs.
Figure 5.5 Examples of estimated filters for the noisy speech signal in Figure 5.1. The filters were computed in the STFT domain but are displayed on a nonlinear frequency scale for visualization purposes.
Chapter 6
Figure 6.1 State‐of‐the‐art single‐channel noise reduction system operating in the STFT domain. The application of a spectral gain function to the output of the STFT block results in an estimate
of clean speech spectral coefficients
. The spectral gain is controlled by the estimated SNR, which in turn requires the noise power tracker as a central component.
Figure 6.2 Spectrograms of (a) a clean speech sample, (b) clean speech mixed with traffic noise at 5 dB SNR, (c) SPP based on the Gaussian model, and (d) SPP based on the Gaussian model with a fixed a priori SNR prior. Single bins in the time‐frequency plane are considered. The fixed a priori SNR was set to 15 dB and
.
Figure 6.3 Power of the noisy speech signal (thin solid line) and estimated noise power (thick solid line) using the noise power tracking approach according to (6.19). The slope parameter is set such that the maximum slope is 5 dB /s. Top: stationary white Gaussian noise; bottom: nonstationary multiple‐speaker babble noise.
Figure 6.4 Probability distribution of short‐time power (
distribution with 10 degrees of freedom (DoF)) and the corresponding distribution of the minimum of
independent power values.
Figure 6.5 Optimal smoothing parameter
as a function of the power ratio
.
Figure 6.6 Power of noisy speech signal (thin solid line) and estimated noise power (thick solid line) using the MMSE noise power tracking approach (Hendriks
et al
., 2010). Top: stationary white Gaussian noise; bottom: nonstationary multiple‐speaker babble noise.
Figure 6.7 Spectrogram of (a) clean speech, (b) clean speech plus additive amplitude‐modulated white Gaussian noise, (c) minimum statistics noise power estimate, (d) log‐error for the minimum statistics estimator, (e) MMSE noise power estimate, and (f) log‐error for the MMSE estimator. For the computation of these noise power estimates we concatenated three identical phrases of which only the last one is shown in the figure. Note that in (d) and (f) light color indicates noise power overestimation while dark color indicates noise power underestimation.
Chapter 7
Figure 7.1 An implementation of a feature‐based CASA system for source separation.
Figure 7.2 Architecture of a GMM‐HMM. The GMM models the typical spectral patterns produced by each source, while the HMM models the spectral continuity.
Figure 7.3 Schematic depiction of an exemplary DNN with two hidden layers and three neurons per layer.
Figure 7.4 DRNN and unfolded DRNN.
Figure 7.5 Architecture of regression DNN for speech enhancement.
Chapter 8
Figure 8.1 An example signal played by piano consists of a sequence of notes C, E, and G, followed by the three notes played simultaneously. The basic NMF models the magnitude spectrogram of the signal
(top left) as a sum of components having fixed spectra
(rightmost panels) and activation coefficients
(lowest panels). Each component represents parts of the spectrogram corresponding to an individual note.
Figure 8.2 The NMF model can be used to generate time‐frequency masks in order to separate sources from a mixture. Top row: the spectrogram
of the mixture signal in Figure 8.1 is modeled with an NMF. Middle row: the model for an individual source can be obtained using a specific set of components. In this case, only component 1 is used to represent an individual note in the mixture. Bottom row: the mixture spectrogram is elementwise multiplied by the time‐frequency mask matrix
, resulting in a separated source spectrogram
.
Figure 8.3 Illustration of the separation of two sources, where source‐specific models are obtained in the training stage. Dictionaries
and
consisting of source‐specific basis vectors are obtained for sources 1 and 2 using isolated samples of each source. The mixture spectrogram
is modeled as a weighted linear combination of all the basis vectors. Matrices
and
contain activations of basis vectors in all frames.
Figure 8.4 Illustration of the model where a basis vector matrix
is obtained for the target source at the training stage and kept fixed, and a basis vector matrix
that represents other sources in the mixture is estimated from the mixture. The two activation matrices
and
are both estimated from the mixture.
Figure 8.5 Gaussian composite model (IS‐NMF) by Févotte
et al
. (2009).
Figure 8.6 Harmonic NMF model by Vincent
et al
. (2010) and Bertin
et al
. (2010).
Figure 8.7 Illustration of the coupled factorization model, where basis vectors
acquired from training data are elementwise multiplied with an equalization filter response
to better model the observed data at test time.
Chapter 9
Figure 9.1 Learning temporal dependencies. The top right plot shows the input matrix
, which has a consistent left‐right structure. The top left plot shows the learned matrices
and the bottom right plot shows the learned activations
. The bottom left plot shows the bases again, only this time we concatenate the corresponding columns from each
. We can clearly see that this sequence of columns learns bases that extend over time.
Figure 9.2 Extraction of time‐frequency sources. The input to this case is shown in the top right plot. It is a drum pattern composed of four distinct drum sounds. The set of four top left plots shows the extracted time‐frequency templates using 2D convolutive NMF. Their corresponding activations are shown in the step plots in the bottom right, and the individual convolutions of each template with its activation as the lower left set of plots. As one can see this model learns the time‐frequency profile of the four drum sounds and correctly identifies where they are located.
Figure 9.3 Convolutive NMF dictionary elements (
) for a speech recording. Note that each component has the form of a short phoneme‐like speech inflection.
Figure 9.4 Convolutive NMF decomposition for a violin recording. Note how the one extracted basis
corresponds to a constant‐Q spectrum that when 2D convolved with the activation
approximates the input. The peaks in
produce a pitch transcription of the recording by indicating energy at each pitch and time offset.
Figure 9.5 Effect of regularization for
. A segment of one of the rows of
is displayed, corresponding to the activations of the accompaniment (piano and double bass). A trumpet solo occurs in the middle of the displayed time interval, where the accompaniment vanishes; the regularization smoothes out coefficients with small energies that remain in unpenalized IS‐NMF.
Figure 9.6 Dictionaries were learned from the speech data of a given speaker. Shown are the dictionaries learned for 18 of the 40 states. Each dictionary is composed of 10 elements that are stacked next to each other. Each of these dictionaries roughly corresponds to a subunit of speech, either a voiced or unvoiced phoneme.
Figure 9.7 Example of dynamic models for source separation. The four spectrograms show the mixture and the extracted speech for three different approaches. D‐PLCA denotes dynamic PLCA and N‐HMM denotes nonnegative HMM. The bar plot shows a quantitative evaluation of the separation performance of each approach. Adapted from Smaragdis
et al
. (2014).
Chapter 10
Figure 10.1 General block diagram of a beamformer with filters
.
Figure 10.2 Beampower (10.6) in polar coordinates of the DS beamformer for a uniform linear array. The beampower is normalized to unity at the look direction. The different behavior of the beampower as a function of the array look direction and the signal wavelength is clearly demonstrated.
Figure 10.3 Harmonically nested linear arrays with
for
to 8 kHz.
Figure 10.4 Constant‐beamwidth distortionless response beamformer design with desired response towards broadside and constrained white noise gain after FIR approximation of
(Mabande
et al
., 2009) for the array of Figure 10.3.
Figure 10.5 GSC structure for implementing the LCMV beamformer.
Chapter 11
Figure 11.1 High‐level block diagram of the common estimation framework for constructing data‐dependent spatial filters.
Figure 11.2 Example of single‐channel and multichannel SPP in a noisy scenario.
Figure 11.3 Example of a DDR estimator in a noisy and reverberant scenario.
Figure 11.4 Postfilter incorporating spatial information (Cohen
et al
., 2003; Gannot and Cohen, 2004).
Chapter 12
Figure 12.1 Left: spectrograms of three‐source mixture recorded on two directional microphones spaced approximately 6 cm apart (Sawada
et al
., 2011). Right: ILD and IPD of this recording.
Figure 12.2 Histogram of ITD and ILD features extracted by DUET from the recording in Figure 12.1 along with separation masks estimated using manually selected parameters.
Figure 12.3 Probabilistic masks estimated by MESSL from the mixture shown in Figure 12.1 using Markov random field mask smoothing (Mandel and Roman, 2015).
Figure 12.4 Multichannel MESSL algorithm. Computes masks for each pair of microphones in the E‐step, then combines masks across pairs, then re‐estimates parameters for each pair from the global masks.
Figure 12.5 Narrowband clustering followed by permutation alignment.
Figure 12.6 Example spectra and masks via a multichannel clustering system on reverberant mixtures (Sawada
et al
., 2011).
Figure 12.7 System diagram of a multichannel classification system.
Figure 12.8 Block diagram of spatial filtering based on masks.
Chapter 13
Figure 13.1 Multichannel time‐frequency representation of an observed signal (left) and its slices: frequency‐wise and microphone‐wise (right). Methods discussed in this chapter are shown in red.
Figure 13.2 The scatter plot on the left‐hand side illustrates the joint probability distribution of two normalized and uniformly distributed signals. The histograms correspond to marginal distributions of the two variables. The left‐hand side corresponds to the same signals when mixed by matrix
. In the latter (mixed) case, the histograms are closer to a Gaussian.
Figure 13.3 (a) Examples of generalized Gaussian distributions for
(gamma),
(Laplacian),
(Gaussian), and
. The latter case demonstrates the fact that for
the distribution is uniform. (b) Histogram of a female utterance, which can be modeled as the Laplacian or gamma distribution.
Figure 13.4 Complex‐valued source models. (a) and (b) are based on 13.16. (c) is a complex Gaussian distribution.
Figure 13.5 A single‐trial SIR achieved by EFICA and BARBI when separating 15 different signals (five i.i.d. sequences, three nonwhite AR Gaussian processes, three piecewise Gaussian i.i.d. processes (nonstationary) and four speech signals).
Figure 13.6 Flow of FD‐ICA for separating convolutive mixtures.
Figure 13.7 Comparison of two criteria for permutation alignment: magnitude spectrum and power ratio. Permutations are aligned as each color (blue or green) corresponds to the same source. Power ratios generally exhibit a higher correlation coefficient for the same source and a more negative correlation coefficient for different sources.
Figure 13.8 TDOA estimation and permutation alignment. For a two‐microphone two‐source situation (left), ICA is applied in each frequency bin and TDOAs for two sources between the two microphones are estimated (right upper). Each color (navy or orange) corresponds to the same source. Clustering for TDOAs aligns the permutation ambiguities (right lower).
Figure 13.9 Experimental setups: source positions a and b for a two‐source setup, a and b and c for a three‐source setup, and a through d for a four‐source setup. The number of microphones used is always the same as the number of sources.
Figure 13.10 Experimental results (dataset A): SDR averaged over separated spatial images of the sources.
Figure 13.11 Spectrograms of the separated signals (dataset A, two sources, STFT window size of 1024). Here, IVA failed to solve the permutation alignment. These two sources are difficult to separate because the signals share the same silent period at around 2.5 s.
Figure 13.12 Experimental results with less active sources (dataset B): SDR averaged over separated spatial images of the sources.
Chapter 14
Figure 14.1 Illustration of multichannel NMF.
,
, and
represent the complex‐valued spectrograms of the sources and the mixture channels, and the complex‐valued mixing coefficients, respectively. NMF factors the power spectrogram
of each source as
(see Chapter 8 and Section 14.2.1). The mixing system is represented by a rank‐1 spatial model (see Section 14.2.2).
Figure 14.2 Block diagram of multichannel Gaussian model based source separation.
Figure 14.3 Illustration of the HMM and multichannel NMF spectral models.
Figure 14.4 Graphical illustration of the EM algorithm for MAP estimation.
Chapter 15
Figure 15.1 Example room acoustic impulse response.
Figure 15.2 Schematic view of the linear‐predictive multiple‐input equalization method, after Habets (2016).
Figure 15.3 Data‐dependent spatial filtering approach to perform reverberation suppression.
Figure 15.4 Single‐channel spectral enhancement approach to perform reverberation suppression.
Chapter 16
Figure 16.1 STFT and constant‐Q representation of a trumpet signal composed of three musical notes of different pitch.
Figure 16.2 Schematic illustration of the filter part
. Durrieu
et al
. (2011) defined the dictionary
of filter atomic elements as a set of 30 Hann functions, with 75% overlap.
Figure 16.3 The source‐filter model in the magnitude frequency domain.
Figure 16.4 Local regularities in the spectrograms of percussive (vertical) and harmonic (horizontal) sounds.
Figure 16.5 REPET: building the repeating background model. In stage 1, we analyze the mixture spectrogram and identify the repeating period. In stage 2, we split the mixture into patches of the identified length and take the median of them. This allows their common part and hence the repeating pattern to be extracted. In stage 3, we use this repeating pattern in each segment to construct a mask for separation.
Figure 16.6 Examples of kernels to use in KAM for modeling (a) percussive, (b) harmonic, (c) repetitive, and (d) spectrally smooth sounds.
Figure 16.7 Using the audio tracks of multiple related videos to perform source separation. Each circle on the left represents a mixture containing music and vocals in the language associated with the flag. The music is the same in all mixtures. Only the language varies. Given multiple copies of a mixture where one element is fixed lets one separate out this stable element (the music) from the varied elements (the speech in various languages).
Figure 16.8 Interferences of different sources in real‐world multitrack recordings. Left: microphone setup. Right: interference pattern. (Courtesy of R. Bittner and T. Prätzlich.)
Chapter 17
Figure 17.1 Block diagram of an ASR system.
Figure 17.2 Calculation of MFCC and log Mel‐filterbank features. “DCT” is the discrete cosine transform.
Figure 17.3 Flow chart of feature extraction in a typical speaker recognition system.
Figure 17.4 Flow chart of robust multichannel ASR with source separation, adaptation, and testing blocks.
Figure 17.5 Comparison of the DS and MVDR beamformers on the CHiME‐3 dataset. The results are obtained with a DNN acoustic model applied to the feature pipeline in Figure 17.6 and trained with the state‐level minimum Bayes risk cost. Real‐dev and Simu‐dev denote the real and simulation development sets, respectively. Real‐test and Simu‐test denote the real and simulation evaluation sets, respectively.
Figure 17.6 Pipeline of state‐of‐the‐art feature extraction, normalization, and transformation procedure for noise robust speech analysis and recognition. CMN, LDA, STC, and fMLLR stand for cepstral mean normalization, linear discriminant analysis, semitied covariance transform, and feature‐space ML linear regression, respectively.
Figure 17.7 Observation uncertainty pipeline.
Figure 17.8 Joint training of a unified network for beamforming and acoustic modeling. Adapted from Xiao
et al
. (2016).
Chapter 18
Figure 18.1 Block diagram for binaural spectral postfiltering based on a common spectro‐temporal gain: (a) direct gain computation (one microphone on each hearing device) and (b) indirect gain computation (two microphones on each hearing device).
Figure 18.2 Block diagram for binaural spatial filtering: (a) incorporating constraints into spatial filter design and (b) mixing with scaled reference signals.
Figure 18.3 Considered acoustic scenario, consisting of a desired source (
), an interfering source (
), and background noise in a reverberant room. The signals are received by the microphones on both hearing devices of the binaural hearing system.
Figure 18.4 Schematic overview of this chapter.
Figure 18.5 Psychoacoustically motivated lower and upper MSC boundaries: (a) frequency range 0–8000 Hz and (b) frequency range 0–500 Hz. For frequencies below 500 Hz, the boundaries depend on the desired MSC while for frequencies above 500 Hz the boundaries are independent of the desired MSC.
Figure 18.6 MSC error of the noise component, intelligibility‐weighted speech distortion, and intelligibility‐weighted output SNR for the MWF, MWF‐N, and MWF‐IC.
Figure 18.7 Performance measures for the binaural MWF, MWF‐RTF, MWF‐IR‐0, and MWF‐IR‐0.2 for a desired speech source at
5° and different interfering source positions. The global input SINR was equal to
3 dB.
Figure 18.8 Performance measures for the binaural MWF, MWF‐RTF, MWF‐IR‐0.2, and MWF‐IR‐0 for a desired speech source at
35° and different interfering source positions. The global input SINR was equal to
3 dB.
Chapter 19
Figure 19.1 Short‐term magnitude spectrum and various representations of the phase spectrum of a speech signal for an STFT analysis window size of 64 ms. For easier visualization, the deviation of the instantaneous frequency from the center frequency of each band is shown rather than the instantaneous frequency itself.
Figure 19.2 Interchannel level difference (ILD) and IPD for two different source positions
(plain curve) and
(dashed curve) 10 cm apart from each other at 1.70 m distance from the microphone pair. The source DOAs are
and
, respectively. The room size is
m, the reverberation time is 230 ms, and the microphone distance is 15 cm.
Figure 19.3 IPD between two microphones spaced by 15 cm belonging (a) to the same device or (b) to two distinct devices with
relative sampling rate mismatch. For illustration purposes, the recorded sound scene consists of a single speech source at a distance of 1.70 m and a DOA of
in a room with a reverberation time of 230 ms, without any interference or noise, and the two devices have zero temporal offset at
.
Cover
Table of Contents
Begin Reading
C1
vi
xvii
xviii
xix
xxi
xxii
xxiii
xxv
xxvi
xxvii
xxviii
xxix
xxx
xxxi
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
189
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
345
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
461
462
463
464
465
466
467
468
469
470
Edited by
Emmanuel Vincent
InriaFrance
Tuomas Virtanen
Tampere University of TechnologyFinland
Sharon Gannot
Bar-Ilan UniversityIsrael
This edition first published 2018
© 2018 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Emmanuel Vincent, Tuomas Virtanen & Sharon Gannot to be identified as authors of the editorial material in this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Vincent, Emmanuel (Research scientist), editor. | Virtanen, Tuomas, editor. | Gannot, Sharon, editor.
Title: Audio source separation and speech enhancement / edited by Emmanuel Vincent, Tuomas Virtanen, Sharon Gannot.
Description: Hoboken, NJ : John Wiley & Sons, 2018. | Includes bibliographical references and index. |
Identifiers: LCCN 2018013163 (print) | LCCN 2018021195 (ebook) | ISBN 9781119279884 (pdf) | ISBN 9781119279914 (epub) | ISBN 9781119279891 (cloth)
Subjects: LCSH: Speech processing systems. | Automatic speech recognition.
Classification: LCC TK7882.S65 (ebook) | LCC TK7882.S65 .A945 2018 (print) | DDC 006.4/54-dc23
LC record available at https://lccn.loc.gov/2018013163
Cover Design: Wiley
Cover Images: © 45RPM/iStockphoto;
© franckreporter/iStockphoto
Shoko Araki
NTT Communication Science Laboratories
Japan
Roland Badeau
Institut Mines‐Télécom
France
Alessio Brutti
Fondazione Bruno Kessler
Italy
Israel Cohen
Technion
Israel
Simon Doclo
Carl von Ossietzky‐Universität Oldenburg
Germany
Jun Du
University of Science and Technology of China
China
Zhiyao Duan
University of Rochester
NY
USA
Cédric Févotte
CNRS
France
Sharon Gannot
Bar‐Ilan University
Israel
Tian Gao
University of Science and Technology of China
China
Timo Gerkmann
Universität Hamburg
Germany
Emanuël A.P. Habets
International Audio Laboratories Erlangen
Germany
Elior Hadad
Bar‐Ilan University
Israel
Hirokazu Kameoka
The University of Tokyo
Japan
Walter Kellermann
Friedrich‐Alexander Universität Erlangen‐Nürnberg
Germany
Zbyněk Koldovský
Technical University of Liberec
Czech Republic
Dorothea Kolossa
Ruhr‐Universität Bochum
Germany
Antoine Liutkus
Inria
France
Michael I. Mandel
City University of New York
NY
USA
Erik Marchi
Technische Universität München
Germany
Shmulik Markovich‐Golan
Bar‐Ilan University
Israel
Daniel Marquardt
Carl von Ossietzky‐Universität Oldenburg
Germany
Rainer Martin
Ruhr‐Universität Bochum
Germany
Nasser Mohammadiha
Chalmers University of Technology
Sweden
Gautham J. Mysore
Adobe Research
CA
USA
Tomohiro Nakatani
NTT Communication Science Laboratories
Japan
Patrick A. Naylor
Imperial College London
UK
Maurizio Omologo
Fondazione Bruno Kessler
Italy
Alexey Ozerov
Technicolor
France
Bryan Pardo
Northwestern University
IL
USA
Pasi Pertilä
Tampere University of Technology
Finland
Gaël Richard
Institut Mines‐Télécom
France
Hiroshi Sawada
NTT Communication Science Laboratories
Japan
Paris Smaragdis
University of Illinois at Urbana‐Champaign
IL
USA
Piergiorgio Svaizer
Fondazione Bruno Kessler
Italy
Emmanuel Vincent
Inria
France
Tuomas Virtanen
Tampere University of Technology
Finland
Shinji Watanabe
Johns Hopkins University
MD
USA
Felix Weninger
Nuance Communications
Germany
Source separation and speech enhancement are some of the most studied technologies in audio signal processing. Their goal is to extract one or more source signals of interest from an audio recording involving several sound sources. This problem arises in many everyday situations. For instance, spoken communication is often obscured by concurrent speakers or by background noise, outdoor recordings feature a variety of environmental sounds, and most music recordings involve a group of instruments. When facing such scenes, humans are able to perceive and listen to individual sources so as to communicate with other speakers, navigate in a crowded street or memorize the melody of a song. Source separation and speech enhancement technologies aim to empower machines with similar abilities.
These technologies are already present in our lives today. Beyond “clean” single‐source signals recorded with close microphones, they allow the industry to extend the applicability of speech and audio processing systems to multi‐source, reverberant, noisy signals recorded with distant microphones. Some of the most striking examples include hearing aids, speech enhancement for smartphones, and distant‐microphone voice command systems. Current technologies are expected to keep improving and spread to many other scenarios in the next few years.
Traditionally, speech enhancement has referred to the problem of segregating speech and background noise, while source separation has referred to the segregation of multiple speech or audio sources. Most textbooks focus on one of these problems and on one of three historical approaches, namely sensor array processing, computational auditory scene analysis, or independent component analysis. These communities now routinely borrow ideas from each other and other approaches have emerged, most notably based on deep learning.
This textbook is the first to provide a comprehensive overview of these problems and approaches by presenting their shared foundations and their differences using common language and notations. Starting with prerequisites (Part I), it proceeds with single‐channel separation and enhancement (Part II), multichannel separation and enhancement (Part III), and applications and perspectives (Part IV). Each chapter provides both introductory and advanced material.
We designed this textbook for people in academia and industry with basic knowledge of signal processing and machine learning. Thanks to its comprehensiveness, we hope it will help students select a promising research track, researchers leverage the acquired cross‐domain knowledge to design improved techniques, and engineers and developers choose the right technology for their application scenario. We also hope that it will be useful for practitioners from other fields (e.g., acoustics, multimedia, phonetics, musicology) willing to exploit audio source separation or speech enhancement as a pre‐processing tool for their own needs.
Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot
May 2017
We would like to thank all the chapter authors, as well as the following people who helped with proofreading: Sebastian Braun, Yaakov Buchris, Emre Cakir, Aleksandr Diment, Dylan Fagot, Nico Gößling, Tomoki Hayashi, Jakub Janský, Ante Jukić, Václav Kautský, Martin Krawczyk‐Becker, Simon Leglaive, Bochen Li, Min Ma, Paul Magron, Zhong Meng, Gaurav Naithani, Zhaoheng Ni, Aditya Arie Nugraha, Sanjeel Parekh, Robert Rehr, Lea Schönherr, Georgina Tryfou, Ziteng Wang, and Mehdi Zohourian
Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot
May 2017
scalar
vector
vector with entries
th entry of vector
vector of zeros
vector of ones
matrix
matrix with entries
th entry of matrix
identity matrix
tensor/array (with three or more dimensions) or set
tensor with entries
diagonal matrix whose entries are those of vector
entrywise product of matrices
and
trace of matrix
determinant of matrix
transpose of vector
conjugate‐transpose of vector
conjugate of scalar
real part of scalar
imaginary unit
probability distribution of continuous random variable
conditional probability distribution of
given
probability value of discrete random variable
conditional probability value of
given
expectation of random variable
conditional expectation of
entropy of random variable
real Gaussian distribution with mean
and covariance
complex Gaussian distribution with mean
and covariance
estimated value of random variable
(e.g., first‐order statistics)
variance of random variable
estimated second‐order statistics of random variable
autocovariance of random vector
estimated second‐order statistics of random vector
covariance of random vectors
and
estimated second‐order statistics of random vectors
and
cost function to be minimized w.r.t. the vector of parameters
objective function to be maximized w.r.t. the vector of parameters
auxiliary function to be minimized or maximized, depending on the context
number of microphones or channels
microphone or channel index in
number of sources
source index in
number of time‐domain samples
sample index in
time‐domain filter length
tap index in
number of time frames
time frame index in
number of frequency bins
frequency bin index in
AR
autoregressive
ASR
automatic speech recognition
BSS
blind source separation
CASA
computational auditory scene analysis
DDR
direct‐to‐diffuse ratio
DFT
discrete Fourier transform
DNN
deep neural network
DOA
direction of arrival
DRNN
deep recurrent neural network
DRR
direct‐to‐reverberant ratio
DS
delay‐and‐sum
ERB
equivalent rectangular bandwidth
EM
expectation‐maximization
EUC
Euclidean
FD‐ICA
frequency‐domain independent component analysis
FIR
finite impulse response
GCC
generalized cross‐correlation
GCC‐PHAT
generalized cross‐correlation with phase transform
GMM
Gaussian mixture model
GSC
generalized sidelobe canceler
HMM
hidden Markov model
IC
interchannel (or interaural) coherence
ICA
independent component analysis
ILD
interchannel (or interaural) level difference
IPD
interchannel (or interaural) phase difference
ITD
interchannel (or interaural) time difference
IVA
independent vector analysis
IS
Itakura–Saito
KL
Kullback–Leibler
LCMV
linearly constrained minimum variance
LSTM
long short‐term memory
MAP
maximum a posteriori
MFCC
Mel‐frequency cepstral coefficient
ML
maximum likelihood
MM
majorization‐minimization
MMSE
minimum mean square error
MSC
magnitude squared coherence
MSE
mean square error
MVDR
minimum variance distortionless response
MWF
multichannel Wiener filter
NMF
nonnegative matrix factorization
PLCA
probabilistic latent component analysis
RNN
recurrent neural network
RT60
reverberation time
RTF
relative transfer function
SAR
signal‐to‐artifacts ratio
SDR
signal‐to‐distortion ratio
SINR
signal‐to‐interference‐plus‐noise ratio
SIR
signal‐to‐interference ratio
SNR
signal‐to‐noise ratio
SPP
speech presence probability
SRP
steered response power
SRP‐PHAT
steered response power with phase transform
SRR
signal‐to‐reverberation ratio
STFT
short‐time Fourier transform
TDOA
time difference of arrival
VAD
voice activity detection
VB
variational Bayesian
This book is accompanied by a companion website:
https://project.inria.fr/ssse/
The website includes:
Implementations of algorithms
Audio samples
Emmanuel Vincent Sharon Gannot and Tuomas Virtanen
Source separation and speech enhancement are core problems in the field of audio signal processing, with applications to speech, music, and environmental audio. Research in this field has accompanied technological trends, such as the move from landline to mobile or hands‐free phones, the gradual replacement of stereo by 3D audio, and the emergence of connected devices equipped with one or more microphones that can execute audio processing tasks which were previously regarded as impossible. In this short introductory chapter, after a brief discussion of the application needs in Section 1.1, we define the problems of source separation and speech enhancement and introduce relevant terminology regarding the scenarios and the desired outcome in Section 1.2. We then present the general processing scheme followed by most source separation and speech enhancement approaches and categorize these approaches in Section 1.3. Finally, we provide an outline of the book in Section 1.4.
The problems of source separation and speech enhancement arise from several application needs in the context of speech, music, and environmental audio processing.
Real‐world speech signals are often contaminated by interfering speakers, environmental noise, and/or reverberation. These phenomena deteriorate speech quality and, in adverse scenarios, speech intelligibility and automatic speech recognition (ASR) performance. Source separation and speech enhancement are therefore required in such scenarios. For instance, spoken communication over mobile phones or hands‐free systems requires the separation or enhancement of the near‐end speaker's voice with respect to interfering speakers and environmental noises before it is transmitted to the far‐end listener. Conference call systems or hearing aids face the same problem, except that several speakers may be considered as targets. Source separation and speech enhancement are also crucial preprocessing steps for robust distant‐microphone ASR, as available in today's personal assistants, car navigation systems, televisions, video game consoles, medical dictation devices, and meeting transcription systems. Finally, they are necessary components in providing humanoid robots, assistive listening devices, and surveillance systems with “super‐hearing” capabilities, which may exceed the hearing capabilities of humans.
Besides speech, music and movie soundtracks are another important application area for source separation. Indeed, music recordings typically involve several instruments playing together live or mixed together in a studio, while movie soundtracks involve speech overlapped with music and sound effects. Source separation has been successfully used to upmix mono or stereo recordings to 3D sound formats and/or to remix them. It lies at the core of object‐based audio coders, which encode a given recording as the sum of several sound objects that can then easily be rendered and manipulated. It is also useful for music information retrieval purposes, e.g. to transcribe the melody or the lyrics of a song from the separated singing voice.
This is an emerging research field with many real‐life applications concerning the analysis of general sound scenes, involving the detection of sound events, their localization and tracking, and the inference of the acoustic environment properties.
The goal of source separation and speech enhancement can be defined in layman's terms as that of recovering the signal of one or more sound sources from an observed signal involving other sound sources and/or reverberation. This definition turns out to be ambiguous. In order to address the ambiguity, the notion of source and the process leading to the observed signal must be characterized more precisely. In this section and in the rest of this book we adopt the general notations defined on p. xxv–xxvii.
Let us assume that the observed signal has channels indexed by . By channel, we mean the output of one microphone in the case when the observed signal has been recorded by one or more microphones, or the input of one loudspeaker in the case when it is destined to be played back on one or more loudspeakers.1 A signal with channels is called single‐channel and is represented by a scalar , while a signal with channels is called multichannel and is represented by an vector . The explanation below employs multichannel notation, but is also valid in the single‐channel case.
Furthermore, let us assume that there are sound sources indexed by . The word “source” can refer to two different concepts. A point source such as a human speaker, a bird, or a loudspeaker is considered to emit sound from a single point in space. It can be represented as a single‐channel signal. A diffuse source such as a car, a piano, or rain simultaneously emits sound from a whole region in space. The sounds emitted from different points of that region are different but not always independent of each other. Therefore, a diffuse source can be thought of as an infinite collection of point sources. The estimation of the individual point sources in this collection can be important for the study of vibrating bodies, but it is considered irrelevant for source separation or speech enhancement. A diffuse source is therefore typically represented by the corresponding signal recorded at the microphone(s) and it is processed as a whole.
The mixing process leading to the observed signal can generally be expressed in two steps. First, each single‐channel point source signal is transformed into an source spatial image signal (Vincent et al., 2012) by means of a possibly nonlinear spatialization operation. This operation can describe the acoustic propagation from the point source to the microphone(s), including reverberation, or some artificial mixing effects. Diffuse sources are directly represented by their spatial images instead. Second, the spatial images of all sources are summed to yield the observed signal called the mixture:
This summation is due to the superposition of the sources in the case of microphone recording or to explicit summation in the case of artificial mixing. This implies that the spatial image of each source represents the contribution of the source to the mixture signal. A schematic overview of the mixing process is depicted in Figure 1.1. More specific details are given in Chapter 3.
Note that target sources, interfering sources, and noise are treated in the same way in this formulation. All these signals can be either point or diffuse sources. The choice of target sources depends on the use case. Also, the distinction between interfering sources and noise may or may not be relevant depending on the use case. In the context of speech processing, these terms typically refer to undesired speech vs. nonspeech sources, respectively. In the context of music or environmental sound processing, this distinction is most often irrelevant and the former term is preferred to the latter.
Figure 1.1 General mixing process, illustrated in the case of sources, including three point sources and one diffuse source, and channels.
In the following, we assume that all signals are digital, meaning that the time variable is discrete. We also assume that quantization effects are negligible, so that we can operate on continuous amplitudes. Regarding the conversion of acoustic signals to analog audio signals and analog signals to digital, see, for example, Havelock et al. (2008, Part XII) and Pohlmann (1995, pp. 22–49).
The above mixing process implies one or more distortions of the target signals: interfering sources, noise, reverberation, and echo emitted by the loudspeakers (if any). In this context, source separation refers to the problem of extracting one or more target sources while suppressing interfering sources and noise. It explicitly excludes dereverberation and echo cancellation. Enhancement is more general, in that it refers to the problem of extracting one or more target sources while suppressing all types of distortion, including reverberation and echo. In practice, though, this term is mostly used in the case when the target sources are speech. In the audio processing literature, these two terms are often interchanged, especially when referring to the problem of suppressing both interfering speakers and noise from a speech signal. Note that, for either source separation or enhancement tasks, the extracted source(s) can be either the spatial image of the source or its direct path component, namely the delayed and attenuated version of the original source signal (Vincent et al., 2012; Gannot et al., 2001).
