Single Channel Phase-Aware Signal Processing in Speech Communication - Pejman Mowlaee - E-Book

Single Channel Phase-Aware Signal Processing in Speech Communication E-Book

Pejman Mowlaee

0,0
86,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An overview on the challenging new topic of phase-aware signal processing

Speech communication technology is a key factor in human-machine interaction, digital hearing aids, mobile telephony, and automatic speech/speaker recognition. With the proliferation of these applications, there is a growing requirement for advanced methodologies that can push the limits of the conventional solutions relying on processing the signal magnitude spectrum.

Single-Channel Phase-Aware Signal Processing in Speech Communication provides a comprehensive guide to phase signal processing and reviews the history of phase importance in the literature, basic problems in phase processing, fundamentals of phase estimation together with several applications to demonstrate the usefulness of phase processing.

Key features:

  • Analysis of recent advances demonstrating the positive impact of phase-based processing in pushing the limits of conventional methods.
  • Offers unique coverage of the historical context, fundamentals of phase processing and provides several examples in speech communication.
  • Provides a detailed review of many references and discusses the existing signal processing techniques required to deal with phase information in different applications involved with speech.
  • The book supplies various examples and MATLAB® implementations delivered within the PhaseLab toolbox.

Single-Channel Phase-Aware Signal Processing in Speech Communication is a valuable single-source for students, non-expert DSP engineers, academics and graduate students.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 446

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

About the Authors

Dr Pejman Mowlaee (main author) Graz University of Technology, Graz, Austria

Dipl. Ing. Josef Kulmer (co-author) Graz University of Technology, Graz, Austria

Dipl. Ing. Johannes Stahl (co-author) Graz University of Technology, Graz, Austria

Florian Mayer (co-author) Graz University of Technology, Graz, Austria

Preface

Purpose and scope

Book outline

Intended audience

Acknowledgments

List of Symbols

Part I: History, Theory and Concepts

Chapter 1: Introduction: Phase Processing, History

1.1 Chapter Organization

1.2 Conventional Speech Communication

1.3 Historical Overview of the Importance or Unimportance of Phase

1.4 Importance of Phase in Speech Processing

1.5 Structure of the Book

1.6 Experiments

1.7 Summary

References

Chapter 2: Fundamentals of Phase-Based Signal Processing

2.1 Chapter Organization

2.2 STFT Phase: Background and Some Remarks

2.3 Phase Unwrapping

2.4 Useful Phase-Based Representations

2.5 Experiments

2.6 Summary

References

Chapter 3: Phase Estimation Fundamentals

3.1 Chapter Organization

3.2 Phase Estimation Fundamentals

3.3 Existing Solutions

3.4 Experiments

3.5 Summary

References

Part II: Applications

Chapter 4: Phase Processing for Single-Channel Speech Enhancement

4.1 Introduction and Chapter Organization

4.2 Speech Enhancement in the STFT Domain: General Concepts

4.3 Conventional Speech Enhancement

4.4 Phase-Sensitive Speech Enhancement

4.5 Experiments

4.6 Summary

References

Chapter 5: Phase Processing for Single-Channel Source Separation

5.1 Chapter Organization

5.2 Why Single-Channel Source Separation?

5.3 Conventional Single-Channel Source Separation

5.4 Phase Processing for Single-Channel Source Separation

5.5 Experiments

5.6 Summary

References

Chapter 6: Phase-Aware Speech Quality Estimation

6.1 Chapter Organization

6.2 Introduction: Speech Quality Estimation

6.3 Conventional Instrumental Metrics for Speech Quality Estimation

6.4 Why Phase-Aware Metrics?

6.5 New Phase-Aware Metrics

6.6 Subjective Tests

6.7 Experiments

6.8 Summary

References

Chapter 7: Conclusion and Future Outlook

7.1 Chapter Organization

7.2 Renaissance of Phase-Aware Signal Processing: Decline and Rise

7.3 Directions for Future Research

7.4 Summary

References

Appendix A: MATLAB Toolbox

A.1 Chapter Organization

A.2 PhaseLab Toolbox

References

Index

End User License Agreement

Pages

xi

xii

xiii

xiv

xv

xvii

xviii

xix

xx

21

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

74

94

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

1

113

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

Guide

Cover

Table of Contents

Preface

Part I: History, Theory and Concepts

Begin Reading

List of Illustrations

Chapter 1: Introduction: Phase Processing, History

Figure 1.1 Speech communication devices used in everyday life scenarios are expected to function robustly in adverse noisy conditions.

Figure 1.2 Block diagram for speech communication from transmitter (microphone) to receiver end (loudspeaker) composed of a chain of blocks: beamforming, echo cancellation, de-reverberation, noise reduction, speech coding, channel coding, speech decoding, artificial bandwidth extension, near-end listening enhancement.

Figure 1.3 Block diagram of the processing chain in speech communication applications: analysis–modification–synthesis.

Figure 1.4 The experimental setup in Vary (1985) comprised three stages: a spectral analyzer, either a polyphase network (PPN) or fast Fourier transform (FFT), followed by an adaptive processor (amplitude/phase modification) and a spectral synthesizer.

Figure 1.5 Block diagram for Vary's experiment to study the effects of phase modification (Vary 1985).

Figure 1.6 Block diagram for Wang and Lim's experiment (Wang and Lim 1982), where stimuli of phase-modified speech are constructed in the framework of analysis–modification–synthesis.

Figure 1.7 Vector diagram inspired by Vary (1985) showing the phase deviation resulting from the added noise to speech at frequency subband and frame .

Figure 1.8 Phase deviation upper bound versus the spectral SNR in Vary's experiment given in (1.5).

Figure 1.9 Block diagram to construct stimuli of phase-modified speech in analysis–modification–synthesis (Paliwal

et al.

2011).

Figure 1.10 Vector diagrams showing phase spectrum compensation (PSC) inspired by Stark

et al.

(2008), where modification of the noisy STFT is shown for conjugate pair signal-to-noise ratio scenarios: (a) large (), (b) low ().

Figure 1.11 Speech enhancement using phase spectrum compensation (PSC; Stark

et al.

2008). Spectrograms in dB are shown for (left) clean, (middle) noisy, (right) enhanced speech signals. PESQ and output SNR scores are shown at the top of each panel.

Chapter 2: Fundamentals of Phase-Based Signal Processing

Figure 2.1 Time domain (left), magnitude spectrogram in dB (middle), and phase spectrogram (right) of female speech. While the magnitude spectrum presents a detailed harmonic structure of speech in time and frequency, the instantaneous phase spectrum shows no useful pattern or useful details.

Figure 2.2 Example showing phase wrapping in the STFT phase spectrum for the vowel “e”: (a) , waveform; (b) , STFT magnitude ; (c) STFT phase using a causal window; (d) , STFT phase using an acausal symmetric window with zero phase; (e) waveform representation as a sum of harmonics; (f) , amplitude of the th harmonic; (g) , harmonic instantaneous phases; (h) , unwrapped phase.

Figure 2.3 An example for two zeros close to the unit circle located between and . Such zeros are the main source of difficulty for DFT-based phase unwrapping methods (Drugman and Stylianou 2015).

Figure 2.4 Different branches for the arctan function are used in McGowan and Kuc (1982) to determine in (2.13) for adding or subtracting the multiples required in the time series phase unwrapping method.

Figure 2.5 The baseband representation of band for a symbolic spectrum composed of one harmonic. The prototype window function spectrum suppresses the impact of the adjacent frequency bands, but not the one closest to the frequency bin of interest, (Krawczyk and Gerkmann 2012).

Figure 2.6 Spectrogram in dB (left) and baseband phase difference (BPD; right) calculated for a clean speech signal used in the short-time Fourier transform phase improvement (STFTPI) method (Krawczyk and Gerkmann 2012).

Figure 2.7 Non-uniform distribution for spectral phase in the form of von Mises characterized by mean and concentration , ranging between uniform distribution and Dirac delta.

Figure 2.9 Example showing how phase distortion features as mean and deviation are used to classify different voicing states: (left) onset, (middle) voiced, (right) offset. The results are shown as (top) time domain, (middle) phase distortion mean (PDM), and (bottom) phase distortion standard deviation (PDD).

Figure 2.8 (a) Time domain signal for female speech, (b) spectrogram in dB, (c) RPS, (d) fundamental frequency.

Figure 2.10 Example inspired by Gdeisat and Lilley (2011) to show the process of phase unwrapping using the DD method applied to a cosine waveform; starting from the wrapped phase (b), via adding/subtracting jumps to remove the wraps, sequentially shown in (c)–(f), for the four wraps in the wrapped phase signal shown in (b).

Figure 2.11 Example inspired by Gdeisat and Lilley (2011) to show the process of phase unwrapping using the DD method applied to a cosine waveform corrupted with additive noise. One-dimensional phase unwrapping problem: top panel: (a) continuous phase, (b) wrapped phase, (c) mild noisy version, (d) wrapped phase, (e) unwrapped phase for mild noise, (f) intense noisy version, (g) wrapped phase, (h) unwrapped phase for intense noise.

Figure 2.12 Computation time (top) and error rate results (bottom) for the phase unwrapping methods for speech signals as a function of for different phase unwrapping methods.

Figure 2.13 Group delay representations listed in Table 2.3 shown for a voiced speech segment: (a) FFT log-magnitude, (b) modified group delay (MGD; Hegde

et al

. 2007), (c) LPC, (d) CGD (Bozkurt

et al

. 2007), and (e) LPGD (Rajan

et al

. 2013).

Figure 2.14 (a) Time domain representation for clean speech; (b) unwrapped phase shown for the first three harmonics, (c) mean, and (d) circular variance.

Figure 2.15 Phase variance presentation for (a) clean speech and (b) speech deteriorated by additive, white Gaussian noise. In highly voiced regions, e.g. at time , the harmonic structure is visible due to their low phase variance. Additive noise increases the phase variance.

Figure 2.16 Time–frequency information for clean (left) and noisy (right) signals: (a) amplitude spectrogram in dB, (b) instantaneous phase, (c) group delay, (d) instantaneous frequency (IF), (e) phasegram, (f) phase distortion deviation (PDD), and (g) relative phase shift (RPS).

Chapter 3: Phase Estimation Fundamentals

Figure 3.1 Visualization of the window impact on a sinusoid in time and frequency domains with , , and a rectangular window with length . The window DTFT is shifted dependent on and multiplied by and , respectively, as shown in the phase response of DTFT .

Figure 3.2 Relation of sinusoidal periods and window length and its impact on amplitude and phase. (a) A sinusoid multiplied by a boxcar window with a length of one period . The Dirichlet kernels do not interfere at and , which yields an unbiased phase estimate of . (b) The more general case of a window length that does not correspond to an integer multiplier of the sinusoid's period . The amplitude as well as the phase do not approach the true value, and thus the outcome is biased.

Figure 3.3 Illustration of three different windows' impacts on the magnitude and phase response of one sinusoid. The improved sidelobe suppression is at the cost of a higher mainlobe width resulting in a lower frequency resolution. For windows with higher sidelobe suppression, the phase response at frequency is increasingly dominated by the phase within the mainlobe width.

Figure 3.4 Illustration of the impact of additive white Gaussian noise on the magnitude and phase response of three neighboring sinusoids, windowed with Hamming without noise [(a),(d)], SNR dB [(b),(e)], and SNR dB [(c),(f)]. The left column shows the case if no noise is added. The mainlobes of the Hamming windows are sufficiently separated. The phase response shows that the neighboring frequencies of , , and are dominated by the phase values of the sinusoids: , , and . The middle and right columns present the impact of ten realizations of additive noise for and , respectively. With an increased noise level the phase values at the neighboring frequencies of , , and are more affected by noise.

Figure 3.5 Spectral magnitude for a signal composed of three harmonics using (a) Hamming and (b) Blackman windows. The harmonics are located at , , and . The broader mainlobe width of the Blackman window demands a higher window length of in order to suppress the neighboring harmonics.

Figure 3.6 Iterative framework for signal reconstruction showing the GLA update procedure to reconstruct the signal at iteration index , following Griffin and Lim, (1984).

Figure 3.7 Spectrogram consistency concept used in Griffin–Lim iterative signal reconstruction. A consistent spectrogram (belonging to the set ) verifies , while for an inconsistent spectrogram leading to .

Figure 3.8 Speech enhancement by maintaining phase continuity and phase reconstruction across time as proposed in Mehmetcik and Ciloglu (2012) for voice frames.

Figure 3.9 Block diagram for phase randomization proposed for zooming noise suppression (Sugiyama and Miyahara 2013).

Figure 3.10 Proof-of-concept experiment for phase randomization: (top) clean versus noisy speech, (bottom) phase randomization with blind and oracle SNRs.

Figure 3.11 Representation of as the sum of speech and noise in the complex plane. Due to the parity in the sign of in (3.57), there is an ambiguity in the set of phase candidates.

Figure 3.12 Phase constraints across time (IFD), harmonics (RPS), and frequency (GDD). The arrows show the coordination to which the proposed constraints are applied on the phase spectrum.

Figure 3.13 Comparison between phase estimation error criteria: squared error () and cyclic error () measures shown versus phase estimation error . For further details, see Nitzan

et al.

(2016).

Figure 3.14 Temporal smoothing of unwrapped phase to estimate the clean phase from a noisy speech input. The steps are: fundamental frequency estimation, phase decomposition, temporal smoothing, and signal reconstruction.

Figure 3.15 SNR-based smoothing (Mowlaae and Kulmer 2015b): Phase deviation cosine as a function of

a priori

and

a posteriori

local SNRs (left); the regions for hypotheses and depending on the values of and (right).

Figure 3.16 Performance evaluation of the maximum likelihood phase estimator (3.33) and maximum

a posteriori

estimator (3.41) with regard to fundamental frequency, signal-to-noise ratio, and a window length of [(a),(c)] and [(b),(d)]. The non-zero phase error for low noise scenarios is caused by the approximation in (3.31).

Figure 3.17 Impact of different window functions on phase estimation for an accurate fundamental frequency estimate (top row), and underestimated by (middle) and (bottom) for a window length of (left) and (right), revealing the importance of an accurate estimate. Given an accurate (top), the rectangular window performs best due to its high frequency resolution. For inaccurate estimates (bottom), window functions with wider mainlobes, e.g. the Blackman window, are in favor.

Figure 3.18 Griffin and Lim algorithm (GLA; Griffin and Lim 1984: dashed) and fast Griffin and Lim algorithm (solid; Perraudin

et al.

2013) used for phase recovery. The results are shown in SSNR measured in dB for (left) female and (right) male speech versus the number of iterations.

Figure 3.19 Block diagram for the single-channel speech enhancement example used in Experiment 3.4 to demonstrate the effectiveness of phase estimators when used to replace the noisy spectra phase at signal reconstruction. Phase modification refers to any selected phase estimator listed in Table 3.1; “A” and “S” denote the analysis and synthesis steps.

Figure 3.20 Phase-only enhancement results for (left) white, (middle) babble, (right) factory noise, reported in PESQ (top row), STOI (middle), and unRMSE (bottom).

Chapter 4: Phase Processing for Single-Channel Speech Enhancement

Figure 4.1 The typical blocks needed for STFT speech enhancement: noise PSD estimation,

a priori

SNR estimation, speech spectral estimation.

Figure 4.2 Log-histogram in dB plots inspired by Breithaupt

et al.

(2007) for the residual noise DFT magnitude distributions (neglecting DC and Nyquist bin) for different choices of in (4.9).

Figure 4.3 Conventional speech enhancement, illustrated by a block diagram. The STFT representation of the noisy signal is given by , and the estimate of the clean amplitude spectrum is denoted by . The is applied to .

Figure 4.4 Three ways to incorporate the spectral phase information into the overall spectral estimation procedure. (a) Use independently obtained estimates for and for reconstruction. Some of the phase estimators depicted in Chapter 3 need a spectral amplitude estimate, which is in general not derived from the same cost function and does not comprise any phase information in this scenario. (b) Phase information is used in order to refine the amplitude estimate. It is optional to use the phase estimate or the noisy phase for reconstruction (indicated by the dashed line, see Section 4.4.2). (c) Amplitude and phase are obtained jointly; there are several ways to accomplish this, and both estimates are employed for synthesis.

Figure 4.5 Spectrograms of the clean speech (a), noise-corrupted speech (b), and enhanced speech signals (c). , leading (4.33) to reduce to (4.23). (d) Phase estimate obtained by Krawczyk and Gerkmann (2014), (e) phase estimate obtained by Kulmer and Mowlaee (2015), and (f) (4.33) with clean phase given. Depending on the choice for , harmonics are restored and artifacts introduced.

Figure 4.6 The iterative closed loop method (Mowlaee and Saeidi 2013). The estimate is given by (4.34) and the estimated phase is obtained by the geometry method presented in Mowlaee

et al.

(2012).

Figure 4.7 Spectrograms of the iterative method and relative inconsistency across iterations. (a) Clean speech, (b) noise corrupted speech, (c) iterative blind, (d) iterative method with noise magnitude assumed to be known for the initial geometry-based phase estimate, (e) the normalized change in inconsistency across iterations. The rectangle shows the speech reconstruction provided a reliable phase estimate is given.

Figure 4.8 (a) Clean speech, (b) noise corrupted speech, (c) estimated phase, (d) oracle phase.

Figure 4.9 Spectral coefficients of a purely stochastic signal (data points centered around zero) and a deterministic signal with uncertainty (decentered data assembly). The phase of the deterministic signal is normalized (achieved by linear phase removal, as in Chapter 3). Figure inspired by Hendriks

et al.

(2007).

Figure 4.10 Graphical representation of (4.58), illustrating that, depending on , the phase and the amplitude of are altered. The resulting phasor is a weighted sum of the complex mean and the noisy observation . Therefore, the ML estimate of the spectral phase is not the noisy phase any more but a value between the noisy phase and the prior information , depending on the certainty of the deterministic model (McCallum and Guillemin 2013).

Figure 4.11 Spectrograms of (a) clean speech, (b) noise corrupted speech, (c) MMSE-STSA, (d) MMSE-STSA with given STFT phase, (e) CUP, and (f) the iterative approach.

Figure 4.12 Relative inconsistency for 20 randomly selected, gender balanced utterances from the TIMIT database (Garofolo

et al.

1993) mixed at global SNRs of dB, 0 dB, and 5 dB. The noise types utilized are white, babble, and factory noise. The inconsistency is normalized to (a) the outcome of the estimator in (4.33) together with STFTPI. If the amplitude of (4.33) is used together with the noisy phase for reconstruction we obtain (b). (c) is the phase-unaware baseline estimator in (4.26), and (d) is the inconsistency of the CUP estimator in (4.46).

Figure 4.13 Sensitivity analysis of the phase-aware estimator in (4.34) (black curve) and its phase-unaware counterpart in (4.26) (gray curve). The left column refers to NMSE, while the right column presents NMSE. The corresponding

a priori

SNRs are dB in (a) and (b), 0 dB in (c) and (d), and 15 dB in (e) and (f).

Chapter 5: Phase Processing for Single-Channel Source Separation

Figure 5.1 A general scenario with mixed sources: A voice is masked by a guitar sound in the background, both recorded with one microphone. SCSS is capable of separating both underlying sources from their mixture.

Figure 5.2 Geometry of the SCSS problem.

Figure 5.3 Conventional SCSS principle using mixture phase at signal reconstruction stage.

Figure 5.4 Block diagram of a CASA system inspired by Wang (2005), comprised of segmentation and grouping stages.

Figure 5.5 Computation of the local SSR for the target source for (a) ideal ratio mask () and (b) ideal binary mask, at frequency bin . Below, the time frequency representation of the IRM () (c) and IBM (d) are shown, respectively.

Figure 5.6 Different approaches in deep learning inspired by Zöhrer

et al

. (2015) using (a) one model directly, (b) indirect learning of the ideal time frequency mask using two models and . The models learn the time frequency mask used to separate sources from mixture .

Figure 5.7 Decomposition of a speech signal, combining trained basis vectors and estimated activations.

Figure 5.8 Multiplicative update for the basis matrix and activation matrix in NMF to approximate the underlying source magnitude spectrum .

Figure 5.9 Schematic representation of the MISI algorithm (Gunawan and Sen 2010). The spectral magnitudes are combined with the estimated phase to produce time domain signal estimates . These source estimates are then subtracted from the observed mixture to produce the remixing error which is used to refine the phase estimates in iterations.

Figure 5.10 Ideal Wiener filter along with the estimated magnitude spectrum using the confidence domain in PPR (Sturmel and Daudet 2012) using fixed threshold (left). Sinusoidal confidence domain and the estimated magnitude spectrum (middle). Speech presence probability of the sinusoidal confidence domain (right). Results are shown for (top) first speaker, (bottom) second speaker.

Figure 5.11 Proof-of-concept result for consistent Wiener filter (Le Roux and Vincent 2013) applied on a noisy speech utterance at 10 dB in street noise with a known noise power spectrum. Spectrograms shown in dB for (top) clean, (middle) noisy, and (bottom) CWF outcome.

Figure 5.12 Comparison of a ranged (neglecting outliers) and non-ranged (including outliers) frequency distribution. The bars illustrate the histogram of the pre-separated input.

Figure 5.13 Proof of concept showing the outcome of applying temporal smoothing phase estimation on the ideal ratio mask (d). The clean target reference (a), mixture (b), and the ideal ratio mask outcome (c) are shown for comparison.

Figure 5.14 Convergence analysis for GL-based methods and their performance reported in terms of SDR, SIR, and SAR compared to Wiener filtering using the mixture phase as baseline (Watanabe and Mowlaee 2013).

Figure 5.15 Quantized magnitude spectrum inspired by Mowlaee

et al

. (2012a) obtained by adding additive white Gaussian noise to each signal source .

Figure 5.16 BSS EVAL results reported for different GL-based phase reconstruction methods versus different quantization levels: (top) SDR, (middle) SIR, (bottom) SAR (all in dB).

Figure 5.17 SDR (left) and SIR (right) in dB for different masks applied to the mixture spectrum, assuming that the phase spectrum is known.

Figure 5.18 Proof-of-concept result obtained by a complex mask applied on noisy male utterance at 3 dB. Shown are clean speech, noisy, IRM, complex mask.

Figure 5.19 The mixture amplitude obtained from different signal interaction functions (left). Mean square error averaged over speech frames achieved by different signal interaction functions (right; Mowlaee and Martin 2012).

Figure 5.20 Proof-of-concept result for (a) clean male utterance corrupted with female utterance as masker, (b) mixed signal at 0 dB, (c) NMF outcome, (d) CMF, and (e) CMF-WISA.

Chapter 6: Phase-Aware Speech Quality Estimation

Figure 6.1 Block diagram for (top) conventional instrumental metric without using phase information, (middle) phase-only instrumental metrics, and (bottom) joint amplitude and phase metric.

Figure 6.2 Segmentations of a TIMIT sentence based on its RMS levels to the low, mid, and high regions used in the CSII method.

Figure 6.3 Geometric representation for the single-channel speech enhancement problem, showing noisy, clean, and noise spectra denoted by , , and , respectively. The phase deviation is defined as the phase difference between the clean phase and the noisy phase .

Figure 6.4 Mean opinion scores (MOS) of the MUSHRA test inspired by Gaich and Mowlaee (2015a). White noise (left) and babble noise (right) shown for 11 participants. The results are grouped into (top) low SNR = 0, (middle) mid SNR = 5, and (bottom) high SNR = 10.

Figure 6.5 Correlation results for speech intelligibility measures with the subjective listening results.

Figure 6.6 Speech signal (top) and spectrogram (bottom) of the utterance “bin blue at l four soon.”

Figure 6.7 Circular variance and spectrogram for the clean phase signal, phase-modified signal () and randomized phase ().

Figure 6.8 Mean objective scores for the best performing instrumental measures evaluated over 50 GRID utterances corrupted by phase distortions controlled by . The results are shown for (top) quality measures (PESQ, IFD, PD, UnMSE) and (bottom) intelligibility measures (STOI, CSII, CSIIm, UnRMSE).

Figure 6.9 Experiment 6.2: Noisy (left), STFTPI (Krawczyk and Gerkmann 2014; middle), and clean (right) speech signals. Results shown as spectrogram (top), group delay (middle), and phase variance (bottom). The predicted quality using PESQ and frequency-weighted SNR are shown for each outcome at the top of each panel.

Figure 6.10 Results shown for clean, noisy, estimated and clean phase: (top) spectrogram, (middle) group delay, (bottom) phase variance. Speech intelligibility outcome predicted by STOI and CSII, for a phase-enhanced signal using STFTPI.

List of Tables

Chapter 1: Introduction: Phase Processing, History

Table 1.1 Results for the Wang and Lim experiment in terms of SNR amplitude () versus SNR phase () showing the equivalent SNR (Wang and Lim 1982). The results are shown for a window length of 512 samples

Table 1.2 Subjective speech quality for different SNRs and different treatment types with SNR of 0 and 10 dB (Paliwal

et al.

2011)

Chapter 2: Fundamentals of Phase-Based Signal Processing

Table 2.1 List of phase unwrapping solutions

Table 2.2 List of useful phase representations explained in this chapter

Table 2.3 Group delay functions and variants

Chapter 3: Phase Estimation Fundamentals

Table 3.1 Categorization of phase estimation methods with citations

Chapter 4: Phase Processing for Single-Channel Speech Enhancement

Table 4.1 Spectral amplitude estimators that are special cases of the parametrized estimator in (4.26)

Table 4.2 Settings for the estimators used in the proof-of-concept experiments

Chapter 5: Phase Processing for Single-Channel Source Separation

Table 5.1 Phase estimation methods proposed for signal reconstruction in SCSS

Table 5.2 List of time–frequency masks used for SCSS, considering two sources

Table 5.3 List of signal interaction functions

Chapter 6: Phase-Aware Speech Quality Estimation

Table 6.1 List of instrumental metrics to predict perceived speech quality

Table 6.2 List of speech intelligibility metrics

Table 6.3 Subjective evaluation results for intelligibility test, reported in percentages comparing LSA and LSA + PE methods

Table 6.4 Statistical analysis of the top performing perceived quality metrics for different noise types and SNRs, averaged over both SNRs and noise types

Table 6.5 Statistical analysis of the top performing speech intelligibility metrics for different noise and SNRs, averaged over both SNRs and noise

Table 6.6 Results for different phase modification scenarios in terms of conventional and phase-aware measures. The phase modification methods are: (A) noisy (unprocessed), (B) STFTPI, (C) maximum

a posteriori

(MAP), (D) clean STFT phase (upper bound)

Appendix A: MATLAB Toolbox

Table A.1 Filename, description, and experiment number for each MATLAB® implementation used in the book and included in the

PhaseLab Toolbox

Single Channel Phase-Aware Signal Processing in Speech Communication: Theory and Practice

Pejman Mowlaee

Josef Kulmer

Johannes Stahl

Florian Mayer

 

Graz University of Technology, Austria

 

 

 

 

This edition first published 2017

© 2017 by John Wiley & Sons, Ltd

Registered office:

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought

MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book's use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software.

Library of Congress Cataloging-in-Publication Data

Names: Mowlaee, Pejman, 1983- author. | Kulmer, Josef, author. | Stahl, Johannes, 1989- author. | Mayer, Florian, 1986- author.

Title: Single channel phase-aware signal processing in speech communication : theory and practice / [compiled and written by] Pejman Mowlaee, Josef Kulmer, Johannes Stahl, Florian Mayer.

Description: Chichester, UK ; Hoboken, NJ : John Wiley & Sons, Inc., 2016. | Includes bibliographical references and index.

Identifiers: LCCN 2016024931 (print) | LCCN 2016033469 (ebook) | ISBN 9781119238812 (cloth) | ISBN 9781119238829 (pdf) | ISBN 9781119238836 (epub)

Subjects: LCSH: Speech processing systems. | Signal processing. | Oral communication. | Phase modulation.

Classification: LCC TK7882.S65 S575 2016 (print) | LCC TK7882.S65 (ebook) | DDC 006.4/54-dc23

LC record available at https://lccn.loc.gov/2016024931

ISBN: 9781119238812

A catalogue record for this book is available from the British Library.

Cover Image: Gettyimages/lestyan4

About the Authors

Dr Pejman Mowlaee (main author) Graz University of Technology, Graz, Austria

Pejman Mowlaee was born in Anzali, Iran. He received his BSc and MSc degrees in telecommunication engineering in Iran in 2005 and 2007. He received his PhD degree at Aalborg University, Denmark in 2010. From January 2011 to September 2012 he was a Marie Curie post-doctoral fellow for digital signal processing in audiology at Ruhr University Bochum, Germany. He is currently an assistant professor at the Speech Communication and Signal Processing (SPSC) Laboratory, Graz University of Technology, Austria.

Dr. Mowlaee has received several awards: young researcher's award for MSc study in 2005 and 2006, best MSc thesis award. His PhD work was supported by the Marie Curie EST-SIGNAL Fellowship during 2009–2010. He is a senior member of IEEE. He was an organizer of a special session and a tutorial session in 2014 and 2015. He was the editor for a special issue of the Elsevier journal Speech Communication, and is a project leader for the Austrian Science Fund.

Dipl. Ing. Josef Kulmer (co-author) Graz University of Technology, Graz, Austria

Josef Kulmer was born in Birkfeld, Austria, in 1985. He received the MSc degree from Graz University of Technology, Austria, in 2014. In 2014 he joined the Signal Processing and Speech Communication Laboratory at Graz University of Technology, where he is currently pursuing his PhD thesis in the field of signal processing.

Dipl. Ing. Johannes Stahl (co-author) Graz University of Technology, Graz, Austria

Johannes Stahl was born in Graz, Austria, in 1989. In 2009, he started studying electrical engineering and audio engineering at Graz University of Technology. In 2015, he received his Dipl.-Ing. (MSc) degree with distinction. In 2015 he joined the Signal Processing and Speech Communication Laboratory at Graz University of Technology, where he is currently pursuing his PhD thesis in the field of speechprocessing.

Florian Mayer (co-author) Graz University of Technology, Graz, Austria

Florian Mayer was born in Dobl, Austria, in 1986. In 2006, he started studying electrical engineering and audio engineering at Graz University of Technology, and received his Dipl.-Ing. (MSc) in 2015.

Preface

Purpose and scope

Speech communication technology has been intensively studied for more than a century since the invention of the telephone in 1876. Today's main target applications are acoustic human–machine communication, digital telephony, and digital hearing aids. Some detailed applications for speech communication, to name a few, are artificial bandwidth extension, speech enhancement, source separation, echo cancellation, speech synthesis, speaker recognition, automatic speech recognition, and speech coding. The signal processing methods used in the aforementioned applications are mostly focused on the short-time Fourier transform. While the Fourier transform spectrum contains both amplitude and phase parts, the phase spectrum has often been neglected or counted as unimportant. Since the spectral phase is typically wrapped due to its periodic nature, the main difficulty in phase processing is associated with extracting a continuous phase representation. In addition, compared to the spectral amplitude, it is a sophisticated task to model the spectral phase across frames.

This book is, in part, an outgrowth of five years of research conducted by the first author, which started with the publication of the first paper on “Phase Estimation for Signal Reconstruction in Single-Channel Source Separation” back in 2012. It is also a product of the research actively conducted in this area by all the authors at the PhaseLab research group. The fact that there is no text book on phase-aware signal processing for speech communication made it paramount to explain its fundamental principles. The need for such a book was even more pronounced as a follow-up to the success of a series of events organized/co-organized by myself, amongst them: a special session on “Phase Importance in Speech Processing Applications” at the International Conference on Spoken Language Processing (INTERSPEECH) 2014, a tutorial session on “Phase Estimation from Theory to Practice” at the International Conference on Spoken Language Processing (INTERSPEECH) 2015, and an editorial for a special issue on “phase-aware signal processing in speech communication” in Speech Communication (Elsevier, 2016), all receiving considerable attention from researchers from diverse speech processing fields. The intention of this book is to unify the recent individual advances made by researchers toward incorporating phase-aware signal processing methods into speech communication applications.

This book develops the tools and methodologies necessary to deal withphase-based signal processing and its application, in particular in single-channel speech processing. It is intended to provide its readers with solid fundamental tools and a detailed overview of the controversial insights regarding the importance and unimportance of phase in speech communication. Phase wrapping, exposed as the main difficulty for analyzing the spectral phase will be presented in detail, with solutions provided. Several useful representations derived from the phase spectrum will be presented. An in-depth analysis for the estimation of a signals' phase observed in noise together with an overview of existing methods will be given. The positive impact of phase-aware processing is demonstrated for three selected applications: speech enhancement, source separation, and speech quality estimation. Through several proof-of-concept examples and computer simulations, we demonstrate the importance and potential of phase processing in each application. Our hope is to provide a sufficient basis for researchers aiming at starting their research projects in different applications in speech communication with a special focus on phase processing.

Book outline

The book is divided into two parts and consists of seven chapters and an appendix. Part I (Chapters 1–3) gives an introduction to phase-based signal processing, providing the fundamentals and key concepts. Chapters 1–3 introduce an overview of the history of phase processing and reveal the phase importance/unimportance arguments (Chapter 1), the required definitions and tools for phase-based signal processing, such as phase unwrapping and abundant representations for spectral phase to make the phase spectrum more accessible (Chapter 2), and finally phase estimation fundamentals, limits potential, and its application to speech signals will be presented (Chapter 3).

Part II (Chapters 4–7) deals with three applications to demonstrate the benefit of phase processing in single-channel speech enhancement (Chapter 4), single-channel source separation (Chapter 5), and speech quality estimation (Chapter 6). Chapter 7 concludes the book and provides several future prospects to pursue. The appendix is dedicated to the implementations in MATLAB® collected as the PhaseLab toolbox in order to describe most of the implementations that reproduce the experiments included in the book.

Intended audience

The book is mainly targeted at researchers and graduate students with some background in signal processing theory and applications focused on speech signal processing. Although it is not primarily intended as a text book, the chapters may be used as supplementary material for a special-topics course at second-year graduate level. As an academic instrument, the book could be used tostrengthen the understanding of the often mystical field of phase-aware signal processing and provides several interesting applications where phase knowledge is successfully incorporated. To get the maximal benefit from this book, the reader is expected to have a fundamental knowledge of digital signal processing, signals and systems, and statistical signal processing. For the sake of completeness, a summary of phase-based signal processing is provided in Chapter 2.

The book contains a detailed overview of phase processing and a collection of phase estimation methods. We hope that these provide a set of useful tools that will help new researchers entering the field of phase-aware signal processing and inspire them to solve problems related to phase processing. As the theory and practice are linked in speech communication applications, the book is supplemented by various examples and contains a number of MATLAB® experiments. The reader will find the MATLAB® implementations for the simulations presented in the book with some audio samples online at the following website:[https://www.spsc.tugraz.at/PhaseLab]

These implementations are provided in a toolbox called PhaseLab which is explained in the appendix. The authors believe that each chapter of the book itself serves as a valuable resource and reference for researchers and students. The topics covered within the seven chapters cross-link with each other and contribute to the progress of the field of phase-aware signal processing for speech communication.

Acknowledgments

The intense collaboration in the year of working on this book project together with the three contributors, Josef Kulmer, Johannes Stahl, and Florian Mayer, was a unique experience and I would like to express my deepest gratitude for all their individual efforts. Apart from the very careful and insightful proofreads, their endless helpful discussions in improving the contents of the chapters and in our regular meetings led to a successful outcome that was only possible within such a great team. In particular, I would like to thank Johannes Stahl and Josef Kulmer for their full contribution in preparing Chapters 3 and 4. I would like to thank Florian Mayer for his valuable contribution in Chapter 5 and his endless efforts in preparing all the figures in the book.

Last, but not least, a number of people contributed in various ways and I would like to thank them: Prof. Gernot Kubin, Prof. Rainer Martin, Prof. Peter Vary,Prof. Bastian Kleijn, Prof. Tim Fingscheidt, and Dr. Christiane Antweiler for their enlightening discussions, for providing several helpful hints, and for sharing their experience with the first author. I would like to thank Dr. Thomas Drugman, Dr. Gilles Degottex, and Dr. Rahim Saeidi for their support regarding the experiments in Chapter 2. Special thanks go to Andreas Gaich for his support in preparing the results in Chapter 6. I am also thankful to several of my former Masters students who graduated at PhaseLab at TU Graz, Carlos Chacón, Anna Maly, and Mario Watanabe, for their valuable insights and outstanding support. I am grateful to Nasrin Ordoubazari, Fereydoun, Kamran, Solmaz, Hana, and Fatemeh Mowlaee, and the Almirdamad family who provided support and encouragement during this book project.

I would also like to thank the editorial team at John Wiley & Sons for their friendly assistance. Finally, I acknowledge the financial support from the Austrian Science Fund (FWF) project number P28070-N33.

P. Mowlaee

Graz, Austria

April 4, 2016

List of Symbols

absolute value

angle

clean speech phase spectrum

tuning parameter for modified smoothed group delay

mean value of the von Mises distribution

perturbed clean speech phase

clean speech amplitude spectrum

amplitude of harmonic

h

scale factor in the

z

-transform

X(z)

clean speech amplitude spectrum estimate

coefficients in the numerator polynomial of

X(z)

continuous phase function

principal value of phase

coefficients in the denumerator polynomial of

X(z)

basis matrix for the

q

th source in NMF

smoothing parameter for decision-directed

a priori

SNR estimation

smoothing parameter for the uncertainty in unvoiced speech

compression parameter of the parametric speech spectrum estimators

coherent gain of a window function

compression function

baseband phase difference (BPD)

−3 dB bandwidth of the window mainlobe

distance metric used in geometry-based phase estimator

GDD-based distance metric used in geometry-based phase estimator

parabolic cylinder function

additive noise signal in time domain

additive noise along time with applied window function

divergence measure

DFT coefficient for noise

DTFT of additive noise

DTFT of windowed noise frame

distance measure as squared error between two spectra

mask approximation objective measure

signal approximation objective measure

change in inconsistency

group delay deviation

phase deviation between the observation and the noisy signal

cyclic mean phase error

remixing error in MISI for the

i

th iteration

expected value operator

conditional expected value operator

relative change of inconsistency

sampling frequency in Hz

fundamental frequency in Hz

fundamental frequency of

q

th source in mixture

phase deviation

instantaneous phase from STFT

relative phase shift

confluent hypergeometric function

gain function of a speech spectrum estimation scheme

STFT(iSTFT(·))

tuning parameter for modified smoothed group delay

key adjustment parameter in CWF

magnitude-squared coherence (MSC)

Gamma function

phase-sensitive filter

complex mask filter

complex ratio mask filter

harmonic index

desired harmonic

number of harmonics

hypothesis of no harmonic structure in the phase

hypothesis of harmonic structure in the phase

iteration index

maximum number of iterations

modified Bessel function of the first kind and order ν

inconsistency operator

discretized IF

confidence domain for the

q

th source in PPR approach

ideal binary mask

ideal ratio mask

instantaneous frequency deviation

imaginary unit

frequency index

von Mises distribution concentration parameter

frame index

integer-valued function used in time series phase unwrapping

number of frames

local criterion used in IBM

phase spectrum compensation function

number of periods per window length

integer value as phase wrapping number

number of atoms used in NMF

number of zeros inside of the unit circle

number of zeros outside of the unit circle

shape parameter of the parametric speech amplitude distribution

circular mean parameter for the

h

th harmonic

circular mean parameter of the von Mises distribution

mean of the Gaussian distribution fitted to the

q

th source fundamental frequency

standard deviation of Gaussian distribution fitted to the

q

th source fundamental frequency

sample index

instantaneous attack time

length of a window function

length of a frame

number of DFT points

normalized mean square error

normalized angular frequency

fundamental radian frequency

instantaneous frequency (IF)

closest sinusoid to bin

k

in STFTPI

tuning factor to scale mask in IRM

phase change in Nashi's phase unwrapping method

phase increment in Nashi's phase unwrapping method

voicing probability

linear phase along time

frequency derivative of phase

phase value of harmonic

h

estimated phase value of harmonic

h

phase distortion

probability density function

phase spectrum of the analysis window

source index in a mixture

number of audio sources in a mixture

radial step size

Pearson's correlation coefficient

constant threshold used in ISSIR

phase randomization index

absolute value of noisy speech signal STFT

relative phase shift

set of frames for von Mises parameter estimation

frame shift, hop size in samples

speech variance

speech intelligibility

signal-to-signal ratio (SSR)

SNR amplitude

SNR phase

local SNR

normalized root-mean-square error

circular variance

noise variance

instantaneous harmonic phase

objective function used in CWF

unwrapped harmonic phase

Part IHistory, Theory and Concepts

Chapter 1Introduction: Phase Processing, History

Pejman Mowlaee

Graz University of Technology, Graz, Austria

1.1 Chapter Organization

This chapter provides the historical background on phase-aware signal processing. We will review the controversial viewpoints on this topic so that the chapter in particular addresses two fundamental questions:

Is the spectral phase important?

To what extent does the phase spectrum affect human auditory perception?

To answer the first question, the chapter covers the up-to-date literature on the significance of phase information in signal processing in general and speech or audio signal processing in particular. We provide examples of phase importance in diverse applications in speech communication. The wide diversity in the range of applications highlights the significance of phase information and the momentum developed in recent years to incorporate phase information in speech signal processing. To answer the second question, we will present several key experiments made by researchers in the literature, in order to examine the importance of the phase spectrum in signal processing. Throughout these experiments, we will examine the key statements made by the researchers in favor of or against phase importance. Finally, the structure of the book with regard to its chapters will be explained.

1.2 Conventional Speech Communication

Speech is the most common method of communication between humans. Technology moves toward incorporating more listening devices in assisted living, using digital signal processing solutions. These innovations show increasingly accurate and more and more robust performance, in particular in adverse noisy conditions. The latest advances in technology have brought new possibilities forvoice-automated applications where acoustic human–machine communication is involved in the form of different speech communication devices including digital telephony, digital hearing aids, and cochlear implants. The end user expects all these devices and applications to function robustly in adverse noise scenarios, such as driving in a car, inside a restaurant, in a factory, or other everyday-life situations (see Figure 1.1). These applications are required to perform robustly in order to maintain a certain quality of service, to guarantee a reliable speech communication experience. Digital processing of speech signals consists of several disciplines, including linguistics, psychoacoustics, physiology, and phonetics. Therefore, the design of a speech processing algorithm is a multi-disciplinary task which requires multiple criteria to be met.1

Figure 1.1 Speech communication devices used in everyday life scenarios are expected to function robustly in adverse noisy conditions.

The desired clean speech signal is rarely accessible and is often observed only as a corrupted noisy version. There might also be some distortion due to failures in the communication channel introduced as acoustic echoes or room reverberation. Figure 1.2 shows an end-to-end speech communication consisting of the different blocks required to mitigate the detrimental impacts causing impairment to the desired speech signal. Some conventional blocks are de-reverberation, noise reduction including single/multi-channel signal enhancement/separation, artificial bandwidth extension, speech coding/decoding, near-end listening enhancement, and acoustic echo cancellation. Depending on the target application, several other blocks might be considered, including speech synthesis, speaker verification, or automatic speech recognition (ASR), where the aim is the classification of the speech or the speaker, e.g., for forensics or security purposes.2

Figure 1.2 Block diagram for speech communication from transmitter (microphone) to receiver end (loudspeaker) composed of a chain of blocks: beamforming, echo cancellation, de-reverberation, noise reduction, speech coding, channel coding, speech decoding, artificial bandwidth extension, near-end listening enhancement.

Independent of which speech application is of interest, the underlying signal processing technique falls into the unified framework of an analysis–modification–synthesis (AMS) chain, as shown in Figure 1.3. The