106,99 €
A comprehensive guide that addresses the theory and practice of spatial audio This book provides readers with the principles and best practices in spatial audio signal processing. It describes how sound fields and their perceptual attributes are captured and analyzed within the time-frequency domain, how essential representation parameters are coded, and how such signals are efficiently reproduced for practical applications. The book is split into four parts starting with an overview of the fundamentals. It then goes on to explain the reproduction of spatial sound before offering an examination of signal-dependent spatial filtering. The book finishes with coverage of both current and future applications and the direction that spatial audio research is heading in. Parametric Time-frequency Domain Spatial Audio focuses on applications in entertainment audio, including music, home cinema, and gaming--covering the capturing and reproduction of spatial sound as well as its generation, transduction, representation, transmission, and perception. This book will teach readers the tools needed for such processing, and provides an overview to existing research. It also shows recent up-to-date projects and commercial applications built on top of the systems. * Provides an in-depth presentation of the principles, past developments, state-of-the-art methods, and future research directions of spatial audio technologies * Includes contributions from leading researchers in the field * Offers MATLAB codes with selected chapters An advanced book aimed at readers who are capable of digesting mathematical expressions about digital signal processing and sound field analysis, Parametric Time-frequency Domain Spatial Audio is best suited for researchers in academia and in the audio industry.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 754
Veröffentlichungsjahr: 2017
Edited by
Ville Pulkki, Symeon Delikaris-Manias, and Archontis Politis
Aalto University Finland
This edition first published 2018 © 2018 John Wiley & Sons Ltd
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Ville Pulkki, Symeon Delikaris-Manias and Archontis Politis to be identified as the authors of the editorial material in this work has been asserted in accordance with law.
Registered OfficesJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial OfficeThe Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data
Names: Pulkki, Ville, editor. | Delikaris-Manias, Symeon, editor. | Politis, Archontis, editor. Title: Parametric time-frequency domain spatial audio / edited by Ville Pulkki, Symeon Delikaris-Manias, Archontis Politis, Aalto University, Aalto, Finland. Description: First edition. | Hoboken, NJ, USA : Wiley, 2018. | Includes bibliographical references and index. | Identifiers: LCCN 2017020532 (print) | LCCN 2017032223 (ebook) | ISBN 9781119252580 (pdf) | ISBN 9781119252610 (epub) | ISBN 9781119252597 (hardback) Subjects: LCSH: Surround-sound systems--Mathematical models. | Time-domain analysis. | Signal processing. | BISAC: TECHNOLOGY & ENGINEERING / Electronics / General. Classification: LCC TK7881.83 (ebook) | LCC TK7881.83 .P37 2018 (print) | DDC 621.382/2--dc23 LC record available at https://lccn.loc.gov/2017020532
Cover Design: Wiley Cover Image: © Vectorig/Gettyimages
List of Contributors
Preface
Notes
About the Companion Website
Part I Analysis and Synthesis of Spatial Sound
1 Time–Frequency Processing: Methods and Tools
1.1 Introduction
1.2 Time–Frequency Processing
1.3 Processing of Spatial Audio
Note
References
2 Spatial Decomposition by Spherical Array Processing
2.1 Introduction
2.2 Sound Field Measurement by a Spherical Array
2.3 Array Processing and Plane-Wave Decomposition
2.4 Sensitivity to Noise and Standard Regularization Methods
2.5 Optimal Noise-Robust Design
2.6 Spatial Aliasing and High Frequency Performance Limit
2.7 High Frequency Bandwidth Extension by Aliasing Cancellation
2.8 High Performance Broadband PWD Example
2.9 Summary
2.10 Acknowledgment
References
3 Sound Field Analysis Using Sparse Recovery
3.1 Introduction
3.2 The Plane-Wave Decomposition Problem
3.3 Bayesian Approach to Plane-Wave Decomposition
3.4 Calculating the IRLS Noise-Power Regularization Parameter
3.5 Numerical Simulations
3.6 Experiment: Echoic Sound Scene Analysis
3.7 Conclusions
Appendix
References
Part II Reproduction of Spatial Sound
4 Overview of Time–Frequency Domain Parametric Spatial Audio Techniques
4.1 Introduction
4.2 Parametric Processing Overview
References
5 First-Order Directional Audio Coding (DirAC)
5.1 Representing Spatial Sound with First-Order B-Format Signals
5.2 Some Notes on the Evolution of the Technique
5.3 DirAC with Ideal B-Format Signals
5.4 Analysis of Directional Parameters with Real Microphone Setups
5.5 First-Order DirAC with Monophonic Audio Transmission
5.6 First-Order DirAC with Multichannel Audio Transmission
5.7 DirAC Synthesis for Headphones and for Hearing Aids
5.8 Optimizing the Time–Frequency Resolution of DirAC for Critical Signals
5.9 Example Implementation
5.10 Summary
References
6 Higher-Order Directional Audio Coding
6.1 Introduction
6.2 Sound Field Model
6.3 Energetic Analysis and Estimation of Parameters
6.4 Synthesis of Target Setup Signals
6.5 Subjective Evaluation
6.6 Conclusions
Note
References
7 Multi-Channel Sound Acquisition Using a Multi-Wave Sound Field Model
7.1 Introduction
7.2 Parametric Sound Acquisition and Processing
7.3 Multi-Wave Sound Field and Signal Model
7.4 Direct and Diffuse Signal Estimation
7.5 Parameter Estimation
7.6 Application to Spatial Sound Reproduction
7.7 Summary
Notes
References
8 Adaptive Mixing of Excessively Directive and Robust Beamformers for Reproduction of Spatial Sound
8.1 Introduction
8.2 Notation and Signal Model
8.3 Overview of the Method
8.4 Loudspeaker-Based Spatial Sound Reproduction
8.5 Binaural-Based Spatial Sound Reproduction
8.6 Conclusions
References
9 Source Separation and Reconstruction of Spatial Audio Using Spectrogram Factorization
9.1 Introduction
9.2 Spectrogram Factorization
9.3 Array Signal Processing and Spectrogram Factorization
9.4 Applications of Spectrogram Factorization in Spatial Audio
9.5 Discussion
9.6 Matlab Example
Note
References
Part III Signal-Dependent Spatial Filtering
10 Time–Frequency Domain Spatial Audio Enhancement
10.1 Introduction
10.2 Signal-Independent Enhancement
10.3 Signal-Dependent Enhancement
References
11 Cross-Spectrum-Based Post-Filter Utilizing Noisy and Robust Beamformers
11.1 Introduction
11.2 Notation and Signal Model
11.3 Estimation of the Cross-Spectrum-Based Post-Filter
11.4 Implementation Examples
11.5 Conclusions and Further Remarks
11.6 Source Code
Note
References
12 Microphone-Array-Based Speech Enhancement Using Neural Networks
12.1 Introduction
12.2 Time–Frequency Masks for Speech Enhancement Using Supervised Learning
12.3 Artificial Neural Networks
12.4 Mask Learning: A Simulated Example
12.5 Mask Learning: A Real-World Example
12.6 Conclusions
12.7 Source Code
Notes
References
Part IV Applications
13 Upmixing and Beamforming in Professional Audio
13.1 Introduction
13.2 Stereo-to-Multichannel Upmix Processor
13.3 Digitally Enhanced Shotgun Microphone
13.4 Surround Microphone System Based on Two Microphone Elements
13.5 Summary
References
14 Spatial Sound Scene Synthesis and Manipulation for Virtual Reality and Audio Effects
14.1 Introduction
14.2 Parametric Sound Scene Synthesis for Virtual Reality
14.3 Spatial Manipulation of Sound Scenes
14.4 Summary
References
15 Parametric Spatial Audio Techniques in Teleconferencing and Remote Presence
15.1 Introduction and Motivation
15.2 Background
15.3 Immersive Audio Communication System (ImmACS)
15.4 Capture and Reproduction of Crowded Acoustic Environments
15.5 Conclusions
Notes
References
Index
EULA
Chapter 9
Table 9.1
Table 9.2
Chapter 12
Table 12.1
Table 12.2
Cover
Table of Contents
Preface
xiii
xiv
xv
xvi
xvii
xix
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
49
50
51
52
53
54
55
56
57
58
59
61
62
65
66
67
69
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
96
97
98
99
100
101
102
103
104
105
106
108
109
110
111
112
114
115
116
117
118
119
120
122
125
131
132
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
201
202
203
204
205
206
207
208
209
210
211
212
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
236
237
238
239
240
241
243
244
245
246
247
248
249
250
251
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
327
329
330
331
332
333
334
335
336
337
338
339
341
343
344
345
346
347
348
350
351
352
353
354
355
356
357
358
359
360
361
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
Ahonen, Jukka
Akukon Ltd, Finland
Alexandridis, Anastasios
Foundation for Research and Technology-Hellas, Institute of Computer Science (FORTH-ICS), Heraklion, Crete, Greece
Alon, David Lou
Department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Israel
Bäckström, Tom
Department of Signal Processing and Acoustics, Aalto University, Finland
Delikaris-Manias, Symeon
Department of Signal Processing and Acoustics, Aalto University, Finland
Epain, Nicolas
CARLab, School of Electrical and Information Engineering, University of Sydney, Australia
Faller, Christof
Illusonic GmbH, Switzerland and École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Habets, Emanuël
International Audio Laboratories Erlangen, Germany
Jin, Craig T.
CARLab, School of Electrical and Information Engineering, University of Sydney, Australia
Laitinen, Mikko-Ville
Nokia Technologies, Finland
Mouchtaris, Athanasios
Foundation for Research and Technology-Hellas, Institute of Computer Science (FORTH-ICS), Heraklion, Crete, Greece
Nikunen, Joonas
Department of Signal Processing, Tampere University of Technology, Finland
Noohi, Tahereh
CARLab, School of Electrical and Information Engineering, University of Sydney, Australia
Pavlidi, Despoina
Foundation for Research and Technology-Hellas, Institute of Computer Science (FORTH-ICS), Heraklion, Crete, Greece
Pertilä, Pasi
Department of Signal Processing, Tampere University of Technology, Finland
Pihlajamäki, Tapani
Nokia Technologies, Finland
Politis, Archontis
Department of Signal Processing and Acoustics, Aalto University, Finland
Pulkki, Ville
Department of Signal Processing and Acoustics, Aalto University, Finland
Rafaely, Boaz
Department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Israel
Stefanakis, Nikolaos
Foundation for Research and Technology-Hellas, Institute of Computer Science (FORTH-ICS), Heraklion, Crete, Greece
Thiergart, Oliver
International Audio Laboratories Erlangen, Germany
Vilkamo, Juha
Nokia Technologies, Finland
Virtanen, Tuomas
Department of Signal Processing, Tampere University of Technology, Finland
A plethora of methods for capturing, storing, and reproducing monophonic sound signals has been developed in the history of audio, starting from early mechanical devices, and progressing via analog electronic devices to faithful digital representation. In recent decades there has also been considerable effort to capture and recreate the spatial characteristics of sound scenes to a listener. When reproducing a sound scene, the locations of sound sources and responses of listening spaces should be perceived as in the original conditions, in either faithful replication or with deliberate modification. A vast number of research articles have been published suggesting methods to capture, store, and recreate spatial sound over headphone or loudspeaker listening setups. However, one cannot say that the field has matured yet, as new techniques and paradigms are still actively being published.
Another important task in spatial sound reproduction is the directional filtering of sound, where unwanted sound coming from other directions is attenuated when compared to the sound arriving from the direction of the desired sound source. Such techniques have applications in surround sound, teleconferencing, and head-mounted virtual reality displays.
This book covers a number of techniques that utilize signal-dependent time–frequency domain processing of spatial audio for both tasks: spatial sound reproduction and directional filtering. The application of time–frequency domain techniques in spatial audio is relatively new, as the first attempts were published about 15 years ago. A common property of the techniques is that the sound field is captured with multiple microphones, and its properties are analyzed for each time instance and individually for different frequency bands. These properties can be described by a set of parameters, which are subsequently used in processing to achieve different tasks, such as perceptually motivated reproduction of spatial sound, spatial filtering, or spatial sound synthesis. The techniques are loosely gathered under the title “time–frequency domain parametric spatial audio.”
The term “parameter” generally denotes any characteristic that can help in defining or classifying a particular system. In spatial audio techniques, the parameter somehow quantifies the properties of the sound field depending on frequency and time. In some techniques described in this book, measures having physical meaning are used, such as the direction of arrival, or the diffuseness of the sound field. Many techniques measure the similarity or dissimilarity of signals from closely located microphones, which also quantifies the spatial attributes of the sound field, although the mapping from parameter value to physical quantities is not necessarily very easy. In all cases, the time- and frequency-dependent parameter directly affects the reproduction of sound, which makes the outputs of the methods depend on the spatial characteristics of the captured sound field. With these techniques, in most cases a significant improvement is obtained with such signal-dependent signal processing compared with more traditional signal-independent processing, when an input with relatively few audio channels is processed.
Signal-dependent processing often relies on implicit assumptions about the properties of the spatial and spectral resolution of the listener, and/or of the sound field. In spatial sound reproduction, the systems should relay sound signals to the ear canals of the listener such that the desired perception of the acoustical surroundings is obtained. The resolution of all perceivable attributes, such as sound spectrum, direction of arrival, or characteristics of reverberation, should be as high as required so that no difference from the original is perceived. On the other hand, the attributes should not be reproduced with an accuracy that is higher than needed, so that the use of computational resources is optimal. Optimally, an authentic reproduction is obtained with a moderate amount of resources, i.e., only a few microphones are needed, the computational requirements are not excessive, and the listening setup consists of only a few electroacoustic transducers.
When the captured acoustical scene deviates from the assumed model, the benefit obtained by the parametric processing may be lost, in addition to potential undesired audible degradations of the audio. An important theme in all the methods presented is how to make them robust to such degradations, by assuming extended and complex models, and/or by handling estimation errors and deviations without detrimental perceptual effects by allowing the result to deviate from reality. Such an optimization requires a deep knowledge of sound field analysis, microphone array processing, statistical signal processing, and spatial hearing. That makes the research topic rich in technological approaches.
The composition of this book was motivated by work on parametric spatial audio at Aalto University. A large number of publications and theses are condensed in this book, aiming to make the core of the work easily accessible. In addition, several chapters are contributed by established international researchers in this topic, offering a wide view of the approaches and solutions in this field.
The first part of the book concerns the analysis of and the synthesis of spatial audio. The first chapter reviews the methods that are commonly used in industrial audio applications when transforming signals to the time–frequency domain. It also provides background knowledge for methods of reproducing sound with controllable spatial attributes, the methods being utilized in several chapters in the book. The two other chapters in this part consider methods for analysis of spatial sound captured with a spherical microphone array: how to decompose the sound field recording into plane waves.
The second part considers systems that consist of a whole sound reproduction chain including capture with a microphone array; time–frequency domain analysis, processing, and synthesis; and often also subjective evaluation of the result. The basic question is how to reproduce a spatial sound scene in such a way that a listener would not notice a difference between original and reproduced occasions. All the methods are parametric in some sense, however, with different assumptions about the sound field and the listener, and with different microphone arrays utilized; the solutions end up being very different.
The third part starts with a review of current signal-dependent spatial filtering approaches. After this, two chapters with new contributions to the field follow. The second chapter discusses a method based on stochastic estimates between higher-order directional patterns, and the third chapter suggests using machine learning and neural networks to perform spatial filtering tasks.
The fourth part extends the theoretic framework to more practical approaches. The first chapter shows a number of commercial devices that utilize parametric time–frequency domain audio techniques. The second chapter discusses the application of the techniques in synthesis of spatial sound for virtual acoustic environments, and the third chapter covers applications in teleconferencing and remote presence.
The reader should possess a good knowledge of the fields of acoustics, audio, psychoacoustics, and digital signal processing; introductions to these fields can be found in other sources.1 Finally, the working principles of many of the proposed techniques are demonstrated with code examples, written in Matlab®, which focus mostly only on the parametric part of the processing. Tools for time–frequency domain transforms and for linear processing of spatial audio are typically also needed in the implementation of complete audio systems, and the reader might find useful the broad range of tools developed by our research group at Aalto University.2 In addition, a range of research-related demos are available.3
Ville Pulkki, Archontis Politis, and Symeon Delikaris-Manias
Otaniemi, Espoo 2017
1
See, for example,
Communication Acoustics: An Introduction to Speech, Audio and Psychoacoustics
, V. Pulkki and M. Karjalainen, Wiley, 2015.
2
See
http://spa.aalto.fi/en/research/research_groups/communication_acoustics/acoustics_software/
.
3
http://spa.aalto.fi/en/research/research_groups/communication_acoustics/demos/
.
Don't forget to visit the companion website for this book:
www.wiley.com/go/pulkki/parametrictime-frequency
There you will find valuable material designed to enhance your learning, including:
Juha Vilkamo1 and Tom Bäckström2
1Nokia Technologies, Finland
2Department of Signal Processing and Acoustics, Aalto University, Finland
In most audio applications, the purpose is to reproduce sounds for human listening, whereby it is essential to design and optimize systems for perceptual quality. To achieve such optimal quality with given resources, we often use principles in the processing of signals that are motivated by the processes involved in hearing. In the big picture, human hearing processes the sound entering the ears in frequency bands (Moore, 1995). The hearing is thus sensitive to the spectral content of ear canal signals, which changes quickly with time in a complex way. As a result of frequency-band processing, the ear is not particularly sensitive to small differences in weaker sounds in the presence of a stronger masking sound near in frequency and time to the weaker sound (Fastl and Zwicker, 2007). Therefore, a representation of audio signals where we have access to both time and frequency information is a well-motivated choice.
A prerequisite for efficient audio processing methods is a representation of the signal that presents features desirable to hearing in an accessible form and also allows high-quality playback of signals. Useful properties of such a representation are, for example, that its coefficients have physically or perceptually relevant interpretations, and that the coefficients can be processed independently from each other. The time–frequency domain is such a domain, and it is commonly used in audio processing (Smith, 2011). Spectral coefficients in this domain explain the signal content in terms of frequency components as a function of time, which is an intuitive and unambiguous physical interpretation. Moreover, time–frequency components are approximately uncorrelated, whereby they can be independently processed and the effect on the output is deterministic. These properties make the spectrum a popular domain for audio processing, and all the techniques discussed in this book utilize it. The first part of this chapter will give an overview of the theory and practice of the tools typically needed in time–frequency processing of audio channels.
The time–frequency domain is also useful when processing the spatial characteristics of sound, for example in microphone array processing. Differences in directions of arrival of wavefronts are visible as differences in time of arrival and amplitude between microphone signals. When the microphone signals are transformed to the time–frequency domain, the differences directly correspond to differences in phase and magnitude in a similar fashion to the way spatial cues used by a human listener are encoded in the ear canal signals (Blauert, 1997). The time–frequency domain differences between microphone channels have proven to be very useful in the capture, analysis, and reproduction of spatial audio, as is shown in the other chapters of this book. The second part of this chapter introduces a few signal processing techniques commonly used, and serves as background information for the reader.
This chapter assumes understanding of basic digital signal processing techniques from the reader, which can be obtained from such basic resources as Oppenheim and Schafer (1975) or Mitra and Kaiser (1993).
A block diagram of a typical parametric time–frequency processing algorithm is shown in Figure 1.1. The processing involves transforms between the time domain input signal xi(t), the time–frequency domain signal xi(k, n), and the time domain output signal yj(t), where t is the time index in the time domain, and k and n are indexes for the frequency and time frame in the time–frequency domain, respectively; i and j are then the channel indexes in the case of multi-channel input and/or output. Additionally, the processing involves short-time stochastic analysis and parameter-driven processing, where the time domain signal y(k, n) is formed based on the parameters and x(k, n). The parametric data consists of any information describing the frequency band signals, for example the stochastic properties, information based on the audio objects, or user input parameters. In some use cases, such as in parametric spatial audio coding decoders, the stochastic estimation block is not applied, and the processing acts entirely on the parametric data provided in the bit stream.
Figure 1.1 Block diagram of a typical parametric time–frequency processing algorithm. The processing operates on three sampling rates: that of the wide-band signal, that of the frequency band signal, and that of the parametric information.
The parametric processing techniques typically operate on several different sampling rates: The sampling rate Fs of the wide-band signal, the sampling rate Fs/K of the frequency band signals, where K is the downsampling factor, and the sampling rate of the parametric information. Since the samples in the parametric information typically describe the signal properties over time frames, it potentially operates at a sampling rate below Fs/K. The parametric processing can also take place using a varying sampling rate, for example when the frame size adapts with the observed onsets of audio signals. In the following sections, the required background for the processing blocks in Figure 1.1 is discussed in detail.
Audio signals are generally time-varying signals whereby the spectrum is not constant in time. Should we analyze a long segment, then its spectrum would contain a mixture of all the different sounds within that segment. We could then not easily access the independent sounds, but only see their mixture, and the application of efficient processing methods would become difficult. It is therefore important to choose segments of the signal of such a length that we obtain good temporal separation of the audio content. Also, other properties such as constraints on algorithmic delay and requirements on spectral resolution impose demands on the length of analysis windows. It is then clear that while the spectrum or the frequency domain is an efficient domain for audio processing, the time axis also has to be factored in to the representation.
Computationally effective algorithms for time–frequency analysis have enabled their current popular usage. Namely, the basis of most time–frequency algorithms is the fast Fourier transform (FFT), which is a practical implementation of the discrete Fourier transform (DFT). It belongs to the class of super-fast algorithms that have an algorithmic complexity of , where N
