103,99 €
An Introduction to Audio Content Analysis Enables readers to understand the algorithmic analysis of musical audio signals with AI-driven approaches An Introduction to Audio Content Analysis serves as a comprehensive guide on audio content analysis explaining how signal processing and machine learning approaches can be utilized for the extraction of musical content from audio. It gives readers the algorithmic understanding to teach a computer to interpret music signals and thus allows for the design of tools for interacting with music. The work ties together topics from audio signal processing and machine learning, showing how to use audio content analysis to pick up musical characteristics automatically. A multitude of audio content analysis tasks related to the extraction of tonal, temporal, timbral, and intensity-related characteristics of the music signal are presented. Each task is introduced from both a musical and a technical perspective, detailing the algorithmic approach as well as providing practical guidance on implementation details and evaluation. To aid in reader comprehension, each task description begins with a short introduction to the most important musical and perceptual characteristics of the covered topic, followed by a detailed algorithmic model and its evaluation, and concluded with questions and exercises. For the interested reader, updated supplemental materials are provided via an accompanying website. Written by a well-known expert in the music industry, sample topics covered in Introduction to Audio Content Analysis include: * Digital audio signals and their representation, common time-frequency transforms, audio features * Pitch and fundamental frequency detection, key and chord * Representation of dynamics in music and intensity-related features * Beat histograms, onset and tempo detection, beat histograms, and detection of structure in music, and sequence alignment * Audio fingerprinting, musical genre, mood, and instrument classification An invaluable guide for newcomers to audio signal processing and industry experts alike, An Introduction to Audio Content Analysis covers a wide range of introductory topics pertaining to music information retrieval and machine listening, allowing students and researchers to quickly gain core holistic knowledge in audio analysis and dig deeper into specific aspects of the field with the help of a large amount of references.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 681
Veröffentlichungsjahr: 2022
Cover
Title Page
Copyright
Dedication
Author Biography
Preface
Acronyms
List of Symbols
Source Code Repositories
1 Introduction
1.1 A Short History of Audio Content Analysis
1.2 Applications and Use Cases
References
Part I: Fundamentals of Audio Content Analysis
2 Analysis of Audio Signals
2.1 Audio Content
2.2 Audio Content Analysis Process
2.3 Exercises
References
Notes
3 Input Representation
3.1 Audio Signals
3.2 Audio Preprocessing
3.3 Time‐Frequency Representations
3.4 Other Input Representations
3.5 Instantaneous Features
3.6 Learned Features
3.7 Feature PostProcessing
3.8 Exercises
References
Notes
4 Inference
4.1 Classification
4.2 Regression
4.3 Clustering
4.4 Distance and Similarity
4.5 Underfitting and Overfitting
4.6 Exercises
References
Note
5 Data
5.1 Data Split
5.2 Training Data Augmentation
5.3 Utilization of Data From Related Tasks
5.4 Reducing Accuracy Requirements for Data Annotation
5.5 Semi‐, Self‐, and Unsupervised Learning
5.6 Exercises
References
6 Evaluation
6.1 Metrics
6.2 Exercises
References
Note
Part II: Music Transcription
7 Tonal Analysis
7.1 Human Perception of Pitch
7.2 Representation of Pitch in Music
7.3 Fundamental Frequency Detection
7.4 Tuning Frequency Estimation
7.5 Key Detection
7.6 Chord Recognition
7.7 Exercises
References
Notes
8 Intensity
8.1 Human Perception of Intensity and Loudness
8.2 Representation of Dynamics in Music
8.3 Features
8.4 Exercises
References
Note
9 Temporal Analysis
9.1 Human Perception of Temporal Events
9.2 Representation of Temporal Events in Music
9.3 Onset Detection
9.4 Beat Histogram
9.5 Detection of Tempo and Beat Phase
9.6 Detection of Meter and Downbeat
9.7 Structure Detection
9.8 Automatic Drum Transcription
9.9 Exercises
References
Notes
10 Alignment
10.1 Dynamic Time Warping
10.2 Audio‐to‐Audio Alignment
10.3 Audio‐to‐Score Alignment
10.4 Evaluation
10.5 Exercises
References
Notes
Part III: Music Identification, Classification, and Assessment
11 Audio Fingerprinting
11.1 Fingerprint Extraction
11.2 Fingerprint Matching
11.3 Fingerprinting System: Example
11.4 Evaluation
References
12 Music Similarity Detection and Music Genre Classification
12.1 Music Similarity Detection
12.2 Musical Genre Classification
References
Notes
13 Mood Recognition
13.1 Approaches to Mood Recognition
13.2 Evaluation
References
14 Musical Instrument Recognition
14.1 Evaluation
References
15 Music Performance Assessment
15.1 Music Performance
15.2 Music Performance Analysis
15.3 Approaches to Music Performance Assessment
References
Part IV: Appendices
Appendix A: Fundamentals
A.1 Sampling and Quantization
A.2 Convolution
A.3 Correlation Function
References
Notes
Appendix B: Fourier Transform
B.1 Properties of the Fourier Transformation
B.2 Spectrum of Example Time Domain Signals
B.3 Transformation of Sampled Time Signals
B.4 Short Time Fourier Transform of Continuous Signals
B.5 Discrete Fourier Transform
B.6 Frequency Reassignment: Instantaneous Frequency
References
Notes
Appendix C: Principal Component Analysis
C.1 Computation of the Transformation Matrix
C.2 Interpretation of the Transformation Matrix
Appendix D: Linear Regression
Appendix E: Software for Audio Analysis
E.1 Frameworks and Libraries
E.2 Data Annotation and Visualization
References
Notes
Appendix F: Datasets
References
Index
End User License Agreement
Chapter 3
Table 3.1 Properties of three popular MFCC implementations, Davis and Merme...
Chapter 6
Table 6.1 Confusion matrix.
Chapter 7
Table 7.1 Names and distance in semitones
of diatonic pitch classes.
Table 7.2 Names of musical intervals, their enharmonic equivalents, and the...
Table 7.3 Deviations of the Pythagorean, meantone, and two diatonic tempera...
Table 7.4 Frequency resolution of the STFT for different block lengths at a...
Table 7.5 Typical range of deviation of the tuning frequency from 440 Hz ov...
Table 7.6 Deviation (in Cent) of seven harmonics from the nearest equal‐tem...
Table 7.7 Pitch class order in the original and the rearranged pitch chroma...
Table 7.8 Various key profile templates, normalized to a vector length of 1...
Chapter 9
Table 9.1 Hypothetical example for annotator disagreements on musical struc...
Chapter 11
Table 11.1 Main properties of fingerprinting and watermarking in comparison...
Chapter 12
Table 12.1 Confusion matrix for an example evaluation of a speech/music cla...
Chapter 13
Table 13.1 Mood clusters as presented by Schubert.
Table 13.2 Mood clusters derived from metadata and used in MIREX.
Appendix B
Table B.1 Frequency domain properties of the most common windows.
Appendix F
Table F.1 List of datasets for audio content analysis.
Chapter 2
Figure 2.1 General processing stages of a system for audio content analysis....
Chapter 3
Figure 3.1 Snippet of a periodic audio signal with indication of its fundame...
Figure 3.2 Approximation of periodic signals: sawtooth (top) and square wave...
Figure 3.3 Probability density function of a square wave (a), a sinusoidal (...
Figure 3.4 Distribution function estimated from a music signal compared to a...
Figure 3.5 RFD (b) of a series of feature values (a) with its arithmetic mea...
Figure 3.6 RFD (b) of a series of feature values (a) with the standard devia...
Figure 3.7 Two probability distributions, Gaussian (a) and Chi‐squared (b), ...
Figure 3.8 Schematic visualization of block‐based processing: the input sign...
Figure 3.9 Example visualization of blocking: the input signal (top) is spli...
Figure 3.10 Short‐Time Fourier Transform: time domain block (a), magnitude s...
Figure 3.11 Time‐domain waveform (a) and spectrogram visualization (b); each...
Figure 3.12 Two approaches of implementing blocking for the CQT: compute mul...
Figure 3.13 Time‐domain waveform (a) and corresponding Log‐Mel Spectrogram (...
Figure 3.14 Normalized impulse response of a gammatone filter with a center ...
Figure 3.15 Frequency response of a resonance filterbank spanning four octav...
Figure 3.16 Waveform of excerpts from a speech recording (a), a string quart...
Figure 3.17 Visualization of the feature extraction process.
Figure 3.18 Spectrogram (a), waveform (b, background), and
spectral centroid
Figure 3.19 Spectrogram (a), waveform (b, background), and
spectral spread
(...
Figure 3.20 Spectrogram (a), waveform (b, background), and
spectral skewness
Figure 3.21 Spectrogram (a), waveform (b, background), and
spectral kurtosis
Figure 3.22 Spectrogram (a), waveform (b, background), and
spectral rolloff
...
Figure 3.23 Spectrogram (a), waveform (b, background), and
spectral decrease
Figure 3.24 Spectrogram (a), waveform (b, background), and
spectral slope
(b...
Figure 3.25 Warped cosine‐shaped transformation basis functions for the comp...
Figure 3.26 Spectrogram (a) and Mel frequency cepstral coefficients 1–4 (b) ...
Figure 3.27 Magnitude transfer function of the filterbank for MFCC computati...
Figure 3.28 Spectrogram (a), waveform (b, background), and
spectral flux
(b,...
Figure 3.29 Spectrogram (a), waveform (b, background), and
spectral crestfac
...
Figure 3.30 Spectrogram (a), waveform (b, background), and
spectral flatness
Figure 3.31 Spectrogram (a), waveform (b, background), and
spectral tonalpow
...
Figure 3.32 Spectrogram (a), waveform (b, background), and feature
maxacf
(b...
Figure 3.33 Spectrogram (a), waveform (b, background), and feature
zerocross
...
Figure 3.34 Example aggregation with a texture window length of
with the a...
Figure 3.35 Feature aggregation with texture windows: example audio input (t...
Figure 3.36 Accuracy over number of features selected by Sequential Forward ...
Chapter 4
Figure 4.1 Example for classifying the blocks of a drum loop into the classe...
Figure 4.2 A music/speech dataset visualized in a two‐dimensional feature sp...
Figure 4.3 Nearest neighbor classification of a query data point (center) fo...
Figure 4.4 The feature space from Figure 4.2 and a corresponding Gaussian Mi...
Figure 4.5 Linear Regression of two feature/target pairs, (a) RMS and peak e...
Figure 4.6 Example illustrating the importance of the similarity definition ...
Figure 4.7 Three iterations of K‐means clustering. The colors indicate the c...
Figure 4.8 Example of visualization of an overfitted model exactly matching ...
Chapter 5
Figure 5.1 Typical percentages for splitting a dataset into train, validatio...
Figure 5.2 Visualization of the data splits for training and testing in the ...
Figure 5.3 Schematic visualization of training data augmentation.
Chapter 6
Figure 6.1 Multiple hypothetical ROCs and their corresponding AUCs.
Chapter 7
Figure 7.1 Prototype visualization of harmonics as integer multiples of the ...
Figure 7.2 Different models for the nonlinear mapping of frequency to Mel (a...
Figure 7.3 Helix visualizing the two facets of pitch perception: pitch heigh...
Figure 7.4 Chromatic pitches
in musical score notation with pitch class in...
Figure 7.5 One octave on a piano keyboard with annotated pitch class names....
Figure 7.6 Musical intervals in musical score notation.
Figure 7.7 Six harmonics of the pitch
in musical score notation.
Figure 7.8 Detection error in Cent resulting from quantization of the period...
Figure 7.9 Detection error in Cent resulting from quantization of the fundam...
Figure 7.10 Very short excerpt of an original and a signal interpolated for ...
Figure 7.11 Magnitude spectrum and zeropadded magnitude spectrum (a) and mag...
Figure 7.12 F0 estimation via the zero‐crossing distance. Time series (a) an...
Figure 7.13 F0 estimation via the (center‐clipped) Autocorrelation Function....
Figure 7.14 Nonlinear preprocessing for ACF‐based pitch period estimation: s...
Figure 7.15 F0 estimation via the Average Magnitude Difference Function (lef...
Figure 7.16 Original power spectrum (top) and compressed spectra (mid) for t...
Figure 7.17 F0 estimation via the Harmonic Product Spectrum: compressed spec...
Figure 7.18 F0 estimation via the Autocorrelation Function of the magnitude ...
Figure 7.19 F0 estimation via the Cepstrum: input magnitude spectrum (a) and...
Figure 7.20 F0 estimation templates: template functions with different pitch...
Figure 7.21 Time domain (left) and frequency domain (right) visualization of...
Figure 7.22 Visualization of Eq. (7.37) with example dimensions.
Figure 7.23 Example for the utilization of NMF for detecting the individual ...
Figure 7.24 Distribution of tuning frequencies.
Figure 7.25 Adaptation of the tuning frequency estimate from an initial sett...
Figure 7.26 Different modes in musical score notation starting at the tonic
Figure 7.27 The twelve major scales in musical score notation, notated in th...
Figure 7.28 Circle of fifths for both major keys and minor keys, plus the nu...
Figure 7.29 Mask function for pitch class
for pitch chroma computation (oc...
Figure 7.30 Magnitude spectrogram (a) and the pitch chromagram (b) of a mono...
Figure 7.31 Pitch chroma of a hypothetical pitch
with 10 harmonics.
Figure 7.32 Key profile vectors (top left) and the resulting interkey distan...
Figure 7.33 Common chords in musical score notation on a root note of
C
.
Figure 7.34 The two inversions of a
D Major
triad in musical score notation....
Figure 7.35 Flowchart of a simple chord detection system.
Figure 7.36 Simple chord template matrix for chord estimation based on pitch...
Figure 7.37 Chord emission probability matrix (a), transition probability ma...
Chapter 8
Figure 8.1 Level error introduced by adding a small constant
to the argume...
Figure 8.2 Spectrogram (top), waveform (bottom background), and RMS output (...
Figure 8.3 Flowchart of the frequency‐weighted RMS calculation.
Figure 8.4 (a) Frequency weighting transfer functions applied before RMS mea...
Figure 8.5 Flowchart of a Peak program meter.
Figure 8.6 Spectrogram (top), waveform (bottom background), and PPM output c...
Figure 8.7 Flowchart of Zwicker's model for loudness computation.
Chapter 9
Figure 9.1 Visualization of an envelope, attack time, and possible location ...
Figure 9.2 Different hierarchical levels related to tempo, beat, and meter....
Figure 9.3 Frequently used time signatures.
Figure 9.4 Note values (a) and corresponding rest values (b) with decreasing...
Figure 9.5 General flowchart of an onset detection system.
Figure 9.6 Audio signal and extracted envelope (a) and novelty function with...
Figure 9.7 Beat histogram of a string quartet performance (a) and of a piece...
Figure 9.8 Visualization of Onset, Beat, and Downbeat times for a drum loop ...
Figure 9.9 Flow chart of a typical (real‐time) beat‐tracking system.
Figure 9.10 Example structure of a popular song (here: Pink's
So What
).
Figure 9.11 Self‐Similarity Matrix of Michael Jackson's
Bad
.
Figure 9.12 Self‐Similarity Matrices of the same song based on different fea...
Figure 9.13 Checker Board Filter Kernel with high‐pass characteristics to de...
Figure 9.14 Self‐Similarity Matrix (left) and extracted Novelty Function....
Figure 9.15 Self‐Similarity Matrix (left) and Low‐pass filtered Diagonal as ...
Figure 9.16 Rotated Self Similarity Matrix (a) and accompanying Ground truth...
Figure 9.17 Two baseline drum transcription systems by extending a standard ...
Chapter 10
Figure 10.1 Visualization of the mapping of two similar sequences to each ot...
Figure 10.2 Distance matrix and alignment path for two example sequences; da...
Figure 10.3 Distance matrix (a and b in two different visualizations) and co...
Figure 10.4 Path restrictions for performance optimizations of DTW: original...
Figure 10.5 Comparison of different features for computing the distance matr...
Chapter 11
Figure 11.1 General framework for audio fingerprinting. The upper part visua...
Figure 11.2 Flowchart of the extraction process of subfingerprints in the Ph...
Figure 11.3 Two fingerprints and their difference. (a) Original, (b) mp3 enc...
Chapter 12
Figure 12.1 Two examples for author‐defined genre taxonomies. (a) Tzanetakis...
Figure 12.2 Scatterplot of the feature space for the 10 music classes of the...
Chapter 13
Figure 13.1 Russell's two‐dimensional model of mood.
Figure 13.2 Scatterplot of the feature space for the valence/energy ratings ...
Chapter 15
Figure 15.1 Chain of musical communication.
Appendix A
Figure A.1 Continuous audio signal (a) and corresponding sample values (b) a...
Figure A.2 Continuous (top) and sampled (below) sinusoidal signals with the ...
Figure A.3 Unquantized input signal (a) and quantized signal at a word lengt...
Figure A.4 Characteristic line of a quantizer with word length
showing the...
Figure A.5 Magnitude frequency response of a moving average low‐pass filter ...
Figure A.6 Example of zero phase filtering: input signal (a), low‐pass filte...
Figure A.7 ACF of a sinusoid (a) and white noise (b).
Appendix B
Figure B.1 Schematic visualization of the spectrum of a continuous time doma...
Figure B.2 Windows in time domain (left) and frequency domain (right).
Figure B.3 Phasor representation and visualization of the phase difference w...
Figure B.4 Magnitude spectrum of a signal composed of three sinusoidals with...
Appendix C
Figure C.1 Scatter plot of a two‐dimensional data set with variables
, and ...
Figure C.2 Five input features (a) and the two resulting principal component...
Cover
Table of Contents
Title Page
Copyright
Dedication
Author Biography
Preface
Acronyms
List of Symbols
Source Code Repositories
Begin Reading
Appendix A: Fundamentals
Appendix B: Fourier Transform
Appendix C: Principal Component Analysis
Appendix D: Linear Regression
Appendix E: Software for Audio Analysis
Appendix F: Datasets
Index
End User License Agreement
ii
iii
iv
v
xvii
xix
xx
xxi
xxii
xxiii
xxv
xxvi
xxvii
xxviii
xxix
1
2
3
4
5
6
7
8
9
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
84
85
91
92
93
94
95
97
96
98
99
100
101
102
103
104
105
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
274
275
276
277
278
279
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
303
305
306
307
308
309
310
311
312
313
314
315
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
405
406
407
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
IEEE Press
445 Hoes Lane Piscataway, NJ 08854
IEEE Press Editorial Board
Sarah Spurgeon,
Editor in Chief
Jón Atli Benediktsson
Andreas Molisch
Diomidis Spinellis
Anjan Bose
Saeid Nahavandi
Ahmet Murat Tekalp
Adam Drobot
Jeffrey Reed
Peter (Yong) Lia
Thomas Robertazzi
Second Edition
Alexander LerchGeorgia Institute of TechnologyAtlanta, USA
Copyright © 2023 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.
Edition HistoryWiley (1e, 2012)
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging‐in‐Publication Data
Names: Lerch, Alexander, author. | John Wiley & Sons, publisher.Title: An introduction to audio content analysis : music information retrieval tasks & applications / Alexander Lerch.Description: Second edition. | Hoboken, New Jersey : Wiley-IEEE Press, [2023] | Includes bibliographical references and index.Identifiers: LCCN 2022046787 (print) | LCCN 2022046788 (ebook) | ISBN 9781119890942 (cloth) | ISBN 9781119890966 (adobe pdf) | ISBN 9781119890973 (epub)Subjects: LCSH: Computer sound processing. | Computational auditory scene analysis. | Content analysis (Communication)--Data processing.Classification: LCC TK7881.4 .L486 2023 (print) | LCC TK7881.4 (ebook) | DDC 006.4/5-dc23/eng/20221018LC record available at https://lccn.loc.gov/2022046787LC ebook record available at https://lccn.loc.gov/2022046788
Cover image: © Weiquan Lin/Getty ImagesCover design by Wiley
For Mila
Alexander Lerch is Associate Professor at the School of Music, Georgia Institute of Technology. He studied Electrical Engineering at the Berlin Institute of Technology and Tonmeister (music production) at the University of Arts, Berlin, and received his PhD (Audio Communications) from the Berlin Institute of Technology in the year 2008. He currently leads the Georgia Tech Music Informatics Group with a research focus on the design and implementation of machine learning and signal processing methods targeting audio and music in the areas audio content analysis, music information retrieval, machine listening, and music meta‐creation. Lerch has served as co‐chair of ISMIR2021, the 22nd International Society for Music Information Retrieval Conference.
Before joining Georgia Tech, he co‐founded the company zplane.development, an industry leader providing advanced music technology to the music industry. zplane technologies are nowadays used by millions of musicians and producers world‐wide in a wide variety of products.
The fields of Audio Content Analysis(ACA) andMusic Information Retrieval(MIR) have seen rapid growth over the past decade as indicated by a rising number in publications, growing conference attendance, and an increasing number of commercial applications in the field. This growth is driven by the need to intelligently browse, retrieve, and process the large amount of music data at our fingertips and is also reflected by growing interest of students, engineers, and software developers to learn about methods for the analysis of audio signals.
Inspired by the feedback received from readers of the first edition of this book, my goals for the second edition were threefold: (i) remain focused on the introduction of baseline systems rather than state‐of‐the‐art approaches, (ii) widen the scope of tasks introduced, and (iii) provide additional materials enhancing the learning experience. While the baseline systems presented in this book cannot be considered state‐of‐the‐art, their knowledge is, in my opinion, crucial for an introduction into and understanding of the field. Not only does this allow us to learn about the wide variety of tasks that mirror the variety of content we handle in the analysis of musical audio, but by looking at each of these tasks, we gain insights into the task‐specific challenges and problems, thus gaining an understanding of inherent principles and challenges as well as parameters of interest. In addition, baseline systems allow for hands‐on exercises and provide an important reference data point clarifying the expected performance of an analysis system.
The second edition of this book comes with noticeable changes from the first edition. Some parts have been shortened by removing unnecessary detail, while others have been extended to provide more useful detail. In addition, evaluation methodologies are summarized for most tasks; while evaluation often seems like a chore and uninteresting part of designing a system, it is a crucial part: without evaluation, no one knows how well a system works. Most chapters now also conclude with a set of questions and assignments assessing the learning outcomes. As this text intends to provide a starting point and guidance for the design of novel musical audio content analysis systems, every chapter contains many references to relevant work. Last but not least, several new topics are introduced, and the whole text has been restructured into three main parts plus an appendix.
The first part covers fundamentals shared by many audio analysis systems. It starts with a closer look at musical audio content and a general flowchart of an audio content analysis system. After a quick coverage of signals and general preprocessing concepts, typical time–frequency representations are introduced, which nowadays serve as the input representations of many analysis systems. A closer look at the most commonly used instantaneous or low‐level features is then followed by an introduction of feature postprocessing methods covering normalization, aggregation, as well as general dimensionality reduction methods. The second part introduces systems that transcribe individual musical properties of the audio signal. These are mostly properties that are more or less explicitly defined in a musical score, such as pitch, dynamics, and tempo. More specifically, these approaches are grouped into the content dimensions tonal (fundamental frequency, tuning frequency, musical key, and chords), intensity (level and loudness), and temporal analysis (onsets, beats, tempo, meter, and structure). Audio analysis approaches related to classification and identification are presented in Part Three. Many of these systems do not focus on individual musical content dimensions but aim at extracting high‐level content from the audio signal. Thus, the classifications of musical genre and mood, as well as the assessment of student music performances, are introduced. In addition, audio fingerprinting is discussed as one of the audio analysis technologies arguably most impactful for the consumer in the past. The appendix provides a more in‐depth reference to basic concepts such as audio sampling and quantization, the Fourier transform, introduces correlation and convolution, as well as providing quick references to principal component analysis and linear regression.
Note that this edition also comes with a tighter integration with online resources such as slides and example code in both MATLAB and Python, as well as a website with resources such as a list of available datasets. The entry point for all these additional (and freely available) resources is https://www.audiocontentanalysis.org.
Atlanta, Georgia
Alexander Lerch
ACA
Audio Content Analysis
ACF
Autocorrelation Function
ADT
Automatic Drum Transcription
AMDF
Average Magnitude Difference Function
ANN
Artificial Neural Network
AOT
Acoustic Onset Time
AUC
Area Under Curve
BPM
Beats per Minute
CCF
Cross Correlation Function
CCIR
Comité Consultatif International des Radiocommunications
CiCF
Circular Correlation Function
CD
Compact Disc
COG
Center of Gravity
CQT
Constant
Transform
DCT
Discrete Cosine Transform
DFT
Discrete Fourier Transform
DJ
Disk Jockey
DNN
Deep Neural Network
DP
Dynamic Programming
DTW
Dynamic Time Warping
EBU
European Broadcasting Union
EM
Expectation Maximization
ERB
Equivalent Rectangular Bandwidth
FFT
Fast Fourier Transform
FT
Fourier Transform
FN
False Negative
FNR
False Negative Rate
FP
False Positive
FPR
False Positive Rate
FWR
Full‐Wave Rectification
GMM
Gaussian Mixture Model
HFC
High Frequency Content
HMM
Hidden Markov Model
HPS
Harmonic Product Spectrum
HSS
Harmonic Sum Spectrum
HTK
HMM Toolkit
HWR
Half‐Wave Rectification
IBI
Inter‐Beat Interval
ICA
Independent Component Analysis
IDFT
Inverse Discrete Fourier Transform
IFT
Inverse Fourier Transform
IIR
Infinite Impulse Response
IO
Input/Output
IOI
Inter‐Onset Interval
ITU
International Telecommunication Union
JNDL
Just Noticeable Difference in Level
K‐NN
K‐Nearest Neighbor
LDA
Linear Discriminant Analysis
MA
Moving Average
MAE
Mean Absolute Error
MFCC
Mel Frequency Cepstral Coefficient
MIDI
Musical Instrument Digital Interface
MIR
Music Information Retrieval
MIREX
Music Information Retrieval Evaluation eXchange
ML
Machine Learning
MPA
Music Performance Analysis
MP3
MPEG‐1 Layer 3
MSE
Mean Squared Error
NOT
Note Onset Time
NMF
Nonnegative Matrix Factorization
PAT
Perceptual Attack Time
PCA
Principal Component Analysis
Probability Density Function
POT
Perceptual Onset Time
PPM
Peak Program Meter
PSD
Peak Structure Distance
RFD
Relative Frequency Distribution
RLB
Revised Low‐Frequency B Curve
RMS
Root Mean Square
RNN
Recurrent Neural Network
ROC
Receiver Operating Characteristic Curve
SIMD
Single Instruction Multiple Data
SNR
Signal‐to‐Noise Ratio
SOM
Self‐Organizing Map
SSM
Self Similarity Matrix
STFT
Short Time Fourier Transform
SVD
Singular Value Decomposition
SVM
Support Vector Machine
SVR
Support Vector Regression
TN
True Negative
TNR
True Negative Rate
TP
True Positive
TPR
True Positive Rate
Amplitude
Filter Coefficient (Recursive)
Accuracy
Number of Beats
Filter Coefficient (Transversal)
Parametrization Factor or Exponent
Number of (Audio) Channels
Center Clipping Function
Cost Matrix for the Distance Matrix between Sequences
and
Overall Cost of a Path through the Cost Matrix
Cepstrum of the Signal
Distance Matrix between Sequences
and
Distance Measure
Quantization Step Size
Delta Impulse Function
Delta Pulse Function
Novelty Function
Prediction Error
Quantization Error
Equivalent Rectangular Bandwidth
(Correlation) Lag
‐Measure
Frequency in Hz
Fundamental Frequency in Hz
Sample Rate
Tuning Frequency in Hz
Number of Features
Instantaneous Frequency in Hz
(Discrete) Fourier Transform
Threshold
Chord Transformation Matrix
Central Moment of Order
of Signal
Transfer Function
Impulse Response
Hop Size
Sample Index
Impulse Response Length
Integer (Loop) Variable
Objective Function
Block Size
Frequency Bin Index
Percentage
Weighting Factor
Number of (Quantization) Steps
Slope
Pitch (Mel)
Geometric Mean of Signal
Harmonic Mean of Signal
Arithmetic Mean of Signal
Number of Observations or Blocks
Block Index
Order (e.g. Filter Order)
Block Overlap Ratio
Angular Velocity (
) in radians per second
Number of Onsets
Precision
Alignment Path
Pitch Class Index
Phase Spectrum of the Signal
Gaussian Function
(MIDI) Pitch
Pitch Chroma Vector/Key Profile
Power of the Signal
Probability Density Function of the Signal
Chord Probability Vector
Quality Factor (Mid‐Frequency divided by Bandwidth)
Evaluation Metric
Quantile Boundary
Recall
Book
reference
Link
Description
Repo 1
https://github.com/alexanderlerch/ACA-Plots
Matlab code to generate plots
Repo 2
https://github.com/alexanderlerch/ACA-Code
Matlab scripts
Repo 3
https://github.com/alexanderlerch/pyACA
Python scripts
Repo 4
https://github.com/alexanderlerch/ACA-Slides
Pdf and code for lecture slides
Audio is an integral and ubiquitous aspect of our daily lives; we intentionally produce sound (e.g. when communicating through speech or playing an instrument), we actively listen (e.g. to music or podcasts), can focus on a specific sound source in a mixture of sources, and we (even unconsciously) suppress sound sources internally (e.g. traffic noise). Similar to humans, algorithms can also generate, analyze, and process audio. This book focuses on the algorithmic analysis of audio signals, more specifically the extraction of information from musical audio signals.
Audio signals contain a wealth of information: by simply listening to an audio signal, humans are able to infer a variety of content information. A speech signal, for example obviously transports the textual information, but it also might reveal information about the speaker (gender, age, accent, mood, etc.), the recording environment (e.g., indoors vs. outdoors), and much more. A music signal might allow us to derive melodic and harmonic characteristics, understand the musical structure, identify the instruments playing, perceive the projected emotion, categorize the music genre, and assess characteristics of the performance as well as the proficiency of the performers. An audio signal can contain and transport a wide variety of content beyond these simple examples. This content information is sometimes referred to as metadata: data about (audio) data.
The field of Audio Content Analysis(ACA) aims at designing and applying algorithms for the automatic extraction of content information from the raw (digital) audio signal. This enables content‐driven and content‐adaptive services which describe, categorize, sort, retrieve, segment, process, and visualize the signal and its content.
The wide range of possible audio sources and the multi‐faceted nature of audio signals results in variety of distinct ACA problems, leading to various areas of research, including
speech analysis
, covering topics such as automatic speech recognition [
1
,
2
] or recognizing emotion in speech [
3
,
4
],
urban sound analysis
with applications in noise pollution monitoring [
5
] and audio surveillance, i.e. the detection of dangerous events [
6
],
industrial sound analysis
such as monitoring the state of mechanical devices like engines [
7
] or monitoring the health of livestock [
8
], and, last but not least,
musical audio analysis
, targeting the understanding and extraction of musical parameters and properties from the audio signal [
9
].
This book focuses on the analysis of musical audio signals and the extraction of musical content from audio. There are many similarities and parallels to the areas above, but there exist also many differences that distinguish musical audio from other signals beyond simple technical properties such as audio bandwidth. Like an urban sound signal, music is a polytimbral mixture of multiple sound sources, but unlike urban sound, its sound sources are clearly related (e.g., melodically, harmonically, or rhythmically). Like a speech signal, a music signal is a sequence in a language with rules and constraints, but unlike speech, the musical language is abstract and has no singular meaning. Like an industrial audio signal, music has both tonal and noise‐like components which may repeat themselves, but unlike the industrial signal, it conveys a (musical) form based on hierarchical grouping of elements not only through repetition but also through precise variation of rhythmic, dynamic, tonal, and timbral elements.
As we will see throughout the chapters, the design of systems for the analysis of musical audio often requires knowledge and understanding from multiple disciplines. While this text approaches the topic mostly from an engineering and Digital Signal Processing(DSP) perspective, the proper formulation of research questions and task definitions often require methods or at least synthesis of knowledge from fields as diverse as music theory, music perception, and psychoacoustics. Researchers working on ACA thus come from various backgrounds such as computer science, engineering, psychology, and musicology.
The diversity in musical audio analysis is also exemplified by the wide variety of terms referring to it. Overall, musical ACA is situated in the broader area of Music Information Retrieval(MIR). MIR is a broader field that covers not only the analysis of musical audio but also symbolic (nonaudio) music formats such as musical scores and files or signals compliant to the so‐called Musical Instrument Digital Interface(MIDI) protocol [10]. MIR also covers the analysis and retrieval of information that is music‐related but cannot be (easily) extracted from the audio signal such as the artist names, user ratings, performance instructions in the score, or bibliographical information such as publisher, publishing date, the work's title. Other areas of research, such as music source separation and automatic music generation are often also considered to belong within MIR. Various overview articles clarify how the understanding of the field of MIR has evolved over time [11–16]. Other, related terms are also in use. Audio event detection, nowadays often related to urban sound analysis, is sometimes described as computational analysis of sound scenes [17]. The analysis of sound scenes from a perceptual point of view has been described as Computational Auditory Scene Analysis(CASA) [18]. In the past, other terms have been used more or less synonymously to the term “audio content” analysis. Examples of such synonyms are machine listening and computer audition. Finally, there is the term music informatics, which encompasses essentially any aspect of algorithmic analysis, synthesis, and processing of music (although in some circles, its meaning is restricted to describe the creation of musical artifacts with software).
Historically, the first systems analyzing the content of audio signals appeared shortly after technology provided the means of storing and reproducing recordings on media in the twentieth century. One early example is Seashore's Tonoscope, which enabled the pitch analysis of an audio signal by visualizing the fundamental frequency of the incoming audio signal on a rotating drum [19]. However, the more recent evolution of digital storage media, DSP methods, and machine learning during the last decades, along with the growing amount of digital audio data available through downloads and streaming services, has significantly increased both the need and the possibilities of automatic systems for analyzing audio content, resulting in a lively and growing research field.
Early systems for audio analysis were frequently so‐called “expert systems” [20], designed by experts who implement their task‐specific knowledge into a set of rules. Such systems can be very successful if there is a clear and simple relation between the knowledge and the implemented algorithms. A good example for such systems are some of the pitch‐tracking approaches introduced in Section 7.3: as the goal is the detection of periodicity in the signal in a specific range, an approach such as the Autocorrelation Function(ACF), combined with multiple assumptions and constraints, can be used to estimate this periodicity and thus the fundamental frequency.
Later, data‐driven systems became increasingly popular and traditional machine learning approaches started to show superior performance on many tasks. These systems extract so‐called features from the audio to achieve a task‐dependent representation of the signal. Then, training data are used to build a model of the feature space and how it maps to the inferred outcome. The role of the expert becomes less influential in the design of these systems, as they are restricted to selecting or designing a fitting set of features, curating a representative set of data, and choosing and parametrizing the machine learning approach. One prototypical example for these approaches is musical genre classification as introduced in Section 12.2.
Modern machine learning approaches include trainable feature extraction approaches (also referred to as feature learning, see Section 3.6) as they are a part of deep neural networks. These approaches have consistently shown superior performance for nearly all ACA tasks. The researcher seldom imparts much domain knowledge beyond choosing input representation, data, and system architecture of these systems. An example of such an end‐to‐end system could be a music genre classification system based on a neural network with a convolutional architecture with a Mel Spectrogram input.
It should be pointed out that while modern systems tend to have superior performance, they also tend to be less interpretable and explainable than traditional systems. For example, deducing the reason for a false classification result in a network‐based system can be difficult, while the reason is usually easily identifiable in a rule‐based system.
The content extracted from music signals improves or enables various forms of content‐based and content‐adaptive services, which allow to sort, categorize, find, segment, process, and visualize the audio signal based on its content.
One of the most intuitive applications is the content‐based search for music in large databases for which consistent manual annotation by humans is often infeasible. This allows, for example, Disk Jockeys(DJs) to retrieve songs in a specific tempo or key, to find music pieces in a specific genre or mood, or to search for specific musical characteristics by example. The same information can be used in end consumer applications such as audio‐based music recommendation and playlist generation systems using an in‐depth understanding of the musical content [21].
The content‐based annotation and representation of the audio signal can also be used to design new interfaces to access data. Fingerprinting (see Chapter 11), for example allows to identify a song currently played from a large database of songs, while Query‐by‐Humming systems identify songs through a user‐hummed melody. Content can also be utilized for new ways of sound visualization and user interaction; a database of music could, for example be explored by a user navigating a virtual music similarity space [22].
Audio analysis has already started to transform consumer‐facing industries such as streaming services as mentioned above. So far, most services focus on recommending music to be played back, but in the near future, we might see the rise of creative music listening applications that enable the listener to interact with the content itself instead of being restricted to only choosing content to be played. This could include, for example, the gain adjustment for individual voices, replacing instruments or vocalists, or interactively changing the musical arrangement or even stylistic properties of the music piece.
Knowledge of the audio content can improve music production tools in various dimensions. On the one hand, content information can enable a more “musical” software interface, e.g. by displaying score‐like information synchronized with the audio data, and thus enabling an intuitive approach to editing the audio data. On the other hand, production software usage could be enhanced in terms of productivity and efficiency: the better a system understands the details and properties of incoming audio streams or files, the better it can adapt, for instance, by applying default gain and equalization parameters [23] or by suggesting compatible audio from a sound library. Intelligent music software might also support editors by offering automatic artifact‐free splicing of multiple recordings from one session or selecting error‐free recordings from a set of recordings.
Modern tools also enhance the creative possibilities in the production process. For example, creating harmonically meaningful background choirs by analyzing the lead vocals and the harmony track is already technically feasible and commercially available today. Knowing and isolating sound sources in a recording could enable new ways of modifying or morphing different sounds for the creation of new soundscapes, effects, and auditory scenes.
The potential of utilizing technology to assist music (instrument) education has been recognized as early as the 1930s, when Seashore pointed out the educational value of scientific observation of music performances [24]. The availability of automatic and fast audio analysis can support the creation of artificially intelligent music tutoring software, which aims at supplementing teachers by providing students with insights and interactive feedback by analyzing and assessing the audio of practice sessions. An interactive music tutor with in‐depth understanding of the musical score and performance content can highlight problematic parts of the students' performance, provide a concise yet easily understandable analysis, give objective and specific feedback on how to improve, and individualize the curriculum depending on the students' mistakes and general progress. At the same time, music tutoring software can provide an accessible, objective, and reproducible analysis.
Machine‐interpretable content information can also feed generative algorithms. The automatic composition and rendition of music is emerging as a challenging yet popular research direction [25], gaining interest from both research institutions and industry. While bigger questions concerning capabilities and restrictions of computational creativity as well as aesthetic evaluation of algorithmically generated music remain largely unanswered, practical applications such as generating background music for user videos and commercial advertisements are currently in the focus of many researchers. The interactive and adaptive generation of sound tracks for video games as well as individualized generation of license‐free music content for streaming is additional long‐term goals of considerable commercial interest.
1
Xuedong Huang, James Baker, and Raj Reddy. A Historical Perspective of Speech Recognition.
Communications of the ACM
, 57(1):94–103, January 2014. ISSN 0001‐0782. doi: 10.1145/2500887. URL
https://doi.org/10.1145/2500887
.
2
Dong Yu and Li Deng.
Automatic Speech Recognition ‐ A Deep Learning Approach
.
Signals and Communication Technology
. Springer, London, 2015. ISBN 978‐1‐4471‐5779‐3. URL
https://doi.org/10.1007/978-1-4471-5779-3
.
3
Frank Dellaert, Thomas Polzin, and Alex Waibel. Recognizing Emotion in Speech. In
Proceedings of the International Conference on Spoken Language Processing (ICSLP)
, volume 3, pages 1970–1973, Philadelphia, PA, October 1996. doi: 10.1109/ICSLP.1996.608022.
4
Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases.
Pattern Recognition
, 44(3):572–587, March 2011. ISSN 0031‐3203. doi: 10.1016/j.patcog.2010.09.020. URL
https://www.sciencedirect.com/science/article/pii/S0031320310004619
.
5
Juan Pablo Bello, Charlie Mydlarz, and Justin Salamon. Sound Analysis in Smart Cities. In
Computational Analysis of Sound Scenes and Events
, pages 373–397. Springer, Cham, 2018. ISBN 978‐3‐319‐63449‐4. URL
https://link.springer.com/chapter/10.1007/978-3-319-63450-0_13
.
6
Marco Crocco, Marco Cristani, Andrea Trucco, and Vittorio Murino. Audio Surveillance: A Systematic Review.
ACM Computing Surveys
, 48(4):52:1–52:46, February 2016. ISSN 0360‐0300. doi: 10.1145/2871183. URL
https://doi.org/10.1145/2871183
.
7
Sascha Grollmisch, Jakob Abeßer, Judith Liebetrau, and Hanna Lukashevich. Sounding Industry: Challenges and Datasets for Industrial Sound Analysis. In
Proceedings of the European Signal Processing Conference (EUSIPCO)
, pages 1–5, A Coruna, Spain, September 2019. doi: 10.23919/EUSIPCO.2019.8902941. ISSN: 2076‐1465.
8
Dries Berckmans, Martijn Hemeryck, Daniel Berckmans, Erik Vranken, and Toon van Waterschoot. Animal Sound… Talks! Real‐time Sound Analysis for Health Monitoring in Livestock. In
Proceedings of the International Symposium on Animal Environment and Welfare (ISAEW)
, Chongqing, China, 2015.
9
Daniel P W Ellis. Extracting Information from Music Audio.
Communications of the ACM
, 49(8):32–37, August 2006. ISSN 00010782. doi: 10.1145/1145287.1145310. URL
http://portal.acm.org/citation.cfm?doid=1145287.1145310
.
10
MIDI Manufacturers Association. Complete MIDI 1.0 Detailed Specification V96.1, 2nd edition. Standard, MMA, 2001.
11
J Stephen Downie. Music Information Retrieval.
Annual Review of Information Science and Technology
, 37:295–340, 2003.
12
Nicola Orio. Music Retrieval: A Tutorial and Review.
Foundations and Trends in Information Retrieval
, 1(1):1–90, 2006.
13
Michael A Casey, Remco Veltkamp, Masataka Goto, Marc Leman, Christophe Rhodes, and Malcolm Slaney. Content‐based Music Information Retrieval: Current Directions and Future Challenges.
Proceedings of the IEEE
, 96(4):668–696, April 2008. ISSN 1558‐2256. doi: 10.1109/JPROC.2008.916370. Conference Name: Proceedings of the IEEE.
14
Markus Schedl, Emilia Gómez, and Julián Urbano. Music Information Retrieval: Recent Developments and Applications.
Foundations and Trends® in Information Retrieval
, 8(2‐3):127–261, September 2014. ISSN 1554‐0669, 1554‐0677. doi: 10.1561/1500000042. URL
http://www.nowpublishers.com/article/Details/INR-042
.
15
Alexander Lerch. Music Information Retrieval. In Stefan Weinzierl, editor,
Akustische Grundlagen der Musik
, number 5 in Handbuch der Systematischen Musikwissenschaft, pages 79–102. Laaber, 2014. ISBN 978‐3‐89007‐699‐7.
16
John Ashley Burgoyne, Ichiro Fujinaga, and J Stephen Downie. Music Information Retrieval. In Susan Schreibman, Ray Siemens, and John Unsworth, editors,
A New Companion to Digital Humanities
, pages 213–228. John Wiley & Sons, Ltd, 2015. ISBN 978‐1‐118‐68060‐5. doi: 10.1002/9781118680605.ch15. URL
http://onlinelibrary.wiley.com/doi/10.1002/9781118680605.ch15/summary
.
17
Tuomas Virtanen, Mark D Plumbley, and Dan Ellis, editors.
Computational Analysis of Sound Scenes and Events
. Springer International Publishing, Cham, 2018. ISBN 978‐3‐319‐63449‐4 978‐3‐319‐63450‐0. doi: 10.1007/978‐3‐319‐63450‐0. URL
http://link.springer.com/10.1007/978-3-319-63450-0
.
18
DeLiang Wang and Guy J Brown.
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
. Wiley‐IEEE Press, 2006. ISBN 978‐0‐470‐04338‐7. URL
https://ieeexplore.ieee.org/book/5769523
.
19
Carl E Seashore. A Voice Tonoscope.
Studies in Psychology
, 3:18–28, 1902.
20
Peter J F Lucas and Linda C van der Gaag.
Principles of Expert Systems
. Addison‐Wesley, 1991.
21
Peter Knees, Markus Schedl, and Masataka Goto. Intelligent User Interfaces for Music Discovery: The Past 20 Years and What's to Come. In
Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)
, pages 44–53, Delft, Netherlands, 2019. URL
http://archives.ismir.net/ismir2019/paper/000003.pdf
.
22
Masahiro Hamasaki, Masataka Goto, and Tomoyasu Nakano. Songrium: Browsing and Listening Environment for Music Content Creation Community. In
Proceedings of the Sound and Music Computing Conference (SMC)
, page 8, Maynooth, Ireland, 2015.
23
Joshua D Reiss and Øyvind Brandtsegg. Applications of Cross‐Adaptive Audio Effects: Automatic Mixing, Live Performance and Everything in Between.
Frontiers in Digital Humanities
, 5, 2018. ISSN 2297‐2668. doi: 10.3389/fdigh.2018.00017. URL
https://www.frontiersin.org/articles/10.3389/fdigh.2018.00017/full
.
24
Carl E Seashore.
Psychology of Music
. McGraw‐Hill, New York, 1938.
25
Jean‐Pierre Briot, Gaëtan Hadjeres, and François‐David Pachet.
Deep Learning Techniques for Music Generation
. Computational Synthesis and Creative Systems. Springer International Publishing, Cham, 2020. ISBN 978‐3‐319‐70162‐2 978‐3‐319‐70163‐9. doi: 10.1007/978‐3‐319‐70163‐9. URL
http://link.springer.com/10.1007/978-3-319-70163-9
.
Taking a closer look at Audio Content Analysis(ACA), this chapter introduces the musical content to be extracted from the audio signal and then summarizes the basic processing steps of an ACA system.
The introduction already provided some examples for content that can be extracted from an audio signal. For a recording of Western music, content originates from three different sources:
Composition
: The term composition will be used broadly as a definition and conceptualization of musical ideas.
1
It can refer to any form of musical notation or description. This includes traditional ways from the
basso continuo
(a historic way of defining the harmonic structure) to the classic Western score notation as well as the modern lead sheet and other forms of notation used for contemporary and popular music.
The content related to the composition allows us to recognize different renderings of the same song or symphony as being the same piece. In most genres of Western music, this can encompass musical elements such as melody, harmony, instrumentation, structure and form, and rhythm.
Performance
: Music generally requires a performer or group of performers to create a unique acoustical rendition of the composition. The performance not only communicates the explicit information from the composition but also interprets and modifies it.
This happens through the performance‐related content, which includes, for example, the tempo and its variation as well as the micro‐timing, the realization of musical dynamics, accents, and instantaneous dynamic modulations such as tremolo, the use of expressive intonation and vibrato, and specific playing (e.g. bowing) techniques influencing the sound quality.
Production
: As the input of an ACA system is an audio recording, the choices made during recording and production impact certain characteristics of the recording.
Production‐related content mostly impacts the sound quality of the recording (microphone positioning, equalization, and the application of effects such as reverb to the signal) and the dynamics (by applying manual or automatic gain adjustments). However, changes in timing and pitch may occur as well during the process of editing the recording and applying software for pitch correction.
Note that these content sources can be hard to separate from the final audio recording. For example, the timbre of a recording can be determined by the instrumentation indicated by the score, by the performers' choice of instruments (e.g. historical instruments, specific guitar amps), by specific playing techniques, and by sound‐processing choices made by the sound engineer or the producer.
Furthermore, it should be noted that the distinction between composition and performance becomes less clearly defined in improvisatory and non‐Western forms of music.
While helpful for a better understanding of musical audio, the categorization by content source has only limited use from a practical audio analysis point of view. A more useful categorization of musical content is driven by musical and perceptual considerations, leading to four main categories:
timbre characteristics
: information related to sound quality such as spectral shape (see
Section 3.5
),
tonal characteristics
: pitch‐related information such as melody, harmony, key, intonation (see Section 7),
intensity‐related characteristics
: dynamics‐related information such as envelope‐, level‐, and loudness (see Section 8), and
temporal characteristics
: information comprising the tempo, timing, meter, of the signal (see Section 9).
Obviously, there are additional ways to quantitatively describe audio signals that do not fall squarely in one of these music‐driven categories, for example, statistical or other technical descriptions of the audio data such as an amplitude distribution.
The categories listed above cover the basic musical information that can be found in a music signal, and analysis tasks such as tempo detection, key detection, or envelope extraction fall unambiguously into one of the above categories. However, cues from various categories are necessary to deduce, for example, the structure of a musical piece or to extract performance characteristics. Furthermore, there exist many high‐level perceptual concepts that humans use when categorizing music, such as the music genre or the projected affective content. Such concepts draw from a combination of several or all categories mentioned above, albeit the impact of specific musical parameters on the perception of these concepts is hard to quantify.
Note that there is also content that is not easily extractable from the signal itself but may still be useful for music information retrieval(MIR) systems. Examples for this content include, for example, the year of the composition or recording, the record label, the song title, and information on the artists.
Systems for audio content analysis can often be represented as a two‐step process as depicted in Figure 2.1, where the signal is first converted into a meaningful input or feature representation that is then fed into a rule‐based or data‐driven system for inference such as a classifier. A perfect, maximally meaningful, input representation requires only a trivial inference method: inference could be as simple as just applying a threshold to the representation. A very powerful method of inference, on the other hand, is able to utilize an input representation even if the task‐relevant information is not directly or easily accessible from the representation. In traditional systems, the design of the input representation used to require considerable effort in feature design and engineering. Nowadays, the input processing tends to become less sophisticated (e.g. simply computing a spectrogram), which is counterbalanced by the increasing complexity of the inference system.
A preprocessing step can precede the feature extraction stage; in this step, the input data might be converted into a specific format, normalized, or processed for increased robustness and stability.
For traditional feature‐based approaches, the following properties are generally desirable for the feature input:
Task‐relevance without noise or other task‐irrelevant information
: features necessarily have to represent information that is relevant to the task at hand. A tempo detection system, for example, might need to focus on the rhythmic parts instead of pitch‐related information. Therefore, a meaningful feature captures task‐relevant (rhythmic) information and discards task‐irrelevant (pitch-related) information. Note that capturing task‐relevant information does not necessarily mean that such features are humanly interpretable; rather, it is sufficient if they can be properly interpreted by the inference system.
Compact and nonredundant
: the dimensionality of raw audio data tends to be too high for traditional machine learning approaches. One channel of a digital audio file in
Compact Disc
(CD)
quality (44 100 samples per second, 16 Bits per sample
2
) with a length of five minutes contains
A compact feature aims at representing these data with fewer values.
Easy to analyze
: Although all extractable information is contained in the raw audio data, observing meaningful information directly from the time‐domain audio data is complicated. For example, a spectrogram contains a similar amount of data as a time domain signal, but it is often easier to parse due to its more sparse representation of the content.
Figure 2.1 General processing stages of a system for audio content analysis.
While these qualities continue to be desirable for modern approaches based on neural networks, the features are not explicitly designed but rather learned from data. The input of the machine learning system is, thus, not a custom‐designed representation but the more or less unprocessed raw data. Some argue that end‐to‐end systems without input processing are generally preferable [1], however, this statement is debatable [2