An Introduction to Audio Content Analysis - Alexander Lerch - E-Book

An Introduction to Audio Content Analysis E-Book

Alexander Lerch

0,0
103,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An Introduction to Audio Content Analysis Enables readers to understand the algorithmic analysis of musical audio signals with AI-driven approaches An Introduction to Audio Content Analysis serves as a comprehensive guide on audio content analysis explaining how signal processing and machine learning approaches can be utilized for the extraction of musical content from audio. It gives readers the algorithmic understanding to teach a computer to interpret music signals and thus allows for the design of tools for interacting with music. The work ties together topics from audio signal processing and machine learning, showing how to use audio content analysis to pick up musical characteristics automatically. A multitude of audio content analysis tasks related to the extraction of tonal, temporal, timbral, and intensity-related characteristics of the music signal are presented. Each task is introduced from both a musical and a technical perspective, detailing the algorithmic approach as well as providing practical guidance on implementation details and evaluation. To aid in reader comprehension, each task description begins with a short introduction to the most important musical and perceptual characteristics of the covered topic, followed by a detailed algorithmic model and its evaluation, and concluded with questions and exercises. For the interested reader, updated supplemental materials are provided via an accompanying website. Written by a well-known expert in the music industry, sample topics covered in Introduction to Audio Content Analysis include: * Digital audio signals and their representation, common time-frequency transforms, audio features * Pitch and fundamental frequency detection, key and chord * Representation of dynamics in music and intensity-related features * Beat histograms, onset and tempo detection, beat histograms, and detection of structure in music, and sequence alignment * Audio fingerprinting, musical genre, mood, and instrument classification An invaluable guide for newcomers to audio signal processing and industry experts alike, An Introduction to Audio Content Analysis covers a wide range of introductory topics pertaining to music information retrieval and machine listening, allowing students and researchers to quickly gain core holistic knowledge in audio analysis and dig deeper into specific aspects of the field with the help of a large amount of references.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 681

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

Author Biography

Preface

Acronyms

List of Symbols

Source Code Repositories

1 Introduction

1.1 A Short History of Audio Content Analysis

1.2 Applications and Use Cases

References

Part I: Fundamentals of Audio Content Analysis

2 Analysis of Audio Signals

2.1 Audio Content

2.2 Audio Content Analysis Process

2.3 Exercises

References

Notes

3 Input Representation

3.1 Audio Signals

3.2 Audio Preprocessing

3.3 Time‐Frequency Representations

3.4 Other Input Representations

3.5 Instantaneous Features

3.6 Learned Features

3.7 Feature PostProcessing

3.8 Exercises

References

Notes

4 Inference

4.1 Classification

4.2 Regression

4.3 Clustering

4.4 Distance and Similarity

4.5 Underfitting and Overfitting

4.6 Exercises

References

Note

5 Data

5.1 Data Split

5.2 Training Data Augmentation

5.3 Utilization of Data From Related Tasks

5.4 Reducing Accuracy Requirements for Data Annotation

5.5 Semi‐, Self‐, and Unsupervised Learning

5.6 Exercises

References

6 Evaluation

6.1 Metrics

6.2 Exercises

References

Note

Part II: Music Transcription

7 Tonal Analysis

7.1 Human Perception of Pitch

7.2 Representation of Pitch in Music

7.3 Fundamental Frequency Detection

7.4 Tuning Frequency Estimation

7.5 Key Detection

7.6 Chord Recognition

7.7 Exercises

References

Notes

8 Intensity

8.1 Human Perception of Intensity and Loudness

8.2 Representation of Dynamics in Music

8.3 Features

8.4 Exercises

References

Note

9 Temporal Analysis

9.1 Human Perception of Temporal Events

9.2 Representation of Temporal Events in Music

9.3 Onset Detection

9.4 Beat Histogram

9.5 Detection of Tempo and Beat Phase

9.6 Detection of Meter and Downbeat

9.7 Structure Detection

9.8 Automatic Drum Transcription

9.9 Exercises

References

Notes

10 Alignment

10.1 Dynamic Time Warping

10.2 Audio‐to‐Audio Alignment

10.3 Audio‐to‐Score Alignment

10.4 Evaluation

10.5 Exercises

References

Notes

Part III: Music Identification, Classification, and Assessment

11 Audio Fingerprinting

11.1 Fingerprint Extraction

11.2 Fingerprint Matching

11.3 Fingerprinting System: Example

11.4 Evaluation

References

12 Music Similarity Detection and Music Genre Classification

12.1 Music Similarity Detection

12.2 Musical Genre Classification

References

Notes

13 Mood Recognition

13.1 Approaches to Mood Recognition

13.2 Evaluation

References

14 Musical Instrument Recognition

14.1 Evaluation

References

15 Music Performance Assessment

15.1 Music Performance

15.2 Music Performance Analysis

15.3 Approaches to Music Performance Assessment

References

Part IV: Appendices

Appendix A: Fundamentals

A.1 Sampling and Quantization

A.2 Convolution

A.3 Correlation Function

References

Notes

Appendix B: Fourier Transform

B.1 Properties of the Fourier Transformation

B.2 Spectrum of Example Time Domain Signals

B.3 Transformation of Sampled Time Signals

B.4 Short Time Fourier Transform of Continuous Signals

B.5 Discrete Fourier Transform

B.6 Frequency Reassignment: Instantaneous Frequency

References

Notes

Appendix C: Principal Component Analysis

C.1 Computation of the Transformation Matrix

C.2 Interpretation of the Transformation Matrix

Appendix D: Linear Regression

Appendix E: Software for Audio Analysis

E.1 Frameworks and Libraries

E.2 Data Annotation and Visualization

References

Notes

Appendix F: Datasets

References

Index

End User License Agreement

List of Tables

Chapter 3

Table 3.1 Properties of three popular MFCC implementations, Davis and Merme...

Chapter 6

Table 6.1 Confusion matrix.

Chapter 7

Table 7.1 Names and distance in semitones

of diatonic pitch classes.

Table 7.2 Names of musical intervals, their enharmonic equivalents, and the...

Table 7.3 Deviations of the Pythagorean, meantone, and two diatonic tempera...

Table 7.4 Frequency resolution of the STFT for different block lengths at a...

Table 7.5 Typical range of deviation of the tuning frequency from 440 Hz ov...

Table 7.6 Deviation (in Cent) of seven harmonics from the nearest equal‐tem...

Table 7.7 Pitch class order in the original and the rearranged pitch chroma...

Table 7.8 Various key profile templates, normalized to a vector length of 1...

Chapter 9

Table 9.1 Hypothetical example for annotator disagreements on musical struc...

Chapter 11

Table 11.1 Main properties of fingerprinting and watermarking in comparison...

Chapter 12

Table 12.1 Confusion matrix for an example evaluation of a speech/music cla...

Chapter 13

Table 13.1 Mood clusters as presented by Schubert.

Table 13.2 Mood clusters derived from metadata and used in MIREX.

Appendix B

Table B.1 Frequency domain properties of the most common windows.

Appendix F

Table F.1 List of datasets for audio content analysis.

List of Illustrations

Chapter 2

Figure 2.1 General processing stages of a system for audio content analysis....

Chapter 3

Figure 3.1 Snippet of a periodic audio signal with indication of its fundame...

Figure 3.2 Approximation of periodic signals: sawtooth (top) and square wave...

Figure 3.3 Probability density function of a square wave (a), a sinusoidal (...

Figure 3.4 Distribution function estimated from a music signal compared to a...

Figure 3.5 RFD (b) of a series of feature values (a) with its arithmetic mea...

Figure 3.6 RFD (b) of a series of feature values (a) with the standard devia...

Figure 3.7 Two probability distributions, Gaussian (a) and Chi‐squared (b), ...

Figure 3.8 Schematic visualization of block‐based processing: the input sign...

Figure 3.9 Example visualization of blocking: the input signal (top) is spli...

Figure 3.10 Short‐Time Fourier Transform: time domain block (a), magnitude s...

Figure 3.11 Time‐domain waveform (a) and spectrogram visualization (b); each...

Figure 3.12 Two approaches of implementing blocking for the CQT: compute mul...

Figure 3.13 Time‐domain waveform (a) and corresponding Log‐Mel Spectrogram (...

Figure 3.14 Normalized impulse response of a gammatone filter with a center ...

Figure 3.15 Frequency response of a resonance filterbank spanning four octav...

Figure 3.16 Waveform of excerpts from a speech recording (a), a string quart...

Figure 3.17 Visualization of the feature extraction process.

Figure 3.18 Spectrogram (a), waveform (b, background), and

spectral centroid

Figure 3.19 Spectrogram (a), waveform (b, background), and

spectral spread

(...

Figure 3.20 Spectrogram (a), waveform (b, background), and

spectral skewness

Figure 3.21 Spectrogram (a), waveform (b, background), and

spectral kurtosis

Figure 3.22 Spectrogram (a), waveform (b, background), and

spectral rolloff

...

Figure 3.23 Spectrogram (a), waveform (b, background), and

spectral decrease

Figure 3.24 Spectrogram (a), waveform (b, background), and

spectral slope

(b...

Figure 3.25 Warped cosine‐shaped transformation basis functions for the comp...

Figure 3.26 Spectrogram (a) and Mel frequency cepstral coefficients 1–4 (b) ...

Figure 3.27 Magnitude transfer function of the filterbank for MFCC computati...

Figure 3.28 Spectrogram (a), waveform (b, background), and

spectral flux

(b,...

Figure 3.29 Spectrogram (a), waveform (b, background), and

spectral crestfac

...

Figure 3.30 Spectrogram (a), waveform (b, background), and

spectral flatness

Figure 3.31 Spectrogram (a), waveform (b, background), and

spectral tonalpow

...

Figure 3.32 Spectrogram (a), waveform (b, background), and feature

maxacf

(b...

Figure 3.33 Spectrogram (a), waveform (b, background), and feature

zerocross

...

Figure 3.34 Example aggregation with a texture window length of

with the a...

Figure 3.35 Feature aggregation with texture windows: example audio input (t...

Figure 3.36 Accuracy over number of features selected by Sequential Forward ...

Chapter 4

Figure 4.1 Example for classifying the blocks of a drum loop into the classe...

Figure 4.2 A music/speech dataset visualized in a two‐dimensional feature sp...

Figure 4.3 Nearest neighbor classification of a query data point (center) fo...

Figure 4.4 The feature space from Figure 4.2 and a corresponding Gaussian Mi...

Figure 4.5 Linear Regression of two feature/target pairs, (a) RMS and peak e...

Figure 4.6 Example illustrating the importance of the similarity definition ...

Figure 4.7 Three iterations of K‐means clustering. The colors indicate the c...

Figure 4.8 Example of visualization of an overfitted model exactly matching ...

Chapter 5

Figure 5.1 Typical percentages for splitting a dataset into train, validatio...

Figure 5.2 Visualization of the data splits for training and testing in the ...

Figure 5.3 Schematic visualization of training data augmentation.

Chapter 6

Figure 6.1 Multiple hypothetical ROCs and their corresponding AUCs.

Chapter 7

Figure 7.1 Prototype visualization of harmonics as integer multiples of the ...

Figure 7.2 Different models for the nonlinear mapping of frequency to Mel (a...

Figure 7.3 Helix visualizing the two facets of pitch perception: pitch heigh...

Figure 7.4 Chromatic pitches

in musical score notation with pitch class in...

Figure 7.5 One octave on a piano keyboard with annotated pitch class names....

Figure 7.6 Musical intervals in musical score notation.

Figure 7.7 Six harmonics of the pitch

in musical score notation.

Figure 7.8 Detection error in Cent resulting from quantization of the period...

Figure 7.9 Detection error in Cent resulting from quantization of the fundam...

Figure 7.10 Very short excerpt of an original and a signal interpolated for ...

Figure 7.11 Magnitude spectrum and zeropadded magnitude spectrum (a) and mag...

Figure 7.12 F0 estimation via the zero‐crossing distance. Time series (a) an...

Figure 7.13 F0 estimation via the (center‐clipped) Autocorrelation Function....

Figure 7.14 Nonlinear preprocessing for ACF‐based pitch period estimation: s...

Figure 7.15 F0 estimation via the Average Magnitude Difference Function (lef...

Figure 7.16 Original power spectrum (top) and compressed spectra (mid) for t...

Figure 7.17 F0 estimation via the Harmonic Product Spectrum: compressed spec...

Figure 7.18 F0 estimation via the Autocorrelation Function of the magnitude ...

Figure 7.19 F0 estimation via the Cepstrum: input magnitude spectrum (a) and...

Figure 7.20 F0 estimation templates: template functions with different pitch...

Figure 7.21 Time domain (left) and frequency domain (right) visualization of...

Figure 7.22 Visualization of Eq. (7.37) with example dimensions.

Figure 7.23 Example for the utilization of NMF for detecting the individual ...

Figure 7.24 Distribution of tuning frequencies.

Figure 7.25 Adaptation of the tuning frequency estimate from an initial sett...

Figure 7.26 Different modes in musical score notation starting at the tonic

Figure 7.27 The twelve major scales in musical score notation, notated in th...

Figure 7.28 Circle of fifths for both major keys and minor keys, plus the nu...

Figure 7.29 Mask function for pitch class

for pitch chroma computation (oc...

Figure 7.30 Magnitude spectrogram (a) and the pitch chromagram (b) of a mono...

Figure 7.31 Pitch chroma of a hypothetical pitch

with 10 harmonics.

Figure 7.32 Key profile vectors (top left) and the resulting interkey distan...

Figure 7.33 Common chords in musical score notation on a root note of

C

.

Figure 7.34 The two inversions of a

D Major

triad in musical score notation....

Figure 7.35 Flowchart of a simple chord detection system.

Figure 7.36 Simple chord template matrix for chord estimation based on pitch...

Figure 7.37 Chord emission probability matrix (a), transition probability ma...

Chapter 8

Figure 8.1 Level error introduced by adding a small constant

to the argume...

Figure 8.2 Spectrogram (top), waveform (bottom background), and RMS output (...

Figure 8.3 Flowchart of the frequency‐weighted RMS calculation.

Figure 8.4 (a) Frequency weighting transfer functions applied before RMS mea...

Figure 8.5 Flowchart of a Peak program meter.

Figure 8.6 Spectrogram (top), waveform (bottom background), and PPM output c...

Figure 8.7 Flowchart of Zwicker's model for loudness computation.

Chapter 9

Figure 9.1 Visualization of an envelope, attack time, and possible location ...

Figure 9.2 Different hierarchical levels related to tempo, beat, and meter....

Figure 9.3 Frequently used time signatures.

Figure 9.4 Note values (a) and corresponding rest values (b) with decreasing...

Figure 9.5 General flowchart of an onset detection system.

Figure 9.6 Audio signal and extracted envelope (a) and novelty function with...

Figure 9.7 Beat histogram of a string quartet performance (a) and of a piece...

Figure 9.8 Visualization of Onset, Beat, and Downbeat times for a drum loop ...

Figure 9.9 Flow chart of a typical (real‐time) beat‐tracking system.

Figure 9.10 Example structure of a popular song (here: Pink's

So What

).

Figure 9.11 Self‐Similarity Matrix of Michael Jackson's

Bad

.

Figure 9.12 Self‐Similarity Matrices of the same song based on different fea...

Figure 9.13 Checker Board Filter Kernel with high‐pass characteristics to de...

Figure 9.14 Self‐Similarity Matrix (left) and extracted Novelty Function....

Figure 9.15 Self‐Similarity Matrix (left) and Low‐pass filtered Diagonal as ...

Figure 9.16 Rotated Self Similarity Matrix (a) and accompanying Ground truth...

Figure 9.17 Two baseline drum transcription systems by extending a standard ...

Chapter 10

Figure 10.1 Visualization of the mapping of two similar sequences to each ot...

Figure 10.2 Distance matrix and alignment path for two example sequences; da...

Figure 10.3 Distance matrix (a and b in two different visualizations) and co...

Figure 10.4 Path restrictions for performance optimizations of DTW: original...

Figure 10.5 Comparison of different features for computing the distance matr...

Chapter 11

Figure 11.1 General framework for audio fingerprinting. The upper part visua...

Figure 11.2 Flowchart of the extraction process of subfingerprints in the Ph...

Figure 11.3 Two fingerprints and their difference. (a) Original, (b) mp3 enc...

Chapter 12

Figure 12.1 Two examples for author‐defined genre taxonomies. (a) Tzanetakis...

Figure 12.2 Scatterplot of the feature space for the 10 music classes of the...

Chapter 13

Figure 13.1 Russell's two‐dimensional model of mood.

Figure 13.2 Scatterplot of the feature space for the valence/energy ratings ...

Chapter 15

Figure 15.1 Chain of musical communication.

Appendix A

Figure A.1 Continuous audio signal (a) and corresponding sample values (b) a...

Figure A.2 Continuous (top) and sampled (below) sinusoidal signals with the ...

Figure A.3 Unquantized input signal (a) and quantized signal at a word lengt...

Figure A.4 Characteristic line of a quantizer with word length

showing the...

Figure A.5 Magnitude frequency response of a moving average low‐pass filter ...

Figure A.6 Example of zero phase filtering: input signal (a), low‐pass filte...

Figure A.7 ACF of a sinusoid (a) and white noise (b).

Appendix B

Figure B.1 Schematic visualization of the spectrum of a continuous time doma...

Figure B.2 Windows in time domain (left) and frequency domain (right).

Figure B.3 Phasor representation and visualization of the phase difference w...

Figure B.4 Magnitude spectrum of a signal composed of three sinusoidals with...

Appendix C

Figure C.1 Scatter plot of a two‐dimensional data set with variables

, and ...

Figure C.2 Five input features (a) and the two resulting principal component...

Guide

Cover

Table of Contents

Title Page

Copyright

Dedication

Author Biography

Preface

Acronyms

List of Symbols

Source Code Repositories

Begin Reading

Appendix A: Fundamentals

Appendix B: Fourier Transform

Appendix C: Principal Component Analysis

Appendix D: Linear Regression

Appendix E: Software for Audio Analysis

Appendix F: Datasets

Index

End User License Agreement

Pages

ii

iii

iv

v

xvii

xix

xx

xxi

xxii

xxiii

xxv

xxvi

xxvii

xxviii

xxix

1

2

3

4

5

6

7

8

9

11

12

13

14

15

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

84

85

91

92

93

94

95

97

96

98

99

100

101

102

103

104

105

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

274

275

276

277

278

279

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

303

305

306

307

308

309

310

311

312

313

314

315

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

405

406

407

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

IEEE Press

445 Hoes Lane Piscataway, NJ 08854

IEEE Press Editorial Board

Sarah Spurgeon,

Editor in Chief

Jón Atli Benediktsson

   

Andreas Molisch

   

Diomidis Spinellis

Anjan Bose

   

Saeid Nahavandi

   

Ahmet Murat Tekalp

Adam Drobot

   

Jeffrey Reed

   

   

Peter (Yong) Lia

   

Thomas Robertazzi

   

   

An Introduction to Audio Content Analysis

Music Information Retrieval Tasks & Applications

 

Second Edition

 

Alexander LerchGeorgia Institute of TechnologyAtlanta, USA

 

 

 

 

Copyright © 2023 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

Edition HistoryWiley (1e, 2012)

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per‐copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750‐8400, fax (978) 750‐4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748‐6011, fax (201) 748‐6008, or online at http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762‐2974, outside the United States at (317) 572‐3993 or fax (317) 572‐4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging‐in‐Publication Data

Names: Lerch, Alexander, author. | John Wiley & Sons, publisher.Title: An introduction to audio content analysis : music information retrieval tasks & applications / Alexander Lerch.Description: Second edition. | Hoboken, New Jersey : Wiley-IEEE Press, [2023] | Includes bibliographical references and index.Identifiers: LCCN 2022046787 (print) | LCCN 2022046788 (ebook) | ISBN 9781119890942 (cloth) | ISBN 9781119890966 (adobe pdf) | ISBN 9781119890973 (epub)Subjects: LCSH: Computer sound processing. | Computational auditory scene analysis. | Content analysis (Communication)--Data processing.Classification: LCC TK7881.4 .L486 2023 (print) | LCC TK7881.4 (ebook) | DDC 006.4/5-dc23/eng/20221018LC record available at https://lccn.loc.gov/2022046787LC ebook record available at https://lccn.loc.gov/2022046788

Cover image: © Weiquan Lin/Getty ImagesCover design by Wiley

 

 

 

For Mila

Author Biography

Alexander Lerch is Associate Professor at the School of Music, Georgia Institute of Technology. He studied Electrical Engineering at the Berlin Institute of Technology and Tonmeister (music production) at the University of Arts, Berlin, and received his PhD (Audio Communications) from the Berlin Institute of Technology in the year 2008. He currently leads the Georgia Tech Music Informatics Group with a research focus on the design and implementation of machine learning and signal processing methods targeting audio and music in the areas audio content analysis, music information retrieval, machine listening, and music meta‐creation. Lerch has served as co‐chair of ISMIR2021, the 22nd International Society for Music Information Retrieval Conference.

Before joining Georgia Tech, he co‐founded the company zplane.development, an industry leader providing advanced music technology to the music industry. zplane technologies are nowadays used by millions of musicians and producers world‐wide in a wide variety of products.

Preface

The fields of Audio Content Analysis(ACA) andMusic Information Retrieval(MIR) have seen rapid growth over the past decade as indicated by a rising number in publications, growing conference attendance, and an increasing number of commercial applications in the field. This growth is driven by the need to intelligently browse, retrieve, and process the large amount of music data at our fingertips and is also reflected by growing interest of students, engineers, and software developers to learn about methods for the analysis of audio signals.

Inspired by the feedback received from readers of the first edition of this book, my goals for the second edition were threefold: (i) remain focused on the introduction of baseline systems rather than state‐of‐the‐art approaches, (ii) widen the scope of tasks introduced, and (iii) provide additional materials enhancing the learning experience. While the baseline systems presented in this book cannot be considered state‐of‐the‐art, their knowledge is, in my opinion, crucial for an introduction into and understanding of the field. Not only does this allow us to learn about the wide variety of tasks that mirror the variety of content we handle in the analysis of musical audio, but by looking at each of these tasks, we gain insights into the task‐specific challenges and problems, thus gaining an understanding of inherent principles and challenges as well as parameters of interest. In addition, baseline systems allow for hands‐on exercises and provide an important reference data point clarifying the expected performance of an analysis system.

The second edition of this book comes with noticeable changes from the first edition. Some parts have been shortened by removing unnecessary detail, while others have been extended to provide more useful detail. In addition, evaluation methodologies are summarized for most tasks; while evaluation often seems like a chore and uninteresting part of designing a system, it is a crucial part: without evaluation, no one knows how well a system works. Most chapters now also conclude with a set of questions and assignments assessing the learning outcomes. As this text intends to provide a starting point and guidance for the design of novel musical audio content analysis systems, every chapter contains many references to relevant work. Last but not least, several new topics are introduced, and the whole text has been restructured into three main parts plus an appendix.

The first part covers fundamentals shared by many audio analysis systems. It starts with a closer look at musical audio content and a general flowchart of an audio content analysis system. After a quick coverage of signals and general preprocessing concepts, typical time–frequency representations are introduced, which nowadays serve as the input representations of many analysis systems. A closer look at the most commonly used instantaneous or low‐level features is then followed by an introduction of feature postprocessing methods covering normalization, aggregation, as well as general dimensionality reduction methods. The second part introduces systems that transcribe individual musical properties of the audio signal. These are mostly properties that are more or less explicitly defined in a musical score, such as pitch, dynamics, and tempo. More specifically, these approaches are grouped into the content dimensions tonal (fundamental frequency, tuning frequency, musical key, and chords), intensity (level and loudness), and temporal analysis (onsets, beats, tempo, meter, and structure). Audio analysis approaches related to classification and identification are presented in Part Three. Many of these systems do not focus on individual musical content dimensions but aim at extracting high‐level content from the audio signal. Thus, the classifications of musical genre and mood, as well as the assessment of student music performances, are introduced. In addition, audio fingerprinting is discussed as one of the audio analysis technologies arguably most impactful for the consumer in the past. The appendix provides a more in‐depth reference to basic concepts such as audio sampling and quantization, the Fourier transform, introduces correlation and convolution, as well as providing quick references to principal component analysis and linear regression.

Note that this edition also comes with a tighter integration with online resources such as slides and example code in both MATLAB and Python, as well as a website with resources such as a list of available datasets. The entry point for all these additional (and freely available) resources is https://www.audiocontentanalysis.org.

 

Atlanta, Georgia

Alexander Lerch

Acronyms

 

ACA

Audio Content Analysis

ACF

Autocorrelation Function

ADT

Automatic Drum Transcription

AMDF

Average Magnitude Difference Function

ANN

Artificial Neural Network

AOT

Acoustic Onset Time

AUC

Area Under Curve

BPM

Beats per Minute

CCF

Cross Correlation Function

CCIR

Comité Consultatif International des Radiocommunications

CiCF

Circular Correlation Function

CD

Compact Disc

COG

Center of Gravity

CQT

Constant

Transform

DCT

Discrete Cosine Transform

DFT

Discrete Fourier Transform

DJ

Disk Jockey

DNN

Deep Neural Network

DP

Dynamic Programming

DTW

Dynamic Time Warping

EBU

European Broadcasting Union

EM

Expectation Maximization

ERB

Equivalent Rectangular Bandwidth

FFT

Fast Fourier Transform

FT

Fourier Transform

FN

False Negative

FNR

False Negative Rate

FP

False Positive

FPR

False Positive Rate

FWR

Full‐Wave Rectification

GMM

Gaussian Mixture Model

HFC

High Frequency Content

HMM

Hidden Markov Model

HPS

Harmonic Product Spectrum

HSS

Harmonic Sum Spectrum

HTK

HMM Toolkit

HWR

Half‐Wave Rectification

IBI

Inter‐Beat Interval

ICA

Independent Component Analysis

IDFT

Inverse Discrete Fourier Transform

IFT

Inverse Fourier Transform

IIR

Infinite Impulse Response

IO

Input/Output

IOI

Inter‐Onset Interval

ITU

International Telecommunication Union

JNDL

Just Noticeable Difference in Level

K‐NN

K‐Nearest Neighbor

LDA

Linear Discriminant Analysis

MA

Moving Average

MAE

Mean Absolute Error

MFCC

Mel Frequency Cepstral Coefficient

MIDI

Musical Instrument Digital Interface

MIR

Music Information Retrieval

MIREX

Music Information Retrieval Evaluation eXchange

ML

Machine Learning

MPA

Music Performance Analysis

MP3

MPEG‐1 Layer 3

MSE

Mean Squared Error

NOT

Note Onset Time

NMF

Nonnegative Matrix Factorization

PAT

Perceptual Attack Time

PCA

Principal Component Analysis

PDF

Probability Density Function

POT

Perceptual Onset Time

PPM

Peak Program Meter

PSD

Peak Structure Distance

RFD

Relative Frequency Distribution

RLB

Revised Low‐Frequency B Curve

RMS

Root Mean Square

RNN

Recurrent Neural Network

ROC

Receiver Operating Characteristic Curve

SIMD

Single Instruction Multiple Data

SNR

Signal‐to‐Noise Ratio

SOM

Self‐Organizing Map

SSM

Self Similarity Matrix

STFT

Short Time Fourier Transform

SVD

Singular Value Decomposition

SVM

Support Vector Machine

SVR

Support Vector Regression

TN

True Negative

TNR

True Negative Rate

TP

True Positive

TPR

True Positive Rate

List of Symbols

 

Amplitude

Filter Coefficient (Recursive)

Accuracy

Number of Beats

Filter Coefficient (Transversal)

Parametrization Factor or Exponent

Number of (Audio) Channels

Center Clipping Function

Cost Matrix for the Distance Matrix between Sequences

and

Overall Cost of a Path through the Cost Matrix

Cepstrum of the Signal

Distance Matrix between Sequences

and

Distance Measure

Quantization Step Size

Delta Impulse Function

Delta Pulse Function

Novelty Function

Prediction Error

Quantization Error

Equivalent Rectangular Bandwidth

(Correlation) Lag

‐Measure

Frequency in Hz

Fundamental Frequency in Hz

Sample Rate

Tuning Frequency in Hz

Number of Features

Instantaneous Frequency in Hz

(Discrete) Fourier Transform

Threshold

Chord Transformation Matrix

Central Moment of Order

of Signal

Transfer Function

Impulse Response

Hop Size

Sample Index

Impulse Response Length

Integer (Loop) Variable

Objective Function

Block Size

Frequency Bin Index

Percentage

Weighting Factor

Number of (Quantization) Steps

Slope

Pitch (Mel)

Geometric Mean of Signal

Harmonic Mean of Signal

Arithmetic Mean of Signal

Number of Observations or Blocks

Block Index

Order (e.g. Filter Order)

Block Overlap Ratio

Angular Velocity (

) in radians per second

Number of Onsets

Precision

Alignment Path

Pitch Class Index

Phase Spectrum of the Signal

Gaussian Function

(MIDI) Pitch

Pitch Chroma Vector/Key Profile

Power of the Signal

Probability Density Function of the Signal

Chord Probability Vector

Quality Factor (Mid‐Frequency divided by Bandwidth)

Evaluation Metric

Quantile Boundary

Recall

Source Code Repositories

Book

reference

Link

Description

Repo 1

https://github.com/alexanderlerch/ACA-Plots

Matlab code to generate plots

Repo 2

https://github.com/alexanderlerch/ACA-Code

Matlab scripts

Repo 3

https://github.com/alexanderlerch/pyACA

Python scripts

Repo 4

https://github.com/alexanderlerch/ACA-Slides

Pdf and code for lecture slides

1Introduction

Audio is an integral and ubiquitous aspect of our daily lives; we intentionally produce sound (e.g. when communicating through speech or playing an instrument), we actively listen (e.g. to music or podcasts), can focus on a specific sound source in a mixture of sources, and we (even unconsciously) suppress sound sources internally (e.g. traffic noise). Similar to humans, algorithms can also generate, analyze, and process audio. This book focuses on the algorithmic analysis of audio signals, more specifically the extraction of information from musical audio signals.

Audio signals contain a wealth of information: by simply listening to an audio signal, humans are able to infer a variety of content information. A speech signal, for example obviously transports the textual information, but it also might reveal information about the speaker (gender, age, accent, mood, etc.), the recording environment (e.g., indoors vs. outdoors), and much more. A music signal might allow us to derive melodic and harmonic characteristics, understand the musical structure, identify the instruments playing, perceive the projected emotion, categorize the music genre, and assess characteristics of the performance as well as the proficiency of the performers. An audio signal can contain and transport a wide variety of content beyond these simple examples. This content information is sometimes referred to as metadata: data about (audio) data.

The field of Audio Content Analysis(ACA) aims at designing and applying algorithms for the automatic extraction of content information from the raw (digital) audio signal. This enables content‐driven and content‐adaptive services which describe, categorize, sort, retrieve, segment, process, and visualize the signal and its content.

The wide range of possible audio sources and the multi‐faceted nature of audio signals results in variety of distinct ACA problems, leading to various areas of research, including

speech analysis

, covering topics such as automatic speech recognition [

1

,

2

] or recognizing emotion in speech [

3

,

4

],

urban sound analysis

with applications in noise pollution monitoring [

5

] and audio surveillance, i.e. the detection of dangerous events [

6

],

industrial sound analysis

such as monitoring the state of mechanical devices like engines [

7

] or monitoring the health of livestock [

8

], and, last but not least,

musical audio analysis

, targeting the understanding and extraction of musical parameters and properties from the audio signal [

9

].

This book focuses on the analysis of musical audio signals and the extraction of musical content from audio. There are many similarities and parallels to the areas above, but there exist also many differences that distinguish musical audio from other signals beyond simple technical properties such as audio bandwidth. Like an urban sound signal, music is a polytimbral mixture of multiple sound sources, but unlike urban sound, its sound sources are clearly related (e.g., melodically, harmonically, or rhythmically). Like a speech signal, a music signal is a sequence in a language with rules and constraints, but unlike speech, the musical language is abstract and has no singular meaning. Like an industrial audio signal, music has both tonal and noise‐like components which may repeat themselves, but unlike the industrial signal, it conveys a (musical) form based on hierarchical grouping of elements not only through repetition but also through precise variation of rhythmic, dynamic, tonal, and timbral elements.

As we will see throughout the chapters, the design of systems for the analysis of musical audio often requires knowledge and understanding from multiple disciplines. While this text approaches the topic mostly from an engineering and Digital Signal Processing(DSP) perspective, the proper formulation of research questions and task definitions often require methods or at least synthesis of knowledge from fields as diverse as music theory, music perception, and psychoacoustics. Researchers working on ACA thus come from various backgrounds such as computer science, engineering, psychology, and musicology.

The diversity in musical audio analysis is also exemplified by the wide variety of terms referring to it. Overall, musical ACA is situated in the broader area of Music Information Retrieval(MIR). MIR is a broader field that covers not only the analysis of musical audio but also symbolic (nonaudio) music formats such as musical scores and files or signals compliant to the so‐called Musical Instrument Digital Interface(MIDI) protocol [10]. MIR also covers the analysis and retrieval of information that is music‐related but cannot be (easily) extracted from the audio signal such as the artist names, user ratings, performance instructions in the score, or bibliographical information such as publisher, publishing date, the work's title. Other areas of research, such as music source separation and automatic music generation are often also considered to belong within MIR. Various overview articles clarify how the understanding of the field of MIR has evolved over time [11–16]. Other, related terms are also in use. Audio event detection, nowadays often related to urban sound analysis, is sometimes described as computational analysis of sound scenes [17]. The analysis of sound scenes from a perceptual point of view has been described as Computational Auditory Scene Analysis(CASA) [18]. In the past, other terms have been used more or less synonymously to the term “audio content” analysis. Examples of such synonyms are machine listening and computer audition. Finally, there is the term music informatics, which encompasses essentially any aspect of algorithmic analysis, synthesis, and processing of music (although in some circles, its meaning is restricted to describe the creation of musical artifacts with software).

1.1 A Short History of Audio Content Analysis

Historically, the first systems analyzing the content of audio signals appeared shortly after technology provided the means of storing and reproducing recordings on media in the twentieth century. One early example is Seashore's Tonoscope, which enabled the pitch analysis of an audio signal by visualizing the fundamental frequency of the incoming audio signal on a rotating drum [19]. However, the more recent evolution of digital storage media, DSP methods, and machine learning during the last decades, along with the growing amount of digital audio data available through downloads and streaming services, has significantly increased both the need and the possibilities of automatic systems for analyzing audio content, resulting in a lively and growing research field.

Early systems for audio analysis were frequently so‐called “expert systems” [20], designed by experts who implement their task‐specific knowledge into a set of rules. Such systems can be very successful if there is a clear and simple relation between the knowledge and the implemented algorithms. A good example for such systems are some of the pitch‐tracking approaches introduced in Section 7.3: as the goal is the detection of periodicity in the signal in a specific range, an approach such as the Autocorrelation Function(ACF), combined with multiple assumptions and constraints, can be used to estimate this periodicity and thus the fundamental frequency.

Later, data‐driven systems became increasingly popular and traditional machine learning approaches started to show superior performance on many tasks. These systems extract so‐called features from the audio to achieve a task‐dependent representation of the signal. Then, training data are used to build a model of the feature space and how it maps to the inferred outcome. The role of the expert becomes less influential in the design of these systems, as they are restricted to selecting or designing a fitting set of features, curating a representative set of data, and choosing and parametrizing the machine learning approach. One prototypical example for these approaches is musical genre classification as introduced in Section 12.2.

Modern machine learning approaches include trainable feature extraction approaches (also referred to as feature learning, see Section 3.6) as they are a part of deep neural networks. These approaches have consistently shown superior performance for nearly all ACA tasks. The researcher seldom imparts much domain knowledge beyond choosing input representation, data, and system architecture of these systems. An example of such an end‐to‐end system could be a music genre classification system based on a neural network with a convolutional architecture with a Mel Spectrogram input.

It should be pointed out that while modern systems tend to have superior performance, they also tend to be less interpretable and explainable than traditional systems. For example, deducing the reason for a false classification result in a network‐based system can be difficult, while the reason is usually easily identifiable in a rule‐based system.

1.2 Applications and Use Cases

The content extracted from music signals improves or enables various forms of content‐based and content‐adaptive services, which allow to sort, categorize, find, segment, process, and visualize the audio signal based on its content.

1.2.1 Music Browsing and Music Discovery

One of the most intuitive applications is the content‐based search for music in large databases for which consistent manual annotation by humans is often infeasible. This allows, for example, Disk Jockeys(DJs) to retrieve songs in a specific tempo or key, to find music pieces in a specific genre or mood, or to search for specific musical characteristics by example. The same information can be used in end consumer applications such as audio‐based music recommendation and playlist generation systems using an in‐depth understanding of the musical content [21].

The content‐based annotation and representation of the audio signal can also be used to design new interfaces to access data. Fingerprinting (see Chapter 11), for example allows to identify a song currently played from a large database of songs, while Query‐by‐Humming systems identify songs through a user‐hummed melody. Content can also be utilized for new ways of sound visualization and user interaction; a database of music could, for example be explored by a user navigating a virtual music similarity space [22].

1.2.2 Music Consumption

Audio analysis has already started to transform consumer‐facing industries such as streaming services as mentioned above. So far, most services focus on recommending music to be played back, but in the near future, we might see the rise of creative music listening applications that enable the listener to interact with the content itself instead of being restricted to only choosing content to be played. This could include, for example, the gain adjustment for individual voices, replacing instruments or vocalists, or interactively changing the musical arrangement or even stylistic properties of the music piece.

1.2.3 Music Production

Knowledge of the audio content can improve music production tools in various dimensions. On the one hand, content information can enable a more “musical” software interface, e.g. by displaying score‐like information synchronized with the audio data, and thus enabling an intuitive approach to editing the audio data. On the other hand, production software usage could be enhanced in terms of productivity and efficiency: the better a system understands the details and properties of incoming audio streams or files, the better it can adapt, for instance, by applying default gain and equalization parameters [23] or by suggesting compatible audio from a sound library. Intelligent music software might also support editors by offering automatic artifact‐free splicing of multiple recordings from one session or selecting error‐free recordings from a set of recordings.

Modern tools also enhance the creative possibilities in the production process. For example, creating harmonically meaningful background choirs by analyzing the lead vocals and the harmony track is already technically feasible and commercially available today. Knowing and isolating sound sources in a recording could enable new ways of modifying or morphing different sounds for the creation of new soundscapes, effects, and auditory scenes.

1.2.4 Music Education

The potential of utilizing technology to assist music (instrument) education has been recognized as early as the 1930s, when Seashore pointed out the educational value of scientific observation of music performances [24]. The availability of automatic and fast audio analysis can support the creation of artificially intelligent music tutoring software, which aims at supplementing teachers by providing students with insights and interactive feedback by analyzing and assessing the audio of practice sessions. An interactive music tutor with in‐depth understanding of the musical score and performance content can highlight problematic parts of the students' performance, provide a concise yet easily understandable analysis, give objective and specific feedback on how to improve, and individualize the curriculum depending on the students' mistakes and general progress. At the same time, music tutoring software can provide an accessible, objective, and reproducible analysis.

1.2.5 Generative Music

Machine‐interpretable content information can also feed generative algorithms. The automatic composition and rendition of music is emerging as a challenging yet popular research direction [25], gaining interest from both research institutions and industry. While bigger questions concerning capabilities and restrictions of computational creativity as well as aesthetic evaluation of algorithmically generated music remain largely unanswered, practical applications such as generating background music for user videos and commercial advertisements are currently in the focus of many researchers. The interactive and adaptive generation of sound tracks for video games as well as individualized generation of license‐free music content for streaming is additional long‐term goals of considerable commercial interest.

References

1

Xuedong Huang, James Baker, and Raj Reddy. A Historical Perspective of Speech Recognition.

Communications of the ACM

, 57(1):94–103, January 2014. ISSN 0001‐0782. doi: 10.1145/2500887. URL

https://doi.org/10.1145/2500887

.

2

Dong Yu and Li Deng.

Automatic Speech Recognition ‐ A Deep Learning Approach

.

Signals and Communication Technology

. Springer, London, 2015. ISBN 978‐1‐4471‐5779‐3. URL

https://doi.org/10.1007/978-1-4471-5779-3

.

3

Frank Dellaert, Thomas Polzin, and Alex Waibel. Recognizing Emotion in Speech. In

Proceedings of the International Conference on Spoken Language Processing (ICSLP)

, volume 3, pages 1970–1973, Philadelphia, PA, October 1996. doi: 10.1109/ICSLP.1996.608022.

4

Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases.

Pattern Recognition

, 44(3):572–587, March 2011. ISSN 0031‐3203. doi: 10.1016/j.patcog.2010.09.020. URL

https://www.sciencedirect.com/science/article/pii/S0031320310004619

.

5

Juan Pablo Bello, Charlie Mydlarz, and Justin Salamon. Sound Analysis in Smart Cities. In

Computational Analysis of Sound Scenes and Events

, pages 373–397. Springer, Cham, 2018. ISBN 978‐3‐319‐63449‐4. URL

https://link.springer.com/chapter/10.1007/978-3-319-63450-0_13

.

6

Marco Crocco, Marco Cristani, Andrea Trucco, and Vittorio Murino. Audio Surveillance: A Systematic Review.

ACM Computing Surveys

, 48(4):52:1–52:46, February 2016. ISSN 0360‐0300. doi: 10.1145/2871183. URL

https://doi.org/10.1145/2871183

.

7

Sascha Grollmisch, Jakob Abeßer, Judith Liebetrau, and Hanna Lukashevich. Sounding Industry: Challenges and Datasets for Industrial Sound Analysis. In

Proceedings of the European Signal Processing Conference (EUSIPCO)

, pages 1–5, A Coruna, Spain, September 2019. doi: 10.23919/EUSIPCO.2019.8902941. ISSN: 2076‐1465.

8

Dries Berckmans, Martijn Hemeryck, Daniel Berckmans, Erik Vranken, and Toon van Waterschoot. Animal Sound… Talks! Real‐time Sound Analysis for Health Monitoring in Livestock. In

Proceedings of the International Symposium on Animal Environment and Welfare (ISAEW)

, Chongqing, China, 2015.

9

Daniel P W Ellis. Extracting Information from Music Audio.

Communications of the ACM

, 49(8):32–37, August 2006. ISSN 00010782. doi: 10.1145/1145287.1145310. URL

http://portal.acm.org/citation.cfm?doid=1145287.1145310

.

10

MIDI Manufacturers Association. Complete MIDI 1.0 Detailed Specification V96.1, 2nd edition. Standard, MMA, 2001.

11

J Stephen Downie. Music Information Retrieval.

Annual Review of Information Science and Technology

, 37:295–340, 2003.

12

Nicola Orio. Music Retrieval: A Tutorial and Review.

Foundations and Trends in Information Retrieval

, 1(1):1–90, 2006.

13

Michael A Casey, Remco Veltkamp, Masataka Goto, Marc Leman, Christophe Rhodes, and Malcolm Slaney. Content‐based Music Information Retrieval: Current Directions and Future Challenges.

Proceedings of the IEEE

, 96(4):668–696, April 2008. ISSN 1558‐2256. doi: 10.1109/JPROC.2008.916370. Conference Name: Proceedings of the IEEE.

14

Markus Schedl, Emilia Gómez, and Julián Urbano. Music Information Retrieval: Recent Developments and Applications.

Foundations and Trends® in Information Retrieval

, 8(2‐3):127–261, September 2014. ISSN 1554‐0669, 1554‐0677. doi: 10.1561/1500000042. URL

http://www.nowpublishers.com/article/Details/INR-042

.

15

Alexander Lerch. Music Information Retrieval. In Stefan Weinzierl, editor,

Akustische Grundlagen der Musik

, number 5 in Handbuch der Systematischen Musikwissenschaft, pages 79–102. Laaber, 2014. ISBN 978‐3‐89007‐699‐7.

16

John Ashley Burgoyne, Ichiro Fujinaga, and J Stephen Downie. Music Information Retrieval. In Susan Schreibman, Ray Siemens, and John Unsworth, editors,

A New Companion to Digital Humanities

, pages 213–228. John Wiley & Sons, Ltd, 2015. ISBN 978‐1‐118‐68060‐5. doi: 10.1002/9781118680605.ch15. URL

http://onlinelibrary.wiley.com/doi/10.1002/9781118680605.ch15/summary

.

17

Tuomas Virtanen, Mark D Plumbley, and Dan Ellis, editors.

Computational Analysis of Sound Scenes and Events

. Springer International Publishing, Cham, 2018. ISBN 978‐3‐319‐63449‐4 978‐3‐319‐63450‐0. doi: 10.1007/978‐3‐319‐63450‐0. URL

http://link.springer.com/10.1007/978-3-319-63450-0

.

18

DeLiang Wang and Guy J Brown.

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

. Wiley‐IEEE Press, 2006. ISBN 978‐0‐470‐04338‐7. URL

https://ieeexplore.ieee.org/book/5769523

.

19

Carl E Seashore. A Voice Tonoscope.

Studies in Psychology

, 3:18–28, 1902.

20

Peter J F Lucas and Linda C van der Gaag.

Principles of Expert Systems

. Addison‐Wesley, 1991.

21

Peter Knees, Markus Schedl, and Masataka Goto. Intelligent User Interfaces for Music Discovery: The Past 20 Years and What's to Come. In

Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)

, pages 44–53, Delft, Netherlands, 2019. URL

http://archives.ismir.net/ismir2019/paper/000003.pdf

.

22

Masahiro Hamasaki, Masataka Goto, and Tomoyasu Nakano. Songrium: Browsing and Listening Environment for Music Content Creation Community. In

Proceedings of the Sound and Music Computing Conference (SMC)

, page 8, Maynooth, Ireland, 2015.

23

Joshua D Reiss and Øyvind Brandtsegg. Applications of Cross‐Adaptive Audio Effects: Automatic Mixing, Live Performance and Everything in Between.

Frontiers in Digital Humanities

, 5, 2018. ISSN 2297‐2668. doi: 10.3389/fdigh.2018.00017. URL

https://www.frontiersin.org/articles/10.3389/fdigh.2018.00017/full

.

24

Carl E Seashore.

Psychology of Music

. McGraw‐Hill, New York, 1938.

25

Jean‐Pierre Briot, Gaëtan Hadjeres, and François‐David Pachet.

Deep Learning Techniques for Music Generation

. Computational Synthesis and Creative Systems. Springer International Publishing, Cham, 2020. ISBN 978‐3‐319‐70162‐2 978‐3‐319‐70163‐9. doi: 10.1007/978‐3‐319‐70163‐9. URL

http://link.springer.com/10.1007/978-3-319-70163-9

.

Part IFundamentals of Audio Content Analysis

 

2Analysis of Audio Signals

Taking a closer look at Audio Content Analysis(ACA), this chapter introduces the musical content to be extracted from the audio signal and then summarizes the basic processing steps of an ACA system.

2.1 Audio Content

The introduction already provided some examples for content that can be extracted from an audio signal. For a recording of Western music, content originates from three different sources:

Composition

: The term composition will be used broadly as a definition and conceptualization of musical ideas.

1

It can refer to any form of musical notation or description. This includes traditional ways from the

basso continuo

(a historic way of defining the harmonic structure) to the classic Western score notation as well as the modern lead sheet and other forms of notation used for contemporary and popular music.

The content related to the composition allows us to recognize different renderings of the same song or symphony as being the same piece. In most genres of Western music, this can encompass musical elements such as melody, harmony, instrumentation, structure and form, and rhythm.

Performance

: Music generally requires a performer or group of performers to create a unique acoustical rendition of the composition. The performance not only communicates the explicit information from the composition but also interprets and modifies it.

This happens through the performance‐related content, which includes, for example, the tempo and its variation as well as the micro‐timing, the realization of musical dynamics, accents, and instantaneous dynamic modulations such as tremolo, the use of expressive intonation and vibrato, and specific playing (e.g. bowing) techniques influencing the sound quality.

Production

: As the input of an ACA system is an audio recording, the choices made during recording and production impact certain characteristics of the recording.

Production‐related content mostly impacts the sound quality of the recording (microphone positioning, equalization, and the application of effects such as reverb to the signal) and the dynamics (by applying manual or automatic gain adjustments). However, changes in timing and pitch may occur as well during the process of editing the recording and applying software for pitch correction.

Note that these content sources can be hard to separate from the final audio recording. For example, the timbre of a recording can be determined by the instrumentation indicated by the score, by the performers' choice of instruments (e.g. historical instruments, specific guitar amps), by specific playing techniques, and by sound‐processing choices made by the sound engineer or the producer.

Furthermore, it should be noted that the distinction between composition and performance becomes less clearly defined in improvisatory and non‐Western forms of music.

While helpful for a better understanding of musical audio, the categorization by content source has only limited use from a practical audio analysis point of view. A more useful categorization of musical content is driven by musical and perceptual considerations, leading to four main categories:

timbre characteristics

: information related to sound quality such as spectral shape (see

Section 3.5

),

tonal characteristics

: pitch‐related information such as melody, harmony, key, intonation (see Section 7),

intensity‐related characteristics

: dynamics‐related information such as envelope‐, level‐, and loudness (see Section 8), and

temporal characteristics

: information comprising the tempo, timing, meter, of the signal (see Section 9).

Obviously, there are additional ways to quantitatively describe audio signals that do not fall squarely in one of these music‐driven categories, for example, statistical or other technical descriptions of the audio data such as an amplitude distribution.

The categories listed above cover the basic musical information that can be found in a music signal, and analysis tasks such as tempo detection, key detection, or envelope extraction fall unambiguously into one of the above categories. However, cues from various categories are necessary to deduce, for example, the structure of a musical piece or to extract performance characteristics. Furthermore, there exist many high‐level perceptual concepts that humans use when categorizing music, such as the music genre or the projected affective content. Such concepts draw from a combination of several or all categories mentioned above, albeit the impact of specific musical parameters on the perception of these concepts is hard to quantify.

Note that there is also content that is not easily extractable from the signal itself but may still be useful for music information retrieval(MIR) systems. Examples for this content include, for example, the year of the composition or recording, the record label, the song title, and information on the artists.

2.2 Audio Content Analysis Process

Systems for audio content analysis can often be represented as a two‐step process as depicted in Figure 2.1, where the signal is first converted into a meaningful input or feature representation that is then fed into a rule‐based or data‐driven system for inference such as a classifier. A perfect, maximally meaningful, input representation requires only a trivial inference method: inference could be as simple as just applying a threshold to the representation. A very powerful method of inference, on the other hand, is able to utilize an input representation even if the task‐relevant information is not directly or easily accessible from the representation. In traditional systems, the design of the input representation used to require considerable effort in feature design and engineering. Nowadays, the input processing tends to become less sophisticated (e.g. simply computing a spectrogram), which is counterbalanced by the increasing complexity of the inference system.

A preprocessing step can precede the feature extraction stage; in this step, the input data might be converted into a specific format, normalized, or processed for increased robustness and stability.

For traditional feature‐based approaches, the following properties are generally desirable for the feature input:

Task‐relevance without noise or other task‐irrelevant information

: features necessarily have to represent information that is relevant to the task at hand. A tempo detection system, for example, might need to focus on the rhythmic parts instead of pitch‐related information. Therefore, a meaningful feature captures task‐relevant (rhythmic) information and discards task‐irrelevant (pitch-related) information. Note that capturing task‐relevant information does not necessarily mean that such features are humanly interpretable; rather, it is sufficient if they can be properly interpreted by the inference system.

Compact and nonredundant

: the dimensionality of raw audio data tends to be too high for traditional machine learning approaches. One channel of a digital audio file in

Compact Disc

(CD)

quality (44 100 samples per second, 16 Bits per sample

2

) with a length of five minutes contains

(2.1)

A compact feature aims at representing these data with fewer values.

Easy to analyze

: Although all extractable information is contained in the raw audio data, observing meaningful information directly from the time‐domain audio data is complicated. For example, a spectrogram contains a similar amount of data as a time domain signal, but it is often easier to parse due to its more sparse representation of the content.

Figure 2.1 General processing stages of a system for audio content analysis.

While these qualities continue to be desirable for modern approaches based on neural networks, the features are not explicitly designed but rather learned from data. The input of the machine learning system is, thus, not a custom‐designed representation but the more or less unprocessed raw data. Some argue that end‐to‐end systems without input processing are generally preferable [1], however, this statement is debatable [2