Automatic Music Transcription

Masters Thesis

Rand ASSWAD

Summary

Summary

  • Automatic music transcription
  • Audio signal characterization
  • Perception of music
  • Pitch analysis
  • Temporal segmentation
  • Conclusion & questions

Automatic Music Transcription

Automatic Music Transcription: Definition

AMT is the process of converting an acoustic musical signal into some form of musical notation. (Benetos et al. 2013)

Motivation

  • Recording imporvised performance
  • Democratizing no-score music genres
  • Score following for music learning
  • Musicological analysis

Background

  • Started in the late 20th century
  • Young discipline (compared to speech processing)
  • ISMIR: International Society for Music Information Retrieval (since 2000)
  • MIREX: Music Information Retrieval Evaluation eXchange (15 years)

Underlying tasks

  • Pitch detection
  • Temporal segmentation
  • Loudness estimation
  • Instrument recognition
  • Rhythm detection
  • Scale detection

Music theory vs. audio signal processing

  • Music theory: studies perceived features of music signals.
  • Audio signal processing: studies the mathematical variables of music signals.

Audio signal characterization

Physical definition

  • Acoustic wave equation (Feynman 1965) \[\Delta p =\frac{1}{c^2}\frac{\partial^2 p}{ {\partial t}^2}\]

  • \(p(\mathbf{x},t)\) pressure function of time and space
  • \(c\) speed of sound propagation
  • Harmonic solutions

Audio signal

  • Audio signal : pressure at the receptor’s position
  • Harmonic function of time
  • \[\tilde{x}(t) = \sum_{h=0}^{\infty} A_h \cos(2\pi hf_0t + \varphi_h)\]

Period and fundamental frequency

[Period is] the smallest positive member of the infinite set of time shifts leaving the signal invariant. (Cheveigné and Kawahara 2002)

  • \(T>0,\forall t, x(t) = x(t+T)\)
  • \(\implies \forall m\in\mathbb{N},\forall t, x(t) = x(t+mT)\)
  • Fundamental frequency: \(f_0 = \frac{1}{T}\)
  • Harmonics: \(f_h = h\cdot f_0, h\in\mathbb{N}\setminus\left\{0\right\}\)
  • Harmonic partials: harmonics \(h>1\)

Perception of music

Pitch

  • Tonal height of a sound
  • Relative musical concept
  • Logarithmic perception
  • \(\neq\) fundamental frequency

Intensity

  • Sound intensity: power carried by sound waves per unit area
  • Sound pressure: local pressure deviation from ambient pressure caused by a sound wave
  • Sound pressure level (SPL): \[\mathrm{SPL} = 20\log_{10}\left(\frac{P}{P_0}\right)\mathrm{dB}\]
  • Loudness: subjective perception of sound pressure
    • Function of SPL and frequency
    • Range from quiet to loud

Pitch Analysis

General model

(Yeh 2008)

  • Imperfect signals
    • Inharmonicity
    • Resonance
    • Surrounding noise
  • \(x(t) = \tilde{x}(t) + z(t)\)
  • \(x(t)\) is quasi-periodic
  • Performed on short-time periods we refer to as frames using a sliding windowing function

Classification

  • Sound: Monophonic (single pitch) vs. Polyphonic (multiple pitch)
  • Analysis: Time domain vs. Spectral domain

Single pitch estimation

\[\tilde{x}(t)=\sum_{h=1}^{\infty} A_h\cos(2\pi f_0 t + \varphi_h) \approx\sum_{h=1}^{H} A_h\cos(2\pi f_0 t + \varphi_h)\]

Task: find \(f_0\)

Time domain

  • Analyse signal \(x(t)\) directly with respect to time.
  • Compare signal \(x(t)\) with a delayed version of itself \(x(t+\tau)\)
  • Similarity/dissimilarity functions

Autocorrelation Function (ACF)

\[r[\tau] = \sum_{t=1}^{N-\tau} x[t]x[t+\tau]\]

  • Attains local maximum for \(\tau\approx mT\)
  • Sensitive to structures in signals
    • (+): useful for speech detection
    • (-): resonance structures in music signals

Average Magnitude Difference Function (AMDF)

\[d_{\text{AM}}[\tau] = \frac{1}{N} \sum_{t=1}^{N-\tau} \left\lvert x[t]-x[t+\tau]\right\rvert\] (Ross et al. 1974)

  • Attains local minimum for \(\tau\approx mT\)
  • More adapted for music signals

Squared difference function (SDF)

\[d[\tau] = \sum_{t=1}^{N-\tau}(x[t]-x[t+\tau])^2\]

  • Attains local minimum for \(\tau\approx mT\)
  • Accentuates dips at corresponding periods
  • More clear local minima

YIN algorithm (Cheveigné and Kawahara 2002)

Cumulative mean normalized function: \[d[\tau] = \sum_{t=1}^{N-\tau}(x[t]-x[t+\tau])^2\] \[d_{\text{YIN}}[\tau] = \begin{cases} 1 &\text{if}~\tau = 0\\ d[\tau] / \frac{1}{\tau}\sum\limits_{t=0}^{\tau} d[t] &\text{otherwise} \end{cases}\]

  • Starts at 1 rather than 0
  • Divides SDF by its average over shorter lags
  • Tends to stay large at short lags
  • Drops when SQD falls under its average

Spectral domain

  • Analyse fourier transform \(X(f)\) of the signal
  • The spectrum of a signal in the magnitude of its fourier transform \(S(f)=\left\lvert X(f)\right\rvert\)
  • Local maxima of the spectrum correspond to frequencies of the signal
  • Analyse spectrum patterns with adapted similarity/dissimilarity functions