Automatic Music Transcription

Masters Thesis

Rand ASSWAD

Summary

Summary

  • Automatic music transcription
  • Audio signal characterization
  • Perception of music
  • Pitch analysis
  • Temporal segmentation
  • Conclusion & questions

Automatic Music Transcription

Automatic Music Transcription: Definition

AMT is the process of converting an acoustic musical signal into some form of musical notation. (Benetos et al. 2013)

Motivation

  • Recording imporvised performance
  • Democratizing no-score music genres
  • Score following for music learning
  • Musicological analysis

Background

  • Started in the late 20th century
  • Young discipline (compared to speech processing)
  • ISMIR: International Society for Music Information Retrieval (since 2000)
  • MIREX: Music Information Retrieval Evaluation eXchange (15 years)

Underlying tasks

  • Pitch detection
  • Temporal segmentation
  • Loudness estimation
  • Instrument recognition
  • Rhythm detection
  • Scale detection

Music theory vs. audio signal processing

  • Music theory: studies perceived features of music signals.
  • Audio signal processing: studies the mathematical variables of music signals.

Audio signal characterization

Physical definition

  • Acoustic wave equation (Feynman 1965) \[\Delta p =\frac{1}{c^2}\frac{\partial^2 p}{ {\partial t}^2}\]

  • \(p(\mathbf{x},t)\) pressure function of time and space
  • \(c\) speed of sound propagation
  • Harmonic solutions

Audio signal

  • Audio signal : pressure at the receptor’s position
  • Harmonic function of time
  • \[\tilde{x}(t) = \sum_{h=0}^{\infty} A_h \cos(2\pi hf_0t + \varphi_h)\]

Period and fundamental frequency

[Period is] the smallest positive member of the infinite set of time shifts leaving the signal invariant. (Cheveigné and Kawahara 2002)

  • \(T>0,\forall t, x(t) = x(t+T)\)
  • \(\implies \forall m\in\mathbb{N},\forall t, x(t) = x(t+mT)\)
  • Fundamental frequency: \(f_0 = \frac{1}{T}\)
  • Harmonics: \(f_h = h\cdot f_0, h\in\mathbb{N}\setminus\left\{0\right\}\)
  • Harmonic partials: harmonics \(h>1\)

Perception of music

Pitch

  • Tonal height of a sound
  • Relative musical concept
  • Logarithmic perception
  • \(\neq\) fundamental frequency

Intensity

  • Sound intensity: power carried by sound waves per unit area
  • Sound pressure: local pressure deviation from ambient pressure caused by a sound wave
  • Sound pressure level (SPL): \[\mathrm{SPL} = 20\log_{10}\left(\frac{P}{P_0}\right)\mathrm{dB}\]
  • Loudness: subjective perception of sound pressure
    • Function of SPL and frequency
    • Range from quiet to loud

Pitch Analysis

General model

(Yeh 2008)

  • Imperfect signals
    • Inharmonicity
    • Resonance
    • Surrounding noise
  • \(x(t) = \tilde{x}(t) + z(t)\)
  • \(x(t)\) is quasi-periodic
  • Performed on short-time periods we refer to as frames using a sliding windowing function

Classification

  • Sound: Monophonic (single pitch) vs. Polyphonic (multiple pitch)
  • Analysis: Time domain vs. Spectral domain

Single pitch estimation

\[\tilde{x}(t)=\sum_{h=1}^{\infty} A_h\cos(2\pi f_0 t + \varphi_h) \approx\sum_{h=1}^{H} A_h\cos(2\pi f_0 t + \varphi_h)\]

Task: find \(f_0\)

Time domain

  • Analyse signal \(x(t)\) directly with respect to time.
  • Compare signal \(x(t)\) with a delayed version of itself \(x(t+\tau)\)
  • Similarity/dissimilarity functions

Autocorrelation Function (ACF)

\[r[\tau] = \sum_{t=1}^{N-\tau} x[t]x[t+\tau]\]

  • Attains local maximum for \(\tau\approx mT\)
  • Sensitive to structures in signals
    • (+): useful for speech detection
    • (-): resonance structures in music signals

Average Magnitude Difference Function (AMDF)

\[d_{\text{AM}}[\tau] = \frac{1}{N} \sum_{t=1}^{N-\tau} \left\lvert x[t]-x[t+\tau]\right\rvert\] (Ross et al. 1974)

  • Attains local minimum for \(\tau\approx mT\)
  • More adapted for music signals

Squared difference function (SDF)

\[d[\tau] = \sum_{t=1}^{N-\tau}(x[t]-x[t+\tau])^2\]

  • Attains local minimum for \(\tau\approx mT\)
  • Accentuates dips at corresponding periods
  • More clear local minima

YIN algorithm (Cheveigné and Kawahara 2002)

Cumulative mean normalized function: \[d[\tau] = \sum_{t=1}^{N-\tau}(x[t]-x[t+\tau])^2\] \[d_{\text{YIN}}[\tau] = \begin{cases} 1 &\text{if}~\tau = 0\\ d[\tau] / \frac{1}{\tau}\sum\limits_{t=0}^{\tau} d[t] &\text{otherwise} \end{cases}\]

  • Starts at 1 rather than 0
  • Divides SDF by its average over shorter lags
  • Tends to stay large at short lags
  • Drops when SQD falls under its average

Spectral domain

  • Analyse fourier transform \(X(f)\) of the signal
  • The spectrum of a signal in the magnitude of its fourier transform \(S(f)=\left\lvert X(f)\right\rvert\)
  • Local maxima of the spectrum correspond to frequencies of the signal
  • Analyse spectrum patterns with adapted similarity/dissimilarity functions

Autocorrelation Function (ACF)

\[R[f] = \sum_{k=1}^{K-f} S[k]S[k+f]\]

  • Attains local maximum for partial harmonics \(f\approx hf_0\) (Lahat, Niederjohn, and Krubsack 1987)
  • Function is attenuated when partial peaks are not well aligned

Harmonic sum/product (Schroeder 1968)

  • Harmonic sum: \[\Sigma(f)=\sum_{h=1}^H 20\log_{10}S(hf)\]
  • Harmonic product: \[\Sigma'(f)=20\log_{10}\sum_{h=1}^H S(hf)\]

  • Weighted frequency histogram
  • Measures the contribution of each harmonic to the histogram
  • Also known as log compression

Spectral YIN (Brossier 2006)

  • Optimized version of YIN in frequency domain
  • The square difference function is defined over spectral magnitudes \[\hat{d}(\tau) = \frac{2}{N} \sum\limits_{k=0}^{\frac{N}{2}+1} \left\lvert\left(-e^{2\pi jk\tau/N}\right) X[k]\right\rvert^2\]

Multiple pitch estimation

\[\begin{align} \tilde{x}(t)&=\sum_{m=1}^{M}\tilde{x}_m(t)\\ &=\sum_{m=1}^{M}\sum_{h=1}^{\infty} A_{m,h}\cos(2\pi f_{0,m} t + \varphi_{m,h})\\ &\approx\sum_{m=1}^{M}\sum_{h=1}^{H_m} A_{m,h}\cos(2\pi f_{0,m} t + \varphi_{m,h}) \end{align}\]

Task: find \(f_{0,m}\) for \(m\in\left\{1,\ldots,M\right\}\)

Challenges

  • Concurrent music notes
  • Can be produced by several instruments
  • Core difficulty of polyphonic music transcription

Approaches

  • Iterative:
    • Extract most prominent pitch at each iteration
    • Tends to accumulate errors at each iteration
    • Computationally inexpensive
  • Joint:
    • Evaluate \(f0\) combinations
    • More accurate estimations
    • Increased computational cost

Harmonic Amplitudes Sum (Klapuri 2006)

  1. Spectral whitening: flatten the spectrum to suppress timbral information.
  2. Salience function: strength of \(f0\) candidate as weighted sum of amplitudes of its harmonic partials.
  3. Iterative or joint estimators.

Spectral whitening

  • Apply bandpass filter in frequency domain \(X(f)\).
  • Calculate standard deviations \(\sigma_b\) within subbands
  • Calculate compression coefficients \(\gamma_b=\sigma_b^{\nu-1}\) where \(\nu\) is the whitening parameter.
  • Interpolate \(\gamma\) for all frequencies.
  • Whitened spectrum \(Y(f) = \gamma(f)X(f)\)

Salience function

\[s(\tau) = \sum_{h=1}^H g(\tau,h)\left\lvert Y(hf_{\tau})\right\rvert\] where \(f_{\tau}=f_s/\tau\) the \(f_0\) candidate corresponding to the period \(\tau\) and \(g(\tau,h)\) is the weight of the \(h\) partial of period \(\tau\).

Iterative estimation

  1. Determine \(f_0=\mathop{\mathrm{argmax}}_{f} s(f)\)
  2. Remove found \(f_0\) from residual spectrum
  3. Repeat until saliences are low

Spectrogram Factorisation (NMF) (Smaragdis and Brown 2003)

  • Non-negative matrix factorisation is a well-established technique
  • Works best with harmonically fixed spectral profiles (such as piano notes)
  • Joint estimation method

\[\boldsymbol{X}\approx \boldsymbol{W}\boldsymbol{H}\]

  • Variables:
    • \(\boldsymbol{X}\in\mathbb{R}_+^{K\times N}\) input spectrogram
    • \(\boldsymbol{W}\in\mathbb{R}_+^{K\times R}\) spectral bases for each pitch component (template matrix)
    • \(\boldsymbol{H}\in\mathbb{R}_+^{R\times N}\) pitch activity across time (activation matrix)
  • Dimensions:
    • \(K\) number of frequency bins
    • \(N\) number of frames
    • \(R\) number of pitch components (rank) such that \(R<<K\)
  • Cost function: \[C=\left\lVert\boldsymbol{X}- \boldsymbol{W}\boldsymbol{H}\right\rVert_F\]

NMF concept

  • \(C=\left\lVert\boldsymbol{V}-\boldsymbol{W}\boldsymbol{H}\right\rVert_2\) is a nonconvex optimisation problem with respect to \(\boldsymbol{W}\) and \(\boldsymbol{H}\).
  • Let \(\boldsymbol{V}=(v_1,\ldots,v_N)\) and \(\boldsymbol{H}=(h_1,\ldots,h_N)\)
  • \(\boldsymbol{V}=\boldsymbol{W}\boldsymbol{H}\implies v_i = \boldsymbol{W}h_i\)
  • Impose orthogonality constraint \(\boldsymbol{H}\boldsymbol{H}^T=I\)
  • Obtain K-means clustering property

Application on polyphonic music decomposition

  • The rank \(R\) corresponds to the pitch components, which is in the case of a piano is the midi integer range from 20 to 109.
  • Reinforce sparsity constraint (???)
  • Apply single pitch estimation on rows of \(\boldsymbol{H}\)

Temporal Segmentation

Onset estimation method

  1. Compute an Onset Detection Function
  2. Calculate a threshold function
  3. Peak-picking local maxima above threshold

Onset Detection Function (ODF)

  • Characterize change in energy or harmonic content in the signal
  • Difficult to identify on time domains
  • Computed on spectral domains using magnitude and/or phase
  • Onsets are detected from local maxima

High Frequency Content (HFC) (Masri and Bateman 1996)

\[D_{\text{HFC}}[n] = \sum\limits_{k=1}^{N} k\cdot\left\lVert X[n,k]\right\rVert^2\]

Favours wide-band energy bursts and high frequency components

Phase Deviation (Bello and Sandler 2003)

Evaluates phase difference \[D_{\Phi}[n] = \sum\limits_{k=0}^{N} \left\lvert \hat{\varphi}[n, k] \right\rvert\]

where

  • \(\mathrm{princarg}(\theta) = \pi + ((\theta + \pi) mod (-2\pi))\)
  • \(\varphi(t, f) = \mathrm{arg}(X(t, f))\)
  • \(\hat{\varphi}(t, f) = \mathrm{princarg} \left( \frac{\partial^2 \varphi}{\partial t^2}(t, f) \right)\)

Identifies tonal onsets and evergy bursts

Complex Distance (Duxbury et al. 2003)

\[D_{\mathbb{C}}[n] = \sum\limits_{k=0}^{N} \left\lVert\hat{X}[n, k] - X[n, k]\right\rVert^2\]

where \(\hat{X}[n, k] = \left\lvert X[n, k] \right\rvert \cdot e^{j\hat{\varphi}[n, k]}\)

Quantifies both tonal onsets and percussive events by combining spectral difference and phase-based approaches.

Thresholding & Peak-picking

  • ODFs are usually sensitive to the slightest perturbations
  • Defining a threshold would eliminate insignificant peaks
  • Suggested threshold: moving average
  • Peak-picking: selecting peaks above defined calculated threshold

Conclusion

Conclusion

  • Single pitch estimation obtains satisfactory results
  • Multi-pitch estimation remains an open problem
  • Promising results in onset detection

Merci pour votre attention

References

Bello, J.P., and M. Sandler. 2003. “Phase-Based Note Onset Detection for Music Signals.” In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., 5:V–441. https://doi.org/10.1109/ICASSP.2003.1200001.

Benetos, Emmanouil, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. 2013. “Automatic Music Transcription: Challenges and Future Directions.” Journal of Intelligent Information Systems 41 (December). https://doi.org/10.1007/s10844-013-0258-3.

Brossier, Paul M. 2006. Automatic Annotation of Musical Audio for Interactive Applications.

Cheveigné, Alain de, and Hideki Kawahara. 2002. “YIN, a Fundamental Frequency Estimator for Speech and Music.” The Journal of the Acoustical Society of America 111 (4): 1917–30. https://doi.org/10.1121/1.1458024.

Duxbury, Chris, Juan Pablo Bello, Mike Davies, and Mark Sandler. 2003. “COMPLEX DOMAIN ONSET DETECTION FOR MUSICAL SIGNALS,” 4.

Feynman, Richard. 1965. “The Feynman Lectures on Physics Vol. I Ch. 47: Sound. The Wave Equation.” https://www.feynmanlectures.caltech.edu/I_47.html.

Klapuri, Anssi. 2006. “Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes,” 6.

Lahat, M., Russell J. Niederjohn, and David A. Krubsack. 1987. “A Spectral Autocorrelation Method for Measurement of the Fundamental Frequency of Noise-Corrupted Speech.” IEEE Trans. Acoustics, Speech, and Signal Processing. https://doi.org/10.1109/TASSP.1987.1165224.

Masri, Paul, and Andrew Bateman. 1996. “Improved Modelling of Attack Transients in Music Analysis-Resynthesis,” 4.

Ross, M., H. Shaffer, A. Cohen, R. Freudberg, and H. Manley. 1974. “Average Magnitude Difference Function Pitch Extractor.” IEEE Transactions on Acoustics, Speech, and Signal Processing 22 (5): 353–62. https://doi.org/10.1109/TASSP.1974.1162598.

Schroeder, Manfred R. 1968. “Period Histogram and Product Spectrum: New Methods for Fundamental-Frequency Measurement.” The Journal of the Acoustical Society of America. https://doi.org/10.1121/1.1910902.

Smaragdis, P., and J.C. Brown. 2003. “Non-Negative Matrix Factorization for Polyphonic Music Transcription.” In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684), 177–80. New Paltz, NY, USA: IEEE. https://doi.org/10.1109/ASPAA.2003.1285860.

Yeh, Chunghsin. 2008. “Multiple Fundamental Frequency Estimation of Polyphonic Recordings,” 153.