Automatic Music Transcription

Masters Thesis

Rand ASSWAD

Summary

Automatic music transcription
Audio signal characterization
Perception of music
Pitch analysis
Temporal segmentation
Conclusion & questions

Automatic Music Transcription

Automatic Music Transcription: Definition

AMT is the process of converting an acoustic musical signal into some form of musical notation. (Benetos et al. 2013)

Motivation

Recording imporvised performance
Democratizing no-score music genres
Score following for music learning
Musicological analysis

Background

Started in the late 20^th century
Young discipline (compared to speech processing)
ISMIR: International Society for Music Information Retrieval (since 2000)
MIREX: Music Information Retrieval Evaluation eXchange (15 years)

Underlying tasks

Pitch detection
Temporal segmentation
Loudness estimation
Instrument recognition
Rhythm detection
Scale detection

Music theory vs. audio signal processing

Music theory: studies perceived features of music signals.
Audio signal processing: studies the mathematical variables of music signals.

Audio signal characterization

Physical definition

Acoustic wave equation (Feynman 1965) \[\Delta p =\frac{1}{c^2}\frac{\partial^2 p}{ {\partial t}^2}\]
\(p(\mathbf{x},t)\) pressure function of time and space
\(c\) speed of sound propagation
Harmonic solutions

Audio signal

Audio signal : pressure at the receptor’s position
Harmonic function of time
\[\tilde{x}(t) = \sum_{h=0}^{\infty} A_h \cos(2\pi hf_0t + \varphi_h)\]

Period and fundamental frequency

[Period is] the smallest positive member of the infinite set of time shifts leaving the signal invariant. (Cheveigné and Kawahara 2002)

\(T>0,\forall t, x(t) = x(t+T)\)
\(\implies \forall m\in\mathbb{N},\forall t, x(t) = x(t+mT)\)
Fundamental frequency: \(f_0 = \frac{1}{T}\)
Harmonics: \(f_h = h\cdot f_0, h\in\mathbb{N}\setminus\left\{0\right\}\)
Harmonic partials: harmonics \(h>1\)

Perception of music

Pitch

Tonal height of a sound
Relative musical concept
Logarithmic perception
\(\neq\) fundamental frequency

Intensity

Sound intensity: power carried by sound waves per unit area
Sound pressure: local pressure deviation from ambient pressure caused by a sound wave
Sound pressure level (SPL): \[\mathrm{SPL} = 20\log_{10}\left(\frac{P}{P_0}\right)\mathrm{dB}\]
Loudness: subjective perception of sound pressure
- Function of SPL and frequency
- Range from quiet to loud

Pitch Analysis

General model

(Yeh 2008)

Imperfect signals
- Inharmonicity
- Resonance
- Surrounding noise
\(x(t) = \tilde{x}(t) + z(t)\)
\(x(t)\) is quasi-periodic
Performed on short-time periods we refer to as frames using a sliding windowing function

Classification

Sound: Monophonic (single pitch) vs. Polyphonic (multiple pitch)
Analysis: Time domain vs. Spectral domain

Single pitch estimation

\[\tilde{x}(t)=\sum_{h=1}^{\infty} A_h\cos(2\pi f_0 t + \varphi_h) \approx\sum_{h=1}^{H} A_h\cos(2\pi f_0 t + \varphi_h)\]

Task: find \(f_0\)

Time domain

Analyse signal \(x(t)\) directly with respect to time.
Compare signal \(x(t)\) with a delayed version of itself \(x(t+\tau)\)
Similarity/dissimilarity functions

Autocorrelation Function (ACF)

\[r[\tau] = \sum_{t=1}^{N-\tau} x[t]x[t+\tau]\]

Attains local maximum for \(\tau\approx mT\)
Sensitive to structures in signals
- (+): useful for speech detection
- (-): resonance structures in music signals

Average Magnitude Difference Function (AMDF)

\[d_{\text{AM}}[\tau] = \frac{1}{N} \sum_{t=1}^{N-\tau} \left\lvert x[t]-x[t+\tau]\right\rvert\] (Ross et al. 1974)

Attains local minimum for \(\tau\approx mT\)
More adapted for music signals

Squared difference function (SDF)

\[d[\tau] = \sum_{t=1}^{N-\tau}(x[t]-x[t+\tau])^2\]

Attains local minimum for \(\tau\approx mT\)
Accentuates dips at corresponding periods
More clear local minima

YIN algorithm (Cheveigné and Kawahara 2002)

Cumulative mean normalized function: \[d[\tau] = \sum_{t=1}^{N-\tau}(x[t]-x[t+\tau])^2\] \[d_{\text{YIN}}[\tau] = \begin{cases} 1 &\text{if}~\tau = 0\\ d[\tau] / \frac{1}{\tau}\sum\limits_{t=0}^{\tau} d[t] &\text{otherwise} \end{cases}\]

Starts at 1 rather than 0
Divides SDF by its average over shorter lags
Tends to stay large at short lags
Drops when SQD falls under its average

Spectral domain

Analyse fourier transform \(X(f)\) of the signal
The spectrum of a signal in the magnitude of its fourier transform \(S(f)=\left\lvert X(f)\right\rvert\)
Local maxima of the spectrum correspond to frequencies of the signal
Analyse spectrum patterns with adapted similarity/dissimilarity functions

Autocorrelation Function (ACF)

\[R[f] = \sum_{k=1}^{K-f} S[k]S[k+f]\]

Attains local maximum for partial harmonics \(f\approx hf_0\) (Lahat, Niederjohn, and Krubsack 1987)
Function is attenuated when partial peaks are not well aligned

Harmonic sum/product (Schroeder 1968)

Harmonic sum: \[\Sigma(f)=\sum_{h=1}^H 20\log_{10}S(hf)\]
Harmonic product: \[\Sigma'(f)=20\log_{10}\sum_{h=1}^H S(hf)\]
Weighted frequency histogram
Measures the contribution of each harmonic to the histogram
Also known as log compression

Spectral YIN (Brossier 2006)

Optimized version of YIN in frequency domain
The square difference function is defined over spectral magnitudes \[\hat{d}(\tau) = \frac{2}{N} \sum\limits_{k=0}^{\frac{N}{2}+1} \left\lvert\left(-e^{2\pi jk\tau/N}\right) X[k]\right\rvert^2\]

Multiple pitch estimation

\[\begin{align} \tilde{x}(t)&=\sum_{m=1}^{M}\tilde{x}_m(t)\\ &=\sum_{m=1}^{M}\sum_{h=1}^{\infty} A_{m,h}\cos(2\pi f_{0,m} t + \varphi_{m,h})\\ &\approx\sum_{m=1}^{M}\sum_{h=1}^{H_m} A_{m,h}\cos(2\pi f_{0,m} t + \varphi_{m,h}) \end{align}\]

Task: find \(f_{0,m}\) for \(m\in\left\{1,\ldots,M\right\}\)

Challenges

Concurrent music notes
Can be produced by several instruments
Core difficulty of polyphonic music transcription

Approaches

Iterative:
- Extract most prominent pitch at each iteration
- Tends to accumulate errors at each iteration
- Computationally inexpensive
Joint:
- Evaluate \(f0\) combinations
- More accurate estimations
- Increased computational cost

Harmonic Amplitudes Sum (Klapuri 2006)

Spectral whitening: flatten the spectrum to suppress timbral information.
Salience function: strength of \(f0\) candidate as weighted sum of amplitudes of its harmonic partials.
Iterative or joint estimators.

Spectral whitening

Apply bandpass filter in frequency domain \(X(f)\).
Calculate standard deviations \(\sigma_b\) within subbands
Calculate compression coefficients \(\gamma_b=\sigma_b^{\nu-1}\) where \(\nu\) is the whitening parameter.
Interpolate \(\gamma\) for all frequencies.
Whitened spectrum \(Y(f) = \gamma(f)X(f)\)

Salience function

\[s(\tau) = \sum_{h=1}^H g(\tau,h)\left\lvert Y(hf_{\tau})\right\rvert\] where \(f_{\tau}=f_s/\tau\) the \(f_0\) candidate corresponding to the period \(\tau\) and \(g(\tau,h)\) is the weight of the \(h\) partial of period \(\tau\).

Iterative estimation

Determine \(f_0=\mathop{\mathrm{argmax}}_{f} s(f)\)
Remove found \(f_0\) from residual spectrum
Repeat until saliences are low

Spectrogram Factorisation (NMF) (Smaragdis and Brown 2003)

Non-negative matrix factorisation is a well-established technique
Works best with harmonically fixed spectral profiles (such as piano notes)
Joint estimation method

\[\boldsymbol{X}\approx \boldsymbol{W}\boldsymbol{H}\]

Variables:
- \(\boldsymbol{X}\in\mathbb{R}_+^{K\times N}\) input spectrogram
- \(\boldsymbol{W}\in\mathbb{R}_+^{K\times R}\) spectral bases for each pitch component (template matrix)
- \(\boldsymbol{H}\in\mathbb{R}_+^{R\times N}\) pitch activity across time (activation matrix)
Dimensions:
- \(K\) number of frequency bins
- \(N\) number of frames
- \(R\) number of pitch components (rank) such that \(R<<K\)
Cost function: \[C=\left\lVert\boldsymbol{X}- \boldsymbol{W}\boldsymbol{H}\right\rVert_F\]

NMF concept

\(C=\left\lVert\boldsymbol{V}-\boldsymbol{W}\boldsymbol{H}\right\rVert_2\) is a nonconvex optimisation problem with respect to \(\boldsymbol{W}\) and \(\boldsymbol{H}\).
Let \(\boldsymbol{V}=(v_1,\ldots,v_N)\) and \(\boldsymbol{H}=(h_1,\ldots,h_N)\)
\(\boldsymbol{V}=\boldsymbol{W}\boldsymbol{H}\implies v_i = \boldsymbol{W}h_i\)
Impose orthogonality constraint \(\boldsymbol{H}\boldsymbol{H}^T=I\)
Obtain K-means clustering property

Application on polyphonic music decomposition

The rank \(R\) corresponds to the pitch components, which is in the case of a piano is the midi integer range from 20 to 109.
Reinforce sparsity constraint (???)
Apply single pitch estimation on rows of \(\boldsymbol{H}\)

Temporal Segmentation

Onset estimation method

Compute an Onset Detection Function
Calculate a threshold function
Peak-picking local maxima above threshold

Onset Detection Function (ODF)

Characterize change in energy or harmonic content in the signal
Difficult to identify on time domains
Computed on spectral domains using magnitude and/or phase
Onsets are detected from local maxima

High Frequency Content (HFC) (Masri and Bateman 1996)

\[D_{\text{HFC}}[n] = \sum\limits_{k=1}^{N} k\cdot\left\lVert X[n,k]\right\rVert^2\]

Favours wide-band energy bursts and high frequency components

Phase Deviation (Bello and Sandler 2003)

Evaluates phase difference \[D_{\Phi}[n] = \sum\limits_{k=0}^{N} \left\lvert \hat{\varphi}[n, k] \right\rvert\]

where

\(\mathrm{princarg}(\theta) = \pi + ((\theta + \pi) mod (-2\pi))\)
\(\varphi(t, f) = \mathrm{arg}(X(t, f))\)
\(\hat{\varphi}(t, f) = \mathrm{princarg} \left( \frac{\partial^2 \varphi}{\partial t^2}(t, f) \right)\)

Identifies tonal onsets and evergy bursts

Complex Distance (Duxbury et al. 2003)

\[D_{\mathbb{C}}[n] = \sum\limits_{k=0}^{N} \left\lVert\hat{X}[n, k] - X[n, k]\right\rVert^2\]

where \(\hat{X}[n, k] = \left\lvert X[n, k] \right\rvert \cdot e^{j\hat{\varphi}[n, k]}\)

Quantifies both tonal onsets and percussive events by combining spectral difference and phase-based approaches.

Thresholding & Peak-picking

ODFs are usually sensitive to the slightest perturbations
Defining a threshold would eliminate insignificant peaks
Suggested threshold: moving average
Peak-picking: selecting peaks above defined calculated threshold

Conclusion

Single pitch estimation obtains satisfactory results
Multi-pitch estimation remains an open problem
Promising results in onset detection

Merci pour votre attention

References

Bello, J.P., and M. Sandler. 2003. “Phase-Based Note Onset Detection for Music Signals.” In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., 5:V–441. https://doi.org/10.1109/ICASSP.2003.1200001.

Benetos, Emmanouil, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri. 2013. “Automatic Music Transcription: Challenges and Future Directions.” Journal of Intelligent Information Systems 41 (December). https://doi.org/10.1007/s10844-013-0258-3.

Brossier, Paul M. 2006. Automatic Annotation of Musical Audio for Interactive Applications.

Cheveigné, Alain de, and Hideki Kawahara. 2002. “YIN, a Fundamental Frequency Estimator for Speech and Music.” The Journal of the Acoustical Society of America 111 (4): 1917–30. https://doi.org/10.1121/1.1458024.

Duxbury, Chris, Juan Pablo Bello, Mike Davies, and Mark Sandler. 2003. “COMPLEX DOMAIN ONSET DETECTION FOR MUSICAL SIGNALS,” 4.

Feynman, Richard. 1965. “The Feynman Lectures on Physics Vol. I Ch. 47: Sound. The Wave Equation.” https://www.feynmanlectures.caltech.edu/I_47.html.

Klapuri, Anssi. 2006. “Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes,” 6.

Lahat, M., Russell J. Niederjohn, and David A. Krubsack. 1987. “A Spectral Autocorrelation Method for Measurement of the Fundamental Frequency of Noise-Corrupted Speech.” IEEE Trans. Acoustics, Speech, and Signal Processing. https://doi.org/10.1109/TASSP.1987.1165224.

Masri, Paul, and Andrew Bateman. 1996. “Improved Modelling of Attack Transients in Music Analysis-Resynthesis,” 4.

Ross, M., H. Shaffer, A. Cohen, R. Freudberg, and H. Manley. 1974. “Average Magnitude Difference Function Pitch Extractor.” IEEE Transactions on Acoustics, Speech, and Signal Processing 22 (5): 353–62. https://doi.org/10.1109/TASSP.1974.1162598.

Schroeder, Manfred R. 1968. “Period Histogram and Product Spectrum: New Methods for Fundamental-Frequency Measurement.” The Journal of the Acoustical Society of America. https://doi.org/10.1121/1.1910902.

Smaragdis, P., and J.C. Brown. 2003. “Non-Negative Matrix Factorization for Polyphonic Music Transcription.” In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684), 177–80. New Paltz, NY, USA: IEEE. https://doi.org/10.1109/ASPAA.2003.1285860.

Yeh, Chunghsin. 2008. “Multiple Fundamental Frequency Estimation of Polyphonic Recordings,” 153.