The Living Thing / Notebooks : Machine listening

See also musical corpora, musical metrics synchronisation, sparse basis dictionaries, speech recognition, learning gamelan, analysis/resynthesis, whatever other machine listening posts I forgot.

I’m not going to talk about speech recognition here; That boat is full.

Machine listening: machine learning, from audio music. Everything from that damn shazam app, to teaching computers to recognise speech, to doing artsy shit with sound. I’m mostly concerned with the third one. Statistics, features, descriptors, metrics, kernels and affinities and the spaces and topologies they induce. for musical audio e.g. your Mp3 pop song library. This has considerable overlap with musical metrics but there I start from scores and transcriptions.

Polyphony and its problems.

Approximate logarithmic perception and its problems.

Should I do a separate psychoaccoustics notebook?

Interesting descriptors/features

Audio summaries that attempt to turn raw signals into useful feature vectors reflective of human perception of them. This is a huge industry, because it makes audio convenient for transmission (hello mobile telephony, MP3) But it’s also useful for understanding speech, music etc. There are as many descriptors as there are IEEE conference slots.

See AlSS16 for an intimidatingly comprehensive summary.

I’m especially interested in

  1. invertible ones, for analysis/resynthesis. If not analytically invertible, convexity would do.
  2. differentiable ones, for leveraging artificial neural infrastructure for easy optimisation.
  3. Ones that avoid windowed DTFT, because it sounds awful in the resynthesis phase and is lazy.
  4. both the harmonic and percussive parts.

Question: can we construct new ones from compressive system identification, as in Carm13?

Also, ones that can encode noisiness in the signal as well as harmonicity…? I guess I should read AlSS16.

Deep neural networks

See, e.g. Jordi Pons’ Spectrogram CNN discussion for some introductions to the kind of features a neural network might “discover” in audio recognition tasks.

There is some interesting stuff here; for example, Dieleman and Schrauwen (DiSc14) show that convolutional neural networks trained on raw audio (i.e. not spectrograms) for music classification recover Mel-like frequency bands. Thickstun et al (ThHK16) do some similar work.

And Keunwoo Choi shows that you can listen to what they learn.

Sparse comb filters

Differentiable! Conditionally invertible! Handy for syncing.

Moorer (Moor74) proposed this for harmonic purposes, but Robertson et al (RoSP11) have shown it to be handy for rhythm.

Autocorrelation features

Measure the signal’s full or partial autocorrelation with itself.

Linear Predictive coefficents

How do these transform?

I think this hinges always on Skorohod embedding.


Classic, but inconvenient to invert.


Take the perfectly lovely-if-fiddly cepstrum and make it really messy, with a vague psychoacoustic model in the hope of that MFCC distinctions correspond to human perceptual distinctions. Folk wisdom holds they are also Eurocentric, in that they destroy, or at least obscure, tonal language features. Ubiquitous, but inconsistently implemented; MFCCs are generally not the sample across implementations, probably because the Mel scale is itself not universally standardised.

Asides from being loosely psychoacoustically motivated features, what do the coefficients of an MFCC specifically tell me?

Hmm. If I have got this right, these are “generic features”; things we can use in machine learning because we hope they project the spectrum into a space which approximately preserves psychoacoustic dissimilarity, whilst having little redundancy.

This heuristic pro is weighted with the practical con that they are not practically differentiable, nor invertible except by heroic computational effort, nor are they humanly interpretable.


Inc bandpasses, Gammatones… Random filterbanks?

Dynamic dictionaries

See sparse basis dictionaries.

Cochlear activation models



Erbs, Mels, Sones, Phones…


Here are some options for doing it:

For my part, I find it congenial to use python and supercollider together for my tricky offline computations and supercollider for the more “live”, realtime stuff; this feels like it gets me best of each of those worlds, and especially of the development communities. YMMV.

Spectral peak tracking

Other specialist tools

Large-Scale Content-Based Matching of Midi and Audio Files:

MIDI files, when paired with corresponding audio recordings, can be used as ground truth for many music information retrieval tasks. We present a system which can efficiently match and align MIDI files to entries in a large corpus of audio content based solely on content, i.e., without using any metadata. The core of our approach is a convolutional network-based cross-modality hashing scheme which transforms feature matrices into sequences of vectors in a common Hamming space. Once represented in this way, we can efficiently perform large-scale dynamic time warping searches to match MIDI data to audio recordings. We evaluate our approach on the task of matching a huge corpus of MIDI files to the Million Song Dataset.

See also Dannenberg’s bibliographies on score following


Abe, T., Kobayashi, T., & Imai, S. (1995) Harmonics tracking and pitch extraction based on instantaneous frequency. In International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95 (Vol. 1, pp. 756–759 vol.1). DOI.
Alías, F., Socoró, J. C., & Sevillano, X. (2016) A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Applied Sciences, 6(5), 143. DOI.
Anglade, A., Benetos, E., Mauch, M., & Dixon, S. (2010) Improving Music Genre Classification Using Automatically Induced Harmony Rules. Journal of New Music Research, 39, 349–361. DOI.
Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
Bigand, E., Parncutt, R., & Lerdahl, F. (1996) Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizontal motion, and musical training. Perception & Psychophysics, 58(1), 125–141. DOI.
Blei, D. M., Cook, P. R., & Hoffman, M. (2010) Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 439–446).
Boulanger-Lewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M.(2016) Time series analysis: forecasting and control. (Fifth edition.). Hoboken, New Jersey: John Wiley & Sons, Inc
Carabias-Orti, J. J., Virtanen, T., Vera-Candeas, P., Ruiz-Reyes, N., & Canadas-Quesada, F. J.(2011) Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1144–1158. DOI.
Carmi, A. Y.(2013) Compressive system identification: Sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI.
Carter, G. C.(1987) Coherence and time delay estimation. Proceedings of the IEEE, 75(2), 236–255. DOI.
Choi, K., Fazekas, G., & Sandler, M. (2016) Explaining Deep Convolutional Neural Networks on Music Classification. arXiv:1607.02444 [Cs].
Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., … Welch, P. D.(1967) What is the fast Fourier transform?. Proceedings of the IEEE, 55(10), 1664–1674. DOI.
Cooley, J. W., Lewis, P. A. W., & Welch, P. D.(1970) The application of the fast Fourier transform algorithm to the estimation of spectra and cross-spectra. Journal of Sound and Vibration, 12(3), 339–352. DOI.
Demopoulos, R. J., & Katchabaw, M. J.(2007) Music Information Retrieval: A Survey of Issues and Approaches. . Technical Report
Dieleman, S., & Schrauwen, B. (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI.
Dörfler, M., Velasco, G., Flexer, A., & Klien, V. (2010) Sparse Regression in Time-Frequency Representations of Complex Audio.
Du, P., Kibbe, W. A., & Lin, S. M.(2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22(17), 2059–2065. DOI.
Fitzgerald, D. (2010) Harmonic/percussive separation using median filtering.
Flamary, R., Févotte, C., Courty, N., & Emiya, V. (2016) Optimal spectral transportation with application to music transcription. In arXiv:1609.09799 [cs, stat] (pp. 703–711). Curran Associates, Inc.
Fu, Z., Lu, G., Ting, K. M., & Zhang, D. (2011) A Survey of Audio-Based Music Classification and Annotation. IEEE Transactions on Multimedia, 13(2), 303–319. DOI.
Glover, J. C., Lazzarini, V., & Timoney, J. (2009) Simpl: A Python library for sinusoidal modelling. In DAFx 09 proceedings of the 12th International Conference on Digital Audio Effects, Politecnico di Milano, Como Campus, Sept. 1-4, Como, Italy (pp. 1–4). Dept. of Electronic Engineering, Queen Mary Univ. of London,
Godsill, S., & Davy, M. (2005) Bayesian computational models for inharmonicity in musical instruments. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005 (pp. 283–286). IEEE DOI.
Gómez, E., & Herrera, P. (2004) Estimating The Tonality Of Polyphonic Audio Files: Cognitive Versus Machine Learning Modelling Strategies. In ISMIR.
Grosche, P., Muller, M., & Kurth, F. (2010) Cyclic Tempogram - a Mid-level Tempo Representation For Music Signals. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (pp. 5522–5525). Piscataway, NJ.: IEEE DOI.
Grosche, Peter, Müller, M., & Sapp, C. S.(2010) What makes beat tracking difficult? a case study on Chopin mazurkas. In Proceedings of the International Conference on Music Information Retrieval (ISMIR 2010).
Harris, F. J.(1978) On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1), 51–83.
Hermes, D. J.(1988) Measurement of pitch by subharmonic summation. The Journal of the Acoustical Society of America, 83(1), 257–264. DOI.
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., … Wilson, K. (2016) CNN Architectures for Large-Scale Audio Classification. arXiv:1609.09430 [Cs, Stat].
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
Hoffman, M., Bach, F. R., & Blei, D. M.(2010) Online learning for latent dirichlet allocation. In advances in neural information processing systems (pp. 856–864).
Irizarry, R. A.(2001) Local Harmonic Estimation in Musical Sound Signals. Journal of the American Statistical Association, 96(454), 357–367. DOI.
Joël Bensoam, & David Roze. (2013) Solving interactions between nonlinear resonators. In Proceedings of the Sound and Music Computing Conference.
Kailath, T., Sayed, A. H., & Hassibi, B. (2000) Linear estimation. . Upper Saddle River, N.J: Prentice Hall
Kalouptsidis, N., Mileounis, G., Babadi, B., & Tarokh, V. (2011) Adaptive algorithms for sparse system identification. Signal Processing, 91(8), 1910–1919. DOI.
Kereliuk, C., Depalle, P., & Pasquier, P. (2013) Audio Interpolation and Morphing via Structured-Sparse Linear Regression.
Lahat, M., Niederjohn, R. J., & Krubsack, D. (1987) A spectral autocorrelation method for measurement of the fundamental frequency of noise-corrupted speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(6), 741–750. DOI.
Ljung, L. (1999) System identification: theory for the user. (2nd ed.). Upper Saddle River, NJ: Prentice Hall PTR
Luo, Y., Chen, Z., Hershey, J. R., Roux, J. L., & Mesgarani, N. (2016) Deep Clustering and Conventional Networks for Music Separation: Stronger Together. arXiv:1611.06265 [Cs, Stat].
Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999) Performance Measures For Information Extraction. In In Proceedings of DARPA Broadcast News Workshop (pp. 249–252).
Marchand, U., & Peeters, G. (2014) The Modulation Scale Spectrum And Its Application To Rhythm-Content Analysis. In DAFX (Digital Audio Effects). Erlangen, Germany
Maxwell, J. B., Pasquier, P., & Whitman, B. (2009) Hierarchical Sequential Memory for Music: A Cognitive Model. In Proceedings of the tenth International Society for Music Information Retrieval Conference (ISMIR 2009) (pp. 429–434).
McFee, B., & Ellis, D. P.(2011) Analyzing song structure with spectral clustering. In IEEE conference on Computer Vision and Pattern Recognition (CVPR).
Mesaros, A., Heittola, T., & Virtanen, T. (2016) Metrics for Polyphonic Sound Event Detection. Applied Sciences, 6(6), 162. DOI.
Moorer, J. . (1974) The optimum comb method of pitch period analysis of continuous digitized speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 22(5), 330–338. DOI.
Müller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011) Signal Processing for Music Analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110. DOI.
Müller, Meinard, & Driedger, J. (2012) Data-Driven Sound Track Generation. In Multimodal Music Processing (Vol. 3, pp. 175–194). Dagstuhl, Germany: Schloss Dagstuhl—Leibniz-Zentrum für Informatik
Murthy, H. A., & Gadde, V. (2003) The modified group delay function and its application to phoneme recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03) (Vol. 1, p. I-68-71 vol.1). DOI.
Nussbaum-Thom, M., Cui, J., Ramabhadran, B., & Goel, V. (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI.
Parncutt, R. (1997) A model of the perceptual root(s) of a chord accounting for voicing and prevailing tonality. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 181–199). Springer Berlin Heidelberg
Paulus, J., Müller, M., & Klapuri, A. (2010) Audio-Based Music Structure Analysis. In ISMIR (pp. 625–636). ISMIR
Phan, H., Hertel, L., Maass, M., & Mertins, A. (2016) Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. In Interspeech 2016.
Pickens, J. (2004) Harmonic modeling for polyphonic music retrieval. . Citeseer
Pickens, J., & Iliopoulos, C. S.(2005) Markov Random Fields and Maximum Entropy Modeling for Music Information Retrieval. In ISMIR (pp. 207–214). Citeseer
Plomp, R., & Levelt, W. J.(1965) Tonal consonance and critical bandwidth. The Journal of the Acoustical Society of America, 38(4), 548–560. DOI.
Pons, J., Lidy, T., & Serra, X. (2016) Experimenting with musically motivated convolutional neural networks. In Content-Based Multimedia Indexing (CBMI), 2016 14th International Workshop on (pp. 1–6). IEEE DOI.
Robertson, A. N., & Plumbley, M. D.(2006) Real-time Interactive Musical Systems: An Overview. Proc. of the Digital Music Research Network, Goldsmiths University, London, 65–68.
Robertson, A., & Plumbley, M. (2007) B-Keeper: A Beat-tracker for Live Performance. In Proceedings of the 7th International Conference on New Interfaces for Musical Expression (pp. 234–237). New York, NY, USA: ACM DOI.
Robertson, A., & Plumbley, M. D.(2013) Synchronizing Sequencing Software to a Live Drummer. Computer Music Journal, 37(2), 46–60. DOI.
Robertson, A., Stark, A., & Davies, M. E.(2013) Percussive Beat tracking using real-time median filtering. . Presented at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Robertson, A., Stark, A. M., & Plumbley, M. D.(2011) Real-Time Visual Beat Tracking Using a Comb Filter Matrix.
Robertson, Andrew N. (2011) A Bayesian approach to drum tracking.
Rochebois, T., & Charbonneau, G. (1997) Cross-synthesis using interverted principal harmonic sub-spaces. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 375–385). Springer Berlin Heidelberg
Sainath, T. N., & Li, B. (2016) Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks. Submitted to Proc. Interspeech.
Salamon, J., Gomez, E., Ellis, D. P., & Richard, G. (2014) Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2), 118–134. DOI.
Salamon, J., Serrà, J., & Gómez, E. (2013) Tonal representations for music retrieval: from version identification to query-by-humming. International Journal of Multimedia Information Retrieval, 2(1), 45–58. DOI.
Schlüter, J., & Böck, S. (2014) Improved musical onset detection with Convolutional Neural Networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6979–6983). DOI.
Schmidt, E. M., & Kim, Y. E.(2011) Learning emotion-based acoustic features with deep belief networks. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 65–68). DOI.
Scholler, S., & Purwins, H. (2011) Sparse Approximations for Drum Sound Classification. IEEE Journal of Selected Topics in Signal Processing, 5(5), 933–940. DOI.
Sergé, A., Bertaux, N., Rigneault, H., & Marguet, D. (2008) Dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. Nature Methods, 5(8), 687–694. DOI.
Serrà, J., Corral, Á., Boguñá, M., Haro, M., & Arcos, J. L.(2012) Measuring the Evolution of Contemporary Western Popular Music. Scientific Reports, 2. DOI.
Smith, E. C., & Lewicki, M. S.(2004) Learning efficient auditory codes using spikes predicts cochlear filters. In Advances in Neural Information Processing Systems (pp. 1289–1296).
Smith, E. C., & Lewicki, M. S.(2006) Efficient auditory coding. Nature, 439(7079), 978–982. DOI.
Smith, E., & Lewicki, M. S.(2005) Efficient Coding of Time-Relative Structure Using Spikes. Neural Computation, 17(1), 19–45. DOI.
Smyth, T., & Elmore, A. R.(2009) Explorations in convolutional synthesis. In Proceedings of the 6th Sound and Music Computing Conference, Porto, Portugal (pp. 23–25).
Terhardt, E., Stoll, G., & Seewann, M. (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. The Journal of the Acoustical Society of America, 71(3), 679–688. DOI.
Thickstun, J., Harchaoui, Z., & Kakade, S. (2016) Learning Features of Music from Scratch. arXiv:1611.09827 [Cs, Stat].
Welch, P. D.(1967) The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2), 70–73.
Yang, C., He, Z., & Yu, W. (2009) Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. BMC Bioinformatics, 10, 4. DOI.
Yoshii, K., & Goto, M. (2012) Infinite Composite Autoregressive Models for Music Signal Analysis.