The Living Thing / Notebooks :

Machine listening

See also musical corpora, musical metrics synchronisation, sparse basis dictionaries, speech recognition, learning gamelan, analysis/resynthesis, whatever other machine listening posts I forgot.

I’m not going to talk about speech recognition here; That boat is full.

Machine listening: machine learning, from audio music. Everything from that damn shazam app, to teaching computers to recognise speech, to doing artsy shit with sound. I’m mostly concerned with the third one. Statistics, features, descriptors, metrics, kernels and affinities and the spaces and topologies they induce. for musical audio e.g. your Mp3 pop song library. This has considerable overlap with musical metrics but there I start from scores and transcriptions.

Polyphony and its problems.

Approximate logarithmic perception and its problems.

Should I create a separate psychoaccoustics notebook?

Interesting descriptors/features

Audio summaries that attempt to turn raw signals into useful feature vectors reflective of human perception of them. This is a huge industry, because it makes audio convenient for transmission (hello mobile telephony, MP3) But it’s also useful for understanding speech, music etc. There are as many descriptors as there are IEEE conference slots.

See AlSS16 for an intimidatingly comprehensive summary.

I’m especially interested in

  1. invertible ones, for analysis/resynthesis. If not analytically invertible, convexity would do.
  2. differentiable ones, for leveraging artificial neural infrastructure for easy optimisation.
  3. Ones that avoid windowed DTFT, because it sounds awful in the resynthesis phase and is lazy.
  4. both the harmonic and percussive parts.

Also, ones that can encode noisiness in the signal as well as harmonicity…? I guess I should read AlSS16.

Deep neural networks

See, e.g. Jordi Pons’ Spectrogram CNN discussion for some introductions to the kind of features a neural network might “discover” in audio recognition tasks.

There is some interesting stuff here; for example, Dieleman and Schrauwen (DiSc14) show that convolutional neural networks trained on raw audio (i.e. not spectrograms) for music classification recover Mel-like frequency bands. Thickstun et al (ThHK16) do some similar work.

And Keunwoo Choi shows that you can listen to what they learn.

Sparse comb filters

Differentiable! Conditionally invertible! Handy for syncing.

Moorer (Moor74) proposed this for harmonic purposes, but Robertson et al (RoSP11) have shown it to be handy for rhythm.

Autocorrelation features

Measure the signal’s full or partial autocorrelation with itself.

Linear Predictive coefficents

How do these transform?

I think this hinges always on Skorohod embedding.


Classic, but inconvenient to invert.


Mel-frequency Ceptral Coefficients, or Mel Cepstral transform. Take the perfectly respectable-if-fiddly cepstrum and make it really messy, with a vague psychoacoustic model in the hope that the distinctions in the resulting “MFCC” might correspond to distinctions correspond to human perceptual distinctions.

Folk wisdom holds that MFCC features are Eurocentric, in that they destroy, or at least obscure, tonal language features. Ubiquitous, but inconsistently implemented; MFCCs are generally not the same across implementations, probably because the Mel scale is itself not universally standardised.

Asides from being loosely psychoacoustically motivated features, what do the coefficients of an MFCC specifically tell me?

Hmm. If I have got this right, these are “generic features”; things we can use in machine learning because we hope they project the spectrum into a space which approximately preserves psychoacoustic dissimilarity, whilst having little redundancy.

This heuristic pro is weighted with the practical con that they are not practically differentiable, nor invertible except by heroic computational effort, nor are they humanly interpretable, and riven with poorly-supported somewhat arbitrary steps. (The Cepstrum of the Mel-frequency-spectrogram is a weird thing that no longer picks out harmonics in the way that God and Tukey intended.)


Inc bandpasses, Gammatones… Random filterbanks?

Dynamic dictionaries

See sparse basis dictionaries.

Cochlear activation models



Erbs, Mels, Sones, Phones…


Here are some options for doing it:


musicbricks is an umbrella project to unify (sometime post hoc) many of the efforts mentioned individually below, plus a few other new ones.

  • Fraunhofer ML software (C++) is part of this project, including such things as

    • Real-Time Pitch Detection
    • MusicBricksTranscriber
    • Goatify Pdf
    • Time Stretch Pitch Shift Library


LibROSA I have been using a lot recently, and I highly recommend it, especially if your pipeline already includes python. Sleek minimal design, with a curated set of algorithms (compare and contrast with the chaos of the vamp plugins ecosystem). Python-based, but fast enough because it uses the numpy numerical libraries. The API design meshes well with Scikit-learn, the de facto python machine learning standard, and it’s flexible and hackable.

  • see also talkbox for a nice-looking but abandoned (?) alternative, which is nonetheless worth it for Alexander Schindler’s lovely MIR lecture based around it.
  • amen is a remix program built on librosa


SonicAnnotator seems to be about cobbling together vamp plugins for batch analysis. That is more steps that I want in an already clunky workflow in the current projects It’s also more about RDF ontologies where I want matrices of floats.


For C++ and Python there is Essentia, as seen in Freesound, which is a high recommendation IMO. (Watch out, the source download is enormous; just shy of half a gigbyte.) Features python and vamp integration, and a great many algorithms. I haven’t given it a fair chance because LibROSA has been such a joy to use. However, the intriguing Dunya project is based off it.


echonest is a proprietary system that was used to generate the Million Songs Database. Seems to be gradually decaying, and was bought up by spotify. has great demos, such as autocanonisation.



RP extract


phonological corpus tools

speech-focussed, phonological corpus tools is another research library for largeish corpus analysis, similarity-classification etc.

Metamorph, smstools

John Glover, soundcloud staffer, has several analysis libraries culminating in Metamorph,

a new open source library for performing high-level sound transformations based on a sinusoids plus noise plus transients model. It is written in C++, can be built as both a Python extension module and a Csound opcode, and currently runs on Mac OS X and Linux.

It is designed to work primarily on monophonic, quasi-harmonic sound sources and can be used in a non-real-time context to process pre-recorded sound files or can operate in a real-time (streaming) mode.

See also the related spectral modeling and synthesis package, smstools.

Sinusoidal modelling with simplsound

“sinusoidal modelling”: Simplsound (GlLT09) is a python implementation of that technique.


If you use a lot of Supercollider, you might like SCMIR, a native supercollider thingy. It has the virtues that

  • it can run in realtime, which is lovely.
  • comes with lots of neato bells and whistles, like the author’s quirky
    breakbeat cut library.

It has the vices that

  • It runs in Supercollider, which is a backwater language unserviced by modern development infrastructure, or decent machine learning libraries, and

  • a fraught development process; I can’t even link directly to it because the author doesn’t provide it its own anchor tag, let alone a whole web page or source code repository. Release schedule is opaque and sporadic. Consequently, it is effectively a lone guy’s pet project, rather than an active community endeavour.

    That is to say this is the Etsy sweater of code knitting. If on balance this sounds like a good deal to you, you can download SCMIR from somewhere or other on Nick Collins’ homepage.

Other specialist tools

Large-Scale Content-Based Matching of Midi and Audio Files:

MIDI files, when paired with corresponding audio recordings, can be used as ground truth for many music information retrieval tasks. We present a system which can efficiently match and align MIDI files to entries in a large corpus of audio content based solely on content, i.e., without using any metadata. The core of our approach is a convolutional network-based cross-modality hashing scheme which transforms feature matrices into sequences of vectors in a common Hamming space. Once represented in this way, we can efficiently perform large-scale dynamic time warping searches to match MIDI data to audio recordings. We evaluate our approach on the task of matching a huge corpus of MIDI files to the Million Song Dataset.

See also Dannenberg’s bibliographies on score following.

mir_eval evaluates MIR metrics.


Abe, T., Kobayashi, T., & Imai, S. (1995) Harmonics tracking and pitch extraction based on instantaneous frequency. In International Conference on Acoustics, Speech, and Signal Processing, 1995. ICASSP-95 (Vol. 1, pp. 756–759 vol.1). DOI.
Alías, F., Socoró, J. C., & Sevillano, X. (2016) A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Applied Sciences, 6(5), 143. DOI.
Anglade, A., Benetos, E., Mauch, M., & Dixon, S. (2010) Improving Music Genre Classification Using Automatically Induced Harmony Rules. Journal of New Music Research, 39, 349–361. DOI.
Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
Bigand, E., Parncutt, R., & Lerdahl, F. (1996) Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizontal motion, and musical training. Perception & Psychophysics, 58(1), 125–141. DOI.
Blei, D. M., Cook, P. R., & Hoffman, M. (2010) Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 439–446).
Boulanger-Lewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M.(2016) Time series analysis: forecasting and control. (Fifth edition.). Hoboken, New Jersey: John Wiley & Sons, Inc
Carabias-Orti, J. J., Virtanen, T., Vera-Candeas, P., Ruiz-Reyes, N., & Canadas-Quesada, F. J.(2011) Musical Instrument Sound Multi-Excitation Model for Non-Negative Spectrogram Factorization. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1144–1158. DOI.
Carmi, A. Y.(2013) Compressive system identification: Sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI.
Carter, G. C.(1987) Coherence and time delay estimation. Proceedings of the IEEE, 75(2), 236–255. DOI.
Choi, K., Fazekas, G., & Sandler, M. (2016) Explaining Deep Convolutional Neural Networks on Music Classification. arXiv:1607.02444 [Cs].
Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., … Welch, P. D.(1967) What is the fast Fourier transform?. Proceedings of the IEEE, 55(10), 1664–1674. DOI.
Cooley, J. W., Lewis, P. A. W., & Welch, P. D.(1970) The application of the fast Fourier transform algorithm to the estimation of spectra and cross-spectra. Journal of Sound and Vibration, 12(3), 339–352. DOI.
Demopoulos, R. J., & Katchabaw, M. J.(2007) Music Information Retrieval: A Survey of Issues and Approaches. . Technical Report
Dieleman, S., & Schrauwen, B. (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI.
Dörfler, M., Velasco, G., Flexer, A., & Klien, V. (2010) Sparse Regression in Time-Frequency Representations of Complex Audio.
Du, P., Kibbe, W. A., & Lin, S. M.(2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22(17), 2059–2065. DOI.
Fitzgerald, D. (2010) Harmonic/percussive separation using median filtering.
Flamary, R., Févotte, C., Courty, N., & Emiya, V. (2016) Optimal spectral transportation with application to music transcription. In arXiv:1609.09799 [cs, stat] (pp. 703–711). Curran Associates, Inc.
Fu, Z., Lu, G., Ting, K. M., & Zhang, D. (2011) A Survey of Audio-Based Music Classification and Annotation. IEEE Transactions on Multimedia, 13(2), 303–319. DOI.
Glover, J. C., Lazzarini, V., & Timoney, J. (2009) Simpl: A Python library for sinusoidal modelling. In DAFx 09 proceedings of the 12th International Conference on Digital Audio Effects, Politecnico di Milano, Como Campus, Sept. 1-4, Como, Italy (pp. 1–4). Dept. of Electronic Engineering, Queen Mary Univ. of London,
Godsill, S., & Davy, M. (2005) Bayesian computational models for inharmonicity in musical instruments. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005 (pp. 283–286). IEEE DOI.
Gómez, E., & Herrera, P. (2004) Estimating The Tonality Of Polyphonic Audio Files: Cognitive Versus Machine Learning Modelling Strategies. In ISMIR.
Grosche, P., Muller, M., & Kurth, F. (2010) Cyclic Tempogram - a Mid-level Tempo Representation For Music Signals. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (pp. 5522–5525). Piscataway, NJ.: IEEE DOI.
Grosche, Peter, Müller, M., & Sapp, C. S.(2010) What makes beat tracking difficult? a case study on Chopin mazurkas. In Proceedings of the International Conference on Music Information Retrieval (ISMIR 2010).
Harris, F. J.(1978) On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE, 66(1), 51–83.
Hermes, D. J.(1988) Measurement of pitch by subharmonic summation. The Journal of the Acoustical Society of America, 83(1), 257–264. DOI.
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., … Wilson, K. (2016) CNN Architectures for Large-Scale Audio Classification. arXiv:1609.09430 [Cs, Stat].
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
Hoffman, M., Bach, F. R., & Blei, D. M.(2010) Online learning for latent dirichlet allocation. In advances in neural information processing systems (pp. 856–864).
Irizarry, R. A.(2001) Local Harmonic Estimation in Musical Sound Signals. Journal of the American Statistical Association, 96(454), 357–367. DOI.
Joël Bensoam, & David Roze. (2013) Solving interactions between nonlinear resonators. In Proceedings of the Sound and Music Computing Conference.
Kailath, T., Sayed, A. H., & Hassibi, B. (2000) Linear estimation. . Upper Saddle River, N.J: Prentice Hall
Kalouptsidis, N., Mileounis, G., Babadi, B., & Tarokh, V. (2011) Adaptive algorithms for sparse system identification. Signal Processing, 91(8), 1910–1919. DOI.
Kereliuk, C., Depalle, P., & Pasquier, P. (2013) Audio Interpolation and Morphing via Structured-Sparse Linear Regression.
Lahat, M., Niederjohn, R. J., & Krubsack, D. (1987) A spectral autocorrelation method for measurement of the fundamental frequency of noise-corrupted speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(6), 741–750. DOI.
Ljung, L. (1999) System identification: theory for the user. (2nd ed.). Upper Saddle River, NJ: Prentice Hall PTR
Luo, Y., Chen, Z., Hershey, J. R., Roux, J. L., & Mesgarani, N. (2016) Deep Clustering and Conventional Networks for Music Separation: Stronger Together. arXiv:1611.06265 [Cs, Stat].
Makhoul, J., Kubala, F., Schwartz, R., & Weischedel, R. (1999) Performance Measures For Information Extraction. In In Proceedings of DARPA Broadcast News Workshop (pp. 249–252).
Marchand, U., & Peeters, G. (2014) The Modulation Scale Spectrum And Its Application To Rhythm-Content Analysis. In DAFX (Digital Audio Effects). Erlangen, Germany
Maxwell, J. B., Pasquier, P., & Whitman, B. (2009) Hierarchical Sequential Memory for Music: A Cognitive Model. In Proceedings of the tenth International Society for Music Information Retrieval Conference (ISMIR 2009) (pp. 429–434).
McFee, B., & Ellis, D. P.(2011) Analyzing song structure with spectral clustering. In IEEE conference on Computer Vision and Pattern Recognition (CVPR).
Mesaros, A., Heittola, T., & Virtanen, T. (2016) Metrics for Polyphonic Sound Event Detection. Applied Sciences, 6(6), 162. DOI.
Moorer, J. . (1974) The optimum comb method of pitch period analysis of continuous digitized speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 22(5), 330–338. DOI.
Müller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011) Signal Processing for Music Analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110. DOI.
Müller, Meinard, & Driedger, J. (2012) Data-Driven Sound Track Generation. In Multimodal Music Processing (Vol. 3, pp. 175–194). Dagstuhl, Germany: Schloss Dagstuhl—Leibniz-Zentrum für Informatik
Murthy, H. A., & Gadde, V. (2003) The modified group delay function and its application to phoneme recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03) (Vol. 1, p. I-68-71 vol.1). DOI.
Nussbaum-Thom, M., Cui, J., Ramabhadran, B., & Goel, V. (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI.
Parncutt, R. (1997) A model of the perceptual root(s) of a chord accounting for voicing and prevailing tonality. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 181–199). Springer Berlin Heidelberg
Paulus, J., Müller, M., & Klapuri, A. (2010) Audio-Based Music Structure Analysis. In ISMIR (pp. 625–636). ISMIR
Phan, H., Hertel, L., Maass, M., & Mertins, A. (2016) Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks. In Interspeech 2016.
Pickens, J. (2004) Harmonic modeling for polyphonic music retrieval. . Citeseer
Pickens, J., & Iliopoulos, C. S.(2005) Markov Random Fields and Maximum Entropy Modeling for Music Information Retrieval. In ISMIR (pp. 207–214). Citeseer
Plomp, R., & Levelt, W. J.(1965) Tonal consonance and critical bandwidth. The Journal of the Acoustical Society of America, 38(4), 548–560. DOI.
Pons, J., Lidy, T., & Serra, X. (2016) Experimenting with musically motivated convolutional neural networks. In Content-Based Multimedia Indexing (CBMI), 2016 14th International Workshop on (pp. 1–6). IEEE DOI.
Robertson, A. N., & Plumbley, M. D.(2006) Real-time Interactive Musical Systems: An Overview. Proc. of the Digital Music Research Network, Goldsmiths University, London, 65–68.
Robertson, A., & Plumbley, M. (2007) B-Keeper: A Beat-tracker for Live Performance. In Proceedings of the 7th International Conference on New Interfaces for Musical Expression (pp. 234–237). New York, NY, USA: ACM DOI.
Robertson, A., & Plumbley, M. D.(2013) Synchronizing Sequencing Software to a Live Drummer. Computer Music Journal, 37(2), 46–60. DOI.
Robertson, A., Stark, A., & Davies, M. E.(2013) Percussive Beat tracking using real-time median filtering. . Presented at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Robertson, A., Stark, A. M., & Plumbley, M. D.(2011) Real-Time Visual Beat Tracking Using a Comb Filter Matrix.
Robertson, Andrew N. (2011) A Bayesian approach to drum tracking.
Rochebois, T., & Charbonneau, G. (1997) Cross-synthesis using interverted principal harmonic sub-spaces. In M. Leman (Ed.), Music, Gestalt, and Computing (pp. 375–385). Springer Berlin Heidelberg
Sainath, T. N., & Li, B. (2016) Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks. Submitted to Proc. Interspeech.
Salamon, J., Gomez, E., Ellis, D. P., & Richard, G. (2014) Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2), 118–134. DOI.
Salamon, J., Serrà, J., & Gómez, E. (2013) Tonal representations for music retrieval: from version identification to query-by-humming. International Journal of Multimedia Information Retrieval, 2(1), 45–58. DOI.
Schlüter, J., & Böck, S. (2014) Improved musical onset detection with Convolutional Neural Networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6979–6983). DOI.
Schmidt, E. M., & Kim, Y. E.(2011) Learning emotion-based acoustic features with deep belief networks. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. 65–68). DOI.
Scholler, S., & Purwins, H. (2011) Sparse Approximations for Drum Sound Classification. IEEE Journal of Selected Topics in Signal Processing, 5(5), 933–940. DOI.
Sergé, A., Bertaux, N., Rigneault, H., & Marguet, D. (2008) Dynamic multiple-target tracing to probe spatiotemporal cartography of cell membranes. Nature Methods, 5(8), 687–694. DOI.
Serrà, J., Corral, Á., Boguñá, M., Haro, M., & Arcos, J. L.(2012) Measuring the Evolution of Contemporary Western Popular Music. Scientific Reports, 2. DOI.
Smith, E. C., & Lewicki, M. S.(2004) Learning efficient auditory codes using spikes predicts cochlear filters. In Advances in Neural Information Processing Systems (pp. 1289–1296).
Smith, E. C., & Lewicki, M. S.(2006) Efficient auditory coding. Nature, 439(7079), 978–982. DOI.
Smith, E., & Lewicki, M. S.(2005) Efficient Coding of Time-Relative Structure Using Spikes. Neural Computation, 17(1), 19–45. DOI.
Smyth, T., & Elmore, A. R.(2009) Explorations in convolutional synthesis. In Proceedings of the 6th Sound and Music Computing Conference, Porto, Portugal (pp. 23–25).
Terhardt, E., Stoll, G., & Seewann, M. (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. The Journal of the Acoustical Society of America, 71(3), 679–688. DOI.
Thickstun, J., Harchaoui, Z., & Kakade, S. (2016) Learning Features of Music from Scratch. arXiv:1611.09827 [Cs, Stat].
Welch, P. D.(1967) The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2), 70–73.
Yang, C., He, Z., & Yu, W. (2009) Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. BMC Bioinformatics, 10, 4. DOI.
Yoshii, K., & Goto, M. (2012) Infinite Composite Autoregressive Models for Music Signal Analysis.