The Living Thing / Notebooks :

Machine listening

See also musical corpora, musical metrics synchronisation, sparse basis dictionaries, speech recognition, learning gamelan, analysis/resynthesis, whatever other machine listening posts I forgot.

I’m not going to talk about speech recognition here; That boat is full.

Machine listening: machine learning, from audio. Everything from that Shazam app doohickey, to teaching computers to recognise speech, to doing artsy things with sound. I’m mostly concerned with the third one. Statistics, features, descriptors, metrics, kernels and affinities and the spaces and topologies they induce. for musical audio e.g. your Mp3 pop song library. This has considerable overlap with musical metrics but there I start from scores and transcriptions.

Polyphony and its problems.

Approximate logarithmic perception and its problems.

Should I create a separate psychoacoustics notebook? Yes.

Interesting descriptors/features

Audio summaries that attempt to turn raw signals into useful feature vectors reflective of human perception of them. This is a huge industry, because it makes audio convenient for transmission (hello mobile telephony, MP3) But it’s also useful for understanding speech, music etc. There are as many descriptors as there are IEEE conference slots.

See AlSS16 for an intimidatingly comprehensive summary.

I’m especially interested in

Also, ones that can encode noisiness in the signal as well as harmonicity…? I guess I should read AlSS16.

Deep neural networks

See, e.g. Jordi Pons’ Spectrogram CNN discussion for some introductions to the kind of features a neural network might “discover” in audio recognition tasks.

There is some interesting stuff here; for example, Dieleman and Schrauwen (DiSc14) show that convolutional neural networks trained on raw audio (i.e. not spectrograms) for music classification recover Mel-like frequency bands. Thickstun et al (ThHK16) do some similar work.

And Keunwoo Choi shows that you can listen to what they learn.

Sparse comb filters

Differentiable! Conditionally invertible! Handy for syncing.

Moorer (Moor74) proposed this for harmonic purposes, but Robertson et al (RoSP11) have shown it to be handy for rhythm.

Autocorrelation features

Measure the signal’s full or partial autocorrelation with itself.

Linear Predictive coefficents

How do these transform? If we did this as all-pole or all-zeros might be useful; but many maxima.


Classic, but inconvenient to invert.


Mel-frequency Ceptral Coefficients, or Mel Cepstral transform. Take the perfectly respectable-if-fiddly cepstrum and make it really messy, with a vague psychoacoustic model in the hope that the distinctions in the resulting “MFCC” might correspond to distinctions correspond to human perceptual distinctions.

Folk wisdom holds that MFCC features are Eurocentric, in that they destroy, or at least obscure, tonal language features. Ubiquitous, but inconsistently implemented; MFCCs are generally not the same across implementations, probably because the Mel scale is itself not universally standardised.

Asides from being loosely psychoacoustically motivated features, what do the coefficients of an MFCC specifically tell me?

Hmm. If I have got this right, these are “generic features”; things we can use in machine learning because we hope they project the spectrum into a space which approximately preserves psychoacoustic dissimilarity, whilst having little redundancy.

This heuristic pro is weighted with the practical con that they are not practically differentiable, nor invertible except by heroic computational effort, nor are they humanly interpretable, and riven with poorly-supported somewhat arbitrary steps. (The Cepstrum of the Mel-frequency-spectrogram is a weird thing that no longer picks out harmonics in the way that God and Tukey intended.)


Inc bandpasses, Gammatones… Random filterbanks?

Dynamic dictionaries

See sparse basis dictionaries.

Cochlear activation models



Erbs, Mels, Sones, Phones…


Here are some options for doing machine listening.


musicbricks is an umbrella project to unify (post hoc) many of the efforts mentioned individually below, plus a few other new ones.


LibROSA I have been using a lot recently, and I highly recommend it, especially if your pipeline already includes python. Sleek minimal design, with a curated set of algorithms (compare and contrast with the chaos of the vamp plugins ecosystem). Python-based, but fast enough because it uses the numpy numerical libraries well. The API design meshes well with Scikit-learn, the de facto python machine learning standard, and it’s flexible and hackable.


SonicAnnotator seems to be about cobbling together vamp plugins for batch analysis. That is more steps that I want in an already clunky workflow in the current projects It’s also more about RDF ontologies where I want matrices of floats.


For C++ and Python there is Essentia, as seen in Freesound, which is a high recommendation IMO. (Watch out, the source download is enormous; just shy of half a gigbyte.) Features python and vamp integration, and a great many algorithms. I haven’t given it a fair chance because LibROSA has been such a joy to use. However, the intriguing Dunya project is based off it.


echonest is a proprietary system that was used to generate the Million Songs Database. Seems to be gradually decaying, and was bought up by spotify. has great demos, such as autocanonisation.



RP extract


phonological corpus tools

speech-focussed, phonological corpus tools is another research library for largeish corpus analysis, similarity-classification etc.

Metamorph, smstools

John Glover, soundcloud staffer, has several analysis libraries culminating in Metamorph,

a new open source library for performing high-level sound transformations based on a sinusoids plus noise plus transients model. It is written in C++, can be built as both a Python extension module and a Csound opcode, and currently runs on Mac OS X and Linux.

It is designed to work primarily on monophonic, quasi-harmonic sound sources and can be used in a non-real-time context to process pre-recorded sound files or can operate in a real-time (streaming) mode.

See also the related spectral modeling and synthesis package, smstools.

Sinusoidal modelling with simplsound

“sinusoidal modelling”: Simplsound (GlLT09) is a python implementation of that technique.


If you use a lot of Supercollider, you might like SCMIR, a native supercollider thingy. It has the virtues that

It has the vices that

Other specialist tools

Large-Scale Content-Based Matching of Midi and Audio Files:

MIDI files, when paired with corresponding audio recordings, can be used as ground truth for many music information retrieval tasks. We present a system which can efficiently match and align MIDI files to entries in a large corpus of audio content based solely on content, i.e., without using any metadata. The core of our approach is a convolutional network-based cross-modality hashing scheme which transforms feature matrices into sequences of vectors in a common Hamming space. Once represented in this way, we can efficiently perform large-scale dynamic time warping searches to match MIDI data to audio recordings. We evaluate our approach on the task of matching a huge corpus of MIDI files to the Million Song Dataset.

See also Dannenberg’s bibliographies on score following.

mir_eval evaluates MIR metrics.

CLEESE is an IRCAM project to do dataset augmentation by pitch and time transform of audio.