The Living Thing / Notebooks : Musical corpora

Datasets of music.

See also machine listening, data sets and, from an artistic angle, sample libraries.

I’m obsessed with music, I make music and I like using computers to do it. As a statistics guy, I have a habit of using statistics in particular to do it, and that needs data. Here are some corpora and some extra analysis tools, for the creation of Mad Science Music.

Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:

Bonus raw-note datasets I just noticed. Some of these I found while reading about “Deep Learning” which is a whole ‘nother story.

Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff)

Other well-known science-y music datasets:

Freesound does’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, including all the Essentia descriptors, plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.

From Christian Walder at Data61: SymbolicMusicMidiDataV1.0

Music data sets of suspicious provenance, via Reddit:


Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
Gouyon, F., Klapuri, A., Dixon, S., Alonso, M., Tzanetakis, G., Uhle, C., & Cano, P. (2006) An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1832–1844. DOI.
Thickstun, J., Harchaoui, Z., & Kakade, S. (2016) Learning Features of Music from Scratch. arXiv:1611.09827 [Cs, Stat].