Datasets of music.
See also machine listening, data sets and, from an artistic angle, sample libraries.
I’m obsessed with music, I make music and I like using computers to do it. As a statistics guy, I have a habit of using statistics in particular to do it, and that needs data. Here are some corpora and some extra analysis tools, for the creation of Mad Science Music.
Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:
Bonus raw-note datasets I just noticed. Some of these I found while reading about “Deep Learning” which is a whole ‘nother story.
Piano-midi.de : classical piano pieces
Nottingham : over 1000 folk tunes
MuseData : electronic library of classical music scores
JSB Chorales : set of four-part harmonized chorales
In this document we report on the tempo induction contest held as part of the ISMIR 2004 Audio Description Contests, organized at the University Pompeu Fabra in Barcelona in September 2004 and won by Anssi Klapuri from Tampere University.[…]
BallroomDancers.com gives many informations on ballroom dancing (online lessons, etc.). Some characteristic excerpts of many dance styles are provided in real audio format. Their tempi are also available.
Total number of instances: 698 Duration: ~30 s Total duration: ~20940 s Genres: Cha Cha, 111; Jive, 60; Quickstep 82; Rumba, 98; Samba, 86; Tango, 86; Viennese Waltz, 65; Slow Waltz, 110
MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note’s position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.
Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff)
Other well-known science-y music datasets:
The classic USPOP CAL500 CAL10K etc
RWC (crosschecks MIDI against Audio)
This dataset consists of ~25000 29s long music clips, each of them annotated with a combination of 188 tags. The annotations have been collected through Edith’s “TagATune” game. The clips are excerpts of songs published by Magnatune.com
There is a list of articles using this data set.
Freesound does’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, including all the Essentia descriptors, plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.
From Christian Walder at Data61: SymbolicMusicMidiDataV1.0
Music data sets of suspicious provenance, via Reddit:
- Drum percussion midi archive
- The largest midi collection on the internet
- 16Gb of song stems
- More song stems
- Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
- Gouyon, F., Klapuri, A., Dixon, S., Alonso, M., Tzanetakis, G., Uhle, C., & Cano, P. (2006) An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1832–1844. DOI.
- Thickstun, J., Harchaoui, Z., & Kakade, S. (2016) Learning Features of Music from Scratch. arXiv:1611.09827 [Cs, Stat].