The Living Thing / Notebooks :

Audio corpora

Datasets of sound.

See also machine listening, data sets and, from an artistic angle, sample libraries.




Audioset (GEFJ17):

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets — principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.



YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to get started on this dataset by training a baseline video model in less than a day on a single machine! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

Our goal is to accelerate research on large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video. More details about the dataset and initial experiments can be found in our technical report. Some statistics from the latest version of the dataset are included below.


I’m obsessed with music, I make music and I like using computers to do it. As a statistics guy, I have a habit of using statistics in particular to do it, and that needs data. Here are some corpora and some extra analysis tools, for the creation of Mad Science Music.

Mumu (ONBS17):

MuMu is a Multimodal Music dataset with multi-label genre annotations that combines information from the Amazon Reviews dataset and the Million Song Dataset (MSD). The former contains millions of album customer reviews and album metadata gathered from The latter is a collection of metadata and precomputed audio features for a million songs.

To map the information from both datasets we use MusicBrainz. This process yields the final set of 147,295 songs, which belong to 31,471 albums. For the mapped set of albums, there are 447,583 customer reviews from the Amazon Dataset. The dataset have been used for multi-label music genre classification experiments in the related publication. In addition to genre annotations, this dataset provides further information about each album, such as genre annotations, average rating, selling rank, similar products, and cover image url. For every text review it also provides helpfulness score of the reviews, average rating, and summary of the review.

The mapping between the three datasets (Amazon, MusicBrainz and MSD), genre annotations, metadata, data splits, text reviews and links to images are available here. Images and audio files can not be released due to copyright issues.

  • MuMu dataset (mapping, metadata, annotations and text reviews)
  • Data splits and multimodal embeddings for ISMIR multi-label classification experiments


We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community’s growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition.

Youtube Music Video 8m:

[…]there are loads of music video online[]…]unlike in the case you’re looking for a dataset that provide music signal, you can just download music contents. For free. No API blocking. No copyright law bans you to do so (redistribution is restricted though). Not just 30s preview but the damn whole song. Just because it’s not mp3 but mp4.

As a MIR researcher I found it annoying but blessing. {music} is banned but {music video} is not! […]

I beta-released YouTube Music Video 5M dataset. The readme has pretty much everything and will be up-to-date. In short, It’s 5M+ youtube music video URLs that are categorised by Spotify artist IDs which are sorted by some sort of artist popularity.

Unfortunately I can’t redistribute anything further that is crawled from either YouTube (e.g., music video titles..) or Spotify (e.g., artist genre labels) but you can get them by yourself.

Dunya project

Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:

The ballroom set

the ballroom music data set: (GKDA06)

In this document we report on the tempo induction contest held as part of the ISMIR 2004 Audio Description Contests, organized at the University Pompeu Fabra in Barcelona in September 2004 and won by Anssi Klapuri from Tampere University.[…] gives many informations on ballroom dancing (online lessons, etc.). Some characteristic excerpts of many dance styles are provided in real audio format. Their tempi are also available.

Total number of instances: 698 Duration: ~30 s Total duration: ~20940 s Genres: Cha Cha, 111; Jive, 60; Quickstep 82; Rumba, 98; Samba, 86; Tango, 86; Viennese Waltz, 65; Slow Waltz, 110



MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note’s position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.

Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff)


MedleyDB (BSTM14):

MedleyDB, a dataset of annotated, royalty-free multitrack recordings. MedleyDB was curated primarily to support research on melody extraction, addressing important shortcomings of existing collections. For each song we provide melody f0 annotations as well as instrument activations for evaluating automatic instrument recognition. The dataset is also useful for research on tasks that require access to the individual tracks of a song such as source separation and automatic mixing…

Dataset Snapshot

  • Size: 122 Multitracks (mix + processed stems + raw audio + metadata)
  • Annotations: Melody f0 (108 tracks), Instrument Activations (122 tracks), Genre (122 tracks)
  • Audio Format: WAV (44.1 kHz,16 bit)
  • Genres: Singer/Songwriter, Classical, Rock, World/Folk, Fusion, Jazz, Pop, Musical Theatre, Rap
  • Track Length: 105 full length tracks (~3 to 5 minutes long), 17 excerpts (7:17 hours total)
  • Instrumentation: 52 instrumental tracks, 70 tracks containing vocals

Mdbdrums (SWLH17)

This repository contains the MDB Drums dataset which consists of drum annotations and audio files for 23 tracks from the MedleyDB dataset.

Two annotation files are provided for each track. The first annotation file, termed class, groups the 7994 onsets into 6 classes based on drum instrument. The second annotation file, termed subclass, groups the onsets into 21 classes based on playing technique and instrument. For further information regarding the dataset please read the MDB Drums paper referenced below.



NSynth is an audio dataset containing 306,043 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI piano (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.


Kyle McDonald, Freesound 4 seconds:

A mirror of all 126,900 sounds on Freesound less than 4 seconds long, as of April 4, 2017. Metadata for all sounds is stored in the files, and the high quality mp3s are stored in the files.


Magnatagatune is a data set of

pop songs with substantial audio and metadata, good for classification tasks. Announcement. (Data)

This dataset consists of ~25000 29s long music clips, each of them annotated with a combination of 188 tags. The annotations have been collected through Edith’s “TagATune” game. The clips are excerpts of songs published by

There is a list of articles using this data set.

Other well-known science-y music datasets:

Freesound does’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, annotated with various Essentia descriptors, (i.e. hand-crafted features) plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.


From Christian Walder at Data61: SymbolicMusicMidiDataV1.0

Music data sets of suspicious provenance, via Reddit:


Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
Bittner, R. M., Salamon, J., Tierney, M., Mauch, M., Cannam, C., & Bello, J. P.(2014) MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research. In ISMIR (Vol. 14, pp. 155–160).
Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2016) FMA: A Dataset For Music Analysis. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
Fonseca, E., Pons, J., Favory, X., Font, F., Bogdanov, D., Ferraro, A., … Serra, X. (2017) Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China.
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., … Ritter, M. (2017) Audio Set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA
Gillet, O., & Richard, G. (2006) ENST-Drums: an extensive audio-visual database for drum signals processing. In ISMIR.
Gouyon, F., Klapuri, A., Dixon, S., Alonso, M., Tzanetakis, G., Uhle, C., & Cano, P. (2006) An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1832–1844. DOI.
Law, E., West, K., & Mandel, M. I.(2009) Evaluation of Algorithms Using Games: The Case of Music Tagging.
Oramas, S., Nieto, O., Barbieri, F., & Serra, X. (2017) Multi-label Music Genre Classification from Audio, Text, and Images Using Deep Features. ArXiv:1707.04916 [Cs].
Southall, C., Wu, C.-W., Lerch, A., & Hockman, J. A.(2017) MDB Drums — An Annotated Subset of MedleyDB for Automatic Drum Transcription. In Late Breaking Demo (Extended Abstract), Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou: International Society for Music Information Retrieval (ISMIR)
Thickstun, J., Harchaoui, Z., & Kakade, S. (2017) Learning Features of Music from Scratch. In Proceedings of International Conference on Learning Representations (ICLR) 2017.