The Living Thing / Notebooks :

Audio/music corpora

Smells like Team Audioset

Usefulness: 🔧 🔧 🔧
Novelty: 💡
Uncertainty: 🤪
Incompleteness: 🚧

Datasets of sound tend to be called audio corpora for reasons of tradition. I’ve listed some audio corpora that are useful to me, which means

See also machine listening, data sets and, from an artistic angle, sample libraries.

General issues

In these datasets, labels are noisy, as exemplified by Audioset. Weakly supervised techniques are worth considering.

General audio

Audioset

Audioset (GEFJ17):

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets – principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

youtube8m

Youtube8M:

YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to get started on this dataset by training a baseline video model in less than a day on a single machine! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

Our goal is to accelerate research on large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video. More details about the dataset and initial experiments can be found in our technical report. Some statistics from the latest version of the dataset are included below.

FSDnoisy18k

FSDnoisy18k: (FPEF19)

FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). More information about the technical aspects of Freesound can be found in [14, 15].

The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. They are listed in the next Table, along with the number of audio clips per class (split in different data subsets that are defined in section FSDnoisy18k basic characteristics). Every numeric entry in the next table reads: number of clips / duration in minutes (rounded). For instance, the Acoustic guitar class has 102 audio clips in the clean subset of the train set, and the total duration of these clips is roughly 11 minutes.

We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

Musical – complete songs and metadata

tl;dr Use Free Music Archive, which is faster, simpler and higher quality. Youtube Music Videos and Magatagatune have more noise and less convenience. Which is not to say that FMA is fast, simple and of high quality. It’s still fucking chaos because the creators are all busy doing their own dissertations I guess, but it is better than the alternatives.

Beatstars and other beat markets

Beatstars is a commercial peer-to-peer beatmakers’ market which has lots of tracks without vocals. I’m not sure what analysis can be done upon them without paying the per-track licensing fee, but possibly quite a lot.

Free music archive

FMA (DBVB16) is an annotated, ML-ready dataset constructed from the Free Music Archive

We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community’s growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user- level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition.

Oh wait, now free music archive’s closed down. Please see the archive.org backup of the audio. The metadata in the datset migth be the best options.

Note that at time of writing, the default code version of the did not use the same format as the default data version. I needed to check out a particular, elderly version of the code, rc1. I suppose you could rebuild the dataset index from scratch yourself, but this would need much CPU time.

Youtube music video 8m

The musical competitor to the above non-musical one, created by Keunwoo Choi. Youtube Music Video 8m:

[…]there are loads of music video online […]unlike in the case you’re looking for a dataset that provide music signal, you can just download music contents. For free. No API blocking. No copyright law bans you to do so (redistribution is restricted though). Not just 30s preview but the damn whole song. Just because it’s not mp3 but mp4.

As a MIR researcher I found it annoying but blessing. {music} is banned but {music video} is not! […]

I beta-released YouTube Music Video 5M dataset. The readme has pretty much everything and will be up-to-date. In short, It’s 5M+ youtube music video URLs that are categorised by Spotify artist IDs which are sorted by some sort of artist popularity.

Unfortunately I can’t redistribute anything further that is crawled from either YouTube (e.g., music video titles..) or Spotify (e.g., artist genre labels) but you can get them by yourself.

The spotify metadata is pretty good, and the youtube audio quality is OK, so this is a handy dataset. But much assembly required, and much bandwidth.

Source code.

Dunya project

Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:

See the Dunya project site.

The ballroom set

The ballroom music data set: (GKDA06)

In this document we report on the tempo induction contest held as part of the ISMIR 2004 Audio Description Contests, organized at the University Pompeu Fabra in Barcelona in September 2004 and won by Anssi Klapuri from Tampere University.[…]

BallroomDancers.com gives many informations [sic] on ballroom dancing (online lessons, etc.). Some characteristic excerpts of many dance styles are provided in real audio format. Their tempi are also available.

MusicNet

MusicNet

MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note’s position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.

Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff.)

Magnatagatune

Magnatagatune is a data set of pop songs with substantial audio and metadata, good for classification tasks. Announcement. (Data)

This dataset consists of ~25000 29s long music clips, each of them annotated with a combination of 188 tags. The annotations have been collected through Edith’s “TagATune” game. The clips are excerpts of songs published by Magnatune.com

There is a list of articles using this data set.

Music – multitrack/stems

🚧 see if spotify made their acapella/whole song database available.

Music stems sets of suspicious provenance, via Reddit, were on r/SongStems, but it turned out, unsuprsignly that they were banned for copyright violations .

They have been replaced by a torrent?

Sisec signal separation

Sisec sigsep did a multitrack musical separation contest

Professionally-produced music recordings

New dataset and Python tools to handle it 100 training songs, 50 test songs, all [email protected] kHz produced recordings All songs include drums, bass, other, vocals stems Songs are encoded in the Native Instruments stems format, with a tool to convert them back and forth to wav and to load them directly in Python. Automatic download of data Python code to analyze results and produce plots for your own paper.

The actual dataset is AFAICT called musdb.

MedleyDB

MedleyDB (BSTM14):

MedleyDB, a dataset of annotated, royalty-free multitrack recordings. MedleyDB was curated primarily to support research on melody extraction, addressing important shortcomings of existing collections. For each song we provide melody f0 annotations as well as instrument activations for evaluating automatic instrument recognition. The dataset is also useful for research on tasks that require access to the individual tracks of a song such as source separation and automatic mixing…

Dataset Snapshot

Bonus: Mdbdrums

Mdbdrums (SWLH17)

This repository contains the MDB Drums dataset which consists of drum annotations and audio files for 23 tracks from the MedleyDB dataset.

Two annotation files are provided for each track. The first annotation file, termed class, groups the 7994 onsets into 6 classes based on drum instrument. The second annotation file, termed subclass, groups the onsets into 21 classes based on playing technique and instrument.

Musical – just metadata

tl;dr If I am using these datasets, I am using someone else’s outdated metadata. It will probably be better if I just benchmark against published research using these databases than reinventing my own wheel. Or maybe I would use these if I wanted to augment a dataset you already have? I’d want to be sure I could match these dataset together with high certainty in that case. Rumour holds that stitching these datasets together is challenging.

Million songs

You have to mention this one because everyone dies, but it’s useless for me

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features.[…]

The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

The problem here is that this is a time-wasting circuitous process to get access to the raw data, and someone else’s suboptimal features are useless to me.

Mumu

Mumu (ONBS17):

MuMu is a Multimodal Music dataset with multi-label genre annotations that combines information from the Amazon Reviews dataset and the Million Song Dataset (MSD). The former contains millions of album customer reviews and album metadata gathered from Amazon.com. The latter is a collection of metadata and precomputed audio features for a million songs.

To map the information from both datasets we use MusicBrainz. This process yields the final set of 147,295 songs, which belong to 31,471 albums. For the mapped set of albums, there are 447,583 customer reviews from the Amazon Dataset. The dataset have been used for multi-label music genre classification experiments in the related publication. In addition to genre annotations, this dataset provides further information about each album, such as genre annotations, average rating, selling rank, similar products, and cover image url. For every text review it also provides helpfulness score of the reviews, average rating, and summary of the review.

The mapping between the three datasets (Amazon, MusicBrainz and MSD), genre annotations, metadata, data splits, text reviews and links to images are available here. Images and audio files can not be released due to copyright issues.

Music – individual notes and voices

Nsynth and fs4s are both good. Get them for yourself.

freesound4seconds

Kyle McDonald, Freesound 4 seconds:

A mirror of all 126,900 sounds on Freesound less than 4 seconds long, as of April 4, 2017. Metadata for all sounds is stored in the json.zip files, and the high quality mp3s are stored in the mp3.zip files.

nsynth

Nsynth:

NSynth is an audio dataset containing 306,043 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI piano (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.

Other well-known science-y music datasets

Freesound doesn’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, annotated with various Essentia descriptors, (i.e. hand-crafted features) plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.

Other lists of yet more datasets

Music – MIDI

MIDI! a symbolic music representation! Easy! not that flexible! but well crowd-sourced.

Colin Raffel’s Lakh MIDI dataset:

The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

Slightly mysterious, the Musical AI MIDI DATAset

If you have MIDI, but you would prefer to have audio, perhaps you could render it to audio using MrsWatson or some other audio software libraries.

From Christian Walder at Data61: SymbolicMusicMidiDataV1.0

Music data sets of suspicious provenance, via Reddit:

Voice

Mozilla’s open-source crowd-sourced CommonVoice dataset:

Most of the data used by large companies isn’t available to the majority of people. We think that stifles innovation. So we’ve launched Common Voice, a project to help make voice recognition open and accessible to everyone.

Now you can donate your voice to help us build an open-source voice database that anyone can use to make innovative apps for devices and the web. Read a sentence to help machines learn how real people speak. Check the work of other contributors to improve the quality. It’s that simple!

Refs

Bertin-Mahieux, Thierry, Daniel P. W. Ellis, Brian Whitman, and Paul Lamere. 2011. “The Million Song Dataset.” In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).

Bittner, Rachel M., Justin Salamon, Mike Tierney, Matthias Mauch, Chris Cannam, and Juan Pablo Bello. 2014. “MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research.” In ISMIR, 14:155–60.

Defferrard, Michaël, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2017. “FMA: A Dataset for Music Analysis.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China. http://arxiv.org/abs/1612.01840.

Fonseca, Eduardo, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra. 2019. “Learning Sound Event Classifiers from Web Audio with Noisy Labels,” January. http://arxiv.org/abs/1901.01189.

Fonseca, Eduardo, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. 2017. “Freesound Datasets: A Platform for the Creation of Open Audio Datasets.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China. https://ismir2017.smcnus.org/wp-content/uploads/2017/10/161_Paper.pdf.

Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” In Proceedings of ICASSP 2017. New Orleans, LA. https://research.google.com/pubs/pub45857.html.

Gillet, Olivier, and Gaël Richard. 2006. “ENST-Drums: An Extensive Audio-Visual Database for Drum Signals Processing.” In ISMIR. http://ismir2006.ismir.net/PAPERS/ISMIR0627_Paper.pdf.

Gouyon, F., A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and P. Cano. 2006. “An Experimental Comparison of Audio Tempo Induction Algorithms.” IEEE Transactions on Audio, Speech, and Language Processing 14 (5): 1832–44. https://doi.org/10.1109/TSA.2005.858509.

Law, Edith, Kris West, and Michael I. Mandel. 2009. “Evaluation of Algorithms Using Games: The Case of Music Tagging.” In. http://ismir2009.ismir.net/proceedings/OS5-5.pdf.

Oramas, Sergio, Oriol Nieto, Francesco Barbieri, and Xavier Serra. 2017. “Multi-Label Music Genre Classification from Audio, Text, and Images Using Deep Features.” In ISMIR. http://arxiv.org/abs/1707.04916.

Southall, Carl, Chih-Wei Wu, Alexander Lerch, and Jason A. Hockman. 2017. “MDB Drums — an Annotated Subset of MedleyDB for Automatic Drum Transcription.” In Late Breaking Demo (Extended Abstract), Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Suzhou: International Society for Music Information Retrieval (ISMIR). http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2017/10/Wu-et-al_2017_MDB-Drums-An-Annotated-Subset-of-MedleyDB-for-Automatic-Drum-Transcription.pdf.

Thickstun, John, Zaid Harchaoui, and Sham Kakade. 2017. “Learning Features of Music from Scratch.” In Proceedings of International Conference on Learning Representations (ICLR) 2017. http://arxiv.org/abs/1611.09827.