The Living Thing / Notebooks : Audio corpora

Datasets of sound.

See also machine listening, data sets and, from an artistic angle, sample libraries.


Audioset (GEFJ17):

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets — principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.


YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to get started on this dataset by training a baseline video model in less than a day on a single machine! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

Our goal is to accelerate research on large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video. More details about the dataset and initial experiments can be found in our technical report. Some statistics from the latest version of the dataset are included below.


I’m obsessed with music, I make music and I like using computers to do it. As a statistics guy, I have a habit of using statistics in particular to do it, and that needs data. Here are some corpora and some extra analysis tools, for the creation of Mad Science Music.

Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:

Bonus raw-note datasets I just noticed. Some of these I found while reading about “Deep Learning” which is a whole ‘nother story.

Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff)

Other well-known science-y music datasets:

Freesound does’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, annotated with various Essentia descriptors, (i.e. hand-crafted features) plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.

From Christian Walder at Data61: SymbolicMusicMidiDataV1.0

Music data sets of suspicious provenance, via Reddit:


Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., … Ritter, M. (2017) Audio Set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA
Gillet, O., & Richard, G. (n.d.) ENST-Drums: an extensive audio-visual database for drum signals processing.
Gouyon, F., Klapuri, A., Dixon, S., Alonso, M., Tzanetakis, G., Uhle, C., & Cano, P. (2006) An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1832–1844. DOI.
Thickstun, J., Harchaoui, Z., & Kakade, S. (2017) Learning Features of Music from Scratch. In Proceedings of International Conference on Learning Representations (ICLR) 2017.