Datasets of sound.
See also machine listening, data sets and, from an artistic angle, sample libraries.
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets — principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 635 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.
YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to get started on this dataset by training a baseline video model in less than a day on a single machine! At the same time, the dataset’s scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.
Our goal is to accelerate research on large-scale video understanding, representation learning, noisy data modeling, transfer learning, and domain adaptation approaches for video. More details about the dataset and initial experiments can be found in our technical report. Some statistics from the latest version of the dataset are included below.
I’m obsessed with music, I make music and I like using computers to do it. As a statistics guy, I have a habit of using statistics in particular to do it, and that needs data. Here are some corpora and some extra analysis tools, for the creation of Mad Science Music.
Universitat Pompeu-Fabra is trying to collect large and minutely analyzed sets of data from several distinct not-necessarily-central-european traditions, and have comprehensive software tools too:
78 project is the internet archive collection of 78rpm shellacs
Piano-midi.de : classical piano pieces
Nottingham : over 1000 folk tunes
MuseData : electronic library of classical music scores
JSB Chorales : set of four-part harmonized chorales
In this document we report on the tempo induction contest held as part of the ISMIR 2004 Audio Description Contests, organized at the University Pompeu Fabra in Barcelona in September 2004 and won by Anssi Klapuri from Tampere University.[…]
BallroomDancers.com gives many informations on ballroom dancing (online lessons, etc.). Some characteristic excerpts of many dance styles are provided in real audio format. Their tempi are also available.
Total number of instances: 698 Duration: ~30 s Total duration: ~20940 s Genres: Cha Cha, 111; Jive, 60; Quickstep 82; Rumba, 98; Samba, 86; Tango, 86; Viennese Waltz, 65; Slow Waltz, 110
MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note’s position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.
Data with purely automatically generated labels (they inferred the annotation from the raw samples using DSP, giving larger data sets and more errors than hand-labelled stuff)
NSynth is an audio dataset containing 306,043 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI pian o (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.
Kyle McDonald, Freesound 4 seconds:
A mirror of all 126,900 sounds on Freesound less than 4 seconds long, as of April 4, 2017. Metadata for all sounds is stored in the json.zip files, and the high quality mp3s are stored in the mp3.zip files.
Other well-known science-y music datasets:
The classic USPOP CAL500 CAL10K etc
RWC (crosschecks MIDI against Audio)
This dataset consists of ~25000 29s long music clips, each of them annotated with a combination of 188 tags. The annotations have been collected through Edith’s “TagATune” game. The clips are excerpts of songs published by Magnatune.com
There is a list of articles using this data set.
Freesound does’t quite fit in with the rest, but it’s worth knowing anyway. Incredible database of raw samples for analysis, annotated with various Essentia descriptors, (i.e. hand-crafted features) plus user tags, descriptions and general good times, and deserves a whole entry of its own, if perhaps under “acoustic” rather than “musical” corpora.
From Christian Walder at Data61: SymbolicMusicMidiDataV1.0
Music data sets of suspicious provenance, via Reddit:
- Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011) The Million Song Dataset. In 12th International Society for Music Information Retrieval Conference (ISMIR 2011).
- Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., … Ritter, M. (2017) Audio Set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA
- Gillet, O., & Richard, G. (n.d.) ENST-Drums: an extensive audio-visual database for drum signals processing.
- Gouyon, F., Klapuri, A., Dixon, S., Alonso, M., Tzanetakis, G., Uhle, C., & Cano, P. (2006) An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1832–1844. DOI.
- Thickstun, J., Harchaoui, Z., & Kakade, S. (2017) Learning Features of Music from Scratch. In Proceedings of International Conference on Learning Representations (ICLR) 2017.