See also musical corpora for some specialised music ones.
Generic tools for construction thereof
Today’s state-of-the-art machine learning models require massive labeled training sets—which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labeling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).
Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.
The Engauge Digitizer tool accepts image files (like PNG, JPEG and TIFF) containing graphs, and recovers the data points from those graphs. The resulting data points are usually used as input to other software applications. Conceptually, Engauge Digitizer is the opposite of a graphing tool that converts data points to graphs. [..] an image file is imported, digitized within Engauge, and exported as a table of numeric data to a text file.
prodigy is an interactive dataset annotator for training classifiers
Rdatasets collates all the most popular R datasets
Zenodo, for example, documents many published scientific data sets
Nuit Blanche’s listing of data sets is handy if you want some good inverse-problem signal processing challenges.
Amazon’s list of the datasets they bother to host is a kind of who’s-who of data
The Social Media Research Toolkit is a list of 50+ social media research tools curated by researchers at the Social Media Lab at Ted Rogers School of Management, Ryerson University.
So not necessarily data, but the software to get it.
The Seshat Global History Databank brings together the most current and comprehensive body of knowledge about human history in one place. Our unique Databank systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time.
Quandl has some databases.