See also musical corpora for some specialised music ones.
Rdatasets collates all the most popular R datasets
Zenodo, for example, documents many published scientific data sets
Nuit Blanche’s listing of data sets is handy if you want some good inverse-problem signal processing challenges.
Amazon’s list of the datasets they bother to host is a kind of who’s-who of data
The Social Media Research Toolkit is a list of 50+ social media research tools curated by researchers at the Social Media Lab at Ted Rogers School of Management, Ryerson University.
So not necessarily data, but the software to get it.
The Seshat Global History Databank brings together the most current and comprehensive body of knowledge about human history in one place. Our unique Databank systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time.
Quandl has some databases.