See also musical corpora for some specialised music ones.
Zenodo, for example, documents many published scientific data sets
The Social Media Research Toolkit is a list of 50+ social media research tools curated by researchers at the Social Media Lab at Ted Rogers School of Management, Ryerson University.
So not necessarily data, but the software to get it.
The Seshat Global History Databank brings together the most current and comprehensive body of knowledge about human history in one place. Our unique Databank systematically collects what is currently known about the social and political organization of human societies and how civilizations have evolved over time.
UCI datasets are diverse. Here’s a nice one:
- Buzz prediction in online social media
- This dataset contains two different social networks: Twitter, a micro-blogging platform with exponential growthand extremely fast dynamics, and Tom’s Hardware, a worldwide forum network focusing on new technology with more conservative dynamics but distinctive features.
- Yang, J. Leskovec. Temporal Variation in Online Media. ACM International Conference on Web Search and Data Mining (WSDM ‘11), 2011.
467 million Twitter posts from 20 million users covering a 7 month period from June 1 2009 to December 31 2009. We estimate this is about 20-30% of all public tweets published on Twitter during the particular time frame.
As per request from Twitter the data is no longer available.
The Higgs dataset has been built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of a new particle with the features of the elusive Higgs boson on 4th July 2012. The messages posted in Twitter about this discovery between 1st and 7th July 2012 are considered.
Quandl has some databases.
Patent citation networks (these are available and reasonably well annotated)
Wikipedia articles and their references (readily available)
- also includes easily-parseable mathematical data and theorems
- …and edit trails
- …and category annotations
- and semantic metadata
- probably more data than you can use
source code of large collaborative projects (Linux or BSD kernel, openoffice, python, Perl, GCC etc)
- can I parse such projects to see how interfaces form?
- Are there odd stylised facts about contribution to these that I might be able to explain?
- Or call-graphs?
This is possibly a low-hanging fruit for me - I’ve got a fair bit of experience of parsing, and SCM-wrangling. But is software engineering a good proxy for physical engineering? Can I parse other technical standards in the same way? (I know a few engineers - must ask them)
Journal article cross-references. This is over-studied.
Estimating number of SKUs as a surrogate for divisions of a modern economy a la Beinhocker (lots of research into this because of Long Tail theories, though the primary data is rarely included - might chase this.)
free text stuff: Some blog data set? http://u.cs.biu.ac.il/~koppel/blogs/blogs.zip