Tips and tricks for collaborative data sharing.
To actually use it, you want some kind of database. But first you have to get it to put it in that database.
You can download it from some open data set library, although the workflow for many of these is not if you have ongoing research, since the data gets uploaded once and thereafter is static. Nonetheless this basic workflow is familiar and simple. Try, say, Figshare, which hosts the supporting data for many amazing papers. E.g. here’s 1.4. Gb of synapses firing.
More modern, you can also use a sync system for changing data. This is not necessarily ideal for provenance if you are not tracking who changed what, when.
Now, let’s think of some more exotic options.
One hip solution is dat, which tracks updates to your datasets and shares the updates in a distributed fashion. It is very similar to syncthing, with a slightly different emphasis - sharing discoverable data to strangers rather than to friends, with a focus on datasets. You could also use it for backups or other sharing.
Dat is the package manager for data. Share files with version control, back up data to servers, browse remote files on demand, and automate long-term data preservation. Secure, distributed, fast.
Rich ecosystems of distributed servers. GUI.
However, you cannot, e.g. use it for synchronised collation of logs from many different servers in the cloud because it’s one-writer-many-readers.
$ npm install -g dat hypercored ... $ mkdir MyData $ cd MyData $ dat create > Title My Amazing Data > Title My Awesome Dat > Description This is a dat Created empty Dat in /Users/me/MyData/.dat $ dat share
Qu publishes any old data from a mongodb store. Mongodb needs more effort to set up than I am usually prepared to tolerate, and isn’t great for dense binary blobs, which is my stock in trade, so I won’t explore that further.
Google’s open data set protocol, which they call their “Dataset Publishing Language”, is a standard for medium-size datasets with EZ visualisations
- Open Science Framework seems to be github, but with a focus on preserving dataset assets well, instead of focussing on code change sets.
- rOpensci is, AFAICT, a way of seamlessly importing disparate online data sets into your analysis
- Dan Hopkins and Brendan Nyhan on How to make scientific research more trustworthy