Data sharing

for research and science

Tips and tricks for collaborative data sharing, e.g. for open notebook or other reproducible research. To actually use it, you want some kind of database. But first you have to get it to put it in that database.

You can download it from some open data set library, although the workflow for many of these is not if you have ongoing research, since the data gets uploaded once and thereafter is static. Nonetheless this basic workflow is familiar and simple. Try, say, Figshare, which hosts the supporting data for many amazing papers. E.g. here's 1.4. Gb of synapses firing.

More modern, you can also use a sync system for changing data. This is not necessarily ideal for provenance if you are not tracking who changed what, when. But a lot of systems do support provenance.

Some campuses offer their own systems, e.g. my university offers resdata.

Now, let's think of some more exotic specialised options.

Dat

One hip solution is dat, which tracks updates to your datasets and shares the updates in a distributed fashion. It is very similar to syncthing, with a slightly different emphasis - sharing discoverable data to strangers rather than to friends, with a focus on datasets. You could also use it for backups or other sharing.

Dat is the package manager for data. Share files with version control, back up data to servers, browse remote files on demand, and automate long-term data preservation. Secure, distributed, fast.

Rich ecosystems of distributed servers. GUI.

However, you cannot, e.g. use it for synchronised collation of logs from many different servers in the cloud because it's one-writer-many-readers.

$npm install -g dat hypercored …$ mkdir MyData
$cd MyData$ dat create
> Title My Amazing Data
> Title My Awesome Dat
> Description This is a dat

Created empty Dat in /Users/me/MyData/.dat

\$ dat share


Qu

Qu publishes any old data from a mongodb store. Mongodb needs more effort to set up than I am usually prepared to tolerate, and isn't great for dense binary blobs, which is my stock in trade, so I won't explore that further.

Misc

Google's open data set protocol, which they call their “Dataset Publishing Language”, is a standard for medium-size datasets with EZ visualisations