The Living Thing / Notebooks :

Research data sharing

for research and science

Usefulness: 🔧
Novelty: 💡
Uncertainty: 🤪 🤪 🤪
Incompleteness: 🚧 🚧 🚧

Tips and tricks for collaborative data sharing, e.g. for reproducible research.

Related: the problems of organising the data efficiently for the task in hand. For that, see database. Also the problem of getting some good data for your research project.For that, you probably want a data set repository.

Web data repositories

You’ve finished writing a paper? Congratulations.

Online services to host supporting data from your finished projects in the name of reproducible research. The data gets uploaded once and thereafter is static.

There is not much to this, except that you might want verification of your data – e.g. someone who will vouch that you have not tampered with it after publication. You might also want a persistent identifier such as a DOI so that other researchers can refer to your work in an academe-endorsed fashion.

The question that you are asking about all of these if you are me is: can I make a nice HTMl front-end to my media examples? The answer is, AFAICT, no. Only in the DIY option.

Recommendation: If your data is small, make a DIY site for users nd also make a zenodo site to host it.

If you are sharing data for ongoing collaboration (you are still accumulating data) you might want a different tool, with less focus on DOIs/verification and more on convenient updating and reproducibility of intermediate results.

Realistically, seeing how often data sets are found to be flawed, or how often they can be improved, I’m not especially interested in verifiable one-off spectacular data releases. I’m interested in accessing collaborative, incremental, and improvable data. That is, after all, how research itself progresses.

The next options are solutions to simplify that kind of thing.

DVC

a.k.a. Data science Version control

DVC looks promising. Versions code with data assets in some external data store like S3 or whatever, which means they are shareable if you set the permissions right As it also integrates build tooling you should read about it there.

Dat

One hip solution is dat, which tracks updates to your datasets and shares the updates in a distributed fashion. It is similar to syncthing, with a different emphasis - sharing discoverable data to strangers rather than to friends, with a focus on datasets. You could also use it for backups or other sharing.

Dat is the package manager for data. Share files with version control, back up data to servers, browse remote files on demand, and automate long-term data preservation. Secure, distributed, fast.

Rich ecosystems of distributed servers. GUI.

However, you cannot, e.g. use it for synchronised collation of logs from many different servers in the cloud because it’s one-writer-many-readers.

$ npm install -g dat hypercored

$ mkdir MyData
$ cd MyData
$ dat create
> Title My Amazing Data
> Title My Awesome Dat
> Description This is a dat

  Created empty Dat in /Users/me/MyData/.dat

$ dat share

Orbitdb

Not sure yet. TBC.

Qu

Qu publishes any old data from a mongodb store. Mongodb needs more effort to set up than I am usually prepared to tolerate, and isn’t great for dense binary blobs, which is my stock in trade, so I won’t explore that further.

Misc

Google’s open data set protocol, which they call their “Dataset Publishing Language”, is a standard for medium-size datasets with EZ visualisations