Tips and tricks for collaborative data sharing, e.g. for reproducible research.
Related: the problems of organising the data efficiently for the task in hand. For that, see database. Also the problem of getting some good data for your research project.For that, you probably want a data set repository.
Web data repositories
You’ve finished writing a paper? Congratulations.
Online services to host supporting data from your finished projects in the name of reproducible research. The data gets uploaded once and thereafter is static.
There is not much to this, except that you might want verification of your data – e.g. someone who will vouch that you have not tampered with it after publication. You might also want a persistent identifier such as a DOI so that other researchers can refer to your work in an academe-endorsed fashion.
- Figshare, which hosts the supporting data for many researchers. It gives you a DOI for your dataset. Up to 5GB. Free.
- Zenodo is similar. Backed by CERN, on their infrastructure. Uploads get a DOI. Up to 50GB. Free.
- IEEE Dataport is free for IEEE members and happily hosts 2TB datasets. It gives you a DOI and integrates with many IEEE publications, plus allows convenient access from the Amazon cloud via AWS, which might be where your data is anyway. for the beta period (all of 2019) it is free. Thereafter, they charge USD2000 for an open access upload, and otherwise only other IEEE dataport users can get at your data. I know this is not an unusual way for access to journal articles to work, but for data sets it feels like a ham-fisted way of enforcing scarcity for data.
- Some campuses offer their own systems, e.g. my university offers resdata.
- DIY option. You could probably upload your data, if not too large, to github and for veracity get a trusted party to cryptographically sign it. Or indeed you could upload it to anywhere and get someone to cryptographically sign it. The problem with such DIY solutions is that they are unstable - very few data sets last more than a few years with this kind of set up. Campus web servers shut down, hosting fees go up etc. On the plus side you can make a nice presentational web page explaining everything and providing nice formatting for the text and tables and such.
- Open Science Framework doesn’t host data, but it does index data sets in google drive or whatever and make them coherently available to other researchers.
The question that you are asking about all of these if you are me is: can I make a nice HTMl front-end to my media examples? The answer is, AFAICT, no. Only in the DIY option.
Recommendation: If your data is small, make a DIY site for users nd also make a zenodo site to host it.
If you are sharing data for ongoing collaboration (you are still accumulating data) you might want a different tool, with less focus on DOIs/verification and more on convenient updating and reproducibility of intermediate results.
Realistically, seeing how often data sets are found to be flawed, or how often they can be improved, I’m not especially interested in verifiable one-off spectacular data releases. I’m interested in accessing collaborative, incremental, and improvable data. That is, after all, how research itself progresses.
The next options are solutions to simplify that kind of thing.
a.k.a. Data science Version control
DVC looks promising. Versions code with data assets in some external data store like S3 or whatever, which means they are shareable if you set the permissions right As it also integrates build tooling you should read about it there.
One hip solution is dat, which tracks updates to your datasets and shares the updates in a distributed fashion. It is similar to syncthing, with a different emphasis - sharing discoverable data to strangers rather than to friends, with a focus on datasets. You could also use it for backups or other sharing.
Dat is the package manager for data. Share files with version control, back up data to servers, browse remote files on demand, and automate long-term data preservation. Secure, distributed, fast.
Rich ecosystems of distributed servers. GUI.
However, you cannot, e.g. use it for synchronised collation of logs from many different servers in the cloud because it’s one-writer-many-readers.
Not sure yet. TBC.
Qu publishes any old data from a mongodb store. Mongodb needs more effort to set up than I am usually prepared to tolerate, and isn’t great for dense binary blobs, which is my stock in trade, so I won’t explore that further.
Google’s open data set protocol, which they call their “Dataset Publishing Language”, is a standard for medium-size datasets with EZ visualisations