The Living Thing / Notebooks :

Open notebook science

and other strategies for reproducible research

Tom Gauld:

Hazard labels used in our laboratory
Hazard labels used in our laboratory

Methodologies of how to publish your methods, not technical details. Technical details are under build/pipelines tools The painful process of journals and article validation is under academic publishing.

The Knowledge Repository is one open workflow sharing platform. Motivation:

[…]our process combines the code review of engineering with the peer review of academia, wrapped in tools to make it all go at startup speed. As in code reviews, we check for code correctness and best practices and tools. As in peer reviews, we check for methodological improvements, connections with preexisting work, and precision in expository claims. We typically don’t aim for a research post to cover every corner of investigation, but instead prefer quick iterations that are correct and transparent about their limitations.

This is coupled closely with build tools.

Actually existing reproducible research tools

See Scientific workbooks as well as scientific computation build tools.

Basic steps toward reproducible research. See also scientific computation workflow, and build/pipelines tools. The painful process of journals and article validation is under academic publishing.

Open notebooks

What do you get when you take your scientific workbooks or other and publish them online along with the data? An open notebook! These are huge in the machine learning pedagogy world right now, and small-to-medium in the applied-machine-learning, especially the recruitment end of that. They are noticeable but rare AFAICS in the rest of the world.

If you want in-depth justifications for open notebooks, see Caleb McDaniel or Ben Marwick’s slide deck.

I’m interested in this because it seems like the actual original pitch for how scientific research was supposed to work, with rapid improvement upon each others’ ideas. Whether I get around to fostering such stuff despite the fact that it is not valued by my employer, that is the question.

Containerized workflow

Docker is designed for reproducible deployment which makes it an approximate fit for reproducible research. See docker for reproducible research.

Build tools

A reproducible experiment is closely coupled to with build tools, which recreate all the, possibly complicated and lengthy, steps. Some of the build tools I document have reproducibility as a primary focus, notably DVC, drake, lancet, and pachyderm.

Sundry data sharing ideas

See Data sharing.


codeocean seems to be a major entrant here.

For the first time, researchers, engineers, developers and scientists can upload code and data in any open source programming language and link working code in a computational environment with the associated article for free. We assign a Digital Object Identifier (DOI) to the algorithm, providing correct attribution and a connection to the published research.

The platform provides open access to the published software code and data to view and download for everyone for free. But the real treat is that users can execute all published code without installing anything on their personal computer. Everything runs in the cloud on CPUs or GPUs according to the user needs. We make it easy to change parameters, modify the code, upload data, run it again, and see how the results change.

The also ran a workshop on this.

Possibly Sylabs cloud is a similar project?