The Living Thing / Notebooks :

Build tools for data science

Build tools are for automating script build/execution from the command-line or from your IDE. See also text editors, citation management, academic writing workflow, python, cloud computing a.k.a. cloud wrangling, open notebook science, scientific computation workflows, scientific workbooks.

The classic here is make, which is a universal standard that pretty much everyone hates, designed for compiling code but also used for data workflows. It works, but has an unsatisfactory user experience.

Therefore, as usual in open source, there are too many alternative options and no consensus on which to use to break out of the local optimum lock in. The obvious problems with make are bikeshedded and the subtler problems unaddressed. Each offering provides a selection of marginal improvements over make, a variety of different faults and varying standards of longevity. Therefore, my listing is arranged in descending order of an undisclosed combination of my perception of

Of these many options, I’m sure I can shoehorn anything into doing what I want, more or less, for the first few steps, and only find out the horrible friction points down the line when it is too late. I’d be satisfied with choosing whichever will have the most users to mindshare when I really need it. Unfortunately, that time of need will be in the future, and my ability to predict the future has a poor historical track record.

So! options.


The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.

Like Make (and unlike, say, doit) it has a custom DSL for specifying command-line jobs, which is an approach I am not a massive fan of, but the explicit support of qsub and other nightmare campus cluster horrors, of remote network files and supports of recent innovations like Singularity containerization make it friendly to data science types. It was originally developed by bioinformaticians, and is even friendlier to those.

They have invented a custom file format for defining the tasks, which is an odd choice. However, its workflow is very close to mine, so I am keen to give it a go.

Warning sign: They depend upon a broken package and the fix for that has been left dangling since at least 2018-08-04. It’s not the maintainers fault, but it looks bad. The workaround is

pip3 install git+

But should I be needing to work out this stuff before even installing the package?


Lancet comes from neuroscience.

Lancet is designed to help you organize the output of your research tools, store it, and dissect the data you have collected. The output of a single simulation run or analysis rarely contains all the data you need; Lancet helps you generate data from many runs and analyse it using your own Python code.

Its special selling point is that it integrates exploratory parameter sweeps. (can it do such interactively, I wonder?)

Parameter spaces often need to be explored for the purpose of plotting, tuning, or analysis. Lancet helps you extract the information you care about from potentially enormous volumes of data generated by such parameter exploration.

Natively supports the over-engineered campus jobs manager PlatformLSF, and it has a thoughtful workflow for reproducible notebook computation. Less focussed on the dependency management side.


“Data science Version control”.

DVC looks hip and solves some problems related to build tools, although it is not one as such. Versions code with data assets in some external data store like S3 or whatever.

DVC runs on top of any Git repository and is compatible with any standard Git server or provider (Github, Gitlab, etc). Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers all the advantages of a distributed version control system — lock-free, local branching, and versioning.

The single dvc repro command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.

It resembles git-LFS which is the classic git method of dealing with with Really Big Files, and maybe also git-annex, which is a Big File Handler built on git. However it puts these at the service of reproducible and easily-distributed experiments.

It has a github overlay dagshub that specialises for DVC projects.


doit seemed to be a flavour of the minute a handful of years ago, promising modern task dependency management and such, for a developer audience. Nice tooling. Still pretty good despite low development activity. It’s got a flexible way of specifying dependencies between both tasks and targets, which is nice, but it gets clunky if your build jobs produce many output files, or if they take arguments.


Invoke – claims to be the successor to Fabric.

Like Ruby’s Rake tool and Invoke’s own predecessor Fabric 1.x, it provides a clean, high level API for running shell commands and defining/organizing task functions from a file[…] it offers advanced features as well – namespacing, task aliasing, before/after hooks, parallel execution and more.

AFAICT, unlike, say, doit, it has no support for build-artefact dependencies (“is that file there?”), only task dependencies, which is not ideal for my workflows.



Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualisation, handling failures, command line integration, and much more.

[…] It includes native Python support for running mapreduce jobs in Hadoop, as well as Pig and Jar jobs. It also comes with filesystem abstractions for HDFS and local files that ensures all file system operations are atomic. This is important because it means your data pipeline will not crash in a state containing partial data.

Not so much about facilitating parallelism, as stopping your jobs from clobbering each other. But that is hard.



is a set of tools to provide lightweight pipelining in Python. In particular, joblib offers:

Joblib is optimized to be fast and robust in particular on large data and has specific optimizations for numpy arrays. It is BSD-licensed.

In practice it’s mostly about memoization and robust concurrency for map-reduce style calculations. This is not quite the same as a full data workflow DAG, but it intersects with that idea. You probably want more sophistication for fancy pipelines, although, what you want even more than that is to ignore concurrency.



d6tflow is a free open-source library which makes it easy for you to build highly effective data science workflows.

Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG.

In the end you just need to run TaskTrain() and it will automatically know which dependencies to run.


pachyderm “is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.”

AFAICT that means it is a cloudimificated build/pipeline tool with data versioning baked in. For the curious, it uses Kubernetes to manage container deployments, which rather presumes you are happy to rent out servers from someone, or have some container-compatibly cluster lying around which it is economic for you to admin and also use.


Sumatra is about tracking and reproducing simulation or analysis parameters for sciency types. Exploratory data pipelines, especially.



[…]is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.

Integrates well with your jupyter notebook which is the main thing, but much like jupyter notebooks themselves, you are on your own when it comes to reproducibility and might want to use it in concert with one of the other solutions here to achieve that.


spark isn’t a build tool, but its parallelisation / execution graph features do overlap. Also many build tools leverage spark.


airflow is Airbnb’s hybrid parallel-happy workflow tool. It has… features. TODO.


pants… “is a build system for Java, Scala and Python. It works particularly well for a source code repository that contains many distinct projects.” Backed by twitter and foursquare. USP list:


Ruffus is about setting up your exploratory simulation and automation pipelines, especially for science.


Scons is a make replacement that is itself damn old, and despite its aspirations to remedy the problems with make, AFAICT, not actually that much easier to use. Oriented toward compiling stuff.


The original, and still the default. For connoisseurs of fragile whitespace handling.