Build tools for data science

July 27, 2016 — June 18, 2019

faster pussycat

premature optimization

Closely-related problem, in this modern data-driven age: versioning and managing data and experiments.

Build tools are for automating code build/execution from the command-line or from your IDE. The classic is make, which is a universal standard that pretty much everyone hates, designed for compiling code but also used for data workflows. It works, but has an unsatisfactory user experience.

Therefore, as usual in open source, there are too many alternative options and no consensus on which to use to break out of the local optimum lock in. The obvious problems with make are bikeshedded and the subtler problems unaddressed. Each offering provides a selection of marginal improvements over make, a variety of different faults and varying standards of longevity. Therefore, my listing is arranged in descending order of an undisclosed combination of my perception of

active community, and therefore chance of community support, and
being useful for my workflow, which is to say,
1. Sure, sometimes I want to compile some code…
2. …but more usually I want to massage some large chunk of data through a rickety processing pipeline…
3. …and graph it, in both lo-res web formats and high-res publication formats…
4. …so please do all the above without getting in my way.
5. …and if somehow it managed to support multiple parallelisation/batching backends (such as wacky supercomputing thingies like Platform LSF HPC and Hadoop or spark or what-have-you) at least for some common use cases I would be inordinately pleased
6. …and if at the end I had a nicely packaged up workflow which I could share with someone else in the name of reproducible research that would be sublime.

Of these many options, I’m sure I can shoehorn anything into doing what I want, more or less, for the first few steps, and only find out the horrible friction points down the line when it is too late. I’d be satisfied with choosing whichever will have the most users to mindshare when I really need it. Unfortunately, that time of need will be in the future, and my ability to predict the future has historically had a bad track record.

So! options.

1 Snakemake

Snakemake: A framework for reproducible data analysis:

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.

Like Make (and unlike, say, doit) it has a custom DSL for specifying command-line jobs, which is an approach I am not a massive fan of, but the explicit support of qsub and other nightmare campus cluster horrors, of remote network files and supports of recent innovations like Apptainer containerization make it friendly to data science types. It was originally developed by bioinformaticians, and is even friendlier to those.

They have invented a custom file format for defining the tasks, which is an odd choice. However, its workflow is close to mine, so I am keen to give it a go.

Snakemake sets itself apart from other text-based workflow systems in the following way. Hooking into the Python interpreter, Snakemake offers a definition language that is an extension of Python with syntax to define rules and workflow specific properties. This allows to combine the flexibility of a plain scripting language with a pythonic workflow definition. The Python language is known to be concise yet readable and can appear almost like pseudo-code. The syntactic extensions provided by Snakemake maintain this property for the definition of the workflow. Further, Snakemake’s scheduling algorithm can be constrained by priorities, provided cores and customizable resources and it provides a generic support for distributed computing (e.g., cluster or batch systems). Hence, a Snakemake workflow scales without modification from single core workstations and multi-core servers to cluster or batch systems. Finally, Snakemake integrates with the package manager Conda and the container engine Singularity such that defining the software stack becomes part of the workflow itself.

2 DrWatson

Half build-tool, half experiment tracker.

DrWatson is a scientific project assistant software package. Here is what it can do:

Project Setup : A universal project structure and functions that allow you to consistently and robustly navigate through your project, no matter where it is located on your hard drive.

Naming Simulations : A robust and deterministic scheme for naming and handling your containers.

Saving Tools : Tools for safely saving and loading your data, tagging the Git commit ID to your saved files, safety when tagging with dirty repos, and more.

Running & Listing Simulations: Tools for producing tables of existing simulations/data, adding new simulation results to the tables, preparing batch parameter containers, and more.

Think of these core aspects of DrWatson as independent islands connected by bridges. If you don’t like the approach of one of the islands, you don’t have to use it to take advantage of DrWatson!

Applications of DrWatson are demonstrated the Real World Examples page. All of these examples are taken from code of real scientific projects that use DrWatson.

Please note that DrWatson is not a data management system.

3 Doit

doit seemed to be a flavour of the minute a handful of years ago, promising modern task dependency management and such, for a developer audience. Nice tooling. Still pretty good despite low development activity. It’s got a flexible way of specifying dependencies between both tasks and targets, which is nice, but it gets clunky if your build jobs produce many output files, or if they take arguments.

4 Invoke

Invoke — claims to be the successor to Fabric.

Like Ruby’s Rake tool and Invoke’s own predecessor Fabric 1.x, it provides a clean, high level API for running shell commands and defining/organizing task functions from a tasks.py file […] it offers advanced features as well — namespacing, task aliasing, before/after hooks, parallel execution and more.

AFAICT, unlike, say, doit, it has no support for build-artefact dependencies (“is that file there?”), only task dependencies, which is not ideal for my workflows.

5 Luigi

luigi

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualisation, handling failures, command line integration, and much more.

[…] It includes native Python support for running mapreduce jobs in Hadoop, as well as Pig and Jar jobs. It also comes with filesystem abstractions for HDFS and local files that ensures all file system operations are atomic. This is important because it means your data pipeline will not crash in a state containing partial data.

Not so much about facilitating parallelism, as stopping your jobs from clobbering each other. But that is hard.

6 joblib

joblib

is a set of tools to provide lightweight pipelining in Python. In particular, joblib offers:

transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)

easy simple parallel computing

logging and tracing of the execution

Joblib is optimized to be fast and robust in particular on large data and has specific optimizations for numpy arrays. It is BSD-licensed.

In practice it’s mostly about memoization and robust concurrency for map-reduce style calculations. This is not quite the same as a full data workflow DAG, but it intersects with that idea. You probably want more sophistication for fancy pipelines, although, what you want even more than that is to ignore concurrency.

7 d6tflow

d6tflow

d6tflow is a free open-source library which makes it easy for you to build highly effective data science workflows.

Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG.

In the end you just need to run TaskTrain() and it will automatically know which dependencies to run.

8 Pachyderm

pachyderm “is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.”

AFAICT that means it is a cloudimificated build/pipeline tool with data versioning baked in. For the curious, it uses Kubernetes to manage container deployments, which rather presumes you are happy to rent out servers from someone, or have some container-compatibly cluster lying around which it is economic for you to admin and also use.

9 Pathos

Pathos

[…] is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.

Integrates well with your jupyter notebook which is the main thing, but much like jupyter notebooks themselves, you are on your own when it comes to reproducibility and might want to use it in concert with one of the other solutions to achieve that.

10 Spark

spark isn’t a build tool, but its parallelisation / execution graph features do overlap. Also many build tools leverage spark.

11 Airflow

airflow is Airbnb’s hybrid parallel-happy workflow tool. It has… features. TODO.

12 Pants

pants… “is a build system for Java, Scala and Python. It works particularly well for a source code repository that contains many distinct projects.” Backed by twitter and foursquare. USP list:

Builds Java, Scala, and Python.
Adding support for new languages is straightforward.
Supports code generation: thrift, protocol buffers, custom code generators.
Resolves external JVM and Python dependencies.
Runs tests.
Scales to large repos with many interdependent modules.
Designed for incremental builds.

13 SCons

Scons is a make replacement that is itself old, and despite its aspirations to remedy the problems with make, AFAICT, not actually that much easier to use. Oriented toward compiling stuff.

14 Make

The original, and still the default. For connoisseurs of fragile whitespace handling.