# Build tools for data science

Usefulness: 🔧
Novelty: 💡
Uncertainty: 🤪 🤪 🤪
Incompleteness: 🚧 🚧 🚧

Build tools are for automating script build/execution from the command-line or from your IDE. See also text editors, citation management, academic writing workflow, python, cloud computing a.k.a. cloud wrangling, open notebook science, scientific computation workflows, scientific workbooks.

The classic here is make, which is a universal standard that pretty much everyone hates, designed for compiling code but also used for data workflows. It works, but has an unsatisfactory user experience.

Therefore, as usual in open source, there are too many alternative options and no consensus on which to use to break out of the local optimum lock in. The obvious problems with make are bikeshedded and the subtler problems unaddressed. Each offering provides a selection of marginal improvements over make, a variety of different faults and varying standards of longevity. Therefore, my listing is arranged in descending order of an undisclosed combination of my perception of

• active community, and therefore chance of community support, and
• being useful for my workflow, which is to say,

1. Sure, sometimes I want to compile some code…
2. …but more usually I want to massage some large chunk of data through a rickety processing pipeline…
3. …and graph it, in both lo-res web formats and high-res publication formats…
4. …so please do all the above without getting in my way.
5. …and if somehow it managed to support multiple parallelisation/batching backends (such as wacky supercomputing shit like Platform LSF HPC and Hadoop or spark or what-have-you) at least for some common use cases I would be inordinately pleased
6. …and if at the end I had a nicely packaged up workflow which I coudl share with someone else in the name of reproducible research that would be sublime.

Of these many options, I’m sure I can shoehorn anything into doing what I want, more or less, for the first few steps, and only find out the horrible friction points down the line when it is too late. I’d be satisfied with choosing whichever will have the most users to mindshare when I really need it. Unfortunately, that time of need will be in the future, and my ability to predict the future has a poor historical track record.

So! options.

## Snakemake

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.

Like Make (and unlike, say, doit) it has a custom DSL for specifying command-line jobs, which is an approach I am not a massive fan of, but the explicit support of qsub and other nightmare campus cluster horrors, of remote network files and supports of recent innovations like Singularity containerization make it friendly to data science types. It was originally developed by bioinformaticians, and is even friendlier to those.

They have invented a custom file format for defining the tasks, which is an odd choice. However, its workflow is very close to mine, so I am keen to give it a go.

Warning sign: They depend upon a broken package and the fix for that has been left dangling since at least 2018-08-04. It’s not the maintainers fault, but it looks bad. The workaround is

pip3 install git+https://github.com/pytries/datrie.git

But should I be needing to work out this stuff before even installing the package?

## Lancet

Lancet comes from neuroscience.

Lancet is designed to help you organize the output of your research tools, store it, and dissect the data you have collected. The output of a single simulation run or analysis rarely contains all the data you need; Lancet helps you generate data from many runs and analyse it using your own Python code.

Its special selling point is that it integrates exploratory parameter sweeps. (can it do such interactively, I wonder?)

Parameter spaces often need to be explored for the purpose of plotting, tuning, or analysis. Lancet helps you extract the information you care about from potentially enormous volumes of data generated by such parameter exploration.

Natively supports the over-engineered campus jobs manager PlatformLSF, and it has a thoughtful workflow for reproducible notebook computation. Less focussed on the dependency management side.

## DVC

“Data science Version control”.

DVC looks hip and solves some problems related to build tools, although it is not one as such. Versions code with data assets in some external data store like S3 or whatever.

DVC runs on top of any Git repository and is compatible with any standard Git server or provider (Github, Gitlab, etc). Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers all the advantages of a distributed version control system — lock-free, local branching, and versioning.

The single dvc repro command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.

It resembles git-LFS which is the classic git method of dealing with with Really Big Files, and maybe also git-annex, which is a Big File Handler built on git. However it puts these at the service of reproducible and easily-distributed experiments.

It has a github overlay dagshub that specialises for DVC projects.

## CaosDB

Some kind of “researcher-friendly” MySQL frontend for managing experiments. I’m not sure how well this truly integrates into a workflow of solving problems I actually have.q

CaosDB - Research Data Management for Complex, Changing, and Automated Research Workflows

Here we present CaosDB, a Research Data Management System (RDMS) designed to ensure seamless integration of inhomogeneous data sources and repositories of legacy data. Its primary purpose is the management of data from biomedical sciences, both from simulations and experiments during the complete research data lifecycle. An RDMS for this domain faces particular challenges: Research data arise in huge amounts, from a wide variety of sources, and traverse a highly branched path of further processing. To be accepted by its users, an RDMS must be built around workflows of the scientists and practices and thus support changes in workflow and data structure. Nevertheless it should encourage and support the development and observation of standards and furthermore facilitate the automation of data acquisition and processing with specialized software. The storage data model of an RDMS must reflect these complexities with appropriate semantics and ontologies while offering simple methods for finding, retrieving, and understanding relevant data. We show how CaosDB responds to these challenges and give an overview of the CaosDB Server, its data model and its easy-to-learn CaosDB Query Language. We briefly discuss the status of the implementation, how we currently use CaosDB, and how we plan to use and extend it.

## Forge

TBC. Forge is an attempt to createa a custom build tool by Adam Kosiorek.

Forge makes it easier to configure experiments and allows easier model inspection and evaluation due to smart checkpoints. With Forge, you can configure and build your dataset and model in separate files and load them easily in an experiment script or a jupyter notebook. Once the model is trained, it can be easily restored from a snapshot (with the corresponding dataset) without the access to the original config files.

## Estimator

Tensorflow-specific. tf.estimator.

## Artemis

Seems to be mostly an experiment wrapper

The Artemis Experiment Framework helps you to keep track of your experiments and their results. It is an alternative to Sacred, with the goal of being more intuitive to use. … Using this module, you can turn your main function into an “Experiment”, which, when run, stores all console output, plots, and computed results to disk (in ~/.artemis/experiments)

artemis

## Sacred

Sacred is a tool to configure, organize, log and reproduce computational experiments. It is designed to introduce only minimal overhead, while encouraging modularity and configurability of experiments.

The ability to conveniently make experiments configurable is at the heart of Sacred. If the parameters of an experiment are exposed in this way, it will help you to:

• keep track of all the parameters of your experiment
• easily run your experiment for different settings
• save configurations for individual runs in files or a database

## Dr Watson

DrWatson is a scientific project assistant software package. Here is what it can do:

• Project Setup : A universal project structure and functions that allow you to consistently and robustly navigate through your project, no matter where it is located on your hard drive.
• Naming Simulations : A robust and deterministic scheme for naming and handling your containers.
• Saving Tools : Tools for safely saving and loading your data, tagging the Git commit ID to your saved files, safety when tagging with dirty repos, and more.
• Running & Listing Simulations: Tools for producing tables of existing simulations/data, adding new simulation results to the tables, preparing batch parameter containers, and more.

Think of these core aspects of DrWatson as independent islands connected by bridges. If you don’t like the approach of one of the islands, you don’t have to use it to take advantage of DrWatson!

Applications of DrWatson are demonstrated the Real World Examples page. All of these examples are taken from code of real scientific projects that use DrWatson.

Please note that DrWatson is not a data management system.

## Doit

doit seemed to be a flavour of the minute a handful of years ago, promising modern task dependency management and such, for a developer audience. Nice tooling. Still pretty good despite low development activity. It’s got a flexible way of specifying dependencies between both tasks and targets, which is nice, but it gets clunky if your build jobs produce many output files, or if they take arguments.

## Invoke

Invoke – claims to be the successor to Fabric.

Like Ruby’s Rake tool and Invoke’s own predecessor Fabric 1.x, it provides a clean, high level API for running shell commands and defining/organizing task functions from a tasks.py file[…] it offers advanced features as well – namespacing, task aliasing, before/after hooks, parallel execution and more.

AFAICT, unlike, say, doit, it has no support for build-artefact dependencies (“is that file there?”), only task dependencies, which is not ideal for my workflows.

## Luigi

luigi

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualisation, handling failures, command line integration, and much more.

[…] It includes native Python support for running mapreduce jobs in Hadoop, as well as Pig and Jar jobs. It also comes with filesystem abstractions for HDFS and local files that ensures all file system operations are atomic. This is important because it means your data pipeline will not crash in a state containing partial data.

Not so much about facilitating parallelism, as stopping your jobs from clobbering each other. But that is hard.

## joblib

joblib

is a set of tools to provide lightweight pipelining in Python. In particular, joblib offers:

• transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)
• easy simple parallel computing
• logging and tracing of the execution

Joblib is optimized to be fast and robust in particular on large data and has specific optimizations for numpy arrays. It is BSD-licensed.

In practice it’s mostly about memoization and robust concurrency for map-reduce style calculations. This is not quite the same as a full data workflow DAG, but it intersects with that idea. You probably want more sophistication for fancy pipelines, although, what you want even more than that is to ignore concurrency.

## d6tflow

d6tflow

d6tflow is a free open-source library which makes it easy for you to build highly effective data science workflows.

Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG.

In the end you just need to run TaskTrain() and it will automatically know which dependencies to run.

## Pachyderm

pachyderm “is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.”

AFAICT that means it is a cloudimificated build/pipeline tool with data versioning baked in. For the curious, it uses Kubernetes to manage container deployments, which rather presumes you are happy to rent out servers from someone, or have some container-compatibly cluster lying around which it is economic for you to admin and also use.

## Sumatra

Sumatra is about tracking and reproducing simulation or analysis parameters for sciencey types. Exploratory data pipelines, especially.

## Pathos

Pathos

[…]is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.

Integrates well with your jupyter notebook which is the main thing, but much like jupyter notebooks themselves, you are on your own when it comes to reproducibility and might want to use it in concert with one of the other solutions here to achieve that.

## Spark

spark isn’t a build tool, but its parallelisation / execution graph features do overlap. Also many build tools leverage spark.

## Airflow

airflow is Airbnb’s hybrid parallel-happy workflow tool. It has… features. TODO.

## Pants

pants… “is a build system for Java, Scala and Python. It works particularly well for a source code repository that contains many distinct projects.” Backed by twitter and foursquare. USP list:

• Builds Java, Scala, and Python.
• Adding support for new languages is straightforward.
• Supports code generation: thrift, protocol buffers, custom code generators.
• Resolves external JVM and Python dependencies.
• Runs tests.
• Scales to large repos with many interdependent modules.
• Designed for incremental builds.

## Ruffus

Ruffus is about setting up your exploratory simulation and automation pipelines, especially for science.

## SCons

Scons is a make replacement that is itself damn old, and despite its aspirations to remedy the problems with make, AFAICT, not actually that much easier to use. Oriented toward compiling stuff.

## Make.

The original, and still the default. For connoisseurs of fragile whitespace handling.