Build tools are for automating script build/execution from the command-line or from your IDE. See also text editors, citation management, academic writing workflow, python, cloud computing a.k.a. cloud wrangling, open notebook science, scientific computation workflows, scientific workbooks.
The classic here is make, which is a universal standard that pretty much everyone hates.
Therefore, as usual in open source, there are too many alternative options and no consensus on which to use to break out of the lock in. All offer marginal improvements over make, all have uncertain lifespan and different faults. Therefore, my listing is arranged in descending order of an undisclosed combination of my perception of
- trendiness, and
- being flexible enough to assist my workflow, which is to say,
- Sure, sometimes I want to compile some code…
- …but usually I want to massage some large chunk of data through a rickety processing pipeline…
- …and graph it, in both lo-res web formats and high-res publication formats…
- …so please do all the above without getting in my way.
- …and if SOMEHOW it managed to support multiple parallelisation/batching backends (such as wacky supercomputing shit like Platform LSF HPC and Hadoop or spark or what-have-you) at least for some common use cases I would be inordinately pleased.
There are a dozen-odd packages for this in my list of things to check out, not including rolling-my-own. They overlap with various cloud options, but I am focussed here on developing a new thing.
That is too many options. I’m sure I can shoehorn anything into doing what I want, more or less, for the first few steps, and only find out the horrible friction points waaaaay down the line when it is too late. I’d be satisfied with choosing whichever will have the most users to mindshare when I really need it. Unfortunately, that time of need will be in the future, and my ability to predict the future has a provably poor track record.
Next best option, I’ll just choose whatever tool seems to have the biggest user-base now and go with that. Here are some tools, each of which does a slightly different thing, overlapping to different degrees:
airflow is Airbnb’s hybrid parallel-happy workflow tool. It has… features. Surely.
Like Ruby’s Rake tool and Invoke’s own predecessor Fabric 1.x, it provides a clean, high level API for running shell commands and defining/organizing task functions from a tasks.py file[…] it offers advanced features as well – namespacing, task aliasing, before/after hooks, parallel execution and more.
Unlike, say, doit, it has no support for build-dependencies.
click is not really a build tool but a command-line parser that is frequently used as one
Click is a Python package for creating beautiful command line interfaces in a composable way with as little code as necessary. It’s the “Command Line Interface Creation Kit”. It’s highly configurable but comes with sensible defaults out of the box.[…]
- arbitrary nesting of commands
- automatic help page generation
- supports lazy loading of subcommands at runtime
doit seemed to be a flavour of the minute two years ago, and promises modern task dependency management and such, for a developer audience. Nice tooling. Seems to be oriented toward “compilation” type targets where some file has to exist or not. It is obscure how this is supposed to work in more general settings.
Lancet comes from neuroscience.
Lancet is designed to help you organize the output of your research tools, store it, and dissect the data you have collected. The output of a single simulation run or analysis rarely contains all the data you need; Lancet helps you generate data from many runs and analyse it using your own Python code.
Parameter spaces often need to be explored for the purpose of plotting, tuning, or analysis. Lancet helps you extract the information you care about from potentially enormous volumes of data generated by such parameter exploration.
Seems to be a little bit over-engineered, but it is the only one with native support for PlatformLSF, and it has a thoughtful workflow for reproducible notebook computation
[…]is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.
integrates well with your jupyter notebook which is the main thing.
Fabric 1.0 - Fabric 1.0 was yesteryear’s hottest thing, still warm. Fabric 2.0 is different - it executes remote ssh commands, but doesn’t build stuff, and thus doesn’t help us here. Long story. Seems to be everywhere, or everywhere that you might be deploying to that cloud thing. Invoke is the successor project. Use that.
Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualisation, handling failures, command line integration, and much more.
[…] It includes native Python support for running mapreduce jobs in Hadoop, as well as Pig and Jar jobs. It also comes with filesystem abstractions for HDFS and local files that ensures all file system operations are atomic. This is important because it means your data pipeline will not crash in a state containing partial data.
Not so much about facilitating parallelism, as stopping your jobs from clobbering each other.
clip.py comes with a passive-aggressive app name, (+1) is all about wrapping generic python commands in command-line applications easily. If your code is already smart about filesystem dependencies this might be all you need.
is a set of tools to provide lightweight pipelining in Python. In particular, joblib offers:
- transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)
- easy simple parallel computing
- logging and tracing of the execution
Joblib is optimized to be fast and robust in particular on large data and has specific optimizations for numpy arrays. It is BSD-licensed.
In practice it’s mostly about memoization and robust concurency for map-reduce style calculations. You probably want more sophistication for fancy pipelines, although, what you want even more than that is to ignore concurrency.
Sumatra is almost entirely about step 2, and in particular, tracking and reproducing simulation or analysis parameters for sciency types.
Drake “Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.” Built in Clojure, but designed for command-line stuff.
pachyderm “is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.”
That is, it is a cloudimificated build/pipeline tool with data versioning baked in. For the curious, it uses Kubernetes to manage container deployments, which rather presumes you are happy to rent out servers from someone, or have some cluster lying around which it is economic for you to admin and also use.
spark isn’t a build tool, but its parallelisation features do overlap.
pants… “is a build system for Java, Scala and Python. It works particularly well for a source code repository that contains many distinct projects.” Backed by twitter and foursquare. USP list:
- Builds Java, Scala, and Python.
- Adding support for new languages is straightforward.
- Supports code generation: thrift, protocol buffers, custom code generators.
- Resolves external JVM and Python dependencies.
- Runs tests.
- Scales to large repos with many interdependent modules.
- Designed for incremental builds.
bazel is the open source cousin of some Google internal build tool. You need this one for building Tensorflow and probably some Android stuff, idk. If you aren’t google this is more complicated than you want. Close your eyes and use it where they tell you.
gradle … is some java-ish build tool
Ruffus is about setting up your simulation and automation pipelines, especially for science.
Paver - had a nice syntax for building stuff, including python extensions, but seems to have been untouched for dangerously long. PAver doesn’t support Python 3, eight years after Python 3 was released, which is suspiciously reactionary behaviour. If it doesn’t move like a corpse and doesn’t quack like a corpse, then by the duck-typing principle…
Scons is a make replacement that is itself damn old, and despite its aspirations to remedy the problems with make, AFAICT, not actually that much easier to use. Oriented toward compiling stuff.
Make. The original, and still the default. For connoisseurs of fragile whitespace handling.