Automating script build/execution from the command-line.
See also text editors, citation management, academic writing workflow, python, cloud computing a.k.a. cloud wrangling, open notebook science, scientific computation workflows, scientific workbooks.
Argh! Too damn many options, that’s for sure, all offering marginal improvements over make, all with uncertain lifespan. Therefore, arranged in descending order of an undisclosed combination of my perception of
- trendiness, and
- being flexible enough to assist my workflow, which is to say,
- Sure, sometimes I want to compile some code…
- …but usually I want to massage some large chunk of data through a rickety processing pipeline…
- …and graph it, in both lo-res web formats and high-res publication formats…
- …so please do all the above without getting in my way.
- …and if SOMEHOW it managed to support multiple parallelisation/batching backends (such as wacky supercomputing shit like Platform LSF HPC and Hadoop or spark or what-have-you) at least for some common use cases I would be inordinately pleased.
There are a dozen-odd packages for this in my list of things to check out, not including rolling-my-own. They overlap with various cloud options, but I am focussed here on developing a new thing.
That is too many options. I’m sure I can shoehorn anything into doing what I want, more or less, for the first few steps, and only find out the horrible friction points waaaaay down the line when it is too late. I’d be satisfied with choosing whichever will have the most users to mindshare when I really need it. Unfortunately, that time of need will be in the future, and my ability to predict the future has a provably poor track record.
Next best option, I’ll just choose whatever tool seems to have the biggest user-base now and go with that. Here are some tools, each of which does a slightly different thing, overlapping to different degrees:
airflow is Airbnb’s hybrid parallel-happy workflow tool. It has… features. Surely.
Like Ruby’s Rake tool and Invoke’s own predecessor Fabric 1.x, it provides a clean, high level API for running shell commands and defining/organizing task functions from a tasks.py file[…] it offers advanced features as well – namespacing, task aliasing, before/after hooks, parallel execution and more.
Unlike, say, doit, it has no support for build-dependencies.
doit seemed to be a flavour of the minute last year, and promises modern task dependency management and such, for a developer audience. Nice tooling. Seems to be oriented toward “compilation” type targets where some file has to exist or not. It is obscure how this is supposed to work in more general settings.
Lancet comes from neuroscience.
Lancet is designed to help you organize the output of your research tools, store it, and dissect the data you have collected. The output of a single simulation run or analysis rarely contains all the data you need; Lancet helps you generate data from many runs and analyse it using your own Python code.
Parameter spaces often need to be explored for the purpose of plotting, tuning, or analysis. Lancet helps you extract the information you care about from potentially enormous volumes of data generated by such parameter exploration.
Seems to be a little bit over-engineered, but it is the only one with native support for PlatformLSF, and it has a thoughtful workflow for reproducible notebook computation
[…]is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources. Pathos provides stagers and launchers for parallel and distributed computing, where each launcher contains the syntactic logic to configure and launch jobs in an execution environment. Some examples of included launchers are: a queue-less MPI-based launcher, a ssh-based launcher, and a multiprocessing launcher. Pathos also provides a map-reduce algorithm for each of the available launchers, thus greatly lowering the barrier for users to extend their code to parallel and distributed resources. Pathos provides the ability to interact with batch schedulers and queuing systems, thus allowing large computations to be easily launched on high-performance computing resources.
integrates well with your jupyter notebook which is the main thing.
Fabric - Yesteryear’s hottest thing, still warm. Optimised for remote deployment, esp of remote services. Seems to be everywhere, or everywhere that you might be deploying to that cloud thing. However it doesn’t support Python 3, eight years after Python 3 was released, which is suspiciously reactionary behaviour. If it doesn’t move like a corpse and doesn’t quack like a corpse, then by the duck-typing principle… Invoke claims to be the successor.
Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualisation, handling failures, command line integration, and much more.
[…] It includes native Python support for running mapreduce jobs in Hadoop, as well as Pig and Jar jobs. It also comes with filesystem abstractions for HDFS and local files that ensures all file system operations are atomic. This is important because it means your data pipeline will not crash in a state containing partial data.
Not so much about facilitating parallelism, as stopping your jobs from clobbering each other.
clip.py comes with a passive-aggressive app name, (+1) is all about wrapping generic python commands in command-line applications easily. If your code is already smart about filesystem dependencies this might be all you need.
is a set of tools to provide lightweight pipelining in Python. In particular, joblib offers:
- transparent disk-caching of the output values and lazy re-evaluation (memoize pattern)
- easy simple parallel computing
- logging and tracing of the execution
Joblib is optimized to be fast and robust in particular on large data and has specific optimizations for numpy arrays. It is BSD-licensed.
Sumatra is almost entirely about step 2, and in particular, tracking and reproducing simulation or analysis parameters for sciency types.
Drake “Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.” Built in Clojure, but designed for command-line stuff.
pachyderm “is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing.”
That is, it is a cloudimificated build/pipeline tool with data versioning baked in. For the curious, it uses Kubernetes to manage container deployments, which rather presumes you are happy to rent out servers from someone.
spark isn’t a build tool, but its parallelisation features do overlap.
pants… “is a build system for Java, Scala and Python. It works particularly well for a source code repository that contains many distinct projects.” Backed by twitter and foursquare. USP list:
- Builds Java, Scala, and Python.
- Adding support for new languages is straightforward.
- Supports code generation: thrift, protocol buffers, custom code generators.
- Resolves external JVM and Python dependencies.
- Runs tests.
- Scales to large repos with many interdependent modules.
- Designed for incremental builds.
bazel… is the open source cousin of google internal build too. You need this one for building Tensorflow and probably some Android stuff.
gradle … is some java-ish build tool
Ruffus is about setting up your simulation and automation pipelines, especially for science.
Paver - had a nice syntax for building stuff, including python extensions, but seems to have been untouched for dangerously long.
Scons is a make replacement that is itself damn old, and despite its aspirations to remedy the problems with make, AFAICT, not actually that much easier to use. Oriented toward compiling stuff.
Make. The original, and still the default. For connoisseurs of fragile whitespace handling.