The Living Thing / Notebooks : Cloudimificating your artificial data learning intelligence science analyticserising

how it works

I get lost in all the options for parallel computing on the cheap. I summarise for myself here.

Fashion dictates this should be “cloud” computing, although I’m also interested in using the same methods without a cloud, as such. In fact, I would prefer frictionless switching between such things according to debugging and processing power needs.

Emphasis for now is on embarrassingly parallel computation, which is what I as a statistician mostly do. Mostly in python, sometimes in other things. That is, I run many calculations/simulations with absolutely no shared state and aggregate them in some way at the end. This avoids much of graph computing complexity.

Let’s say, I want “easy, optionally local, shared-nothing parallel computing”.

*this* easy

Additional material to this theme under scientific computation workflow (for the actual job management software) and stream processing. I might need to consider how to store my data.


Algorithms, implementations thereof, and providers of parallel processing services are all coupled closely. Nonetheless I’ll try to draw a distinction between the three.

Since I am not a startup trying to do machine-learning on the cheap, but a grad student implementing algorithms, it’s essential what whatever I use can get me access “under the hood”; I can’t just hand in someone else’s library as my dissertation.

  1. Plain old HPC clusters, as seen on campus, I discuss at scientific workbooks and build tools.

  2. airflow is Airbnb’s hybrid parallel-happy workflow tool:

    While you can get up and running with Airflow in just a few commands, the complete architecture has the following components:

    • The job definitions, in source control.
    • A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your DAGs.
    • A web application, to explore your DAGs’ definition, their dependencies, progress, metadata and logs. The web server is packaged with Airflow and is built on top of the Flask Python web framework.
    • A metadata repository, typically a MySQL or Postgres database that Airflow uses to keep track of task job statuses and other persistent information.
    • An array of workers, running the jobs task instances in a distributed fashion.
    • Scheduler processes, that fire up the task instances that are ready to run.
  3. SystemML created by IBM, now run by the Apache Foundation:

    SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single-node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark.

  4. Tensorflow is the hot new Google one, coming from the training of artificial neural networks, but more broadly applicable. See my notes It handles general optimisation problems, especially for neural networks. Listed here as a parallel option because it has some parallel support, especially on google’ own infrastructure.. C++/Python.

  5. Turi (formerly Dato (formally Graphlab)) claims to automate this stuff, using their own libraries, which are a little… opaque and have similar, but not identical, APIs to python’s scikit-learn. Their (open source) Tensorflow competitor mxnet claims to the the fastest thingy ever times whatever you said plus one. Not clear which of their other offerings are open-source.

  6. Spark is…

    …an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. […]By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited to machine learning algorithms.

    Spark requires a cluster manager and a distributed storage system. […] Spark also supports a pseudo-distributed mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in the scenario, Spark is running on a single machine with one worker per CPU core.

    Spark had over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects.

    Sounds lavishly well-endowed with a community. Not an option for me, since the supercomputer I use has its own weird proprietary job management system. Although possibly I could set up temporary multiprocess clusters on our cluster? It does support python, for example.

    It has in fact, many cluster modes:

    • Standalone – a simple cluster manager included with Spark that makes it easy to set up a private cluster.
    • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
    • Hadoop YARN – the resource manager in Hadoop 2.
    • basic EC2 launch scripts for free, btw

    Interesting application: Communication-Efficient Distributed Dual Coordinate Ascent (CoCoA):

    By leveraging the primal-dual structure of these optimization problems, COCOA effectively combines partial results from local computation while avoiding conflict with updates simultaneously computed on other machines. In each round, COCOA employs steps of an arbitrary dual optimization method on the local data on each machine, in parallel. A single update vector is then communicated to the master node.

    i.e. cunning optimisation stunts to do efficient distribution of optimisation problems over various machines.

    Spark integrates with scikit-learn via spark-sklearn:

    This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

    i.e. this one is for when your data fits in memory, but the optimisation over hyperparameters needs to be parallelised.

  7. Dataflow/Beam is google’s job handling as used in google cloud (see below), but they open-sourced this bit. Comparison with spark. The claim has been made that Beam subsumes Spark. It’s not clear how easy it is to make a cluster of these things, but it can also use Apache Spark for processing. Here’s an example using docker. Java only, for now.

# dask seems to parallelize certain python tasks well and scale up elastically.
It’s purely for python.
  1. Thunder does time series and image analysis - seems to be a python/spark gig specialising in certain operations but at scale:

    Spatial and temporal data is all around us, whether images from satellites or time series from electronic or biological sensors. These kinds of data are also the bread and butter of neuroscience. Almost all raw neural data consists of electrophysiological time series, or time-varying images of flourescence or resonance.

    Thunder is a library for analyzing large spatial and temporal data. Its core components are:

    • Methods for loading distributed collections of images and time series data
    • Data structures and methods for working with these data types
    • Analysis methods for extracting patterns from these data

    It is built on top of the Spark distributed computing platform. The API is broadly organized into:

    Not clear how easy it is to extend to do more advanced operations.

  2. Tessera is a map-reduce library for R:

    The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end with simple R commands.

  3. Thrill s a C++ framework for distributed Big Data batch computations on a cluster of machines. It is currently being designed and developed as a research project at Karlsruhe Institute of Technology and is in early testing.

    Some of the main goals for the design are:

    • To create a high-performance Big Data batch processing framework.
    • Expose a powerful C++ user interface, that is efficiently tied to the framework’s internals. The interface supports the Map/Reduce paradigm, but also versatile “dataflow graph” style computations like Apache Spark or Apache Flink with host language control flow.[…]
    • Leverage newest C++11 and C++14 features like lambda functions and auto types to make writing user programs easy and convenient.
    • Enable compilation of binary programs with full compile-time optimization runnable directly on hardware without a virtual machine interpreter. Exploit cache effects due to less indirections than in Java and other languages. Save energy and money by reducing computation overhead.
    • Due to the zero-overhead concept of C++, enable applications to process small datatypes efficiently with no overhead.
    • […]

Computation node suppliers

Cloud, n.
  1. A place of terror and dismay, a mysterious digital onslaught, into which we all quietly moved.
  2. A fictitious place where dreams are stored. Once believed to be free and nebulous, now colonized and managed by monsters. See ‘Castle in the Air’.[…]
  3. […] other peoples’ computers

via Bryan Alexander’s Devil’s Dictionary of educational computing

If you want a GPU this all becomes incredibly tedious. Anyway…

  1. IBM has a trial cloud offering (documentation here)
  2. Sciencecluster claims fast and easy deployment too.
  3. databricks, spun off from the apache spark team, does automated spark deployment. The product looks tasty, but has a crazy high baseline rate for a solo seveloper of USD99/month
  4. Amazon has stepped up the ease of doing this recently:
  1. google cloud might interoperate well with a bunch of google products, such as Tensorflow; obviously this works better if you commit to googly storage APIs etc. Note: supports python 2.7 only - so party like it’s 2010. Seems to also support VMs? But then why are you bothering to use google’s prepackaged services if you can’t avoid the complexity of VMs?

    There is the usual thicket of weird service names.

    Basic workflow is

    1. cloud storage stores data
    2. dataflow suppplies data
    3. model fitting done using cloudml
    4. control this from datalab

    Also good to know:

  2. yhat aims at deploying real-timeish analytics for large scale production use; the marketing folks have got at their front page and made their precise product offerings impossibly hard to discern. It’s possible they sell flash animation about machine learning services rather than actual machine learning services.

    Oh wait, there are yhat docs hidden two links deep. Looks like they especially do model-prediction calls? But this is frustrating if the bottleneck is fitting the model, surely? And isn’t that usually the bottleneck? Confused.

  3. Turi is also in this business, I think? I’ve grown tired of trying to keep track of their doohickeys.

  4. cloudera supplies nodes that even run python but AFAICT they are thousands of bucks per year, very enterprisey. Not grad student appropriate.

Parallel tasks on your awful ancient “High Performance” computing cluster that you hate but your campus spent lots of money on and it IS free so uh…

hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.

Holy crap, someone procrastinated. Andrea Zonca wrote a script that allows spawning on your PlatformLSF/Torque legacy HPC monstrosity. It’s called remotespawner. This is the most thankless kind of task.

A more ad hoc, probably-slower-but-more-robust approach: Easily distributing a parallel IPython Notebook on a cluster:

Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”

Why yes, I have.

Also, if I have no cluster CPU cycles to hand:


Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.”[…]

ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.

works on on Platform LSF, Sun Grid Engine, Torque, SLURM.

I do not at all understand how you get data back from this; I guess you run it in situ. Strictly python.

Local parallel tasks with jupyter

See also build tools.


pathos is one general tool here.

ipyparallel is the built-in jupyter option with less pluggability but much ease.

You could also launch spark jobs.

Good scientific VM images

(To work out - should I be listing Docker container images instead? Much hipper, seems less tedious.)

  1. AWS has launched modern DL VMs at last:

    We are excited to announce that an AWS Deep Learning AMI for Ubuntu is now available in the AWS Marketplace in addition to the Amazon Linux version.

    The AWS Deep Learning AMI, now available on AWS Marketplace, lets you run deep learning in the Cloud, at any scale. Launch instances of pre-installed, open source deep learning frameworks, including Apache MXNet, to train sophisticated, custom AI models, experiment with new algorithms, and learn new deep learning skills and techniques. The AWS Deep Learning AMI lets you create managed, auto-scaling clusters of GPUs for large-scale training, or run inference on trained models using the latest versions of MXNet, TensorFlow, Caffe, Theano, Torch, and Keras. With the addition of an Ubuntu version, you have the choice to run on the operating system of your choice. There is no additional charge for the AWS Deep Learning AMI – you pay only for the AWS resources needed to store and run your applications.

Here is the download page.
  1. Continuum Analytics has conda python images.
  2. StarCluster is an academically-targeted AWS-compatible multi-computing library. Their VMs are a little bit dated for my purposes, lacking LLVM3.5 etc.
  3. Tensorflow-happy AWS images:


The other, lighter, hipper virtual machine standard. which, AFAICT, attempts to make VM provisioning more like installing an app than building a computer.