The Living Thing / Notebooks :

Cloud machine learning

Cloudimificating my artificial data learning intelligence brain clever science analyticserisation

how it works
how it works

I get lost in all the options for parallel computing on the cheap. I summarise for myself here.

There are roadmaps here, e.g. the one by Cloud Native Computing foundation, Landscape. However, for me it exemplifies my precise problems with the industry, in that it mistakes an underexplained information deluge for actionable advice.

So, back to the old-skool: Lets find some specific things that work, implement solutions to the problems I have, and generalise as needed.

UPDATE so far nothing has worked for me.

Fashion dictates this should be “cloud” computing, although I’m also interested in using the same methods without a cloud, as such. In fact, I would prefer frictionless switching between such things according to debugging and processing power needs.

My emphasis is strictly on the cloud doing large data analyses. I don’t care about serving web pages or streaming videos or whatever, or deploying anything.

In particular I mostly want to do embarrassingly parallel computation That is, I run many calculations/simulations with absolutely no shared state and aggregate them in some way at the end. This avoids much of graph computing complexity. Some of my stuff is deep learning now, whcih is not quite as straightforward.

Additional material to this theme under scientific computation workflow and stream processing. I might need to consider how to store my data.

Software

Algorithms, implementations thereof, and providers of parallel processing services are all coupled closely. Nonetheless I’ll try to draw a distinction between the three.

Since I am not a startup trying to do machine-learning on the cheap, but a grad student implementing algorithms, it’s essential what whatever I use can get me access “under the hood”; I can’t just hand in someone else’s library as my dissertation.

Computation node suppliers

Cloud, n.

via Bryan Alexander’s Devil’s Dictionary of educational computing

See also Julia Carrie Wong and Matthew Cantor’s devil’s dictionary of Silicon vallue

cloud, the (n) – Servers. A way to keep more of your data off your computer and in the hands of big tech, where it can be monetized in ways you don’t understand but may have agreed to when you clicked on the Terms of Service. Usually located in a city or town whose elected officials exchanged tens of millions of dollars in tax breaks for seven full-time security guard jobs.

If you want a GPU this all becomes incredibly tedious. Anyway…

Parallel tasks on your awful ancient “High Performance” computing cluster that you hate but your campus spent lots of money on and it IS free so uh…

See HPC hell.

Local parallel tasks with python

See also the overlapping section on build tools for some other pipelining tool with less concurrency focus.

ipython native

Ipython spawning overview. ipyparallel is the built-in jupyter option with less pluggability but much ease.

joblib/dask.distributed

joblib is a simple python scientific computing library with basis mapreduce and some nice caching that integrate well. Not fancy, but super easy, which is what an academic usually wants, since fancy woudl imply we have a personnel budget.

>>> from math import sqrt
>>> from joblib import Parallel, delayed
>>> Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

dask.distributed is a similar project which expands slightly on joblib to handle networked computer clusters and also does load management even without a cluster. In fact it integrates with joblib.

pathos

pathos is one general tool here. Looks a little… sedate… in development. Looks more powerful than joblib in principle, but joblib actually ships.

spark

You could also launch spark jobs.

Scientific VMs

(To work out - should I be listing Docker container images instead? Much hipper, seems less tedious.)

Scientific containers

Only one so far.

Singularity

Singularity promises containerized science infrastructure.

Singularity provides a single universal on-ramp from the laptop, to HPC, to cloud.

Users of singularity can build applications on their desktops and run hundreds or thousands of instances—without change—on any public cloud.

Features include:

Released in 2016, Singularity is an open source-based container platform designed for scientific and high-performance computing (HPC) environments. Used by more than 25,000 top academic, government, and enterprise users, Singularity is installed on more than 3 million cores and trusted to run over a million jobs each day.

In addition to enabling greater control over the IT environment, Singularity also supports Bring Your Own Environment (BYOE)—where entire Singularity environments can be transported between computational resources (e.g., users’ PCs) with reproducibility.