The Living Thing / Notebooks : Statistical and mathematical python

Kinda hate R, because it is exactly so much a statistical dream as a programming nightmare? Is MATLAB too expensive when you try to run it on your cloud server farm and you’re anyway vaguely suspicious that they get kickback from the companies that sell RAM because otherwise why does it eat all your memory like that? Love the speed of C++ but have a nagging feeling that you should not need to recompile your code to do exploratory data analysis? Like the idea of Julia, but wary of depending on yet another bloody language, let along one without the serious corporate backing or history of the other ones I mentioned?

Python has a refreshingly different set of warts to those other options. Its statistical library support is less broad than R - probably comparable to MATLAB. It is, however, generally fast and nicer to debug, and integrates general programming tasks well — web servers, academic blogs, neural networks, weird art projects, online open workbooks

The matrix mathematics side of python OTOH is not so comprehensive as MATLAB, mind. The core matrix libraries, numpy and scipy, do what you want mostly, but when, for example, you really need some fancy spline type you read about on the internet, or you want someone to have really put in some effort on ensuring such-and-such a recursive filter is stable, you might find you need to do it yourself. So far I haven’t found any especially serious shortcomings this way, but it’s worth bearing in mind.

OTOH, it’s ubiquitous and free, free, free so you don’t need to worry about stupid licensing restrictions, and the community is enormous, so it’s pretty easy to answer any questions you may have, so that’s nice.

But in any case, you don’t need to choose. Python interoperates with all these other languages, and indeed, makes a specialty of gluing stuff together.

Aside: A lot of useful machine-learning-type functionality, which I won’t discuss in detail here here, exists in the python deep learning toolkits such as Tensorflow and Theano; you might want to check those pages too. Also graphing is a whole separate issue, as is optimisation.

In recent times, a few major styles have been ascendant in the statistical python scene.

Matrix-style Ecosystem: (scikit-learn)

scikit-learn exemplifies a machine-learning style, with lots of abstract feature construction and predictive-performance style model selection built around homogeneously-typed (only floats, only ints) matrices instead of dataframes. This style wil be more familiar to MATLAB users than to R users.

DataFrame-style (pandas)

pandas plus statsmodels look a lot more like R, but stripped down. On the minus side, they lack some language features of R (e.g. regression formulae are not first class language features). On the plus side, they lack some language features of R, such as the object model being a box of turds, and copy-by-value semantics, and broken reference counting and hating kittens.

Then again, there is no ggplot2 and a smaller selection of implemented algorithms compared to R, so you don’t get this stuff for free.

Functional composition ecosystem (Tensorflow)

Somewhat similar to the matrix-style interface of the scikit-learn, but a little weirder, is the computation-flow-graph-style of the differentiable learning libraries such as Theano and Tensorflow, and the variational approximation/ deep learning communities. Notably not quite the same interface, but still very matrixey, the probabilistic modeling libraries probably fit in this school.

There isn’t really a consistent API across these, just an emphasis on functional compositionality. But have a look at the approach in Edward and pymc3 and so on for a taste of how this works in practice.

Interoperation with other languages, platforms

R

live

Slow, but survivable. You can do this using rpy2.ipython

%load_ext rpy2.ipython

%R library(robustbase)
%Rpush yy xx
%R mod <- lmrob(yy ~ xx);
%R params <- mod$coefficients;
%Rpull params

See the Revolutions blog, or Josh Devlin’s tips for more of that.

via the filesystem

Much faster, weirdly, and better documented. Recommended. Try feather, which is based on Apache arrow.

import pandas as pd
import feather

path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
library(feather)

path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)

If that doesn’t work, try hdf5 or protobuf or whatever.

Compiled systems

There are too many options for interfacing with compiled code! FFI, ctypes, Cython, Boost-Python, numba, SWIG… Beware choice paralysis. More on that when I need it.

Miscellaneous items

Tips

Big array previews

How I set my numpy arrays to be displayed big and informative:

np.set_printoptions(edgeitems=5,
    linewidth=85, precision=4,
    suppress=True, threshold=500)

Reset to default:

np.set_printoptions(edgeitems=3, infstr='inf',
    linewidth=75, nanstr='nan', precision=8,
    suppress=False, threshold=1000, formatter=None)

Local random number generator state

Seeding your RNG can be a pain in the arse, especially if you are interfacing with an external library that doesn’t have RNG state passing in the API. So, use a context manager. Here’s one that works for numpy-based code:

from numpy.random import get_state, set_state, seed

class Seed(object):
    """
    context manager for reproducible seeding.

    >>> with Seed(5):
    >>>     print(np.random.rand())

    0.22199317108973948
    """
    def __init__(self, seed):
        self.seed = seed
        self.state = None

    def __enter__(self):
        self.state = get_state()
        seed(self.seed)

    def __exit__(self, exc_type, exc_value, traceback):
        set_state(self.state)

Exercise for the student: make it work with the default RNG also