Kinda hate R, because it is exactly so much a statistical dream as a programming nightmare? Is MATLAB too expensive when you try to run it on your cloud server farm and you’re anyway vaguely suspicious that they get kickback from the companies that sell RAM because otherwise why does it eat all your memory like that? Love the speed of C++ but have a nagging feeling that you should not need to recompile your code to do exploratory data analysis? Like the idea of Julia, but wary of depending on yet another bloody language, let along one without the serious corporate backing or history of the other ones I mentioned?
Python has a refreshingly different set of warts to those other options. Its statistical library support is less broad than R  probably comparable to MATLAB. It is, however, generally fast and nicer to debug, and integrates general programming tasks well — web servers, academic blogs, neural networks, weird art projects, online open workbooks…
The matrix mathematics side of python OTOH is not so comprehensive as MATLAB, mind. The core matrix libraries, numpy and scipy, do what you want mostly, but when, for example, you really need some fancy spline type you read about on the internet, or you want someone to have really put in some effort on ensuring suchandsuch a recursive filter is stable, you might find you need to do it yourself. So far I haven’t found any especially serious shortcomings this way, but it’s worth bearing in mind.
OTOH, it’s ubiquitous and free, free, free so you don’t need to worry about stupid licensing restrictions, and the community is enormous, so it’s pretty easy to answer any questions you may have, so that’s nice.
But in any case, you don’t need to choose. Python interoperates with all these other languages, and indeed, makes a specialty of gluing stuff together.
Aside: A lot of useful machinelearningtype functionality, which I won’t discuss in detail here here, exists in the python deep learning toolkits such as Tensorflow and Theano; you might want to check those pages too. Also graphing is a whole separate issue, as is optimisation.
In recent times, a few major styles have been ascendant in the statistical python scene.
Matrixstyle Ecosystem: (scikitlearn)
scikitlearn exemplifies a machinelearning style, with lots of abstract feature construction and predictiveperformance style model selection built around homogeneouslytyped (only floats, only ints) matrices instead of dataframes. This style wil be more familiar to MATLAB users than to R users.
scikitlearn (sklearn to its friends) is the flagship of this fleet. It is fast, clear and welldesigned. I enjoy using it for implementing MLtype tasks. It has various algorithms such as random forests and linear regression and gaussian processes and reference implementations of many algorithms, both à la mode and passé. Although I miss *sniff* glmnet in R for lasso regression.
Tensorflow includes a sklearnlike API in its library.

SKLL (pronounced “skull”) provides a number of utilities to make it simpler to run common scikitlearn experiments with pregenerated features.

…provides a bridge between sklearn‘s machine learning methods and pandasstyle Data Frames.
In particular, it provides:
 a way to map DataFrame columns to transformations, which are later recombined into features
 a way to crossvalidate a pipeline that takes a pandas DataFrame as input.
libROSA, the machine listening library is more or less in this school.
kernelregression I should check out. Claims to supplement the convolution kernel algorithms in sklearn.
“pystruct aims at being an easytouse structured learning and prediction library.”
Currently it implements only maxmargin methods and a perceptron, but other algorithms might follow. The learning algorithms implemented in PyStruct have various names, which are often used loosely or differently in different communities. Common names are conditional random fields (CRFs), maximummargin Markov random fields (M3N) or structural support vector machines.
DataFramestyle (pandas)
pandas plus statsmodels look a lot more like R, but stripped down. On the minus side, they lack some language features of R (e.g. regression formulae are not first class language features). On the plus side, they lack some language features of R, such as the object model being a box of turds, and copybyvalue semantics, and broken reference counting and hating kittens.
Then again, there is no ggplot2 and a smaller selection of implemented algorithms compared to R, so you don’t get this stuff for free.
 pandas is moreorless a dataframe class for python. Lots of nice things are built on this, such as …
 statsmodels, which is moreorless R, but Python. Implements
 Linear regression models
 Generalized linear models
 Discrete choice models
 Robust linear models
 Many models and functions for time series analysis
 Nonparametric estimators
 A wide range of statistical tests
 Inputoutput tools for producing tables in a number of formats (Text, LaTex,
 HTML) and for reading Stata files into NumPy and Pandas.
 etc
 patsy implements a formula language for pandas.
This does lots of things, but most importantly, it
 builds design matrices (i.e. it knows how to represent `z~x^2+x+y^3` as a matrix, which only sounds trivial if you haven’t tried it)
 statefully preconditions data (e.g. constructs data transforms that will correctly normalise the test set as well as the training data.)
Functional composition ecosystem (Tensorflow)
Somewhat similar to the matrixstyle interface of the scikitlearn, but weirder, is the computationflowgraphstyle of the differentiable learning libraries such as Theano and Tensorflow, and the variational approximation / deep learning communities. Notably not quite the same interface, but still very matrixey, the probabilistic modeling libraries probably fit in this school.
There isn’t really a consistent API across these, just an emphasis on functional compositionality. But have a look at the approach in Edward and pymc3 and so on for a taste of how this works in practice.
Interoperation with other languages, platforms
R
live
Slow, but survivable. You can do this using rpy2.ipython
%load_ext rpy2.ipython %R library(robustbase) %Rpush yy xx %R mod < lmrob(yy ~ xx); %R params < mod$coefficients; %Rpull params
See the Revolutions blog, or Josh Devlin’s tips for more of that.
via the filesystem
Much faster, weirdly, and better documented. Recommended. Try feather, which is based on Apache arrow.
import pandas as pd import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)
library(feather) path < "my_data.feather" write_feather(df, path) df < read_feather(path)
If that doesn’t work, try hdf5 or protobuf or whatever.
Compiled systems
There are too many options for interfacing with external libraries and/or compiling python code.
FFI, ctypes, Cython, BoostPython, numba, SWIG…
Cython
Lowishfriction, well tested, welldocument works everywhere that Cpython extensions can be compiled. Compiles most python code (apart from generators and inner functions). Optimises python code using type defs and extended syntax.
Highlights: It works seamlessly with numpy. It makes calling Ccode easy
Problems: No generic dispatch. Debugging is nasty, like debugging C with extra crap in your way.
numba
More specialised than cython, uses llvm instead of c compiler. Numba make optimising inner numeric loops reasonably easy.
Highlights: jitcompiles plain python, so it’s very easy to use normal debuggers then switch on the compiler for performance improvements using the @jit Generic dispatch using the @generated_jit decorator. Compiles to multicore vectorisations as well as CUDA. In principle this means you can do your calculations on the GPU.
Problems: LLVM is a shifty beast and sensitive version dependencies are annoying. Documentation is a bit crap. Practically, getting performance out of a GPU is trickier than working out you can optimise away one tedious matrix op. There is too much messing with details of how many processors to allocate what to.
Miscellaneous items
 hypertools is a generic dimensionality reduction toolkit. Is this worthwhile?
 Bonus scientific computation is also available through GSL, most easily via CythonGSL
 agate is a stats package is not designed for high performance but for ease of use and reproducibility for journalists.
 savez makes persisting arrays fast and efficient per default. Use that, if you are talking to other python processes.
 biopython is a whole other world of phylogeny and biodata wrangling. I’m not sure if it adheres to one of the aforementioned schools or not. Should check.
 more options for speech/string analysis at natural language processing.
Displaying numbers legibly
Easy, but documentation is hard to find.
Floats
Sven Marnach distills everything adroitly:
print("{:10.4f}".format(x))
means “with 4 decimal points, align x to fill 10 columns”.
Atll conceivable alternatives are displayed at pyformat.info.
Arrays
How I set my numpy arrays to be displayed big and informative:
np.set_printoptions(edgeitems=5, linewidth=85, precision=4, suppress=True, threshold=500)
Reset to default:
np.set_printoptions(edgeitems=3, infstr='inf', linewidth=75, nanstr='nan', precision=8, suppress=False, threshold=1000, formatter=None)
There are a lot of ways to do this one.
See also np.array_str for oneoff formatting.
Tips
Use a knowngood project structure
[cookiecutter data science](https://drivendata.github.io/cookiecutterdatascience/#gettingstarted).
Local random number generator state
Seeding your RNG can be a pain in the arse, especially if you are interfacing with an external library that doesn’t have RNG state passing in the API. So, use a context manager. Here’s one that works for numpybased code:
from numpy.random import get_state, set_state, seed class Seed(object): """ context manager for reproducible seeding. >>> with Seed(5): >>> print(np.random.rand()) 0.22199317108973948 """ def __init__(self, seed): self.seed = seed self.state = None def __enter__(self): self.state = get_state() seed(self.seed) def __exit__(self, exc_type, exc_value, traceback): set_state(self.state)
Exercise for the student: make it work with the default RNG also