R

The statistical programming language, not the letter

Usefulness: 🔧 🔧
Novelty: 💡
Uncertainty: 🤪 🤪 🤪
Incompleteness: 🚧 🚧 🚧

tl;dr R is a powerful, effective, diverse, well-supported, free, nightmare de-facto standard. As far as statistical languages go, this is outstandingly good.

Pros and cons

Good

• Free (beer/speech)
• Combines unparalleled breadth and community, at least as pertains to statisticians, data miners, machine learners and other such assorted folk as I call my colleagues. To get some sense of this thriving scene, check out R-bloggers. That community alone is enough to sell R, whatever you think of the language (cf “Your community is your best asset”) And believe me, I have reservations about everything else.
• Amazing, statistically-useful plotting (cf the awful battle to get error bars in python’s mayavi)
• Online web-app visualisation: shiny
• Integration into literate coding and reproducible research through knitr – see scientific writing workflow.

• Seems, from my standpoint, to have been written by statisticians who prioritise delivering statistical functionality right now over making an elegant, fast or consistent language to access that functionality. (“Elegant”, “fast”, ”consistent”; you can choose… uh… Oh look, it’s lunch break! Gotta go.) I’d rather access those same libraries through a language which has had as many computer scientists winnowing its ugly chaff as Python or Ruby has had. Or indeed Go, Julia. R is the Javascript of numerical computing. And, for that matter, I’d like as many amazing third-party libraries for non-statistical things as these other languages promise, even javacscript. Anyway, it is convenient for many common use cases, which is nice.
• Poetically, R has random scope amongst other parser and syntax weirdness.
• Call-by-value semantics (in a “big-data” processing language?)
• …ameliorated not even by array views,
• …exacerbated by bloaty design
• Object model tacked on after the fact… in fact, several object models, which is fine? I guess? Maybe, but…
• …if the object model stuff is a confused multi-standard compatibility disaster, I’d like the trade-off to be speed, or some other such modern convenience. Nah.
• One of the worst names to google for ever (cf Processing, Pure)

Needful packages

The tidyverse

The tidyverse is a miniature ecosystem within R which has certain coding conventions and tooling to make certain data analysis easier and prettier.

Blogging and reproducible research

blogdown, as mentioned elsewhere is R’s entrant into the fancy equationy bloggy universe. It also does reproducible research and miscellaneous scientific writing.

Hip machine learning

R Now plugs into many machine-learning-style algorithms.

For one example, you can run keras, and hence presumably tensorflow via Rstudio’s keras.

R does not define a standardized interface for its machine-learning algorithms. Therefore, for any non-trivial experiments, you need to write lengthy, tedious and error-prone wrappers to call the different algorithms and unify their respective output.

Additionally you need to implement infrastructure to

• resample your models
• optimize hyperparameters
• select features
• cope with pre- and post-processing of data and compare models in a statistically meaningful way.

As this becomes computationally expensive, you might want to parallelize your experiments as well. This often forces users to make crummy trade-offs in their experiments due to time constraints or lacking expert programming skills.

mlr provides this infrastructure so that you can focus on your experiments! The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering. It is written in a way that you can extend it yourself or deviate from the implemented convenience methods and construct your own complex experiments or algorithms.

I think this pitch is more or less the same for caret.

There are also externally developed of ML algorithms accessible from R that presumably have consistent interfaces by construction: h20

Misc

data.table offers high performance dataset querying, and approximately the same functionality as dplyr, but seems to be faster at e.g. sorting.

disk.frame is a friendly gigabyte-scale single machine disk-backed data store, for stuff too big for memory.

Testing

Academics are terrible at testing so I do not know how relevent this is, but ttdo + tinytest looks low-difficulty.

Plotting

IMO, the real killer feature of R.

See Plotting in R.

High performance R

Rcpp seems to be how everyone invokes their favoured compiled C++ code.

There are higher level tools that do this under the hood -

rstan compiles an inner loop this for Bayesian posterior simulation and a little bit of basic variational inference.

If you want a little more freedom but still want to have automatic differentiation and linear algebra done by magic, try TMB whose name and description are both awful but which manages pretty neat reduced rank matrix and optimization tricks for you.

Interacting with Julia

Julia is a nice language that can attain high performance.

I don’t know how to choose between these alternative methods. They both seem to have stalled, but XRJulia seems to be somewhat fresher.

This package provides an interface from R to Julia, based on the XR structure, as implemented in the XR package, in this repository.

rJulia provides an interface between R and Julia. It allows a user to run a script in Julia from R, and maps objects between the two languages.

Intro help

Save my workspace (i.e. current scope and variable definitions) to ./.Rdata

> save.image()

Load my workspace from ./.Rdata

> rm(list=ls())  # clear current defs
> load(".RData") # actually load

Subsetting hell

To subset a list based object:

x[1]

to subset and optionally downcast the same:

x[[1]]

to subset a matrix-based object:

x[1, , drop=FALSE]

to subset and optionally downcast the same:

x[1]

Data exchange

How to pass sparse matrices between R and Python

My hdf5 hack

Counter-intuitively, this FS-backed method was a couple of orders of magnitude faster than rpy2 last time I tried to pass more than a few MB of data.

Apparently you can use feather for this these days, although there is little documentation. Also, you can try rccpcnpy (pronounced Arrsikpeekernoppy) which is a numpy-matrix-loader for R.

Writing packages

How do you do that? It’s not so hard, and as Hilary Parker points out, saves you time.

Make a folder called MyCode with a DESCRIPTION file. Make a subfolder called R. Put R code in .R files in there. Edit, load_all("MyCode"), use the functions.

1. Install his devtools.
2. Use the Devtools/RStudio workflow.

Here’s an intro to the OO facilities of R - although I recommend going for a functional style as much as possible to avoid pain.

There are step debuggers and other such modern conveniences

Inspecting frames post hoc

Use recover. In fact, pro-tip, you can invoke it in 3rd party code gracefully:

options(error = utils::recover)

Basic interactive debugger

There is at least one, called browser.

Graphical interactive optionally-web-based debugger

Available in RStudio and if it had any more buzzwords in it would socially tag your instagram and upload in to the NSA’s Internet Of Things to be 3D printed.

Command-line invocation

Rio

Loads CSV from stdin into R as a data.frame, executes given commands, and gets the output as CSV or PNG on stdout

R for Pythonistas

Many things about R are surprising to me, coming as I do most recently from Python. I’m documenting my perpetual surprise here, in order that it may save someone else the inconvenience of going to all that trouble to be personally surprised.

Opaque imports

Importing an R package, unlike importing a python module, brings in random cruft that may have little to do with the names of the thing you just imported. That is, IMO, poor planning, although history indicates that most language designers don’t agree with me on that:

> npreg
> library("np")
Nonparametric Kernel Methods for Mixed Datatypes (version 0.40-4)
> npreg
function (bws, …) #etc

Further, Data structures in R can do, and are intended to, provide first class scopes for looking up of names. You are, in your explorations into data, as apt to bring the names of columns in a data set into scope as much as the names of functions in a library. This is kind of useful, although it leads to bizarre and unhelpful errors, so watch it.

No scalar types…

A float is a float vector of size 1:

> 5
[1] 5

…yet verbose vector literal syntax

You makes vectors by using a call to a function called c. Witness:

> c('a', 'b', 'c', 'd')
[1] "a" "b" "c" "d"

If you type a literal vector in though, it will throw an error:

> 'a', 'b', 'c', 'd'
Error: unexpected ',' in "'a',"

I’m sure there are Reasons for this; it’s just that they are reasons that I don’t care about.

Refs

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “Mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.

Kuhn, Max. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software 28 (1): 1–26. https://doi.org/10.18637/jss.v028.i05.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. New York: Springer-Verlag. https://www.springer.com/gp/book/9781461468486.