The Living Thing / Notebooks :

R (the language)

tl;dr R is a powerful, effective, diverse, well-supported, free, nightmare de-facto standard. As far as statistical languages go, this is outstandingly good.

Pros and cons

Good

Bad

Needful packages

the tidyverse

Most lucid explanation of everything is Hadley Wickham’s Advanced R book which is free and only “Advanced” in the sense that your knowledge of R will be advanced after reading it, not in the sense of being forbiddingly complicated for beginners.

Kieran Healy advises the following set up for visualisation in the tidyverse style :

my_packages <- c("tidyverse", "broom", "coefplot", "cowplot",
                 "gapminder", "GGally", "ggrepel", "ggridges",X "gridExtra",
                 "interplot", "margins", "maps", "mapproj", "mapdata",
                 "MASS", "quantreg", "scales", "survey", "srvyr",
                 "viridis", "viridisLite", "devtools")

install.packages(my_packages,
                 repos = "http://cran.rstudio.com")

Others:

Useful functions: semi_join etc.

blogging and reproducible research

blogdown, as mentioned elsewhere is R’s entrant into the fancy equationy bloggy universe. It also does reproducible research and miscellaneous scientific writing.

plotting

More tidyverse

high performance R

Rcpp seems to be how everyone invokes their favoured compiled C++ code.

There are higher level tools that do this under the hood -

rstan compiles an inner loop this for Bayesian posterior simulation and a little bit of basic variational inference.

If you want a little more freedom but still want to have automatic differentiation and linear algebra done by magic, try TMB whose name and description are both awful but which manages pretty neat reduced rank matrix and optimization tricks for you.

Misc

data.table offers high performance dataset querying, and approximately the same functionality as dplyr, but seems to be faster at e.g. sorting.

disk.frame is a friendly gigabyte-scale single machine disk-backed data store, for stuff too big for memory. ## Surprising connections

You can run keras, and hence presumably tensorflow via Rstudio’s keras.

General Tips

Intro help

stackexchange meta list.

subsetting hell

To subset a list based object:

x[1]

to subset and optionally downcast the same:

x[[1]]

to subset a matrix-based object:

x[1, , drop=FALSE]

to subset and optionally downcast the same:

x[1]

How to pass sparse matrices between R and Python

My hdf5 hack

Counter-intuitively, this FS-backed method was a couple of orders of magnitude faster than rpy2 last time I tried to pass more than a few MB of data.

Apparently you can use feather for this these days, although there is little documentation. Also, you can try rccpcnpy (pronounced Arrsikpeekernoppy) which is a numpy-matrix-loader for R.

Upgrading R breaks the installed packages

update: no longer broken by default in latest R!

This is the fix:

update.packages(checkBuilt=TRUE, ask=FALSE)

Bioconductor’s horrifyingly pwnable install

What, you’d like to install some biostatistics software on your campus supercomputing cluster? Easy! Simply download and run this unverifiable obligatedly unencrypted unsigned script from a web server of unknown provenance!

source("http://bioconductor.org/biocLite.R")
biocLite("RBGL")

It is probably usually not often script kiddies spoofing you so as to to trojan your campus computing cluster to steal CPU cycles. After all, who would do that?

On an unrelated note, I am looking for investors in a cryptocurrency mining operation. Contact me privately.

NO, WAIT! There are now alternatives, specifically BiocManager, which might possibly be more secure.

BiocManager::install(c("limma", "knitr"))

Writing packages

How do you do that? It’s not so hard, and as Hilary Parker points out, saves you time.

Easy project reload

Devtools for lazy people:

Make a folder called MyCode with a DESCRIPTION file. Make a subfolder called R. Put R code in .R files in there. Edit, load_all("MyCode"), use the functions.

Hadley Wickham pro-style

Install his devtools.

Devtools/RStudio workflow

Here’s an intro that explains how to use the OO facilities of R - although I recommend going for a functional style to avoid pain. You don’t want to have to learn about multiple different object model versions.

There are step debuggers and other such modern conveniences

inspecting frames post hoc:

recover In fact, pro-tip, you can invoke it in 3rd party code gracefully:

options(error = utils::recover)

Interactive debugger

browser

Graphical interactive optionally-web-based debugger

available in RStudio and if it had any more buzzwords in it would socially tag your instagram and upload in to the NSA’s Internet Of Things to be 3D printed.

easy command-line invocation:

Rio

Loads CSV from stdin into R as a data.frame, executes given commands, and gets the output as CSV or PNG on stdout

R for Pythonistas

Many things about R are surprising to me, coming as I do most recently from Python. I’m documenting my perpetual surprise here, in order that it may save someone else the inconvenience of going to all that trouble to be personally surprised.

Opaque imports

Importing an R package, unlike importing a python module, brings in random cruft that may have little to do with the names of the thing you just imported. That is, IMO, poor planning, although history indicates that most language designers don’t agree with me on that:

> npreg
Error: object 'npreg' not found
> library("np")
Nonparametric Kernel Methods for Mixed Datatypes (version 0.40-4)
> npreg
function (bws, …) #etc

Further, Data structures in R can do, and are intended to, provide first class scopes for looking up of names. You are, in your explorations into data, as apt to bring the names of columns in a data set into scope as much as the names of functions in a library. This is kind of useful, although it leads to bizarre and unhelpful errors, so watch it.

No scalar types…

A float is a float vector of size 1:

> 5
[1] 5

…yet verbose vector literal syntax

You makes vectors by using a call to a function called c. Witness:

> c('a', 'b', 'c', 'd')
[1] "a" "b" "c" "d"

If you type a literal vector in though, it will throw an error:

> 'a', 'b', 'c', 'd'
Error: unexpected ',' in "'a',"

I’m sure there are Reasons for this; it’s just that they are reasons that I don’t care about.