R

The statistical programming language, not the letter

August 7, 2011 — December 14, 2021

computers are awful

number crunching

statistics

tl;dr R is a powerful, effective, diverse, well-supported, free, messy, inefficient, de-facto standard. As far as scientific computation goes, this is outstandingly good.

1 Pros and cons

Figure 2: You can use R for anything, if you really really want.

1.1 Good

Free (beer/speech)
Combines unparalleled breadth and community, at least as pertains to statisticians, data miners, machine learners and other such assorted folk as I call my colleagues. To get some sense of this thriving scene, check out R-bloggers. That community alone should be enough to sell R as an ecosystem, whatever you think of technical chaos of the language itself (cf “Your community is your best asset”) And believe me, I have reservations about everything else.
Amazing, statistically-useful plotting (cf the awful battle to get error bars in python’s mayavi)
Online web-app visualisation: shiny
Integration into literate coding and reproducible research through knitr — see scientific writing workflow.

1.2 Bad

Seems to have been written by statisticians who prioritise delivering statistical functionality right now over making an elegant, fast or consistent language to access that functionality. (“Elegant”, “fast”, or ”consistent”? You can choose… uh… Oh look, none of those are performance metrics for my current role. Gotta go! Byeeeeeee!) I’d rather access those same libraries through a language which has had as many computer scientists winnowing its ugly chaff as Python or Ruby has had. Or indeed Go or Julia. And, for that matter, I’d like as many amazing third-party libraries for non-statistical things as these other languages promise, even javascript. Anyway, it is convenient for many common use cases, which is nice.
Poetically, R has random scope amongst other parser and syntax weirdness.
Call-by-value semantics (in a “big-data” processing language?)
…ameliorated not even by array views,
…exacerbated by bloaty design.
Object model tacked on after the fact. In fact, several object models?. Which is fine? I guess?
One of the worst names to google for ever (cf Processing, Pure)

2 Installing R

See the R package page.

3 Installing packages

See the R package page.

4 ODEs

Here is a forward solver: Package deSolve: Solving Initial Value Differential Equations in R.

Or invoke julia using JuliaCall: SciML/diffeqr: Solving differential equations in R using DifferentialEquations.jl and the SciML Scientific Machine Learning ecosystem. This support adjoints and so on.

5 Command-line scripting

RScript is the command-line invocation for R. littler is an older one.

Rio:

Loads CSV from stdin into R as a data.frame, executes given commands, and gets the output as CSV or PNG on stdout

6 Recommended config

Pro tip: if I make a per-project .Rprofile, I still execute my user .Rprofile:

source("~/.Rprofile")

6.1 UI

Aaron recommends not starting a new X server to ask you to choose a menu item:

Code

options(menu.graphics=FALSE)

6.2 Path surgery for `linuxbrew`

If I use homebrew with system R on e.g. ubuntu, things can get weird. R will sometimes use the homebrew packages and sometimes the OS packages. Confusing compilation or runtime errors ensue, things like unable to load shared object and undefined symbol: Also, if I switch between using homebrew and not using homebrew, the confusing compilation errors persist and can be hard to track down. In my experience at least it is best to keep linuxbrew and R as far away from one another as possible. I try to prevent R seeing linuxbrew in my R startup script .Rprofile. This puts the .linuxbrew paths last:

.pth = Sys.getenv("PATH")
.pths = unlist(strsplit(.pth, ":"))
.brewpthi = as.vector(unlist(lapply(.pths, function (x) grepl("brew", x))))
.nbrewpthi = as.vector(unlist(lapply(.pths, function (x) !grepl("brew", x))))
Sys.setenv(PATH=paste(paste(.pths[.nbrewpthi], collapse=":"), paste(.pths[.brewpthi], collapse=":"), sep=":"))

This one deletes them entirely and is what I use:

.pth = Sys.getenv("PATH")
.pths = unlist(strsplit(.pth, ":"))
.nbrewpthi = as.vector(unlist(lapply(.pths, function (x) !grepl("brew", x))))
Sys.setenv(PATH=paste(.pths[.nbrewpthi], collapse=":"))

Mentioning that you have done this is probably helpful to the user. I note it with the following message:

print("Changed PATH")
print(.pth)
print("to")
print(Sys.getenv("PATH"))

NB this is recommended if one uses linuxbrew or some similar system which leaves additional versions of existing libraries on the library path. By contrast, the packages macOS homebrew provides are generally not otherwise supported and seem to be an OK default, so I do not molest the R library paths on macOS.

7 Needful packages

Upon setting up a new machine I always run

install.packages(c("blogdown", "renv", "tinytex", "knitr", "devtools", "ggplot2"))
tinytex::install_tinytex()
devtools::install_github("r-lib/hugodown")

That gets the baseline tools I actually use.

Now, details.

7.1 The tidyverse

The tidyverse is a miniature ecosystem within R which has coding conventions and tooling to make certain data analysis easier and prettier, although not necessarily more performant.

install.packages(c("tidyverse"))

7.2 Machine learning

R now plugs into many machine-learning-style algorithms.

For one example, you can run keras, and hence presumably tensorflow via Rstudio’s keras. Other enterprises here include mlr/mlr3

R does not define a standardized interface for its machine-learning algorithms. Therefore, for any non-trivial experiments, you need to write lengthy, tedious and error-prone wrappers to call the different algorithms and unify their respective output.

Additionally you need to implement infrastructure to

resample your models

optimize hyperparameters

select features

cope with pre- and post-processing of data and compare models in a statistically meaningful way.

As this becomes computationally expensive, you might want to parallelize your experiments as well. This often forces users to make crummy trade-offs in their experiments due to time constraints or lacking expert programming skills.

mlr provides this infrastructure so that you can focus on your experiments! The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering. It is written in a way that you can extend it yourself or deviate from the implemented convenience methods and construct your own complex experiments or algorithms.

I think this pitch is more or less the same for caret.

There are also externally developed of ML algorithms accessible from R that presumably have consistent interfaces by construction: h2o

7.3 Dataframe alternatives

data.table refines data frames in various ways, including lett dunderheaded default consstructors, high performance dataset querying, and approximately the same functionality as dplyr, but seems to be faster at e.g. sorting. It has a slightly different syntax to built-in dataframes, usually better syntax. (tutorial, introduction.

In the tidyverse tibble provide funcatinoal style table modificatoin

disk.frame is a friendly gigabyte-scale single machine disk-backed data store, for stuff too big for memory.

8 Reports etc

This is the R killer feature that incorporates all the other killer features. R is extremely good at outputting to many formats, even the ones that the most tedious of office procedures demand.

8.1 Shiny

shiny turns statistical models into interactive web apps. I made a notebook for that.

8.2 Blogging / reports / reproducible research

I recommend R reproducible research and miscellaneous scientific writing, as it outputs to HTML website, HTML slides, Word documents, PDFs, Powerpoint slides and probably other stuff too. blogdown, the blogging tool, and the knit rendering engine, as mentioned elsewhere comprise R’s entrant into the academic blogoverse.

David Gohel’s officeverse outputs directly to MS Office documents, notably Powerpoint and Word (Manual).

9 Testing

Academics are terrible at testing so I do not know how relevant this is, but ttdo + tinytest looks low-difficulty.

10 Plotting

IMO, the real killer feature of R.

See Plotting in R.

11 Strings

R is distractingly verbose when it comes to text.¹ It has minimal string formatting baked in, although all the needful tools are there more or less. not-particularly consistent or polished functions You need to use libraries if you don’t want to have lots of paste invocations everywhere and the usual regexes. A more mnemonic string handling library is stringr.

11.1 Formatting numbers

I can’t help but feel there might be an easier way to represent a proportion as a percentage to two decimal places with the built-in infrastructure?

paste(format(round(x * 100, 2), nsmall = 2), '%')

Depends what I mean by easier, I suppose. There is also sprintf which does classic C formatting, which is not luxurious by modern standards but works OK

11.2 General string mangling

11.3 HTML

I recommend htmltools for generating HTML from structured data. The manual is annoyingly perfunctory and it sends you off to some other web pages which are supposed to tell you more about ti but do not. Nonetheless it is pretty easy to work out. Here and here are the least perfunctory introductions. I like using it with withTags, which creates an environment wherein every HTML tag is a function, and behaves in an obvious way. Here is my code to make a custom HTML fragment for every row in a data frame for the ozpowermap project.

popupHtml = apply(winners_sf, 1, function(r){
    s = as.character(withTags({
        div(
            p(
                h3(str_to_title(paste(r$GivenNm, r$Surname, paste(sep="", '(', r$PartyNm, ')'))))
            ),
            dl(
                dt('Winnning Margin'), dd(r$Margin ),
                dt('First prefs'), dd(r$FP_OrdinaryVotes),
                dt('Total Votes'), dd(r$TotalVotes),
            )
        )
    }))
    # s[[1]]
})

12 IDEs

See R IDES.

13 Teaching

There are various tools and tutorials

One can also simply use one of the pre-made courses.

13.1 R for Pythonistas

Many things about R are surprising to me, coming as I do most recently from Python. I’m documenting my perpetual surprise here, in order that it may save someone else the inconvenience of going to all that trouble to be personally surprised.

13.2 Fun introductions

Stackexchange meta tutorial list.

Rstudio.com cheat sheets
Monash university’s bioinformatic’s focused intro.
CSIRO’s introduction
Drew Conway’s strata bootcamp
stackoverflow
R cookbook
Jeremy Howard of Kaggle gives a virtuous and improving presentation
Edwin de Jonge and Mark van der Loo, Data cleaning with R
Bob Rudis: Using R to get data out of word documents

14 Saving and loading

Save my workspace (i.e. current scope and variable definitions) to ./.Rdata

> save.image()

Load my workspace from ./.Rdata

> rm(list=ls())  # clear current defs
> load(".RData") # actually load

15 Interoperation with other languages

See R interoperation.

16 Debugging

16.1 Introspection

Here are a list of introspection hacks:

str(obj)
typeof(obj)
class(obj)
sapply(obj, class)
sapply(obj, attributes)
attributes(obj)
names(obj)

16.2 Inspecting frames post hoc

Use recover. In fact, pro-tip, you can invoke it in 3rd party code gracefully:

options(error = utils::recover)

16.3 Basic interactive debugger

There is at least one, called browser.

17 Tips and gotchas

17.1 Opaque imports

Importing an R package, unlike importing a python module, brings in random cruft that may have little to do with the names of the thing you just imported. That is, IMO, poor planning, although history indicates that most language designers don’t agree with me on that:

> npreg
Error: object 'npreg' not found
> library("np")
Nonparametric Kernel Methods for Mixed Datatypes (version 0.40-4)
> npreg
function (bws, …) #etc

Further, Data structures in R can do, and are intended to, provide first class scopes for looking up of names. You are, in your explorations into data, as apt to bring the names of columns in a data set into scope as much as the names of functions in a library. This is kind of useful, although it leads to bizarre and unhelpful errors, so watch it.

17.2 No scalar types…

A float is a float vector of size 1:

> 5
[1] 5

17.3 …yet verbose vector literal syntax

You makes vectors by using a call to a function called c. Witness:

> c('a', 'b', 'c', ’d')
[1] "a" "b" "c" "d"

If you type a literal vector in though, it will throw an error:

> 'a', 'b', 'c', ’d'
Error: unexpected ',' in "'a',"

I’m sure there are Reasons for this; it’s just that they are reasons that I don’t care about.

17.4 Subsetting hell

To subset a list based object:

x[1]

to subset and optionally downcast the same:

x[[1]]

to subset a matrix-based object:

x[1, , drop=FALSE]

to subset and optionally downcast the same:

x[1]

18 Which files do I need to keep?

Here is a good .gitignore file for R which keeps only what you need.

19 Tips

19.1 Quantile transform

quantile.t = function(v) ecdf(v)(v)

19.2 package utils in defaultPackages was not found

Upon a recent restore from backup I found my R would not work any more. There errors looked like this:

During startup - Warning messages:
1: package ‘utils’ in options("defaultPackages") was not found
2: package ‘stats’ in options("defaultPackages") was not found

Weird, eh?

Try this

library("utils")
shared object '‘utils.so’' already loaded
Error: package or namespace load failed for ‘utils’:
 .onLoad failed in loadNamespace() for 'utils', details:
  call: options(op.utils[toset])
  error: invalid value for 'editor'
 >R_ReplConsole(): before "for(;;)" {main.c}

Ah, there is a clue. Turns out that R needs the shell $EDITOR variable to point to a really-existing editor or it freaks out, even if it seems never to invoke said editor.

EDITOR=code-insiders R

should fix it.

20 References

Bischl, Lang, Kotthoff, et al. 2016. “Mlr: Machine Learning in R.” Journal of Machine Learning Research.

Kuhn. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software.

Kuhn, and Johnson. 2013. Applied Predictive Modeling.

Venables. n.d. An Introduction to R.

Footnotes

much like me; thanks for bearing with my prolixity↩︎

1 Pros and cons

1.1 Good

1.2 Bad

2 Installing R

3 Installing packages

4 ODEs

5 Command-line scripting

6 Recommended config

6.1 UI

6.2 Path surgery for linuxbrew

7 Needful packages

7.1 The tidyverse

7.2 Machine learning

7.3 Dataframe alternatives

8 Reports etc

8.1 Shiny

8.2 Blogging / reports / reproducible research

9 Testing

10 Plotting

11 Strings

11.1 Formatting numbers

11.2 General string mangling

11.3 HTML

12 IDEs

13 Teaching

13.1 R for Pythonistas

13.2 Fun introductions

14 Saving and loading

15 Interoperation with other languages

16 Debugging

16.1 Introspection

16.2 Inspecting frames post hoc

16.3 Basic interactive debugger

17 Tips and gotchas

17.1 Opaque imports

17.2 No scalar types…

17.3 …yet verbose vector literal syntax

17.4 Subsetting hell

18 Which files do I need to keep?

19 Tips

19.1 Quantile transform

19.2 package utils in defaultPackages was not found

20 References

Footnotes

6.2 Path surgery for `linuxbrew`