The Living Thing / Notebooks :

Statistical learning theory

Eventually including structural risk minimisation, risk bounds, hopefully-uniform convergence rates, VC-dimension, generalisation and stability framings

Another area I am taking notes on because I know nothing about it; Please ignore everything I say here accordingly.

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data? If I can answer this question a priori, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation.

Depending on your training you might think about this in terms of Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoffs…

Modern results seem to appeal to concentration inequalities. Also apparently stability of estimators gets you somewhere?

AFAICT, nice statistical learning results apply to very particular classes of models; e.g. SVMs and kernel regressions have built-in sample complexity results under the right loss function, but the ground gets rapidly shakier as you generalise.


Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory it seem one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.

I would like to understand the relationship between the different kinds of convergence rate results that we get here, and different learning theories.

I would like just one or two results that applied to dependent data.

VC dimension

Rademacher complexity


Also apparently stability of estimators gets you somewhere?


Probably approximately correct, you mean?

Non-I.I.D data

See learning for dependent data.

Stein’s method

I just ran into this via Reinert (Rein00) and it looks handy. TBD.

Misc, to file

HaIb90 and GoNu90, give minimax convergence rates in Sobolev class regression.

This looks like fun (DaMY16):

We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.