The Living Thing / Notebooks : Statistical learning theory, risk bounds, hopefully-uniform convergence rates etc

Another area I am taking notes on because I know nothing about it; Please ignore everything I say here accordingly.

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data? If I can answer this question a priori, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation.

Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoff.

Machine learning people always talk about this in terms of classification, which is what VC-dimension can be directly applied to. Apparently also regression?

Modern results seem to avoid some of this by appealing to concentration inequalities. Also apparently stability of estimators gets you somewhere?

AFAICT nice statistical learning results apply to very particular classes of models; e.g. SVMs and kernel regressions have built-in sample complexity results under the right loss function, but the ground gets rapidly shakier as you generalise.

Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory it seem one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.

I would like to understand the relationship between the different kinds of convergence rate results that we get here.

I would like just one or two results that applied to dependent data.

VC dimension

Rademacher complexity


Also apparently stability of estimators gets you somewhere?

Non-I.I.D data


Overview in McSS11b:

Yu [22] sets forth many of the uniform ergodic theorems that are needed to derive generalization error bounds for stochastic processes. Meir [11] is one of the first papers to construct risk bounds for time series. […]

More recently, others have provided PAC results for non-IID data. Steinwart and Christmann [20] prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from α-mixing processes, a fairly general sort of weak serial dependence, getting learning rates for least-squares support vector machines (SVMs) close to the optimal IID rates. Mohri and Rostamizadeh [13] prove stability-based generalization bounds when the data are stationary and φ-mixing or β-mixing, strictly generalizing IID results and applying to all stable learning algorithms. […] Karandikar and Vidyasagar [8] show that if an algorithm is “subadditive” and yields a predictor whose risk can be upper bounded when the data are IID, then the same algorithm yields predictors whose risk can be bounded if data are β-mixing. They use this result to derive generalization error bounds in terms of the learning rates for IID data and the β-mixing coefficients.

See also Geer02.

Stein’s method

I just ran into this via Reinert (Rein00) and it looks handy. TBD.

Misc, to file

HaIb90 and GoNu90, give minimax convergence rates in Sobolev class regression.

This looks like fun (DaMY16):

We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.


Abbeel, P., Koller, D., & Ng, A. Y.(2006) Learning Factor Graphs in Polynomial Time and Sample Complexity. J. Mach. Learn. Res., 7, 1743–1788.
Banerjee, A., Chen, S., Fazayeli, F., & Sivakumar, V. (2014) Estimation with Norm Regularization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (pp. 1556–1564). Curran Associates, Inc.
Barbour, A. D., & Chen, L. H. Y.(2005) An Introduction to Stein’s Method. . World Scientific
Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008) MDL, penalized likelihood, and statistical risk. In Information Theory Workshop, 2008. ITW’08. IEEE (pp. 247–257). IEEE DOI.
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D.(2006) Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association, 101(473), 138–156. DOI.
Bartlett, P. L., & Mendelson, S. (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 3(Nov), 463–482.
Bellec, P. C., & Tsybakov, A. B.(2016) Bounds on the prediction error of penalized least squares estimators with convex penalty. arXiv:1609.06675 [Math, Stat].
Bertin, K., Pennec, E. L., & Rivoirard, V. (2011) Adaptive Dantzig density estimation. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 47(1), 43–74. DOI.
Birgé, L. (2008) Model selection for density estimation with L2-loss. arXiv:0808.1416 [Math, Stat].
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K.(1989) Learnability and the Vapnik-Chervonenkis Dimension. J. ACM, 36(4), 929–965. DOI.
Bousquet, O., Boucheron, S., & Lugosi, G. (2004) Introduction to Statistical Learning Theory. In O. Bousquet, U. von Luxburg, & G. Rätsch (Eds.), Advanced Lectures on Machine Learning (pp. 169–207). Springer Berlin Heidelberg
Bunea, F., Tsybakov, A. B., & Wegkamp, M. H.(2007a) Sparse Density Estimation with ℓ1 Penalties. In N. H. Bshouty & C. Gentile (Eds.), Learning Theory (pp. 530–543). Springer Berlin Heidelberg DOI.
Bunea, F., Tsybakov, A., & Wegkamp, M. (2007b) Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1, 169–194. DOI.
Carmi, A. Y.(2013) Compressive system identification: Sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI.
Daneshmand, H., Gomez-Rodriguez, M., Song, L., & Schölkopf, B. (2014) Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm. In ICML.
David, O., Moran, S., & Yehudayoff, A. (2016) On statistical learning via the lens of compression. arXiv:1610.03592 [Cs, Math].
Dinh, V., Ho, L. S. T., Nguyen, D., & Nguyen, B. T.(2016) Fast learning rates with heavy-tailed losses. arXiv:1609.09481 [Cs, Stat].
D’Souza, A. A.(2004) Towards Tractable Parameter-free Statistical Learning. . University of Southern California, Los Angeles, CA, USA
Flammia, S. T., Gross, D., Liu, Y.-K., & Eisert, J. (2012) Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators. New Journal of Physics, 14(9), 95022. DOI.
Foygel, R., & Srebro, N. (2011) Fast-rate and optimistic-rate error bounds for L1-regularized regression. arXiv:1108.0373 [Math, Stat].
Gnecco, G., & Sanguineti, M. (2008) Approximation Error Bounds via Rademacher’s Complexity. Applied Mathematical Sciences, 2(4), 153–176.
Golubev, G. K., & Nussbaum, M. (1990) A Risk Bound in Sobolev Class Regression. The Annals of Statistics, 18(2), 758–778. DOI.
Goodfellow, I., Papernot, N., & McDaniel, P. (2016) cleverhans v01: an adversarial machine learning library. arXiv:1610.00768 [Cs, Stat].
Hansen, N. R., Reynaud-Bouret, P., & Rivoirard, V. (2015) Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21(1), 83–143. DOI.
Hasminskii, R., & Ibragimov, I. (1990) On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory. The Annals of Statistics, 18(3), 999–1010. DOI.
Hastie, T. J., Tibshirani, Rob, & Wainwright, M. J.(2015) Statistical Learning with Sparsity: The Lasso and Generalizations. . Boca Raton: Chapman and Hall/CRC
Hastie, T., Tibshirani, R., & Friedman, J. (2009) The elements of statistical learning: data mining, inference and prediction. . Springer
Homrighausen, D., & McDonald, D. J.(2016) Risk estimation for high-dimensional lasso regression. arXiv:1602.01522 [Stat].
Huang, C., Cheang, G. L. H., & Barron, A. R.(2008) Risk of penalized least squares, greedy selection and l1 penalization for flexible function libraries.
Kloft, M., Rückert, U., & Bartlett, P. L.(2010) A Unifying View of Multiple Kernel Learning. In J. L. Balcázar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 66–81). Springer Berlin Heidelberg DOI.
Krishnapuram, B., Carin, L., Figueiredo, M. A. T., & Hartemink, A. J.(2005) Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Trans. Pattern Anal. Mach. Intell., 27(6), 957–968. DOI.
Lederer, J., & van de Geer, S. (2014) New concentration inequalities for suprema of empirical processes. Bernoulli, 20(4), 2020–2038. DOI.
Liang, P. (2014) CS229T/STAT231: Statistical Learning Theory (Winter 2014).
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011a) Estimated VC dimension for risk bounds. Eprint arXiv:1111.3404.
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011b) Generalization error bounds for stationary autoregressive models. arXiv:1103.0942 [Cs, Stat].
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011c) Risk bounds for time series without strong mixing. arXiv:1106.0730 [Cs, Stat].
Mei, S., Bai, Y., & Montanari, A. (2016) The Landscape of Empirical Risk for Non-convex Losses. arXiv:1607.06534 [Stat].
Meng, Q., Wang, Y., Chen, W., Wang, T., Ma, Z.-M., & Liu, T.-Y. (2016) Generalization Error Bounds for Optimization Algorithms via Stability. In arXiv:1609.08397 [stat] (Vol. 10, pp. 441–474).
Natarajan, B. K.(1989) On learning sets and functions. Machine Learning, 4(1), 67–97. DOI.
Reid, M. D., & Williamson, R. C.(2011) Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research, 12(Mar), 731–817.
Reinert, G. (2000) Stein’s Method for Epidemic Processes. In Complex Stochastic Systems (pp. 235–275). Boca Raton: Chapman & Hall/CRC
Reynaud-Bouret, P. (2003) Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probability Theory and Related Fields, 126(1). DOI.
Reynaud-Bouret, P., & Schbath, S. (2010) Adaptive estimation for Hawkes processes; application to genome analysis. The Annals of Statistics, 38(5), 2781–2822. DOI.
Schölkopf, B., & Smola, A. J.(2002) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. . MIT Press
Schölkopf, B., & Smola, A. J.(2003) A Short Introduction to Learning with Kernels. In S. Mendelson & A. J. Smola (Eds.), Advanced Lectures on Machine Learning (pp. 41–64). Springer Berlin Heidelberg DOI.
Tropp, J. A.(2015) An Introduction to Matrix Concentration Inequalities. arXiv:1501.01571 [Cs, Math, Stat].
Tulabandhula, T., & Rudin, C. (2014) Generalization bounds for learning with linear, polygonal, quadratic and conic side knowledge. arXiv Preprint arXiv:1405.7764.
Vainsencher, D., Mannor, S., & Bruckstein, A. M.(2011) The Sample Complexity of Dictionary Learning. Journal of Machine Learning Research, 12(Nov), 3259–3281.
van de Geer, S. (2002) On Hoeffdoing’s inequality for dependent random variables. In Empirical Process Techniques for Dependent Data. Birkhhäuser
van de Geer, S. (2014) Worst possible sub-directions in high-dimensional models. In arXiv:1403.7023 [math, stat] (Vol. 131).
van de Geer, S. A.(2008) High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. DOI.
Vapnik, V. (2010) The Nature of Statistical Learning Theory. (Softcover reprint of hardcover 2nd ed. 2000.). Springer
Vapnik, V., & Chervonenkis, A. (1971) On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2), 264–280. DOI.
Vapnik, V., Levin, E., & Cun, Y. L.(1994) Measuring the VC-Dimension of a Learning Machine. Neural Computation, 6(5), 851–876. DOI.
von Luxburg, U., & Schölkopf, B. (2008) Statistical Learning Theory: Models, Concepts, and Results. arXiv:0810.4752 [Math, Stat].
Wu, Q., & Zhou, D.-X. (2008) Learning with sample dependent hypothesis spaces. Computers & Mathematics with Applications, 56(11), 2896–2907. DOI.