The Living Thing / Notebooks :

Statistical learning theory

Eventually including structural risk minimisation, risk bounds, hopefully-uniform convergence rates, VC-dimension, generalisation and stability framings

Another area I am taking notes on because I know nothing about it; Please ignore everything I say here accordingly.

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data? If I can answer this question a priori, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation.

Depending on your training you might think about this in terms of Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension, bias-variance tradeoffs…

Modern results seem to appeal to concentration inequalities. Also apparently stability of estimators gets you somewhere?

AFAICT, nice statistical learning results apply to very particular classes of models; e.g. SVMs and kernel regressions have built-in sample complexity results under the right loss function, but the ground gets rapidly shakier as you generalise.

Percy Liang’s course notes give a rapid overview: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also function approximation, and or model selection for the statisticians’ approach, which is more about working out which model our data can support once you’ve fit it, frequently by appealing to asymptotic large-sample results. In learning theory it seem one always cares about finite sample bounds. Which is reasonable. and the attractiveness of getting them without, e.g. tedious and computationally expensive cross validation, bootstrapping is understandable.

I would like to understand the relationship between the different kinds of convergence rate results that we get here, and different learning theories.

I would like just one or two results that applied to dependent data.

VC dimension

Rademacher complexity

Stability-based

Also apparently stability of estimators gets you somewhere?

PAC-learning

Probably approximately correct, you mean?

Non-I.I.D data

See learning for dependent data.

Stein’s method

I just ran into this via Reinert (Rein00) and it looks handy. TBD.

Misc, to file

HaIb90 and GoNu90, give minimax convergence rates in Sobolev class regression.

This looks like fun (DaMY16):

We begin with the setting of multiclass categorization (zero/one loss). We prove that in this case learnability is equivalent to compression of logarithmic sample size, and that uniform convergence implies compression of constant size. We then consider Vapnik’s general learning setting: we show that in order to extend the compressibility-learnability equivalence to this case, it is necessary to consider an approximate variant of compression.

Refs

AbKN06
Abbeel, P., Koller, D., & Ng, A. Y.(2006) Learning Factor Graphs in Polynomial Time and Sample Complexity. J. Mach. Learn. Res., 7, 1743–1788.
BCFS14
Banerjee, A., Chen, S., Fazayeli, F., & Sivakumar, V. (2014) Estimation with Norm Regularization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (pp. 1556–1564). Curran Associates, Inc.
BaCh05
Barbour, A. D., & Chen, L. H. Y.(2005) An Introduction to Stein’s Method. . World Scientific
BHLL08
Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008) MDL, penalized likelihood, and statistical risk. In Information Theory Workshop, 2008. ITW’08. IEEE (pp. 247–257). IEEE DOI.
BaJM06
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D.(2006) Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association, 101(473), 138–156. DOI.
BaMe02
Bartlett, P. L., & Mendelson, S. (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 3(Nov), 463–482.
BeTs16
Bellec, P. C., & Tsybakov, A. B.(2016) Bounds on the prediction error of penalized least squares estimators with convex penalty. arXiv:1609.06675 [Math, Stat].
BePR11
Bertin, K., Pennec, E. L., & Rivoirard, V. (2011) Adaptive Dantzig density estimation. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 47(1), 43–74. DOI.
Birg08
Birgé, L. (2008) Model selection for density estimation with L2-loss. arXiv:0808.1416 [Math, Stat].
BEHW89
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K.(1989) Learnability and the Vapnik-Chervonenkis Dimension. J. ACM, 36(4), 929–965. DOI.
BoBL04
Bousquet, O., Boucheron, S., & Lugosi, G. (2004) Introduction to Statistical Learning Theory. In O. Bousquet, U. von Luxburg, & G. Rätsch (Eds.), Advanced Lectures on Machine Learning (pp. 169–207). Springer Berlin Heidelberg
BuTW07a
Bunea, F., Tsybakov, A. B., & Wegkamp, M. H.(2007a) Sparse Density Estimation with ℓ1 Penalties. In N. H. Bshouty & C. Gentile (Eds.), Learning Theory (pp. 530–543). Springer Berlin Heidelberg DOI.
BuTW07b
Bunea, F., Tsybakov, A., & Wegkamp, M. (2007b) Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1, 169–194. DOI.
Carm13
Carmi, A. Y.(2013) Compressive system identification: Sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI.
DGSS14
Daneshmand, H., Gomez-Rodriguez, M., Song, L., & Schölkopf, B. (2014) Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm. In ICML.
DaMY16
David, O., Moran, S., & Yehudayoff, A. (2016) On statistical learning via the lens of compression. arXiv:1610.03592 [Cs, Math].
DHNN16
Dinh, V., Ho, L. S. T., Nguyen, D., & Nguyen, B. T.(2016) Fast learning rates with heavy-tailed losses. arXiv:1609.09481 [Cs, Stat].
Dsou04
D’Souza, A. A.(2004) Towards Tractable Parameter-free Statistical Learning. . University of Southern California, Los Angeles, CA, USA
FGLE12
Flammia, S. T., Gross, D., Liu, Y.-K., & Eisert, J. (2012) Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators. New Journal of Physics, 14(9), 95022. DOI.
FoSr11
Foygel, R., & Srebro, N. (2011) Fast-rate and optimistic-rate error bounds for L1-regularized regression. arXiv:1108.0373 [Math, Stat].
GnSa08
Gnecco, G., & Sanguineti, M. (2008) Approximation Error Bounds via Rademacher’s Complexity. Applied Mathematical Sciences, 2(4), 153–176.
GoNu90
Golubev, G. K., & Nussbaum, M. (1990) A Risk Bound in Sobolev Class Regression. The Annals of Statistics, 18(2), 758–778. DOI.
GoPM16
Goodfellow, I., Papernot, N., & McDaniel, P. (2016) cleverhans v01: an adversarial machine learning library. arXiv:1610.00768 [Cs, Stat].
HaRR15
Hansen, N. R., Reynaud-Bouret, P., & Rivoirard, V. (2015) Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21(1), 83–143. DOI.
HaIb90
Hasminskii, R., & Ibragimov, I. (1990) On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory. The Annals of Statistics, 18(3), 999–1010. DOI.
HaTW15
Hastie, T. J., Tibshirani, Rob, & Wainwright, M. J.(2015) Statistical Learning with Sparsity: The Lasso and Generalizations. . Boca Raton: Chapman and Hall/CRC
HaTF09
Hastie, T., Tibshirani, R., & Friedman, J. (2009) The elements of statistical learning: data mining, inference and prediction. . Springer
HoMc16
Homrighausen, D., & McDonald, D. J.(2016) Risk estimation for high-dimensional lasso regression. arXiv:1602.01522 [Stat].
HuCB08
Huang, C., Cheang, G. L. H., & Barron, A. R.(2008) Risk of penalized least squares, greedy selection and l1 penalization for flexible function libraries.
KlRB10
Kloft, M., Rückert, U., & Bartlett, P. L.(2010) A Unifying View of Multiple Kernel Learning. In J. L. Balcázar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 66–81). Springer Berlin Heidelberg DOI.
KCFH05
Krishnapuram, B., Carin, L., Figueiredo, M. A. T., & Hartemink, A. J.(2005) Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Trans. Pattern Anal. Mach. Intell., 27(6), 957–968. DOI.
LeGe14
Lederer, J., & van de Geer, S. (2014) New concentration inequalities for suprema of empirical processes. Bernoulli, 20(4), 2020–2038. DOI.
Lian14
Liang, P. (2014) CS229T/STAT231: Statistical Learning Theory (Winter 2014).
McSS11a
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011a) Estimated VC dimension for risk bounds. Eprint arXiv:1111.3404.
McSS11b
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011b) Generalization error bounds for stationary autoregressive models. arXiv:1103.0942 [Cs, Stat].
McSS11c
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011c) Risk bounds for time series without strong mixing. arXiv:1106.0730 [Cs, Stat].
MeBM16
Mei, S., Bai, Y., & Montanari, A. (2016) The Landscape of Empirical Risk for Non-convex Losses. arXiv:1607.06534 [Stat].
MWCW16
Meng, Q., Wang, Y., Chen, W., Wang, T., Ma, Z.-M., & Liu, T.-Y. (2016) Generalization Error Bounds for Optimization Algorithms via Stability. In arXiv:1609.08397 [stat] (Vol. 10, pp. 441–474).
Nata89
Natarajan, B. K.(1989) On learning sets and functions. Machine Learning, 4(1), 67–97. DOI.
ReWi11
Reid, M. D., & Williamson, R. C.(2011) Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research, 12(Mar), 731–817.
Rein00
Reinert, G. (2000) Stein’s Method for Epidemic Processes. In Complex Stochastic Systems (pp. 235–275). Boca Raton: Chapman & Hall/CRC
Reyn03
Reynaud-Bouret, P. (2003) Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probability Theory and Related Fields, 126(1). DOI.
ReSc10
Reynaud-Bouret, P., & Schbath, S. (2010) Adaptive estimation for Hawkes processes; application to genome analysis. The Annals of Statistics, 38(5), 2781–2822. DOI.
ScSm02
Schölkopf, B., & Smola, A. J.(2002) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. . MIT Press
ScSm03
Schölkopf, B., & Smola, A. J.(2003) A Short Introduction to Learning with Kernels. In S. Mendelson & A. J. Smola (Eds.), Advanced Lectures on Machine Learning (pp. 41–64). Springer Berlin Heidelberg DOI.
Trop15
Tropp, J. A.(2015) An Introduction to Matrix Concentration Inequalities. arXiv:1501.01571 [Cs, Math, Stat].
TuRu14
Tulabandhula, T., & Rudin, C. (2014) Generalization bounds for learning with linear, polygonal, quadratic and conic side knowledge. arXiv Preprint arXiv:1405.7764.
VaMB11
Vainsencher, D., Mannor, S., & Bruckstein, A. M.(2011) The Sample Complexity of Dictionary Learning. Journal of Machine Learning Research, 12(Nov), 3259–3281.
Geer02
van de Geer, S. (2002) On Hoeffdoing’s inequality for dependent random variables. In Empirical Process Techniques for Dependent Data. Birkhhäuser
Geer14
van de Geer, S. (2014) Worst possible sub-directions in high-dimensional models. In arXiv:1403.7023 [math, stat] (Vol. 131).
Geer08
van de Geer, S. A.(2008) High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. DOI.
Vapn10
Vapnik, V. (2010) The Nature of Statistical Learning Theory. (Softcover reprint of hardcover 2nd ed. 2000.). Springer
VaCh71
Vapnik, V., & Chervonenkis, A. (1971) On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2), 264–280. DOI.
VaLC94
Vapnik, V., Levin, E., & Cun, Y. L.(1994) Measuring the VC-Dimension of a Learning Machine. Neural Computation, 6(5), 851–876. DOI.
LuSc08
von Luxburg, U., & Schölkopf, B. (2008) Statistical Learning Theory: Models, Concepts, and Results. arXiv:0810.4752 [Math, Stat].
WuZh08
Wu, Q., & Zhou, D.-X. (2008) Learning with sample dependent hypothesis spaces. Computers & Mathematics with Applications, 56(11), 2896–2907. DOI.