Welcome to the probability inequality mines!

When something in your process (measurement, estimation) means that you can be pretty sure that a whole bunch of your stuff is damn likely to be somewhere in particular.

This is basic workhorse stuff in univariate probability, and turns out to be yet more essential in multivariate matrix probability, as seen in matrix factorisation, compressive sensing, PAC-bounds and suchlike.

Contents

## Background

Overviews include

this super simple intro to chaining and controlling maxima by Thomas Lumley

Dasgupta, Asymptotic Theory of Statistics and Probability (Dasg08) is very easy, and despite its name introduces some nice basic non-asymptotic inequalities

Raginsky and Sason, Concentration of Measure Inequalities in Information Theory, Communications and Coding (RaSa12)

Tropp, An Introduction to Matrix Concentration Inequalities (Trop15) high-dimensional data! free!

Boucheron, Bousquet & Lugosi, Concentration inequalities (BoBl04a) (Clear and brisk but missing some newer stuff)

Massart, Concentration inequalities and model section (Mass07). Clear, and focussed, but very quick and further, depressingly, by being applied it also demonstrates the limitations of these techniques. Mass00 is an earlier draft.

Boucheron, Lugosi & Massart, Concentration inequalities: a nonasymptotic theory of independence (BoLM13). Haven’t read it yet.

Lugosi’s Concentration-of-measure Lecture notes:

The inequalities discussed in these notes bound tail probabilities of general functions of independent random variables.

The taxonomy is interesting:

Several methods have been known to prove such inequalities, including martingale methods (see Milman and Schechtman [1] and the surveys of McDiarmid [2, 3]), information-theoretic methods (see Alhswede, Ga ́cs, and Ko ̈rner [4], Marton [5, 6, 7], Dembo [8], Massart [9] and Rio [10]), Talagrand’s induction method [11, 12, 13] (see also Luczak and McDiarmid [14], McDiarmid [15] and Panchenko [16, 17, 18]), the decoupling method surveyed by de la Pen ̃a and Gin ́e [19], and the so-called “entropy method”, based on logarithmic Sobolev inequalities, developed by Ledoux [20, 21], see also Bobkov and Ledoux [22], Massart [23], Rio [10], Klein [24], Boucheron, Lugosi, and Mas- sart [25, 26], Bousquet [27, 28], and Boucheron, Bousquet, Lugosi, and Massart [29].

(actioned in his Combinatorial statistics notes)

Foundational but impenetrable things I won’t read right now: Talagrand’s opus that is commonly credited with kicking off the modern fad especially with the chaining method. (Tala95)

## Finite sample bounds

These are everywhere in statistics.
Special attention will be given here to finite-sample inequalities.
Asymptotic normality is so last season.
These days we care about finite sample performance, and asymptotic results don’t help us there.
Apparently I can construct useful bounds using concentration inequalities?
One suggested keyword to disambiguate: *Ahlswede-Winterfeld bounds*?

## Basic inequalities

the classics

### Markov

TBD

### Chebychev

TBD

### Hoeffding

TBD

### Chernoff

TBD

### Kolmogorov

TBD.

### Gaussian

For the Gaussian distribution. Filed there.

### Martingale type

TBD.

### Khintchine

Let us copy from wikipedia:

Heuristically: if we pick $N$ complex numbers \(x_1,\dots,x_N \in\mathbb{C}$\), and add them together, each multiplied by jointly independent random signs \(\pm 1\), then the expected value of the sum’s magnitude is close to \(\sqrt{|x_1|^{2}+ \cdots + |x_N|^{2}}\).

Let :math:` {varepsilon_n}_{n=1}^N ` i.i.d. random variables with \(P(\varepsilon_n=\pm1)=\frac12\) for \(n=1,\ldots, N\), i.e., a sequence with Rademacher distribution. Let :math:` 0<p<infty` and let :math:` x_1,ldots,x_Nin mathbb{C}`. Then

for some constants \(A_p,B_p>0 \). It is a simple matter to see that \(A_p = 1\) when \(p \ge 2\), and \(B_p = 1\) when \(0 < p \le 2\).

## Empirical process theory

Large deviation inequalities, empirical process inequalities, Talagrand chaining method. Berry-Esseen bound.

## Matrix Chernoff bounds

Nikhil Srivastava’s Discrepancy, Graphs, and the Kadison-Singer Problem has an interesting example of bounds via discrepancy theory (and only indirectly probability). Gros11 is also readable, and gives quantum-mechanical results (i.e. the matrices are complex-valued).

Trop15 summarises:

In recent years, random matrices have come to play a major role in computational mathematics, but most of the classical areas of random matrix theory remain the province of experts. Over the last decade, with the advent of matrix concentration inequalities, research has advanced to the point where we can conquer many (formerly) challenging problems with a page or two of arithmetic.

## Refs

- AuNe11
- Aubrun, G., & Nechita, I. (2011) The multiplicative property characterizes $ell_p$ and $L_p$ norms.
*Confluentes Mathematici*, 03(04), 637–647. DOI. - Bach13
- Bach, F. R.(2013) Sharp analysis of low-rank kernel matrix approximations. In COLT (Vol. 30, pp. 185–209).
- BaBM99
- Barron, A., Birgé, L., & Massart, P. (1999) Risk bounds for model selection via penalization.
*Probability Theory and Related Fields*, 113(3), 301–413. - BeLT17
- Bellec, P. C., Lecué, G., & Tsybakov, A. B.(2017) Towards the study of least squares estimators with convex penalty.
*ArXiv:1701.09120 [Math, Stat]*. - BeHK12
- Berend, D., Harremoës, P., & Kontorovich, A. (2012) Minimum KL-divergence on complements of $L_1$ balls.
*ArXiv:1206.6544 [Cs, Math]*. - BePo05
- Bertsimas, D., & Popescu, I. (2005) Optimal Inequalities in Probability Theory: A Convex Optimization Approach.
*SIAM Journal on Optimization*, 15(3), 780–804. DOI. - BoBL04a
- Boucheron, S., Bousquet, O., & Lugosi, G. (2004a) Concentration inequalities. In O. Bousquet, U. von Luxburg, & G. Rätsch (Eds.), Advanced Lectures in Machine Learning.
- BoLM03
- Boucheron, S., Lugosi, G., & Massart, P. (2003) Concentration inequalities using the entropy method. , 31(3), 1583–1614. DOI.
- BoLM13
- Boucheron, S., Lugosi, G., & Massart, P. (2013) Concentration inequalities: a nonasymptotic theory of independence. (1st ed.). Oxford: Oxford University Press
- BoBL04b
- Bousquet, O., Boucheron, S., & Lugosi, G. (2004b) Introduction to Statistical Learning Theory. In O. Bousquet, U. von Luxburg, & G. Rätsch (Eds.), Advanced Lectures on Machine Learning (pp. 169–207). Springer Berlin Heidelberg
- BüGe11
- Bühlmann, P., & van de Geer, S. (2011) Statistics for High-Dimensional Data: Methods, Theory and Applications. (2011 edition.). Heidelberg ; New York: Springer
- CaRe09
- Candès, E. J., & Recht, B. (2009) Exact matrix completion via convex optimization.
*Foundations of Computational Mathematics*, 9(6), 717–772. DOI. - CaRT06
- Candès, E. J., Romberg, J., & Tao, T. (2006) Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information.
*IEEE Transactions on Information Theory*, 52(2), 489–509. DOI. - DSBN13
- Dasarathy, G., Shah, P., Bhaskar, B. N., & Nowak, R. (2013) Sketching Sparse Matrices.
*ArXiv:1303.6544 [Cs, Math]*. - Dasg08
- DasGupta, A. (2008) Asymptotic Theory of Statistics and Probability. . New York: Springer New York
- Dasg00
- Dasgupta, S. (2000) Experiments with Random Projection. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (pp. 143–151). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
- DaHV12
- Dasgupta, S., Hsu, D., & Verma, N. (2012) A concentration theorem for projections.
*ArXiv Preprint ArXiv:1206.6813*. - DeHW11
- Del Moral, P., Hu, P., & Wu, L. (2011) On the concentration properties of Interacting particle processes.
*Foundations and Trends® in Machine Learning*, 3(3–4), 225–389. DOI. - Elka13
- El Karoui, N. (2013) Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators : rigorous results.
*ArXiv:1311.2445 [Math, Stat]*. - FGLE12
- Flammia, S. T., Gross, D., Liu, Y.-K., & Eisert, J. (2012) Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.
*New Journal of Physics*, 14(9), 095022. DOI. - GiNi09
- Giné, E., & Nickl, R. (2009) Uniform limit theorems for wavelet density estimators.
*The Annals of Probability*, 37(4), 1605–1646. DOI. - Gros11
- Gross, D. (2011) Recovering Low-Rank Matrices From Few Coefficients in Any Basis.
*IEEE Transactions on Information Theory*, 57(3), 1548–1566. DOI. - GLFB10
- Gross, D., Liu, Y.-K., Flammia, S. T., Becker, S., & Eisert, J. (2010) Quantum state tomography via compressed sensing.
*Physical Review Letters*, 105(15). DOI. - HaRR15
- Hansen, N. R., Reynaud-Bouret, P., & Rivoirard, V. (2015) Lasso and probabilistic inequalities for multivariate point processes.
*Bernoulli*, 21(1), 83–143. DOI. - Horn79
- Horn, M. (1979) Some inequalities for the expectation of a product of functions of a random variable and for the multivariate distribution function at a random point.
*Biometrical Journal*, 21(3), 243–245. DOI. - Houd02
- Houdré, C. (2002) Remarks on deviation inequalities for functions of infinitely divisible random vectors.
*The Annals of Probability*, 30(3), 1223–1237. DOI. - HoPr02
- Houdré, C., & Privault, N. (2002) Concentration and deviation inequalities in infinite dimensions via covariance representations.
*Bernoulli*, 8(6), 697–720. - IsMc16
- Isaev, M., & McKay, B. D.(2016) On a bound of Hoeffding in the complex case.
*Electronic Communications in Probability*, 21(0). DOI. - Kenn15
- Kennedy, E. H.(2015) Semiparametric theory and empirical processes in causal inference.
*ArXiv Preprint ArXiv:1510.04740*. - Kolt11
- Koltchinskii, P. V.(2011) Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. . Springer Berlin Heidelberg DOI.
- KoMa12
- Kontoyiannis, I., & Madiman, M. (2012) Sumset inequalities for differential entropy and mutual information. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on (pp. 1261–1265). DOI.
- Krol16
- Kroll, M. (2016) Concentration inequalities for Poisson point processes with application to adaptive intensity estimation.
*ArXiv:1612.07901 [Math, Stat]*. - KuMo14
- Kuznetsov, V., & Mohri, M. (2014) Generalization Bounds for Time Series Prediction with Non-stationary Processes. In P. Auer, A. Clark, T. Zeugmann, & S. Zilles (Eds.), Algorithmic Learning Theory (pp. 260–274). Bled, Slovenia: Springer International Publishing DOI.
- KuMo15
- Kuznetsov, V., & Mohri, M. (2015) Learning Theory and Algorithms for Forecasting Non-Stationary Time Series. In Advances in Neural Information Processing Systems (pp. 541–549). Curran Associates, Inc.
- KuMo16
- Kuznetsov, V., & Mohri, M. (2016) Generalization Bounds for Non-stationary Mixing Processes. In Machine Learning Journal.
- LaGG16
- Lahiri, S., Gao, P., & Ganguli, S. (2016) Random projections of random manifolds.
*ArXiv:1607.04331 [Cs, q-Bio, Stat]*. - LeGe14
- Lederer, J., & van de Geer, S. (2014) New concentration inequalities for suprema of empirical processes.
*Bernoulli*, 20(4), 2020–2038. DOI. - Ligg10
- Liggett, T. M.(2010) Stochastic models for large interacting systems and related correlation inequalities.
*Proceedings of the National Academy of Sciences of the United States of America*, 107(38), 16413–16419. DOI. - Madi08
- Madiman, M. (2008) On the entropy of sums. In Information Theory Workshop, 2008. ITW ’08. IEEE (pp. 303–307). DOI.
- Mass00
- Massart, P. (2000) Some applications of concentration inequalities to statistics. In Annales de la Faculté des sciences de Toulouse: Mathématiques (Vol. 9, pp. 245–303).
- Mass07
- Massart, P. (2007) Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. . Berlin ; New York: Springer-Verlag
- MaMi11
- Maugis, C., & Michel, B. (2011) A non asymptotic penalized criterion for Gaussian mixture model selection.
*ESAIM: Probability and Statistics*, 15, 41–68. DOI. - RaSa12
- Raginsky, M., & Sason, I. (2012) Concentration of Measure Inequalities in Information Theory, Communications and Coding.
*Foundations and Trends in Communications and Information Theory*. - RaRe09
- Rahimi, A., & Recht, B. (2009) Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313–1320). Curran Associates, Inc.
- ReRo07
- Reynaud-Bouret, P., & Roy, E. (2007) Some non asymptotic tail estimates for Hawkes processes.
*Bulletin of the Belgian Mathematical Society - Simon Stevin*, 13(5), 883–896. - Tala95
- Talagrand, M. (1995) Concentration of measure and isoperimetric inequalities in product spaces.
*Publications Mathématiques de L’IHÉS*, 81(1), 73–205. DOI. - Tala96
- Talagrand, M. (1996) A new look at independence.
*The Annals of Probability*, 24(1), 1–34. - Trop15
- Tropp, J. A.(2015) An Introduction to Matrix Concentration Inequalities.
*ArXiv:1501.01571 [Cs, Math, Stat]*. - Geer95
- van de Geer, S. (1995) Exponential Inequalities for Martingales, with Application to Maximum Likelihood Estimation for Counting Processes.
*The Annals of Statistics*, 23(5), 1779–1801. DOI. - Geer02
- van de Geer, S. (2002) On Hoeffdoing’s inequality for dependent random variables. In Empirical Process Techniques for Dependent Data. Birkhhäuser
- Geer14
- van de Geer, S. (2014) Statistical Theory for High-Dimensional Models.
*ArXiv:1409.8557 [Math, Stat]*. - GeLe11
- van de Geer, S., & Lederer, J. (2011) The Lasso, correlated design, and improved oracle inequalities.
*ArXiv:1107.0189 [Stat]*. - WuZh16
- Wu, X., & Zhang, J. (2016) Distribution-dependent concentration inequalities for tighter generalization bounds.
*ArXiv:1607.05506 [Stat]*.