Quantifying difference between probability measures. Measuring the distribution itself, for, e.g. badness of approximation of a statistical fit. The theory of binary experiments. You probably care about these because you want to test for independence or do hypothesis testing or model selection, or density estimation, or to prove convergence for some random variable, or probability inequalities inequalities, or to model the distinguishability of the distributions from some process and a generative model of it, as seen in adversarial training. That kind of thing. Frequently the distance here is between a measure and an empirical estimate thereof, but this is no requirement.
A good choice of probability metric might give you a convenient distribution of a test statistic, an efficient loss function to target, simple convergence behaviour for some class of estimator, or simply a warm fuzzy glow.
“Distance” and “metric” both often imply symmetric functions obeying the triangle inequality, but on this page we have a broader church, and include premetrics, metriclike functions which still “go to zero when two things get similar”, without including the other axioms of distances. These are also called divergences. This is still useful for the aforementioned convergence results. I’ll use “true metric” or “true distance” to make it clear when needed. “Contrast” is probably better here, but is less common.
nle;dr Don’t read my summary, read the epic Reid and Williamson paper, [ReWi11], which, in the quiet solitude of my own skull, I refer to as One regret to rule them all and in divergence bound them.
Wait, you are still here?
Norms with respect to Lebesgue measure on the state space
Well now, this is a fancy name. But this is probably the most familiar to many, as it’s a plain old functional approximation metric applied to probability distributions on the state space of the random variable.
The usual norms can be applied to density, Most famously, \(L_p\) norms (which I will call \(L_k\) norms because I am using \(p\)).
When \(p=2\), you get a convenient Hilbert space for free.
When written like this, the norm is taken between densities, i.e. RadonNikodym derivatives, not distributions. (Although see the Kolmogorov metric for an application of the \(p=\infty\) norm to cumulative distributions.)
A little more generally, consider some RV \(X\sim P\) taking values on \(\mathbb{R}\) with a RadonNikodym derivative (a.k.a. density) continuous with respect to the Lebesgue measure \(\lambda\), \(p=dP/d\lambda\).
\(L_2\) norm are classics for kernel density estimates, because it allows you to use all the machinery of function approximation.
\(L_k, k\geq 1\) norms do observe the triangle inequality, and \(L_2\) norms have lots of additional features, such as Wiener filtering formulations, and Parseval’s identity.
There are the standard facts about about \(L_k,\,k\geq 1\) spaces (i.e. expectation of arbitrary measurable functions), e.g. domination
Hölder’s inequality for probabilities
and the Minkowski (i.e. triangle) inequality
However, it’s an awkward choice for a distance on a probability space, the \(L_k\) space on densities.
If you transform the random variable by anything other than a linear transform, then your distances transform in an arbitrary way. And we haven’t exploited the nonnegativity of probability densities so it might feel as if we are wasting some information If our estimated density \(q(x)<0,\;\forall x\in A\) for some nonempty interval \(A\) then we know it’s plain wrong, since probability is never negative.
Also, such norms are not necessarily convenient. Exercise: Given \(N\) i.i.d samples drawn from \(X\sim P= \text{Norm}(\mu,\sigma)\), find a closed form expression for estimates \((\hat{\mu}_N, \hat{\sigma}_N)\) such that the distance \(E_P\(p\hat{p})\_2\) is minimised.
Doing this directly is hard; But indirectly can work —if we try to directly minimise a different distance, such as the KL divergence, we can squeeze the \(L_2\) distance. TODO: come back to this point.
Finally, these feel like setting up an inappropriate problem to solve statistically, since an error is penalised equally everywhere in the statespace; Why are errors penalised just as much for where \(p\simeq 0\) as for \(p\gg 0\)? Surely there are cases where we care more, or less, about such areas? That leads to…
\(\phi\)divergences
Why not call \(P\) close to \(Q\) if closeness depends on the probability weighting of that place? Specifically, some divergence \(R\) like this, using scalar function \(\phi\) and pointwise loss \(\ell\)
If we are going to measure divergence here, we also want the properties that \(P=Q\Rightarrow R(P,Q)=0\), and \(R(P,Q)\gt 0 \Rightarrow P\neq Q\). We can get this if we chose some increasing \(\psi\) and \(\ell(s,t)\) such that
Let \(\psi\) be the identity function for now, and concentrate on the fiddly bit, \(\ell\). We try a form of function that exploits the nonnegativity of densities and penalises the derivative of one distribution with respect to the other (resp. the ratio of densities) :
If \(p(x)=q(x)\) then \(q(x)/p(x)=1\). So to get the right sort of penalty, we choose \(\phi\) to have a minimum where the argument is 1, \(\phi(1)=0\) and \(\phi(t)\geq 0, \forall t\)
It turns out that it’s also wise to take \(\phi\) to be convex. (Exercise: why?) And, note that for these not to explode we will now require \(P\) be dominated by \(Q.\) (i.e. \(Q(A)=0\Rightarrow P(A)=0,\, \forall A \in\text{Borel}(\mathbb{R})\)
Putting this all together, we have a family of divergences
And BAM! These are the \(\phi\)divergences. You get a different one for each choice of \(\phi\).
a.k.a. Csiszárdivergences, \(f\)divergences or AliSilvey distances, after the people who noticed them. ([AlSi66], [Csis72])
These are in general mere premetrics. And note they are no longer in general symmetric We should not necessarily expect
to be equal to
Anyway, back to concreteness, and recall our wellbehaved continuous random variables; we can write, in this case,
Let’s explore some \(\phi\)s.
KullbackLeibler divergence
We take \(\phi(t)=t \ln t\), and write the corresponding divergence, \(D_\text{KL}=\operatorname{KL}\),
Indeed, so long as P is absolutely continuous wrt Q,
This is one of many, many possible derivations of the KullbackLeibler divergence a.k.a. KL divergence, or relative entropy; It pops up because of informationtheoretic significance. e.g. information criteria.
TODO: revisit in a maximum likelihood and variational inference settings, where we have good algorithms exploiting its nice properties.
Total variation distance
Take \(\phi(t)=t1\). We write \(\delta(P,Q)\) for the divergence. I will use the set \(A:=\left\{x:\frac{dP}{dQ}\geq 1\right\}=\{x:dP\geq dQ\}.\)
I have also the standard fact that for any probability measure \(P\) and \(P\)measurable set, \(A\), it holds that \(P(A)=1P(A^C)\).
Equivalently
To see that \(A\) attains that supremum, we note for any set \(B\supseteq A,\, B:=A\cup D\) for some \(Z\) disjoint from \(A\), it follows that \(P(B)  Q(B)\leq P(A)  Q(A)\) since, on \(Z,\, dP/dQ\leq 1\), by construction.
It should be clear that this is symmetric.
Supposedly, [KhFG07] show that this is the only possible fdivergence which is also a true distance, but I can’t access that paper to see how.
TODO: Prove that for myself Is the representation of divergences as “simple” divergences helpful? See in [ReWi09] (credited to Österreicher and Wajda)
TODO: talk about triangle inequalities.
Hellinger divergence
For this one, we write \(H^2(P,Q)\), and take \(\phi(t):=(\sqrt{t}1)^2\). Stepbystep, that becomes
It turns out to be another symmetrical \(\phi\)divergence. The square root of the Hellinger divergence \(H=\sqrt{H^2}\) is the Hellinger distance on the space of probability measures which is a true distance. (exercise: prove)
It doesn’t look intuitive, but has convenient properties for proving inequalities (simple relationships with other norms, triangle inequality) and magically good estimation properties ([Bera77]), e.g. in robust statistics.
TODO: make some of these “convenient properties” explicit.
\(\alpha\)divergence
a.k.a Rényi divergences, which are a sub family of the f divergences with a particular parameteriation. Includes KL, reverseKL and Hellinger as special cases.
We take \(\phi(t):=\frac{4}{1\alpha^2} \left(1t^{(1+\alpha )/2}\right).\)
This gets fiddly to write out in full generality, with various undefined or infinite integrals needing definitions in terms of limits and is supposed to be constructed in terms of “Hellinger integral”…? I will ignore that for now and write out a simple enough version. See [ErHa14] or [LiVa06] for gory details.
\(\chi^2\) divergence
As made famous by count data significance tests.
For this one, we write \(\chi^2\), and take \(\phi(t):=(t1)^2\). Then, by the same old process…
Normally you see this for discrete data indexed by \(i\), in which case we may write
If you have constructed these discrete probability mass functions from \(N\) samples, say, \(p_i:=\frac{n^P_i}{N}\) and \(q_i:=\frac{n^Q_i}{N}\), this becomes
This is probably familiar from some primordial statistics class.
The main use of this one is its ancient pedigree, (used by Pearson in 1900, according to Wikipedia) and its noncontroversiality, so you include it in lists wherein you wish to mention you have a hipper alternative.
TBD: Reverse Pinsker inequalities (e.g. [BeHK12]), and covering numbers and other such horrors.
Hellinger inequalities
Wr/t the total variation distance,
Additionally,
Pinsker inequalities
[BeHK12] attribute this to Csiszár (1967 article I could not find) and Kullback ([Kull70]) instead of [Pins80] (which is in any case in Russian and I haven’t read it).
[ReWi09] derives the bestpossible generalised Pinsker inequalities, in a certain sense of “best” and “generalised”, i.e. they are tight bounds, but not necessarily convenient.
Here are the most useful 3 of their inequalities: (\(P,Q\) arguments omitted)
Integral probability metrics
TBD. For now, see [SGSS07]. Weaponized in [GFTS08] as an independence test.
Also included:
 total variation
 Kantorovich/Wasserstein (Mass transport) These are now hot in, e.g. the Wasserstein GAN, a form of adversarial training
 FourtetMourier
 Lipschitz
 general kernel metrics (e.g. [SGSS07])
Analysed in RKHS distribution embeddings.
Others
 “Pdivergence”

Metrizes convergence in probability. Note this is defined upon random variables with an arbitrary joint distribution, not upon two distributions per se.
 Lèvy metric

This monster metrizes convergence in distribution
$$D_L(P,Q):=\inf\{\epsilon >0: P(x\epsilon)\epsilon \leq Q(x)\leq P(x+\epsilon)+\epsilon\}$$  Kolmogorov metric

the \(L_\infty\) metric between the cumulative distributions (i.e. not between densities)
$$D_K(P,Q):= \sup_x \left\{ P(x)  Q(x) \right\}$$  Wasserstein distance

Earthmover distance.
 Skorokhod

Hmmm.
 Neural Net distance

Wasserstein distance with a baked in notion of the capacity of the discriminators which much measure the distance. ([AGLM17])
What even are the Kuiper and Prokhorov metrics?
To read
 This guy’s study blog
 Anand Sarwate: C.R. Rao and information geometry
 [ReWi11], relating pretty much all these metrics and a few others to a theory of experiments distinguishing data from two distributions.
 [BeHK12] on reverse Pinsker inequalities.
Refs
 Csis72: (1972) A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(1–4), 191–213. DOI
 AlSi66: (1966) A General Class of Coefficients of Divergence of One Distribution from Another. Journal of the Royal Statistical Society. Series B (Methodological), 28(1), 131–142.
 SGSS07: (2007) A Hilbert Space Embedding for Distributions. In Algorithmic Learning Theory (pp. 13–31). Springer Berlin Heidelberg
 GFTS08: (2008) A Kernel Statistical Test of Independence. In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference. Cambridge, MA: MIT Press
 Kull67: (1967) A lower bound for discrimination information in terms of variation (Corresp). IEEE Transactions on Information Theory, 13(1), 126–127. DOI
 KhFG07: (2007) Confliction of the Convexity and Metric Properties in fDivergences. IEICE Trans. Fundam. Electron. Commun. Comput. Sci., E90–A(9), 1848–1853. DOI
 Bill13: (2013) Convergence Of Probability Measures. Wiley
 Bach13: (2013) Convex relaxations of structured matrix factorizations. ArXiv:1309.3117 [Cs, Math].
 Kull70: (1970) Correction to A Lower Bound for Discrimination Information in Terms of Variation. IEEE Transactions on Information Theory, 16(5), 652–652. DOI
 Niel13: (2013) CramerRao Lower Bound and Information Geometry. ArXiv:1301.3578 [Cs, Math].
 Rao87: (1987) Differential metrics in probability spaces. In Differential geometry in statistical inference (Vol. 10, pp. 217–240). IMS Lecture Notes and Monographs Series, Hayward, CA
 Lin91: (1991) Divergence measures based on the Shannon entropy. Information Theory, IEEE Transactions On, 37(1), 145–151. DOI
 SaVe16: (2016) fDivergence Inequalities via Functional Domination. ArXiv:1610.09110 [Cs, Math, Stat].
 ReWi09: (2009) Generalised Pinsker Inequalities. In arXiv:0906.1244 [cs, math].
 AGLM17: (2017) Generalization and Equilibrium in Generative Adversarial Nets (GANs). ArXiv:1703.00573 [Cs].
 SGFS10: (2010) Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11, 1517−1561.
 SHSF09: (2009) Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 961–968). New York, NY, USA: ACM DOI
 Csis75: (1975) IDivergence Geometry of Probability Distributions and Minimization Problems. The Annals of Probability, 3(1), 146–158.
 ReWi11: (2011) Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research, 12(Mar), 731–817.
 HaTu12: (2012) Information Divergence is more chi squared distributed than the chi squared statistics. ArXiv:1202.1125 [Cs, Math, Stat].
 SGFL08: (2008) Injective Hilbert Space Embeddings of Probability Measures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
 ZPJS12: (2012) Kernelbased Conditional Independence Test and Application in Causal Discovery. ArXiv:1202.3775 [Cs, Stat].
 PaVe08: (2008) Lautum Information. IEEE Transactions on Information Theory, 54(3), 964–975. DOI
 Nuss04: (2004) Minimax Risk, Pinsker Bound for. In Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc.
 Bera77: (1977) Minimum Hellinger Distance Estimates for Parametric Models. The Annals of Statistics, 5(3), 445–463. DOI
 BeHK12: (2012) Minimum KLdivergence on complements of \(L_1\) balls. ArXiv:1206.6544 [Cs, Math].
 MoHe14: (2014) Multivariate fDivergence Estimation With Confidence. In NIPS 2014.
 Saga05: (2005) Nonparametric maximumlikelihood estimation of probability measures: existence and consistency. Journal of Statistical Planning and Inference, 133(2), 249–271. DOI
 HaIb90: (1990) On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory. The Annals of Statistics, 18(3), 999–1010. DOI
 LiVa06: (2006) On Divergences and Informations in Statistics and Information Theory. IEEE Transactions on Information Theory, 52(10), 4394–4412. DOI
 Hall87: (1987) On KullbackLeibler Loss and Density Estimation. The Annals of Statistics, 15(4), 1491–1519. DOI
 Gila10: (2010) On Pinsker’s Type Inequalities and Csiszár’s fdivergences Part I: Second and FourthOrder Inequalities. IEEE Transactions on Information Theory, 56(11), 5377–5386. DOI
 NiNo13: (2013) On the Chi square and higherorder Chi distances for approximating fdivergences. ArXiv:1309.3029 [Cs, Math].
 Rény59: (1959) On the dimension and entropy of probability distributions. Acta Mathematica Academiae Scientiarum Hungarica, 10(1–2), 193–215. DOI
 SFGS12: (2012) On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6, 1550–1599. DOI
 Pins80: (1980) Optimal filtration of squareintegrable signals in Gaussian noise. Problems in Information Transmiss, 16(2), 120–133.
 GHLY17: (2017) Relaxed Wasserstein with Applications to GANs. ArXiv:1705.07164 [Cs, Stat].
 ErHa14: (2014) Rényi Divergence and KullbackLeibler Divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820. DOI
 GiAL13: (2013) Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 249(Supplement C), 124–131. DOI
 Rény00: (n.d.) Rényi Divergence Measures for Commonly Used Univariate Continuous Distributions.
 Geer14: (2014) Statistical Theory for HighDimensional Models. ArXiv:1409.8557 [Math, Stat].
 BüGe11: (2011) Statistics for HighDimensional Data: Methods, Theory and Applications. Heidelberg ; New York: Springer
 MoMC16: (2016) Wasserstein Training of Restricted Boltzmann Machines. In Advances in Neural Information Processing Systems 29 (pp. 3711–3719). Curran Associates, Inc.