The Living Thing / Notebooks :

(Probability) distribution distances, metrics, contrasts and divergences

Quantifying difference between measures; particularly probability measures. Measuring the distribution itself, for, e.g. badness of approximation of a statistical fit. The theory of binary experiments. You probably care about these because you want to test for independence or do hypothesis testing or model selection, or density estimation, or to prove convergence for some random variable, or probability inequalities inequalities, or to model the distinguishability of the distributions from some process and a generative model of it, as seen in adversarial training. (TODO: work out how this is used in indirect inference.) That kind of thing.

A good choice of probability metric might give you a convenient distribution of a test statistic, an efficient loss function to target, simple convergence behaviour for some class of estimator, or simply a warm fuzzy glow.

“Distance” and “metric” both often imply symmetric functions obeying the triangle inequality, but on this page we have a broader church, and include premetrics, metric-like functions which still “go to zero when two things get similar”, without including the other axioms of distances. This is still useful for the aforementioned convergence results. I’ll use “true metric” or “true distance” to make it clear when needed. “Contrast” is probably better here, but is less common.

nle;dr Don’t read my summary, read the incredible Reid and Williamson paper, ReWi11, which, in the quiet solitude of my own skull, I refer to as One regret to rule them all and in divergence bound them, because I am a nerd who doesn’t need his pop culture references to be rigorous.

Wait, you are still here?

Contents

Norms with respect to Lebesgue measure on the state space

Well now, this is a fancy name. But this is probably the most familiar to many, as it’s a plain old functional approximation metric applied to probability distributions considered as distributions on the state space of the random variable.

Let’s be concrete, (generalisation is super easy, here) and consider some RV \(X\sim P\) taking values on \(\mathbb{R}\) with a Radon-Nikodym derivative (a.k.a. density) continuous with respect to the Lebesgue measure \(\lambda\), \(p=dP/d\lambda\). [1]

Hereafter I will assume that if we need to integrate with respect to some dominating measure it will be \(\lambda\), and I will omit the \(d\lambda\) bit. Say we wish to measure how similar some other density \(q,\) is to \(p\). We might imagine \(q\) is some kind of approximation to \(p\), and will assume it meets same continuity assumptions,.

L_2 on probability densities

To get \(R(p,q)\), our measure of badness, we can do the simplest possible thing and just integrate a pointwise loss function \(\ell:\mathbb{R}^+\times\mathbb{R}\rightarrow\mathbb{R}^+\) over the state space.

\begin{equation*} R(p, q)=\int_\mathbb{R}\ell(p(x), q(x))dx \end{equation*}

The usual norms can be applied to density, e.g. \(L_p\) norms (which I will call \(L_k\) norms because I am using \(p\)), or weighted ones. Say we choose \(\ell(s,t):=|s-t|^2\)

\begin{equation*} R(p, q):=\int_\mathbb{R}\left|p(x)-q(x)\right|^2dx \end{equation*}

Oh, cool, squared-\(L_2\). We way as well take the square root so that it’s a proper distance:

\begin{equation*} R(p, q):=\sqrt{\int_\mathbb{R}\left|p(x)-q(x)\right|^2dx}=\|p-q\|_2 \end{equation*}

\(L_2\) norm are classics for kernel density estimates, and density regression, because it allows you to use all the machinery of function approximation.

\(L_k, k\geq 1\) norms do observe the triangle inequality, and \(L_2\) norms have lots of additional features, such as Wiener filtering formulations, and Parseval’s identity (long story, google it).

However, it’s also kind of arbitrary; If you transform the random variable by anything other than a linear transform, then your distances transform in an arbitrary way. And we haven’t exploited the non-negativity of probability densities or used much in the way of measure theory, so it might feel as if we are wasting some information - If our estimated measure \(q(x)<0\forall x\in A\) for some nonempty interval \(A\) then we know it’s kjust plain wrong, since probability is never negative. Moreover, outside of a couple of special cases, it’s difficult to approximate densities in one of these norms from random i.i.d. samples, at least directly.

Fun exercise: Given \(N\) i.i.d samples drawn from \(X\sim P= \text{Norm}(\mu,\sigma)\), find a closed form expression for estimates \((\hat{\mu}_N, \hat{\sigma}_N)\) such that the distance \(E_P\|(p-\hat{p})\|_2\) is minimised.

Doing this directly is hard; But indirectly, if we try to directly minimise a different distance, the KL divergence, we can squeeze the \(L_2\) distance. I’ll come back to this point.

And finally, these feel like setting up an inconvenient problem to solve statistically, since an error is penalised equally everywhere in the state-space; Why are errors penalised just as much for where \(p\simeq 0\) as for \(p\gg 0\)? We know that there will be almost no data there.

That leads to…

\(\phi\)-divergences

Why not call \(P\) close to \(Q\) if it is close in all the “likely” places? Some distance \(R\) defined in terms of \(\ell(p(x), q(x))\) that cuts out the state space as much as possible and looks say, like this:

\begin{equation*} R(P,Q):=\psi(E_Q(\ell(dP(x),dQ(x)))) \end{equation*}

If we are going to measure divergence here then we also what the properties that \(P=Q\Rightarrow R(P,Q)=0\), and \(R(P,Q)\gt0 \Rightarrow P\neq Q\). We can get this if we chose some monotonic increasing \(\psi\) and \(\ell(s,t)\) such that

\begin{equation*} \begin{array}{rl} \ell(s,t) \geq 0 &\text{ for } s\neq t\\ \ell(s,t)=0 &\text{ for } s=t\ \end{array} \end{equation*}

Let’s let \(\psi\) be the identiy function for now, and concentrate on the fiddly bit: for that set up to work we need to find the right \(\ell\).

One obvious choice here would be

\begin{equation*} \ell(s,t) := \|s-t\|_k \end{equation*}

But I don’t use that because it turns out not to be especially simplifying.

Instead, we try a form of function that exploits the non-negativity of densities and penalises the ratio of one density to the other:

\begin{equation*} \ell(s,t) := \phi(s/t) \end{equation*}

If \(p(x)=q(x)\) then \(q(x)/p(x)=1\). So to get the right sort of penalty, we choose \(\phi\) to have a minimum where the argument is 1, \(\phi(1)=0\) and \(\phi(t)\geq 0, \forall t\)

phi-function

It turns out that it’s also wise to take \(\phi\) to be convex. And, note that for these not to explode we will now require \(P\) be dominated by \(Q\) (i.e. \(Q(A)=0\Rightarrow P(A)=0,\, \forall A \in\text{Borel}(\mathbb{R})\)

Putting this all together, we now have a divergence like this

\begin{equation*} D_\phi(P,Q) := E_Q\left[\phi(\frac{dP}{dQ})\right] \end{equation*}

And BAM! These are the \(\phi\)-divergences.

You get a different one for each choice of \(\phi\).

a.k.a. Csiszár-divergences, \(f\)-divergences or Ali-Silvey distances, after the people who noticed them. (AlSi66, Csis72)

These are in general mere premetrics. And note they are no longer in general symmetric - We should not necessarily expect

\begin{equation*} D_\phi(Q,P) = E_P\left[\phi\left(\frac{dQ}{dP}\right)\right] \end{equation*}

to be equal to

\begin{equation*} D_\phi(P,Q) = E_Q\left[\phi\left(\frac{dP}{dQ}\right)\right] \end{equation*}

Let’s choose some \(\phi\)s.

Kullback-Leibler divergence

Anyway, back to concreteness, and recall our well-behaved continuous random variables; we can write, in this case,

\begin{equation*} D_\phi(P,Q) = \int_\mathbb{R}\phi\left(\frac{p(x)}{q(x)}\right)q(x)dx \end{equation*}

We take \(\phi(t)=t \ln t\), and write the corresponding divergence, \(D_\text{KL}\),

\begin{align*} D_\text{KL}(Q,P) &= E_Q\left[\phi\left(\frac{p(x)}{q(x)}\right)\right] \\ &= \int_\mathbb{R}\phi\left(\frac{p(x)}{q(x)}\right)q(x)dx \\ &= \int_\mathbb{R}\left(\frac{p(x)}{q(x)}\right)\ln \left(\frac{p(x)}{q(x)}\right) q(x)dx \\ &= \int_\mathbb{R} \ln \left(\frac{q(x)}{p(x)}\right) p(x)dx \end{align*}

Indeed, in general, so long as P is absolutely continuous wrt Q,

\begin{equation*} D_\text{KL}(P,Q) = E_Q\left[\ln \left(\frac{dP}{dQ}\right)\right] \end{equation*}

This is one of many, many possible derivations of the Kullback-Leibler divergence a.k.a. KL divergence, or relative entropy; It pops up because of information-theoretic significance. e.g. information criteria.

TODO: revisit in a maximum likelihood setting, where we have a good algorithm to minimise it’s empirical version.

Total-variation distance

Take \(\phi(t)=|t-1|\). We write \(\delta(P,Q)\) for the divergence. I will use the set \(A:=\left\{x:\frac{dP}{dQ}\geq 1\right\}=\{x:dP\geq dQ\}\). (And I will assume the measurability of this set under both measures of without further comment.)

\begin{align*} \delta(P,Q) &= E_Q\left[\left|\frac{dP}{dQ}-1\right|\right] \\ &= \int_A \left(\frac{dP}{dQ}-1 \right)dQ - \int_{A^C} \left(\frac{dP}{dQ}-1 \right)dQ\\ &= \int_A \frac{dP}{dQ} dQ - \int_A 1 dQ - \int_{A^C} \frac{dP}{dQ}dQ + \int_{A^C} 1 dQ\\ &= \int_A dP - \int_A dQ - \int_{A^C} dP + \int_{A^C} dQ\\ &= P(A) - Q(A) - P(A^C) + Q(A^C)\\ &= 2[P(A) - Q(A)] \\ &= 2[Q(A^C) - P(A^C)] \\ \text{ i.e. } &= 2\left[P(\{dP\geq dQ\})-Q(\{dQ\geq dP\})\right] \end{align*}

I have also the standard fact that for any probability measure \(P\) and \(P\)-measurable set, \(A\), it holds that \(P(A)=1-P(A^C)\).

Equivalently

\begin{equation*} \delta(P,Q) :=\sup_{B \in \sigma(Q)} \left\{ |P(B) - Q(B)| \right\} \end{equation*}

To see that \(A\) attains that supremum, we note for any set \(B\supseteq A,\, B:=A\cup D\) for some \(D\) disjoint from \(A\), it follows that \(|P(B) - Q(B)|\leq |P(A) - Q(A)|\) since, on \(D,\, dP\leq dQ\), by construction.

It should be clear that this is symmetric, which might be surprising.

Supposedly, KhFG07 show that this is the only possible f-divergence which is also a true metric, but I can’t access that paper to see how.

TODO: procrastinate by proving that for myself - I bet the representation of divergences as “simple” divergences is helpful, as given in ReWi09 (credited to Österreicher and Wajda)

TODO: talk about triangle inequalities.

Hellinger divergence

For this one, we write \(H^2(P,Q)\), and take the possibly bizarre-looking choice \(\phi(t):=(\sqrt{t}-1)^2\). Then, taking it slowly because this is the first time I’ve ever had cause to take the square root of a differential,

\begin{align*} H^2(P,Q) &:=E_Q \left[\left(\sqrt{\frac{dP}{dQ}}-1\right)^2\right] \\ &= \int \left(\sqrt{\frac{dP}{dQ}}-1\right)^2 dQ\\ &= \int \frac{dP}{dQ} dQ -2\int \sqrt{\frac{dP}{dQ}} dQ +\int dQ\\ &= \int dP -2\int \sqrt{\frac{dP}{dQ}} dQ +\int dQ\\ &= \int \sqrt{dP}^2 -2\int \sqrt{dP}\sqrt{dQ} +\int \sqrt{dQ}^2\\ &=\int (\sqrt{dP}-\sqrt{dQ})^2 \end{align*}

Another symmetrical \(\phi\)-divergence; The square root of the Hellinger divergence is the Hellinger distance, \(H=\sqrt{H^2}\) which is in fact a true distance. (exercise: prove)

It doesn’t look intuitive, but has convenient properties for proving inequalities, (simple relationships with other norms, triangle inequality) and magically good estimation properties (Bera77), e.g. in robust statistics.

TODO: make some these “convenient properties” explicit.

\(\chi^2\) divergence

As made famous by count data significance tests.

For this one, we write \(\chi^2\), and take \(\phi(t):=(t-1)^2\). Then, by the same old process…

\begin{align*} \chi^2(P,Q) &:=E_Q \left[\left(\frac{dP}{dQ}-1\right)^2\right] \\ &= \int \left(\frac{dP}{dQ}-1\right)^2 dQ\\ &= \int \left(\frac{dP}{dQ}\right)^2 dQ - 2 \int \frac{dP}{dQ} dQ + \int dQ\\ &= \int \frac{dP}{dQ} dP - 1 \end{align*}

Normally you see this for discrete data indexed by \(i\), in which case we may write

\begin{align*} \chi^2(P,Q) &= \left(\sum_i \frac{p_i}{q_i} p_i\right) - 1\\ &= \sum_i\left( \frac{p_i^2}{q_i} - q_i\right)\\ &= \sum_i \frac{p_i^2-q_i^2}{q_i}\\ \end{align*}

If you have constructed these discrete probability mass functions from \(N\) samples, say, \(p_i:=\frac{n^P_i}{N}\) and \(q_i:=\frac{n^Q_i}{N}\), this becomes

\begin{equation*} \chi^2(P,Q) = \sum_i \frac{(n^P_i)^2-(n^Q_i)^2}{Nn^Q_i} \end{equation*}

This is probably familiar from some primordial statistics class.

The main use of this one is its ancient pedigree, (used by Pearson in 1900, according to wikipedia) and its non-controversiality, so you include it in lists wherein you wish to mention you have a hipper alternative.

Integral probability metrics

TBD. For now, see SGSS07. Weaponized in GFTS08 as an independence test.

Also included:

Analysed in RKHS distribution embeddings.

Others

\(L_p\)-distance

for \(p\geq 1\). The preferred (true) metric for density estimation problems, especially when \(p=2\), and you get a convenient Hilbert space for free. When written like this, the norm is taken between densities, not distributions. (Although see the Kolmogorov metric for an application of the \(p=\infty\) norm to cumulative distributions.)

\begin{equation*} L_p(P,Q):= \left\|dP-dQ\right\|_p \end{equation*}
“P-divergence”
Metrizes convergence in probability. Note this is defined upon random variables with an arbitrary joint distribution, not upon two distributions per se.
Lèvy metric

This monster metrizes convergence in distribution

\begin{equation*} D_L(P,Q):=\inf\{\epsilon >0: P(x-\epsilon)-\epsilon \leq Q(x)\leq P(x+\epsilon)+\epsilon\} \end{equation*}
Kolmogorov metric

the \(L_\infty\) metric between the cumulative distributions (i.e. not between densities)

\begin{equation*} D_K(P,Q):= \sup_x \left\{ |P(x) - Q(x)| \right\} \end{equation*}
Wasserstein distance
Uhhhh
Skorokhod
Hmmm.
Neural Net distance
Wasserstein distance with a baked in notion of the capacity of the discriminators which much measure the distance. (AGLM17)

What even are the Kuiper and Prokhorov metrics? Fisher Information apparently induces some kind of metric. Not KL? Skorokhod metric?

Relations

TBD: Reverse Pinsker inequalities (e.g. BeHK12), and covering numbers and other such horrors.

Hellinger inequalities

Wr/t the total variation distance,

\begin{equation*} H^2(P,Q) \leq \delta(P,Q) \leq \sqrt 2 H(P,Q)\,. \end{equation*}
\begin{equation*} H^2(P,Q) \leq D_K(P,Q) \end{equation*}

Additionally,

\begin{equation*} 0\le H(P,Q) \le 1 \end{equation*}

Pinsker’s inequality

BeHK12 attribute this to Csiszár (1967 article I could not find) and Kullback (Kull70) instead of Pins80 (which is in any case in Russian and I haven’t read it). Huh.

\begin{equation*} \delta(P,Q) \leq \sqrt{\frac{1}{2} D_{K L}(P\|Q)} \end{equation*}

ReWi09 derives the “best-possible” generalised Pinsker inequalities, in certain sense of “best” and “generalised”, in that it relates various other inequalities to this one.

Standard Lp inequalities

Applies both with respect to a dominating measure and from one density to another.

Standard fact about \(L_p\) spaces:

\begin{equation*} p>1 \text{ and } q>p\Rightarrow \|x\|_p\geq\|x\|_q \end{equation*}

To read

[1]We don’t actually need the densities, or even the distributions, to be continuous, but there are various additional features we might desire from our estimators without those assumptions, which can be ignored for now; Relaxing these assumptions is how you motivate Skohorohod metrics for convergence of random variables.

Refs

AlSi66
Ali, S. M., & Silvey, S. D.(1966) A General Class of Coefficients of Divergence of One Distribution from Another. Journal of the Royal Statistical Society. Series B (Methodological), 28(1), 131–142.
AGLM17
Arora, S., Ge, R., Liang, Y., Ma, T., & Zhang, Y. (2017) Generalization and Equilibrium in Generative Adversarial Nets (GANs). arXiv:1703.00573 [Cs].
Bach13
Bach, F. (2013) Convex relaxations of structured matrix factorizations. arXiv:1309.3117 [Cs, Math].
Bera77
Beran, R. (1977) Minimum Hellinger Distance Estimates for Parametric Models. The Annals of Statistics, 5(3), 445–463. DOI.
BeHK12
Berend, D., Harremoës, P., & Kontorovich, A. (2012) Minimum KL-divergence on complements of $L_1$ balls. arXiv:1206.6544 [Cs, Math].
Bill13
Billingsley, P. (2013) Convergence Of Probability Measures. (2 edition.). Wiley
BüGe11
Bühlmann, P., & van de Geer, S. (2011) Statistics for High-Dimensional Data: Methods, Theory and Applications. (2011 edition.). Heidelberg ; New York: Springer
Csis72
Csiszár, I. (1972) A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(1–4), 191–213. DOI.
Csis75
Csiszár, I. (1975) I-Divergence Geometry of Probability Distributions and Minimization Problems. The Annals of Probability, 3(1), 146–158.
Gila10
Gilardoni, G. L.(2010) On Pinsker’s Type Inequalities and Csiszár’s f-divergences Part I: Second and Fourth-Order Inequalities. IEEE Transactions on Information Theory, 56(11), 5377–5386. DOI.
GFTS08
Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J.(2008) A Kernel Statistical Test of Independence. In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference. Cambridge, MA: MIT Press
Hall87
Hall, P. (1987) On Kullback-Leibler Loss and Density Estimation. The Annals of Statistics, 15(4), 1491–1519. DOI.
HaTu12
Harremoës, P., & Tusnády, G. (2012) Information Divergence is more chi squared distributed than the chi squared statistics. arXiv:1202.1125 [Cs, Math, Stat].
HaIb90
Hasminskii, R., & Ibragimov, I. (1990) On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory. The Annals of Statistics, 18(3), 999–1010. DOI.
KhFG07
Khosravifard, M., Fooladivanda, D., & Gulliver, T. A.(2007) Confliction of the Convexity and Metric Properties in f-Divergences. IEICE Trans. Fundam. Electron. Commun. Comput. Sci., E90–A(9), 1848–1853. DOI.
Kull67
Kullback, S. (1967) A lower bound for discrimination information in terms of variation (Corresp). IEEE Transactions on Information Theory, 13(1), 126–127. DOI.
Kull70
Kullback, S. (1970) Correction to A Lower Bound for Discrimination Information in Terms of Variation. IEEE Transactions on Information Theory, 16(5), 652–652. DOI.
LiVa06
Liese, F., & Vajda, I. (2006) On Divergences and Informations in Statistics and Information Theory. IEEE Transactions on Information Theory, 52(10), 4394–4412. DOI.
Lin91
Lin, J. (1991) Divergence measures based on the Shannon entropy. Information Theory, IEEE Transactions on, 37(1), 145–151. DOI.
MoMC16
Montavon, G., Müller, K.-R., & Cuturi, M. (2016) Wasserstein Training of Restricted Boltzmann Machines. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 3711–3719). Curran Associates, Inc.
MoHe14
Moon, K. R., & Hero III, A. O.(2014) Multivariate f-Divergence Estimation With Confidence. In NIPS 2014.
Niel13
Nielsen, F. (2013) Cramer-Rao Lower Bound and Information Geometry. arXiv:1301.3578 [Cs, Math].
NiNo13
Nielsen, F., & Nock, R. (2013) On the Chi square and higher-order Chi distances for approximating f-divergences. arXiv:1309.3029 [Cs, Math].
Nuss04
Nussbaum, M. (2004) Minimax Risk, Pinsker Bound for. In Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc.
PaVe08
Palomar, D. P., & Verdu, S. (2008) Lautum Information. IEEE Transactions on Information Theory, 54(3), 964–975. DOI.
Pins80
Pinsker, M. S.(1980) Optimal filtration of square-integrable signals in Gaussian noise. Problems in Information Transmiss, 16(2), 120–133.
Rao87
Rao, C. R.(1987) Differential metrics in probability spaces. In Differential geometry in statistical inference (Vol. 10, pp. 217–240). IMS Lecture Notes and Monographs Series, Hayward, CA
ReWi09
Reid, M. D., & Williamson, R. C.(2009) Generalised Pinsker Inequalities. In arXiv:0906.1244 [cs, math].
ReWi11
Reid, M. D., & Williamson, R. C.(2011) Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research, 12(Mar), 731–817.
Rény59
Rényi, A. (1959) On the dimension and entropy of probability distributions. Acta Mathematica Academiae Scientiarum Hungarica, 10(1–2), 193–215. DOI.
Saga05
Sagara, N. (2005) Nonparametric maximum-likelihood estimation of probability measures: existence and consistency. Journal of Statistical Planning and Inference, 133(2), 249–271. DOI.
SaVe16
Sason, I., & Verdú, S. (2016) f-Divergence Inequalities via Functional Domination. arXiv:1610.09110 [Cs, Math, Stat].
SGSS07
Smola, A., Gretton, A., Song, L., & Schölkopf, B. (2007) A Hilbert Space Embedding for Distributions. In M. Hutter, R. A. Servedio, & E. Takimoto (Eds.), Algorithmic Learning Theory (pp. 13–31). Springer Berlin Heidelberg
SHSF09
Song, L., Huang, J., Smola, A., & Fukumizu, K. (2009) Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 961–968). New York, NY, USA: ACM DOI.
SGFL08
Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G., & Schölkopf, B. (2008) Injective Hilbert Space Embeddings of Probability Measures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
SFGS12
Sriperumbudur, Bharath K., Fukumizu, K., Gretton, A., Schölkopf, B., & Lanckriet, G. R. G.(2012) On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6, 1550–1599. DOI.
SGFS10
Sriperumbudur, Bharath K., Gretton, A., Fukumizu, K., Schölkopf, B., & Lanckriet, G. R. G.(2010) Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11, 1517−1561.
Geer14
van de Geer, S. (2014) Statistical Theory for High-Dimensional Models. arXiv:1409.8557 [Math, Stat].
ZPJS12
Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2012) Kernel-based Conditional Independence Test and Application in Causal Discovery. arXiv:1202.3775 [Cs, Stat].