Quantifying difference between probability measures. Measuring the distribution itself, for, e.g. badness of approximation of a statistical fit. The theory of binary experiments. You probably care about these because you want to work with empirical observations of data drawn from a given distribution, to test for independence or do hypothesis testing or model selection, or density estimation, or to prove convergence for some random variable, or probability inequalities, or to model the distinguishability of the distributions from some process and a generative model of it, as seen in generative adversarial learning. That kind of thing. Frequently the distance here is between a measure and an empirical estimate thereof, but this is no requirement.

A good choice of probability metric might give you a convenient distribution of a test statistic, an efficient loss function to target, simple convergence behaviour for some class of estimator, or simply a warm fuzzy glow.

“Distance” and “metric” both often imply symmetric functions obeying the triangle inequality, but on this page we have a broader church, and include premetrics, metric-like functions which still “go to zero when two things get similar”, without including the other axioms of distances. These are also called divergences. This is still useful for the aforementioned convergence results. I’ll use “true metric” or “true distance” to make it clear when needed. “Contrast” is probably better here, but is less common.

# Overview

**nle;dr** Don’t read my summary, read the epic Reid and Williamson paper, ReWi11, which, in the quiet solitude of my own skull, I refer to as *One regret to rule them all and in divergence bound them*.

Wait, you are still here?

There is also a nifty omnibus of classic relations in Gibbs and Su:

Relationships among probability metrics. A directed arrow from A to B annotated by a function \(h(x)\) means that \(d_A \leq h(d_B)\). The symbol diam Ω denotes the diameter of the probability space Ω; bounds involving it are only useful if Ω is bounded. For Ω finite, \(d_{\text{min}} = \inf_{x,y\in\Omega} d(x,y).\) The probability metrics take arguments μ,ν; “ν dom μ” indicates that the given bound only holds if ν dominates μ. […]

# Norms with respect to Lebesgue measure on the state space

Well now, this is my fancy name. But this is probably the most familiar to many, as it’s a vanilla functional norm-induced metric applied to probability distributions on the state space of the random variable.

The “usual” norms can be applied to density, Most famously, \(L_p\) norms (which I will call \(L_k\) norms because I am using \(p\)).

When written like this, the norm is taken between densities, i.e. Radon-Nikodym derivatives, not distributions. (Although see the Kolmogorov metric for an application of the \(k=\infty\) norm to cumulative distributions.)

A little more generally, consider some RV \(X\sim P\) taking values on \(\mathbb{R}\) with a Radon-Nikodym derivative (a.k.a. density) continuous with respect to the Lebesgue measure \(\lambda\), \(p=dP/d\lambda\).

\[\begin{aligned} L_k(P,Q)&:= \left\|\frac{dP-dQ}{d\lambda}\right\|_k\\ &=\left[\int \left(\frac{dP-dQ}{d\lambda}\right)^k d\lambda\right]^{1/k}\\ &=\mathbb{E}\left[\frac{dP-dQ}{d\lambda}^k \right]^{1/k} \end{aligned}\]

\(L_2\) norm are classics for kernel density estimates, because it allows you to use lots of tasty machinery of spectral function approximation.

\(L_k, k\geq 1\) norms *do* observe the triangle inequality, and \(L_2\) norms have lots of additional features, such as Wiener filtering formulations, and Parseval’s identity, and you get a convenient Hilbert space for free.

There are the standard facts about about \(L_k,\,k\geq 1\) spaces (i.e. expectation of arbitrary measurable functions), e.g. domination

\[k>1 \text{ and } j>k \Rightarrow \|f\|_k\geq\|g\|_j\]

Hölder’s inequality for probabilities

\[1/k + 1/j \leq 1 \Rightarrow \|fg\|_1\leq \|f\|_k\|g\|_j\]

and the Minkowski (i.e. triangle) inequality

\[\|x+y\|_k \leq \|x\|_k+\|y\|_k\]

However, it’s an awkward choice for a distance on a probability space, the \(L_k\) space on densities.

If you transform the random variable by anything other than a linear transform, then your distances transform in an arbitrary way. And we haven’t exploited the non-negativity of probability densities so it might feel as if we are wasting some information – If our estimated density \(q(x)<0,\;\forall x\in A\) for some non empty interval \(A\) then we know it’s plain *wrong*, since probability is never negative.

Also, such norms are not necessarily convenient. Exercise: Given \(N\) i.i.d samples drawn from \(X\sim P= \text{Norm}(\mu,\sigma)\), find a closed form expression for estimates \((\hat{\mu}_N, \hat{\sigma}_N)\) such that the distance \(E_P\|(p-\hat{p})\|_2\) is minimised.

Doing this *directly* is hard; But indirectly can work – if we try to directly minimise a *different* distance, such as the KL divergence, we can squeeze the \(L_2\) distance. TODO: come back to this point.

Finally, these feel like setting up an inappropriate problem to solve statistically, since an error is penalised equally everywhere in the state-space; Why are errors penalised just as much for where \(p\simeq 0\) as for \(p\gg 0\)? Surely there are cases where we care more, or less, about such areas? That leads to, for example…

# \(\phi\)-divergences

Why not call \(P\) close to \(Q\) if closeness depends on the probability weighting of that place? Specifically, some divergence \(R\) like this, using scalar function \(\phi\) and pointwise loss \(\ell\)

\[R(P,Q):=\psi(E_Q(\ell(p(x), q(x))))\]

If we are going to measure divergence here, we also want the properties that \(P=Q\Rightarrow R(P,Q)=0\), and \(R(P,Q)\gt 0 \Rightarrow P\neq Q\). We can get this if we chose some increasing \(\psi\) and \(\ell(s,t)\) such that

\[ \begin{aligned} \begin{array}{rl} \ell(s,t) \geq 0 &\text{ for } s\neq t\\ \ell(s,t)=0 &\text{ for } s=t\\ \end{array} \end{aligned} \]

Let \(\psi\) be the identity function for now, and concentrate on the fiddly bit, \(\ell\). We try a form of function that exploits the non-negativity of densities and penalises the *derivative* of one distribution with respect to the other (resp. the ratio of densities) :

\[\ell(s,t) := \phi(s/t)\]

If \(p(x)=q(x)\) then \(q(x)/p(x)=1\). So to get the right sort of penalty, we choose \(\phi\) to have a minimum where the argument is 1, \(\phi(1)=0\) and \(\phi(t)\geq 0, \forall t\)

It turns out that it’s also wise to take \(\phi\) to be convex. (Exercise: why?) And, note that for these not to explode we will now require \(P\) be dominated by \(Q.\) (i.e. \(Q(A)=0\Rightarrow P(A)=0,\, \forall A \in\text{Borel}(\mathbb{R})\)

Putting this all together, we have a family of divergences

\[D_\phi(P,Q) := E_Q\phi\left(\frac{dP}{dQ}\right)\]

And BAM! These are the \(\phi\)-divergences. You get a different one for each choice of \(\phi\).

a.k.a. Csiszár-divergences, \(f\)-divergences or Ali-Silvey distances, after the people who noticed them. (AlSi66, Csis72)

These are in general mere premetrics. And note they are no longer in general symmetric -We should not necessarily expect

\[D_\phi(Q,P) = E_P\phi\left(\frac{dQ}{dP}\right)\]

to be equal to

\[D_\phi(P,Q) = E_Q\phi\left(\frac{dP}{dQ}\right)\]

Anyway, back to concreteness, and recall our well-behaved continuous random variables; we can write, in this case,

\[D_\phi(P,Q) = \int_\mathbb{R}\phi\left(\frac{p(x)}{q(x)}\right)q(x)dx\]

Let’s explore some \(\phi\)s.

## Kullback-Leibler divergence

We take \(\phi(t)=t \ln t\), and write the corresponding divergence, \(D_\text{KL}=\operatorname{KL}\),

\[\begin{aligned} \operatorname{KL}(Q,P) &= E_Q\phi\left(\frac{p(x)}{q(x)}\right) \\ &= \int_\mathbb{R}\phi\left(\frac{p(x)}{q(x)}\right)q(x)dx \\ &= \int_\mathbb{R}\left(\frac{p(x)}{q(x)}\right)\ln \left(\frac{p(x)}{q(x)}\right) q(x)dx \\ &= \int_\mathbb{R} \ln \left(\frac{q(x)}{p(x)}\right) p(x)dx \end{aligned}\]

Indeed, if \(P\) is absolutely continuous wrt \(Q\),

\[\operatorname{KL}(P,Q) = E_Q\log \left(\frac{dP}{dQ}\right)\]

This is one of many possible derivations of the Kullback-Leibler divergence a.k.a. *KL divergence*, or *relative entropy*; It pops up because of, e.g., information-theoretic significance.

TODO: revisit in maximum likelihood and variational inference settings, where we have good algorithms exploiting its nice properties.

## Total variation distance

Take \(\phi(t)=|t-1|\). We write \(\delta(P,Q)\) for the divergence. I will use the set \(A:=\left\{x:\frac{dP}{dQ}\geq 1\right\}=\{x:dP\geq dQ\}.\)

\[\begin{aligned} \delta(P,Q) &= E_Q\left|\frac{dP}{dQ}-1\right| \\ &= \int_A \left(\frac{dP}{dQ}-1 \right)dQ - \int_{A^C} \left(\frac{dP}{dQ}-1 \right)dQ\\ &= \int_A \frac{dP}{dQ} dQ - \int_A 1 dQ - \int_{A^C} \frac{dP}{dQ}dQ + \int_{A^C} 1 dQ\\ &= \int_A dP - \int_A dQ - \int_{A^C} dP + \int_{A^C} dQ\\ &= P(A) - Q(A) - P(A^C) + Q(A^C)\\ &= 2[P(A) - Q(A)] \\ &= 2[Q(A^C) - P(A^C)] \\ \text{ i.e. } &= 2\left[P(\{dP\geq dQ\})-Q(\{dQ\geq dP\})\right] \end{aligned}\]

I have also the standard fact that for any probability measure \(P\) and \(P\)-measurable set, \(A\), it holds that \(P(A)=1-P(A^C)\).

Equivalently

\[\delta(P,Q) :=\sup_{B \in \sigma(Q)} \left\{ |P(B) - Q(B)| \right\}\]

To see that \(A\) attains that supremum, we note for any set \(B\supseteq A,\, B:=A\cup D\) for some \(Z\) disjoint from \(A\), it follows that \(|P(B) - Q(B)|\leq |P(A) - Q(A)|\) since, on \(Z,\, dP/dQ\leq 1\), by construction.

It should be clear that this is symmetric.

Supposedly, KhFG07 show that this is the only possible *f*-divergence which is also a true distance, but I can’t access that paper to see how.

TODO: Prove that for myself -Is the representation of divergences as “simple” divergences helpful? See in ReWi09 (credited to Österreicher and Wajda)

TODO: talk about triangle inequalities.

## Hellinger divergence

For this one, we write \(H^2(P,Q)\), and take \(\phi(t):=(\sqrt{t}-1)^2\). Step-by-step, that becomes

\[\begin{aligned} H^2(P,Q) &:=E_Q \left(\sqrt{\frac{dP}{dQ}}-1\right)^2 \\ &= \int \left(\sqrt{\frac{dP}{dQ}}-1\right)^2 dQ\\ &= \int \frac{dP}{dQ} dQ -2\int \sqrt{\frac{dP}{dQ}} dQ +\int dQ\\ &= \int dP -2\int \sqrt{\frac{dP}{dQ}} dQ +\int dQ\\ &= \int \sqrt{dP}^2 -2\int \sqrt{dP}\sqrt{dQ} +\int \sqrt{dQ}^2\\ &=\int (\sqrt{dP}-\sqrt{dQ})^2 \end{aligned}\]

It turns out to be another symmetrical \(\phi\)-divergence. The square root of the Hellinger divergence \(H=\sqrt{H^2}\) is the Hellinger distance on the space of probability measures which is a true distance. (exercise: prove)

It doesn’t look intuitive, but has convenient properties for proving inequalities (simple relationships with other norms, triangle inequality) and magically good estimation properties (Bera77), e.g. in robust statistics.

TODO: make some of these “convenient properties” explicit.

## \(\alpha\)-divergence

a.k.a Rényi divergences, which are a sub family of the *f* divergences with a particular parameteriation. Includes KL, reverse-KL and Hellinger as special cases.

We take \(\phi(t):=\frac{4}{1-\alpha^2} \left(1-t^{(1+\alpha )/2}\right).\)

This gets fiddly to write out in full generality, with various undefined or infinite integrals needing definitions in terms of limits and is supposed to be constructed in terms of “Hellinger integral”…? I will ignore that for now and write out a simple enough version. See ErHa14 or LiVa06 for gory details.

\[D_\alpha(P,Q):=\frac{1}{1-\alpha}\log\int \left(\frac{p}{q}\right)^{1-\alpha}dP\]

## \(\chi^2\) divergence

As made famous by count data significance tests.

For this one, we write \(\chi^2\), and take \(\phi(t):=(t-1)^2\). Then, by the same old process…

\[\begin{aligned} \chi^2(P,Q) &:=E_Q \left(\frac{dP}{dQ}-1\right)^2 \\ &= \int \left(\frac{dP}{dQ}-1\right)^2 dQ\\ &= \int \left(\frac{dP}{dQ}\right)^2 dQ - 2 \int \frac{dP}{dQ} dQ + \int dQ\\ &= \int \frac{dP}{dQ} dP - 1 \end{aligned}\]

Normally you see this for discrete data indexed by \(i\), in which case we may write

\[\begin{aligned} \chi^2(P,Q) &= \left(\sum_i \frac{p_i}{q_i} p_i\right) - 1\\ &= \sum_i\left( \frac{p_i^2}{q_i} - q_i\right)\\ &= \sum_i \frac{p_i^2-q_i^2}{q_i}\\ \end{aligned}\]

If you have constructed these discrete probability mass functions from \(N\) samples, say, \(p_i:=\frac{n^P_i}{N}\) and \(q_i:=\frac{n^Q_i}{N}\), this becomes

\[\chi^2(P,Q) = \sum_i \frac{(n^P_i)^2-(n^Q_i)^2}{Nn^Q_i}\]

This is probably familiar from some primordial statistics class.

The main use of this one is its ancient pedigree, (used by Pearson in 1900, according to Wikipedia) and its non-controversiality, so you include it in lists wherein you wish to mention you have a hipper alternative.

TBD: Reverse Pinsker inequalities (e.g. BeHK12), and covering numbers and other such horrors.

## Hellinger inequalities

Wr/t the total variation distance,

\[H^2(P,Q) \leq \delta(P,Q) \leq \sqrt 2 H(P,Q)\,.\]

\[H^2(P,Q) \leq \operatorname{KL}(P,Q)\]

Additionally,

\[0\leq H^2(P,Q) \leq H(P,Q) \leq 1\]

## Pinsker inequalities

BeHK12 attribute this to Csiszár (1967 article I could not find) and Kullback (Kull70) instead of Pins80 (which is in any case in Russian and I haven’t read it).

\[\delta(P,Q) \leq \sqrt{\frac{1}{2} D_{K L}(P\|Q)}\]

ReWi09 derive the best-possible generalised Pinsker inequalities, in a certain sense of “best” and “generalised”, i.e. they are tight bounds, but not necessarily convenient.

Here are the most useful 3 of their inequalities: (\(P,Q\) arguments omitted)

\[\begin{aligned} H^2 &\geq 2-\sqrt{4-\delta^2} \\ \chi^2 &\geq \mathbb{I}\{\delta\lt 1\}\delta^2+\mathbb{I}\{\delta\lt 1\}\frac{\delta}{2-\delta}\\ \operatorname{KL} &\geq \min_{\beta\in [\delta-2,2-\delta]}\left(\frac{\delta+2-\beta}{4}\right) \log\left(\frac{\beta-2-\delta}{\beta-2+\delta}\right) + \left(\frac{\beta+2-\delta}{4}\right) \log\left(\frac{\beta+2-\delta}{\beta+2+\delta}\right) \end{aligned}\]

# Integral probability metrics

TBD. For now, see SGSS07. Weaponized in GFTS08 as an independence test.

Included:

- Total Variation
- Kantorovich/Wasserstein/Mass transport. (TODO: make precise)
- Fourtet-Mourier
- Lipschitz (?)
Maximum Mean Discrepancy, esp using RKHS-based (e.g. SGSS07). Homework: Can you use RKHS methods in all of these?

Analysed in Maximum Mean Discpreancy.

# Wasserstein distance(s)

a.k.a. Optimal transport metrics. Monge-Kantorovich metrics. Earthmover distances.

Let \((M,d)\) be a metric space for which every probability measure on \(M\) is a Radon measure. For \(p\ge 1\), let \(\mathcal{P}_p(M)\) denote the collection of all probability measures \(P\) on \(M\) with finite \(p^{\text{th}}\) moment for some \(x_0\) in \(M\), \[\int_{M} d(x, x_{0})^{p} \, \mathrm{d} P (x) < +\infty.\]

Then the \(p^{\text{th}}\) *Wasserstein distance* between two probability measures \(P\) and \(Q\) in \(\mathcal{P}_p(M)\) is defined as \[W_{p} ( P , Q ):=
\left( \inf_{\gamma \in \Pi ( P , Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) \right)^{1/p},\]

where \(\Pi( P , Q )\) denotes the collection of all measures on \(M \times M\) with marginal distributions \(P\) and \(Q\) respectively.

Practically, one usually sees \(p\in\{1,2\}\). For \(p=1\) one uses \[W_1(P,Q)=\inf_{\gamma \in \Pi( P , Q )}\mathbb{E}_{(x,y)\sim \gamma}\left[d(x,y)\right]\]

This is frequently intractable, or at least has no closed form, but you can find it for some useful special cases.

Useful: Two Gaussians may be related thusly for a Wasserstein-2 \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).

\[\begin{aligned} d&:= W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert m_1-m_2\Vert_2^2 + \mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\] (ref)

But why do you are about such an intractable distance? Because it bounds the errors from approximate distributions.

We know that if \(W_p(\nu\hat{nu}) \leq \epsilon\), then for any L-Lipschitz function \(\phi\), \(|\nu(\phi) - \hat{\nu}(\phi)| \leq L\epsilon.\) See [@HugginsPractical2018] for some specifics.

## “Neural Net distance”

Wasserstein distance with a baked in notion of the capacity of the discriminators which approximate Wasserstein in some sense. (Arjovsky et al again) Is this actually used? The name is suspiciously awful.

# \((p,\nu)\)-Fisher distance

This is the terminology of [@HugginsPractical2018]. they use this distance as a computationally tractable proxy for the Wasserstein distance.

For a Borel measure \(\nu\), let \(L_p(\nu)\) denote the space of functions that are \(p\)-integrable with respect to \(\nu: \phi \in L_p(\nu) \Leftrightarrow \|\phi\|Lp(\nu) = (\int\phi(\theta)p\nu(d\theta)))^{1/p} < \infty\). Let \(U = − \log d\eta/d\theta\) and \(\hat{U} = − \log d\hat{\eta}/d\theta\) denote the potential energy functions associated with, respectively, \(\eta\) and \(\hat{\eta}\).

Then the \((p, \nu)\)-Fisher distance is given by \[\begin{align} d_{p,\nu}(\eta,\hat{\eta}) &=\left\|\|\nabla U−\nabla U\|_2\right\|_{L^p(\nu)}\\ &= \left(\int\|\nabla U(\theta)−\nabla U(\theta)\|_2^p\nu(d\theta)\right)^{1/p}. \end{align}\]

This avoids an inconvenient posterior normalising calculation in Bayes.

# Others

- “P-divergence”
Metrizes convergence in probability. Note this is defined upon random variables with an arbitrary joint distribution, not upon two distributions

*per se*.- Lèvy metric
This monster metrizes convergence in distribution. (But doesn’t Wasserstein already do that?)

\[D_L(P,Q) := \inf\{\epsilon >0: P(x-\epsilon)-\epsilon \leq Q(x)\leq P(x+\epsilon)+\epsilon\}\]

- Kolmogorov metric
the \(L_\infty\) metric between the cumulative distributions (i.e. not between densities)

\[D_K(P,Q):= \sup_x \left\{ |P(x) - Q(x)| \right\}\]

Nonetheless it does

*look*similar to Total Variation, doesn’t it?- Skorokhod
Hmmm.

What even are the Kuiper and Prokhorov metrics?

# Induced topologies

There is a synthesis of the importance of the topologies induced by each of these metrics, which I read in Arjovsky et al, and which they credit to Billingsley and Villani.

In this paper, we direct our attention on the various ways to measure how close the model distribution and the real distribution are, or equivalently, on the various ways to define a distance or divergence \(\rho(P_{\theta},P_{r})\). The most fundamental difference between such distances is their impact on the convergence of sequences of probability distributions. A sequence of distributions \((P_{t}) _{t\in \mathbb{N}}\) converges if and only if there is a distribution \(P_{\infty}\) such that \(\rho(P_{\theta} , P_{r} )\) tends to zero, something that depends on how exactly the distance \(\rho\) is defined. Informally, a distance \(\rho\) induces a weaker topology when it makes it easier for a sequence of distribution to converge. […]

In order to optimize the parameter \({\theta}\), it is of course desirable to define our model distribution \(P_{\theta}\) in a manner that makes the mapping \({\theta} \mapsto P{\theta}\) continuous. Continuity means that when a sequence of parameters \(\theta_t\) converges to \({\theta}\), the distributions \(P_{\theta_t}t\) also converge to \(P{\theta}.\) […] If \(\rho\) is our notion of distance between two distributions, we would like to have a loss function \(\theta \mapsto\rho(P_{\theta},P_r)\) that is continuous[…]

# To read

- This guy’s study blog
- Anand Sarwate: C.R. Rao and information geometry
- ReWi11, relating many of these metrics and a few others to a theory of experiments distinguishing data from two distributions.
- BeHK12 on reverse Pinsker inequalities.

# Refs

- AAPS17: Benjamin Arras, Ehsan Azmoodeh, Guillaume Poly, Yvik Swan (2017) A bound on the 2-Wasserstein distance between linear combinations of independent random variables.
*ArXiv:1704.01376 [Math]*. - Csis72: I. Csiszár (1972) A class of measures of informativity of observation channels.
*Periodica Mathematica Hungarica*, 2(1–4), 191–213. DOI - GiSh84: Clark R. Givens, Rae Michael Shortt (1984) A class of Wasserstein metrics for probability distributions.
*The Michigan Mathematical Journal*, 31(2), 231–240. DOI - AlSi66: S. M. Ali, S. D. Silvey (1966) A General Class of Coefficients of Divergence of One Distribution from Another.
*Journal of the Royal Statistical Society. Series B (Methodological)*, 28(1), 131–142. - SGSS07: Alex Smola, Arthur Gretton, Le Song, Bernhard Schölkopf (2007) A Hilbert Space Embedding for Distributions. In Algorithmic Learning Theory (pp. 13–31). Springer Berlin Heidelberg
- GFTS08: Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, Alexander J Smola (2008) A Kernel Statistical Test of Independence. In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference. Cambridge, MA: MIT Press
- Kull67: S. Kullback (1967) A lower bound for discrimination information in terms of variation (Corresp).
*IEEE Transactions on Information Theory*, 13(1), 126–127. DOI - AdLu18: Jonas Adler, Sebastian Lunz (2018) Banach Wasserstein GAN.
- Rust19: Raif M. Rustamov (2019) Closed-form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.
*ArXiv:1901.03227 [Cs, Stat]*. - KhFG07: Mohammadali Khosravifard, Dariush Fooladivanda, T. Aaron Gulliver (2007) Confliction of the Convexity and Metric Properties in f-Divergences.
*IEICE Trans. Fundam. Electron. Commun. Comput. Sci.*, E90-A(9), 1848–1853. DOI - Bill13: Patrick Billingsley (2013)
*Convergence Of Probability Measures*. Wiley - Bach13: Francis Bach (2013) Convex relaxations of structured matrix factorizations.
*ArXiv:1309.3117 [Cs, Math]*. - Kull70: S. Kullback (1970) Correction to A Lower Bound for Discrimination Information in Terms of Variation.
*IEEE Transactions on Information Theory*, 16(5), 652–652. DOI - Niel13: Frank Nielsen (2013) Cramer-Rao Lower Bound and Information Geometry.
*ArXiv:1301.3578 [Cs, Math]*. - Rao87: C R Rao (1987) Differential metrics in probability spaces. In Differential geometry in statistical inference (Vol. 10, pp. 217–240). IMS Lecture Notes and Monographs Series, Hayward, CA
- Lin91: Jianhua Lin (1991) Divergence measures based on the Shannon entropy.
*IEEE Transactions on Information Theory*, 37(1), 145–151. DOI - SaVe16: Igal Sason, Sergio Verdú (2016) f-Divergence Inequalities via Functional Domination.
*ArXiv:1610.09110 [Cs, Math, Stat]*. - ReWi09: Mark D. Reid, Robert C. Williamson (2009) Generalised Pinsker Inequalities. In arXiv:0906.1244 [cs, math].
- AGLM17: Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang (2017) Generalization and Equilibrium in Generative Adversarial Nets (GANs).
*ArXiv:1703.00573 [Cs]*. - SGFS10: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R. G. Lanckriet (2010) Hilbert Space Embeddings and Metrics on Probability Measures.
*Journal of Machine Learning Research*, 11, 1517−1561. - SHSF09: Le Song, Jonathan Huang, Alex Smola, Kenji Fukumizu (2009) Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 961–968). New York, NY, USA: ACM DOI
- Csis75: I. Csiszár (1975) I-Divergence Geometry of Probability Distributions and Minimization Problems.
*The Annals of Probability*, 3(1), 146–158. - ReWi11: Mark D. Reid, Robert C. Williamson (2011) Information, Divergence and Risk for Binary Experiments.
*Journal of Machine Learning Research*, 12(Mar), 731–817. - SGFL08: B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, B. Schölkopf (2008) Injective Hilbert Space Embeddings of Probability Measures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
- ZPJS12: Kun Zhang, Jonas Peters, Dominik Janzing, Bernhard Schölkopf (2012) Kernel-based Conditional Independence Test and Application in Causal Discovery.
*ArXiv:1202.3775 [Cs, Stat]*. - PaVe08: D.P. Palomar, S. Verdu (2008) Lautum Information.
*IEEE Transactions on Information Theory*, 54(3), 964–975. DOI - CaRo12: Guillermo D. Canas, Lorenzo Rosasco (2012) Learning Probability Measures with respect to Optimal Transport Metrics.
*ArXiv:1209.1077 [Cs, Stat]*. - SiPó18: Shashank Singh, Barnabás Póczos (2018) Minimax Distribution Estimation in Wasserstein Distance.
*ArXiv:1802.08855 [Cs, Math, Stat]*. - Nuss04: Michael Nussbaum (2004) Minimax Risk, Pinsker Bound for. In Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc.
- Bera77: Rudolf Beran (1977) Minimum Hellinger Distance Estimates for Parametric Models.
*The Annals of Statistics*, 5(3), 445–463. DOI - BeHK12: Daniel Berend, Peter Harremoës, Aryeh Kontorovich (2012) Minimum KL-divergence on complements of \(L_1\) balls.
*ArXiv:1206.6544 [Cs, Math]*. - MoHe14: Kevin R. Moon, Alfred O. Hero III (2014) Multivariate f-Divergence Estimation With Confidence. In NIPS 2014.
- Saga05: Nobusumi Sagara (2005) Nonparametric maximum-likelihood estimation of probability measures: existence and consistency.
*Journal of Statistical Planning and Inference*, 133(2), 249–271. DOI - GiSu02: Alison L. Gibbs, Francis Edward Su (2002) On Choosing and Bounding Probability Metrics.
*International Statistical Review*, 70(3), 419–435. DOI - HaIb90: Rafael Hasminskii, Ildar Ibragimov (1990) On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory.
*The Annals of Statistics*, 18(3), 999–1010. DOI - LiVa06: F Liese, I Vajda (2006) On Divergences and Informations in Statistics and Information Theory.
*IEEE Transactions on Information Theory*, 52(10), 4394–4412. DOI - Hall87: Peter Hall (1987) On Kullback-Leibler Loss and Density Estimation.
*The Annals of Statistics*, 15(4), 1491–1519. DOI - Gila10: Gustavo L. Gilardoni (2010) On Pinsker’s Type Inequalities and Csiszár’s f-divergences Part I: Second and Fourth-Order Inequalities.
*IEEE Transactions on Information Theory*, 56(11), 5377–5386. DOI - NiNo13: Frank Nielsen, Richard Nock (2013) On the Chi square and higher-order Chi distances for approximating f-divergences.
*ArXiv:1309.3029 [Cs, Math]*. - Rény59: A. Rényi (1959) On the dimension and entropy of probability distributions.
*Acta Mathematica Academiae Scientiarum Hungarica*, 10(1–2), 193–215. DOI - SFGS12: Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert R. G. Lanckriet (2012) On the empirical estimation of integral probability metrics.
*Electronic Journal of Statistics*, 6, 1550–1599. DOI - Pins80: M. S. Pinsker (1980) Optimal filtration of square-integrable signals in Gaussian noise.
*Problems in Information Transmiss*, 16(2), 120–133. - Maur18: Abhinav Maurya (2018) Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.
- Vill09: Cédric Villani (2009)
*Optimal Transport: Old and New*. Berlin Heidelberg: Springer-Verlag - HuAB17: Jonathan Huggins, Ryan P Adams, Tamara Broderick (2017) PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference. In Advances in Neural Information Processing Systems 30 (pp. 3611–3621). Curran Associates, Inc.
- HCKB18: Jonathan H. Huggins, Trevor Campbell, Mikołaj Kasprzak, Tamara Broderick (2018) Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach.
*ArXiv:1809.09505 [Cs, Math, Stat]*. - BaLK17: Olivier Bachem, Mario Lucic, Andreas Krause (2017) Practical Coreset Constructions for Machine Learning.
*ArXiv Preprint ArXiv:1703.06476*. - GHLY17: Xin Guo, Johnny Hong, Tianyi Lin, Nan Yang (2017) Relaxed Wasserstein with Applications to GANs.
*ArXiv:1705.07164 [Cs, Stat]*. - ErHa14: Tim van Erven, Peter Harremoës (2014) Rényi Divergence and Kullback-Leibler Divergence.
*IEEE Transactions on Information Theory*, 60(7), 3797–3820. DOI - GiAL13: M. Gil, F. Alajaji, T. Linder (2013) Rényi divergence measures for commonly used univariate continuous distributions.
*Information Sciences*, 249(Supplement C), 124–131. DOI - Geer14: Sara van de Geer (2014) Statistical Theory for High-Dimensional Models.
*ArXiv:1409.8557 [Math, Stat]*. - BüGe11: Peter Bühlmann, Sara van de Geer (2011)
*Statistics for High-Dimensional Data: Methods, Theory and Applications*. Heidelberg ; New York: Springer - ArCB17: Martin Arjovsky, Soumith Chintala, Léon Bottou (2017) Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning (pp. 214–223).
- MoMC16: Grégoire Montavon, Klaus-Robert Müller, Marco Cuturi (2016) Wasserstein Training of Restricted Boltzmann Machines. In Advances in Neural Information Processing Systems 29 (pp. 3711–3719). Curran Associates, Inc.
- BoVi05: François Bolley, Cédric Villani (2005) Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities.
*Annales de La Faculté Des Sciences de Toulouse Mathématiques*, 14(3), 331–352. DOI