I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.

Let \((M,d)\) be a metric space for which every probability measure on \(M\) is a Radon measure. For \(p\ge 1\), let \(\mathcal{P}_p(M)\) denote the collection of all probability measures \(P\) on \(M\) with finite \(p^{\text{th}}\) moment for some \(x_0\) in \(M\), \[\int_{M} d(x, x_{0})^{p} \, \mathrm{d} P (x) < +\infty.\]

Then the \(p^{\text{th}}\) *Wasserstein distance* between two probability measures \(P\) and \(Q\) in \(\mathcal{P}_p(M)\) is defined as \[W_{p} ( P , Q ):=
\left( \inf_{\gamma \in \Pi ( P , Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) \right)^{1/p},\]

where \(\Pi( P , Q )\) denotes the collection of all measures on \(M \times M\) with marginal distributions \(P\) and \(Q\) respectively.

Practically, one usually sees \(p\in\{1,2\}\). For \(p=1\) one uses \[W_1(P,Q)=\inf_{\gamma \in \Pi( P , Q )}\mathbb{E}_{(x,y)\sim \gamma}\left[d(x,y)\right]\]

This is frequently intractable, or at least has no closed form, but you can find it for some useful special cases, or bound/approximate it in others.

TODO: discuss favourable properties of this metric (triangle inequality, bounds on moments etc).

Useful: Two Gaussians may be related thusly for a Wasserstein-2 \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).

\[\begin{aligned} d&:= W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert m_1-m_2\Vert_2^2 + \mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\] (ref)

But why do you are about such an intractable distance? Because it bounds the errors from approximate distributions.

We know that if \(W_p(\nu\hat{nu}) \leq \epsilon\), then for any L-Lipschitz function \(\phi\), \(|\nu(\phi) - \hat{\nu}(\phi)| \leq L\epsilon.\) See [@HugginsPractical2018] for some specifics.

## Kontorovich-Rubinstein duality

Vincent Hermann gives an excellent practical introduction.

## “Neural Net distance”

Wasserstein distance with a baked in notion of the capacity of the function class which approximate the true Wasserstein. (Arjovsky et al) Is this actually used?

## Fisher distance

Specifically \((p,\nu)\)-Fisher distances, in the terminology of [@HugginsPractical2018]. They use these distances as a computationally tractable proxy for Wasserstein distances. See Fisher distances.

# Refs

- AAPS17: Benjamin Arras, Ehsan Azmoodeh, Guillaume Poly, Yvik Swan (2017) A bound on the 2-Wasserstein distance between linear combinations of independent random variables.
*ArXiv:1704.01376 [Math]*. - GiSh84: Clark R. Givens, Rae Michael Shortt (1984) A class of Wasserstein metrics for probability distributions.
*The Michigan Mathematical Journal*, 31(2), 231–240. DOI - Rust19: Raif M. Rustamov (2019) Closed-form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.
*ArXiv:1901.03227 [Cs, Stat]*. - BoGG12: François Bolley, Ivan Gentil, Arnaud Guillin (2012) Convergence to equilibrium in Wasserstein distance for Fokker–Planck equations.
*Journal of Functional Analysis*, 263(8), 2430–2457. DOI - SGPC15: Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, … Leonidas Guibas (2015) Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains.
*ACM Transactions on Graphics*, 34(4), 66:1–66:11. DOI - MoKu18: Peyman Mohajerin Esfahani, Daniel Kuhn (2018) Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations.
*Mathematical Programming*, 171(1), 115–166. DOI - GaKl16: Rui Gao, Anton J. Kleywegt (2016) Distributionally Robust Stochastic Optimization with Wasserstein Distance.
*ArXiv:1604.02199 [Math]*. - AGLM17: Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang (2017) Generalization and Equilibrium in Generative Adversarial Nets (GANs).
*ArXiv:1703.00573 [Cs]*. - CaRo12: Guillermo D. Canas, Lorenzo Rosasco (2012) Learning Probability Measures with respect to Optimal Transport Metrics.
*ArXiv:1209.1077 [Cs, Stat]*. - FZMA15: Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, Tomaso A Poggio (2015) Learning with a Wasserstein Loss. In Advances in Neural Information Processing Systems 28 (pp. 2053–2061). Curran Associates, Inc.
- SiPó18: Shashank Singh, Barnabás Póczos (2018) Minimax Distribution Estimation in Wasserstein Distance.
*ArXiv:1802.08855 [Cs, Math, Stat]*. - KnSm84: M. Knott, C. S. Smith (1984) On the optimal mapping of distributions.
*Journal of Optimization Theory and Applications*, 43(1), 39–49. DOI - Taka08: Asuka Takatsu (2008) On Wasserstein geometry of the space of Gaussian measures.
- Maur18: Abhinav Maurya (2018) Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.
- Vill09: Cédric Villani (2009)
*Optimal Transport: Old and New*. Berlin Heidelberg: Springer-Verlag - Zeme12: Yoav Zemel (2012) Optimal Transportation: Continuous and Discrete.
- HuAB17: Jonathan Huggins, Ryan P Adams, Tamara Broderick (2017) PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference. In Advances in Neural Information Processing Systems 30 (pp. 3611–3621). Curran Associates, Inc.
- HCKB18: Jonathan H. Huggins, Trevor Campbell, Mikołaj Kasprzak, Tamara Broderick (2018) Practical bounds on the error of Bayesian posterior approximations: A nonasymptotic approach.
*ArXiv:1809.09505 [Cs, Math, Stat]*. - BaLK17: Olivier Bachem, Mario Lucic, Andreas Krause (2017) Practical Coreset Constructions for Machine Learning.
*ArXiv Preprint ArXiv:1703.06476*. - GHLY17: Xin Guo, Johnny Hong, Tianyi Lin, Nan Yang (2017) Relaxed Wasserstein with Applications to GANs.
*ArXiv:1705.07164 [Cs, Stat]*. - WeBa17: Jonathan Weed, Francis Bach (2017) Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance.
*ArXiv:1707.00087 [Math, Stat]*. - Cutu13: Marco Cuturi (2013) Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances.
- PaZe19: Victor M. Panaretos, Yoav Zemel (2019) Statistical Aspects of Wasserstein Distances.
*Annual Review of Statistics and Its Application*, 6(1), 405–431. DOI - ChDJ08: T. Champion, L. De Pascale, P. Juutinen (2008) The \(\infty\)-Wasserstein Distance: Local Solutions and Existence of Optimal Transport Maps.
*SIAM Journal on Mathematical Analysis*, 40(1), 1–20. DOI - OlPu82: I. Olkin, F. Pukelsheim (1982) The distance between two random vectors with given dispersion matrices.
*Linear Algebra and Its Applications*, 48, 257–263. DOI - DoLa82: D. C. Dowson, B. V. Landau (1982) The Fréchet distance between multivariate normal distributions.
*Journal of Multivariate Analysis*, 12(3), 450–455. DOI - ArCB17: Martin Arjovsky, Soumith Chintala, Léon Bottou (2017) Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning (pp. 214–223).
- MoMC16: Grégoire Montavon, Klaus-Robert Müller, Marco Cuturi (2016) Wasserstein Training of Restricted Boltzmann Machines. In Advances in Neural Information Processing Systems 29 (pp. 3711–3719). Curran Associates, Inc.
- BoVi05: François Bolley, Cédric Villani (2005) Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities.
*Annales de La Faculté Des Sciences de Toulouse Mathématiques*, 14(3), 331–352. DOI - Zeme00: (n.d.) Zemel - 2012 - Optimal Transportation Continuous and Discrete.pdf.