The Living Thing / Notebooks :

Optimal transport distances

Wasserstein distances, Monge-Kantorovich metrics, Earthmover distances

I presume there are other uses for optimal transport distances apart from as probability metrics, but so far I only care about them in that context, so this will be skewed that way.

Let \((M,d)\) be a metric space for which every probability measure on \(M\) is a Radon measure. For \(p\ge 1\), let \(\mathcal{P}_p(M)\) denote the collection of all probability measures \(P\) on \(M\) with finite \(p^{\text{th}}\) moment for some \(x_0\) in \(M\), \[\int_{M} d(x, x_{0})^{p} \, \mathrm{d} P (x) < +\infty.\]

Then the \(p^{\text{th}}\) Wasserstein distance between two probability measures \(P\) and \(Q\) in \(\mathcal{P}_p(M)\) is defined as \[W_{p} ( P , Q ):= \left( \inf_{\gamma \in \Pi ( P , Q )} \int_{M \times M} d(x, y)^{p} \, \mathrm{d} \gamma (x, y) \right)^{1/p},\]

where \(\Pi( P , Q )\) denotes the collection of all measures on \(M \times M\) with marginal distributions \(P\) and \(Q\) respectively.

Practically, one usually sees \(p\in\{1,2\}\). For \(p=1\) one uses \[W_1(P,Q)=\inf_{\gamma \in \Pi( P , Q )}\mathbb{E}_{(x,y)\sim \gamma}\left[d(x,y)\right]\]

This is frequently intractable, or at least has no closed form, but you can find it for some useful special cases, or bound/approximate it in others.

TODO: discuss favourable properties of this metric (triangle inequality, bounds on moments etc).

Useful: Two Gaussians may be related thusly for a Wasserstein-2 \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).

\[\begin{aligned} d&:= W_2(\mathcal{N}(m_1,\Sigma_1);\mathcal{N}(m_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert m_1-m_2\Vert_2^2 + \mathrm{Tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\] (ref)

But why do you are about such an intractable distance? Because it bounds the errors from approximate distributions.

We know that if \(W_p(\nu\hat{nu}) \leq \epsilon\), then for any L-Lipschitz function \(\phi\), \(|\nu(\phi) - \hat{\nu}(\phi)| \leq L\epsilon.\) See [@HugginsPractical2018] for some specifics.

Kontorovich-Rubinstein duality

Vincent Hermann gives an excellent practical introduction.

“Neural Net distance”

Wasserstein distance with a baked in notion of the capacity of the function class which approximate the true Wasserstein. (Arjovsky et al) Is this actually used?

Fisher distance

Specifically \((p,\nu)\)-Fisher distances, in the terminology of [@HugginsPractical2018]. They use these distances as a computationally tractable proxy for Wasserstein distances. See Fisher distances.

Refs