Maximum Mean Discrepancy, Hilbert-Schmidt Independence Criterion

August 21, 2016 — March 1, 2024

approximation

functional analysis

Hilbert space

measure

metrics

nonparametric

optimization

probability

statistics

An integral probability metric. The intersection of reproducing kernel methods, dependence tests and probability metrics; where we use an kernel embedding to cleverly measure differences between probability distributions, typically an RKHS embedding, but any old Hilbert space will do something.

Can be estimated from samples only, which is neat.

A mere placeholder. For a thorough treatment see the canonical references (Gretton et al. 2008; Gretton, Borgwardt, et al. 2012).

1 Tutorial

Arthur Gretton, Dougal Sutherland, Wittawat Jitkrittum presentation: Interpretable Comparison of Distributions and Models.

Danica Sutherland’s explanation is clear.

Pierre Alquier’s post Universal estimation with Maximum Mean Discrepancy (MMD) shows how to use MMD in a robust nonparametric estimator.

Gaël Varoquaux’ introduction is friendly and illustrated, Comparing distributions: Kernels estimate good representations, l1 distances give good tests based on (Scetbon and Varoquaux 2019).

2 Hilbert-Schmidt Independence Criterion

The HSIC is the application of the MMD to dependence testing, AFAICT.

3 Connection to optimal transport losses

Husain (2020)’s results connect IPMs to transport metrics and regularisation theory, and classification.

Feydy et al. (2019) connects MMD to optimal transport losses.

Arbel et al. (2019) also looks pertinent and has some connections to Wasserstein gradient flow, which is a thing.

4 Connection to kernelized Stein discrepancy

TBD. See Stein VGD.

5 Choice of kernel

Hmm. See Gretton, Sriperumbudur, et al. (2012).

6 Tooling

MMD is included in the ITE toolbox (estimators).

6.1 GeomLoss

GeomLoss: Geometric Loss functions between sampled measures, images and volumes

The GeomLoss library provides efficient GPU implementations for:

Kernel norms (also known as Maximum Mean Discrepancies).

Hausdorff divergences, which are positive definite generalizations of the Chamfer-ICP loss and are analogous to log-likelihoods of Gaussian Mixture Models.

Debiased Sinkhorn divergences, which are affordable yet positive and definite approximations of Optimal Transport (Wasserstein) distances.

It is hosted on GitHub and distributed under the permissive MIT license. pypi pepy

GeomLoss functions are available through the custom PyTorch layers SamplesLoss, ImagesLoss and VolumesLoss which allow you to work with weighted point clouds (of any dimension), density maps and volumetric segmentation masks.

7 Incoming

Cross Validated intro to

8 References

Arbel, Korba, Salim, et al. 2019. “Maximum Mean Discrepancy Gradient Flow.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems.

Arras, Azmoodeh, Poly, et al. 2017. “A Bound on the 2-Wasserstein Distance Between Linear Combinations of Independent Random Variables.” arXiv:1704.01376 [Math].

Blanchet, Chen, and Zhou. 2018. “Distributionally Robust Mean-Variance Portfolio Selection with Wasserstein Distances.” arXiv:1802.04885 [Stat].

Deb, Ghosal, and Sen. 2020. “Measuring Association on Topological Spaces Using Kernels and Geometric Graphs.” arXiv:2010.01768 [Math, Stat].

Dellaporta, Knoblauch, Damoulas, et al. 2022. “Robust Bayesian Inference for Simulator-Based Models via the MMD Posterior Bootstrap.” arXiv:2202.04744 [Cs, Stat].

Feydy, Séjourné, Vialard, et al. 2019. “Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences.” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics.

Gretton, Borgwardt, Rasch, et al. 2012. “A Kernel Two-Sample Test.” The Journal of Machine Learning Research.

Gretton, Fukumizu, Teo, et al. 2008. “A Kernel Statistical Test of Independence.” In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference.

Gretton, Sriperumbudur, Sejdinovic, et al. 2012. “Optimal Kernel Choice for Large-Scale Two-Sample Tests.” In Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS’12.

Hamzi, and Owhadi. 2021. “Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part I: Parametric Kernel Flows.” Physica D: Nonlinear Phenomena.

Husain. 2020. “Distributional Robustness with IPMs and Links to Regularization and GANs.” arXiv:2006.04349 [Cs, Stat].

Huszár, and Duvenaud. 2016. “Optimally-Weighted Herding Is Bayesian Quadrature.”

Jitkrittum, Xu, Szabo, et al. 2017. “A Linear-Time Kernel Goodness-of-Fit Test.” In Advances in Neural Information Processing Systems.

Long, Cao, Wang, et al. 2015. “Learning Transferable Features with Deep Adaptation Networks.” In Proceedings of the 32nd International Conference on Machine Learning.

Muandet, Fukumizu, Sriperumbudur, et al. 2014. “Kernel Mean Shrinkage Estimators.” arXiv:1405.5505 [Cs, Stat].

Muandet, Fukumizu, Sriperumbudur, et al. 2017. “Kernel Mean Embedding of Distributions: A Review and Beyond.” Foundations and Trends® in Machine Learning.

Nishiyama, and Fukumizu. 2016. “Characteristic Kernels and Infinitely Divisible Distributions.” The Journal of Machine Learning Research.

Pfister, Bühlmann, Schölkopf, et al. 2018. “Kernel-Based Tests for Joint Independence.” Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Rustamov. 2021. “Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.” Stat.

Scetbon, and Varoquaux. 2019. “Comparing Distributions: \(\ell_1\) Geometry Improves Kernel Two-Sample Testing.” In Advances in Neural Information Processing Systems 32.

Schölkopf, Muandet, Fukumizu, et al. 2015. “Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations.” arXiv:1501.06794 [Cs, Stat].

Sejdinovic, Sriperumbudur, Gretton, et al. 2012. “Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing.” The Annals of Statistics.

Sheng, and Sriperumbudur. 2023. “On Distance and Kernel Measures of Conditional Dependence.” Journal of Machine Learning Research.

Smola, Gretton, Song, et al. 2007. “A Hilbert Space Embedding for Distributions.” In Algorithmic Learning Theory. Lecture Notes in Computer Science 4754.

Song, Fukumizu, and Gretton. 2013. “Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models.” IEEE Signal Processing Magazine.

Song, Huang, Smola, et al. 2009. “Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems.” In Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09.

Sriperumbudur, Bharath K., Fukumizu, Gretton, et al. 2012. “On the Empirical Estimation of Integral Probability Metrics.” Electronic Journal of Statistics.

Sriperumbudur, B. K., Gretton, Fukumizu, et al. 2008. “Injective Hilbert Space Embeddings of Probability Measures.” In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).

Sriperumbudur, Bharath K., Gretton, Fukumizu, et al. 2010. “Hilbert Space Embeddings and Metrics on Probability Measures.” Journal of Machine Learning Research.

Strobl, Zhang, and Visweswaran. 2017. “Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery.” arXiv:1702.03877 [Stat].

Sutherland, Tung, Strathmann, et al. 2021. “Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy.”

Szabo, and Sriperumbudur. 2017. “Characteristic and Universal Tensor Product Kernels.” arXiv:1708.08157 [Cs, Math, Stat].

Tolstikhin, Sriperumbudur, and Schölkopf. 2016. “Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels.” In Advances in Neural Information Processing Systems 29.

Zhang, Qinyi, Filippi, Gretton, et al. 2016. “Large-Scale Kernel Methods for Independence Testing.” arXiv:1606.07892 [Stat].

Zhang, Kun, Peters, Janzing, et al. 2012. “Kernel-Based Conditional Independence Test and Application in Causal Discovery.” arXiv:1202.3775 [Cs, Stat].