*TBD.*

I’m working through a small realisation, for my own interest, which has been helpful in my understanding of variational Bayes, - indeed in fact, relating it to non-Bayes variational inference. By starting from the idea of sufficient statistics, we come to the idea of variational inference in a natural way, via some other interesting points.

I doubt this insight is novel, but I will work through it as if it is, for the sake of my own education.

See also mixture models, probabilistic deep learning, directed graphical models, other probability metrics. Intuitive connection with differential privacy of posterior sampling? ([#DNZM13]

To mention: Bayesian likelihood principle, Pitman–Koopman–Darmois theorem, connection to degrees of freedom.

## Sufficient statistics in exponential families

Let’s start with sufficient statistics in exponential families, which, for reasons of historical pedagogy, are the Garden of Eden of Inference, the Garden of Edenference for short. I suspect that deep in their hearts, all statisticians regard themselves as prodigal exiles from of the exponential family, and long for the innocence of that Garden of Edenference.

Anyway, informally speaking, here’s what’s going on with the inference problems involving sufficient statistics. We are interested in estimating some parameter of inference, \(\theta\) using realisations \(x\) of some random process \(X\sim \mathbb{P}(x|\theta).\)

Then \(T(x)\) is a *sufficient statistic* for \(\theta\) iff

That is, our inference about \(\theta\) depends on the data *only* through the sufficient statistic.

(mention size of sufficient statistic)

Fisher–Neyman factorization theorem

Famously, Maximum Likelihood estimators of
exponential family models
are highly compressible, in that these have *sufficient statistics* -
these are low-dimensional functions of the data which characterise
all the information in the complete data,
with respect to the parameter estimates.
Many models and data sets and estimation methods do not have this feature,
even parametric models with very few parameters.

This can be a PITA when your data is very big and you wish to get benefit from that, and yet you can’t fit the data in memory; The question then arises - when can I do better? Can I find a “nearly sufficient” statistic, which is smaller than my data and yet does not worsen my error substantially? Can I quantify this nearness to the original?

## refs

- Grün04: (2004) A tutorial introduction to the minimum description length principle.
*Advances in Minimum Description Length: Theory and Applications*, 23–81. - DNZM13: (2013) Bayesian Differential Privacy through Posterior Sampling.
*ArXiv:1306.1066 [Cs, Stat]*. - Wall90: (1990) Classification by minimum-message-length inference. In Advances in Computing and Information — ICCI ’90 (pp. 72–81). Springer, Berlin, Heidelberg DOI
- GBKT17: (2017) Compressive Statistical Learning with Random Feature Moments.
*ArXiv:1706.07180 [Cs, Math, Stat]*. - Mont15: (2015) Computational implications of reducing data to sufficient statistics.
*Electronic Journal of Statistics*, 9(2), 2370–2390. DOI - KKPW14: (2014) Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations.
*ArXiv:1411.4342 [Stat]*. - Riss07: (2007)
*Information and complexity in statistical modeling*. New York: Springer - Mack02: (2002)
*Information Theory, Inference & Learning Algorithms*. Cambridge University Press - HiCa93: (1993) Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory (pp. 5–13). New York, NY, USA: ACM DOI
- Vitá06: (2006) Meaningful information.
*IEEE Transactions on Information Theory*, 52(10), 4617–4626. DOI - Riss78: (1978) Modeling by shortest data description.
*Automatica*, 14(5), 465–471. DOI - KuLe51: (1951) On Information and Sufficiency.
*The Annals of Mathematical Statistics*, 22(1), 79–86. - Wolp08: (2008) Physical limits of inference.
*Physica D: Nonlinear Phenomena*, 237(9), 1257–1281. DOI - CBMF17: (2017) Random Feature Expansions for Deep Gaussian Processes. In PMLR.
- Baez11: (2011) Renyi Entropy and Free Energy.
- DeBo15: (2015) Scalable Inference for Gaussian Process Models with Black-box Likelihoods. In Advances in Neural Information Processing Systems 28 (pp. 1414–1422). Cambridge, MA, USA: MIT Press
- AdCo09: (2009) Sufficient dimension reduction and prediction in regression.
*Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences*, 367(1906), 4385–4405. DOI - BaSW99: (1999) The Consistency of Posterior Distributions in Nonparametric Problems.
*The Annals of Statistics*, 27(2), 536–561. DOI - BaRY98: (1998) The minimum description length principle in coding and modeling.
*IEEE Transactions on Information Theory*, 44(6), 2743–2760. DOI - Mand62: (1962) The Role of Sufficiency and of Estimation in Thermodynamics.
*The Annals of Mathematical Statistics*, 33(3), 1021–1038. DOI - Riss84: (1984) Universal coding, information, prediction, and estimation.
*IEEE Transactions on Information Theory*, 30(4), 629–636. DOI - BlKM17: (2017) Variational Inference: A Review for Statisticians.
*Journal of the American Statistical Association*, 112(518), 859–877. DOI - HoVa04: (2004) Variational learning and bits-back coding: an information-theoretic view to Bayesian learning.
*IEEE Transactions on Neural Networks*, 15(4), 800–810. DOI