🚧

Other fun keywords: rate distortion…

I’m working through a small realisation, for my own interest, which has been helpful in my understanding of variational Bayes; specifically relating it to non-Bayes variational inference. Also sequential monte carlo.

By starting from the idea of sufficient statistics, we come to the idea of variational inference in a natural way, via some other interesting stopovers.

I doubt this insight is novel, but if I have time I will work through it as if it is, for the sake of my own education.

I will contend that most Bayesian inference (maybe *also* frequentist, but the setup here is easier with Bayes) is most naturally considered as filtering. We have some system who parameters we wish to estimate, and we have many experiments done upon that system, throughout time. We would like to update our understanding of that system, using the experiments’ data. Possibly the system is changing, possibly it’s not changing. Either way (or both ways) there is a computational challenge.

(Or it’s not changing, but our model is increasing in size as we gain more data about it. How to handle this case?)

We would like the hypothesis updating to happen in a compact way, where we can do inference without keeping the whole data-set about, but rather use some summary statistics. With exponential family models this is trivial to do; just use conjugate updates and you are done. With models where compactness would be more useful, say, big data and big models without conjugate updates, it’s not clear how we can do this exactly.

See also mixture models, probabilistic deep learning, directed graphical models, other probability metrics. Intuitive connection with differential privacy of posterior sampling? DNZM13

Possibly related: Data summarisation via *inducing sets* or *coresets*. Bounded Memory Learning? To mention: Bayesian likelihood principle, Pitman–Koopman–Darmois theorem.

## Sufficient statistics in exponential families

Let’s start with sufficient statistics in exponential families, which, for reasons of historical pedagogy, are the Garden of Eden of Inference. (the Garden of Edenference for short.) I suspect that deep in their hearts, all statisticians regard themselves as prodigal exiles from of the exponential family, and long for the innocence of that Garden of Edenference.

Informally, here’s what’s going on with the inference problems involving sufficient statistics. We are interested in estimating some parameter of inference, \(\theta\) using realisations \(x\) of some random process \(X\sim \mathbb{P}(x|\theta).\)

Then \(T(x)\) is a *sufficient statistic* for \(\theta\) iff \[\mathbb{P}(x|T(x),\theta)= \mathbb{P}(x|T(x)).\] That is, our inference about \(\theta\) depends on the data *only* through the sufficient statistic.

(mention size of sufficient statistic)

Fisher–Neyman factorization theorem \[ \mathbb{P}(x;\theta)=h(x)g(T(x);\theta) \]

Famously, Maximum Likelihood estimators of exponential family models are highly compressible, in that these have *sufficient statistics* - these are low-dimensional functions of the data which characterise all the information in the complete data, with respect to the parameter estimates. Many models and data sets and estimation methods do not have this feature, even parametric models with very few parameters.

This can be a PITA when your data is very big and you wish to get benefit from that, and yet you can’t fit the data in memory; The question then arises – when can I do better? Can I find a “nearly sufficient” statistic, which is smaller than my data and yet does not worsen my error substantially? Can I quantify this nearness to the original?

# Refs

Adragni, Kofi P., and R. Dennis Cook. 2009. “Sufficient Dimension Reduction and Prediction in Regression.” *Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences* 367 (1906): 4385–4405. https://doi.org/10.1098/rsta.2009.0110.

Bachem, Olivier, Mario Lucic, and Andreas Krause. 2015. “Coresets for Nonparametric Estimation - the Case of DP-Means.” In *International Conference on Machine Learning*, 209–17. http://proceedings.mlr.press/v37/bachem15.html.

Baez, John C. 2011. “Renyi Entropy and Free Energy,” February. https://arxiv.org/abs/1102.2098.

Barron, Andrew, Mark J Schervisch, and Larry Wasserman. 1999. “The Consistency of Posterior Distributions in Nonparametric Problems.” *The Annals of Statistics* 27 (2): 536–61. https://doi.org/10.1214/aos/1018031206.

Barron, A., J. Rissanen, and Bin Yu. 1998. “The Minimum Description Length Principle in Coding and Modeling.” *IEEE Transactions on Information Theory* 44 (6): 2743–60. https://doi.org/10.1109/18.720554.

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” *Journal of the American Statistical Association* 112 (518): 859–77. https://doi.org/10.1080/01621459.2017.1285773.

Broderick, Tamara, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. 2013. “Streaming Variational Bayes.” In *Advances in Neural Information Processing Systems 26*, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 1727–35. Curran Associates, Inc. http://papers.nips.cc/paper/4980-streaming-variational-bayes.pdf.

Cortes, Corinna, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. 2016. “Structured Prediction Theory Based on Factor Graph Complexity.” In *Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2514–22. Curran Associates, Inc. http://papers.nips.cc/paper/6485-structured-prediction-theory-based-on-factor-graph-complexity.pdf.

Cutajar, Kurt, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. 2017. “Random Feature Expansions for Deep Gaussian Processes.” In *PMLR*. http://proceedings.mlr.press/v70/cutajar17a.html.

Dezfouli, Amir, and Edwin V. Bonilla. 2015. “Scalable Inference for Gaussian Process Models with Black-Box Likelihoods.” In *Advances in Neural Information Processing Systems 28*, 1414–22. NIPS’15. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2969239.2969397.

Dimitrakakis, Christos, Blaine Nelson, and Zuhe Zhang, Aikaterini Mitrokotsa, and Benjamin Rubinstein. 2013. “Bayesian Differential Privacy Through Posterior Sampling,” June. http://arxiv.org/abs/1306.1066.

Edwards, Harrison, and Amos Storkey. 2017. “Towards a Neural Statistician.” In *Proceedings of ICLR*. https://arxiv.org/abs/1606.02185v2.

Gribonval, Rémi, Gilles Blanchard, Nicolas Keriven, and Yann Traonmilin. 2017. “Compressive Statistical Learning with Random Feature Moments,” June. http://arxiv.org/abs/1706.07180.

Grünwald, Peter D. 2004. “A Tutorial Introduction to the Minimum Description Length Principle.” *Advances in Minimum Description Length: Theory and Applications*, June, 23–81. http://arxiv.org/abs/math/0406077.

Hinton, Geoffrey E., and Drew van Camp. 1993. “Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In *Proceedings of the Sixth Annual Conference on Computational Learning Theory*, 5–13. COLT ’93. New York, NY, USA: ACM. https://doi.org/10.1145/168304.168306.

Honkela, A., and H. Valpola. 2004. “Variational Learning and Bits-Back Coding: An Information-Theoretic View to Bayesian Learning.” *IEEE Transactions on Neural Networks* 15 (4): 800–810. https://doi.org/10.1109/TNN.2004.828762.

Hu, Shell Xu, Pablo G Moreno, Neil Lawrence, and Andreas Damianou. n.d. “Β-BNN: A Rate-Distortion Perspective on Bayesian Neural Networks,” 4.

Huggins, Jonathan, Ryan P Adams, and Tamara Broderick. 2017. “PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In *Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 3611–21. Curran Associates, Inc. http://papers.nips.cc/paper/6952-pass-glm-polynomial-approximate-sufficient-statistics-for-scalable-bayesian-glm-inference.pdf.

Kandasamy, Kirthevasan, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M. Robins. 2014. “Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations,” November. http://arxiv.org/abs/1411.4342.

Kullback, S, and R A Leibler. 1951. “On Information and Sufficiency.” *The Annals of Mathematical Statistics* 22 (1): 79–86.

MacKay, David J C. 2002. *Information Theory, Inference & Learning Algorithms*. Cambridge University Press.

Mandelbrot, Benoit. 1962. “The Role of Sufficiency and of Estimation in Thermodynamics.” *The Annals of Mathematical Statistics* 33 (3): 1021–38. https://doi.org/10.1214/aoms/1177704470.

Montanari, Andrea. 2015. “Computational Implications of Reducing Data to Sufficient Statistics.” *Electronic Journal of Statistics* 9 (2): 2370–90. https://doi.org/10.1214/15-EJS1059.

Moreno-Muñoz, Pablo, Antonio Artés-Rodríguez, and Mauricio A. Álvarez. 2019. “Continual Multi-Task Gaussian Processes,” October. http://arxiv.org/abs/1911.00002.

Moshkovitz, Dana, and Michal Moshkovitz. 2017. “Mixing Implies Lower Bounds for Space Bounded Learning.” In *Conference on Learning Theory*, 1516–66. http://proceedings.mlr.press/v65/moshkovitz17a.html.

Moshkovitz, Michal, and Naftali Tishby. 2017a. “Mixing Complexity and Its Applications to Neural Networks,” March. http://arxiv.org/abs/1703.00729.

———. 2017b. “A General Memory-Bounded Learning Algorithm,” December. http://arxiv.org/abs/1712.03524.

Rissanen, J. 1978. “Modeling by Shortest Data Description.” *Automatica* 14 (5): 465–71. https://doi.org/10.1016/0005-1098(78)90005-5.

———. 1984. “Universal Coding, Information, Prediction, and Estimation.” *IEEE Transactions on Information Theory* 30 (4): 629–36. https://doi.org/10.1109/TIT.1984.1056936.

Rissanen, Jorma. 2007. *Information and Complexity in Statistical Modeling*. Information Science and Statistics. New York: Springer. http://www.springer.com/mathematics/probability/book/978-0-387-36610-4.

Vitányi, Paul M. 2006. “Meaningful Information.” *IEEE Transactions on Information Theory* 52 (10): 4617–26. https://doi.org/10.1109/TIT.2006.881729.

Wallace, C. S. 1990. “Classification by Minimum-Message-Length Inference.” In *Advances in Computing and Information — ICCI ’90*, 72–81. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-53504-7_63.

Wolpert, David H. 2008. “Physical Limits of Inference.” *Physica D: Nonlinear Phenomena*, Novel Computing Paradigms: Quo Vadis?, 237 (9): 1257–81. https://doi.org/10.1016/j.physd.2008.03.040.