The Living Thing / Notebooks :

Expectation propagation

An approximate posterior expectation calculation. Is this only for approximate bayesian inference, or does it work for indirect inference too? Other things?

Expectation Propagation is exact in the large data limit

Guillaume Dehaene, Neurosciences, UNIGE

Expectation Propagation (EP, Minka 2001) is a popular algorithm for approximating posterior distributions. While it is known empirically to give good approximations at a low computational cost (Nickisch and Rasmussen, 2008), it’s also very poorly understood. In this talk, I will present some new results on EP in the large-data limit. I will show that Gaussian-EP iterations asymptotically behave like the Newton algorithm, from which it follows that Gaussian-EP is asymptotically exact. I will then give an upper bound on the errors of the EP approximation and show that their decay speed compares favorably to the errors of the Laplace approximation of the posterior.

These theoretical results can be used to inform practical use of EP, and I will give some advice for implementation of the method.

Christian Robert on expectation-proagation

Andrew Gelman thereupon:

We revisit expectation propagation (EP) as a prototype for scalable algorithms that partition big datasets into many parts and analyze each part in parallel to perform inference of shared parameters. The algorithm should be particularly efficient for hierarchical models, for which the EP algorithm works on the shared parameters (hyperparameters) of the model.

The central idea of EP is to work at each step with a “tilted distribution” that combines the likelihood for a part of the data with the “cavity distribution,” which is the approximate model for the prior and all other parts of the data. EP iteratively approximates the moments of the tilted distributions and incorporates those approximations into a global posterior approximation. As such, EP can be used to divide the computation for large models into manageable sizes. The computation for each partition can be made parallel with occasional exchanging of information between processes through the global posterior approximation. Moments of multivariate tilted distributions can be approximated in various ways, including, MCMC, Laplace approximations, and importance sampling.

I love love love love love this. The idea is to forget about the usual derivation of EP (the Kullback-Leibler discrepancy, etc.) and to instead start at the other end, with Bayesian data-splitting algorithms, with the idea of taking a big problem and dividing it into K little pieces, performing inference on each of the K pieces, and then putting them together to get an approximate posterior inference.

The difficulty with such algorithms, as usually constructed, is that each of the K pieces has only partial information; as a result, for any of these pieces, you’re wasting a lot of computation in places that are contradicted by the other K-1 pieces.