# Variational inference

TBD.

Expectation maximisation, Bayes, graphical models, mumble mumble.

Using optimisation to approximate posterior semi parametrically rather than purely sampling from it. This is nice because as a message passing methods, it scales up to large data.

I suspect this is not intrinsically Bayesian, but most of the literature on it is from Bayesians, so I won’t look into it in a frequentist context for now.

See also mixture models, probabilistic deep learning, directed graphical models, and note that lots of the software to do this is filed under Bayesian Statistics HOWTO.

## Why does this always seem to be about mixture models

idk. Easy to keep them normalised?

However, see the extension into reparameterisation.

## Loss functions

For now see probability metrics.

### Loss function aside

It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses $$\pi, q$$ and test functions from some family $$\mathcal{F}$$. I completely agree with the motivation: KL-Divergence in the form $$\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x$$ indeed underestimates the variance of $$\pi$$ and approximates only one mode. Using KL the other way around, $$\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x$$ takes all modes into account, but still tends to underestimate variance.

$…$the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density $$q$$ at all but uses test functions exclusively. The only requirement is that we be able to draw samples from the proposal. The authors claim that assuming access to $$q$$ limits applicability of an objective/operator. This claim is not substantiated however.