The Living Thing / Notebooks :

Variational inference


Expectation maximisation, Bayes, graphical models, mumble mumble.

Using optimisation to approximate posterior semi parametrically rather than purely sampling from it. This is nice because as a message passing methods, it scales up to large data.

I suspect this is not intrinsically Bayesian, but most of the literature on it is from Bayesians, so I won’t look into it in a frequentist context for now.

See also mixture models, probabilistic deep learning, directed graphical models, and note that lots of the software to do this is filed under Bayesian Statistics HOWTO.

Why does this always seem to be about mixture models

idk. Easy to keep them normalised?

However, see the extension into reparameterisation.

Loss functions

For now see probability metrics.

Loss function aside

Ingmar Schuster’s critique of black box loss as seen in RTAB16:

It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses \(\pi, q\) and test functions from some family \(\mathcal{F}\). I completely agree with the motivation: KL-Divergence in the form \(\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x\) indeed underestimates the variance of \(\pi\) and approximates only one mode. Using KL the other way around, \(\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x\) takes all modes into account, but still tends to underestimate variance.

\[…\]the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density \(q\) at all but uses test functions exclusively. The only requirement is that we be able to draw samples from the proposal. The authors claim that assuming access to \(q\) limits applicability of an objective/operator. This claim is not substantiated however.