# Expectation maximisation

A particular optimisation method for statistics for that gets you a maximum likelihood estimate despite various annoyances such as missing data.

Approximate description of the algorithm:

We have an experimental process that generates a random vector $B\cup Y$, according to parameter $\theta$. We wish to estimate the parameter of interest $\theta$ by maximum likelihood. However, we only observe i.i.d. samples $b_i$ drawn from $B$. The likelihood function of the incomplete data $L(\theta, b)$ is tedious or intractable to maximise. But the “complete” joint likelihood of both the observed and unobserved components, $L(\theta, \{b_i\}, y)$, is easier to maximise. Then we are potentially in a situation where expectation maximisation can help.

Call $\theta^{(k)}$ the estimate of of $\theta$ at step $k$. Write $\ell(\theta, \{b_i\}, y)\equiv\log L(\theta, \{b_i\}, y)$ because we virtually always work in log likelihoods and especially here.

The following form of the algorithm works when the log-likelihood $\ell(\theta, b, y)$ is linear in $b$. (Which is equivalent to it being in a exponential family I believe, but should check.)

At time $k=0$ we start with an estimate of $\theta^{(0)}$ chosen arbitrarily or by our favourite approximate method.

We attempt to improve our estimate of the parameter of interest by the following iterative algorithm:

1. “Expectation”: Under the completed data model with joint distribution $F(b,y,\theta^{(k)})$ we estimate $y$ as

\begin{equation*} y^{(k)}=E_{\theta^{(k)}}[Y|b] \end{equation*}
2. “Maximisation”: Solve a (hopefully easier) maximisation problem:

\begin{equation*} \theta^{(k+1)}=\text{argmax}_\theta \ell(\theta, b, y^{(k)}) \end{equation*}

In the case that this log likelihood is not linear in $b$, you are supposed to instead take

\begin{equation*} \theta^{(k+1)}=\text{argmax}_\theta E_{\theta^{(k)}}[\ell(\theta, b, Y)|b] \end{equation*}

In practice I have seen this latter nicety ignored, apparently without ill effect.

Even if you do the right thing, EM may not converge especially well, or to the global maximum, but damn it can be easy and robust to get started with, and at least it doesn’t make things worse.

Literature note - apparently the proofs in Dempster and Laird (1977) are dicey; See the Wu (1983) paper for improved (i.e. correct) versions.

• A Transparent Interpretation of the EM Algorithm by James Coughlan has a short point

We write data $z$, latent variable $y$, parameter of interest $\theta$. Then….

[…]maximizing Neal and Hinton’s joint function of $\theta$ and a distribution on $y$ is equivalent to maximum likelihood estimation.

The key point is to note that maximizing $\log P(z|\theta)$ over $\theta$ is equivalent to maximizing

\begin{equation*} \log P (z|\theta)-D(\tilde{P}(y)\|P(y|z,\theta)) \end{equation*}

jointly over $\theta$ and $\tilde{P}(y)$. […]

[…We rewrite this cost function]

\begin{equation*} H(\tilde{P}) + \sum_y\tilde{P}(y)\log \{P(y|z,\theta)P(z|\theta)\}, \end{equation*}

where $H(\tilde{P})=-\sum_y\tilde{P}(y)\log\tilde{P}(y)$ is the entropy of $\tilde{P}$. This expression is in turn equivalent to

\begin{equation*} H(\tilde{P}) +\sum_y\tilde{P}(y)\log P(y,z|\theta), \end{equation*}

which is the same as the function $F(\tilde{P},\theta)$ given in Neal and Hinton. This function is maximized iteratively, where each iteration consists of two separate maximizations, one over $\theta$ and another $\tilde{P}$

My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary.

## Refs

BiOt98
Bilmes, J. A., & others. (1998) A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. International Computer Science Institute, 4(510), 126.
CeCD95
Celeux, G., Chauveau, D., & Diebolt, J. (1995) On Stochastic Versions of the EM Algorithm (report).
CCFM01
Celeux, G., Chretien, S., Forbes, F., & Mkhadri, A. (2001) A Component-Wise EM Algorithm for Mixtures. Journal of Computational and Graphical Statistics, 10(4), 697–712. DOI.
CeFP03
Celeux, G., Forbes, F., & Peyrard, N. (2003) EM procedures using mean field-like approximations for Markov model-based image segmentation. Pattern Recognition, 36(1), 131–144. DOI.
DeLM99
Delyon, B., Lavielle, M., & Moulines, E. (1999) Convergence of a stochastic approximation version of the EM algorithm. The Annals of Statistics, 27(1), 94–128. DOI.
DeLR77
Dempster, A. P., Laird, N. M., & Rubin, D. B.(1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
KuLa04
Kuhn, E., & Lavielle, M. (2004) Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8, 115–131. DOI.
McKr08
McLachlan, G. J., & Krishnan, T. (2008) The EM algorithm and extensions. . Hoboken, N.J.: Wiley-Interscience
McKN04
McLachlan, G. J., Krishnan, T., & Ng, S. K.(2004) The EM Algorithm (No. 2004,24). . Humboldt-Universität Berlin, Center for Applied Statistics and Economics (CASE)
Navi97
Navidi, W. (1997) A Graphical Illustration of the EM Algorithm. The American Statistician, 51(1), 29–31. DOI.
NeHi98
Neal, R. M., & Hinton, G. E.(1998) A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants. In M. I. Jordan (Ed.), Learning in Graphical Models (pp. 355–368). Springer Netherlands
Roch11
Roche, A. (2011) EM algorithm and variants: an informal tutorial. arXiv:1105.1476 [Stat].
WeTa90
Wei, G. C. G., & Tanner, M. A.(1990) A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms. Journal of the American Statistical Association, 85(411), 699–704. DOI.
Wu83
Wu, C. F. J.(1983) On the Convergence Properties of the EM Algorithm. The Annals of Statistics, 11(1), 95–103. DOI.