GBO notes: Expectation Maximization

In this note, we will describe how to estimate the parameters of GMM and HMM models using expectation-maximization method. The equations and discussion is heavily based on Jeff Bilmes’ paper.

Maximum likelihood

A popular method to estimate the parameters of a statistical model is using maximum likelihood. Given a set of observations $X = {\mathbf{x}_1, \ldots,\mathbf{x}_N}$, we assume that the observations were generated using some statistical model (for e.g., a Gaussian) with parameters $\theta \in \Theta$. Then, we compute the most probable $\theta$ by maximizing the probability of generating $X$ with these parameters, i.e.,

\[\hat{\theta} = \text{arg} \max_{\theta \in \Theta} P(X\mid\theta).\]

This is also called the likelihood of $\theta$, denoted as $L(\theta)$. It is a function of $\theta$ given some fixed $X$. For example, for the simple case of a single Gaussian, the likelihood assuming I.I.D. observations is given as

\[L(\theta) = \prod_{i=1}^N \mathcal{N}(\mathbf{x}_i; \mu, \sigma),\]

where $\mu$ and $\sigma$ are the mean and covariance. As calculus suggests, the parameters that maximize the likelihood are computed by taking the derivative and setting it to 0. In practice, it is easier to work with the log-likelihood

\[\log \ell(\theta) = \sum_{i=1}^N \log \mathcal{N}(\mathbf{x}_i; \mu, \sigma).\]

In fact, we can show that the derivative of the log-likelihood w.r.t. $\mu$ is given as

\[\frac{\partial \ell}{\partial \mu} = \sum_{i=1}^N \frac{\mathbf{x}_i - \mu}{\sigma^2}.\]

Latent variable models and expectation-maximization

It is not always so simple to maximize the likelihood function since the derivative may not have an analytical solution. However, we can often simplify such cases by assuming that the data $X$ that we observe is actually incomplete, i.e., there is some latent (or hidden) data $Y$ such that the whole data is $Z = (X,Y)$. In such as case, we define a joint density function for $Z$ as:

\[p(\mathbf{z}\mid\theta) = p(\mathbf{z},\mathbf{y}\mid\theta) = p(\mathbf{y}\mid \mathbf{x},\theta)p(\mathbf{x}\mid \theta).\]

We can then define a “complete” data likelihood as $L(\theta\mid X,Y)$, and the original likelihood $L(\theta\mid X)$ is called the incomplete data likelihood. Note that since $Y$ is a random variable, we cannot find the exact value of the complete-data likelihood, but only an expectation for it. Given the observed data $X$ and the current set of model parameters $\theta^{(i-1)}$, the complete-data likelihood can be given as

\[\begin{aligned} Q(\theta, \theta^{(i-1)}) &= \mathbb{E}[\log p(X,Y\mid \theta)\mid X, \theta^{(i-1)}] \\ &= \int_{\mathbf{y}\in \mathcal{Y}} \log p(X,\mathbf{y}\mid \theta)f(\mathbf{y}|X,\theta^{(i-1)})d\mathbf{y}. \end{aligned}\]

This is called the E-step of the EM algorithm. Once we have the complete-data likelihood, we can maximize it w.r.t. $\theta$ as:

\[\theta^{(i)} = \text{arg}\max_{\theta \in \Theta} Q(\theta, \theta^{(i-1)}),\]

which is known as the M-step. Note that this is a very abstract characterization of the method, and the exact equations would depend on the underlying statistical model. In the remainder of this post, we will see how the EM algorithm works in the case of GMM and HMM models.

Gaussian Mixture Model

Often, a single Gaussian is not sufficient to describe the set of observations, in which case we turn to mixture models. A Gaussian mixture model (GMM) is a mixture of a fixed number of Gaussians. The probability of generating the observations by a GMM containing $K$ Gaussians with means $\mu_k$ and covariances $\sigma_k$ is given as

\[P(X) = \prod_{i=1}^N \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_i; \mu_k, \sigma_k),\]

where $\pi_k$ are the mixture proportion representing the contribution from the Gaussian $k$.

If we try to directly apply the MLE on this likelihood function, we will run into a problem since the log cannot be taken inside the sum, and so there is no analytical solution (unlike the case of a single Gaussian). In fact, the derivate of the GMM likelihood function (w.r.t $\mu_k$) would look like

\[\sum_{i=1}^N \gamma_{z_i}(k) \frac{\mathbf{x}_i - \mu_k}{\sigma_k^2},\]

where $\gamma_{z_i}(k)$ is the posterior of the component $k$ given the observation $i$, and is computed as

\[\gamma_{z_i}(k) = \frac{\pi_k \mathcal{N}(\mathbf{x}_i;\mu_k,\sigma_k)}{\sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_i;\mu_k,\sigma_k)}.\]

Now, in the derivative above, there is a latent variable ($\gamma_{z_i}(k)$ depends on $\mu_k$), so it is not possible to get an analytical solution. However, if we assume that $\gamma_{z_i}(k)$ is fixed, we can solve for $\mu_k$, obtaining:

\[\mu_k = \frac{\sum_{i=1}^N \gamma_{z_i}(k) \mathbf{x}_i}{\sum_{i=1}^N \gamma_{z_i}(k)},\]

which is essentially a weighted average of the observations, with the posteriors as the weights. Similarly, we can obtain equations for $\sigma_k$ and $\pi_k$.

At this point, it seems like we are done. But remember that these equations involve $\gamma_{z_i}(k)$, which itself depends on the unknown parameters. So, we solve for them in an alternating fashion, as follows:

E-step: Compute the posterior probabilities $\gamma_{z_i}(k)$ using the current values of the parameters $\mu_k$, $\sigma_k$, and $\pi_k$.
M-step: Estimate new values for the parameters using the computed values for $\gamma_{z_i}(k)$.

We repeat this iterative process until the change in log-likelihood is less than some threshold. Note that it is guaranteed that the log-likelihood will keep increasing.

Hidden Markov Model

Note: Some of the material in this section is modified from this blog post.

An HMM is a stochastic model which assumes an underlying latent variable $\mathbf{z}$ which evolves through a Markov process, such that the joint distribution is given as

\[\begin{aligned} P(X,Z) &= p(\mathbf{x}_1,\ldots,\mathbf{x}_N,\mathbf{z}_1,\ldots,\mathbf{z}_N) \\ &= p(\mathbf{z}_1)p(\mathbf{x}_1\mid \mathbf{z}_1)\prod_{i=2}^N p(\mathbf{z}_i\mid \mathbf{z}_1^{i-1}) p(\mathbf{x}_i\mid \mathbf{z}_1^i, \mathbf{x}_1^{i-1}) \\ &= p(\mathbf{z}_1)p(\mathbf{x}_1\mid \mathbf{z}_1)\prod_{i=2}^N p(\mathbf{z}_i\mid \mathbf{z}_{i-1}) p(\mathbf{x}_i\mid \mathbf{z}_i). \end{aligned}\]

The joint probability derived as such can be factorized into transition probabilities for the latent variable $Z$

\[p(Z) = p(\mathbf{z}_1) \prod_{i=2}^N p(\mathbf{z}_i \mid \mathbf{z}_{i-1}),\]

and emission probabilities for generating $X$ from $Z$

\[p(\mathbf{x}_i \mid \mathbf{z}_i = k, \phi_k).\]

For example, the emission probability may be modeled using a single Gaussian or a mixture model. The EM algorithm for computing the MLE parameters of an HMM is also known as the Baum-Welch algorithm.

The parameters we need to estimate are:

the transition matrix $A$ which is an $m \times m$ matrix, where $n$ is the number of possible states that $Z$ can take,
the initial probabilities of the states $\pi$,
and the parameters $\phi$ of the emission distribution,

i.e., $\theta = (A, \pi, \phi)$.

Again, we solve this problem by alternating between an E-step and an M-step, where the E-step involves computing the complete-data likelihood and the M-step maximizes it to get the parameters at that instant. However, in this case, the E-step is a little more complicated since there are an exponential number of parameters. As such, it involves using the forward-backward algorithm (and so Baum-Welch = EM with forward-backward).

In particular, in the E-step we need to compute 2 posteriors: $\gamma(z_{nk})$, i.e., the posterior probability of being in state $k$ at time $n$, and $\xi(z_{n-1,j},z_{nk})$, i.e., the posterior of transitioning from state $j$ to state $k$ at step $n$. We compute these using the forward-backward algorithm.

In the M-step, the posteriors computed above are used to re-estimate the parameters of the model, thus increasing the complete-data likelihood.

I skip over the details of the derivation here since they can be found in the sources linked above.