The EM Algorithm (2): Evidence Lower Bound and KL Divergence

Understanding why the EM algorithm increases log-likelihood through the Evidence Lower Bound (ELBO) and KL divergence framework.

To understand why the EM algorithm is able to increase the log-likelihood, we introduce the concepts of the “Evidence Lower Bound (ELBO)” and “KL Divergence (Kullback-Leibler Divergence).”

Evidence Lower Bound (ELBO)

The Evidence Lower Bound $\mathcal{L}(\theta, \hat{\theta})$ in the EM algorithm is defined as:

$$ \mathcal{L}(\theta, \hat{\theta}) = \int p(z|x,\hat{\theta}) \log \frac{p(x,z|\theta)}{p(z|x,\hat{\theta})}dz $$

This expression is derived from the relationship between the log-likelihood $\log p(x|\theta)$ and the KL divergence between the posterior distributions $p(z|x,\hat{\theta})$ and $p(z|x,\theta)$ of the latent variable $z$.

$$ \log p(x|\theta) = \mathcal{L}(\theta, \hat{\theta}) + KL(p(z|x,\hat{\theta}) || p(z|x,\theta)) $$

Here:

  • $\mathcal{L}(\theta, \hat{\theta})$ is the Evidence Lower Bound (ELBO).
  • $KL(p(z|x,\hat{\theta}) || p(z|x,\theta))$ is the KL divergence.

The ELBO $\mathcal{L}(\theta, \hat{\theta})$ can also be expressed as the sum of the Q function computed in the E-step and the entropy of the posterior distribution of the latent variables.

$$ \mathcal{L}(\theta, \hat{\theta}) = Q(\theta, \hat{\theta}) + H(p(z|x,\hat{\theta})) $$

KL Divergence

KL divergence $KL(q||p)$ is a measure of the “information-theoretic distance” between two probability distributions $q(x)$ and $p(x)$.

$$ KL(q||p) = \mathbb{E}_q[\log\frac{q(x)}{p(x)}] = \int q(x)\log \frac{q(x)}{p(x)}dx $$

KL divergence has the following properties:

  • It is always non-negative: $KL(q||p) \ge 0$
  • It equals zero only when the two distributions are identical: $KL(q||p) = 0 \iff q(x) = p(x)$

The EM Algorithm and Monotonic Increase of the ELBO

Each step of the EM algorithm monotonically increases the log-likelihood. This can be explained through the relationship between the ELBO and KL divergence.

Evidence Lower Bound

  1. E-step: Fix the current parameters $\hat{\theta}$ and compute the posterior distribution $p(z|x,\hat{\theta})$ of the latent variables $z$. In this step, $\theta$ is updated so that the KL divergence is minimized (becomes 0), meaning $p(z|x,\hat{\theta})$ matches $p(z|x,\theta)$. This causes the ELBO $\mathcal{L}(\theta, \hat{\theta})$ to equal the log-likelihood $\log p(x|\theta)$.

  2. M-step: Using $p(z|x,\hat{\theta})$ computed in the E-step, find the new parameter $\theta$ that maximizes the Q function $Q(\theta, \hat{\theta})$. Maximizing the Q function also increases the ELBO $\mathcal{L}(\theta, \hat{\theta})$. Since the KL divergence is non-negative, the log-likelihood $\log p(x|\theta)$ also increases.

In this way, the EM algorithm monotonically increases the log-likelihood by alternating between the E-step and M-step.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)