The EM Algorithm (2): Evidence Lower Bound and KL Divergence

Understanding why the EM algorithm increases log-likelihood through the Evidence Lower Bound (ELBO) and KL divergence framework.

To understand why the EM algorithm is able to increase the log-likelihood, we introduce the concepts of the “Evidence Lower Bound (ELBO)” and “KL Divergence (Kullback-Leibler Divergence).”

Evidence Lower Bound (ELBO)

The Evidence Lower Bound \(\mathcal{L}(\theta, \hat{\theta})\) in the EM algorithm is defined as:

\[ \mathcal{L}(\theta, \hat{\theta}) = \int p(z|x,\hat{\theta}) \log \frac{p(x,z|\theta)}{p(z|x,\hat{\theta})}dz \]

This expression is derived from the relationship between the log-likelihood \(\log p(x|\theta)\) and the KL divergence between the posterior distributions \(p(z|x,\hat{\theta})\) and \(p(z|x,\theta)\) of the latent variable \(z\).

\[ \log p(x|\theta) = \mathcal{L}(\theta, \hat{\theta}) + KL(p(z|x,\hat{\theta}) || p(z|x,\theta)) \]

Here:

  • \(\mathcal{L}(\theta, \hat{\theta})\) is the Evidence Lower Bound (ELBO).
  • \(KL(p(z|x,\hat{\theta}) || p(z|x,\theta))\) is the KL divergence.

The ELBO \(\mathcal{L}(\theta, \hat{\theta})\) can also be expressed as the sum of the Q function computed in the E-step and the entropy of the posterior distribution of the latent variables.

\[ \mathcal{L}(\theta, \hat{\theta}) = Q(\theta, \hat{\theta}) + H(p(z|x,\hat{\theta})) \]

KL Divergence

KL divergence \(KL(q||p)\) is a measure of the “information-theoretic distance” between two probability distributions \(q(x)\) and \(p(x)\).

\[ KL(q||p) = \mathbb{E}\_q[\log\frac{q(x)}{p(x)}] = \int q(x)\log \frac{q(x)}{p(x)}dx \]

KL divergence has the following properties:

  • It is always non-negative: \(KL(q||p) \ge 0\)
  • It equals zero only when the two distributions are identical: \(KL(q||p) = 0 \iff q(x) = p(x)\)

The EM Algorithm and Monotonic Increase of the ELBO

Each step of the EM algorithm monotonically increases the log-likelihood. This can be explained through the relationship between the ELBO and KL divergence.

Evidence Lower Bound

  1. E-step: Fix the current parameters \(\hat{\theta}\) and compute the posterior distribution \(p(z|x,\hat{\theta})\) of the latent variables \(z\). In this step, \(\theta\) is updated so that the KL divergence is minimized (becomes 0), meaning \(p(z|x,\hat{\theta})\) matches \(p(z|x,\theta)\). This causes the ELBO \(\mathcal{L}(\theta, \hat{\theta})\) to equal the log-likelihood \(\log p(x|\theta)\).

  2. M-step: Using \(p(z|x,\hat{\theta})\) computed in the E-step, find the new parameter \(\theta\) that maximizes the Q function \(Q(\theta, \hat{\theta})\). Maximizing the Q function also increases the ELBO \(\mathcal{L}(\theta, \hat{\theta})\). Since the KL divergence is non-negative, the log-likelihood \(\log p(x|\theta)\) also increases.

In this way, the EM algorithm monotonically increases the log-likelihood by alternating between the E-step and M-step.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)