The EM Algorithm (2): Evidence Lower Bound and KL Divergence

To understand why the EM algorithm is able to increase the log-likelihood, we introduce the concepts of the “Evidence Lower Bound (ELBO)” and “KL Divergence (Kullback-Leibler Divergence).”

Evidence Lower Bound (ELBO)

The Evidence Lower Bound \(\mathcal{L}(\theta, \hat{\theta})\) in the EM algorithm is defined as:

\[ \mathcal{L}(\theta, \hat{\theta}) = \int p(z|x,\hat{\theta}) \log \frac{p(x,z|\theta)}{p(z|x,\hat{\theta})}dz \]

This expression is derived from the relationship between the log-likelihood \(\log p(x|\theta)\) and the KL divergence between the posterior distributions \(p(z|x,\hat{\theta})\) and \(p(z|x,\theta)\) of the latent variable \(z\).

\[ \log p(x|\theta) = \mathcal{L}(\theta, \hat{\theta}) + KL(p(z|x,\hat{\theta}) || p(z|x,\theta)) \]

Here:

\(\mathcal{L}(\theta, \hat{\theta})\) is the Evidence Lower Bound (ELBO).
\(KL(p(z|x,\hat{\theta}) || p(z|x,\theta))\) is the KL divergence.

The ELBO \(\mathcal{L}(\theta, \hat{\theta})\) can also be expressed as the sum of the Q function computed in the E-step and the entropy of the posterior distribution of the latent variables.

\[ \mathcal{L}(\theta, \hat{\theta}) = Q(\theta, \hat{\theta}) + H(p(z|x,\hat{\theta})) \]

KL Divergence

KL divergence \(KL(q||p)\) is a measure of the “information-theoretic distance” between two probability distributions \(q(x)\) and \(p(x)\).

\[ KL(q||p) = \mathbb{E}\_q[\log\frac{q(x)}{p(x)}] = \int q(x)\log \frac{q(x)}{p(x)}dx \]

KL divergence has the following properties:

It is always non-negative: \(KL(q||p) \ge 0\)
It equals zero only when the two distributions are identical: \(KL(q||p) = 0 \iff q(x) = p(x)\)

The EM Algorithm and Monotonic Increase of the ELBO

Each step of the EM algorithm monotonically increases the log-likelihood. This can be explained through the relationship between the ELBO and KL divergence.

Evidence Lower Bound

E-step: Fix the current parameters \(\hat{\theta}\) and compute the posterior distribution \(p(z|x,\hat{\theta})\) of the latent variables \(z\). In this step, \(\theta\) is updated so that the KL divergence is minimized (becomes 0), meaning \(p(z|x,\hat{\theta})\) matches \(p(z|x,\theta)\). This causes the ELBO \(\mathcal{L}(\theta, \hat{\theta})\) to equal the log-likelihood \(\log p(x|\theta)\).
M-step: Using \(p(z|x,\hat{\theta})\) computed in the E-step, find the new parameter \(\theta\) that maximizes the Q function \(Q(\theta, \hat{\theta})\). Maximizing the Q function also increases the ELBO \(\mathcal{L}(\theta, \hat{\theta})\). Since the KL divergence is non-negative, the log-likelihood \(\log p(x|\theta)\) also increases.

In this way, the EM algorithm monotonically increases the log-likelihood by alternating between the E-step and M-step.

References

Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)

Evidence Lower Bound (ELBO)

KL Divergence

The EM Algorithm and Monotonic Increase of the ELBO

References

関連記事

Variational Autoencoder (VAE)

Markov Chain Monte Carlo (MCMC)

Variational Bayes