Variational Bayes

An introduction to variational Bayesian inference, explaining how it differs from EM, the variational lower bound, and mean-field approximation for posterior estimation.

Variational Bayes is a method for performing inference using a tractable approximate distribution when directly computing the complex posterior distribution is infeasible. Like the EM algorithm, it leverages the concept of the “Evidence Lower Bound,” but differs from the EM algorithm in that it estimates the distribution of parameters rather than performing point estimation.

  • Variational: Refers to the differentiation of a functional (a function that takes a function as its argument).
  • Functional: A function that takes a function as input and outputs a scalar value.

The Evidence Lower Bound in Variational Bayes

The ELBO in the EM Algorithm

In the EM algorithm, we considered the following Evidence Lower Bound $\mathcal{L}(\theta, \hat{\theta})$ as a lower bound on the log-likelihood $\log p(x|\theta)$ of the observed data $x$.

$$ \log p(x|\theta) = \mathcal{L}(\theta, \hat{\theta}) + KL(p(z|x,\hat{\theta})||p(z|x,\theta)) $$

Here, $\mathcal{L}(\theta, \hat{\theta})$ is expressed as the sum of the Q function and the entropy.

The ELBO in Variational Bayes

In variational Bayes, parameters $\theta$ and latent variables $z$ are collectively treated as latent variables $w = (\theta, z)$. An approximate distribution $q(w)$ is introduced to approximate the true posterior distribution $p(w|x)$. The goal of variational Bayes is to maximize the lower bound on the log marginal likelihood $\log p(x)$ of the observed data $x$ by bringing the approximate distribution $q(w)$ close to the true posterior.

$$ \log p(x) = \mathcal{L}(q) + KL(q(w)||p(w|x)) $$

Here, $\mathcal{L}(q)$ is the Evidence Lower Bound in variational Bayes.

$$ \mathcal{L}(q) = \int q(w) \log \frac{p(x,w)}{q(w)} dw $$

This can be expressed in terms of expectations as:

$$ \mathcal{L}(q) = \mathbb{E}{q(w)}[\log p(x,w)] - \mathbb{E}{q(w)}[\log q(w)] $$

Maximizing the ELBO $\mathcal{L}(q)$ is equivalent to minimizing the KL divergence $KL(q(w)||p(w|x))$ between the true posterior $p(w|x)$ and the approximate distribution $q(w)$. This is because $\log p(x)$ is a constant independent of $q(w)$, and the KL divergence is always non-negative.

In variational Bayes, it is typically assumed that the approximate distribution $q(w)$ factorizes as a product over the individual components of the latent variables (mean-field approximation).

$$ q(w) = \prod_i q_i(w_i) $$

Under this assumption, maximizing the ELBO yields update formulas for each $q_i(w_i)$, and the optimal approximate distribution is found by iterating these updates.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)