Variational Autoencoder (VAE)

A detailed explanation of Variational Autoencoders (VAE), covering the encoder-decoder architecture, ELBO derivation, and the reparameterization trick.

The Variational Autoencoder (VAE) is a type of generative model that aims to learn the latent structure of data and generate new data. Similar to how the EM algorithm maximizes the log-likelihood lower bound, VAE also learns by maximizing the Evidence Lower Bound (ELBO).

A key feature of VAE is that it has a probabilistic encoder (recognition model) and a probabilistic decoder (generative model). This makes the latent space smooth and enables meaningful data generation.

  • Recognition model (encoder): Estimates the probability distribution $q_\phi(z|x)$ of latent variables $z$ from input data $x$. It has parameters $\phi$.
  • Generative model (decoder): Generates the probability distribution $p_\theta(x|z)$ of data $x$ from latent variables $z$. It has parameters $\theta$.

The latent variable $z$ can be interpreted as a lower-dimensional, abstract “latent representation” or “latent code” of the information contained in input data $x$.

Autoencoder

An autoencoder is a neural network that learns to compress and encode input data, then reconstruct (decode) it to output the same content as the input. The intermediate layer (latent space) becomes a compressed representation (code) that captures the important features of the input data.

VAE introduces the concept of variational inference into this autoencoder framework.

The ELBO in VAE

VAE learning is performed by maximizing the following Evidence Lower Bound (ELBO):

$$ \mathcal{L}(\theta, \phi) = \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x) || p(z)) $$

This formula consists of two terms:

  1. Reconstruction error (first term): $\mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)]$ This represents how accurately the decoder can reconstruct the original input $x$ using the latent variable $z$ generated by the encoder. Maximizing this term corresponds to minimizing the reconstruction error.

  2. Regularization term (second term): $KL(q_\phi(z|x) || p(z))$ This measures how close the distribution $q_\phi(z|x)$ of latent variables $z$ estimated by the encoder is to the predefined prior distribution $p(z)$ of the latent variables. Minimizing this term serves to make the latent space smooth and enable meaningful data generation.

Recognition Model $q_\phi(z|x)$

The recognition model estimates the distribution of latent variables $z$ from input $x$. Typically, a neural network is used to output the mean $\mu(x)$ and variance $\sigma^2(x)$ (or log-variance $\log \sigma^2(x)$) of the multivariate normal distribution that $z$ follows.

$$ q_\phi(z|x) = \mathcal{N}(z | \mu_\phi(x), \text{diag}(\sigma^2_\phi(x))) $$

Here, $\mu_\phi(x)$ and $\sigma^2_\phi(x)$ are outputs of the neural network that takes input $x$.

Generative Model $p_\theta(x|z)$

The generative model produces the distribution of data $x$ from latent variables $z$. The choice of probability distribution depends on the type of data $x$.

  • Binary data (e.g., black-and-white images): Bernoulli distribution (or categorical distribution)
  • Continuous data (e.g., grayscale images): Gaussian distribution

For example, for continuous data, a multivariate normal distribution with variance fixed at 1 may be used:

$$ p_\theta(x|z) = \mathcal{N}(x | \nu_\theta(z), I) $$

Here, $\nu_\theta(z)$ is the output of the neural network that takes latent variable $z$ as input.

Prior Distribution of Latent Variables $p(z)$

The prior distribution of latent variables is typically defined as a product of standard normal distributions (mean 0, variance 1).

$$ p(z) = \prod_{j=1}^k \mathcal{N}(z_j | 0, 1) $$

This prior is fixed during training and does not depend on parameters $\theta$ or $\phi$.

Gradient Descent and the Reparameterization Trick

VAE training uses gradient descent (optimization algorithms such as Adam) to update parameters $\theta$ and $\phi$ to maximize the ELBO $\mathcal{L}(\theta, \phi)$.

The gradient of the reconstruction error term is relatively straightforward to compute, but directly computing the gradient of the regularization term $KL(q_\phi(z|x) || p(z))$ is difficult. This is because the expectation is taken over $q_\phi(z|x)$, which depends on $\phi$.

To solve this problem, the reparameterization trick is used. This technique represents the latent variable $z$ using a random variable $\epsilon$ that does not depend on parameter $\phi$ and a deterministic function $g(\epsilon, x, \phi)$ that depends on $\phi$.

For example, in the case of a Gaussian distribution, $z = \mu_\phi(x) + \sigma_\phi(x) \cdot \epsilon$ (where $\epsilon \sim \mathcal{N}(0, I)$). This allows the expectation to be computed over $\epsilon$, which does not depend on $\phi$, enabling standard gradient descent to be applied.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)