Fundamentals of Bayesian Estimation

An introduction to Bayesian estimation covering Bayes' theorem, prior and posterior distributions, MAP estimation, and the Bayesian updating process.

Bayesian estimation is a statistical approach that takes into account not only the data but also prior knowledge (prior information) about the parameters when estimating statistical model parameters. This prior knowledge is expressed as a prior distribution.

Bayes’ Theorem

The foundation of Bayesian estimation is Bayes’ theorem, a fundamental result derived from the relationship of conditional probabilities.

Given a random variable $X$ and a parameter $\theta$, the following relationship holds:

$$ p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)} $$

Each term in this theorem is named as follows:

  • $p(\theta|x)$: Posterior Distribution The probability distribution of the parameter $\theta$ after observing data $x$. It represents the updated belief about the parameter informed by the data.
  • $p(x|\theta)$: Likelihood The probability (or probability density) of observing data $x$ given that the parameter is $\theta$. This is the same function that is maximized in maximum likelihood estimation.
  • $p(\theta)$: Prior Distribution The probability distribution representing prior knowledge or beliefs about the parameter $\theta$ before observing data $x$.
  • $p(x)$: Marginal Likelihood / Evidence / Normalizing Constant The probability of the data $x$ obtained by marginalizing over the parameter $\theta$. It serves as a constant that ensures the posterior distribution sums to 1, and is computed as follows: $$ p(x) = \int p(x|\theta’)p(\theta’)d\theta’ $$

MAP Estimation (Maximum A Posteriori Estimation)

MAP estimation finds the parameter $\hat{\theta}_{MAP}$ that maximizes the posterior distribution $p(\theta|x)$. It can be thought of as maximum likelihood estimation with the addition of a prior distribution.

$$ \hat{\theta}{MAP} = \arg\max{\theta} p(\theta|x) = \arg\max_{\theta} \frac{p(x|\theta)p(\theta)}{p(x)} $$

Since $p(x)$ is a constant that does not depend on $\theta$, this effectively maximizes the product of the likelihood $p(x|\theta)$ and the prior distribution $p(\theta)$.

$$ \hat{\theta}{MAP} = \arg\max{\theta} p(x|\theta)p(\theta) $$

Bayesian Estimation

While MAP estimation is a point estimate that uses the peak (most probable point) of the posterior distribution as the parameter estimate, Bayesian estimation uses the posterior distribution $p(\theta|x)$ itself as the inference result rather than seeking a single value for the parameter $\theta$.

This enables predictions that account for parameter uncertainty. For example, when computing the predictive distribution $p(y|x)$ for new data $y$, the posterior distribution $p(\theta|x)$ is used to average predictions over all possible $\theta$.

$$ p(y|x) = \int p(y|\theta)p(\theta|x)d\theta $$

If a single predicted value is desired, the expectation of this predictive distribution $p(y|x)$ can be computed.

$$ \hat{y} = \mathbb{E}[y|x] = \int y p(y|x)dy $$

In this way, Bayesian estimation derives the posterior distribution based on the prior distribution, the likelihood function, and the observed data, and performs inference using the entire posterior distribution.

Bayesian Updating

Bayesian updating is the process of repeatedly applying Bayes’ theorem, using each updated posterior distribution as the new prior distribution whenever new data is observed.

For example, suppose we obtain the posterior distribution $p(\theta|x^{(1)})$ after observing the first data $x^{(1)}$. When new data $x^{(2)}$ is observed, we treat $p(\theta|x^{(1)})$ as the new prior distribution and apply Bayes’ theorem to obtain the posterior distribution $p(\theta|x^{(1)}, x^{(2)})$ that considers both $x^{(1)}$ and $x^{(2)}$.

$$ p(\theta|x^{(1)}, x^{(2)}) = \frac{p(x^{(2)}|\theta)p(\theta|x^{(1)})}{p(x^{(2)}|x^{(1)})} $$

Through this sequential updating, estimation accuracy improves as more data is accumulated, and the uncertainty in the posterior distribution decreases.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)