The Lagrange multiplier method is a powerful mathematical technique for finding the maximum or minimum of a multivariable function subject to equality constraints.
Principle
Consider the problem of optimizing an objective function $f(x, y)$ subject to the constraint $g(x, y) = 0$. At the optimal solution, the contour lines of the objective function are tangent to the constraint curve. At this point, the gradient vectors of both functions are parallel.
The gradient vector indicates the direction of steepest change of a function and is defined using the gradient operator $\nabla$ as follows: $$ \nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}) $$
Therefore, at the optimal point, the following relationship holds: $$ \nabla f(x, y) = \lambda \nabla g(x, y) $$ where $\lambda$ is a constant called the Lagrange multiplier.
By solving this equation together with the constraint $g(x, y) = 0$, we can find candidate points for the optimal solution.
The Lagrangian
The Lagrange multiplier method converts a constrained optimization problem into an unconstrained one by introducing the Lagrangian $\mathcal{L}(x, y, \lambda)$.
$$ \mathcal{L}(x, y, \lambda) = f(x, y) - \lambda g(x, y) $$
Taking partial derivatives of this Lagrangian with respect to each variable ($x, y, \lambda$) and setting them to zero yields the candidate optimal solutions.
$$ \frac{\partial \mathcal{L}}{\partial x} = 0 $$ $$ \frac{\partial \mathcal{L}}{\partial y} = 0 $$ $$ \frac{\partial \mathcal{L}}{\partial \lambda} = 0 $$
The last equation, $\frac{\partial \mathcal{L}}{\partial \lambda} = -g(x, y) = 0$, is simply the original constraint $g(x, y) = 0$ itself.
Worked Example
Minimize the function $f(x, y) = x^2 + y^2$ subject to the constraint $2x + y + 1 = 0$.
The Lagrangian is: $$ \mathcal{L}(x, y, \lambda) = x^2 + y^2 - \lambda (2x + y + 1) $$
Taking partial derivatives and setting them to zero:
- $\frac{\partial \mathcal{L}}{\partial x} = 2x - 2\lambda = 0 \implies x = \lambda$
- $\frac{\partial \mathcal{L}}{\partial y} = 2y - \lambda = 0 \implies y = \frac{1}{2}\lambda$
- $\frac{\partial \mathcal{L}}{\partial \lambda} = -(2x + y + 1) = 0 \implies 2x + y + 1 = 0$
Substituting equations 1 and 2 into equation 3: $2(\lambda) + (\frac{1}{2}\lambda) + 1 = 0$ $\frac{5}{2}\lambda = -1$ $\lambda = -\frac{2}{5}$
Substituting into $x = \lambda$ and $y = \frac{1}{2}\lambda$: $x = -\frac{2}{5}$ $y = -\frac{1}{5}$
Therefore, the point that minimizes $f(x, y)$ under the constraint is $(-\frac{2}{5}, -\frac{1}{5})$.
Application to MLE of the Multinomial Distribution
In maximum likelihood estimation of the multinomial distribution, we need to maximize the log-likelihood function subject to the constraint that the probabilities sum to 1. The Lagrange multiplier method is effective in such cases.
The probability mass function of the multinomial distribution, with $x_j$ denoting the count of each category and $\mu_j$ denoting the probability of each category, is: $$ p(x_1, \dots, x_k | \mu_1, \dots, \mu_k) = \frac{(\sum x_j)!}{\prod x_j!} \prod_{j=1}^k \mu_j^{x_j} $$ Excluding constant terms, the log-likelihood function is $\sum_{j=1}^k x_j \log \mu_j$. The constraint is $\sum_{j=1}^k \mu_j - 1 = 0$.
The Lagrangian is: $$ \mathcal{L}(\mu_1, \dots, \mu_k, \lambda) = \sum_{j=1}^k x_j \log \mu_j - \lambda (\sum_{j=1}^k \mu_j - 1) $$
Taking the partial derivative with respect to $\mu_h$ and setting it to zero: $$ \frac{\partial \mathcal{L}}{\partial \mu_h} = \frac{x_h}{\mu_h} - \lambda = 0 \implies \mu_h = \frac{x_h}{\lambda} $$
Since $\mu_h = x_h / \lambda$ holds for all $h$, we substitute into the constraint $\sum_{j=1}^k \mu_j = 1$: $$ \sum_{j=1}^k \frac{x_j}{\lambda} = 1 \implies \frac{1}{\lambda} \sum_{j=1}^k x_j = 1 $$ Since $\sum_{j=1}^k x_j = m$ is the total number of trials: $$ \frac{m}{\lambda} = 1 \implies \lambda = m $$
Therefore, the MLE of each category probability $\mu_j$ in the multinomial distribution is: $$ \hat{\mu}_j = \frac{x_j}{m} $$ This matches the intuitive frequency obtained by dividing each category count by the total number of trials.
References
- Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)