Lagrange Multipliers

The Lagrange multiplier method is a powerful mathematical technique for finding the maximum or minimum of a multivariable function subject to equality constraints.

Principle

Consider the problem of optimizing an objective function \(f(x, y)\) subject to the constraint \(g(x, y) = 0\). At the optimal solution, the contour lines of the objective function are tangent to the constraint curve. At this point, the gradient vectors of both functions are parallel.

The gradient vector indicates the direction of steepest change of a function and is defined using the gradient operator \(\nabla\) as follows:

\[ \nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}) \]

Therefore, at the optimal point, the following relationship holds:

\[ \nabla f(x, y) = \lambda \nabla g(x, y) \]

where \(\lambda\) is a constant called the Lagrange multiplier.

By solving this equation together with the constraint \(g(x, y) = 0\), we can find candidate points for the optimal solution.

The Lagrangian

The Lagrange multiplier method converts a constrained optimization problem into an unconstrained one by introducing the Lagrangian \(\mathcal{L}(x, y, \lambda)\).

\[ \mathcal{L}(x, y, \lambda) = f(x, y) - \lambda g(x, y) \]

Taking partial derivatives of this Lagrangian with respect to each variable (\(x, y, \lambda\)) and setting them to zero yields the candidate optimal solutions.

\[ \frac{\partial \mathcal{L}}{\partial x} = 0 \]

\[ \frac{\partial \mathcal{L}}{\partial y} = 0 \]

\[ \frac{\partial \mathcal{L}}{\partial \lambda} = 0 \]

The last equation, \(\frac{\partial \mathcal{L}}{\partial \lambda} = -g(x, y) = 0\), is simply the original constraint \(g(x, y) = 0\) itself.

Worked Example

Minimize the function \(f(x, y) = x^2 + y^2\) subject to the constraint \(2x + y + 1 = 0\).

The Lagrangian is:

\[ \mathcal{L}(x, y, \lambda) = x^2 + y^2 - \lambda (2x + y + 1) \]

Taking partial derivatives and setting them to zero:

\(\frac{\partial \mathcal{L}}{\partial x} = 2x - 2\lambda = 0 \implies x = \lambda\)
\(\frac{\partial \mathcal{L}}{\partial y} = 2y - \lambda = 0 \implies y = \frac{1}{2}\lambda\)
\(\frac{\partial \mathcal{L}}{\partial \lambda} = -(2x + y + 1) = 0 \implies 2x + y + 1 = 0\)

Substituting equations 1 and 2 into equation 3: \(2(\lambda) + (\frac{1}{2}\lambda) + 1 = 0\) \(\frac{5}{2}\lambda = -1\) \(\lambda = -\frac{2}{5}\)

Substituting into \(x = \lambda\) and \(y = \frac{1}{2}\lambda\): \(x = -\frac{2}{5}\) \(y = -\frac{1}{5}\)

Therefore, the point that minimizes \(f(x, y)\) under the constraint is \((-\frac{2}{5}, -\frac{1}{5})\).

Application to MLE of the Multinomial Distribution

In maximum likelihood estimation of the multinomial distribution, we need to maximize the log-likelihood function subject to the constraint that the probabilities sum to 1. The Lagrange multiplier method is effective in such cases.

The probability mass function of the multinomial distribution, with \(x_j\) denoting the count of each category and \(\mu_j\) denoting the probability of each category, is:

\[ p(x*1, \dots, x_k | \mu_1, \dots, \mu_k) = \frac{(\sum x_j)!}{\prod x_j!} \prod*{j=1}^k \mu*j^{x_j} \]

Excluding constant terms, the log-likelihood function is \(\sum*{j=1}^k x*j \log \mu_j\). The constraint is \(\sum*{j=1}^k \mu_j - 1 = 0\).

The Lagrangian is:

\[ \mathcal{L}(\mu*1, \dots, \mu_k, \lambda) = \sum*{j=1}^k x*j \log \mu_j - \lambda (\sum*{j=1}^k \mu_j - 1) \]

Taking the partial derivative with respect to \(\mu_h\) and setting it to zero:

\[ \frac{\partial \mathcal{L}}{\partial \mu_h} = \frac{x_h}{\mu_h} - \lambda = 0 \implies \mu_h = \frac{x_h}{\lambda} \]

Since \(\mu_h = x_h / \lambda\) holds for all \(h\), we substitute into the constraint \(\sum_{j=1}^k \mu_j = 1\):

\[ \sum*{j=1}^k \frac{x_j}{\lambda} = 1 \implies \frac{1}{\lambda} \sum*{j=1}^k x*j = 1 \]

Since \(\sum*{j=1}^k x_j = m\) is the total number of trials:

\[ \frac{m}{\lambda} = 1 \implies \lambda = m \]

Therefore, the MLE of each category probability \(\mu_j\) in the multinomial distribution is:

\[ \hat{\mu}\_j = \frac{x_j}{m} \]

This matches the intuitive frequency obtained by dividing each category count by the total number of trials.

References

Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)

Principle

The Lagrangian

Worked Example

Application to MLE of the Multinomial Distribution

References

関連記事

Closed-Form Derivation of Cross-Entropy Between Gaussian Distributions

Cross-Entropy Method: A Practical Monte Carlo Optimization Technique

Particle Swarm Optimization (PSO) and Two-Swarm Cooperative PSO (TCPSO) with Matlab