Introduction
Training deep learning models is an optimization problem: finding parameters that minimize a loss function. At the core of this optimization are gradient descent and its variants.
This article compares gradient-based optimization algorithms from basic SGD to the widely-used Adam, examining their mathematical formulations and visualizing convergence behavior.
Gradient Descent (GD)
Update parameters \(\theta\) in the direction of the loss function \(L(\theta)\) gradient:
\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \tag{1}\]where \(\eta\) is the learning rate. Computing gradients over all data makes each step expensive for large datasets.
Stochastic Gradient Descent (SGD)
Use gradient estimates from mini-batch \(\mathcal{B}\):
\[\theta_{t+1} = \theta_t - \eta \nabla L_{\mathcal{B}}(\theta_t) \tag{2}\]Lower computational cost, but high gradient variance leads to unstable convergence.
Momentum
Introduces an exponential moving average of past gradients (velocity term):
\[v_t = \beta v_{t-1} + \nabla L_{\mathcal{B}}(\theta_t) \tag{3}\]\[\theta_{t+1} = \theta_t - \eta v_t \tag{4}\]\(\beta\) (typically 0.9) controls past gradient influence. Suppresses oscillations in valley-shaped loss landscapes and accelerates convergence.
Nesterov Accelerated Gradient (NAG)
An improvement over Momentum that computes gradients at a “look-ahead” position:
\[v_t = \beta v_{t-1} + \nabla L_{\mathcal{B}}(\theta_t - \eta \beta v_{t-1}) \tag{5}\]\[\theta_{t+1} = \theta_t - \eta v_t \tag{6}\]The look-ahead reduces overshoot near the optimum.
AdaGrad
A pioneer of adaptive learning rate methods with per-parameter rates:
\[G_t = G_{t-1} + (\nabla L_{\mathcal{B}}(\theta_t))^2 \tag{7}\]\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \varepsilon}} \nabla L_{\mathcal{B}}(\theta_t) \tag{8}\]\(G_t\) accumulates squared gradients. Frequently updated parameters get smaller learning rates. The issue is that \(G_t\) monotonically increases, causing excessive learning rate decay.
RMSProp
Solves AdaGrad’s decay problem using an exponential moving average of squared gradients:
\[s_t = \rho s_{t-1} + (1 - \rho)(\nabla L_{\mathcal{B}}(\theta_t))^2 \tag{9}\]\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t + \varepsilon}} \nabla L_{\mathcal{B}}(\theta_t) \tag{10}\]\(\rho\) (typically 0.99) enables forgetting old gradient information.
Adam (Adaptive Moment Estimation)
Combines Momentum and RMSProp; the most widely used optimizer today.
\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L_{\mathcal{B}}(\theta_t) \tag{11}\]\[v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L_{\mathcal{B}}(\theta_t))^2 \tag{12}\]\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \tag{13}\]\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \varepsilon} \hat{m}_t \tag{14}\]- \(m_t\): first moment (mean) estimate of the gradient
- \(v_t\): second moment (variance) estimate of the gradient
- Bias correction in \((13)\) fixes estimation bias in early steps
- Recommended: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\varepsilon = 10^{-8}\)
Comparison
| Method | Adaptive LR | Momentum | Key Feature |
|---|---|---|---|
| SGD | No | No | Simple, sometimes better generalization |
| Momentum | No | Yes | Oscillation suppression, faster convergence |
| NAG | No | Yes | Look-ahead reduces overshoot |
| AdaGrad | Yes | No | Good for sparse data, LR decay issue |
| RMSProp | Yes | No | Fixes AdaGrad’s decay problem |
| Adam | Yes | Yes | Most widely used, robust |
Python Implementation: Convergence Visualization
Comparing optimizer trajectories on the Rosenbrock function:
import numpy as np
import matplotlib.pyplot as plt
def rosenbrock(x, y):
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_grad(x, y):
dx = -2 * (1 - x) - 400 * x * (y - x**2)
dy = 200 * (y - x**2)
return np.array([dx, dy])
def optimize(method, x0, lr, steps, **kwargs):
x = np.array(x0, dtype=float)
trajectory = [x.copy()]
m = np.zeros_like(x)
v = np.zeros_like(x)
beta1 = kwargs.get('beta1', 0.9)
beta2 = kwargs.get('beta2', 0.999)
eps = 1e-8
for t in range(1, steps + 1):
g = rosenbrock_grad(x[0], x[1])
if method == 'sgd':
x = x - lr * g
elif method == 'momentum':
m = beta1 * m + g
x = x - lr * m
elif method == 'rmsprop':
v = beta2 * v + (1 - beta2) * g**2
x = x - lr * g / (np.sqrt(v) + eps)
elif method == 'adam':
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
x = x - lr * m_hat / (np.sqrt(v_hat) + eps)
trajectory.append(x.copy())
return np.array(trajectory)
x0 = [-1.0, 1.5]
steps = 5000
trajectories = {
'SGD': optimize('sgd', x0, lr=0.0005, steps=steps),
'Momentum': optimize('momentum', x0, lr=0.0001, steps=steps),
'RMSProp': optimize('rmsprop', x0, lr=0.001, steps=steps),
'Adam': optimize('adam', x0, lr=0.005, steps=steps),
}
fig, ax = plt.subplots(figsize=(10, 8))
X, Y = np.meshgrid(np.linspace(-2, 2, 200), np.linspace(-1, 3, 200))
Z = rosenbrock(X, Y)
ax.contour(X, Y, Z, levels=np.logspace(0, 3.5, 20), cmap='gray', alpha=0.3)
colors = {'SGD': 'blue', 'Momentum': 'green', 'RMSProp': 'orange', 'Adam': 'red'}
for name, traj in trajectories.items():
ax.plot(traj[:, 0], traj[:, 1], '-', color=colors[name], label=name, alpha=0.7)
ax.plot(traj[-1, 0], traj[-1, 1], 'o', color=colors[name], markersize=6)
ax.plot(1, 1, 'k*', markersize=15, label='Optimum (1,1)')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimizer Trajectories on Rosenbrock Function')
ax.legend()
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)
plt.tight_layout()
plt.show()
Practical Guidelines
- Start with Adam: stable performance for most tasks
- When generalization matters: SGD + Momentum + LR scheduling
- Sparse data (NLP etc.): Adam or AdaGrad
- Minimize LR tuning effort: Adam (robust due to adaptive rates)
Related Articles
- Cross-Entropy Method: A Practical Monte Carlo Optimization Technique - Useful comparison with gradient-free black-box optimization.
- Bayesian Optimization: Fundamentals and Python Implementation - Bayesian optimization for hyperparameter tuning (learning rate, etc.).
- Bayesian Linear Regression Fundamentals - Bayesian parameter estimation without gradient descent.
References
- Kingma, D. P., & Ba, J. (2015). “Adam: A Method for Stochastic Optimization”. ICLR 2015.
- Ruder, S. (2016). “An overview of gradient descent optimization algorithms”. arXiv:1609.04747.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.