From SGD to Adam: Evolution of Gradient-Based Optimization

Introduction

Training deep learning models is an optimization problem: finding parameters that minimize a loss function. At the core of this optimization are gradient descent and its variants.

This article compares gradient-based optimization algorithms from basic SGD to the widely-used Adam, examining their mathematical formulations and visualizing convergence behavior.

Gradient Descent (GD)

Update parameters \(\theta\) in the direction of the loss function \(L(\theta)\) gradient:

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \tag{1}\]

where \(\eta\) is the learning rate. Computing gradients over all data makes each step expensive for large datasets.

Stochastic Gradient Descent (SGD)

Use gradient estimates from mini-batch \(\mathcal{B}\):

\[\theta_{t+1} = \theta_t - \eta \nabla L_{\mathcal{B}}(\theta_t) \tag{2}\]

Lower computational cost, but high gradient variance leads to unstable convergence.

Momentum

Introduces an exponential moving average of past gradients (velocity term):

\[v_t = \beta v_{t-1} + \nabla L_{\mathcal{B}}(\theta_t) \tag{3}\]\[\theta_{t+1} = \theta_t - \eta v_t \tag{4}\]

\(\beta\) (typically 0.9) controls past gradient influence. Suppresses oscillations in valley-shaped loss landscapes and accelerates convergence.

Nesterov Accelerated Gradient (NAG)

An improvement over Momentum that computes gradients at a “look-ahead” position:

\[v_t = \beta v_{t-1} + \nabla L_{\mathcal{B}}(\theta_t - \eta \beta v_{t-1}) \tag{5}\]\[\theta_{t+1} = \theta_t - \eta v_t \tag{6}\]

The look-ahead reduces overshoot near the optimum.

AdaGrad

A pioneer of adaptive learning rate methods with per-parameter rates:

\[G_t = G_{t-1} + (\nabla L_{\mathcal{B}}(\theta_t))^2 \tag{7}\]\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \varepsilon}} \nabla L_{\mathcal{B}}(\theta_t) \tag{8}\]

\(G_t\) accumulates squared gradients. Frequently updated parameters get smaller learning rates. The issue is that \(G_t\) monotonically increases, causing excessive learning rate decay.

RMSProp

Solves AdaGrad’s decay problem using an exponential moving average of squared gradients:

\[s_t = \rho s_{t-1} + (1 - \rho)(\nabla L_{\mathcal{B}}(\theta_t))^2 \tag{9}\]\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t + \varepsilon}} \nabla L_{\mathcal{B}}(\theta_t) \tag{10}\]

\(\rho\) (typically 0.99) enables forgetting old gradient information.

Adam (Adaptive Moment Estimation)

Combines Momentum and RMSProp; the most widely used optimizer today.

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L_{\mathcal{B}}(\theta_t) \tag{11}\]\[v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L_{\mathcal{B}}(\theta_t))^2 \tag{12}\]\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \tag{13}\]\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \varepsilon} \hat{m}_t \tag{14}\]

\(m_t\): first moment (mean) estimate of the gradient
\(v_t\): second moment (variance) estimate of the gradient
Bias correction in \((13)\) fixes estimation bias in early steps
Recommended: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\varepsilon = 10^{-8}\)

Comparison

Method	Adaptive LR	Momentum	Key Feature
SGD	No	No	Simple, sometimes better generalization
Momentum	No	Yes	Oscillation suppression, faster convergence
NAG	No	Yes	Look-ahead reduces overshoot
AdaGrad	Yes	No	Good for sparse data, LR decay issue
RMSProp	Yes	No	Fixes AdaGrad’s decay problem
Adam	Yes	Yes	Most widely used, robust

Python Implementation: Convergence Visualization

Comparing optimizer trajectories on the Rosenbrock function:

import numpy as np
import matplotlib.pyplot as plt

def rosenbrock(x, y):
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(x, y):
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

def optimize(method, x0, lr, steps, **kwargs):
    x = np.array(x0, dtype=float)
    trajectory = [x.copy()]
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    beta1 = kwargs.get('beta1', 0.9)
    beta2 = kwargs.get('beta2', 0.999)
    eps = 1e-8

    for t in range(1, steps + 1):
        g = rosenbrock_grad(x[0], x[1])
        if method == 'sgd':
            x = x - lr * g
        elif method == 'momentum':
            m = beta1 * m + g
            x = x - lr * m
        elif method == 'rmsprop':
            v = beta2 * v + (1 - beta2) * g**2
            x = x - lr * g / (np.sqrt(v) + eps)
        elif method == 'adam':
            m = beta1 * m + (1 - beta1) * g
            v = beta2 * v + (1 - beta2) * g**2
            m_hat = m / (1 - beta1**t)
            v_hat = v / (1 - beta2**t)
            x = x - lr * m_hat / (np.sqrt(v_hat) + eps)
        trajectory.append(x.copy())
    return np.array(trajectory)

x0 = [-1.0, 1.5]
steps = 5000
trajectories = {
    'SGD': optimize('sgd', x0, lr=0.0005, steps=steps),
    'Momentum': optimize('momentum', x0, lr=0.0001, steps=steps),
    'RMSProp': optimize('rmsprop', x0, lr=0.001, steps=steps),
    'Adam': optimize('adam', x0, lr=0.005, steps=steps),
}

fig, ax = plt.subplots(figsize=(10, 8))
X, Y = np.meshgrid(np.linspace(-2, 2, 200), np.linspace(-1, 3, 200))
Z = rosenbrock(X, Y)
ax.contour(X, Y, Z, levels=np.logspace(0, 3.5, 20), cmap='gray', alpha=0.3)

colors = {'SGD': 'blue', 'Momentum': 'green', 'RMSProp': 'orange', 'Adam': 'red'}
for name, traj in trajectories.items():
    ax.plot(traj[:, 0], traj[:, 1], '-', color=colors[name], label=name, alpha=0.7)
    ax.plot(traj[-1, 0], traj[-1, 1], 'o', color=colors[name], markersize=6)

ax.plot(1, 1, 'k*', markersize=15, label='Optimum (1,1)')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Optimizer Trajectories on Rosenbrock Function')
ax.legend()
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)
plt.tight_layout()
plt.show()

Practical Guidelines

Start with Adam: stable performance for most tasks
When generalization matters: SGD + Momentum + LR scheduling
Sparse data (NLP etc.): Adam or AdaGrad
Minimize LR tuning effort: Adam (robust due to adaptive rates)

Simulated Annealing: Theory and Python Implementation - Gradient-free metaheuristic for comparison.
Genetic Algorithms: Fundamentals and Python Implementation - Gradient-free evolutionary optimization.
Bayesian Optimization: Fundamentals and Python Implementation - Bayesian optimization for hyperparameter tuning (learning rate, etc.).
Understanding Self-Attention from Scratch: Math and Python Implementation - Transformers where Adam is the standard optimizer.
Cross-Entropy Method: A Practical Monte Carlo Optimization Technique - Useful comparison with gradient-free black-box optimization.
Bayesian Linear Regression: From Least Squares to Bayesian Estimation - Bayesian parameter estimation without gradient descent.

References

Kingma, D. P., & Ba, J. (2015). “Adam: A Method for Stochastic Optimization”. ICLR 2015.
Ruder, S. (2016). “An overview of gradient descent optimization algorithms”. arXiv:1609.04747.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.