MPPI (Model Predictive Path Integral): A Unified View with the Cross-Entropy Method

An explanation and Python implementation of Model Predictive Path Integral (MPPI) control, with a comparison to the Cross-Entropy Method (CEM) through the lens of importance sampling.

What is MPPI?

In the article on Cross-Entropy Method (CEM), we introduced a method for iteratively optimizing the parameters of a sampling distribution in importance sampling.

Model Predictive Path Integral (MPPI) is a sampling-based model predictive control (MPC) algorithm grounded in stochastic optimal control theory. Although CEM and MPPI appear to take different approaches, they can be understood through a unified mathematical framework based on importance sampling and variational inference.

  • CEM: “Hard” selection of elite samples (top \(P\)%)
  • MPPI: “Soft” exponential weighting based on cost

Reference: Williams, G., et al. (2017). “Information Theoretic MPC for Model-Based Reinforcement Learning.” ICRA 2017.

Problem Setup

Consider a discrete-time stochastic dynamical system:

\[ \mathbf{x}_{t+1} = F(\mathbf{x}_t, \mathbf{u}_t) + \boldsymbol{\epsilon}_t, \quad \boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}) \tag{1} \]

where \(\mathbf{x}_t\) is the state, \(\mathbf{u}_t\) is the control input, and \(\boldsymbol{\epsilon}_t\) is the system noise.

The cost function for a control sequence \(\mathbf{U} = (\mathbf{u}_0, \mathbf{u}_1, \ldots, \mathbf{u}_{T-1})\) is defined as:

\[ J(\mathbf{U}) = \phi(\mathbf{x}_T) + \sum_{t=0}^{T-1} q(\mathbf{x}_t, \mathbf{u}_t) \tag{2} \]

where \(\phi\) is the terminal cost and \(q\) is the stage cost.

Derivation of MPPI

Optimal Control Distribution

In the stochastic optimal control framework, we treat the control sequence as a random variable and define the optimal control distribution as:

\[ p^*(\mathbf{V}) \propto \exp\left(-\frac{1}{\lambda} J(\mathbf{V})\right) \cdot p(\mathbf{V}) \tag{3} \]

where \(\mathbf{V}\) is the noisy control sequence, \(\lambda > 0\) is the temperature parameter, and \(p(\mathbf{V})\) is the prior distribution (current sampling distribution).

The role of temperature \(\lambda\):

  • \(\lambda \to 0\): Selects only the minimum-cost control sequence (hard selection)
  • \(\lambda \to \infty\): Weights all control sequences equally

Weighted Control Update

We draw \(N\) samples \(\mathbf{V}^{(1)}, \ldots, \mathbf{V}^{(N)}\) from the prior \(p(\mathbf{V})\) and compute the cost \(J(\mathbf{V}^{(i)})\) for each.

The importance sampling weights are computed as:

\[ w^{(i)} = \frac{\exp\left(-\frac{1}{\lambda} J(\mathbf{V}^{(i)})\right)}{\sum_{j=1}^N \exp\left(-\frac{1}{\lambda} J(\mathbf{V}^{(j)})\right)} \tag{4} \]

The optimal control input is obtained as a weighted average:

\[ \mathbf{u}_t^* = \sum_{i=1}^N w^{(i)} \mathbf{v}_t^{(i)} \tag{5} \]

Comparison with CEM

Both CEM and MPPI share the structure of sampling control sequences from a current distribution and updating the distribution based on cost information.

CEM (Hard Selection)

CEM selects the top \(P\)% elite samples and updates the distribution parameters:

\[ \boldsymbol{\mu}\_{\text{new}} = \frac{1}{|\mathcal{E}|} \sum_{i \in \mathcal{E}} \mathbf{V}^{(i)} \tag{6} \]\[ \boldsymbol{\Sigma}\_{\text{new}} = \frac{1}{|\mathcal{E}|} \sum_{i \in \mathcal{E}} (\mathbf{V}^{(i)} - \boldsymbol{\mu}\_{\text{new}})(\mathbf{V}^{(i)} - \boldsymbol{\mu}\_{\text{new}})^T \tag{7} \]

where \(\mathcal{E}\) is the set of elite samples.

MPPI (Soft Weighting)

MPPI uses all samples with exponential weights:

\[ \boldsymbol{\mu}\_{\text{new}} = \sum_{i=1}^N w^{(i)} \mathbf{V}^{(i)} \tag{8} \]

Unified Understanding: Variational Inference MPC

Both methods can be unified under the variational inference framework:

\[ \min_q \, D(q(\mathbf{U}) \| p^*(\mathbf{U})) \tag{9} \]

where \(D\) is a divergence measure.

CEMMPPI
Sample usageTop \(P\)% onlyAll samples
WeightingUniform (0 or 1)Exponential (continuous)
ParametersElite ratio \(P\)Temperature \(\lambda\)
Variance updateYesNo (typically)
Mathematical interpretationHard thresholdBoltzmann distribution

MPPI Algorithm

Input: Initial control sequence μ = (μ_0, ..., μ_{T-1}), num samples N, temperature λ, noise variance Σ
Repeat:
  1. Sample N noisy control sequences:
     ε_t^(i) ~ N(0, Σ),  V^(i) = μ + ε^(i)
  2. Compute cost for each sample:
     J(V^(i)) = φ(x_T^(i)) + Σ_t q(x_t^(i), v_t^(i))
  3. Compute weights:
     w^(i) = exp(-J(V^(i))/λ) / Σ_j exp(-J(V^(j))/λ)
  4. Update control input:
     μ_t  μ_t + Σ_i w^(i) ε_t^(i)
  5. Apply first control input μ_0 and shift the sequence

Python Implementation

We implement MPPI for a 2D goal-reaching problem.

Problem Definition

import numpy as np
import matplotlib.pyplot as plt

# ---- Problem setup ----
n_state = 4     # state dimension [x, y, vx, vy]
n_ctrl = 2      # control dimension [ax, ay]
dt = 0.1        # time step
T = 20          # prediction horizon
N = 500         # number of samples
lam = 1.0       # temperature parameter
sigma = 0.5     # control noise standard deviation
goal = np.array([5.0, 5.0])  # goal position

def dynamics(x, u):
    """Linear dynamics model (2D point mass)"""
    x_next = np.zeros(n_state)
    x_next[0] = x[0] + x[2] * dt  # px
    x_next[1] = x[1] + x[3] * dt  # py
    x_next[2] = x[2] + u[0] * dt  # vx
    x_next[3] = x[3] + u[1] * dt  # vy
    return x_next

def stage_cost(x, u):
    """Stage cost"""
    pos = x[:2]
    dist = np.sum((pos - goal)**2)
    ctrl = 0.01 * np.sum(u**2)
    return dist + ctrl

def terminal_cost(x):
    """Terminal cost"""
    pos = x[:2]
    return 10.0 * np.sum((pos - goal)**2)

MPPI Controller

class MPPIController:
    def __init__(self):
        # Initialize control sequence to zeros
        self.mu = np.zeros((T, n_ctrl))

    def compute_control(self, x0):
        """Compute optimal control input using MPPI"""
        # Step 1: Sample noisy control sequences
        noise = np.random.randn(N, T, n_ctrl) * sigma
        V = self.mu[np.newaxis, :, :] + noise  # (N, T, n_ctrl)

        # Step 2: Compute cost for each sample
        costs = np.zeros(N)
        for i in range(N):
            x = x0.copy()
            for t in range(T):
                costs[i] += stage_cost(x, V[i, t])
                x = dynamics(x, V[i, t])
            costs[i] += terminal_cost(x)

        # Step 3: Compute weights (Eq. 4)
        # Subtract minimum cost for numerical stability
        costs_shifted = costs - np.min(costs)
        weights = np.exp(-costs_shifted / lam)
        weights /= np.sum(weights)  # normalize

        # Step 4: Update control input (Eq. 5)
        weighted_noise = np.sum(
            weights[:, np.newaxis, np.newaxis] * noise, axis=0
        )
        self.mu += weighted_noise

        # Optimal control input (first time step)
        u_opt = self.mu[0].copy()

        # Shift control sequence for next time step
        self.mu = np.roll(self.mu, -1, axis=0)
        self.mu[-1] = 0.0  # set last element to zero

        return u_opt

Simulation

np.random.seed(42)

# Initial state
x = np.array([0.0, 0.0, 0.0, 0.0])
controller = MPPIController()

# Simulation
n_steps = 100
trajectory = [x.copy()]
controls = []

for step in range(n_steps):
    u = controller.compute_control(x)
    controls.append(u.copy())
    x = dynamics(x, u)
    trajectory.append(x.copy())

    # Check goal
    if np.linalg.norm(x[:2] - goal) < 0.1:
        print(f"Goal reached at step {step + 1}")
        break

trajectory = np.array(trajectory)
controls = np.array(controls)

# ---- Plot results ----
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Trajectory
axes[0].plot(trajectory[:, 0], trajectory[:, 1], "b-o",
             markersize=3, label="MPPI trajectory")
axes[0].plot(*goal, "r*", markersize=15, label="Goal")
axes[0].plot(0, 0, "gs", markersize=10, label="Start")
axes[0].set_xlabel("x")
axes[0].set_ylabel("y")
axes[0].set_title("MPPI - Goal Reaching")
axes[0].legend()
axes[0].grid(True)
axes[0].set_aspect("equal")

# Control inputs
axes[1].plot(controls[:, 0], label="$a_x$")
axes[1].plot(controls[:, 1], label="$a_y$")
axes[1].set_xlabel("Step")
axes[1].set_ylabel("Control input")
axes[1].set_title("Control inputs over time")
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.savefig("mppi_result.png", dpi=150)
plt.show()

Effect of the Temperature Parameter

The temperature \(\lambda\) controls the concentration of the weights:

# Visualize the effect of temperature parameter
costs_example = np.linspace(0, 10, 100)
fig, ax = plt.subplots(figsize=(8, 5))

for lam_val in [0.1, 0.5, 1.0, 5.0]:
    w = np.exp(-costs_example / lam_val)
    w /= np.sum(w)
    ax.plot(costs_example, w, label=f"$\\lambda={lam_val}$")

ax.set_xlabel("Cost $J$")
ax.set_ylabel("Weight $w$")
ax.set_title("Temperature parameter effect on MPPI weights")
ax.legend()
ax.grid(True)
plt.tight_layout()
plt.savefig("mppi_temperature.png", dpi=150)
plt.show()

A smaller \(\lambda\) concentrates weights on low-cost samples (approaching CEM’s hard selection), while a larger \(\lambda\) leads to more uniform weighting.

References

  • Williams, G., Aldrich, A., & Theodorou, E. A. (2017). “Model Predictive Path Integral Control: From Theory to Parallel Computation.” Journal of Guidance, Control, and Dynamics, 40(2), 344-357.
  • Williams, G., et al. (2017). “Information Theoretic MPC for Model-Based Reinforcement Learning.” ICRA 2017.
  • Rubinstein, R. Y., & Kroese, D. P. (2013). The Cross-Entropy Method. Springer.