The Cross-Entropy Method (CEM) is a type of importance sampling in Monte Carlo methods, used as an algorithm for optimization problems and rare event probability estimation.
Monte Carlo Sampling
The Monte Carlo method is a general term for numerical computation techniques that use random numbers to find approximate solutions to complex problems.
For example, suppose we want to find the expected value \(l = \mathbb{E}[H(X)] = \int H(x)f(x)dx\) of a function \(H(X)\), where random variable \(X\) follows probability density function \(f(x)\). In Monte Carlo sampling, this expected value is approximated using \(N\) independent and identically distributed samples \(X_1, \dots, X_N\):
\[ l*{MS} = \frac{1}{N} \sum*{i=1}^{N} H(X_i) \]Importance Sampling
In Monte Carlo sampling, samples are generated directly from the probability distribution \(f(x)\) for which we want to compute the expected value. However, if the region where \(H(x)\) takes large values falls in a low-probability region of \(f(x)\), efficient sampling becomes difficult.
In importance sampling, instead of sampling from the target distribution \(f(x)\), samples are generated from a different sampling distribution \(g(x)\). The expected value is then estimated by multiplying those samples by weights \(w(x) = f(x)/g(x)\).
\[ l*{IS} = \frac{1}{N} \sum*{i=1}^N H(X_i) \frac{f(X_i)}{g(X_i)} = \mathbb{E}\_g\lbrace H(X)\frac{f(X)}{g(X)}\rbrace \]Here, \(X_i \sim g(x)\). The efficiency of importance sampling depends heavily on the choice of sampling distribution \(g(x)\). In particular, the optimal sampling distribution \(g^*(x)\) that minimizes variance is known to take the form \(g^*(x) \propto |H(x)|f(x)\).
Cross-Entropy Method (CEM)
The Cross-Entropy Method is an algorithm for iteratively finding a distribution close to the optimal sampling distribution \(g^*(x)\) in importance sampling.
“Cross-entropy” is a measure of similarity between two probability distributions \(p\) and \(q\), defined as:
\[ H(p, q) = -\int p(x) \log q(x) dx \]From the relationship \(KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx = H(p,q) - H(p)\), when the true distribution \(p\) is fixed, minimizing the KL divergence is equivalent to minimizing the cross-entropy.
CEM optimizes the parameters \(v\) of a parameterized sampling distribution \(g(x;v)\) through the following procedure:
- Initialization: Initialize parameter \(v\) and set up the sampling distribution \(g(x;v)\).
- Sampling: Generate \(N\) samples \(X_1, \dots, X_N\) from the current sampling distribution \(g(x;v)\).
- Evaluation: Compute the objective function \(H(X_i)\) for each sample \(X_i\).
- Elite Sample Selection: Select the top \(P\) percent of samples with the highest (or lowest, depending on the optimization direction) objective function values as “elite samples.”
- Parameter Update: Update the parameters \(v\) of the sampling distribution \(g(x;v)\) using the elite samples. This update maximizes the likelihood of the elite samples, which is equivalent to minimizing the cross-entropy between the empirical distribution of elite samples and \(g(x;v)\).
- Convergence Check: Repeat from step 2 until the change in parameter \(v\) is sufficiently small or the maximum number of iterations is reached.
Through this iterative process, the sampling distribution \(g(x;v)\) gradually concentrates on the region of the optimal solution, enabling efficient search.
Comparison of CEM and MPPI
A closely related method to CEM is MPPI (Model Predictive Path Integral). Both are optimization methods based on importance sampling, but they differ in how they weight samples.
| Property | CEM | MPPI |
|---|---|---|
| Weighting | Hard selection (top \(P\)% elite samples) | Soft weighting (exponential weights based on cost) |
| Update rule | Update distribution with mean/variance of elites | Update distribution with weighted mean of all samples |
| Information usage | Discards information from non-elite samples | Utilizes information from all samples |
| Temperature | None | Yes (\(\lambda\) controls sharpness of weights) |
| Application | General-purpose optimization | Primarily model predictive control (MPC) |
CEM’s operation of “selecting the top \(P\)%” corresponds to the limit \(\lambda \to 0\) of MPPI’s temperature parameter. In other words, CEM can be understood as a special case of MPPI.
Practical guidelines for choosing between them:
- Discrete optimization and combinatorial optimization are better suited for CEM
- Continuous control problems and robotics are better suited for MPPI
- MPPI has higher sample efficiency (since it uses information from all samples), but requires tuning the temperature parameter
Python Implementation of CEM
Here we present a Python implementation of the Cross-Entropy Method for solving a continuous optimization problem. We use the Rastrigin function as a test function, which has many local minima.
Rastrigin Function
The Rastrigin function is a widely used benchmark for optimization algorithms, defined as:
\[ f(\mathbf{x}) = An + \sum\_{i=1}^{n} \left[ x_i^2 - A\cos(2\pi x_i) \right] \]where \(A = 10\) and \(n\) is the dimensionality. The global minimum is at \(\mathbf{x}^* = \mathbf{0}\) with \(f(\mathbf{0}) = 0\). Due to its many local minima, simple gradient-based methods struggle to find the global minimum, making it ideal for demonstrating the effectiveness of stochastic search methods like CEM.
Implementation
import numpy as np
import matplotlib.pyplot as plt
# --- Define the Rastrigin function ---
def rastrigin(x):
"""Rastrigin function (minimization target)"""
A = 10
return A * len(x) + np.sum(x**2 - A * np.cos(2 * np.pi * x))
# --- Cross-Entropy Method implementation ---
def cross_entropy_method(
objective_fn, # Objective function (to minimize)
dim, # Dimensionality of search space
n_samples=100, # Number of samples per iteration
elite_frac=0.2, # Fraction of elite samples
n_iterations=50,# Maximum number of iterations
initial_mean=None, # Initial mean
initial_std=5.0, # Initial standard deviation
):
"""Minimize using the Cross-Entropy Method"""
# Step 1: Initialization
mean = initial_mean if initial_mean is not None else np.zeros(dim)
std = np.full(dim, initial_std)
n_elite = int(n_samples * elite_frac)
# History tracking
best_scores = [] # Best score per iteration
mean_history = [] # Mean trajectory
for iteration in range(n_iterations):
# Step 2: Sampling (from normal distribution)
samples = np.random.normal(
loc=mean, scale=std, size=(n_samples, dim)
)
# Step 3: Evaluation (compute objective for each sample)
scores = np.array([objective_fn(s) for s in samples])
# Step 4: Elite sample selection (top P% with lowest scores)
elite_indices = np.argsort(scores)[:n_elite]
elite_samples = samples[elite_indices]
# Step 5: Parameter update (mean and std of elite samples)
mean = np.mean(elite_samples, axis=0)
std = np.std(elite_samples, axis=0)
# Record history
best_scores.append(np.min(scores))
mean_history.append(mean.copy())
# Convergence check (stop if std is small enough)
if np.all(std < 1e-6):
print(f"Converged at iteration {iteration + 1}.")
break
return mean, best_scores, mean_history
# --- Run CEM ---
np.random.seed(42)
dim = 2 # 2D problem
best_solution, best_scores, mean_history = cross_entropy_method(
objective_fn=rastrigin,
dim=dim,
n_samples=200,
elite_frac=0.1,
n_iterations=100,
initial_mean=np.array([3.0, -3.0]), # Start far from optimum
initial_std=3.0,
)
print(f"Optimal solution: {best_solution}")
print(f"Objective value: {rastrigin(best_solution):.6f}")
# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# (a) Convergence plot
axes[0].plot(best_scores, linewidth=2)
axes[0].set_xlabel("Iteration")
axes[0].set_ylabel("Best Score")
axes[0].set_title("Convergence of CEM")
axes[0].set_yscale("log")
axes[0].grid(True, alpha=0.3)
# (b) Contour plot with search trajectory
x_grid = np.linspace(-5, 5, 200)
y_grid = np.linspace(-5, 5, 200)
X, Y = np.meshgrid(x_grid, y_grid)
Z = np.array([
[rastrigin(np.array([xi, yi])) for xi in x_grid]
for yi in y_grid
])
axes[1].contourf(X, Y, Z, levels=30, cmap="viridis", alpha=0.8)
axes[1].colorbar = plt.colorbar(axes[1].contourf(X, Y, Z, levels=30, cmap="viridis", alpha=0.8), ax=axes[1])
# Plot mean trajectory
mean_arr = np.array(mean_history)
axes[1].plot(mean_arr[:, 0], mean_arr[:, 1], "r.-", markersize=8, linewidth=1.5, label="Mean trajectory")
axes[1].plot(mean_arr[0, 0], mean_arr[0, 1], "rs", markersize=12, label="Start")
axes[1].plot(mean_arr[-1, 0], mean_arr[-1, 1], "r*", markersize=15, label="Final")
axes[1].plot(0, 0, "wx", markersize=12, markeredgewidth=3, label="Global optimum")
axes[1].set_xlabel("$x_1$")
axes[1].set_ylabel("$x_2$")
axes[1].set_title("CEM Search Trajectory on Rastrigin Function")
axes[1].legend(loc="upper right")
plt.tight_layout()
plt.savefig("cem_result.png", dpi=150, bbox_inches="tight")
plt.show()
Running this code demonstrates the following:
- Convergence behavior: The objective function value rapidly decreases with each iteration, approaching the global minimum \(f(\mathbf{0}) = 0\)
- Search trajectory: The mean of the sampling distribution moves from the initial point \((3, -3)\) toward the origin \((0, 0)\)
- Avoiding local minima: CEM successfully avoids the many local minima of the Rastrigin function to reach the global minimum
Key hyperparameters and their effects:
n_samples(sample count): More samples provide more stable search but increase computational costelite_frac(elite fraction): A smaller fraction increases selection pressure and speeds convergence, but risks premature convergenceinitial_std(initial standard deviation): A larger value explores a wider region but takes longer to converge
Related Articles
- Bayesian Optimization: Fundamentals and Python Implementation - Another black-box optimization approach that uses Gaussian process surrogate models instead of sampling-based search.
- MPPI (Model Predictive Path Integral): A Unified View with the Cross-Entropy Method - Extends CEM’s importance sampling framework to stochastic optimal control, with a unified mathematical perspective.
References
- Rubinstein, R. Y., & Kroese, D. P. (2013). The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer Science & Business Media.
- Urushihara, T., “Application of Large Deviation Theory to Importance Sampling in Monte Carlo Simulation”
