Cross-Entropy Method: A Practical Monte Carlo Optimization Technique

An explanation of the Cross-Entropy Method (CEM) for optimization, covering Monte Carlo sampling, importance sampling, iterative elite-based parameter updates, and a Python implementation with the Rastrigin function.

Cross-Entropy Method: A Practical Monte Carlo Optimization Technique

The Cross-Entropy Method (CEM) is a type of importance sampling in Monte Carlo methods, used as an algorithm for optimization problems and rare event probability estimation.

Monte Carlo Sampling

The Monte Carlo method is a general term for numerical computation techniques that use random numbers to find approximate solutions to complex problems.

For example, suppose we want to find the expected value \(l = \mathbb{E}[H(X)] = \int H(x)f(x)dx\) of a function \(H(X)\), where random variable \(X\) follows probability density function \(f(x)\). In Monte Carlo sampling, this expected value is approximated using \(N\) independent and identically distributed samples \(X_1, \dots, X_N\):

\[ l*{MS} = \frac{1}{N} \sum*{i=1}^{N} H(X_i) \]

Importance Sampling

In Monte Carlo sampling, samples are generated directly from the probability distribution \(f(x)\) for which we want to compute the expected value. However, if the region where \(H(x)\) takes large values falls in a low-probability region of \(f(x)\), efficient sampling becomes difficult.

In importance sampling, instead of sampling from the target distribution \(f(x)\), samples are generated from a different sampling distribution \(g(x)\). The expected value is then estimated by multiplying those samples by weights \(w(x) = f(x)/g(x)\).

\[ l*{IS} = \frac{1}{N} \sum*{i=1}^N H(X_i) \frac{f(X_i)}{g(X_i)} = \mathbb{E}\_g\lbrace H(X)\frac{f(X)}{g(X)}\rbrace \]

Here, \(X_i \sim g(x)\). The efficiency of importance sampling depends heavily on the choice of sampling distribution \(g(x)\). In particular, the optimal sampling distribution \(g^*(x)\) that minimizes variance is known to take the form \(g^*(x) \propto |H(x)|f(x)\).

Cross-Entropy Method (CEM)

The Cross-Entropy Method is an algorithm for iteratively finding a distribution close to the optimal sampling distribution \(g^*(x)\) in importance sampling.

“Cross-entropy” is a measure of similarity between two probability distributions \(p\) and \(q\), defined as:

\[ H(p, q) = -\int p(x) \log q(x) dx \]

From the relationship \(KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx = H(p,q) - H(p)\), when the true distribution \(p\) is fixed, minimizing the KL divergence is equivalent to minimizing the cross-entropy.

CEM optimizes the parameters \(v\) of a parameterized sampling distribution \(g(x;v)\) through the following procedure:

  1. Initialization: Initialize parameter \(v\) and set up the sampling distribution \(g(x;v)\).
  2. Sampling: Generate \(N\) samples \(X_1, \dots, X_N\) from the current sampling distribution \(g(x;v)\).
  3. Evaluation: Compute the objective function \(H(X_i)\) for each sample \(X_i\).
  4. Elite Sample Selection: Select the top \(P\) percent of samples with the highest (or lowest, depending on the optimization direction) objective function values as “elite samples.”
  5. Parameter Update: Update the parameters \(v\) of the sampling distribution \(g(x;v)\) using the elite samples. This update maximizes the likelihood of the elite samples, which is equivalent to minimizing the cross-entropy between the empirical distribution of elite samples and \(g(x;v)\).
  6. Convergence Check: Repeat from step 2 until the change in parameter \(v\) is sufficiently small or the maximum number of iterations is reached.

Through this iterative process, the sampling distribution \(g(x;v)\) gradually concentrates on the region of the optimal solution, enabling efficient search.

Comparison of CEM and MPPI

A closely related method to CEM is MPPI (Model Predictive Path Integral). Both are optimization methods based on importance sampling, but they differ in how they weight samples.

PropertyCEMMPPI
WeightingHard selection (top \(P\)% elite samples)Soft weighting (exponential weights based on cost)
Update ruleUpdate distribution with mean/variance of elitesUpdate distribution with weighted mean of all samples
Information usageDiscards information from non-elite samplesUtilizes information from all samples
TemperatureNoneYes (\(\lambda\) controls sharpness of weights)
ApplicationGeneral-purpose optimizationPrimarily model predictive control (MPC)

CEM’s operation of “selecting the top \(P\)%” corresponds to the limit \(\lambda \to 0\) of MPPI’s temperature parameter. In other words, CEM can be understood as a special case of MPPI.

Practical guidelines for choosing between them:

  • Discrete optimization and combinatorial optimization are better suited for CEM
  • Continuous control problems and robotics are better suited for MPPI
  • MPPI has higher sample efficiency (since it uses information from all samples), but requires tuning the temperature parameter

Python Implementation of CEM

Here we present a Python implementation of the Cross-Entropy Method for solving a continuous optimization problem. We use the Rastrigin function as a test function, which has many local minima.

Rastrigin Function

The Rastrigin function is a widely used benchmark for optimization algorithms, defined as:

\[ f(\mathbf{x}) = An + \sum\_{i=1}^{n} \left[ x_i^2 - A\cos(2\pi x_i) \right] \]

where \(A = 10\) and \(n\) is the dimensionality. The global minimum is at \(\mathbf{x}^* = \mathbf{0}\) with \(f(\mathbf{0}) = 0\). Due to its many local minima, simple gradient-based methods struggle to find the global minimum, making it ideal for demonstrating the effectiveness of stochastic search methods like CEM.

Implementation

import numpy as np
import matplotlib.pyplot as plt

# --- Define the Rastrigin function ---
def rastrigin(x):
    """Rastrigin function (minimization target)"""
    A = 10
    return A * len(x) + np.sum(x**2 - A * np.cos(2 * np.pi * x))

# --- Cross-Entropy Method implementation ---
def cross_entropy_method(
    objective_fn,   # Objective function (to minimize)
    dim,            # Dimensionality of search space
    n_samples=100,  # Number of samples per iteration
    elite_frac=0.2, # Fraction of elite samples
    n_iterations=50,# Maximum number of iterations
    initial_mean=None,  # Initial mean
    initial_std=5.0,    # Initial standard deviation
):
    """Minimize using the Cross-Entropy Method"""

    # Step 1: Initialization
    mean = initial_mean if initial_mean is not None else np.zeros(dim)
    std = np.full(dim, initial_std)
    n_elite = int(n_samples * elite_frac)

    # History tracking
    best_scores = []      # Best score per iteration
    mean_history = []     # Mean trajectory

    for iteration in range(n_iterations):
        # Step 2: Sampling (from normal distribution)
        samples = np.random.normal(
            loc=mean, scale=std, size=(n_samples, dim)
        )

        # Step 3: Evaluation (compute objective for each sample)
        scores = np.array([objective_fn(s) for s in samples])

        # Step 4: Elite sample selection (top P% with lowest scores)
        elite_indices = np.argsort(scores)[:n_elite]
        elite_samples = samples[elite_indices]

        # Step 5: Parameter update (mean and std of elite samples)
        mean = np.mean(elite_samples, axis=0)
        std = np.std(elite_samples, axis=0)

        # Record history
        best_scores.append(np.min(scores))
        mean_history.append(mean.copy())

        # Convergence check (stop if std is small enough)
        if np.all(std < 1e-6):
            print(f"Converged at iteration {iteration + 1}.")
            break

    return mean, best_scores, mean_history

# --- Run CEM ---
np.random.seed(42)
dim = 2  # 2D problem

best_solution, best_scores, mean_history = cross_entropy_method(
    objective_fn=rastrigin,
    dim=dim,
    n_samples=200,
    elite_frac=0.1,
    n_iterations=100,
    initial_mean=np.array([3.0, -3.0]),  # Start far from optimum
    initial_std=3.0,
)

print(f"Optimal solution: {best_solution}")
print(f"Objective value: {rastrigin(best_solution):.6f}")

# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# (a) Convergence plot
axes[0].plot(best_scores, linewidth=2)
axes[0].set_xlabel("Iteration")
axes[0].set_ylabel("Best Score")
axes[0].set_title("Convergence of CEM")
axes[0].set_yscale("log")
axes[0].grid(True, alpha=0.3)

# (b) Contour plot with search trajectory
x_grid = np.linspace(-5, 5, 200)
y_grid = np.linspace(-5, 5, 200)
X, Y = np.meshgrid(x_grid, y_grid)
Z = np.array([
    [rastrigin(np.array([xi, yi])) for xi in x_grid]
    for yi in y_grid
])

axes[1].contourf(X, Y, Z, levels=30, cmap="viridis", alpha=0.8)
axes[1].colorbar = plt.colorbar(axes[1].contourf(X, Y, Z, levels=30, cmap="viridis", alpha=0.8), ax=axes[1])

# Plot mean trajectory
mean_arr = np.array(mean_history)
axes[1].plot(mean_arr[:, 0], mean_arr[:, 1], "r.-", markersize=8, linewidth=1.5, label="Mean trajectory")
axes[1].plot(mean_arr[0, 0], mean_arr[0, 1], "rs", markersize=12, label="Start")
axes[1].plot(mean_arr[-1, 0], mean_arr[-1, 1], "r*", markersize=15, label="Final")
axes[1].plot(0, 0, "wx", markersize=12, markeredgewidth=3, label="Global optimum")
axes[1].set_xlabel("$x_1$")
axes[1].set_ylabel("$x_2$")
axes[1].set_title("CEM Search Trajectory on Rastrigin Function")
axes[1].legend(loc="upper right")

plt.tight_layout()
plt.savefig("cem_result.png", dpi=150, bbox_inches="tight")
plt.show()

Running this code demonstrates the following:

  • Convergence behavior: The objective function value rapidly decreases with each iteration, approaching the global minimum \(f(\mathbf{0}) = 0\)
  • Search trajectory: The mean of the sampling distribution moves from the initial point \((3, -3)\) toward the origin \((0, 0)\)
  • Avoiding local minima: CEM successfully avoids the many local minima of the Rastrigin function to reach the global minimum

Key hyperparameters and their effects:

  • n_samples (sample count): More samples provide more stable search but increase computational cost
  • elite_frac (elite fraction): A smaller fraction increases selection pressure and speeds convergence, but risks premature convergence
  • initial_std (initial standard deviation): A larger value explores a wider region but takes longer to converge

References

  • Rubinstein, R. Y., & Kroese, D. P. (2013). The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer Science & Business Media.
  • Urushihara, T., “Application of Large Deviation Theory to Importance Sampling in Monte Carlo Simulation”