Policy Gradient Methods

An explanation of policy gradient methods in reinforcement learning, covering the policy gradient theorem, log-derivative trick, and variance reduction with baselines.

The Policy Gradient Method is a policy-based approach in reinforcement learning. Instead of learning a value function, it represents the policy itself as a parameterized function and directly optimizes its parameters to find the optimal policy.

The policy is represented as a function that takes state $s$ as input and outputs the probability $\pi_\theta(a|s)$ of selecting each action $a$. Here, $\theta$ represents the policy parameters.

Policy Evaluation and Optimization

The quality of a policy is evaluated by the expected reward (or expected cumulative reward) obtained when following that policy. The goal of the policy gradient method is to maximize this expected reward.

The expected reward $J(\theta)$ can be expressed as:

$$ J(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_\theta(a|s) Q^{\pi_\theta}(s,a) $$

Where:

  • $d^{\pi_\theta}(s)$: The stationary distribution (or discounted state visitation frequency) of visiting state $s$ under policy $\pi_\theta$.
  • $Q^{\pi_\theta}(s,a)$: The action-value function for taking action $a$ in state $s$ under policy $\pi_\theta$.

To maximize this expected reward $J(\theta)$, the parameter $\theta$ is updated using gradient ascent. That is, the gradient $\nabla J(\theta)$ of the expected reward is computed, and the parameters are incrementally updated in that direction.

Policy Gradient Theorem

The policy gradient theorem shows that the gradient $\nabla J(\theta)$ of the expected reward can be expressed in the following simple form:

$$ \nabla J(\theta) \propto \sum_s d^{\pi_\theta}(s) \sum_a \nabla \pi_\theta(a|s) Q^{\pi_\theta}(s,a) $$

Furthermore, using the log-derivative trick, this gradient can be expressed in the form of an expectation:

$$ \nabla J(\theta) = \mathbb{E}{s \sim d^{\pi\theta}, a \sim \pi_\theta}[\nabla \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a)] $$

This formula helps to intuitively interpret the policy gradient:

  • $\nabla \log \pi_\theta(a|s)$: Indicates the direction (gradient) that increases the probability of selecting action $a$.
  • $Q^{\pi_\theta}(s,a)$: A weight representing the “goodness” of that action.

In other words, the policy gradient method updates the policy to “increase the probability of selecting good actions (actions with high $Q$ values) and decrease the probability of selecting bad actions (actions with low $Q$ values).”

Implementation Challenges and Solutions

There are several challenges in implementing the policy gradient method:

  • Estimating $Q^{\pi_\theta}(s,a)$: Since the action-value function $Q^{\pi_\theta}(s,a)$ is unknown, it needs to be estimated from experience. Monte Carlo methods (cumulative rewards after episode completion) or TD learning (TD errors) are used for approximation.
  • Variance Reduction: When using Monte Carlo methods for gradient estimation, variance can be large, leading to unstable learning. To address this, it is common to introduce a baseline (e.g., state-value function $V(s)$) and use the advantage function $A(s,a) = Q(s,a) - V(s)$.

The policy gradient method is a highly flexible approach that can also be applied to continuous action space problems and serves as the foundation for more advanced algorithms such as Actor-Critic methods.

References

  • Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)