Advantage Actor-Critic (A2C)

An explanation of Advantage Actor-Critic (A2C), covering the advantage function, its role in policy gradients, and how Actor and Critic components learn together.

The Actor-Critic method combines the policy gradient method, a policy-based reinforcement learning approach, with TD learning, a value-based approach. The Actor (policy) selects actions, and the Critic (value function) evaluates those actions, making learning more efficient.

Advantage Function

In the policy gradient method, the action-value function $Q(s,a)$ is used to determine the direction of policy updates. However, the action-value function tends to be heavily dependent on the state’s own value $V(s)$. For example, if a state has very high value, all actions in that state will have high Q-values.

Therefore, we consider evaluating actions using the advantage function $A(s,a)$, which subtracts the state value $V(s)$.

$$ A(s,a) = Q(s,a) - V(s) $$

The advantage function indicates “how much better (or worse) taking a specific action $a$ in state $s$ is compared to the average value $V(s)$ of that state.” This enables evaluation of the relative goodness of actions.

Policy Gradient with the Advantage Function

The policy gradient using the advantage function is expressed as:

$$ \nabla J(\theta) = \mathbb{E}{s \sim d^{\pi\theta}, a \sim \pi_\theta}[\nabla \log \pi_\theta(a|s) A(s,a)] $$

This formula replaces $Q(s,a)$ with $A(s,a)$ in the policy gradient method. $A(s,a)$ functions as a baseline, reducing the variance of the gradient. This improves both learning stability and efficiency.

Advantage Actor-Critic (A2C)

The Actor-Critic method using the advantage function is called Advantage Actor-Critic (A2C).

  • Actor: Learns the policy $\pi_\theta(a|s)$ and selects actions. The actor updates the policy parameters $\theta$ using the advantage function $A(s,a)$.
  • Critic: Learns the state-value function $V(s)$. The critic updates the state-value function parameters using the TD error ($r + \gamma V(s’) - V(s)$), and this TD error is provided to the actor as an approximation of the advantage function.

A2C is a powerful algorithm that improves the stability of the policy gradient method and enables more efficient learning.

References

  • Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)