The Actor-Critic method combines the policy gradient method, a policy-based reinforcement learning approach, with TD learning, a value-based approach. The Actor (policy) selects actions, and the Critic (value function) evaluates those actions, making learning more efficient.
Advantage Function
In the policy gradient method, the action-value function $Q(s,a)$ is used to determine the direction of policy updates. However, the action-value function tends to be heavily dependent on the state’s own value $V(s)$. For example, if a state has very high value, all actions in that state will have high Q-values.
Therefore, we consider evaluating actions using the advantage function $A(s,a)$, which subtracts the state value $V(s)$.
$$ A(s,a) = Q(s,a) - V(s) $$
The advantage function indicates “how much better (or worse) taking a specific action $a$ in state $s$ is compared to the average value $V(s)$ of that state.” This enables evaluation of the relative goodness of actions.
Policy Gradient with the Advantage Function
The policy gradient using the advantage function is expressed as:
$$ \nabla J(\theta) = \mathbb{E}{s \sim d^{\pi\theta}, a \sim \pi_\theta}[\nabla \log \pi_\theta(a|s) A(s,a)] $$
This formula replaces $Q(s,a)$ with $A(s,a)$ in the policy gradient method. $A(s,a)$ functions as a baseline, reducing the variance of the gradient. This improves both learning stability and efficiency.
Advantage Actor-Critic (A2C)
The Actor-Critic method using the advantage function is called Advantage Actor-Critic (A2C).
- Actor: Learns the policy $\pi_\theta(a|s)$ and selects actions. The actor updates the policy parameters $\theta$ using the advantage function $A(s,a)$.
- Critic: Learns the state-value function $V(s)$. The critic updates the state-value function parameters using the TD error ($r + \gamma V(s’) - V(s)$), and this TD error is provided to the actor as an approximation of the advantage function.
A2C is a powerful algorithm that improves the stability of the policy gradient method and enables more efficient learning.
References
- Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)