The Actor-Critic method combines the policy gradient method, a policy-based reinforcement learning approach, with TD learning, a value-based approach. The Actor (policy) selects actions, and the Critic (value function) evaluates those actions, making learning more efficient.
Advantage Function
In the policy gradient method, the action-value function \(Q(s,a)\) is used to determine the direction of policy updates. However, the action-value function tends to be heavily dependent on the state’s own value \(V(s)\). For example, if a state has very high value, all actions in that state will have high Q-values.
Therefore, we consider evaluating actions using the advantage function \(A(s,a)\), which subtracts the state value \(V(s)\).
\[ A(s,a) = Q(s,a) - V(s) \]The advantage function indicates “how much better (or worse) taking a specific action \(a\) in state \(s\) is compared to the average value \(V(s)\) of that state.” This enables evaluation of the relative goodness of actions.
Policy Gradient with the Advantage Function
The policy gradient using the advantage function is expressed as:
\[ \nabla J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi*\theta}[\nabla \log \pi*\theta(a|s) A(s,a)] \]This formula replaces \(Q(s,a)\) with \(A(s,a)\) in the policy gradient method. \(A(s,a)\) functions as a baseline, reducing the variance of the gradient. This improves both learning stability and efficiency.
Advantage Actor-Critic (A2C)
The Actor-Critic method using the advantage function is called Advantage Actor-Critic (A2C).
- Actor: Learns the policy \(\pi_\theta(a|s)\) and selects actions. The actor updates the policy parameters \(\theta\) using the advantage function \(A(s,a)\).
- Critic: Learns the state-value function \(V(s)\). The critic updates the state-value function parameters using the TD error (\(r + \gamma V(s') - V(s)\)), and this TD error is provided to the actor as an approximation of the advantage function.
A2C is a powerful algorithm that improves the stability of the policy gradient method and enables more efficient learning.
References
- Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)