Advantage Actor-Critic (A2C)

The Actor-Critic method combines the policy gradient method, a policy-based reinforcement learning approach, with TD learning, a value-based approach. The Actor (policy) selects actions, and the Critic (value function) evaluates those actions, making learning more efficient.

Advantage Function

In the policy gradient method, the action-value function \(Q(s,a)\) is used to determine the direction of policy updates. However, the action-value function tends to be heavily dependent on the state’s own value \(V(s)\). For example, if a state has very high value, all actions in that state will have high Q-values.

Therefore, we consider evaluating actions using the advantage function \(A(s,a)\), which subtracts the state value \(V(s)\).

\[ A(s,a) = Q(s,a) - V(s) \]

The advantage function indicates “how much better (or worse) taking a specific action \(a\) in state \(s\) is compared to the average value \(V(s)\) of that state.” This enables evaluation of the relative goodness of actions.

Policy Gradient with the Advantage Function

The policy gradient using the advantage function is expressed as:

\[ \nabla J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi*\theta}[\nabla \log \pi*\theta(a|s) A(s,a)] \]

This formula replaces \(Q(s,a)\) with \(A(s,a)\) in the policy gradient method. \(A(s,a)\) functions as a baseline, reducing the variance of the gradient. This improves both learning stability and efficiency.