Model-Free Reinforcement Learning

A guide to model-free reinforcement learning covering exploration-exploitation trade-off, Monte Carlo methods, TD learning, Q-learning, SARSA, and Actor-Critic.

Model-free reinforcement learning assumes that the environment model (transition function $T(s’|s,a)$ and reward function $R(s,s’)$) is unknown. The agent accumulates experience by interacting with the environment and learns optimal policies directly from that experience.

Exploration-Exploitation Trade-off

In reinforcement learning, the agent must balance between “maximizing rewards using knowledge gained so far (exploitation)” and “trying unknown actions or states to gain new knowledge (exploration).” This is called the “exploration-exploitation trade-off.”

If infinite actions were possible, one could explore sufficiently and then focus on exploitation. However, in most cases, the number of actions is limited. A common method for balancing this trade-off is the epsilon-Greedy method.

  • Epsilon-Greedy method:
    • With probability epsilon, a random action is selected (exploration).
    • With probability 1-epsilon, the action considered best based on current knowledge is selected (exploitation). It is common to set epsilon large in the early stages of learning to encourage exploration and gradually decrease it as learning progresses to emphasize exploitation.

Plan Correction: Monte Carlo and TD Methods

There are several methods for how and when the agent updates its value function and policy from experience.

Monte Carlo Method

The Monte Carlo method updates values based on the total reward (return) actually obtained in an episode after one episode has ended.

$$ V(s_t) \leftarrow V(s_t) + \alpha (G_t - V(s_t)) $$ Here, $G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots + \gamma^{T-t-1} r_T$ is the discounted return from time $t$ onward.

  • Characteristics: Since it uses actually obtained rewards, there is no bias. However, since updates cannot be performed until an episode ends, it is not suitable for problems with long episodes.

TD Learning (Temporal-Difference Learning)

TD learning updates values at each step. Updates are based on the difference (TD error) between the current value estimate and the one-step-ahead reward plus value estimate (TD target).

$$ V(s_t) \leftarrow V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1}) - V(s_t)) $$ Here, $r_{t+1} + \gamma V(s_{t+1})$ is the TD target.

  • Characteristics: Since updates can be performed during an episode, learning can progress faster than the Monte Carlo method. However, since it uses estimates of future values, bias may arise from bootstrapping (self-reference).

TD($\lambda$)

TD($\lambda$) is an intermediate method between the Monte Carlo method and TD learning. It calculates the TD error considering not only one step ahead but multiple steps ahead (n-step TD targets). By adjusting the parameter $\lambda$, it can continuously vary from the Monte Carlo method ($\lambda=1$) to TD learning ($\lambda=0$).

Value-Based and Policy-Based

Model-free reinforcement learning, like model-based reinforcement learning, is classified by whether it focuses on updating value estimation or policy. The action value is denoted by $Q(s,a)$.

Value-Based

Learns the value function (especially the action-value function $Q(s,a)$) and selects actions based on that value function.

  • Q-Learning: $$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha (r_{t+1} + \gamma \max_{a’} Q(s_{t+1}, a’) - Q(s_t, a_t)) $$ Q-learning uses the maximum action value at the next state, making it an off-policy method. This means the policy used for action selection during learning (e.g., epsilon-Greedy) differs from the optimal policy obtained through learning.

Policy-Based

Directly learns the policy itself (action probabilities at each state).

  • SARSA: $$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)) $$ SARSA uses the action value of the action $a_{t+1}$ actually selected at the next state $s_{t+1}$, making it an on-policy method. This means the policy used for action selection during learning is the same as the policy obtained through learning.

Actor-Critic Method

A method that combines value-based and policy-based approaches.

  • Actor: The part that learns the policy and selects actions.
  • Critic: The part that learns the value function and evaluates the actions selected by the actor.

The actor and critic learn by mutually updating information with each other.

References

  • Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)