In reinforcement learning, when the state space or action space is very large, it becomes difficult to represent the value function or policy in tabular form. In such cases, neural networks (NN) are used to approximate value functions and policies.
However, applying NNs to reinforcement learning introduces the challenge of unstable learning. Various techniques have been proposed to address this issue.
Key Techniques for Stabilizing Learning
1. Experience Replay
The experience (state, action, reward, next state) obtained through the agent’s interaction with the environment is stored in a memory called a replay buffer. During learning, mini-batches of experience are randomly sampled from this replay buffer to update the NN.
- Benefits:
- Reduces correlations between experiences, improving learning stability.
- Since the same experience can be used multiple times, data utilization efficiency is improved.
2. Fixed Target Q-Network
In Q-learning, this technique fixes the parameters of the Q-network used for calculating target values (target Q-values) for a certain period.
- Problem: In standard Q-learning, the same network is used for both target Q-value calculation and Q-network updates, causing the target values to constantly fluctuate and destabilize learning.
- Benefit: By fixing the target Q-network parameters, the target values become stable, improving learning convergence. The target Q-network parameters are updated with the main Q-network parameters at regular intervals.
3. Reward Clipping
When the reward scale is very large or the reward distribution is skewed, this technique clips rewards to a fixed range (e.g., [-1, 1]).
- Benefit: Prevents gradient instability caused by excessively large reward scales and stabilizes learning.
Deep Q-Network (DQN) and Its Improvements
Deep Q-Network (DQN) is a groundbreaking method that combines Q-learning with deep learning (neural networks) and the above stabilization techniques (Experience Replay, Fixed Target Q-Network). Developed by Google DeepMind, it demonstrated superhuman performance in Atari games.
Since DQN’s publication, various improvement methods have been proposed to further enhance performance. DeepMind also published Rainbow, a model that combines several of these improvements.
Main DQN Improvements
Double DQN (DDQN)
- Purpose: Suppresses overestimation of Q-values and improves value estimation accuracy.
- Mechanism: Separates the networks used for action selection and target Q-value calculation. Action selection is performed by the main Q-network, and the target Q-value of that action is calculated by the target Q-network.
Prioritized Experience Replay (PER)
- Purpose: Improves learning efficiency.
- Mechanism: Instead of randomly sampling experiences in Experience Replay, experiences with large TD errors (i.e., high learning potential) are sampled preferentially.
Dueling Network Architectures (Dueling DQN)
- Purpose: Improves value estimation accuracy.
- Mechanism: Adopts a network architecture that decomposes Q-values into “State-Value” and “Advantage” components. This enables more accurate evaluation of each action’s value.
Multi-step Learning (N-step TD)
- Purpose: Improves value estimation accuracy.
- Mechanism: Instead of 1-step TD updates like Q-learning or SARSA, updates are performed using rewards and value estimates from n steps ahead. It has properties intermediate between Monte Carlo and TD methods.
Distributional RL (C51, QR-DQN, etc.)
- Purpose: Improves value estimation accuracy.
- Mechanism: Instead of estimating Q-values as single values, the distribution of rewards itself is learned. This allows action decisions based on richer information.
Noisy Nets
- Purpose: Improves exploration efficiency.
- Mechanism: Instead of manually setting epsilon as in epsilon-Greedy, noise is added to the network weights, allowing the agent itself to learn the degree of exploration. This automates the balance adjustment of exploration.
References
- Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)