A Comprehensive Overview of Reinforcement Learning

A systematic overview of reinforcement learning: from planning problems to value-based, policy-based, and actor-critic methods with mathematical formulations.

RLmap

Planning Problems

Sequential decision-making problems where the environment dynamics (state transition probabilities and reward functions) are known.

  • Value Iteration Directly computes the optimal value function by repeatedly applying the Bellman optimality operator. $$ (B_{*}v)(s) = \max_a {g(s,a) + \gamma \sum_{s’} p_T(s’|s,a)v(s’)} $$ $$ V^{*} = \lim_{k \to \infty} (B_{*}^{k}V)(s) $$

  • Policy Iteration Iterates between two steps – “policy evaluation” and “policy improvement” – to find the optimal policy.

    1. Policy Evaluation: Computes the value function $V^\pi$ under the current policy $\pi$ using the Bellman expectation operator. $$ (B_{\pi}v)(s) = \sum_a \pi(a|s) (g(s,a) + \gamma \sum_{s’} p_T(s’|s,a)v(s’)) $$ $$ V^{\pi} = \lim_{k \to \infty} (B_{\pi}^{k}V)(s) $$
    2. Policy Improvement: Greedily determines a better policy using the computed value function $V^\pi$. $$ \pi(s) = \arg\max_a {g(s,a) + \gamma \sum_{s’} p_T(s’|s,a)V^\pi(s’)} $$

Reinforcement Learning

Sequential decision-making problems where the environment dynamics are unknown. The agent collects data through interaction with the environment and learns the optimal policy from that data.

Value-Based Methods

These methods estimate a value function (state-value function V or action-value function Q) and implicitly derive a policy from it.

  • TD Learning (Temporal-Difference Learning) Fixes a policy $\pi$ and estimates the state-value function V under it. The value function is updated online using the TD error $\delta$. $$ \delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t) $$ $$ \hat{V}(s_t) \leftarrow \hat{V}(s_t) + \alpha_t \delta_t $$

  • Q-Learning Estimates the action-value function Q. Since it uses the action max_a' that maximizes the value of the next state for updates, it is an off-policy method that does not depend on the current policy. $$ \delta_t = r_t + \gamma \max_{a’} \hat{Q}(s_{t+1}, a’) - \hat{Q}(s_t, a_t) $$ $$ \hat{Q}(s_t, a_t) \leftarrow \hat{Q}(s_t, a_t) + \alpha_t \delta_t $$

  • SARSA Estimates the action-value function Q. Unlike Q-Learning, the next action $a_{t+1}$ is selected according to the current policy $\pi$, making it an on-policy method. $$ \delta_t = r_t + \gamma \hat{Q}(s_{t+1}, a_{t+1}) - \hat{Q}(s_t, a_t) $$ $$ \hat{Q}(s_t, a_t) \leftarrow \hat{Q}(s_t, a_t) + \alpha_t \delta_t $$

Policy-Based Methods

These methods directly model and optimize the policy $\pi_\theta$ with parameters $\theta$, without going through a value function.

  • Policy Gradient Methods Compute the gradient $\nabla_\theta J(\theta)$ to maximize a performance measure $J(\theta)$, and update the policy parameters $\theta$. $$ \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) $$

Actor-Critic Methods

These methods combine value-based and policy-based approaches.

  • Actor: Updates the policy $\pi_\theta$ (selects actions)
  • Critic: Learns a value function and evaluates the actions chosen by the actor

Both the actor and critic are updated using the TD error $\delta_t$.

  • Critic Update (Value Function Learning) $$ \delta_t = r_t + \gamma \hat{V}w(s{t+1}) - \hat{V}w(s_t) $$ $$ w{t+1} = w_t + \alpha_w \delta_t \nabla_w \hat{V}_w(s_t) $$
  • Actor Update (Policy Learning) $$ \theta_{t+1} = \theta_t + \alpha_\theta \delta_t \nabla_\theta \log \pi_\theta(a_t|s_t) $$

Function Approximation

When the state space is large or continuous, value functions and policies are approximated using parameterized functions (such as linear functions or neural networks) with parameters $w$ or $\theta$, instead of tabular representations.

  • Value Function Approximation Combines function approximation with update rules such as TD learning and Q-Learning. For example, in approximate Q-Learning, the weights $w$ are updated as follows: $$ \delta_t = r_t + \gamma \max_{a’} \hat{Q}w(s{t+1}, a’) - \hat{Q}w(s_t, a_t) $$ $$ w{t+1} = w_t + \alpha_t \delta_t \nabla_w \hat{Q}_w(s_t, a_t) $$ Deep Q-Network (DQN) uses a neural network to approximate the action-value function and stabilizes learning with techniques such as experience replay and target networks.

  • Policy Function Approximation Directly approximates the policy using function approximation in policy gradient methods and actor-critic methods.

References