Reinforcement learning methods are broadly divided into two types depending on whether they use a model of the environment (transition function and reward function).
- Model-Based Reinforcement Learning: Methods that explicitly use a model of the environment (transition function $T(s’|s,a)$ and reward function $R(s,s’)$) to learn actions. This includes cases where the model is not known in advance but is estimated through interaction with the environment.
- Model-Free Reinforcement Learning: Methods where the agent directly learns optimal actions from experience without explicitly constructing or using a model of the environment.
This article explains model-based reinforcement learning.
Definition of Value
In reinforcement learning, the goal is to maximize the sum of future rewards. This sum of future rewards is called “value.”
Sum of Future Rewards: The sum of immediate rewards $G_t$ obtained from time $t$ onward can be defined recursively as follows: $$ G_t = r_{t+1} + \gamma G_{t+1} $$ Here, $r_{t+1}$ is the immediate reward obtained at time $t+1$, and $\gamma$ is the discount factor ($0 \le \gamma \le 1$), a coefficient for converting future rewards to present value.
Value as Expected Value: Since environmental transitions are stochastic, the same action does not always produce the same result. Therefore, value is defined as an expected value. The value function $V_\pi(s)$ of state $s$ under policy $\pi$ represents the expected cumulative reward when starting from state $s$ and following policy $\pi$.
Bellman Equation
The Bellman equation expresses the value function recursively in terms of expected values. It shows that the value of a state is determined by the actions available from that state and the values of the next states resulting from those actions.
$$ V_\pi(s) = \sum_a \pi(a|s) \sum_{s’} T(s’|s,a) (R(s,a,s’) + \gamma V_\pi(s’)) $$ Here, $R(s,a,s’)$ is the reward obtained when taking action $a$ in state $s$ and transitioning to state $s’$.
Learning: Dynamic Programming
In model-based reinforcement learning, since the environment model is known, dynamic programming can be used to compute optimal policies and value functions. Dynamic programming sets appropriate initial values for the value function and improves its accuracy by repeatedly applying the Bellman equation.
There are two main approaches for obtaining optimal actions through dynamic programming.
Value Iteration
The agent calculates the value of each state and decides actions to transition to the state with the highest value. Value iteration aims to directly find the optimal value function $V^*(s)$.
$$ V_{k+1}(s) = \max_a \left{ \sum_{s’} T(s’|s,a) (R(s,a,s’) + \gamma V_k(s’)) \right} $$ Through this iterative computation, $V_k(s)$ converges to the optimal value function $V^*(s)$. Once the optimal value function is obtained, the optimal action at each state can be determined as the action that maximizes the value.
Policy Iteration
Policy iteration finds the optimal policy by alternating between two steps: “policy evaluation” and “policy improvement.”
- Policy Evaluation: Computes the value function $V_\pi(s)$ under the current policy $\pi$. This is done by solving the Bellman equation. $$ V_\pi(s) = \sum_a \pi(a|s) \sum_{s’} T(s’|s,a) (R(s,a,s’) + \gamma V_\pi(s’)) $$
- Policy Improvement: Using the evaluated value function $V_\pi(s)$, greedily determines a better policy $\pi’$ than the current policy $\pi$. $$ \pi’(s) = \arg\max_a \left{ \sum_{s’} T(s’|s,a) (R(s,a,s’) + \gamma V_\pi(s’)) \right} $$ By repeating this process, the policy and value function converge to optimal ones.
Difference Between Model-Based and Model-Free
In model-based reinforcement learning, since the environment’s transition function and reward function are known (or can be estimated), the optimal policy can be computed from model information alone without actually running the agent in the environment.
In contrast, in model-free reinforcement learning, since the environment model is unknown, the agent must actually interact with the environment and learn optimal policies and value functions directly from its experience (sequences of states, actions, rewards, and next states).
References
- Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)