Model-Based Reinforcement Learning

Reinforcement learning methods are broadly divided into two types depending on whether they use a model of the environment (transition function and reward function).

Model-Based Reinforcement Learning: Methods that explicitly use a model of the environment (transition function \(T(s'|s,a)\) and reward function \(R(s,s')\)) to learn actions. This includes cases where the model is not known in advance but is estimated through interaction with the environment.
Model-Free Reinforcement Learning: Methods where the agent directly learns optimal actions from experience without explicitly constructing or using a model of the environment.

This article explains model-based reinforcement learning.

Definition of Value

In reinforcement learning, the goal is to maximize the sum of future rewards. This sum of future rewards is called “value.”

Sum of Future Rewards: The sum of immediate rewards \(G_t\) obtained from time \(t\) onward can be defined recursively as follows:
\[ G*t = r*{t+1} + \gamma G*{t+1} \]
Here, \(r*{t+1}\) is the immediate reward obtained at time \(t+1\), and \(\gamma\) is the discount factor (\(0 \le \gamma \le 1\)), a coefficient for converting future rewards to present value.
Value as Expected Value: Since environmental transitions are stochastic, the same action does not always produce the same result. Therefore, value is defined as an expected value. The value function \(V_\pi(s)\) of state \(s\) under policy \(\pi\) represents the expected cumulative reward when starting from state \(s\) and following policy \(\pi\).

Bellman Equation

The Bellman equation expresses the value function recursively in terms of expected values. It shows that the value of a state is determined by the actions available from that state and the values of the next states resulting from those actions.

\[ V*\pi(s) = \sum_a \pi(a|s) \sum*{s'} T(s'|s,a) (R(s,a,s') + \gamma V\_\pi(s')) \]

Here, \(R(s,a,s')\) is the reward obtained when taking action \(a\) in state \(s\) and transitioning to state \(s'\).

Learning: Dynamic Programming

In model-based reinforcement learning, since the environment model is known, dynamic programming can be used to compute optimal policies and value functions. Dynamic programming sets appropriate initial values for the value function and improves its accuracy by repeatedly applying the Bellman equation.

There are two main approaches for obtaining optimal actions through dynamic programming.

Value Iteration

The agent calculates the value of each state and decides actions to transition to the state with the highest value. Value iteration aims to directly find the optimal value function \(V^*(s)\).

\[ V*{k+1}(s) = \max_a \left\{ \sum*{s'} T(s'|s,a) (R(s,a,s') + \gamma V_k(s')) \right\} \]

Through this iterative computation, \(V_k(s)\) converges to the optimal value function \(V^*(s)\). Once the optimal value function is obtained, the optimal action at each state can be determined as the action that maximizes the value.

Policy Iteration

Policy iteration finds the optimal policy by alternating between two steps: “policy evaluation” and “policy improvement.”

Policy Evaluation: Computes the value function \(V_\pi(s)\) under the current policy \(\pi\). This is done by solving the Bellman equation. \( V*\pi(s) = \sum_a \pi(a|s) \sum*{s'} T(s'|s,a) (R(s,a,s') + \gamma V\_\pi(s')) \)
Policy Improvement: Using the evaluated value function \(V_\pi(s)\), greedily determines a better policy \(\pi'\) than the current policy \(\pi\). \( \pi'(s) = \arg\max*a \left\{ \sum*{s'} T(s'|s,a) (R(s,a,s') + \gamma V\_\pi(s')) \right\} \) By repeating this process, the policy and value function converge to optimal ones.

Difference Between Model-Based and Model-Free

In model-based reinforcement learning, since the environment’s transition function and reward function are known (or can be estimated), the optimal policy can be computed from model information alone without actually running the agent in the environment.

In contrast, in model-free reinforcement learning, since the environment model is unknown, the agent must actually interact with the environment and learn optimal policies and value functions directly from its experience (sequences of states, actions, rewards, and next states).

References

Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)

Definition of Value

Bellman Equation

Learning: Dynamic Programming

Value Iteration

Policy Iteration

Difference Between Model-Based and Model-Free

References

関連記事

Reinforcement Learning for ROBOTIS OP3 Walking: ROS Package Implementation

Key Algorithms in Deep Reinforcement Learning

Advantage Actor-Critic (A2C)