Reinforcement Learning Basics: Overview and Markov Decision Process

Machine Learning Terminology

In machine learning, a model is a mathematical formula or algorithm that learns from data. It contains adjustable parameters that are optimized based on given data through a process called learning or training.

A representative model is the Neural Network (NN), which mimics the neural circuits of the human brain. A multi-layered version is called a Deep Neural Network (DNN).

Learning methods are broadly divided into three categories:

Supervised Learning
Unsupervised Learning
Reinforcement Learning

Overview of Reinforcement Learning

Unlike supervised and unsupervised learning where datasets are provided, reinforcement learning is characterized by being given an environment.

Environment: A space where an agent (the learning entity) takes actions, states change according to those actions, and “rewards” are given when certain states are reached or certain actions are taken.

In reinforcement learning, the agent adjusts its model parameters to obtain more rewards through interaction with the environment. A sequence of actions and state transitions from the start to the end of the environment is called one episode, and the goal of learning is to maximize the cumulative reward obtained in one episode.

Problem Formulation: Markov Decision Process (MDP)

Reinforcement learning problems are often formulated as a Markov Decision Process (MDP). An MDP is a decision-making process with the Markov property (the next state depends only on the current state and action, not on past history).

The main components of an MDP are the following four elements:

\(S\): The set of States. Represents the current situation of the agent.
\(A\): The set of Actions. The choices available to the agent in each state.
\(T\): Transition Probability. The probability \(P(s'|s, a)\) of transitioning to the next state \(s'\) when taking action \(a\) in state \(s\).
\(R\): Reward Function. The reward \(R(s, a, s')\) obtained when taking action \(a\) in state \(s\) and transitioning to the next state \(s'\).

The “robot” or “AI” in reinforcement learning can be viewed as a function that receives these states and outputs optimal actions. This function is called the Policy \(\pi(a|s)\). The agent aims to discover the optimal policy by updating its policy to maximize rewards.

References

Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)

Machine Learning Terminology

Overview of Reinforcement Learning

Problem Formulation: Markov Decision Process (MDP)

References

関連記事

Reinforcement Learning for ROBOTIS OP3 Walking: ROS Package Implementation

Key Algorithms in Deep Reinforcement Learning

Advantage Actor-Critic (A2C)