Machine Learning Terminology
In machine learning, a model is a mathematical formula or algorithm that learns from data. It contains adjustable parameters that are optimized based on given data through a process called learning or training.
A representative model is the Neural Network (NN), which mimics the neural circuits of the human brain. A multi-layered version is called a Deep Neural Network (DNN).
Learning methods are broadly divided into three categories:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
Overview of Reinforcement Learning
Unlike supervised and unsupervised learning where datasets are provided, reinforcement learning is characterized by being given an environment.
- Environment: A space where an agent (the learning entity) takes actions, states change according to those actions, and “rewards” are given when certain states are reached or certain actions are taken.
In reinforcement learning, the agent adjusts its model parameters to obtain more rewards through interaction with the environment. A sequence of actions and state transitions from the start to the end of the environment is called one episode, and the goal of learning is to maximize the cumulative reward obtained in one episode.
Problem Formulation: Markov Decision Process (MDP)
Reinforcement learning problems are often formulated as a Markov Decision Process (MDP). An MDP is a decision-making process with the Markov property (the next state depends only on the current state and action, not on past history).
The main components of an MDP are the following four elements:
- $S$: The set of States. Represents the current situation of the agent.
- $A$: The set of Actions. The choices available to the agent in each state.
- $T$: Transition Probability. The probability $P(s’|s, a)$ of transitioning to the next state $s’$ when taking action $a$ in state $s$.
- $R$: Reward Function. The reward $R(s, a, s’)$ obtained when taking action $a$ in state $s$ and transitioning to the next state $s’$.
The “robot” or “AI” in reinforcement learning can be viewed as a function that receives these states and outputs optimal actions. This function is called the Policy $\pi(a|s)$. The agent aims to discover the optimal policy by updating its policy to maximize rewards.
References
- Takahiro Kubo, “Introduction to Reinforcement Learning with Python: From Basics to Practice”, Shoeisha (2019)