Basic Terminology in Statistics and Machine Learning

Statistics Terminology

Sample: Individual elements or observations drawn from a population of data targeted for analysis or investigation.
Observation: Specific numerical values or data obtained from samples.
Parameter: Unknown numerical values that determine the characteristics of probability distributions or statistical models. For example, the mean and variance of a normal distribution.
Hypothesis Testing: A method for determining whether a hypothesis (e.g., whether there is a difference between two groups) is statistically valid based on data.
Estimation: Inferring unknown parameters or future values from known data. There are two types: point estimation (inferring a single value) and interval estimation (inferring a range of values).

Machine Learning Terminology

Learning/Training: The process by which a machine automatically discovers patterns and regularities from data to build a model.
Task: A specific problem or objective to be solved using machine learning.

Supervised Learning

Supervised learning is a method that builds a model (discriminative model) by providing numerous pairs of input data and their corresponding correct outputs (labels/teacher signals), learning the relationships between them, and predicting outputs for unknown inputs.

Representative tasks:

Classification: A task that assigns inputs to one of a finite number of predefined categories (classes). Examples: spam email detection, image recognition.
Regression: A task that predicts continuous real-valued outputs from inputs. Examples: stock price prediction, housing price prediction.

Unsupervised Learning

Unsupervised learning is a method that discovers structure and patterns in data without correct outputs (labels). It often aims to learn the mechanism by which data was generated (generative model).

Representative tasks:

Clustering: A task that divides data into multiple groups (clusters) based on similarity. Example: customer segmentation.
Dimensionality Reduction: A task that transforms high-dimensional data into a lower-dimensional representation while minimizing information loss. Examples: feature visualization, noise removal.

References

Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)

Statistics Terminology

Machine Learning Terminology

Supervised Learning

Unsupervised Learning

References

関連記事

Moments in Statistics: Characterizing Probability Distributions

The Binomial Distribution and Related Distributions

Maximum Likelihood Estimation for the Normal Distribution