Statistics Terminology
- Sample: Individual elements or observations drawn from a population of data targeted for analysis or investigation.
- Observation: Specific numerical values or data obtained from samples.
- Parameter: Unknown numerical values that determine the characteristics of probability distributions or statistical models. For example, the mean and variance of a normal distribution.
- Hypothesis Testing: A method for determining whether a hypothesis (e.g., whether there is a difference between two groups) is statistically valid based on data.
- Estimation: Inferring unknown parameters or future values from known data. There are two types: point estimation (inferring a single value) and interval estimation (inferring a range of values).
Machine Learning Terminology
- Learning/Training: The process by which a machine automatically discovers patterns and regularities from data to build a model.
- Task: A specific problem or objective to be solved using machine learning.
Supervised Learning
Supervised learning is a method that builds a model (discriminative model) by providing numerous pairs of input data and their corresponding correct outputs (labels/teacher signals), learning the relationships between them, and predicting outputs for unknown inputs.
Representative tasks:
- Classification: A task that assigns inputs to one of a finite number of predefined categories (classes). Examples: spam email detection, image recognition.
- Regression: A task that predicts continuous real-valued outputs from inputs. Examples: stock price prediction, housing price prediction.
Unsupervised Learning
Unsupervised learning is a method that discovers structure and patterns in data without correct outputs (labels). It often aims to learn the mechanism by which data was generated (generative model).
Representative tasks:
- Clustering: A task that divides data into multiple groups (clusters) based on similarity. Example: customer segmentation.
- Dimensionality Reduction: A task that transforms high-dimensional data into a lower-dimensional representation while minimizing information loss. Examples: feature visualization, noise removal.
References
- Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)