Basic Terminology in Statistics and Machine Learning

An overview of fundamental terms in statistics and machine learning, covering samples, parameters, estimation, supervised and unsupervised learning tasks.

Statistics Terminology

  • Sample: Individual elements or observations drawn from a population of data targeted for analysis or investigation.
  • Observation: Specific numerical values or data obtained from samples.
  • Parameter: Unknown numerical values that determine the characteristics of probability distributions or statistical models. For example, the mean and variance of a normal distribution.
  • Hypothesis Testing: A method for determining whether a hypothesis (e.g., whether there is a difference between two groups) is statistically valid based on data.
  • Estimation: Inferring unknown parameters or future values from known data. There are two types: point estimation (inferring a single value) and interval estimation (inferring a range of values).

Machine Learning Terminology

  • Learning/Training: The process by which a machine automatically discovers patterns and regularities from data to build a model.
  • Task: A specific problem or objective to be solved using machine learning.

Supervised Learning

Supervised learning is a method that builds a model (discriminative model) by providing numerous pairs of input data and their corresponding correct outputs (labels/teacher signals), learning the relationships between them, and predicting outputs for unknown inputs.

Representative tasks:

  • Classification: A task that assigns inputs to one of a finite number of predefined categories (classes). Examples: spam email detection, image recognition.
  • Regression: A task that predicts continuous real-valued outputs from inputs. Examples: stock price prediction, housing price prediction.

Unsupervised Learning

Unsupervised learning is a method that discovers structure and patterns in data without correct outputs (labels). It often aims to learn the mechanism by which data was generated (generative model).

Representative tasks:

  • Clustering: A task that divides data into multiple groups (clusters) based on similarity. Example: customer segmentation.
  • Dimensionality Reduction: A task that transforms high-dimensional data into a lower-dimensional representation while minimizing information loss. Examples: feature visualization, noise removal.

References

  • Taro Tezuka, “Understanding Bayesian Statistics and Machine Learning,” Kodansha (2017)