Ensemble Learning: From Decision Trees to Random Forest and Gradient Boosting

Introduction

Ensemble learning combines multiple weak learners to achieve higher generalization performance than any single model. Most top solutions in Kaggle competitions employ ensemble methods.

This article systematically covers decision trees as the building block, bagging (Random Forest), and boosting (GBDT).

Decision Trees

CART Algorithm

For classification, recursively find splits minimizing Gini impurity:

\[\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2 \tag{1}\]

where \(p_k\) is the proportion of class \(k\) in node \(t\).

For regression, minimize mean squared error:

\[\text{MSE}(t) = \frac{1}{|t|} \sum_{i \in t} (y_i - \bar{y}_t)^2 \tag{2}\]

Strengths and Weaknesses

Strengths	Weaknesses
Interpretable	Prone to overfitting
No preprocessing needed	Axis-aligned boundaries
Handles nonlinearity	High variance (unstable)

Bagging

Principle

Bootstrap Aggregating (bagging) trains multiple decision trees on bootstrap samples independently, then averages (regression) or votes (classification):

\[\hat{f}_{\text{bag}}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(\mathbf{x}) \tag{3}\]

This reduces variance:

\[\text{Var}\left(\frac{1}{B}\sum_b f_b\right) = \frac{\sigma^2}{B} \quad (\text{if independent}) \tag{4}\]

Random Forest

In addition to bagging, each split considers only a random subset of features (\(m \approx \sqrt{p}\)). This decorrelates trees, further reducing variance.

Boosting

Gradient Boosting (GBDT)

Weak learners are added sequentially, each fitting the residuals (negative gradient) of the previous model:

\[F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x}) \tag{5}\]

where \(h_m\) fits pseudo-residuals \(r_{im} = -\frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}\) and \(\eta\) is the learning rate.

Bagging vs Boosting

Feature	Bagging	Boosting
Training	Parallel (independent)	Sequential (dependent)
Main effect	Variance reduction	Bias reduction
Overfitting resistance	High	Lower (needs tuning)
Methods	Random Forest	GBDT, XGBoost, LightGBM

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier,
                               GradientBoostingClassifier)
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                            n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                      random_state=42)

models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5,
                                             random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100,
                                                     max_depth=3,
                                                     learning_rate=0.1,
                                                     random_state=42),
}

for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5,
                                 scoring='accuracy')
    model.fit(X_train, y_train)
    test_acc = accuracy_score(y_test, model.predict(X_test))
    print(f"{name}: CV={cv_scores.mean():.3f}±{cv_scores.std():.3f}, "
          f"Test={test_acc:.3f}")

# --- Feature importance (Random Forest) ---
rf = models['Random Forest']
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10]

plt.figure(figsize=(10, 5))
plt.bar(range(10), importances[indices])
plt.xticks(range(10), [f'Feature {i}' for i in indices], rotation=45)
plt.ylabel('Importance')
plt.title('Random Forest Feature Importance (Top 10)')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Key Hyperparameters

Random Forest

Parameter	Description	Guideline
`n_estimators`	Number of trees	100-500 (more = stable)
`max_depth`	Tree depth	5-20
`max_features`	Features per split	\(\sqrt{p}\) (classification), \(p/3\) (regression)

Gradient Boosting

Parameter	Description	Guideline
`n_estimators`	Boosting iterations	100-1000
`learning_rate`	Step size \(\eta\)	0.01-0.1
`max_depth`	Depth per tree	3-8 (shallow)

Small learning rate + many iterations is the standard approach.

Understanding Self-Attention from Scratch: Math and Python Implementation - Transformer-based approaches for comparison with ensemble methods.
Support Vector Machines (SVM): Kernel Methods and Nonlinear Classification with Python - Comparison with another powerful classification method.
k-means and GMM: Clustering Theory and Python Implementation - Contrast with unsupervised learning; decision trees also compute feature importance.
From SGD to Adam: Evolution of Gradient-Based Optimization - Connection between gradient boosting’s “gradient” and optimization methods.
Bayesian Optimization: Fundamentals and Python Implementation - Apply Bayesian optimization to tune GBDT hyperparameters.

References

Breiman, L. (2001). “Random Forests”. Machine Learning, 45(1), 5-32.
Friedman, J. H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine”. Annals of Statistics, 29(5), 1189-1232.
scikit-learn Ensemble Methods

Introduction

Decision Trees

CART Algorithm

Strengths and Weaknesses

Bagging

Principle

Random Forest

Boosting

Gradient Boosting (GBDT)

Bagging vs Boosting

Python Implementation

Key Hyperparameters

Random Forest

Gradient Boosting

Related Articles

References

関連記事

Support Vector Machines (SVM): Kernel Methods and Nonlinear Classification with Python

k-means and GMM: Clustering Theory and Python Implementation

LSTM for Time Series Forecasting: Theory and Python Implementation