Ensemble Learning: From Decision Trees to Random Forest and Gradient Boosting

From decision tree fundamentals to bagging (Random Forest) and boosting (GBDT, XGBoost, LightGBM), with theory and scikit-learn implementation.

Introduction

Ensemble learning combines multiple weak learners to achieve higher generalization performance than any single model. Most top solutions in Kaggle competitions employ ensemble methods.

This article systematically covers decision trees as the building block, bagging (Random Forest), and boosting (GBDT).

Decision Trees

CART Algorithm

For classification, recursively find splits minimizing Gini impurity:

\[\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2 \tag{1}\]

where \(p_k\) is the proportion of class \(k\) in node \(t\).

For regression, minimize mean squared error:

\[\text{MSE}(t) = \frac{1}{|t|} \sum_{i \in t} (y_i - \bar{y}_t)^2 \tag{2}\]

Strengths and Weaknesses

StrengthsWeaknesses
InterpretableProne to overfitting
No preprocessing neededAxis-aligned boundaries
Handles nonlinearityHigh variance (unstable)

Bagging

Principle

Bootstrap Aggregating (bagging) trains multiple decision trees on bootstrap samples independently, then averages (regression) or votes (classification):

\[\hat{f}_{\text{bag}}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(\mathbf{x}) \tag{3}\]

This reduces variance:

\[\text{Var}\left(\frac{1}{B}\sum_b f_b\right) = \frac{\sigma^2}{B} \quad (\text{if independent}) \tag{4}\]

Random Forest

In addition to bagging, each split considers only a random subset of features (\(m \approx \sqrt{p}\)). This decorrelates trees, further reducing variance.

Boosting

Gradient Boosting (GBDT)

Weak learners are added sequentially, each fitting the residuals (negative gradient) of the previous model:

\[F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x}) \tag{5}\]

where \(h_m\) fits pseudo-residuals \(r_{im} = -\frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}\) and \(\eta\) is the learning rate.

Bagging vs Boosting

FeatureBaggingBoosting
TrainingParallel (independent)Sequential (dependent)
Main effectVariance reductionBias reduction
Overfitting resistanceHighLower (needs tuning)
MethodsRandom ForestGBDT, XGBoost, LightGBM

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier,
                               GradientBoostingClassifier)
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                            n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                      random_state=42)

models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5,
                                             random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100,
                                                     max_depth=3,
                                                     learning_rate=0.1,
                                                     random_state=42),
}

for name, model in models.items():
    cv_scores = cross_val_score(model, X_train, y_train, cv=5,
                                 scoring='accuracy')
    model.fit(X_train, y_train)
    test_acc = accuracy_score(y_test, model.predict(X_test))
    print(f"{name}: CV={cv_scores.mean():.3f}±{cv_scores.std():.3f}, "
          f"Test={test_acc:.3f}")

# --- Feature importance (Random Forest) ---
rf = models['Random Forest']
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10]

plt.figure(figsize=(10, 5))
plt.bar(range(10), importances[indices])
plt.xticks(range(10), [f'Feature {i}' for i in indices], rotation=45)
plt.ylabel('Importance')
plt.title('Random Forest Feature Importance (Top 10)')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

Key Hyperparameters

Random Forest

ParameterDescriptionGuideline
n_estimatorsNumber of trees100-500 (more = stable)
max_depthTree depth5-20
max_featuresFeatures per split\(\sqrt{p}\) (classification), \(p/3\) (regression)

Gradient Boosting

ParameterDescriptionGuideline
n_estimatorsBoosting iterations100-1000
learning_rateStep size \(\eta\)0.01-0.1
max_depthDepth per tree3-8 (shallow)

Small learning rate + many iterations is the standard approach.

References

  • Breiman, L. (2001). “Random Forests”. Machine Learning, 45(1), 5-32.
  • Friedman, J. H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine”. Annals of Statistics, 29(5), 1189-1232.
  • scikit-learn Ensemble Methods