Introduction
Ensemble learning combines multiple weak learners to achieve higher generalization performance than any single model. Most top solutions in Kaggle competitions employ ensemble methods.
This article systematically covers decision trees as the building block, bagging (Random Forest), and boosting (GBDT).
Decision Trees
CART Algorithm
For classification, recursively find splits minimizing Gini impurity:
\[\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2 \tag{1}\]where \(p_k\) is the proportion of class \(k\) in node \(t\).
For regression, minimize mean squared error:
\[\text{MSE}(t) = \frac{1}{|t|} \sum_{i \in t} (y_i - \bar{y}_t)^2 \tag{2}\]Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Interpretable | Prone to overfitting |
| No preprocessing needed | Axis-aligned boundaries |
| Handles nonlinearity | High variance (unstable) |
Bagging
Principle
Bootstrap Aggregating (bagging) trains multiple decision trees on bootstrap samples independently, then averages (regression) or votes (classification):
\[\hat{f}_{\text{bag}}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(\mathbf{x}) \tag{3}\]This reduces variance:
\[\text{Var}\left(\frac{1}{B}\sum_b f_b\right) = \frac{\sigma^2}{B} \quad (\text{if independent}) \tag{4}\]Random Forest
In addition to bagging, each split considers only a random subset of features (\(m \approx \sqrt{p}\)). This decorrelates trees, further reducing variance.
Boosting
Gradient Boosting (GBDT)
Weak learners are added sequentially, each fitting the residuals (negative gradient) of the previous model:
\[F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x}) \tag{5}\]where \(h_m\) fits pseudo-residuals \(r_{im} = -\frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}\) and \(\eta\) is the learning rate.
Bagging vs Boosting
| Feature | Bagging | Boosting |
|---|---|---|
| Training | Parallel (independent) | Sequential (dependent) |
| Main effect | Variance reduction | Bias reduction |
| Overfitting resistance | High | Lower (needs tuning) |
| Methods | Random Forest | GBDT, XGBoost, LightGBM |
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.metrics import accuracy_score
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
models = {
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5,
random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100,
max_depth=3,
learning_rate=0.1,
random_state=42),
}
for name, model in models.items():
cv_scores = cross_val_score(model, X_train, y_train, cv=5,
scoring='accuracy')
model.fit(X_train, y_train)
test_acc = accuracy_score(y_test, model.predict(X_test))
print(f"{name}: CV={cv_scores.mean():.3f}±{cv_scores.std():.3f}, "
f"Test={test_acc:.3f}")
# --- Feature importance (Random Forest) ---
rf = models['Random Forest']
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:10]
plt.figure(figsize=(10, 5))
plt.bar(range(10), importances[indices])
plt.xticks(range(10), [f'Feature {i}' for i in indices], rotation=45)
plt.ylabel('Importance')
plt.title('Random Forest Feature Importance (Top 10)')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Key Hyperparameters
Random Forest
| Parameter | Description | Guideline |
|---|---|---|
n_estimators | Number of trees | 100-500 (more = stable) |
max_depth | Tree depth | 5-20 |
max_features | Features per split | \(\sqrt{p}\) (classification), \(p/3\) (regression) |
Gradient Boosting
| Parameter | Description | Guideline |
|---|---|---|
n_estimators | Boosting iterations | 100-1000 |
learning_rate | Step size \(\eta\) | 0.01-0.1 |
max_depth | Depth per tree | 3-8 (shallow) |
Small learning rate + many iterations is the standard approach.
Related Articles
- Support Vector Machines (SVM) - Comparison with another powerful classification method.
- k-means and GMM - Contrast with unsupervised learning; decision trees also compute feature importance.
- SGD to Adam - Connection between gradient boosting’s “gradient” and optimization methods.
- Bayesian Optimization: Fundamentals and Python Implementation - Apply Bayesian optimization to tune GBDT hyperparameters.
References
- Breiman, L. (2001). “Random Forests”. Machine Learning, 45(1), 5-32.
- Friedman, J. H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine”. Annals of Statistics, 29(5), 1189-1232.
- scikit-learn Ensemble Methods