Introduction
Support Vector Machine (SVM) is a supervised learning method for classification and regression. It constructs a decision boundary that maximizes the margin between classes, achieving high generalization performance.
The kernel trick enables nonlinear decision boundaries, making SVM particularly powerful for high-dimensional, small-to-medium datasets.
Linear SVM
Hard Margin
For linearly separable data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\) (\(y_i \in \{-1, +1\}\)):
\[\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 \tag{1}\]\[\text{subject to} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \quad \forall i \tag{2}\]The margin width is \(2 / \|\mathbf{w}\|\); minimizing \(\|\mathbf{w}\|\) maximizes the margin.
Soft Margin
For non-separable data, introduce slack variables \(\xi_i \geq 0\):
\[\min_{\mathbf{w}, b, \xi} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i \tag{3}\]\[\text{subject to} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \tag{4}\]\(C\) controls the trade-off between margin size and misclassification penalty.
Kernel Trick
Dual Problem
The Lagrangian dual shows data appears only through inner products \(\mathbf{x}_i \cdot \mathbf{x}_j\):
\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j) \tag{5}\]\[\text{subject to} \quad 0 \leq \alpha_i \leq C, \quad \sum_{i=1}^{n} \alpha_i y_i = 0 \tag{6}\]Kernel Functions
Replacing inner products with kernel functions \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)\) enables nonlinear classification without explicitly computing the mapping \(\phi\).
| Kernel | Definition | Use Case |
|---|---|---|
| Linear | \(K(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y}\) | Linearly separable data |
| RBF (Gaussian) | \(K(\mathbf{x}, \mathbf{y}) = \exp(-\gamma\|\mathbf{x}-\mathbf{y}\|^2)\) | Most versatile |
| Polynomial | \(K(\mathbf{x}, \mathbf{y}) = (\gamma \mathbf{x} \cdot \mathbf{y} + r)^d\) | Feature interactions |
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
kernels = ['linear', 'rbf', 'poly']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, kernel in zip(axes, kernels):
clf = SVC(kernel=kernel, C=1.0, gamma='scale')
clf.fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, colors=['blue', 'red'])
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors='k', linestyles=['--', '-', '--'])
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='bwr', s=20, alpha=0.6)
sv = clf.support_vectors_
ax.scatter(sv[:, 0], sv[:, 1], s=80, facecolors='none', edgecolors='k', linewidths=1.5)
ax.set_title(f'{kernel} (acc={acc:.3f})')
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)
plt.tight_layout()
plt.show()
Hyperparameters
C (Regularization)
- Large C: Small margin, strict misclassification penalty → overfitting risk
- Small C: Large margin, tolerates some errors → underfitting risk
gamma (RBF Kernel)
- Large gamma: Narrow influence per sample → complex boundary, overfitting
- Small gamma: Wide influence → smooth boundary
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001]}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}, Score: {grid.best_score_:.3f}")
SVM Strengths and Limitations
| Strengths | Limitations |
|---|---|
| Strong in high dimensions | Slow on large data (\(O(n^2)\)-\(O(n^3)\)) |
| Nonlinear via kernels | Probability output requires extra processing |
| Resistant to overfitting | Feature scaling is mandatory |
| Decision boundary from SVs only | Multiclass is indirect (OvO or OvR) |
Related Articles
- Bayesian Linear Regression Fundamentals - Useful comparison between Bayesian regression and SVR.
- Bayesian Optimization: Fundamentals and Python Implementation - Automate SVM hyperparameter tuning (C, gamma).
- Cross-Entropy Method: A Practical Monte Carlo Optimization Technique - Alternative optimization approach for comparison.
References
- Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.
- Cortes, C., & Vapnik, V. (1995). “Support-vector networks”. Machine Learning, 20(3), 273-297.
- scikit-learn SVM documentation