Support Vector Machines (SVM): Kernel Methods and Nonlinear Classification with Python

Introduction

Support Vector Machine (SVM) is a supervised learning method for classification and regression. It constructs a decision boundary that maximizes the margin between classes, achieving high generalization performance.

The kernel trick enables nonlinear decision boundaries, making SVM particularly powerful for high-dimensional, small-to-medium datasets.

Linear SVM

Hard Margin

For linearly separable data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\) (\(y_i \in \{-1, +1\}\)):

\[\min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 \tag{1}\]\[\text{subject to} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \quad \forall i \tag{2}\]

The margin width is \(2 / \|\mathbf{w}\|\); minimizing \(\|\mathbf{w}\|\) maximizes the margin.

Soft Margin

For non-separable data, introduce slack variables \(\xi_i \geq 0\):

\[\min_{\mathbf{w}, b, \xi} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i \tag{3}\]\[\text{subject to} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \tag{4}\]

\(C\) controls the trade-off between margin size and misclassification penalty.

Kernel Trick

Dual Problem

The Lagrangian dual shows data appears only through inner products \(\mathbf{x}_i \cdot \mathbf{x}_j\):

\[\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j) \tag{5}\]\[\text{subject to} \quad 0 \leq \alpha_i \leq C, \quad \sum_{i=1}^{n} \alpha_i y_i = 0 \tag{6}\]

Kernel Functions

Replacing inner products with kernel functions \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)\) enables nonlinear classification without explicitly computing the mapping \(\phi\).

Kernel	Definition	Use Case
Linear	\(K(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y}\)	Linearly separable data
RBF (Gaussian)	\(K(\mathbf{x}, \mathbf{y}) = \exp(-\gamma\\|\mathbf{x}-\mathbf{y}\\|^2)\)	Most versatile
Polynomial	\(K(\mathbf{x}, \mathbf{y}) = (\gamma \mathbf{x} \cdot \mathbf{y} + r)^d\)	Feature interactions

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                      random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

kernels = ['linear', 'rbf', 'poly']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, kernel in zip(axes, kernels):
    clf = SVC(kernel=kernel, C=1.0, gamma='scale')
    clf.fit(X_train, y_train)
    acc = accuracy_score(y_test, clf.predict(X_test))

    xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    ax.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, colors=['blue', 'red'])
    ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors='k', linestyles=['--', '-', '--'])
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='bwr', s=20, alpha=0.6)

    sv = clf.support_vectors_
    ax.scatter(sv[:, 0], sv[:, 1], s=80, facecolors='none', edgecolors='k', linewidths=1.5)

    ax.set_title(f'{kernel} (acc={acc:.3f})')
    ax.set_xlim(-3, 3)
    ax.set_ylim(-3, 3)

plt.tight_layout()
plt.show()

Hyperparameters

C (Regularization)

Large C: Small margin, strict misclassification penalty → overfitting risk
Small C: Large margin, tolerates some errors → underfitting risk

gamma (RBF Kernel)

Large gamma: Narrow influence per sample → complex boundary, overfitting
Small gamma: Wide influence → smooth boundary

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001]}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}, Score: {grid.best_score_:.3f}")

SVM Strengths and Limitations

Strengths	Limitations
Strong in high dimensions	Slow on large data (\(O(n^2)\)-\(O(n^3)\))
Nonlinear via kernels	Probability output requires extra processing
Resistant to overfitting	Feature scaling is mandatory
Decision boundary from SVs only	Multiclass is indirect (OvO or OvR)

Ensemble Learning: From Decision Trees to Random Forest and Gradient Boosting - Alternative classification methods including Random Forest and GBDT.
Understanding Self-Attention from Scratch: Math and Python Implementation - Similarity between kernel tricks and Attention.
k-means and GMM: Clustering Theory and Python Implementation - Contrast supervised classification with unsupervised structure discovery.
Bayesian Optimization: Fundamentals and Python Implementation - Automate SVM hyperparameter tuning (C, gamma).
Bayesian Linear Regression: From Least Squares to Bayesian Estimation - Useful comparison between Bayesian regression and SVR.
Cross-Entropy Method: A Practical Monte Carlo Optimization Technique - Alternative optimization approach for comparison.

References

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.
Cortes, C., & Vapnik, V. (1995). “Support-vector networks”. Machine Learning, 20(3), 273-297.
scikit-learn SVM documentation