Introduction
“Applying machine learning to time-series data” hides a surprising amount of variety: the output you actually want might be forecasting, classification, clustering, or anomaly detection, and even within forecasting alone the right tool changes drastically depending on horizon, linearity, stationarity, and whether you need interpretability. This article is a hub that integrates the four GA4-trending pillars of this site—k-means / GMM clustering, ensemble learning, LSTM time-series forecasting, and time-series anomaly detection—into a single map.
The questions that machine learning faces on time-series, classification, and anomaly tasks can be organized along four axes:
- Supervised vs. unsupervised — do you have labels, or do you have to self-organize?
- Parametric vs. nonparametric — do you assume a distribution or structure, or do you let the data speak?
- Stationary vs. non-stationary — are the statistics of the series constant over time?
- Linear vs. nonlinear — can the dynamics or regression function be approximated linearly, or do you need a neural network or tree ensemble?
For example, k-means and GMM are unsupervised and parametric (they assume the number of clusters \(K\) and a covariance structure), Random Forest and GBDT are supervised and nonparametric, LSTM is supervised, nonlinear, and sequential, while Kalman-based anomaly detection is unsupervised, parametric, and linear (extended to nonlinearity with EKF/UKF). Each occupies a clearly distinct point in the 4-axis space, and hyperparameter selection across all of them is best handled with Bayesian optimization.
This hub gives you three selection axes, one feature comparison matrix, nine decision scenarios, and one Python evaluation framework that runs five methods on the same data, so you can mechanically narrow down to the right tool for your problem. Theoretical details are delegated to the per-method articles; what you get here is the map and the judgment.
Three Selection Axes
Axis 1: Supervised (LSTM / GBDT / Random Forest) vs. Unsupervised (k-means / GMM / Kalman anomaly)
If you have labels \(y\) , it is supervised; otherwise unsupervised.
- Supervised with discrete labels → classification (Random Forest / GBDT, LSTM classification head)
- Supervised with continuous targets (especially future values of a series) → regression / forecasting (LSTM, GBDT)
- Unlabeled, looking for groups → clustering (k-means / GMM)
- Unlabeled, looking for deviations from normal → anomaly detection (Isolation Forest, One-Class SVM, Kalman residual test)
Anomaly detection often sits in the semi-supervised gray zone (train on normal data only), straddling the supervised/unsupervised line.
Axis 2: Sequential vs. i.i.d. samples
Whether samples can be treated as independent and identically distributed, or whether past observations determine the present, dictates which methods apply.
- i.i.d.: k-means / GMM, Random Forest / GBDT. Even time-series problems can usually be cast in this frame with careful feature engineering (lags, rolling statistics).
- Sequential (Markovian): LSTM, Kalman-based models. State transitions are captured explicitly via an internal state \(h_t\) .
In practice “throw lag features at GBDT first” is a strong default; if that is not enough, move on to LSTM or state-space models.
Axis 3: Interpretability vs. expressive power (GBDT / RF vs. LSTM / deep)
How much does the decision need to be defensible?
- Highly interpretable: tree models (Random Forest / GBDT). Feature importance, SHAP, partial dependence make the model auditable.
- Moderately interpretable: GMM (per-cluster mean and covariance), Kalman (state-space variables with physical meaning).
- Low interpretability, high expressivity: LSTM, Transformer family. Absorb long-range and nonlinear interactions implicitly.
In healthcare, finance, and public-sector deployments, “high accuracy without explanation” is often not acceptable. The safe pattern is to establish a GBDT baseline first and only move to neural nets when its precision is insufficient.
Feature Comparison Matrix: Seven Methods, Six Columns
| Method | Category | Data requirement | Computational cost | Interpretability | Main use cases |
|---|---|---|---|---|---|
| k-means | unsupervised / distance | \(\sim 10^2\) + | \(O(NKd)\) / iter | high | customer segmentation, vector quant. |
| GMM | unsupervised / probabilistic | \(\sim 10^3\) + | \(O(NKd^2)\) / EM | medium | soft clustering, density estimation |
| Random Forest | supervised / bagging | \(\sim 10^3\) + | \(O(MND \log N)\) | high | tabular classification, importance |
| GBDT / XGBoost / LightGBM | supervised / boosting | \(\sim 10^3\) + | \(O(MND)\) histogram | high | Kaggle staple, time-series with lags |
| LSTM | supervised / RNN | \(\sim 10^4\) + | \(O(T H^2)\) / step | low | short–medium horizon, sequence labeling |
| Kalman filter | unsupervised / state-space | model required | \(O(T n^3)\) | medium | tracking, linear forecasting, residuals |
| Isolation Forest | unsupervised / tree | \(\sim 10^3\) + | \(O(M N \log N)\) | medium | point anomaly, outlier scoring |
- \(N\) samples, \(d\) feature dimension, \(K\) clusters, \(M\) trees, \(D\) tree depth, \(T\) sequence length, \(H\) LSTM hidden size, \(n\) state dimension
- “Data requirement” is the practical minimum for stable training. LSTM realistically needs a few thousand samples across tens of series.
- Interpretability ratings reflect the practical applicability of SHAP / partial dependence.
One look at this matrix prevents mismatches like “200 samples but reaching for LSTM” or “regulatory explanation required, but reaching for a neural net”.
Decision Scenarios: Nine Recurring Problems and Their Recommended Methods
Scenario 1: Customer segmentation (marketing, N=tens of thousands)
Features are low-dimensional continuous variables such as purchase frequency, average ticket, recency (RFM). Start with k-means to cut \(K=4\) –\(6\) interpretable segments quickly and visualize statistics per cluster. If boundaries blur, switch to GMM for soft assignment probabilities that quantify “which side” each customer leans toward. Choose \(K\) with elbow / silhouette / BIC combined.
Scenario 2: Point anomaly detection (sensors, real time)
To catch single spikes or outliers, Isolation Forest is the first choice—it scores anomalies by tree path length and trains in seconds for \(N=10^4\) . If a few labels exist, ensembling with One-Class SVM or Local Outlier Factor (LOF) boosts robustness.
Scenario 3: Sequence anomaly detection (equipment diagnostics, time-structured failures)
Anomalies with temporal structure (gradual drift, vanishing periodicity) cannot be caught by point detectors. Train a state-space model on normal series with a Kalman filter and flag points where the Mahalanobis distance of the residual \(\nu_t = y_t - \hat{y}_t\) exceeds the \(\chi^2_{0.99}\) threshold: \(\nu_t^\top S_t^{-1} \nu_t > \chi^2_{0.99}\) . For nonlinear dynamics, use EKF / UKF, or replace with an LSTM autoencoder reconstruction error.
Scenario 4: Short-horizon forecasting (up to 10 steps, N=thousands)
For demand or sensor short-term futures, the “direct” approach with GBDT (LightGBM / XGBoost) plus lag features is a fast, accurate, interpretable workhorse. Just adding lag-1, lag-7, lag-30, rolling means, and one-hot day-of-week / month-of-year often beats ARIMA and naive LSTM. For skewed error profiles, quantile loss yields prediction intervals out of the box.
Scenario 5: Long-horizon forecasting (seasonality and trend, N=years of daily data)
The longer the horizon, the more LSTM errors accumulate and degrade. The gold standard is STL decomposition + GBDT residual forecasting, or Seq2Seq LSTM with attention with trend, seasonality, and holidays passed as exogenous features. Prophet-style Bayesian state-space models are still strong contenders that handle trend changepoints and holidays out of the box.
Scenario 6: Classification with class imbalance (N=thousands, 1% positives)
Tree models dominate imbalanced data: Random Forest / GBDT with class_weight="balanced" or scale_pos_weight. SMOTE-style oversampling raises overfitting risk; start with class weighting and threshold tuning. Evaluate with PR-AUC / F1 / Matthews correlation, not accuracy.
Scenario 7: Feature importance and explainability
To show which variables drive predictions, Random Forest MDI / permutation importance and GBDT SHAP are the workhorses. MDI is biased toward high-cardinality categorical variables, so always combine with permutation. SHAP is heavier but reveals interactions. LSTM can be explained via Integrated Gradients, but if interpretability is a hard requirement, choose a tree model from the start.
Scenario 8: Online learning (continuous data stream)
When batch retraining is too slow, options include the Kalman filter (closed-form sequential update), SGD-based linear models (sklearn.linear_model.SGDClassifier), Hoeffding trees / VFDT for streaming tree learning, and online fine-tuning of an LSTM. For linear state spaces, the Kalman filter converges fastest with minimal memory.
Scenario 9: Automatic hyperparameter search (Bayesian optimization)
GBDT’s learning_rate / max_depth / num_leaves, LSTM’s hidden_size / layers / lookback, GMM’s \(K\)
—as the count grows, grid search collapses. Use Bayesian optimization to maximize an acquisition function (EI / UCB) and choose the next evaluation point: Optuna, scikit-optimize, or Ax reach a practical optimum in 20–50 trials. For tree models, TPE (Tree-structured Parzen Estimator) is the standard.
Unified Python Evaluation Framework
Apply k-means / GMM / Random Forest / LSTM / Isolation Forest to one synthetic time series with trend + seasonality + noise + point anomalies, and compute forecast MSE, anomaly F1, and clustering silhouette side by side.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.ensemble import RandomForestRegressor, IsolationForest
from sklearn.metrics import mean_squared_error, f1_score, silhouette_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
# Synthetic series: trend + seasonality + noise + point anomalies
rng = np.random.default_rng(0)
T = 1000
t = np.arange(T)
trend = 0.01 * t
season = 2.0 * np.sin(2 * np.pi * t / 50)
noise = rng.normal(0, 0.3, T)
y = trend + season + noise
anomaly_idx = rng.choice(T, 20, replace=False)
y[anomaly_idx] += rng.normal(0, 5, 20)
is_anomaly = np.zeros(T, dtype=int); is_anomaly[anomaly_idx] = 1
# Lag features (cast the series into an i.i.d.-like table)
L = 10
X = np.array([y[i-L:i] for i in range(L, T)])
y_target = y[L:]
# (1) k-means: segment assignment on lag windows
km = KMeans(n_clusters=3, n_init=10, random_state=0).fit(X)
sil_km = silhouette_score(X, km.labels_)
# (2) GMM: soft clustering + log-likelihood
gmm = GaussianMixture(n_components=3, covariance_type="full", random_state=0).fit(X)
sil_gmm = silhouette_score(X, gmm.predict(X))
# (3) Random Forest: one-step-ahead forecast
split = int(len(X) * 0.8)
rf = RandomForestRegressor(n_estimators=200, max_depth=8, random_state=0)
rf.fit(X[:split], y_target[:split])
mse_rf = mean_squared_error(y_target[split:], rf.predict(X[split:]))
# (4) LSTM: one-step-ahead forecast on the same lag window
Xn = X.reshape(-1, L, 1)
lstm = Sequential([LSTM(32, input_shape=(L, 1)), Dense(1)])
lstm.compile(optimizer="adam", loss="mse")
lstm.fit(Xn[:split], y_target[:split], epochs=20, batch_size=32, verbose=0)
mse_lstm = mean_squared_error(y_target[split:], lstm.predict(Xn[split:], verbose=0).ravel())
# (5) Isolation Forest: point anomaly detection
iso = IsolationForest(contamination=0.02, random_state=0).fit(y.reshape(-1, 1))
pred_anom = (iso.predict(y.reshape(-1, 1)) == -1).astype(int)
f1_iso = f1_score(is_anomaly, pred_anom)
print(f"KMeans silhouette : {sil_km:.3f}")
print(f"GMM silhouette : {sil_gmm:.3f}")
print(f"RF forecast MSE : {mse_rf:.3f}")
print(f"LSTM forecast MSE : {mse_lstm:.3f}")
print(f"IsolationForest F1 : {f1_iso:.3f}")
Around forty lines of code give you five methods times three metrics in one shot. To run it on your own data, swap the synthetic y for your array. You will see in your own numbers that LSTM does not automatically beat GBDT, which is the kind of empirical grounding that prevents both over- and under-estimation of deep learning. Implementation details live in the per-method articles: k-means / GMM, ensembles, LSTM, and time-series anomaly detection.
Design Parameter Table
| Method | Key parameters | Recommended starting points |
|---|---|---|
| k-means | number of clusters \(K\) | sweep 2–10 with elbow / silhouette; use n_init=10 to avoid local minima |
| GMM | \(K\) , covariance type | covariance_type: full (flexible) / tied / diag (high-dim) / spherical; minimize BIC |
| Random Forest | n_estimators, max_depth | 200–500 trees, depth 8–20; max_features="sqrt" for classification, 1.0 for regression |
| GBDT | learning_rate, num_leaves, n_estimators | LR 0.05, leaves 31, early stopping for tree count; lower LR × more trees ⇒ higher accuracy |
| LSTM | hidden size \(H\) , layers, lookback | \(H=32\) –\(128\) , 1–2 layers, lookback 1–2× the period; dropout 0.2, Adam LR 1e-3 |
| Kalman filter | process noise \(Q\) , observation noise \(R\) | \(Q/R\) ratio controls tracking; estimate by EM or cross-validation |
| Isolation Forest threshold | contamination | 1–3× the expected anomaly rate; tune by inspecting the score histogram |
Rules of thumb: tree models improve as you lower learning_rate and add more trees (at the cost of compute); LSTM cannot capture seasonality if lookback is below one period and overfits if too high; \(K\)
and layer counts should be chosen mechanically via BIC or validation loss. All of these are amenable to Bayesian optimization for automated search.
Related Reading
Entry points for deeper dives from this hub.
The four pillars
- k-means and GMM clustering — hard vs. soft clustering, the EM algorithm
- Ensemble learning (bagging / boosting / stacking) — RF / GBDT / XGBoost / LightGBM
- Time-series forecasting with LSTM — gating, vanishing gradients, Seq2Seq
- Time-series anomaly detection — point vs. sequence anomalies, state-space models
Classical time-series models
- ARIMA and SARIMA time-series forecasting — the linear baseline to benchmark LSTM against
- Kalman smoother (RTS) — fixed-interval smoothing for offline accuracy gains
Automated optimization and adjacent tools
- Bayesian optimization for hyperparameter search — directly applicable to GBDT and LSTM tuning
Signal processing hub (twin article)
- Time-frequency analysis hub (FFT / STFT / Wavelet / Hilbert) — feature engineering (spectrogram features) before feeding data into a model
- EMD, VMD, and SSA mode decomposition — decompose non-stationary signals into IMFs as ML features
- Digital Signal Processing and Machine Learning Roadmap — meta path tying the five hubs together
Conclusion
When applying machine learning to time-series, classification, or anomaly tasks, the cleanest path is: narrow down by the three axes supervised/unsupervised × sequential dependence × interpretability, sanity-check against sample size and compute in the feature matrix, and start from whichever decision scenario most closely matches your problem.
- Unlabeled, looking for groups → k-means / GMM
- Tabular classification or short–medium horizon forecasting with interpretability → Random Forest / GBDT
- Long-range dependence and nonlinearity → LSTM
- Sequence anomalies or state tracking → Kalman filter family / LSTM-AE
- Automatic hyperparameter search → Bayesian optimization
When in doubt, copy the unified Python evaluation script in this article, swap in your own data, and read MSE / F1 / silhouette across the five methods. From here, the next themes worth deepening are (a) Gaussian process regression for small-data forecasting with calibrated intervals, (b) Transformer-family time-series models (Informer, TimesNet), (c) deep state-space models that mix neural networks with state-space structure, and (d) the bridge to causal inference, where feature importance becomes intervention effect. Use this hub and the linked articles as a two-way map for finding the right tool for your problem.