Introduction
ARIMA has been a classical workhorse for time series forecasting, but it struggles with nonlinear dependencies and long-range patterns. LSTM (Long Short-Term Memory) solves the vanishing gradient problem of vanilla RNNs by introducing a dedicated memory cell, enabling it to capture long-term temporal dependencies.
This article covers the mathematical foundations of LSTM gates and demonstrates practical time series forecasting with PyTorch.
The Limitation of RNNs: Vanishing Gradients
A standard RNN updates its hidden state as:
\[h_t = \tanh(W_h h_{t-1} + W_x x_t + b) \tag{1}\]During backpropagation through time (BPTT), gradients are multiplied by the weight matrix at every step. When eigenvalues are less than 1 in magnitude, gradients shrink exponentially over long sequences — the vanishing gradient problem.
LSTM introduces a cell state \(c_t\) that acts as a “memory highway,” allowing gradients to flow without repeated multiplication.
LSTM Architecture
An LSTM cell has three gates operating on the concatenated input \([h_{t-1}, x_t]\).
Forget Gate
Decides what information to discard from the previous cell state.
\[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \tag{2}\]Values close to 0 mean “forget”; close to 1 mean “keep.”
Input Gate
Controls how much new information to write into the cell state.
\[i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \tag{3}\]\[\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \tag{4}\]Cell State Update
Combines the forget and input gates to update the cell state.
\[c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \tag{5}\]where \(\odot\) denotes the Hadamard (element-wise) product. The additive update preserves gradient flow over long sequences.
Output Gate
Determines what to expose as the hidden state.
\[o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \tag{6}\]\[h_t = o_t \odot \tanh(c_t) \tag{7}\]Parameter Count
For input dimension \(d_x\) and hidden dimension \(d_h\), one LSTM cell has:
\[4 \times (d_h \times (d_x + d_h) + d_h) \tag{8}\]parameters — four weight matrices (\(W_f, W_i, W_c, W_o\)) each of size \(d_h \times (d_x + d_h)\) plus biases.
LSTM for Time Series Forecasting
The standard Seq2One setup feeds a window of \(T\) past values \(\{x_{t-T+1}, \ldots, x_t\}\) and predicts \(x_{t+1}\).
| Setup | Input | Output | Use Case |
|---|---|---|---|
| Seq2One | sequence → 1 | single-step | Next-step prediction |
| Seq2Seq | sequence → seq | multi-step | Multi-horizon forecast |
| Encoder-Decoder | variable-length | variable-length | Translation, anomaly |
Python Implementation (PyTorch)
Data Preparation
We use a noisy sine wave as a synthetic time series.
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
# Synthetic time series
np.random.seed(42)
t = np.linspace(0, 8 * np.pi, 1000)
data = np.sin(t) + 0.2 * np.random.randn(len(t))
# Normalize
data_mean, data_std = data.mean(), data.std()
data_norm = (data - data_mean) / data_std
def create_sequences(data, seq_len):
X, y = [], []
for i in range(len(data) - seq_len):
X.append(data[i:i + seq_len])
y.append(data[i + seq_len])
return np.array(X), np.array(y)
SEQ_LEN = 30
X, y = create_sequences(data_norm, SEQ_LEN)
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Shape: [batch, seq_len, features]
X_train_t = torch.FloatTensor(X_train).unsqueeze(-1)
X_test_t = torch.FloatTensor(X_test).unsqueeze(-1)
y_train_t = torch.FloatTensor(y_train).unsqueeze(-1)
y_test_t = torch.FloatTensor(y_test).unsqueeze(-1)
train_loader = DataLoader(
TensorDataset(X_train_t, y_train_t),
batch_size=32, shuffle=True
)
Model Definition
class LSTMForecaster(nn.Module):
def __init__(self, input_size=1, hidden_size=64, num_layers=2, dropout=0.2):
super().__init__()
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0.0
)
self.fc = nn.Linear(hidden_size, 1)
def forward(self, x):
out, _ = self.lstm(x)
return self.fc(out[:, -1, :]) # last timestep
model = LSTMForecaster(hidden_size=64, num_layers=2)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
Training Loop
EPOCHS = 50
train_losses = []
for epoch in range(EPOCHS):
model.train()
epoch_loss = 0.0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
pred = model(X_batch)
loss = criterion(pred, y_batch)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
epoch_loss += loss.item()
train_losses.append(epoch_loss / len(train_loader))
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{EPOCHS} Loss: {train_losses[-1]:.6f}")
Prediction and Evaluation
model.eval()
with torch.no_grad():
pred_norm = model(X_test_t).numpy().squeeze()
pred = pred_norm * data_std + data_mean
true = y_test * data_std + data_mean
rmse = np.sqrt(np.mean((pred - true) ** 2))
print(f"RMSE: {rmse:.4f}")
plt.figure(figsize=(12, 4))
plt.plot(true, label="Ground Truth", alpha=0.7)
plt.plot(pred, label="LSTM Prediction", alpha=0.7)
plt.xlabel("Time Step")
plt.ylabel("Value")
plt.title("LSTM Time Series Forecasting")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
ARIMA vs LSTM Comparison
| Aspect | ARIMA | LSTM |
|---|---|---|
| Stationarity | Required (differencing helps) | Not required |
| Nonlinearity | No | Yes |
| Long-range deps | Weak (low-order ARIMA) | Strong (cell state) |
| Parameter count | Few (p, d, q) | Many (thousands) |
| Interpretability | High (coefficient meaning) | Low (black box) |
| Data requirement | Small datasets OK | Needs substantial data |
| Training speed | Fast | Slow (GPU accelerates) |
| Overfitting risk | Low | High (Dropout mitigates) |
Decision guide:
- Small data, linear seasonality → ARIMA/SARIMA
- Nonlinear patterns, multivariate, large data → LSTM
- Real-time inference required → Lightweight LSTM or ARIMA
Multivariate Extension
To use multiple input features, simply change input_size:
# Example: predict temperature from [temp, humidity, pressure]
model = LSTMForecaster(input_size=3, hidden_size=128, num_layers=2)
# X_train_t: [batch, seq_len, 3]
Related Articles
- Time Series Forecasting Fundamentals: ARIMA Theory and Python Implementation — classical linear forecasting method; compare interpretability vs LSTM.
- Self-Attention Theory and Python Implementation — the attention mechanism that succeeded LSTM in sequence modeling.
- SGD and Adam Optimizer Theory and Python Implementation — details on the Adam optimizer used to train LSTM.
- Time Series Anomaly Detection: From Statistical Methods to Kalman Filters — anomaly detection approaches applicable alongside LSTM forecasting.
References
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
- PyTorch Documentation: torch.nn.LSTM
Related Tools
- DevToolBox - Free Developer Tools - JSON formatter, regex tester, and 85+ developer tools
- CalcBox - Everyday Calculators - Statistics, compound interest, and 61+ calculators