LSTM for Time Series Forecasting: Theory and Python Implementation

LSTM (Long Short-Term Memory) tutorial covering forget/input/output gate mathematics, PyTorch implementation for time series forecasting, and comparison with ARIMA.

Introduction

ARIMA has been a classical workhorse for time series forecasting, but it struggles with nonlinear dependencies and long-range patterns. LSTM (Long Short-Term Memory) solves the vanishing gradient problem of vanilla RNNs by introducing a dedicated memory cell, enabling it to capture long-term temporal dependencies.

This article covers the mathematical foundations of LSTM gates and demonstrates practical time series forecasting with PyTorch.

The Limitation of RNNs: Vanishing Gradients

A standard RNN updates its hidden state as:

\[h_t = \tanh(W_h h_{t-1} + W_x x_t + b) \tag{1}\]

During backpropagation through time (BPTT), gradients are multiplied by the weight matrix at every step. When eigenvalues are less than 1 in magnitude, gradients shrink exponentially over long sequences — the vanishing gradient problem.

LSTM introduces a cell state \(c_t\) that acts as a “memory highway,” allowing gradients to flow without repeated multiplication.

LSTM Architecture

An LSTM cell has three gates operating on the concatenated input \([h_{t-1}, x_t]\).

Forget Gate

Decides what information to discard from the previous cell state.

\[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \tag{2}\]

Values close to 0 mean “forget”; close to 1 mean “keep.”

Input Gate

Controls how much new information to write into the cell state.

\[i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \tag{3}\]\[\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \tag{4}\]

Cell State Update

Combines the forget and input gates to update the cell state.

\[c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \tag{5}\]

where \(\odot\) denotes the Hadamard (element-wise) product. The additive update preserves gradient flow over long sequences.

Output Gate

Determines what to expose as the hidden state.

\[o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \tag{6}\]\[h_t = o_t \odot \tanh(c_t) \tag{7}\]

Parameter Count

For input dimension \(d_x\) and hidden dimension \(d_h\), one LSTM cell has:

\[4 \times (d_h \times (d_x + d_h) + d_h) \tag{8}\]

parameters — four weight matrices (\(W_f, W_i, W_c, W_o\)) each of size \(d_h \times (d_x + d_h)\) plus biases.

LSTM for Time Series Forecasting

The standard Seq2One setup feeds a window of \(T\) past values \(\{x_{t-T+1}, \ldots, x_t\}\) and predicts \(x_{t+1}\).

SetupInputOutputUse Case
Seq2Onesequence → 1single-stepNext-step prediction
Seq2Seqsequence → seqmulti-stepMulti-horizon forecast
Encoder-Decodervariable-lengthvariable-lengthTranslation, anomaly

Python Implementation (PyTorch)

Data Preparation

We use a noisy sine wave as a synthetic time series.

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Synthetic time series
np.random.seed(42)
t = np.linspace(0, 8 * np.pi, 1000)
data = np.sin(t) + 0.2 * np.random.randn(len(t))

# Normalize
data_mean, data_std = data.mean(), data.std()
data_norm = (data - data_mean) / data_std

def create_sequences(data, seq_len):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i + seq_len])
        y.append(data[i + seq_len])
    return np.array(X), np.array(y)

SEQ_LEN = 30
X, y = create_sequences(data_norm, SEQ_LEN)

split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Shape: [batch, seq_len, features]
X_train_t = torch.FloatTensor(X_train).unsqueeze(-1)
X_test_t  = torch.FloatTensor(X_test).unsqueeze(-1)
y_train_t = torch.FloatTensor(y_train).unsqueeze(-1)
y_test_t  = torch.FloatTensor(y_test).unsqueeze(-1)

train_loader = DataLoader(
    TensorDataset(X_train_t, y_train_t),
    batch_size=32, shuffle=True
)

Model Definition

class LSTMForecaster(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0
        )
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out[:, -1, :])  # last timestep

model = LSTMForecaster(hidden_size=64, num_layers=2)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Training Loop

EPOCHS = 50
train_losses = []

for epoch in range(EPOCHS):
    model.train()
    epoch_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        pred = model(X_batch)
        loss = criterion(pred, y_batch)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        epoch_loss += loss.item()
    train_losses.append(epoch_loss / len(train_loader))

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{EPOCHS}  Loss: {train_losses[-1]:.6f}")

Prediction and Evaluation

model.eval()
with torch.no_grad():
    pred_norm = model(X_test_t).numpy().squeeze()

pred = pred_norm * data_std + data_mean
true = y_test * data_std + data_mean

rmse = np.sqrt(np.mean((pred - true) ** 2))
print(f"RMSE: {rmse:.4f}")

plt.figure(figsize=(12, 4))
plt.plot(true, label="Ground Truth", alpha=0.7)
plt.plot(pred, label="LSTM Prediction", alpha=0.7)
plt.xlabel("Time Step")
plt.ylabel("Value")
plt.title("LSTM Time Series Forecasting")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

ARIMA vs LSTM Comparison

AspectARIMALSTM
StationarityRequired (differencing helps)Not required
NonlinearityNoYes
Long-range depsWeak (low-order ARIMA)Strong (cell state)
Parameter countFew (p, d, q)Many (thousands)
InterpretabilityHigh (coefficient meaning)Low (black box)
Data requirementSmall datasets OKNeeds substantial data
Training speedFastSlow (GPU accelerates)
Overfitting riskLowHigh (Dropout mitigates)

Decision guide:

  • Small data, linear seasonality → ARIMA/SARIMA
  • Nonlinear patterns, multivariate, large data → LSTM
  • Real-time inference required → Lightweight LSTM or ARIMA

Multivariate Extension

To use multiple input features, simply change input_size:

# Example: predict temperature from [temp, humidity, pressure]
model = LSTMForecaster(input_size=3, hidden_size=128, num_layers=2)
# X_train_t: [batch, seq_len, 3]

References

  • Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
  • PyTorch Documentation: torch.nn.LSTM