LSTM for Time Series Forecasting: Theory and Python Implementation

LSTM (Long Short-Term Memory) tutorial covering forget/input/output gate mathematics, PyTorch implementation for time series forecasting, and comparison with ARIMA.

March 17, 2026 Updated March 17, 2026 5 min read

Python Machine Learning Statistics

Introduction

ARIMA has been a classical workhorse for time series forecasting, but it struggles with nonlinear dependencies and long-range patterns. LSTM (Long Short-Term Memory) solves the vanishing gradient problem of vanilla RNNs by introducing a dedicated memory cell, enabling it to capture long-term temporal dependencies.

This article covers the mathematical foundations of LSTM gates and demonstrates practical time series forecasting with PyTorch.

The Limitation of RNNs: Vanishing Gradients

A standard RNN updates its hidden state as:

\[h_t = \tanh(W_h h_{t-1} + W_x x_t + b) \tag{1}\]

During backpropagation through time (BPTT), gradients are multiplied by the weight matrix at every step. When eigenvalues are less than 1 in magnitude, gradients shrink exponentially over long sequences — the vanishing gradient problem.

LSTM introduces a cell state \(c_t\) that acts as a “memory highway,” allowing gradients to flow without repeated multiplication.

LSTM Architecture

An LSTM cell has three gates operating on the concatenated input \([h_{t-1}, x_t]\).

Forget Gate

Decides what information to discard from the previous cell state.

\[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \tag{2}\]

Values close to 0 mean “forget”; close to 1 mean “keep.”

Input Gate

Controls how much new information to write into the cell state.

\[i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \tag{3}\]\[\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \tag{4}\]

Cell State Update

Combines the forget and input gates to update the cell state.

\[c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \tag{5}\]

where \(\odot\) denotes the Hadamard (element-wise) product. The additive update preserves gradient flow over long sequences.

Output Gate

Determines what to expose as the hidden state.

\[o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \tag{6}\]\[h_t = o_t \odot \tanh(c_t) \tag{7}\]

Parameter Count

For input dimension \(d_x\) and hidden dimension \(d_h\), one LSTM cell has:

\[4 \times (d_h \times (d_x + d_h) + d_h) \tag{8}\]

parameters — four weight matrices (\(W_f, W_i, W_c, W_o\)) each of size \(d_h \times (d_x + d_h)\) plus biases.

LSTM for Time Series Forecasting

The standard Seq2One setup feeds a window of \(T\) past values \(\{x_{t-T+1}, \ldots, x_t\}\) and predicts \(x_{t+1}\).

Setup	Input	Output	Use Case
Seq2One	sequence → 1	single-step	Next-step prediction
Seq2Seq	sequence → seq	multi-step	Multi-horizon forecast
Encoder-Decoder	variable-length	variable-length	Translation, anomaly

Python Implementation (PyTorch)

Data Preparation

We use a noisy sine wave as a synthetic time series.

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Synthetic time series
np.random.seed(42)
t = np.linspace(0, 8 * np.pi, 1000)
data = np.sin(t) + 0.2 * np.random.randn(len(t))

# Normalize
data_mean, data_std = data.mean(), data.std()
data_norm = (data - data_mean) / data_std

def create_sequences(data, seq_len):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i + seq_len])
        y.append(data[i + seq_len])
    return np.array(X), np.array(y)

SEQ_LEN = 30
X, y = create_sequences(data_norm, SEQ_LEN)

split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Shape: [batch, seq_len, features]
X_train_t = torch.FloatTensor(X_train).unsqueeze(-1)
X_test_t  = torch.FloatTensor(X_test).unsqueeze(-1)
y_train_t = torch.FloatTensor(y_train).unsqueeze(-1)
y_test_t  = torch.FloatTensor(y_test).unsqueeze(-1)

train_loader = DataLoader(
    TensorDataset(X_train_t, y_train_t),
    batch_size=32, shuffle=True
)

Model Definition

class LSTMForecaster(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0
        )
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out[:, -1, :])  # last timestep

model = LSTMForecaster(hidden_size=64, num_layers=2)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Training Loop

EPOCHS = 50
train_losses = []

for epoch in range(EPOCHS):
    model.train()
    epoch_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        pred = model(X_batch)
        loss = criterion(pred, y_batch)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        epoch_loss += loss.item()
    train_losses.append(epoch_loss / len(train_loader))

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{EPOCHS}  Loss: {train_losses[-1]:.6f}")

Prediction and Evaluation

model.eval()
with torch.no_grad():
    pred_norm = model(X_test_t).numpy().squeeze()

pred = pred_norm * data_std + data_mean
true = y_test * data_std + data_mean

rmse = np.sqrt(np.mean((pred - true) ** 2))
print(f"RMSE: {rmse:.4f}")

plt.figure(figsize=(12, 4))
plt.plot(true, label="Ground Truth", alpha=0.7)
plt.plot(pred, label="LSTM Prediction", alpha=0.7)
plt.xlabel("Time Step")
plt.ylabel("Value")
plt.title("LSTM Time Series Forecasting")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

ARIMA vs LSTM Comparison

Aspect	ARIMA	LSTM
Stationarity	Required (differencing helps)	Not required
Nonlinearity	No	Yes
Long-range deps	Weak (low-order ARIMA)	Strong (cell state)
Parameter count	Few (p, d, q)	Many (thousands)
Interpretability	High (coefficient meaning)	Low (black box)
Data requirement	Small datasets OK	Needs substantial data
Training speed	Fast	Slow (GPU accelerates)
Overfitting risk	Low	High (Dropout mitigates)

Decision guide:

Small data, linear seasonality → ARIMA/SARIMA
Nonlinear patterns, multivariate, large data → LSTM
Real-time inference required → Lightweight LSTM or ARIMA

Multivariate Extension

To use multiple input features, simply change input_size:

# Example: predict temperature from [temp, humidity, pressure]
model = LSTMForecaster(input_size=3, hidden_size=128, num_layers=2)
# X_train_t: [batch, seq_len, 3]

Time Series Forecasting Fundamentals: ARIMA Theory and Python Implementation — classical linear forecasting method; compare interpretability vs LSTM.
Self-Attention Theory and Python Implementation — the attention mechanism that succeeded LSTM in sequence modeling.
SGD and Adam Optimizer Theory and Python Implementation — details on the Adam optimizer used to train LSTM.
Time Series Anomaly Detection: From Statistical Methods to Kalman Filters — anomaly detection approaches applicable alongside LSTM forecasting.

References

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
PyTorch Documentation: torch.nn.LSTM

DevToolBox - Free Developer Tools - JSON formatter, regex tester, and 85+ developer tools
CalcBox - Everyday Calculators - Statistics, compound interest, and 61+ calculators