Multi-Step Time Series Forecasting: Recursive, Direct, and Hybrid Strategies

DS
LDS Team
Let's Data Science
10 min readAudio
Multi-Step Time Series Forecasting: Recursive, Direct, and Hybrid Strategies
0:00 / 0:00

Predicting what happens tomorrow is useful, but predicting what happens next week, next month, or next quarter is where the real business value lies. Supply chain managers don't order inventory one day at a time; they plan weeks in advance. Energy grids need 24-hour horizon forecasts, not just the next hour.

This is the domain of Multi-Step Forecasting.

While single-step forecasting is the default for most tutorials, extending it to multiple future time steps (h>1h > 1) introduces complex trade-offs between error accumulation and model complexity. Should you reuse one model iteratively? Build ten different models for ten days? Or use a complex neural network that predicts everything at once?

This guide breaks down the mathematical engines, Python implementations, and strategic trade-offs of the three dominant multi-step forecasting architectures.

What is the multi-step forecasting problem?

The multi-step forecasting problem involves predicting a sequence of future values [yt+1,yt+2,,yt+h][y_{t+1}, y_{t+2}, \dots, y_{t+h}] given a historical sequence [yt,yt1,,ytn][y_{t}, y_{t-1}, \dots, y_{t-n}]. Unlike single-step forecasting which outputs a scalar, multi-step forecasting must account for the dependencies between future time steps and the propagation of errors over the forecast horizon.

In a standard single-step regression setup, we map input features XX to a target yy:

yt+1=f(yt,yt1,,ytn)+ϵy_{t+1} = f(y_t, y_{t-1}, \dots, y_{t-n}) + \epsilon

In multi-step settings, we need to solve for a horizon HH. There are four primary strategies to handle this, each with distinct mathematical properties:

  1. Recursive Strategy (Iterative)
  2. Direct Strategy (Independent)
  3. Multi-Output Strategy (Vector)
  4. Hybrid Strategies (Direct-Recursive)

💡 Pro Tip: Before attempting multi-step forecasting, ensure your series is stationary or properly differenced. Trends and seasonality wreaks havoc on long horizons. We cover this in depth in Time Series Forecasting: Mastering Trends, Seasonality, and Stationarity.


How does the Recursive Strategy work?

The Recursive (or Iterative) strategy trains a single model ff to predict one step ahead. To forecast multiple steps, we feed the model's prediction back into itself as an input for the next step.

The Algorithm

  1. Train model ff to predict yt+1y_{t+1} based on history.
  2. Predict y^t+1\hat{y}_{t+1}.
  3. Append y^t+1\hat{y}_{t+1} to the history.
  4. Use the updated history to predict y^t+2\hat{y}_{t+2}.
  5. Repeat HH times.

The Math

For a horizon h=2h=2, the forecast looks like this:

y^t+1=f(yt,yt1,)\hat{y}_{t+1} = f(y_t, y_{t-1}, \dots) y^t+2=f(y^t+1,yt,)\hat{y}_{t+2} = f(\hat{y}_{t+1}, y_t, \dots)

In Plain English: The recursive strategy is like driving a car in heavy fog. You predict where the road is one second ahead, drive there, and then predict the next second based on your new position. If your first prediction is slightly off, you are now starting the second prediction from the wrong spot.

Python Implementation (Recursive)

We can build a simple recursive forecaster using XGBoost.

python
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic sine wave data
t = np.linspace(0, 50, 500)
data = np.sin(t) + np.random.normal(0, 0.1, 500)

# Create lag features (Window size = 10)
def create_lags(data, window_size):
    X, y = [], []
    for i in range(len(data) - window_size):
        X.append(data[i:i+window_size])
        y.append(data[i+window_size])
    return np.array(X), np.array(y)

window_size = 10
X, y = create_lags(data, window_size)

# Split Train/Test
train_size = int(len(X) * 0.8)
X_train, y_train = X[:train_size], y[:train_size]
X_test, y_test = X[train_size:], y[train_size:]

# Train the ONE-STEP model
model = XGBRegressor(n_estimators=100, objective='reg:squarederror')
model.fit(X_train, y_train)

# Recursive Forecast Logic
def recursive_forecast(model, initial_sequence, horizon):
    current_sequence = initial_sequence.copy()
    predictions = []
    
    for _ in range(horizon):
        # Reshape for prediction (1, window_size)
        pred_input = current_sequence[-window_size:].reshape(1, -1)
        pred = model.predict(pred_input)[0]
        
        predictions.append(pred)
        # Append prediction to sequence to use for next step
        current_sequence = np.append(current_sequence, pred)
        
    return np.array(predictions)

# Test on the first sample of the test set
initial_seq = X_test[0] # The last known real values
horizon = 20
forecasts = recursive_forecast(model, initial_seq, horizon)

print(f"First 5 recursive predictions: {forecasts[:5]}")
# Output will show the predicted values drifting over time

The Fatal Flaw: Error Propagation

The defining characteristic of the recursive strategy is Error Accumulation. Since y^t+2\hat{y}_{t+2} depends on y^t+1\hat{y}_{t+1}, any error in the first step is carried over and amplified.

If ϵ\epsilon is the error at step 1: y^t+2=f(yt+1+ϵ,)\hat{y}_{t+2} = f(y_{t+1} + \epsilon, \dots)

This often causes recursive forecasts to degrade rapidly over long horizons, eventually converging to the mean or a flat line.


How does the Direct Strategy differ?

The Direct strategy handles the horizon problem by training separate models for each specific time step. If you need to forecast 7 days out, you train 7 distinct models.

The Algorithm

  1. Model f1f_1 learns to predict yt+1y_{t+1} using [yt,yt1,][y_t, y_{t-1}, \dots].
  2. Model f2f_2 learns to predict yt+2y_{t+2} using [yt,yt1,][y_t, y_{t-1}, \dots] (skipping yt+1y_{t+1}).
  3. Model fHf_H learns to predict yt+Hy_{t+H} using [yt,yt1,][y_t, y_{t-1}, \dots].

The Math

y^t+h=fh(yt,yt1,)\hat{y}_{t+h} = f_h(y_t, y_{t-1}, \dots)

In Plain English: The direct strategy is like hiring a team of specialists. One person is an expert at predicting tomorrow's weather. Another person is an expert at predicting the weather specifically for next Tuesday, using only today's data. They don't talk to each other; they just give you their independent answers.

Python Implementation (Direct)

Here, we utilize MultiOutputRegressor from Scikit-Learn, which essentially wraps the Direct strategy logic (training one regressor per target).

python
from sklearn.multioutput import MultiOutputRegressor

# Prepare data for Direct Strategy
# We need y to be a matrix of shape (samples, horizon)
def create_direct_xy(data, window_size, horizon):
    X, y = [], []
    for i in range(len(data) - window_size - horizon + 1):
        X.append(data[i:i+window_size])
        y.append(data[i+window_size : i+window_size+horizon])
    return np.array(X), np.array(y)

horizon = 20
X_direct, y_direct = create_direct_xy(data, window_size, horizon)

# Split
train_size = int(len(X_direct) * 0.8)
X_train_dir, y_train_dir = X_direct[:train_size], y_direct[:train_size]
X_test_dir, y_test_dir = X_direct[train_size:], y_direct[train_size:]

# Train 20 separate models (one per step)
direct_model = MultiOutputRegressor(XGBRegressor(n_estimators=100))
direct_model.fit(X_train_dir, y_train_dir)

# Predict
# No loop needed during inference, models run in parallel
predictions_direct = direct_model.predict(X_test_dir[0].reshape(1, -1))

print(f"Direct predictions shape: {predictions_direct.shape}")
# Output: (1, 20)

Pros and Cons

  • Pros: No error propagation! The prediction for step 10 does not depend on the potentially wrong prediction for step 9.
  • Cons: It ignores dependencies between future steps (e.g., if tomorrow is hot, the day after is likely hot). It is also computationally expensive (training HH models) and can have higher variance because fHf_H tries to predict far into the future using old data.

What is the Multi-Output Strategy?

The Multi-Output (or Vector Output) strategy uses a single model that outputs the entire forecast sequence vector [yt+1,,yt+H][y_{t+1}, \dots, y_{t+H}] simultaneously. This is commonly seen in Deep Learning (LSTMs, Transformers) but can also be done with algorithms like K-Nearest Neighbors.

The Math

[y^t+1,,y^t+H]=f(yt,yt1,)[\hat{y}_{t+1}, \dots, \hat{y}_{t+H}] = f(y_t, y_{t-1}, \dots)

This looks similar to the Direct strategy in terms of inputs, but the internal weights are shared. The model learns the correlations between yt+1y_{t+1} and yt+2y_{t+2} during training.

In Plain English: This is the "shotgun" approach. Instead of firing one bullet at a time (Recursive) or using 10 different guns (Direct), you use one model that fires a spread of 10 predictions at once. The model understands that the pellets (predictions) should travel together in a certain shape.

We covered this architecture extensively in Mastering LSTMs for Time Series: When Deep Learning Beats Statistics, where the final dense layer has HH neurons.

When to use Multi-Output?

This strategy is ideal when there are strong dependencies between the future time steps. For example, in temperature forecasting, the curve of the temperature throughout the day follows a physics-based shape. A multi-output model can learn this "shape," whereas a Direct strategy might predict a jagged, unrealistic line.


Which strategy should you choose?

The choice depends on your data volume, forecast horizon, and the "cost" of error accumulation versus model variance.

FeatureRecursiveDirectMulti-Output
Model Count1 ModelHH Models1 Model
Error PatternAccumulates over time (bias)Independent (variance)Balanced
DependenciesCaptures sequential structureIgnores forecast dependenciesCaptures structure via shared weights
ComputationFast training, slow inferenceSlow training, fast inferenceFast training, fast inference
Best ForShort horizons, simple patternsLong horizons, complex seasonalityNeural Networks, structured outputs

🔑 Key Insight: A common "Pro" move is the Direct-Recursive Hybrid. You train separate models for different horizons (Direct), but include the predictions of shorter horizons as inputs for the longer horizons (Recursive). This reduces variance while keeping some dependency structure.


Common Pitfalls in Multi-Step Forecasting

1. The "Flat Line" Forecast

Newcomers often find that their recursive forecasts (especially with ARIMA or simple RNNs) quickly converge to a straight line or the mean of the data after a few steps.

Why it happens: This is usually due to "mean reversion" in stationary models. If your model doesn't explicitly capture trend or seasonality (or if the window size is too short to "see" the cycle), the safest statistical guess for t+t+\infty is the dataset's average.

The Fix:

  • Ensure seasonality is modeled explicitly (e.g., using Facebook Prophet).
  • Use the Direct strategy, which forces the model to learn the specific value for t+10t+10 rather than iterating its way there.

2. Leaking Future Information

In the Direct strategy, creating the XX and yy matrices can be tricky. A common bug is accidentally including data from t+1t+1 in the input features for the model predicting t+2t+2.

The Check: Always verify your timestamps. For a model predicting yt+hy_{t+h}, the latest allowed input is yty_t.

3. Improper Validation Scheme

Using standard K-Fold Cross-Validation breaks the temporal order of time series. Even standard TimeSeriesSplit is insufficient for multi-step if you don't account for the "gap."

If you are predicting 7 days out, your validation folds must be separated by at least 7 days, or your model will "peek" at the ground truth of adjacent days which are highly correlated.

Conclusion

Multi-step forecasting is not merely repeating a single-step prediction loop; it is a structural decision that defines how your model handles uncertainty over time.

  • Use Recursive strategies (like ARIMA) when the horizon is short and the underlying physics of the system are stable.
  • Use Direct strategies (often with XGBoost or LightGBM) when you have plenty of data and need to avoid the noise amplification of recursive loops.
  • Use Multi-Output (like LSTMs or Transformers) when the shape of the future sequence matters as much as the individual points.

The "best" strategy is rarely obvious without experimentation. Start with a Recursive baseline—it's the easiest to implement. If accuracy degrades too fast over the horizon, switch to the Direct strategy to stabilize those long-range predictions.

To deepen your understanding of the models that power these strategies, explore our guide on Gradient Boosting for direct forecasting, or Exponential Smoothing for a classical recursive approach.


Hands-On Practice

Multi-step time series forecasting is a critical skill for real-world applications where planning horizons extend beyond a single day. In this tutorial, we will move beyond simple next-day predictions and implement the two dominant strategies for predicting sequences: the Recursive Strategy (iterative) and the Direct Strategy (independent models). Using a realistic retail sales dataset, you will build forecasting engines that can predict sales 14 days into the future, learning to balance the trade-offs between error accumulation and model complexity.

Dataset: Retail Sales (Time Series) 3 years of daily retail sales data with clear trend, weekly/yearly seasonality, and related features. Includes sales, visitors, marketing spend, and temperature. Perfect for ARIMA, Exponential Smoothing, and Time Series Forecasting.

Try It Yourself

Retail Time Series
Loading editor...
0/50 runs

Retail Time Series: Daily retail sales with trend and seasonality

In this tutorial, you implemented both Recursive and Direct forecasting strategies. You likely observed that the Recursive strategy follows the trend but may drift over time as errors compound, while the Direct strategy often captures specific future points better but requires maintaining multiple models. Experiment by changing the HORIZON variable to 30 days to see how drastically the recursive error accumulation degrades performance compared to the direct method.