Updated markdown in TFT jupyter notebook

brprojects · brprojects · commit 443feb3f8210 · 2025-04-06T22:54:00.000+01:00
diff --git a/tft_price.ipynb b/tft_price.ipynb
@@ -8,42 +8,37 @@
    "source": [
     "# TFT Model for Energy Price in Spain\n",
     "\n",
-    "TFT (Temporal Fusion Transformer)\n",
-    "N-BeatsX (N-Beats with Exogenous Variables)\n",
-    "N-HiTS (Neural Hierarchical Interpolation for Time Series Forecasting)\n",
+    "## Model Comparison\n",
     "\n",
-    "## Why Transformers\n",
+    "In this section, I will compare three standout models for time-series forecasting: `TFTs` (Temporal Fusion Transformers), `N-BeatsX`, and `N-HiTS`. \n",
     "\n",
-    "Transformers, originally developed for natural language processing, have become a powerful tool for time series forecasting. Transformers take a matrix of features x timesteps as input, with each timestep encoded individually using position encoding and the self-attention mechanism. These techniques allow the model to capture both short- and long-range dependencies in the data whilst the temporal and sequential structure is maintained, unlike in FNNs (Feedforward Neural Networks), which collapse the data into a single column before encoding.\n",
+    "They are all state-of-the-art models, widely recognized for their effectiveness in handling complex time-series data, capturing both short-term and long-term dependencies. These models are supported by strong research and practical applications, making them top choices for many forecasting tasks across different industries, including energy price prediction.\n",
     "\n",
-    "For energy demand forecasting, transformers can process multiple input features simultaneously, including lagged demand values, weather conditions, calendar effects, and other exogenous variables, without relying on manually defined temporal dependencies. \n",
+    "**TFT (Temporal Fusion Transformer)**\n",
     "\n",
-    "The table below highlights how a Transformer is well-suited to predicting energy demand compared with other neural-network-based machine learning methods:\n",
+    "TFTs adapt and extend the transformer design specifically for time series forecasting. They combine the strengths of sequence models and attention mechanisms, enabling accurate and interpretable multi-horizon forecasting.\n",
     "\n",
-    "| Feature                            | Transformer                                 | RNN/LSTM/GRU                             | FNN (Feedforward Neural Network)        | CNN (Convolutional Neural Network)       |\n",
-    "|------------------------------------|---------------------------------------------|------------------------------------------|------------------------------------------|------------------------------------------|\n",
-    "| **Architecture Type**              | Attention-based (self-attention)            | Sequential (recurrence-based)            | Fully connected layers (no recurrence)   | Convolutional layers (local receptive fields) |\n",
-    "| **Handling of Long-Term Dependencies** | Excellent (self-attention)                 | Limited (vanishing/exploding gradients)  | None (treats all inputs as independent)  | Limited (focuses on local features)      |\n",
-    "| **Scalability**                    | High (parallel processing of large data)    | Low (sequential processing)              | High (processing all features in parallel)| High (parallel processing of convolutions) |\n",
-    "| **Training Time**                  | Faster (due to parallelisation)             | Slower (sequential training)             | Fast (simple architecture)               | Moderate (depends on depth of network)   |\n",
-    "| **Multi-Output Prediction**        | Very effective for multi-output tasks       | Challenging (often requires multiple models or complex architectures) | Challenging (often requires separate models) | Effective (can use same model for multiple outputs) |\n",
-    "| **Handling Complex Temporal Relationships** | Good (learns relationships with long-range dependencies) | Poor for long sequences due to vanishing gradients | Poor (no temporal awareness)             | Limited (focuses on local patterns, not temporal) |\n",
+    "The TFT encoder is designed to handle the unique structure of time series data by separating inputs into static features, past observed variables, and known future inputs. Each of these is passed through embedding layers and processed by Variable Selection Networks, which use Gated Residual Networks (GRNs) to dynamically weight the importance of each input feature at every timestep. This allows the model to focus only on the most relevant information. The selected inputs are then passed through a sequence of LSTM layers, which capture temporal dependencies and preserve the ordering of events.\n",
     "\n",
+    "In the decoder, TFTs leverage multi-head attention to identify and focus on the most relevant timesteps from the encoded historical data for each forecasting horizon. \n",
     "\n",
+    "**N-BeatsX (N-Beats with Exogenous Variables)**\n",
     "\n",
+    "N-BeatsX is an extension of the N-BEATS model designed for time series forecasting that incorporates exogenous variables. The model consists of two main blocks which are fully connected feedforward networks (FFNs): the backcast block and the forecast block.\n",
     "\n",
-    "## Choosing a Transformer Architecture\n",
+    "- **Backcast Block:** This block aims to model the historical data by reconstructing past values. It learns the patterns and underlying structure of the time series data, helping the model capture long-term trends and seasonality.\n",
     "\n",
-    "The model needs to generate 24 hourly predictions for the next day at noon, meaning it must capture sequential dependencies effectively. There are two main transformer architectures to consider:\n",
+    "- **Forecast Block:** The forecast block predicts future values based on the patterns learned in the backcast block. It decomposes the forecast into components like trend and seasonality, making the model interpretable.\n",
     "\n",
-    "- **Sequence-to-Sequence (Seq2Seq) Transformer** : This approach uses both an encoder and a decoder. The encoder processes historical data (past demand, weather, etc), extracting meaningful representations. The decoder autoregressively generates each hour’s demand forecast while attending to both the encoder’s outputs and known future data (weather forecasts, calendar effects, etc).\n",
-    "    - Pros : Naturally handles multi-step forecasting by generating predictions one at a time. Can incorporate past predictions dynamically in an autoregressive manner.\n",
-    "    - Cons : Requires masking to ensure predictions don’t leak future information. More computationally expensive due to the decoder’s sequential nature.\n",
-    "- **Encoder-Only Transformer** : This approach uses only an encoder (like BERT-style transformers) to map historical and exogenous features directly to all 24 future hourly predictions in a single forward pass.\n",
-    "    - Pros : Faster and more efficient since it doesn’t require iterative decoding. Avoids error accumulation from autoregressive predictions.\n",
-    "    - Cons : Can struggle with capturing temporal dependencies between consecutive predicted hours. Less flexible in handling exogenous variables that evolve dynamically.\n",
+    "**N-HiTS (Neural Hierarchical Interpolation for Time Series Forecasting)**\n",
     "\n",
-    "For this task, an encoder-only transformer will be used since it efficiently predicts all 24 future time steps in one forward pass. While Seq2Seq models better capture sequential dependencies, their computational cost and risk of accumulating errors make them less suitable for predicting 36 hours ahead (12 historical + 24 future time steps)."
+    "N-HiTS uses hierarchical decomposition and interpolation to efficiently capture both short-term and long-term temporal dependencies in time series data.\n",
+    "\n",
+    "N-HiTS works by processing the time series data through multiple hierarchical levels, each focused on a different temporal resolution. The model starts with coarse, long-term trends and progressively refines the forecast using interpolation layers, which help generate more fine-grained predictions at each level. This hierarchical approach allows N-HiTS to capture patterns across multiple temporal scales, making it effective for both long-term trends and short-term fluctuations.\n",
+    "\n",
+    "**Why TFTs**\n",
+    "\n",
+    "For energy price prediction, where exogenous variables like energy demand and weather forecasts play a critical role, TFTs stand out. Their ability to handle mixed inputs, including historical data and known future inputs, makes them the ideal choice. Additionally, Variable Selection Networks (VSNs) in TFTs ensure that only the most relevant factors influence the predictions. While N-BeatsX and N-HiTS can handle exogenous variables, they lack the flexibility to incorporate known future inputs, making TFTs the better fit for this task."
    ]
   },
   {
@@ -71,7 +66,9 @@
    "source": [
     "## Feature Engineering\n",
     "\n",
-    "The pre-cleaned dataset from [xgboost_demand.ipynb](xgboost_demand.ipynb) can be reused for the transformer model, but adjustments are needed in the feature engineering step. Unlike XGBoost, which treated each time step as a separate row with extracted lag-based features, the transformer model requires the data to be structured as a sequence. This means that instead of having a single feature vector per prediction (num_features), the input to the transformer will be a (24, num_features) matrix, preserving the temporal relationships within each 24-hour window."
+    "The pre-cleaned dataset from [xgboost_price.ipynb](xgboost_price.ipynb) can be reused for the TFT model, but adjustments are needed in the feature engineering step. Unlike XGBoost, which treated each time step as a separate row with extracted lag-based features, the TFT model requires the data to be structured as a time series dataset, with the data split into different parts, such as known features (future inputs), observed features (historical data), and target variables. This means the input to the TFT model must be organized as sequences, where each sequence corresponds to a 1-week window that preserves the temporal relationships across time steps. `pytorch_forecasting` has a `TimeSeriesDataSet` which can handle this.\n",
+    "\n",
+    "The TFT model requires more training data than XGBoost, so using one data point per day is insufficient. To address this, we will create one data point for each hour of the day, increasing the amount of training data and allowing the model to learn richer temporal patterns. While the TFT will still only make predictions for each hour (0-23) of the following day at noon, using data for other hours can still provide valuable context. This extra data can help the model capture long-term trends, making it more generalised and improving its overall forecasting accuracy."
    ]
   },
   {
@@ -245,6 +242,13 @@
     "print(f'Test Batches: {len(test_dataloader)}')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Training"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -330,7 +334,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Epoch 0: 100%|██████████| 542/542 [02:53<00:00,  3.12it/s, v_num=3, train_loss_step=944.0, val_loss=467.0, train_loss_epoch=371.0]\n"
+      "Epoch 20: 100%|██████████| 542/542 [02:53<00:00,  3.12it/s, v_num=3, train_loss_step=944.0, val_loss=467.0, train_loss_epoch=371.0]\n"
      ]
     }
    ],
@@ -364,6 +368,13 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The model output is a 35-vector, with predictions for each hour from 1pm the day before to 11pm on the day being predicted. Therefore, the prediction needs to be sliced to only include the last 24 hours of the prediction."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -422,13 +433,6 @@
     "plt.show()"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The transformer model requires more training data than XGBoost, so using one data point per day is insufficient. To address this, we will create one data point for each hour of the day, increasing the amount of training data and allowing the model to learn richer temporal patterns. While the transformer will still only make predictions for each hour (0-23) of the following day at noon, using data for other hours can still provide valuable context. This extra data can help the model capture long-term trends, making it more generalised and improving its overall forecasting accuracy."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -465,11 +469,9 @@
    "source": [
     "## Summary\n",
     "\n",
-    "The Transformer model achieved a Root Mean Squared Error (RMSE) of 899.11 MW, a Mean Absolute Error (MAE) of 681.94 MW, and a Mean Absolute Percentage Error (MAPE) of 2.57%. Therefore, the model is 97.43% accurate. This is significantly better that the benchmark model, however it was still outperformed by the XGBoost model with RMSE of 851.66 MW, MAE of 632.62 MW, and MAPE of 2.39%.\n",
-    "\n",
-    "The XGBoost model outperformed the Transformer model primarily due to the limited dataset size. The XGBoost model benefits from handcrafted lag-based features, which explicitly encode temporal dependencies. In contrast, the transformer model relies on self-attention to learn these relationships, which requires significantly more data to generalise effectively. Additionally, Transformers have higher model complexity, making them more prone to overfitting on small datasets, whereas XGBoost’s built-in regularisation techniques help maintain robust performance.\n",
+    "The TFT model achieved a Root Mean Squared Error (RMSE) of 22.23 EUR/MWh, and a Mean Absolute Error (MAE) of 16.52 EUR/MWh. This is significantly better that the benchmark model, and similar performance to the XGBoost model with RMSE of 21.57 EUR/MWh, MAE of 16.57 EUR/MWh. The TFT's RMSE was 0.66 EUR/MWh (3%) worse, however its MAE was marginally (0.05 EUR/MWh) better than the XGBoost model.\n",
     "\n",
-    "The histogram of residuals indicates that the Transformer model systematically overpredicts energy demand more often than it underpredicts. Since this effect was not observed in the XGBoost model, it is unlikely to be caused by bias in the training data. Instead, it may be a consequence of the Transformer being trained on a more generalised dataset, where predictions were made at every hour rather than specifically at noon each day, potentially reducing its ability to capture noon-specific demand patterns accurately."
+    "Despite the limited dataset size, the TFT model achieved performance comparable to XGBoost, highlighting the advantage of using architectures specifically designed for time-series forecasting. Previously, a general-purpose Transformer struggled in predicting energy demand due to its complexity and data requirements, but the TFT’s tailored design, incorporating components like LSTMs for short-term patterns, attention for long-term dependencies, and variable selection networks for feature relevance, allowed it to make effective use of the available data. Its ability to handle both known future inputs and exogenous variables in a structured way also contributed to its strong performance, even in a data-constrained setting."
    ]
   }
  ],