GAP-LSTM: Graph-Based Autocorrelation Preserving Networks for Geo-Distributed Forecasting

Forecasting methods are important decision support tools in geo-distributed sensor networks. However, challenges such as the multivariate nature of data, the existence of multiple nodes, and the presence of spatio-temporal autocorrelation increase the complexity of the task. Existing forecasting methods are unable to address these challenges in a combined manner, resulting in a suboptimal model accuracy. In this article, we propose GAP-LSTM, a novel geo-distributed forecasting method that leverages the synergic interaction of graph convolution, attention-based long short-term memory (LSTM), 2-D-convolution, and latent memory states to effectively exploit spatio-temporal autocorrelation in multivariate data generated by multiple nodes, resulting in improved modeling capabilities. Our extensive evaluation, involving real-world datasets on traffic, energy, and pollution domains, showcases the ability of our method to outperform state-of-the-art forecasting methods. An ablation study confirms that all method components provide a positive contribution to the accuracy of the extracted forecasts. The method also provides an interpretable visualization that complements forecasts with additional insights for domain experts.


I. INTRODUCTION
T IME series forecasting represents a crucial task in many real-world domains, including renewable energy, traffic, and pollution.Indeed, highly accurate forecasting models can be a powerful decision support tool to domain experts, providing the required knowledge to define new policies, foster operational safety, improve resource planning, and increase revenues.In the energy domain, accurate forecasting tools can lead to efficient integration of renewable energy with fossil sources, aiming at a balanced power grid load Fig. 1.
Overview of the geo-distributed forecasting task addressed by GAP-LSTM.Multiple locations (graph nodes) generate multivariate data (node features) at a fixed time granularity.The model leverages historical data x 1 , x 2 , . . ., x T to extract forecasts for p timesteps.Modeling temporal, spatial, and graph-based interactions is crucial to capture complex and dynamic correlation patterns resulting in accurate multistep-ahead forecasts.distribution, as well as effective energy trading strategies.In the traffic domain, forecasting tools can support traffic redirection policies to effectively distribute traffic across roads.In the pollution domain, they can help improving air quality in areas where high pollution is expected, by restricting traffic on certain roads or resorting to other preventive measures.
A major source of complexity of this task in sensor networks is represented by the geo-distribution of data across multiple locations (as shown in Fig. 1), which violates the typical assumption of independent and identically distributed observations made by machine learning models.The main challenges in this context are the analysis of multivariate data, the necessity to combine information from multiple geo-distributed nodes, and the presence of spatio-temporal autocorrelation described by complex patterns and interactions between sensors.This complexity requires new and sophisticated models that are able to address the aforementioned challenges simultaneously.
Neural network approaches, including gated recurrent unit (GRU) [10], [11], long short-term memory (LSTM)-based models [12], [13], [14], and other encoder-decoder architectures [15], [16], [17], are able to effectively analyze multivariate data while capturing temporal dependencies, but they only partially address the challenge of combining information from multiple nodes by modeling spatial autocorrelation.In fact, all nodes and features are treated as independent features, therefore ignoring the spatial information in the sensor network.Recently, this issue has been partially addressed by neural network models with custom loss functions [18], methods involving spatio-temporal feature extraction [19], [20], and variants of LSTM such as support vector decomposition (SVD)-LSTM [21] and convolutional neural network (CNN)-LSTM [22], [23], [24], [25], [26], [27], which model local spatial dependencies in data.Another alternative is that of LSTM models with spatial attention [28], [29], [30].However, these methods do not take into account the geographical structure of nodes, and only have a limited view of neighboring data, bounded by the kernel size, due to the locality characteristic of the convolution operator [31], [32].
This article proposes GAP-LSTM, a graph-based autocorrelation-preserving neural network method for geo-distributed forecasting, which addresses all the abovementioned challenges.Specifically, our method involves a novel GCN-LSTM cell that performs graph convolutions at each timestep, to propagate spatial information throughout the whole time series modeling process, synergically integrating GCN within the LSTM cell.Our cell presents a dual structure.The first part is an LSTM with the addition of GCN layers before the computation of each gate.The second part is a simplified version of the first part that also contains an additional latent memory state: the output gate of the first part of the cell goes directly into the second part, together with the latent memory state of the previous timestep.This novel model architecture allows us to extract and preserve complex spatio-temporal patterns that were hidden in raw data.Moreover, it overcomes the limitation of the commonly adopted GCN-LSTM workflow, where the spatial information is initially extracted by means of a GCN, and an LSTM carries out the forecasting task on the resulting representation.The proposed model architecture is particularly suitable to tackle predictive tasks in many real-world domains, including renewable energy, air pollution, and traffic.These domains are characterized by geo-distributed sensor data with spatio-temporal dependencies that can be exploited to yield more accurate forecasting models.
Another important challenge addressed by our method is that of model interpretability.Indeed, a high model complexity exacerbates the difficulty of gathering insights and explanations about model predictions.This challenge, which is often overlooked by state-of-the-art forecasting methods, is crucial for their adoption as decision support tools in real-world domains such as smart grids.In our method, we leverage the multinode capabilities of our model to generate interpretable attention maps that reveal the most significant nodes and timesteps for the forecasting process, and their relative importance compared to the others in the same sequence and across the whole observation period.
In summary, the main contributions of this work are as follows.
1) A novel model architecture integrating a custom recurrent cell with graph convolution and latent memory states, coupled with attention-based LSTM and 2-D convolution to exploit and preserve spatio-temporal autocorrelation in multivariate and geo-distributed data.2) An interpretable output visualization that highlights relevant factors that have an impact on predictions, as well as interactions between nodes, supporting domain experts in their decision making.3) An extensive evaluation with five real-world datasets which shows that GAP-LSTM outperforms state-of-theart forecasting methods on energy, traffic, and pollution domains.The code to reproduce the experiments is publicly available at https://github.com/m-altieri/GAP-LSTM/.This article is structured as follows.Section II summarizes related works.Section III describes our proposed method.Section IV describes the experimental settings, and provides a discussion on the results obtained in our experiments.Section V wraps up this article with a summary of the results obtained and outlines relevant directions for future work.

A. Autoregressive Models
Autoregressive models for time series forecasting extract future values for a property of interest based on recent values observed for the same property.Popular models include autoregressive integrated moving average (ARIMA) [1], [2] and Prophet [3].Other works combine autoregressive approaches and evolutionary algorithms to better explore model variants in the search space and select better performing models [4], or combine them with fuzzy or other types of statistical modeling [5], [6], [7].Among their benefits, it is worthwhile mentioning their ease of use and their determinism, given their statistical nature.On the other hand, their main drawback is that, since they only analyze univariate data, they do not consider exogenous variables.Moreover, they only model temporal autocorrelation, without benefiting from the spatial autocorrelation in data generated by multiple nodes.VAR models [8], [9] partially overcome this limitation, allowing to analyze data from multiple sources.
However, they take into account spatio-temporal autocorrelation only considering the target property subject to the forecasting task, and attributing equal importance to all nodes.Moreover, similar to ARIMA, they are unable to model complex nonlinear correlations among different values.As a result, vector autoregressive models, only partially address the exploitation of spatial autocorrelation, and do not consider graph-based relationships among nodes.

B. Neural Networks for Temporal and Spatial Data
RNN models, such as LSTM and GRU, overcome some of these limitations by analyzing multivariate data while leveraging nonlinear activation functions, which allow to model and learn more complex correlations among features [10], [11].LSTM-based modeling has been adopted in a number of applications, including the classification of thermal images of electric motors processed through feature extraction.
Alternative approaches combine a series of multiseasonal decomposition techniques to better capture seasonal patterns in LSTM-based forecasting models [12].A similar approach is followed in [13], where the forecasting task is carried out using an LSTM model with skip connections complemented with exponential smoothing and ensembling.Other approaches are proposed in [14], [15], [16], and [17], which combine an encoder-decoder model architecture with progressive decomposition capacities for complex time series.However, these methods can only model temporal correlations.
Methods that involve custom spatio-temporal loss functions [18] and feature extraction approaches [19], [20] partially overcome this problem by jointly extracting spatial and temporal dependencies.One popular approach is SVD-LSTM, which performs singular value decomposition and carries out the forecasting task through LSTM networks [21].
A hybrid method of CNN and bi-directional LSTM (Bi-LSTM) models is proposed in [22].The method learns shared representation features from multivariate time series for air quality data and it is applied to pollution forecasting.CNN-LSTM merges CNN and LSTM models to jointly capture short and long-term spatio-temporal dependencies, as explored in [23], [25], [26], and [27].A more recent approach in [24] extracts images from sliding windows of time series, and uses saliency as a mixup strategy for data augmentation to train deep models.
The study in [28] proposes a multistep ahead flood forecasting model that features a spatio-temporal attention LSTM, applying attention weights to the hidden layer state of each timestep, while in [29] they use spatio-temporal attention in the encoder, and multiple convolutions in the decoder to extract spatio-temporal features at different resolutions.In [30] spatiotemporal attention is used to predict chlorophyll for the early detection of red tide.More recently, the triangular, variablespecific attentions for long sequence multivariate time series forecasting (Triformer) method [44] proposed a more sophisticated triangular and variable-specific attention mechanism with distinct model parameters for different variables and linear complexity.
Even if these methods support the analysis of multivariate data generated at multiple nodes, an important limitation is that they extract temporal correlations from low-level features in adjacent nodes through an initial modeling step.By doing so, the subsequent operations, which act on top of the extracted high-level features, do not take into account spatial correlations in low-level layers, which are partially unexploited or entirely lost, resulting in a performance degradation on the subsequent downstream task [39], [40], [42], [43].Some effort has been devoted to mitigate this issue with the introduction of skip connections that propagate spatial information [39], [43] as well as spatial memory cells [40].Other limitations include the lack of exploitation of graph-based relationships among nodes, and the lack of model interpretability.

C. Graph Neural Networks
A recent and promising thread of research is that of GCNs, which take into account the network structure and combine the contribution of data from multiple sources, by means of a graph convolutional operator [33], [34].When used in synergy with recurrent networks, GCN can be used to perform time series forecasting tasks.
Some works have focused on the analysis of traffic data.An interesting approach leveraging complete convolutional structures with a limited number of parameters is proposed in [35].A spatio-temporal graph convolution framework for traffic prediction is proposed in [36], in which multiple graphs are built to explicitly model dynamic correlations among road segments, while RNNs capture temporal correlations for each road segment.
The work in [37] proposes a collaborative graph neural network prediction method involving multiple agents and exploiting dynamic interactions in the system.Interactions are represented as a graph, where edge weights reflect the importance of each predictor.The method has shown successful results on trajectory prediction, online human motion prediction, and online traffic speed prediction.
A different approach is followed in TraverseNet [38], where spatial and temporal dependencies are unified in a non-Euclidean space, and a spatial-temporal graph neural network mines spatial-temporal graphs while exploiting the evolving spatial-temporal dependencies for each node via message traverse mechanisms.
A recent and popular method is graph wavenet (GWN) [45], which exploits an adjacency matrix learned with node embeddings, and a stacked dilated 1-D-convolution component with increasingly growing receptive fields.Another powerful method is regularized graph structure learning (RGSL) [46], which extracts a dense similarity matrix through node embeddings, and learns a sparse graph structure using the regularized graph generation (RGG) method.
Other graph-based approaches were proposed to forecast air quality and electricity data.The method in [47] builds graph models to represent contextual information of nodes and aggregates them via a multigraph fusion module while retaining the importance of each node in the different graphs.A graph neural network with attention as a transfer learning mechanism is proposed in [48], where knowledge from multiple sources is used to improve the learning process in new sources.
The method proposed in [49] addresses the behindthe-meter load and photovoltaic forecasting, modeling the residential units as a spatio-temporal graph where the nodes represent the net load measurements and edges reflect their Although all these methods are able to analyze multivariate data from multiple nodes, and both spatial and temporal information is exploited in the learning task, a common pitfall is that they extract spatial and temporal autocorrelation separately.Most frequently, the GCN embeds the spatial component into an encoded vector and the LSTM performs forecasting on top of this representation, which results in the inability to further extract spatial autocorrelation in successive modeling steps.Another common issue of forecasting models is their lack of interpretability, due to the complexity and time-variant nature of the analyzed data.The work in [24] is an interesting attempt in this direction.
A summarized comparison of state-of-the-art forecasting methods is provided in Table I.The comparison emphasizes the novelty of GAP-LSTM with respect to relevant features for geo-distributed forecasting.

III. METHOD
This section is divided into three parts.We first define the problem we tackle in this study.Subsequently, we describe our proposed method in detail, focusing on the contribution provided by each component.Finally, we illustrate the interpretability aspects of the method.

A. Problem Definition
This article addresses the scenario where N geo-distributed nodes generate multivariate observations.Each node is identi- 1 For SA, a gray tick indicates that a method extracts it without preserving it throughout all the modeling steps.For Int, a gray tick indicates that, even if some aspects of the model could support the interpretation of results (e.g., attention weights), they are not exploited to complement predictions. 2This method is not strictly a forecasting method, but it rather provides a general explainability framework for time series analysis.
fied by a pair of latitude and longitude coordinates.A graphical overview of the addressed task is shown in Fig. 1.Together, all nodes generate an observation x t ∈ R N ×F for each discrete time point t, F being the number of features.We note that F is the cardinality of the entire feature space consisting of both descriptive (independent) features and target features.In our work, the discrete timeline is split into non-overlapping sequences of length T , each corresponding to a desired unit of analysis, e.g., a single day described by 24 hourly observations.The kth sequence can be defined as follows: Based on this formulation, it is possible to model data into two data structures: 1) a sequence tensor S ∈ R S×T ×N ×F , containing contiguous sequences: S = [s 0 , s 1 , . . ., s S−1 ], where each sequence s k contains T chronologically ordered observations; 2) a graph matrix A ∈ R N ×N , where A i j is some measure of relationship, or closeness, or correlation, or similarity between nodes i and j.Given a sequence s k and a forecasting horizon P, the forecasting task consists in approximating the function f : which returns the next P observations for the target feature of interest, denoted by y.
Without loss of generality, this formalization of the forecasting task also applies to the context of multitarget forecasting, where the goal is to correctly predict a set of target features F ′ ≤ F, approximating the ground truth y ∈ R P×N ×F ′ .
In addition to the extraction of accurate predictions, our aim is to preserve spatio-temporal autocorrelation throughout the entire learning workflow.Temporal autocorrelation accounts for temporal dependencies between observations at different timesteps [1].The analysis of temporal autocorrelation allows us to identify and exploit repeating patterns in data characterized by periodic behavior, as commonly observed in time series data on geophysical phenomena (e.g., weather, renewable energy, pollution, traffic).Considering a sequence s k;n collected from a single node n and a desired lag h, temporal autocorrelation can be defined as where x(n) is the average value observed for node n.For all sequences, all nodes, and all possible lags, the average temporal autocorrelation can be computed as the average Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Spatial autocorrelation can be defined as a correlation in a signal among nearby locations in space.A widely adopted way to formalize it is resorting to Moran's I global autocorrelation statistic [52].Considering a sequence s k collected from multiple nodes, and the closeness relationships between nodes defined in A, spatial autocorrelation for a single timestep x (k,t) can be defined as where x(k,t) is the average observed value considering all nodes at timestep t.For all sequences, the average spatial autocorrelation can be computed as the average

B. Proposed Model: GAP-LSTM
The forecasting model proposed in this work involves several deep-learning components.A schematic representation of the method is shown in Fig. 2. In order to simplify the discussion, we first give a general description of the method, based on the sequential information flow, and then we present each component in detail.
1) Method Workflow: The input of the model consists of two tensors: the sequence tensor S and the graph matrix A, as defined in Section III-A.The model leverages a customized encoder-decoder architecture.The rationale is to use an encoder module e to encode an input sequence s k to an embedding representation that captures the most relevant information for the task at hand, mitigating possible collinearity and noise in the original data.All encoder and decoder steps are a modified version of the LSTM cell, which includes a graph convolutional layer before each gate, 3 considering the adjacency matrix A as input, which is used for graph convolution.
Given a sequence s k and T input steps, we denote with e t the encoder step t, which processes the observation x (k,t) .Each encoder step e t is a function that takes as input the observation at the current timestep and the outputs of the previous encoder cell, and yields h t , m t , where h t , m t , and c t denote, respectively, hidden state, latent memory state, and cell state for timestep t.Intermediate encoder hidden states are reused in the attention mechanism.While h t and c t are standard in LSTM-based cell architectures to capture temporal dependencies, the introduction of the latent memory state m t is a novel contribution that allows our model to preserve spatio-temporal autocorrelation more effectively than existing LSTM-based models, as discussed later in this section.
The goal of the decoder is to reconstruct the encoded representations into a series of P steps, which will then lead to the prediction.The decoder has the same output shape as the encoder, but may have a different number of steps (in case T ̸ = P).The input of each decoder cell d p consists of the previous decoder latent memory state m p−1 and cell state ĉ p−1 (for d 1 , we use the last encoder step's latent memory state m T and cell state c T ), as well as the weighted sum of encoder hidden states, as determined by the attention mechanism.For a single timestep p, the decoder cell is a mapping where {α (t) h t } are the attention weights of the previous timestep.The output of the encoder component h T is also provided as input to a 2-D-convolutional layer, and then it is added to the decoder hidden states [ ĥ1 , . . ., ĥ P ].
To extract the final prediction, this tensor then goes through a fully connected layer that reduces the last dimension from F to 1, so that the final output of the model (s k ) ∈ R P×N contains the value of the relevant target feature for each node at each prediction timestep.A graphical overview of this workflow is shown in Fig. 3 and its high-level pseudo-code is described in Algorithm 1.

Algorithm 1 GAP-LSTM Method Workflow
Input: 2) Graph-Based Data Representation: Before introducing the GCN-LSTM cell, we start with describing the graph-based representation for the geo-distributed time series data analyzed in this study.Each node n ∈ {1, 2, . . ., N } has its own time series describing the features of interest, including the target feature over time.
The nodes and these relationships between nodes can be represented as a graph G = (V, E), where V is the set of all nodes (|V | = N ), and E = V × V is the set of edges, or relationships, between nodes.Since the task of interest is to forecast all nodes simultaneously, we can exploit the structural graph information of the data generated at multiple locations to get a better understanding of the whole context, which results in additional spatio-temporal information to support the forecasting task.
An adjacency matrix A ∈ R N ×N is a way to represent relationships between nodes in a compact form.To properly model the intensity of relationships between nodes, in our work, we define entries in the adjacency matrix a i j by exploiting the continuous range of closeness values between all pairs of nodes i, j, i.e., a i j ∈ [0, 1] represents the closeness relationship between nodes according to their distance.It is calculated as a i j = 1 − (d(i, j)/ max i ′ , j ′ d(i ′ , j ′ )), where d(i, j) is the geographical distance between nodes i and j.Using this similarity measure, the edges of the graph are undirected, making the adjacency matrix symmetric.In this way, a i j represents how close nodes i and j are relative to the distance between other nodes.If the distance between i and j is 0, their closeness will be 1.In the following, we describe in detail the proposed GCN-LSTM cell occurring in each encoder and decoder step.
3) GCN-LSTM Cell: Our cell is a modified type of the LSTM cell that incorporates graph convolution (GCN), which allows the model to learn spatio-temporal correlations of historical timesteps.The cell has four inputs: the input observation at the current timestep x t , the previous hidden state h t−1 , the previous cell state c t−1 , and the previous latent memory state m t−1 .The cell also requires the adjacency matrix A for the GCN computation.
Our method supports the adoption of different forms of graph convolution.Within the scope of this article, we evaluated two graph convolution alternatives, giving place to two variants of our method.The first variant is denoted as GAP-LSTM-Default and adopts the standard GCN implementation in [33], where a graph convolutional layer is defined as: (X, A) = ReLU( ÂX W ), with Â = D−(1/2) Ã D−(1/2) for the degree matrix Dii = j Ãi j , Ã = A + I N adjacency matrix with added self-loops and W ∈ R F×F ′ weight matrix.The second variant is denoted as GAP-LSTM-Weighted and employs a slightly different graph convolution implementation, which applies a componentwise weighting process to highlight salient features where W ∈ R F×2F , σ denotes the sigmoid activation, and ⊙ denotes the point-wise (or Hadamard) product.
On the contrary of the typical GCN-LSTM workflow, which first extracts spatial information by means of GCN and then uses LSTM to perform the forecasting task on the resulting representation, we aim to propagate spatio-temporal patterns throughout the entire workflow, synergically integrating GCN within the LSTM cell.This process takes place by means of T graph convolutions, corresponding to T hops in the graph data structure.
The cell is composed of two parts: the first part is similar to a conventional LSTM cell, but with the addition of GCN layers before the computation of each gate.The second part is a simplified version of the LSTM cell, still with the addition of GCN layers before each gate, but also containing an additional latent memory state: the output gate of the first part of the cell goes directly into the second part, together with the latent memory state.The latent memory state consists of three gates.Conceptually, the first gate acts similar to a forget gate in an LSTM architecture.It filters the information contained in the previous latent memory state by selecting how much latent information from m t−1 should be preserved in the current state m t .The second gate can be seen as an input/update gate that combines old information (m t−1 ) with new information (output of the left side of the GCN-LSTM cell) by updating the memory state and propagating the information extracted from the first gate.The outputs of the first and second gates are summed and this output is further propagated and filtered through the third gate, which acts as an output gate of the cell, and is in charge of the final output hidden state h t for the current timestep t.Both hidden states and latent memory states have a size equal to the number of input features F. They are initialized with ones and are updated at each timestep of a given sequence during the inference stage.
An intuitive visual representation of the proposed GCN-LSTM cell is depicted in Fig. 2. Conceptually, the goal of the first part of the cell is to propagate h t−1 with the current spatial (GCN) and temporal (LSTM) information.The second part of the cell acquires the output of the first part as well as the latent memory state m t−1 , and directs it through GCN and LSTM operations, which further extract spatio-temporal patterns from the latent memory representation.This dual process in our proposed cell has the potential to further emphasize useful spatio-temporal information that was hidden in raw data, and it has been extracted by the first part of the cell.Overall, the GCN-LSTM cell allows our method to extract global spatio-temporal patterns by iteratively combining information in local neighborhoods, based on the closeness relationship among nodes.An illustration of the spatial autocorrelation SA(x (k,t) ) measured across the encoder hidden states for different models is shown in Fig. 4. The visualization emphasizes that the proposed GAP-LSTM method presents a better ability to model spatial autocorrelation than other methods due to its cell's ability to jointly propagate temporal and spatial information during all encoding steps.
4) Attention Mechanism: The attention mechanism is used to determine the importance of any encoder step e t when computing the decoding step d p , i.e., how much d p "attends to" e t , for a prediction step p.Following the attention framework described in [53], we denote with query Q the decoder thought vector.The keys K = {k 1 , k 2 , . . ., k T } are aligned with the query, to determine how closely related they are to each other.This comparison is performed with a function a : (k, Q) → scor e, which is usually defined by a trainable neural network, so that each k t is associated with its own network a t that learns to properly weigh the corresponding history step t.The outputs of functions {a i } T 1 are normalized with a softmax to get the attention weights {α i } T 1 .The values V = v 1 , v 2 , . . ., v T are used to compute the decoder input T t=1 α t v t .In our method, global multihead attention, also known as soft attention, is adopted [54].In our formulation, the key and the value are the set of encoder hidden states We denote with K n ∈ R T ×F the key for node n, with K (t) ∈ R N ×F the key for history step t, and with K (t)  n ∈ R F the key for node n and history step t.The same notation is used to select specific nodes and history steps from V .We also denote with Q ∈ R N ×F the query for the current decoding step d p−1 , and with Q n ∈ R F the query for node n.
The whole attention can be defined as Each tensor Q n , K (t) n , V (t) n , is associated with a learnable weight matrix: W Q;n , W (t) K ;n , and W (t) V ;n , respectively.Overall, head n is calculated as where Therefore, the model learns N weight matrices for Q, and T • N weight matrices for K and V , respectively.Keeping the node n fixed, the T matrices for the key and the value can be defined explicitly: for T history steps, we have The embedded key at each step is a vector ∈ R F×T , and the attention scores are represented as a vector , containing a value for each history step.On this vector, a softmax is performed to obtain the attention weights α, which in turn is multiplied by the embedded value V n W V ;n , which, like the weighted key, has size T × F. Finally, head n (Q n , K n , V n ) ∈ R F is a context vector and constitutes the input for the decoder.
Typically, attention mechanisms perform a dot product or extract the embedding of the sole query vector.Instead, in our method, we leverage the embedding of both the query and the key.The advantage is to decouple the encoder information in two parts: the first part h t , which is used to extract spatio-temporal information from historical timesteps, and the second part K , which is useful to model the alignment of each encoder timestep with respect to each decoder state, used to extract predictions.Moreover, in our method, the attention mechanism focuses on each node n individually, in a multihead fashion, learning a different and independent head n (Q n , K n , V n ), which outputs F features for the given node n.
A graphical representation of our attention mechanism is shown in Fig. 5.

5) 2-D Convolution:
A 2-D-convolutional layer is used to modify decoder hidden states ĥ p with a number of filters that extract additional information from the last encoder state h T .In particular, the convolution computes P filters, one for each prediction timestep, with a kernel of size (2, 2) which is applied to h T .Different values for kernel size have been tested with preliminary experiments on a variety of domains, and highlighted (2, 2) as the best configuration.However, it can be easily customized to satisfy specific domain characteristics.
The kernel allows us to look at groups of features learned in adjacent nodes and extract useful localized spatial patterns.Specifically, the model will extract, for each timestep p, useful cross-correlations between different features of adjacent nodes to predict that particular timestep.A zero-padding is added and the stride is set to 1, to guarantee that the convolution output keeps the same shape as the input.The resulting P feature maps are stacked together to form a P × N × F tensor, which is then added to the decoder hidden states [ ĥ1 , ĥ2 , . . ., ĥ P ] to properly fuse relevant spatial information extracted from the convolution with decoder states.

C. Model Interpretability
This step enhances the GAP-LSTM method with interpretable predictive capabilities.Specifically, our method leverages the attention weights α generated at each prediction step p to support the interpretation of the extracted predictions.The rationale is that, for each prediction step, the model generates a different decoding state, which in turn results in different weights α.Combining this information allows us to identify the most relevant encoder hidden states at timesteps t = 1, 2, . . ., T for the current prediction step p.To accomplish this goal, the attention weights matrix α ∈ R N ×T is exploited to generate an insightful visualization that consists of three heatmaps for each p.Let us consider a sequence sk and let us denote with α (k) the attention weights for a generic sequence s k .In Fig. 6, each row r = 1, . . ., 12 shows attention maps computed for decoding step d 2r −1 during the forecasting of a given sequence.Fig. 6(a) shows the difference between attention weights for each decoding step p and average attention weights for this sequence α ( k)  :, p − (1/P) P p=1 α ( k) :, p .Fig. 6(b) shows the difference between the attention weights for decoding step p and average attention weights computed for all sequences subject to prediction.Let S p ⊂ {0, 1, . . ., S − 1} be a set of indices referencing sequences subject to prediction, the average attention weight is computed as α ( k)  :, p − (1/|S p |) k∈S p (1/P) P p=1 α (k) :, p .Fig. 6(c) shows the raw attention weights α ( k) for each decoding step of this sequence.
The difference operations allow us to visually highlight changes in the attention maps corresponding to encoding Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
steps becoming relevant for a given decoding step in a given sequence.From the visualization in Fig. 6, we can highlight salient patterns in different timesteps where, in each heatmap, rows denote nodes, columns denote encoding steps, and lighter colors in a cell correspond to stronger activations.Let us focus on the attention head for node 1, corresponding to the first row in all heatmaps.Observing the first heatmap at the top in Fig. 6(a) highlights that this attention head was highly aligned to e 11 for d 1 (as evident from the white pixel), but then gradually shifted its attention toward e 5 , e 15 , e 17 , and especially e 4 [as evident in the subsequent heatmaps in Fig. 6(a), starting from ŷ(k,T +7) to ŷ(k,T +23) ].This result emphasizes that hidden states h 4 , h 5 , h 15 , h 17 contained particularly useful information for the forecasting of node 1.

IV. EXPERIMENTS A. Research Questions
Our experiments are designed to answer the following research questions.
1) RQ1: Does GAP-LSTM achieve a higher performance than state-of-the-art methods in the multistep ahead geo-distributed forecasting task? 2) RQ2: Is the combination of all components in GAP-LSTM contributing to the achievement of an improved forecasting performance compared to simplified variants of the model?3) RQ3: Can GAP-LSTM provide an effective way for domain experts to interpret the extracted forecasts?

B. Datasets
We perform experiments with the following real-world datasets.
1) Beijing Air Quality [55] includes pollutant levels (PM2.5, PM10, SO2, NO2, CO, and O3) in addition to weather features (temperature, pressure, dew point, rain precipitation, and wind speed).2) Lightsource [20] covers solar energy data for the year 2017 from seven plants located in the U.K. Spot values, collected at a time granularity of 1 min, are aggregated hourly.3) PEMS SF Weather4 describes traffic information in terms of lane occupation, which has been enriched with weather features through the RapidAPI service.4) PV Italy [19] contains hourly observations from 17 solar plants located in Italy, collected from 2:00 A.M. to 8:00 P.M.The time period spans from January 1, 2012 to May 4, 2014.5) Wind-NREL [20] was modeled using the weather research and forecasting (WRF) model.Each plant consists of ten 3-MW turbines (for a total of 30 MW).Hourly aggregated observations range from January 1, 2005 to December 31, 2006.For all energy datasets (Lightsource, PV Italy, Wind NREL), the following input features are represented: temperature, pressure, wind speed, wind bearing, humidity, dew point, and cloud  cover.Lightsource includes additional features for altitude, azimuth, and irradiance.PV Italy includes all features in Lightsource and adds an additional weather summary feature.Lightsource and PV Italy observe a cutoff period between 9:00 P.M. and 2:00 A.M. due to the absence of irradiance at that time, i.e., plants are not operational and no observations are recorded during that time frame.For all datasets, the latitude and longitude coordinates of nodes are used to extract the graph matrix A.
A summary of our analyzed datasets is shown in Table II.A set of representative photos for the considered datasets is shown in Fig. 7.

C. Experimental Setup
We perform a data pre-processing phase consisting in the selection of significant features, conversion of all features to real values (including one-hot encoding of categorical features), min-max normalization of all features in the [0, 1] range, and computation of pair-wise node closeness according to their geographical location [18].Considering the sequential nature of data analyzed in our study, the model receives the sequences (each representing 24 h) one by one (every day) in chronological order.Models are trained considering all historical data sequences s 0 , s 1 , . . ., s S−1 .Given the current sequence s k , the model extracts predictions according to the forecasting horizon P (see Table II).For all experiments, models are optimized through Stochastic Gradient Descent using the Adam optimizer for 50 epochs, and mean absolute error (MAE) as the loss function.Fine-tuning takes place via grid search using validation data (1% of available sequences in each dataset) considering the following sets of hyperparameter values: batch size: {2 i } i∈{2,3,4} ; learning rate (LR): {10 j } j∈{−3,−4,−5} .Each method featured in our experiments is optimized separately for each dataset considering the aforementioned search space for hyperparameters.Once the optimal configuration on validation sequences for a given method is found, it is selected for the actual carried out on prediction (testing set).In order to evaluate the models, 10% of the available 24-h sequences are randomly sampled from each dataset for testing purposes.Once sampled, testing dates are fixed the same dates are used to fairly compare performance across different methods.Moreover, to simulate the outcome of multiple independent executions of the method while ensuring that only historical data from previous timesteps is being used, the evaluation (training on data prior to the prediction sequence and extraction of forecasts for the prediction sequence) is repeated for each prediction sequence.
For a given ground-truth sequence and its respective model forecast, the evaluation of the forecasting accuracy of each model is performed using the following metrics.

D. Results
Experimental results for all methods with all datasets are reported in Table III.Results are averaged over all test executions (10% of sequences). 5The results highlight that at least one variant of the method (GAP-LSTM-Weighted or GAP-LSTM-Default) achieves the best forecasting accuracy in terms of average RMSE, for all the analyzed datasets with the exception of Beijing Air Quality, where our method provides the second-best results.Particularly, GAP-LSTM-Weighted performs better on the Lightsource and PEMS-SF Weather datasets, possibly due to the presence of highly correlated clusters of nodes operating similarly, as well as the high similarity and regularity of solar energy prediction and traffic curves in these datasets.For the other three datasets, the basic weighting provided by the closeness relationship 5 The materials to replicate our experiments are available at the following repository: https://github.com/m-altieri/GAP-LSTM/.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.among nodes is apparently enough to model the actual spatial dependencies.Finally, we observe that the GAT variant or our method, GAP-LSTM-GAT, is not able to achieve the same performance of the other two variants.In summary, results show that GAP-LSTM generally achieves the highest forecasting performance across all methods and all datasets in terms of RMSE.To validate the statistical significance of our results, we perform Wilcoxon Signed Rank tests to all pairwise combinations of methods across multiple executions with all datasets.The results in Table IV highlight that GAP-LSTM-Default is the most robust variant of our method, outperforming competitors in 50 out of 65 configurations, 28 of which are statistically significant.GAP-LSTM-Weighted outperforms other approaches 44 times, 20 of which are statistically significant.Among all competitors, we observe that ESG performs similar to GAP-LSTM-Default.However, it is noteworthy that ESG is not able to exploit the possibility of fully catching spatio-temporal autocorrelation (as evident in Fig. 4) and, in addition, does not generate interpretable results.Moreover, it generally shows a worse performance with large prediction horizons (e.g., Wind NREL, PV Italy), with the exception of Lightsource, where ESG achieves a slightly worse performance than GAP-LSTM-Weighted (RQ1).
A different view of results is obtained considering model calibration by analyzing the number of times each model's predictions overestimate or underestimate target values.This analysis reveals interesting border cases of imbalanced predictions for specific models and datasets.We observe that Attention-LSTM overestimates target values in the 68.36% of the cases with the Beijing Air Quality dataset.CNN-LSTM overestimates 65.75% with the Lightsource dataset.On PV Italy, GRU underestimates target values in 65.55% of the cases, whereas CNN-LSTM overestimates them in 67.59%.Finally, we observe that GWN underestimates target values in 62.89% of the cases with Wind NREL.
Among the competitors considered in our study, autoregressive models (ARIMA) present an unsatisfactory performance, since they do not fully exploit the multivariate nature of data, and do not combine the spatial information residing in the multinode graph structure.We note that ARIMA was not evaluated on PEMS-SF Weather due to the unfeasible computational cost observed with this dataset.As for neural network-based methods, they significantly outperform ARIMA, but generally present a substandard performance.This result is justified by the fact that temporal-focused methods (LSTM, GRU, Bi-LSTM, Attention-LSTM) do not exploit spatial dependencies, whereas spatio-temporal methods (CNN-LSTM and GCN-LSTM) partially exploit them, as discussed in Section II.Shifting the focus to more recent baselines (GWN, multivariate time series forecasting with graph neural network (MTGNN), Triformer, ESG, RGSL), we can observe that they generally achieve very competitive results, significantly outperforming the aforementioned competitor methods, but appear suboptimal when compared to GAP-LSTM, with the exception of Beijing Air Quality dataset, where ESG outperforms all other approaches.A possible reason is that pollution predictions are more accurate with multiscale temporal modeling rather than spatio-temporal autocorrelation preserving modeling.However, with the exception of ESG, GAP-LSTM outperforms all other methods with this dataset.The superiority of GAP-LSTM on all other datasets is due to its competitors' inability to model spatial dependencies (Triformer), or their inability to preserve them throughout all modeling steps (MTGNN, ESG).The other methods (GWN, RGSL) effectively model and preserve spatial dependencies, but their scope is likely limited to one or few types of spatial interactions without considering, for instance, crosscorrelations in the embedding space, and appear penalized when such characteristics are naturally present in the data.
We conducted an ablation study to verify that all components of GAP-LSTM contribute to achieving a higher forecasting performance, and report its results in Table V.In this analysis, we run experiments with different variants of the model where the 2-D-convolutional layer, the attention mechanism, and the second part of the GCN-LSTM cell are mutually excluded.When one of the components is deactivated, the variant GAP-LSTM-NoMemoryState yields the second-best performance, followed by GAP-LSTM-NoAttention and GAP-LSTM-NoConv.We can also observe that deactivating any of the components always leads to a performance degradation ranging from 2.0% to 38.7% with respect to the best variant of GAP-LSTM in terms of RMSE.
The experimental results obtained for all variants of the model are reported at the bottom of Table III, and highlight that all components provide a positive contribution in terms of model accuracy.As a result, model variants where all components are active (GAP-LSTM-Weighted and GAP-LSTM-Default) achieve the best overall performance.From a qualitative viewpoint, Fig. 8 shows multinode prediction (red) and measured (blue) curves (one curve per node) for the target variable of each dataset and for a selected day. 6Overall, GAP-LSTM outperforms state-of-the-art methods, that the synergic work of all components contributes to the extraction of the most accurate predictions in the final model (RQ2).Specifically, the 2-D-convolutional layer fruitfully extracts spatial patterns from the last step of the encoder output using information from all nodes that are useful for the decoder.In addition, the attention mechanism determines the importance of previous timesteps in the input sequence during the generation of each prediction step performed by the decoder.Finally, the latent memory state component in the GCN-LSTM cell enhances spatiotemporal patterns residing in the embedded representation of each timestep of the sequence with a latent representation of the previous timestep.Focusing on model interpretability, we leverage the attention weights α generated at each prediction step p to support the interpretation of the extracted predictions.The rationale is that, for each prediction step, the model generates a different decoding state, which in turn results in different weights α.Combining this information allows to identify the most relevant encoder hidden states at timesteps t = 1, 2, . . ., T for the current prediction step p. Visualizations in Fig. 6 provide domain experts with this information, supporting them in visually understanding the most relevant factors influencing model predictions.This visualization demonstrates our method's ability to extract a succinct and qualitative interpretation of the forecast values, providing an effective way for domain experts to understand the extracted forecasts and support their decision-making process (RQ3).
Another way to gather insights about model's predictions is to demonstrate the effective exploitation of spatio-temporal autocorrelation provided by the different components of GAP-LSTM.
To gain deeper insights into the decoder representations, we adopt t-distributed stochastic neighbor embedding (t-SNE), which effectively extracts 2-D visualizations from the N × F decoder hidden states for a given prediction hour.In Fig. 9(a)-(e), we show four visualizations corresponding to four equally spaced hours.Points correspond to nodes of the Lightsource dataset, whereas the X -and Y -axes correspond to the compressed representation space extracted by t-SNE.
The results highlight that the embedding representation learned by our model preserves spatio-temporal correlations among different nodes, which are evident at different timesteps [see Fig. 9(a)-(d)] during the predictive stage.Similar nodes appear naturally clustered by the method based on their behavior, and resemble the closeness relationship defined by their physical location in the sensor network, as depicted in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Time Complexity
To evaluate the computational cost of all methods, we computed their execution time as the ratio between the total training time of each method and the number of prediction sequences for each dataset (number of executions).The result is then averaged across all datasets.In the following, we discuss the results for a subset of representative methods from different categories, which yield an accurate performance.
Experiments are run on a workstation equipped with an Intel Xeon W-2145 (3.7 GHz) CPU, 64 GB of RAM (DDR4-2666), 512-GB SSD drive, and an NVIDIA RTX 4090 GPU.Execution times for all methods and datasets are reported in Table VI.We observe that GAP-LSTM trades a linear increase in computational cost over a simpler neural network approach such as SVD-LSTM and Attention-LSTM, and a slight increase over state-of-the-art approaches such as GWN, MTGNN, Triformer, and ESG in exchange for higher forecasting performance.The RGSL method presents a higher execution time due to its layer complexity.Autoregressive models like ARIMA are unable to exploit information from all nodes simultaneously, resulting in a suboptimal forecasting performance and a considerably higher computational cost.
As an additional analysis, we report the GAP-LSTM space and time complexity with an increasing number of nodes in the sensor network.Results in Table VII show the amount of memory and time (absolute, relative, and per-node) required for model training, considering the Wind NREL dataset (originally including five nodes) and a progressively increasing number of synthetically added nodes in the network.We observe that the absolute time (expressed in seconds per epoch) increases sublinearly with the number of nodes.This result is also more intuitively represented by relative times.For instance, increasing the sensor network by a factor of 10 (from 5 to 50 nodes) results in a time increase of 6.83×.As for memory consumption, results for absolute memory (expressed in MBs) show a negligible increase as the number of nodes grows.In relative terms, increasing the sensor network size by a factor of 2 (from 5 to 10 nodes) only results in a 1% memory increase.In the largest case, where the network size is increased by factor of 10 (from 5 to 50 nodes) the memory increase observed is just 31%.This result is expected, since the core components and the number of model parameters remain constant with the addition of new nodes, with the exception of the adjacency matrix and attention heads, which are extended, leading to a logarithmic increase in space.These results show that GAP-LSTM makes an efficient use of computational resources, resulting in a minimal overhead with the increase of network size, which is a typical scenario in

V. CONCLUSION
This article proposes GAP-LSTM, a novel method for geo-distributed forecasting that focuses on exploiting spatio-temporal autocorrelation in multivariate data generated by multiple nodes.Existing approaches for this challenging task either do not simultaneously take into account the spatial and the temporal dimensions of data, or do not preserve the learned spatio-temporal patterns throughout the entire downstream forecasting task.Our method leverages the synergic interaction of graph convolution, attention-based LSTM, 2-D convolution, and latent memory states to overcome these limitations.
An extensive evaluation involving real-world datasets on traffic, energy, and pollution domains shows that GAP-LSTM outperforms state-of-the-art methods.An ablation study shows that all components bring a positive contribution to this outcome.The method also provides a visualization that allows domain experts to gather additional insights about predictions and supports them in their decisions.In future work, we aim to explore the adaptability of our method in applications with multiple correlated nodes and without an explicit geodistributed network.Moreover, we will investigate the explicit treatment of specific forms of spatio-temporal autocorrelation in the loss function of the model.Finally, we aim to assess our model's accuracy in other domains where very short-term and long-term forecasting horizons are required.

Fig. 3 .
Fig. 3. Graphical workflow of the geo-distributed forecasting task.Historical time series and the graph structure defined by geographical coordinates of the nodes (latitude, longitude) are provided as input to train the GAP-LSTM model, which extracts forecasts for the next p time points.

Fig. 4 .
Fig. 4. Spatial autocorrelation according to the global Moran's I statistic for different models.(a) CNN-LSTM, (b) GCN-LSTM, (c) ESG, and (d) GAP-LSTM.The plot shows one line for each training sequence s k , where the first value is calculated on the original raw input data at timestep 0, and the following values are obtained at each model's encoder hidden states for timesteps 1, 2, . . ., 19, corresponding to observations from 2:00 A.M. to 8:00 P.M. in the PV Italy dataset.For ESG, the x-axis labels differ from the other methods and denote intervals rather than timesteps due to the temporal dilation performed by the method.

Fig. 5 .
Fig. 5. Global multihead attention mechanism in GAP-LSTM.This diagram represents the computations for a single attention head head n .

Fig. 6 .
Fig. 6.Interpretable output generated by GAP-LSTM for a given prediction horizon P. The figure illustrates the relevance of encoder hidden states during the decoding stage (prediction) in the attention mechanism by means of p heatmaps.Each heatmap (N × T ) shows which nodes (1, . . ., N , 5 in each heatmap) and timesteps (1, . . ., T , 24 in each heatmap) are more relevant (light color) and less relevant (dark color) for the sequence time point subject to prediction.Technically, the figure shows (a) difference between attention weights for each decoding step and the average attention weights for the whole sequence, (b) difference between attention weights for a decoding step and the average attention weights computed for all prediction sequences, and (c) raw attention weights for each decoding step for this sequence.Each row of heatmaps is associated with a single forecasting timestep ŷ(k,T + p) .

Fig. 7 .
Fig. 7. Real photographs for the different datasets considered in our experiments: solar plant arrays in the south of Italy (PV Italy), 3-MW wind turbines (Wind NREL), traffic lanes in the San Francisco Bay Area (PEMS SF Weather), and Beijing's Municipal Environmental Monitoring Center (Beijing Air Quality).

Fig. 8 .
Fig. 8. Predicted (red) versus actual (blue) curves for different nodes on one selected test day in different datasets.(a) PV Italy.(b) Lightsource.(c) Beijing Air Quality.(d) Wind NREL.Each line is the time series for a single node.The bolded line represents the average across all nodes.

Fig. 9 .
Fig. 9. Visualization of the decoder hidden states extracted with t-SNE for a randomly sampled predicted sequence of the Lightsource dataset at different hours.(a) 4 A.M., (b) 9 A.M., (c) 2 P.M., and (d) 7 P.M., and (e) corresponding closeness heatmap for all nodes.Distances between nodes in (a)-(d) resemble physical closeness relationships among nodes in (e).

Fig. 9 (
Fig. 9(e): lighter colors correspond to pairs of nodes with higher values of closeness.For instance, nodes 2 and 3 present a value close to 0.8 in Fig. 9(e), and appear systematically close to each other (red and green points) in Fig. 9(a)-(d) (RQ3).

TABLE II DATASETS
ANALYZED IN OUR EXPERIMENTS

TABLE III EXPERIMENTAL
RESULTS FOR ALL THE METHODS AND DATASETS UNDER EVALUATION IN TERMS OF MAE, SMAPE, AND RMSE, WITH THE RESPECTIVE VARIANCE IN PARENTHESIS.THE BEST-PERFORMING METHOD FOR EACH DATASET ACCORDING TO THE RMSE METRIC IS MARKED IN BOLD

TABLE V ABLATION
STUDY.THE COMPLETE MODEL IS COMPARED WITH THREE VARIANTS WHERE THE 2-D CONVOLUTION, THE ATTENTION MECHANISM, AND THE SECOND PART OF THE GCN-LSTM CELL ARE MUTUALLY EXCLUDED.THE BEST-PERFORMING METHOD FOR EACH DATASET ACCORDING TO THE RMSE METRIC IS MARKED IN BOLD.THE

TABLE VI EXECUTION
TIMES FOR THE WHOLE TRAINING PROCESS FOR ALL THE METHODS AND DATASETS.BAQ DENOTES THE BEIJING AIR QUALITY DATASET

TABLE VII EMPIRICAL
ANALYSIS ON GAP-LSTM SPACE AND TIME COMPLEXITY, INCLUDING THE AMOUNT OF MEMORY, TIME, RELATIVE TIME (WITH RESPECT TO THE SMALLEST EXPERIMENT WITH FIVE NODES), AND TIME-PER-NODE (TPN) WITH AN INCREASING NUMBER OF NODES IN THE GRAPH

TABLE VIII NUMBER
OF TRAINABLE PARAMETERS FOR EACH MODEL, MEASURED ON LIGHTSOURCE WITH A BATCH SIZE OF 16 FOR ALL MODELS.THE NUMBER OF PARAMETERS PRESENTS MINIMAL VARIATIONS ACROSS DATASETS real-world sensor networks.Additional information pertaining to model complexity is shown by the number of trainable parameters, as shown in Table VIII.Results highlight that GAP-LSTM is competitive in terms of model complexity (in terms of the number of parameters) when compared to its direct competitors.