Graph Neural Network for Air Quality Prediction: A Case Study in Madrid

Air quality monitoring, modelling and forecasting are considered pressing and challenging topics for citizens and decision-makers, including the government. The tools used to achieve the above goals vary depending on the opportunities provided by technological development. Much attention is currently being paid to machine learning and deep learning methods, which, compared to domain knowledge methods, often perform better in terms of capturing, computing and processing multidimensional information and complex dependencies. The technique introduced in this work is an Attention Temporal Graph Convolutional Network based on a combination of Attention, a Gated Recurrent Unit and a Graph Convolutional Network. In the framework of the current study, it is initially suggested to use the presented approach in the domain of air quality prediction. The proposed method was tested using air quality, meteorological and traffic data obtained from the city of Madrid for the periods January-June 2019 and January-June 2022. The evaluation metrics, including Root Mean Square Error, Mean Absolute Error and Pearson Correlation Coefficient, confirmed the proposed model’s advantages compared with the reference models (Temporal Graph Convolutional Network, Long Short-Term Memory and Gated Recurrent Unit).


I. INTRODUCTION
Air pollution's consequences seriously impact the world's population's health and the ecosystem by affecting the single element and components of them. Air pollution is the fourth biggest global risk factor for human health [1]. It is responsible for about 16% of all deaths worldwide [2], in particular, 1.6 million death in China [3]. The World Health Organization (WHO) air quality guidelines report that about 90% of the world's citizens live in areas where air pollution exceeds established thresholds [4]. Monitoring and improving air quality are considered one of today's biggest challenges. Advanced knowledge of air quality concentrations can enable decision-makers to take appropriate action in order to reduce air pollution and its detrimental impacts. At its core, The associate editor coordinating the review of this manuscript and approving it for publication was Giacomo Fiumara . air quality is extremely complex and influenced by many factors.
Several approaches are employed to model and forecast air quality. These approaches can be divided into two categories, namely, domain knowledge-based and data-driven. Studies have indicated that the first group has limitations in capturing non-linear dependencies. They mainly simplify the existing relationship between concentration and affected factors [5]. The second category, which includes machine learning techniques, has proven to be efficient in collecting and processing complex dependencies across scales from high-dimensional datasets, including interactions and non-linear relationships and intrinsic features that control and form pollution [6], [7]. Air pollution has spatial and temporal dependencies, i.e. the concentration depends on many factors, including local climatic conditions and air pollutants, which fluctuate over time. Thus, it is vital to conduct a spatiotemporal analysis in order to capture and process all the above dependencies. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The main objective of the current study is to conduct spatiotemporal prediction of nitrogen dioxide (NO 2 ) in the city of Madrid using air quality, meteorological and traffic data. Some of our previous studies were also devoted to solving the above tasks [8], [9]. However, the approaches implemented in those works require the input to have a Euclidean or grid-like structure. Reconstructing input data as a grid is a tedious and time-consuming task. Also, depending on the distribution of the data, the resulting grid may contain many empty cells without any data, which is likely to degrade the performance of the predictive analysis. One of the most advanced methods proposed in this research is an Attention Temporal Graph Convolutional Network (A3T-GCN), which processes non-Euclidean structured data as input. A3T-GCN allows efficient collection of spatiotemporal information through a model architecture based on a combination of Attention, Gated Recurrent Unit (GRU) and Graph Convolutional Network (GCN). The first two blocks are responsible for collecting temporal information and the GCN is responsible for spatial dependencies. Furthermore, the flexibility of a graph neural network enables it to be applied in many areas, especially those where the data is at a non-Euclidean distance. The following are the significant contributions addressed within the scope of this work: • We conducted spatiotemporal prediction of NO 2 using a graph neural network, namely A3T-GCN.
• We compared the proposed method with reference methods (Temporal Graph Convolutional Network (TGCN), Long Short-Term Memory (LSTM) and GRU) in terms of defined evaluation metrics (Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Pearson Correlation Coefficient (R)). The remainder of the paper is structured as follows. Section II introduces the latest studies in air quality prediction from a spatiotemporal perspective. The case study along with the data employed is described in Section III. Section IV defines the concept of a graph neural network and provides details of the construction of the proposed method. A thorough explanation of the experimental analyses and the results obtained can be found in Section V. Finally, Section VI summarises the results and proposes further research directions.

II. RELATED WORK
Many methods have been used to implement air quality prediction in the spatiotemporal dimension. For example, in the following study [7] LSTM integrated with Multi-Output and Multi-Index of Supervised Learning was proposed. It was evaluated using the datasets for Beijing from 1 January 2016 to 31 December 2017. Zhao et al. [10] proposed the spatiotemporal method called Spatiotemporal Convolutional Neural Network combined with LSTM, which was implemented on data generated by a three-dimensional structure called Relevance Data Cube using a clustering algorithm, time sliding windows and correlation analysis of factors. The authors, in their analysis, followed the concept of the regional air quality prediction problem. Abirami and Chitra [11] implemented a hierarchical deep learning model, DL-Air, which consisted of three blocks, including encoder, Spatiotemporal Association Analysis LSTM and decoder components. Another model, named Ensemble LSTM [12] was applied to predict particulate matter with a diameter of less than 2.5 micrometres (PM 2.5 ). It consists of Ensemble Empirical Mode Decomposition (EEMD) and LSTM. EEMD is responsible for multi-modal feature extraction and estimated integration, and LSTM is responsible for multi-modal feature learning. Huang et al. [13] to predict PM 2.5 implemented a method consisting of Empirical Mode Decomposition (EMD) and GRU.
In addition, many authors have recently been drawn to the advantages of a graph neural network in areas such as traffic flow prediction [14], [15], [16], [17], parking availability prediction [18], pedestrian trajectories prediction [19], [20], urban vehicle emission prediction [21], wind speed prediction [22], weather prediction [23] and solar irradiance prediction [24]. The air quality domain, among others, has also benefited from these advantages, and various authors have used graph neural networks to forecast air quality. Han et al. [25] proposed the Self-Supervised Hierarchical Graph Neural Network based on cities→functional zones→regions hierarchical graph network to perform fine-grained air quality prediction implemented on datasets for the Beijing-Tianjin-Hebei and the Pearl River Delta urban agglomerations. Ram et al. [26] proposed a Dual GCN (DGCN) and LSTM network combined with a wireless sensor network and Internet of Things (IoT) to perform Air Quality Index (AQI) predictions; in particular, DGCN helps to process the data from the sensors that were later learned by the graph LSTM. Xiao et al. [27] offered a Dual-path Dynamic Directed GCN (DP-DDGCN) based on the combination of dual-path dynamic directed graph blocks and GRU. Ouyang et al. [28] proposed a Spatiotemporal Dynamic GCN (ST-DGCN) based on a time-varying dynamic adjacency matrix to predict PM 2.5 . Ge et al. [29] offered a Multi-scale Spatiotemporal GCN (MST-GCN) including a multi-scale block, several spatiotemporal blocks and a fusion block to forecast air quality. Wang et al. [30] used Attentive Temporal GCN to model inter-station relationships (spatial adjacency, functional similarity and temporal pattern similarity) to predict air quality. To forecast nationwide city air quality, Chen et al. [31] proposed the group-aware graph neural network using the Chinese city air quality dataset. Xu et al. [32] performed air quality forecasting based on a hierarchical graph neural network; in particular, city-level and station-level graphs were constructed using the Yangtze River Delta city group's dataset. The authors developed two strategies, upper delivery and lower updating, to implement the inter-level interactions and introduce a messagepassing mechanism to implement the intra-level interactions. Another study is devoted to comparing graph-based and non-graph-based models for PM 2.5 prediction under distribution shift [33]. Le [34] for efficient learning spatiotemporal characteristics of air quality values, and related factors Spatiotemporal Graph Convolutional Recurrent Neural Network was used.
Zhao et al. [35] introduced a novel model based on the combination of air quality spatiotemporal network and GCN for PM 2.5 prediction. Gao and Li [36] proposed a graph-based LSTM model to perform spatiotemporal prediction of PM 2.5 concentration. Zhang et al. [37] used a Temporal Attention network with domain-specific graph regularisation in order to improve PM 2.5 prediction. Wang et al. [38] developed a new model called PM 2.5 -GNN to capture fine-grained and long-term influences in the PM 2.5 process. Zhao and Zettsu [39] proposed Multi-Attention Spatiotemporal Graph Networks to predict the concentration of PM 2.5 , ground-level ozone (O 3 ) and particulate matter with a diameter of less than 10 micrometres (PM 10 ). Qi et al. [40] implemented spectral GCN combined with LSTM using historical data for the last 24 h to forecast the PM 2.5 concentration for the next 1 h, 2 h, 4 h, 8 h, 12 h, 24 h, 48 h and 72 h. Huang et al. [41] implemented a Spatio-Attention embedded Recurrent Neural Network to predict air quality using Beijing's air quality datasets; to capture spatial patterns, a self-loop-normalised adjacency matrix was used. Lin et al. [42] proposed the Geo-context based Diffusion Convolutional Recurrent Neural Network to predict PM 2.5 . The geo-context segment was implemented by building a graph that allowed information to be collected in the spatial dimension, and a Diffusion Convolutional Recurrent Neural Network was responsible for collecting information in the temporal dimension.
The overall picture of the publications related to the implementation of a graph neural network for air quality prediction can be seen in Table 1 Year: year of publication of the works. As can be seen, interest in the topic began quite recently, since 2018, in particular, the main peak came in 2021, when ten out of eighteen extracted works were published in 2021.
Method: implemented methods to perform the prediction. As shown, most studies involve GCN combined with Recurrent Neural Network (RNN) such as GRU or LSTM. Recently, the integration of the attention-based network is also increasing.
Edge Weight, Dynamic/Static, Directed/Undirected: to find out more information about the structure of the graph, information about the edge weight, dynamics and direction was extracted. Figure 1 shows the distribution of publications for each feature. It is noticeable that most of the papers used graphs consisting of weighted edges (seventeen out of eighteen). In terms of dynamic status, most of them are static (fourteen out of eighteen), and in terms of direction, most studies used undirected graphs (twelve out of eighteen). Target: are predictable pollutants. The following pollutants were taken into account: PM 2.5 (fourteen papers), PM 10 (four papers), AQI (three papers), NO 2 (two papers), O 3 (two papers), carbon monoxide (CO (one paper)). The most used pollutant was PM 2.5 . Figure 2 shows the distribution of the prediction target over time. It can be seen that PM 2.5 has been in use since 2018. From 2020, additional targets are included, and in 2022, studies attempt to predict all the aforementioned prediction targets. Dataset: datasets used for predictive analysis. The following datasets are used: air quality (eighteen papers), spatial (the location of air quality monitoring stations; (eighteen papers)), meteorological (seventeen papers), points of interest (five papers), traffic (two papers), road network (two papers) and geographic data (land uses, roads, water areas, buildings; (one paper)). Air quality datasets and spatial datasets were most commonly used, which is logical, since the main task was air TABLE 1. Features of the papers devoted to the implementation of graph neural networks for air quality prediction (*). VOLUME 11, 2023 quality forecasting, and since the main focus of the work was on the implementation of a graph neural network, the location of monitoring stations is the main base for constructing the graph. The next most commonly used dataset is meteorological data, due to the strong correlation between air quality and meteorological data.

2732
It is also very interesting to see the distribution of the datasets in chronological order. Air quality, spatial and meteorological data are included for all years ( Figure 3). In recent years, in particular 2021 and 2022, the analysis began to include also points of interest, traffic and road network data. Evaluation Metric: is a metric for measuring model performance. The following metrics were used, including RMSE (seventeen papers), MAE (sixteen papers), Coefficient of Determination (three papers), False Alarm Rate (three papers), Accuracy (two papers), Mean Absolute Percentage Error (two papers), Index of Agreement (two papers), Critical Success Index (two papers), Probability of Detection (two papers), Symmetric Mean Absolute Percentage Error (two papers), train loss (two papers), test loss (two papers), validation loss (two papers), Spatiotemporal RMSE (one paper), Mean Square Error (one paper), and Recall Rate (one paper).
Compared to the above studies, the following are the main contributions made by the current work: 1) Including traffic, air quality and meteorological data; 2) Conducting spatiotemporal prediction of NO 2 in different time intervals, including 1-12 h, 12-24 h, 24-36 h and 36-48 h; 3) Comparing the proposed method with reference methods in terms of evaluation metrics that measure model performance (RMSE, MAE, R). Furthermore, since the suggested A3T-GCN method is primarily used to predict traffic speed [43], the current research can be considered a somewhat groundbreaking contribution to the domain of air quality prediction.

III. DATA
The dataset used in this study consists of air quality, meteorological and traffic data from the period January-June 2019 (training set) and January-June 2022 (testing set), and the location of air quality and meteorological monitoring stations and traffic measurement points of the city of Madrid. The data were obtained from the Open Data portal of Madrid City Council. 1 In addition, there are twenty-four air quality control stations, twenty-six meteorological control stations and more than 4,000 traffic measurement points. Note that the location of the traffic measurement points is not fixed, and the Open Data portal of Madrid City Council provides the location files for each month. The files for January-June 2019 and January-June 2022 were used in the current analysis. In Figure 4, for illustration purposes, only locations for January 2019 are used, which is also indicated in the caption.
The datasets used in the current work are presented below ( Table 2 shows summary statistics of each type of data for the periods used in the analyses): • Air Quality Data -NO 2 (µg/m 3 ).
• Traffic Data -since the attributes of the traffic data can be specific to a certain area, the selected traffic attributes are shown below with their definitions for the city of Madrid.
-Intensity -the intensity of the measurement point in a period of 15 minutes (vehicles/hour). parameter represents an estimate of the degree of congestion, calculated from an algorithm that uses intensity and occupancy as variables, with certain correction factors. It establishes the degree of road use in a range from 0 (empty) to 100 (collapse). -Average traffic speed -an average speed of the vehicles in a period of 15 minutes (km/h). Only for M30 intercity measuring points. Although the traffic data is recorded every 15 minutes, since the NO 2 and meteorological data are at hourly rates, the traffic data was filtered, and only hourly records were selected (for example, with entries at 13:00, 13:15, 13:30, 13:45 and 14:00, we selected the entries at 13:00 and 14:00 and the same logic was applied for the entire period).
It should be noted that the original datasets also contained ultraviolet radiation (Mw/m 2 ) and precipitation (l/m 2 ), which were removed after exploratory data analysis. Regarding wind direction, it was converted to the following categories: north, east, south, west, southwest, northeast, southeast, northwest, and then underwent the implementation of One Hot Encoder 2 [44].   3 The grid was created with the help of ArcPy package, 4 specifically with the CreateFishnet function. 5 Altogether, there are 340 cells (20 by 17) that cover 56.27% of the area of the city of Madrid (Figure 4). The reason for selecting this area was to have a minimum extent to encompass all air quality monitoring stations. The value of each cell consists of the values of NO 2 , meteorological and traffic attributes obtained from assigned stations covered by that cell at a certain time. Cells with no stations or measurement points were assigned a value of zero, whereas cells with more than one station or measurement point were assigned the mean value of all existing values in the given cells. It can be observed that only twenty-four cells out of 340 have NO 2 data. After combining all the features, the twenty-four cells mentioned above were selected for further analysis, particularly for use as input for the proposed method working with non-Euclidean distances. As a result, the input data has the dimensions mentioned in Table 3. Regarding NO 2 , Figure 5 shows the time series of NO 2 during January-June 2019 and January-June 2022. It can be seen that it decreases over time, which might be attributed to domestic heating use throughout the winter. Moreover, the overall concentration during 2022 is lower than that for the same period in 2019, which can be explained by the constraints enforced to control the spread of coronavirus disease 2019 (COVID-19). From Figure 6 it can be seen that maximum values during January-June 2019 are detected around 300 µg/m 3 and the highest concentration was detected at the station with id 72 (328 µg/m 3 at the following time: 2019-01-14 19:00:00). During the 2022 period, the greatest value was identified in the station with id 47 (625 µg/m 3 at the following time: 2022-04-26 14:00:00), which most probably is an outlier.
Another exploratory analysis was performed to detect the correlation between the time series of NO 2 at the stations 3 Projected coordinate system: https://epsg.io/25830 4 ArcPy package: https://bit.ly/3UPYKjy 5 Create Fishnet (Data Management): https://bit.ly/3Rn62Zj  ( Figure 7). During the 2019 period, it can be seen that the stations are correlated, except for the station with id 323, which has a lower correlation than the others. This can be explained by the station's location, which is relatively remote from the others. Furthermore, during 2022, in addition to the station with id 323, the station with id 141 was also found to be less correlated. Looking at Figure 8, which shows the time series of NO 2 concentration at station 141, it can be seen that there is no data for the end of January, as well as for the entire period of March and April, which may be due to a sensor malfunction.

IV. METHODOLOGY A. GRAPH NEURAL NETWORK
A graph can be designated as G = (V , E), where |V | = N is the number of nodes and |E| = N e is the number of edges. A ∈ R N ×N is the adjacency matrix [45]. The graphs are categorised into directed and undirected, based on the connection of the edges. In the case of directed graphs, edges are directed from one node to another, while edges in undirected  graphs are connected between nodes without specifying the direction. Undirected graphs are sometimes considered to be two directed edges.
Another classification divides them into weighted and unweighted graphs. An unweighted graph only shows whether two nodes are connected or not (in the case of being connected, it is assigned 1, otherwise 0). In contrast, the weighted graph provides additional information about the connected edges; for example, in this work, the distance between stations is assigned as weights.
Regarding learning tasks, there are three categories: nodelevel (node classification, node regression, node clustering), edge-level (edge classification, link prediction) and graphlevel (graph classification, graph regression, graph matching). The graph considered in the scope of this study is an undirected weighted graph, and the learning task is a node regression since the main objective of this work is to predict the concentration of NO 2 in each station in a given time interval.

B. ATTENTION TEMPORAL GRAPH CONVOLUTIONAL NETWORK
The model proposed in the current work is the A3T-GCN, which is the combination of the Attention, GRU and GCN methods ( Figure 9) [43]. The GRU and Attention mechanism are responsible for temporal aggregation and GCN deals with spatial aggregation.
Graph Convolutional Networks: There are two types of GCNs: Spatial GCN and Spectral GCN [46]. To learn graphs, spatial GCN uses spatial features. It defines convolutions on spatially close neighbours. It generates v i node's representation by aggregating its own features X i and neighbours' features X j . As an aggregation function is used mean, sum or max functions. Afterwards, a non-linear transformation is applied to the outputs. While in the case of spectral GCN, it defines graph convolutions using filters from the perspective of graph signal processing. Spectral GCN is a combination of the following steps: 1) converting the graph into the spectral domain with the help of eigendecomposition, 2) applying eigendecomposition to the specified kernel, 3) multiplying spectral graph and spectral kernel, and 4) returning the results in the original spatial domain.
The one used in this work was Spectral GCN, which can be defined as the multiplication of a filter g θ with signal x in the Fourier domain.
where θ is a model parameter, L is the graph Laplacian matrix (Eq. 2) and U is the eigenvector of the normalised where I N ∈ R N ×N is the identity matrix, D ∈ R N ×N is the diagonal degree matrix and λ is the diagonal matrix of the eigenvalues of the Laplacian matrix. Gated Recurrent Unit: GRU is a type of recurrent neural network introduced by Cho et al. [47]. It can be defined with the following equations: where x t is the input vector at the current time step, z t is the update gate, r t is the reset gate, h t is the current memory content, h t−1 is the hidden state at the previous time step, h t is the hidden state at the current time step and is the Hadamard product.
Attention: The Attention model focuses on a few relevant things in the complex input while ignoring others in networks. Bahdanau et al. [48] proposed the attention mechanism in order to overcome the drawbacks of RNN, in particular, the inability to remember longer sequences. The equation defining the attention model is shown below.
where c i is the context vector, a ij is the weights and h j is the hidden state. a ij can be calculated with the following equations.
where e ij is the output score of a feedforward neural network (alignment scores).
There are several categories of the Attention mechanism. Soft Attention is the type used in this study (soft Attention refers to the function that varies smoothly over its domain and, as a result, can be differentiated).

C. EVALUATION METRICS
To evaluate the performance of the models, the following metrics were used in the ongoing work, including RMSE, MAE and R. RMSE measures the geometric difference between the estimated and actual values, and is susceptible to significant errors; MAE measures the average magnitude of the errors, and R measures the strength and the direction of a linear relationship between two variables.
where n is the number of instances and E i and A i are the estimated and actual values. Regarding RMSE and MAE, the lower the value is, the better the prediction will be. In terms of R, the values range from −1 to +1, where −1 means a perfectly linear negative correlation, +1 means a perfectly linear absolute correlation, and 0 means no correlation.

D. GRAPH CONSTRUCTION
Following the definition of the graph structure, in this work, air quality stations will be considered graph nodes. All stations are interconnected, forming graph edges, and the distances between them will be considered edge weights. VOLUME 11, 2023 The distance between nodes was calculated using the arcpy.analysis.GenerateNearTable 6 function. It should be mentioned that to create the adjacency matrix, the original distance between two nodes was converted to 1/distance (Eq.13), so if the distance is large, the division will be smaller, and this will give little weight to a certain edge, which matches the graph logic since closer nodes have more influence on each other than remote nodes.
where d ij is the distance between i and j stations.
Regarding node features, all variables associated with each station will be considered node features; in our case, for each time t the node features can be assigned as X t ∈ R N ×M , where N is the number of nodes and M is the features. Figure 10 shows the graph constructed on the basis of air quality stations. It consists of 24 nodes and 276 edges (connecting each pair of nodes). The numbers on the nodes are the identifier of each cell of the grid that was initially given, which contains a certain station. Algorithm 1 shows the procedure for creating a graph network on the map. return Draw a Network between each pair of points with all combinations using the arcpy.management.XYToLine 8 function end function Output: Figure 10 The prediction of NO 2 was performed on the basis of different time granularities, in particular, using the previous 12 hours to predict the concentration in the next T hours. The following time intervals have been defined as the value of T : 1-12 h, 12-24 h, 24-36 h and 36-48 h. In the mathematical expression, the procedure above can be defined as a function of the air quality stations network G and the feature 6 Generate Near  matrix X (Eq. 14).
[X a t+1 , . . . , X a t+T ] = f (G; (X t−n , . . . , X t−1 , X t )) (14) where T is the next hours, n is the previous hours, X a t is the concentration of NO 2 at time t, and X t is a combination of NO 2 , meteorological and traffic data. Each sample of the input has the following structure: where 24 is the number of nodes, 18 is the number of features of each node, T is equal to 12, [2,552] from edge_index refers to the fact that every edge was considered twice (276*2=552). Algorithm 2 shows the procedure of input data preparation for the graph neural network.
The parameters and settings applied in the analysis are summarised in Table 4. Regarding the structure of the proposed model, it consists of three graph convolutional layers (the output of the layers is used in GRU: update gate, reset gate and hidden state) followed by learnable transformations.
Regarding the reference models, we used a TGCN, LSTM and GRU. Below is demonstrated the architecture of each of them. for each hour ∈ Period do 3: Create grid with CreateFishnet function (ArcPy library) 4: Add field to the created grid 5: for each item i ∈ Data do 6: i spatial join with grid 7: input the mean of the values of each corresponding cell to the field 8:  16: calculate distance between each pair of air quality stations using arcpy.analysis.GenerateNearTable 17: return adjacency matrix filled with inverse distance (1/distance) between each pair of nodes 18: end function 19: function NORMALISE DATA(data) 20: return normalised data using Z-Score method 21 [24, 18, T], edge_index= [2,552], edge_attr= [552], y= [24, T], batch=[64])) 24: end function Output:Dataset with 8687 samples (each sample: Data(x= [24, 18, T], edge_index= [2,552], edge_attr= [552], y= [24, T], batch=[64])) TGCN [49]: is a combination of GCN and GRU. The architecture is composed of three graph convolutional layers (the output of the layers is used in GRU: update gate, reset gate and hidden state) followed by learnable transformations. Regarding the hidden unit, it was assigned to 256.
LSTM [50]: It consists of the fully connected layer with 432 units (24*18), followed by the LSTM layer with 512 units, and then the models were finalised with another fully connected layer with 24 units (representing NO 2 for all stations).
GRU [47]: It consists of the fully connected layer with 432 units (24*18), followed by the GRU layer with 512 units, and then the models were finalised with another fully connected layer with 24 units (representing NO 2 for all stations).
It should be mentioned that the analysis was performed in the Google Colab cloud service 9 using the PyTorch Geometric Temporal library [51] (the source code can be accessed at the GitHub repository 10 ).

V. EXPERIMENTAL ANALYSIS AND RESULTS
This section presents a detailed explanation of the experimental analysis and results obtained. The workflow is shown in Figure 11. In the first stage, the data were integrated with the spatiotemporal dimension in a defined area, followed by the graph construction, which has already been introduced in the previous chapter. The next block is feature engineering, consisting of transformation and scaling steps.
Transformation: this step refers to generating data based on the defined time interval. The defined time interval is 1-12 h, 12-24 h, 24-36 h and 36-48 h.
Scaling: before converting the data into a graph construction, the input data were standardised (the standardisation is also called Z-score; Eq. 15).
where µ is the mean, and σ is the standard deviation. The next block is the architecture of the proposed model that returns the results of the experimental analysis demonstrated in the next part.
This part illustrates the output of the analysis. The results of the analysis are shown in Table 5 (the best results are  indicated in bold). Algorithm 3 provides the pseudo-code of the NO 2 prediction procedure.
Looking at the results in Table 5, first of all, it can be seen that for the A3T-GCN, TGCN and LSTM models, the time interval of 1-12 hours is superior to other time intervals in terms of all three evaluation metrics, and in the case of GRU, the leading time interval is 12-24 hours in terms of RMSE and MAE, and 1-12 hours in terms of R.
Regarding individual model performance, the A3T-GCN outperformed all three reference models. Especially, in terms of RMSE, the proposed method ( return error estimated with evaluation metrics (RMSE, MAE, R) end function Output: RMSE, MAE, R for A3T-GCN, TGCN, LSTM and GRU (Table 5) with the unit of the target variable (NO 2 : µg/m 3 ). Therefore, based on the results obtained (RMSE-19.14 µg/m 3 , MAE-15.33 µg/m 3 ), the proposed method can be considered sufficient compared with the mean values of NO 2 (36.69 and 27.96 for the period 2019 and 2022, respectively).
It is important to mention that when comparing only reference methods between them, it can be noticed that TGCN outperforms the other two methods (LSTM and GRU). Especially in terms of RMSE, TGCN (21.48 µg/m 3 ) outperformed LSTM (22.33 µg/m 3 ) by 3.81%, and GRU (22.29 µg/m 3 ) by 3.63%. In terms of MAE, TGCN (16.24 µg/m 3 ) outperformed LSTM (16.70 µg/m 3 ) by 2.75%, and GRU (16.97 µg/m 3 ) by 4.3%. Since TGCN is also a graphbased method, based on these findings, the advantage of a graph-based method with the ability to capture spatial dependencies in addition to temporal dependencies can be highlighted.

VI. CONCLUSION AND FUTURE WORK
The spatiotemporal prediction of air quality has attracted the attention of many authors. Having access to a large amount of data creates challenges for their proper integration. New technologies with constant development offer new approaches to conducting spatiotemporal analysis. The approach proposed in this study is A3T-GCN. The main goal is to predict NO 2 using data from Madrid air quality monitoring stations combined with meteorological and traffic data from January 2019 to June 2019 (training set) and January 2022 to June 2022 (testing set). RMSE, MAE and R were applied as evaluation metrics. The results highlighted the superiority of the proposed method over the reference methods (TGCN, LSTM and GRU). In terms of RMSE, the A3T-GCN outperformed TGCN by 10.89%, LSTM by 14.29%, and GRU by 14.13%. In terms of MAE, the A3T-GCN outperformed TGCN by 5.6%, LSTM by 8.2%, and GRU by 9.7%. In terms of R, the A3T-GCN outperformed TGCN by 16.95%, LSTM by 3.39%, and GRU by 5.08%.
Regarding future work, it would be interesting to implement this approach in other regions and see if it can be generalised in terms of spatial dimension. The final performance is likely to be affected by the spatial peculiarities of different regions, as well as the distance between stations, the number of stations, and the available features. Another extension can be related to the approach to graph construction. Considering that an undirected graph was used in this work, it would be advantageous to use a directed graph, since the importance of the node V i on V j is different from that of V j on V i . It would also be preferable to consider the topology, buildings and infrastructure connecting the two nodes in relation to the weighted edges that were created by the inverse distance between the two nodes. The constraints related to the construction of the graph can be addressed in future work. Additionally, it is advisable to change the architecture of the proposed method and stack more layers, which will probably improve the results. Another improvement could be to include data for a more extended period.
Regarding the limitations, the proposed method requires a relatively long time for the data to converge. As regards experimental analysis, it is worth mentioning the limitation associated with Google Colab Pro as it has limitations in terms of resources 11 .