Short-Term Weather Forecasting Using Spatial Feature Attention Based LSTM Model

Weather prediction and meteorological analysis contribute significantly towards sustainable development to reduce the damage from extreme events which could otherwise set-back the progress in development by years. The change in surface temperature is one of the important indicators in detecting climate change. In this research, we propose a novel deep learning model named Spatial Feature Attention Long Short-Term Memory (SFA-LSTM) model to capture accurate spatial and temporal relations of multiple meteorological features to forecast temperature. Significant spatial feature and temporal interpretations of historical data aligned directly to output feature helps the model to forecast data accurately. The spatial feature attention captures mutual influence of input features on the target feature. The model is built using encoder-decoder architecture, where the temporal dependencies in data are learnt using LSTM layers in the encoder phase and spatial feature relations in the decoder phase. SFA-LSTM forecasts temperature by simultaneously learning most important time steps and weather variables. When compared with baseline models, SFA-LSTM maintains the state-of the-art prediction accuracy while offering the benefit of appropriate spatial feature interpretability. The learned spatial feature attention weights are validated from magnitude of correlation with target feature obtained from the dataset.


I. INTRODUCTION
Artificial Intelligence plays an important role in not only achieving sustainable development goals with respect to economy and society but also in achieving sustainable environmental goals by protecting and preserving biodiversity, in climate change, predicting extreme climatic conditions [1], evaluating ocean health [2], weather forecasting [3], [4], [5] and preventing spread of diseases [6], [7]. Now more than ever, environmental sustainability is becoming extremely crucial. The provisional World Meteorological Organization (WMO) State of the Global Climate 2021 report draws from the recent evidences to show how our earth is changing before our eyes. Weather prediction and meteorological analysis contribute significantly towards sustainable The associate editor coordinating the review of this manuscript and approving it for publication was Li He . development to reduce the damage from extreme weather events, to decrease weather-related losses including protection of habitat, livelihood, economy which could otherwise set-back the progress in development by years.
Weather forecasting is the prediction of weather conditions for a given location and time through application of science, technology and principles of physics. The meteorological features such as atmospheric pressure, temperature, humidity, wind speed, precipitation of a given location collected over a time frame provides quantitative data describing the state of atmosphere at that particular point of time which is used for understanding the science of atmospheric processes to forecast future atmospheric state. Weather forecasting helps to plan the outcomes and influence of future weather conditions in our day-to-day activities. The ability to detect impending snow, rain, heat waves and floods help the public and government to plan and prevent its dreadful consequences.
The information about future weather conditions helps to maintain commercial, economic, environmental and social interests. For example, weather forecasts in agriculture helps the farmers to plan their harvests and work load, utility companies to purchase sufficient supplies of power and natural gas, inventories and stores to match the demand and supply of resources, public to plan their outdoor activities and government to communicate the weather warnings to general public to protect their life and property within sufficient time.
In the recent years, the number of climate monitoring systems has increased providing large amounts of hourly, daily, weekly, monthly and yearly weather-related information, and data remains transparent. This data is stored and provided so that other departments can utilize it by efficiently analyzing weather forecasts. The research is aimed at developing a machine learning platform for predictive modeling in the case of sustainable environmental management.
The proposed work targets to accelerate the discovery of new knowledge and optimize decision-making in sustainable environmental management. For that purpose, it is proposed to design and implement a machine learning (ML) pipeline that incorporates the necessary modules for a datadriven, accurate and effective weather forecasting. For effective forecasting, it is necessary to identify the interactions between meteorological features that indirectly contribute to climate change. An emphasis is made on temperature forecasting and building a deep neural network model to forecast weather while simultaneously learning interactions of different predictor variables. Therefore, in this paper, we propose a model for successful weather forecasting by considering the mutual influence of various meteorological features with target weather feature to be forecasted.
The major research contributions of our work are as follows: • The proposed SFA-LSTM model is novel for multipleinput-single-output predictions in context of spatial feature interpretability in time series prediction. To the best of our knowledge, this is the first work on spatial feature time series prediction model where the spatial feature attention weight is aligned directly to the output feature.
• The model is trained to capture temporal patterns across multiple time steps and spatial feature interactions across multiple predictors to forecast the target variable. The target feature learns from both temporal and spatial feature contributions.
• Spatial feature attention mechanism is considered to grasp the quantitative mutual influence of input features on target feature.
• The proposed model will provide meaningful spatial feature interpretations which will be verified using domain knowledge

II. RELATED WORK
The weather is a dynamic, continuous, multi-dimensional and chaotic process [8]. Numerous methods have been developed to predict the weather. This section focuses on the work that has been done in the field of weather forecasting using machine learning and deep learning techniques and a special interest is taken on temperature forecast. Many researchers have tried to solve weather forecasting problem using different machine learning techniques [4], [5], [9], [10], [11] with successful results. Holmstrom et al. [9] proposed linear regression and functional regression which forecasts weather by searching historical weather patterns which are most similar to current weather pattern and Rasel et al. [5] performed a comparative study between Support Vector Regression [12] and Artificial Neural Networks [13] for temperature and rainfall prediction. The studies on deep learning neural networks [14], [15], deep belief networks [16], [17], [18] provide promising results with its ''deep'' architecture and higher learning ability in comparison to ''shallow'' machine learning models [14]. In the last decade, Recurrent Neural Networks (RNNs) have gained widespread attention and developed rapidly due to their powerful and effective modeling capabilities [19]. However, traditional RNN suffers from short term memory and vanishing gradient problems [20], [21], [22] which makes it difficult to capture long term dependencies, an important factor to capture historical relevant data over long time series to accurately predict the future weather. In the world of RNN, the Long-Short Term Memory (LSTM) based RNN overcomes the drawbacks of traditional RNN and formulates long-term dependencies between training samples [2], [23], [24], [25]. Shi et al. [26] proposed convLSTM network for precipitation nowcasting which consist convolutional structures in both input-to-state and state-to-state transitions which captures spatiotemporal relationships better than a fully connected LSTM network. A lightweight temporal convolutional neural network (TCN) has been developed [27] for short-to-medium range weather forecasting which is limited to regional forecasting and two weather parameters.
Karevan [24] proposed transductive LSTM (T-LSTM), a localized version of LSTM where the samples in the vicinity of test point have a higher impact on model fitting which is computationally expensive and not suitable for multivariate time series prediction. The drawback of transductive learning is the number of models that needs to be trained since the parameters of the model depend on individual test points. Kreuzer [28] proposed a new convLSTM model for local temperature forecasting where it uses six convolutional layers connected to an LSTM layer and a dense layer. Multi-stacked sequence to sequence LSTM model [29] to forecast temperature, wind speed and relative humidity and the proposed model could forecast weather with high accuracy. A similar approach was taken by Park [30] to restore the missed temperature data using four layered LSTM model which outperformed the deep neural network (DNN). Three DNNs, (Multi-Layer Perceptron) MLP, LSTM and CNN+LSTM were used by Roy [31] to forecast air temperature of weather station and the result indicated VOLUME 10, 2022 that prediction accuracy increases with increase in model complexity.
Several other models have been proposed based on LSTM-RNN but are ineffective to forecast weather accurately when there is a change in weather pattern. The shift in weather often depends on changes observed in subsequent mutually related weather variables. Using multivariate weather variables to forecast a single target weather feature can be used to determine the mutual influence and attention weight (spatial influence) of multiple weather variables with respect to target variable. The attention mechanism [32] can be used to assign different weights to input variables by determining which part of the input data needs to be focused on in the model. An attention aware LSTM model was proposed [33] to forecast soil moisture and soil temperature to perform multifeature attention and temporal attention. The model produces an average R2 of 0.908 and 0.715 and RMSE of 1.665 and 2.756 for soil temperature and soil moisture respectively. Shi et al. [34] demonstrated a Self-attention joint spatiotemporal convLSTM model for temperature prediction which introduces a unified memory to define spatial and temporal models. However, the complexity and variance explained by these models are comparatively lesser. Table 1 summarize the existing LSTM based temperature forecasting models with their limitations.

A. RESEARCH GAP
The identified research gap is to accurately forecast weather when there is a sudden change in weather patterns. The major limitations described in table 1 is that the proposed baseline and derived models forecast temperature inaccurately when there is a change observed in weather over the learned time sequence. Progressively, meteorological studies suggest that the shift in weather often depends on changes observed in subsequent mutually related weather variables. This interaction of mutually correlated weather features can be learned during weather forecasting to accurately predict a weather feature when there is a sudden change observed in weather. Thus, we aim to develop a spatial feature attention mechanism to simultaneously learn input feature interactions in long sequences to predict the target feature accurately.

III. LONG SHORT-TERM MEMORY
Traditional RNN is general form of feed forward neural network with an internal memory. The decision of the output is made by current input which is learned from the previous input and thus the output is connected to previous inputs of the sequence. It is recurrent in nature because it computes the output using the same function for every input while the output is dependent on previous calculations. They use internal state memory to process sequences of input. Fig. 1 depicts a simple RNN where X 0 to X n are the inputs at every sequence, H 0 to H n are the corresponding outputs produced for every sequence. Here, we can clearly see that all the inputs are related to each other where A denotes a single RNN cell. The formula for current state, activation function and output state are described in (1), (2) and (3) respectively, where H is the single hidden vector, W is weight, W h−1 is the weight of previous hidden state, W h is the weight of current input state, W y is the weight at output state, Y t is the output state and tanh is the activation function which regulates the values to range [−1 ,1].
LSTM [38], an artificial RNN architecture was proposed by S. Hochreiter and J. Schmidhuber in 1997. It uses a gated mechanism (input gate, output gate and forget gate) to control the flow, storage and dependency of information over time [39] thus making it well suitable for training long sequential data. LSTM was a solution to handle long term dependency, vanishing gradient and exploding problem of traditional RNNs. Fig. 2 depicts a gated LSTM network.
Here, X t and H t denotes input and output of particular cell respectively. In the input gate, the sigmoid function regulates the information (4) and decides on the values to be remembered using H t−1 and X t . The tanh function (5) assigns weights to the values passed and produces a vector V t containing values ranging from −1 to 1.
The output produced at input gate input gate is the elementwise product of V t and regulated values (i t ) to produce useful information. The forget gate is responsible for discarding the information that is no longer useful. The inputs of this gate H t−1 and X t are multiplied with weight matrix W f and are passed through the activation function which assigns a binary value 0 or 1 to either discard or retain the information accordingly (6). C t and memory of the block is used for extracting the useful information from output gate. Tanh function provides weights to the values which is multiplied with the regulated values O t obtained from sigmoid function (7), and the resultant vector H t (8) is the output of the cell which acts as input to the next cell. Since the proposition of original LSTM architecture, several different variations and approaches have been proposed to enhance the performance of the model such as bidirectional LSTM [40], encoderdecoder based LSTM [41] and many more [12], [42].

IV. PROPOSED WORK
We propose a novel deep learning Spatial Feature attentionbased LSTM (SFA-LSTM) model to capture accurate spatial and temporal relations of multiple weather variables to forecast a weather feature. Significant spatial feature interpretations of historical data aligned directly to output feature helps the model to forecast data accurately. The model is built using encoder-decoder architecture, where the temporal dependencies in data are learnt using LSTM layers in the encoder phase and spatial feature relations in the decoder phase.
The proposed model: • provides meaningful spatial feature interpretations which are verified using domain knowledge VOLUME 10, 2022 • has spatial attention module built in the decoder phase to explicitly capture spatial features correlation which aligns directly with the output • is computationally inexpensive, scalable and is dependent only on past historical data • consists spatial feature attention and long-term temporal dependency mechanisms coordinated in a unified architecture to forecast accurately while offering precise spatial feature interpretability In this section, we propose SFA-LSTM model and investigate its computational complexity. Contrary to previous works [47], [48], [49], in SFA-LSTM, the spatial attention is designed in the decoder layer to simultaneously learn through relevant time steps and significant variables. Our model constitutes of two major divisions which are encoder and decoder. Given a multivariate time series with N features denoted by X = [x 1 , x 2 , x 3 , . . . , x N ] T ∈ R NXTin , where T in is the total length of input sequence and indicates time series associated with each input feature. To represent all input features at time step t ∈ [1,Tin] such that X = [x 1 , x 2 , . . . , x Tin ] T is the compact form of all input time series then we denote it as Analogously, the output univariate time series for T out time steps is denoted by y ∈ R Tout , where y j ∈ R is the output at time step j.
In our model, for every time step t, temporal dependencies and spatial feature attention are calculated. The input to the encoder at a time step t is The spatial feature attention module is built in the decoder parallel to temporal layer to capture spatial feature correlations while attending most relevant time steps as it directly aligns with the output feature as shown in fig. 3. Spatial feature embeddings are generated independently using feed forward neural network which are input to the spatial feature attention module. The feedforward neural network used to compute spatial feature embeddings include a series of computations where the data from previous hidden state of decoder is concatenated with input features acted upon by soft-max activation function to assign weights in the decoder LSTM. The spatial feature embeddings do not have any feedback connections i.e., for each feature , the spatial embeddings for all features are computed from X = [x 1 , x 2 , x 3 , . . . , x N ] T and denoted as S = [s 1 , s 2 , s 3 , . . . , s N ] T . The spatial attention weights are calculated in a feed forward aligned manner in the decoder layer where h f and c f are hidden state and cell state of spatial feature attention. β i j is the spatial attention weight of i th feature calculated at output time step j using h f ,j−1 which is previous hidden state of spatial feature attention at the decoder and s i is the spatial feature embedding of i th feature. W α ∈ R P +Q is the learning parameter and tanh activation function simulates the weights to the values passed (10).
We then use spatial feature attention weights to calculate spatial feature context vector f j , and it is distinct at each time step. f j is further optimized and its dimension is reduced using feed forward neural network with tanh activation to produce r f ,j which is further concatenated with the output of previous time step O j−1 . This produce an updated spatial feature context vectorˆr f ,j which is input to LSTM f , the final spatial feature attention LSTM layer.
The final step of SFA-LSTM is to concatenate the hidden states of LSTM t and LSTM f ([h t,j ,h f ,j ]) which is the output O j and append to the output list of predictions. Fig. 4 describes the detailed workflow of modelling. We start our implementation from the data collection and data preprocessing phase which is described in detail in the next section. Our next phase includes model setting, training, comparison and evaluation. The final step of our experimentation is to compare the performance of trained models and verify the obtained spatial feature attention weights with domain knowledge.

A. DATASET DESCRIPTION AND DATA PREPROCESSING
In this study, we use real meteorological data of weather station at Saskatoon John G. Diefenbaker Intl. Airport latitude 52.14, longitude -106.69 collected from weatherstats website. The datasets is an hourly time-series of 87672 data points each from 2012-01-01 00:00:00 CST to 2021-12-31 23:00:00 containing weather variables as temperature, dew point, windchill, relative humidity, station pressure, sea pressure and wind speed. Temperature is recorded in Celsius scale. Dew point is also in Celsius scale which provides the average temperature below which water droplets begin to condense. Relative humidity provides the fraction of water vapour present in the air. Wind speed is measured in m/s expressing the velocity of wind and surface pressure is measured in Pascals (Pa). These meteorological features are selected to forecast weather because these features explain the state of weather for a given location and time. All the eight meteorological features are used as input features to forecast temperature [24,33,35,36,37]. The experiment was performed on real data and thus included some necessary preprocessing steps to reflect true model performance. The missing values in the data were imputed using linear interpolation in forward direction. Linear interpolation estimates the missing values in the increasing order from previous values. Smoothing of data using simple moving average with an appropriate window length is an effective technique in time series forecasting as it removes noise and random variations from data without neglecting the weather variations in time. For our study, we perform simple moving average of window length = 5. Data smoothened over a higher window length might not represent the actual nature of weather. In the final stage of data preprocessing, we normalize our data using MinMax Scaling Technique. Since the proposed model is of multi-input and single-output form and our multiple input time series are in different unit and range, we normalize it in the range 0 to 1 using the  (13) The data is split into training set and testing set of proportion 0.9 and 0.1 respectively. The training dataset contains 78877 rows and the testing data contains 8743 rows. Both training and test sets are processed using moving window algorithm to obtain the input and output sequences. The input sequence contains seven features i.e. temperature, dew point, windchill, relative humidity, station pressure, sea pressure and wind speed and the output sequence contains temperature values. We compare the performance of SFA-LSTM with several baselines and derived models which will be discussed in the next sections.

B. MODEL SETTING AND TRAINING
We applied the processed data containing seven weather variables described above to predict temperature and used tensorflow backend in our experiments. Input variables to SFA-LSTM and other studies models are temperature, dew point, windchill, relative humidity, station pressure, sea pressure and wind speed and the output variable (target variable) is temperature. LSTM is an artificial RNN with feedback connections which enables it to process long sequences. Hyperparameters are the values which need to be chosen or predefined before the training of algorithm. These hyperparameters are the not the parameters of machine learning that will be learned during the training of model. The hyperparameters of LSTM include learning rate, hidden states, batch size, epochs and optimizer. The working evaluation mechanism of hyper parameter tuning is depicted in Fig. 5. We chose Bayesian Optimizer model to tune the hyperparameters of SFA-LSTM. This is to keep a track of past evaluation results which will be used in probabilistic algorithm of Bayesian algorithm. Learning rate decides how fast the model will converge or diverge or in other words, it decides on how quickly the learning parameters of the model are updated. If a higher learning rate is set then the model may not converge and produce biased results and if a lower learning rate is set then it will drastically slow down the learning. We train our model three times for learning rates 0.01, 0.001 and 0.0001 and learning rate 0.0001 results to the minimum loss of 8.413637260673568e-06. Hidden states in deep learning decide the capacity of the model to learn. It is the main measure of learning capacity of the deep learning model. A thumb rule is that the more the complex model is, the more hidden units/states it requires to learn. We train our model for 16, 32, 64 and 128 hidden states using Bayesian optimizer and the choose 32 for final model training. Batch size of model defines the number of resources allocated for model training and the speed of model. Defining a higher batch size for model training is computationally expensive and a smaller batch size will induce noise in the model. Thus, we train our model for batch size 128, 256 and 512. Bayesian optimizer produces the output in favor of batch size of 128. The value of epoch decides on the number of complete iterations of the data and model to be run. The value can be anything until infinity and the optimal value decides on how well the model fits the data. A smaller value for epochs will result in higher error loss and a bigger value may result to overfitting. We trained our model for 1 to 50 epochs and the results produced are shown in fig. 6. The model results in low MSE in range of e-05 after 15 epochs and we choose the size of epochs to be 20. The hyperparameters of SFA-LSTM are determined using Bayesian optimization technique i.e, learning rate of the model is 0.001, number of epochs is 50, optimizer is Adam and the activation function used is Tanh.

C. MODEL EVALUATION
We use three metrics i.e., Mean Sqaure Error (MSE), Mean Absolute Error (MAE) and R 2 to evaluate to the performance of SFA-LSTM and other state-of-the-art predictive models. MSE is the squared error loss corresponding to expected value and MAE is the average absolute error loss in a set of predictions. R 2 describes the magnitude proportion of variance explained by the predictive model. The performance metrics are calculated as follows: where y i is the actual temperature value at time step i, y i is the predicted temperature value at i th time step, y avg is the mean of actual temperature values and N is the sample size. These error scores are used as common performance metrics for regression models [50], [51], [52]. The hyperparameters of the implemented models is described in table 2. GRU is another variation of RNN developed by in 2014 [41]. Its performance in learning long sequences is similar to LSTM and is computationally less expensive than LSTM because of fewer gates. GRU is widely used in weather prediction modelling [43], [44], [45]. We also compare the performance of SFA-LSTM with the original LSTM model (Vanilla LSTM). We implemented an integrated LSTM-BiLSTM model which was proposed by Maddu et al. [35] to forecast soil temperature with multivariate input variables. Sequence to Sequence LSTM (seq2seq LSTM) model was proposed by Zaytar et al. [29] to forecast temperature with temperature, wind speed and relative humidity as input features. STAM-LSTM is a novel stateof-the-art spatiotemporal attention-based LSTM model proposed by Gangopadhyay et al. [46] for multivariate time series prediction. We use Keras Self-Attention package to implement attention mechanism in LSTM model which considers context of each time step. Additionally, we also build a custom temporal attention-based LSTM model (att-LSTM) to compare its performance with the proposed SFA-LSTM model. The performance SFA-LSTM is also compared the

VI. RESULTS
A novel deep learning model, SFA-LSTM for short term weather forecasting has been proposed in this research. The proposed model is evaluated using statistical error metrics i.e, MAE, MSE and R 2 and its performance is compared with baseline, derived and existing models surveyed in literature. The results will also include an analysis on the spatial feature interpretability and the spatial feature attention weights obtained during model learning with its verification using domain knowledge. Table 4 contains the quantitative findings and prediction  performance of our proposed algorithms listed in table 2. These models are trained and developed by us using the hyperparameters described in Table 2 with input sequence of 24hr and output sequence of 1hr. This means that 1hr temperature is predicted based on past 24hr meteorological values. The performance of SFA-LSTM which outperforms other proposed models is also compared with the results of existing models from literature (feature comparison with our proposed SFA-LSTM in Table 3) and the same has been documented in Table 5

B. SHORT TERM TEMPERATURE PREDICTION FOR DIFFERENT INPUT SEQUENCES AND OUTPUT SEQUENCES
The performance of SFA-LSTM for various input and output sequence lengths is documented in Table 6. On comparison, we can safely say that SFA-LSTM has better prediction accuracy as compared to other models for different input and output sequence lengths. Table 7 provides the correlation between input features used for temperature prediction. Correlation is a statistical value to measure the amount of linear dependency between two variables. The use of this information in temperature prediction will help us to understand the spatial feature interpretability VOLUME 10, 2022 TABLE 6. Temperature predicted with 24hr, 48hr and 72hr input sequence for 1hr, 2hr and 3hr ahead.  between the input feature and the target feature. The spatial feature attention weights obtained from learning the SFA-LSTM model is depicted in fig. 11.

C. SPATIAL FEATURE INTERPRETABILITY
Clearly, temperature contributes maximum towards forecasting future temperature values i.e., upto 20% of total feature contribution. Dew point contributes upto 19% towards temperature prediction and is linearly correlated to a great extent. Wind speed is seen to contribute the least with only 0.85% of total contribution and is also correlated to temperature to a very small extent.
We observe that the spatial feature attention weights obtained from SFA-LSTM are verified using domain knowledge. The spatial feature attention mechanism helps to forecast weather accurately when there is a change in weather values over the sequence.

VII. CONCLUSION
In this work, weather forecasting problem is addressed with the vision to accurately forecast weather when a sudden change in weather pattern is observed. To address this problem, we used the concept of mutual correlation between meteorological features. In this paper, we proposed our novel SFA-LSTM model with a built-in spatial feature attention mechanism to capture long term dependencies and spatial feature correlations of multivariate input time series to predict a single output feature. The spatial feature attention mechanism grasps the quantitative mutual influence of input features on target feature which leads to accurate predictions including when sudden changes in input sequences are observed.
The magnitude of shift in a weather feature can be learned from simultaneous shifts observed in subsequent mutually related weather variables. Using multivariate weather variables to forecast a single target weather feature can be used to determine the weight of spatial feature influence of multiple weather variable on the target variable. Capturing such correlations during model learning helps to predict future weather accurately over long sequences. The proposed model was built using encoder-decoder architecture, where the temporal dependencies in data are learnt using LSTM layers in the encoder phase and spatial relations in the decoder phase. SFA-LSTM is seen to outperform the state-of-the-art model performance with providing accurate spatial feature interpretability.