A Comparative Analysis of Deep Neural Networks for Hourly Temperature Forecasting

High-resolution temperature forecasting can often prove to be challenging for conventional machine learning models as temperature is highly seasonal and varies with the time of the year as well as with passing hours of the day. In most cases, only the daily extremes or mean temperatures are provided by temperature forecasting methods. However, with the growing availability of data and the development of deep neural networks (DNNs) capable of detecting complex relationships, high-resolution temperature forecasting is becoming easier. Typically, historical temperature data along with multiple meteorological sensor data is used for temperature forecasting which increases the complexity of the system making it harder and costlier to implement physically. In this paper, high-resolution hourly temperature forecasting is performed using only historical temperature data. The paper presents a comparative analysis among four popular DNNs-simple recurrent neural network (SRN), gated recurrent unit (GRU), long-short term memory (LSTM), convolutional neural network (CNN), and two hybrid models- CNN-LSTM parallel network and GRU-LSTM parallel network trained on Beijing temperature dataset. Experimental results showed GRU-LSTM parallel network obtained the lowest RMSE (1.691°C) whereas CNN has the best computational efficiency obtaining a slightly worse RMSE (1.759°C). Additionally, a robustness analysis is performed on temperature data from four additional geographically diverse locations (Toronto, Las Vegas, Seattle, and Dallas) which reveals GRU to be the most consistent algorithm. Finally, the paper establishes a correlation between the model performance and the dataset based on their variance and mean absolute deviation with reference to the training dataset.


I. INTRODUCTION
T EMPERATURE forecasting is one of the most consistent areas of research owning to its direct impact on utility demand, living conditions, agriculture, and various industries. Temperature has a high correlation to electric load demand in particular and therefore, temperature forecast is a prerequisite for many load forecasting schemes. These forecasts are usually provided by weathers stations in many countries, but often only predict the daily extremes (maximum and minimum) or average temperatures. Moreover, it does not specify what time of the day this maximum or minimum will occur. The extreme temperatures only help to predict the peak load; however, with the growing availability of data, higher resolution temperature predictions can be made which will aid utilities with the scheduling, supply operation and preparation for sudden load change to a great extent. In this regard, hourly forecast of temperature is an important feature that can further improve the prediction horizon of many other applications.
To better schedule the generation scheme and avoid undergeneration or overgeneration, many utilities require hourly temperature data for short-term load forecasting (STLF) [1]. A significant percentage of electric demand comes from heating, ventilation and air conditioning (HVAC) which consumes more than 40% of a building's power on average [2]. HVAC is highly temperature-dependent, and STLF such as 1 hour ahead (h ahead), 2h ahead and 3h ahead can be crucial for preparing the HVAC systems to adapt to the change of load and enhance the operational safety of the electric network.
Zhao et al. [3] proposed a hybrid PLS-SVM model that takes into account meteorological parameters and historical data to perform up to 3h ahead load forecasting to optimize HVAC operations. The authors showed that accuracy of the hourly temperature forecast directly affects the proposed model. Higher resolution such as 1h ahead, 2h ahead and 3h ahead temperature forecasts yield higher accuracy for the load forecasting model compared to using daily extreme temperature forecasts. Hourly temperature data is also required to analyze test reference years (TRYs) and design summer years (DSYs) for energy use, to calculate plant sizing, and to simulate building performance during hot summers [4].
Shao et al. [5] proposed a model which predicts the hourly road surface temperature and state (wet/ice/dry) using meteorological data from seven countries. This model is a shortterm model that predicts up to 3h ahead which integrates an hourly temperature forecasting scheme as a prerequisite feature for the next stage of the proposed forecasting model. A similar study by Bogren et al. [6] used hourly air temperature forecast to predict the road surface temperature. In agriculture, Kim et al. [7] used hourly air temperature forecasts to estimate the duration of leaf hydration retainability. Hourly temperature can even affect biological parameters, such as the mortality burden of hourly temperature variability which was studied extensively [8]. Another significant application of hourly temperature forecasting is in photovoltaic (PV) generation. For seamless grid integration, predicting hourly fluctuations in PV generation is crucial. Since the output of a PV system is a function of temperature, hourly temperature forecasts are a prerequisite in the solar industry.
So, there are a plethora of applications for hourly temperature forecasting. After addressing the necessity of high resolution hourly forecasts, the discussion proceeds to assess the hourly forecast techniques that have been used so far as well as the state-of-the-art regarding this topic.

II. TEMPERATURE FORECASTING METHODS
Weather forecasting mainly takes one of three routes-traditional physics-based, statistical and NN or DNN models. This section briefly explores the different techniques, their advantages and drawbacks.

A. PHYSICS BASED MODELS
Physics-based weather forecasting is the traditional method and is still used by a number of public weather forecast providers. These methods mainly take into account physical parameters like solar irradiance, wind speed, humidity, precipitation, cloud covers, etc. and use theoretical formulae to calculate the future temperature. Zhao et al. [3] presented a purely physics-based temperature forecasting model to determine the temperature which is a prerequisite for the load forecasting part of their study. The study used a heat conduction equation that assessed parameters such as heat capacity, conductivity, current temperature, surface albedo, solar irradiance, net longwave irradiance, ground conductive heat flux density, sensible and latent heat flux densities to derive the road surface temperature. Physics-based models require sensor measurements from multiple sources to compute the temperature; moreover, these values vary significantly across different locations. These models tend to work better for daily temperature forecasting rather than short horizon predictions.

B. STATISTICAL MODELS
Mathematical models started gaining momentum around the 1990s. Since the temperature forecasts at that time only provided maximum and minimum temperature without specifying what time of the day it will occur, the hourly electric load curve had to be generated through interpolation of the two extremes. Data-driven weather forecasting models are built using different statistical and machine learning algorithms. Such models can significantly decrease the setup cost by trading off more historical data for additional sensor data. However, these models may require extensive historical data to yield good accuracy. Recently, with the increased availability of precise data, data-driven models for weather forecasting have gained popularity and are actively being studied. Statistical models such as, autoregressive integrated moving average (ARIMA) use time-series analysis to predict long-term change in data like daily and monthly time horizons [9]. ARIMA is one of the most common linear statistical techniques and a form of regression analysis used in time series forecasting. The auto-regressive component of ARIMA regresses some of the lagged data, then integration is performed to make the data stationary, and the movingaverage incorporates preceding error terms from a moving average model applied to lagged observations. One of the biggest drawbacks of ARIMA is that it is negatively affected by seasonality, and temperature is a highly seasonal dataset. If stationarity is not confirmed in a trend, computation throughout the whole process might not be accurate [10]. So ARIMAs are not the best choice for temperature forecasting.

C. NEURAL NETWORK MODELS
In recent times, NN models have become increasingly popular specially for short-term predictions such as hourly and daily time horizons compared to long-term predictions achieved through statistical models. Existing research mostly focus on temperature forecasting using consistent time unit data where both the input and target data are of the same time unit, for example, using daily input data to forecast dayahead temperature. However, with the increased availability of high-resolution data, and continued development of processing units, it is now possible to predict a time frame of different duration compared to the input. Both hourly and daily patterns can be employed to forecast daily temperatures, but as the data is abundant and detailed, it is essential to process them efficiently and accurately. With the correct models, hourly temperature data can even be used to predict the hourly temperature of the next day to a limit before the errors become too significant. In this context, NN models This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  Hippert et al. [13] Hybrid ANN and ARIMA

Hourly 2000
When sequential forecasting is performed, e.g., 1h ahead temperature is predicted at a time and then the predicted value is used as one of the inputs in the forecasting of the next hour's temperature-ANN might yield chaotic output. The proposed hybrid model claims to mitigate this drawback.
Methaprayoon et al. [14] Multistage ANN Hourly 2007 The back-end hourly temperature forecaster of an STLF model performed poorly for temperature forecasting, and it was exclaimed that the temperature forecast error was the major cause behind higher errors in STLF.
V. Vamitha et al. [15] Fuzzy & Markov chain Daily average 2012 Through the fuzzification of the temperature data, four categorical data sequences were obtained and a multivariate Markov chain was applied.
Verhelst et al. [2] ARX, ARMAX Hourly 2017 These models are multivariate and computation expensive. ANN does not require input data in stationary format but ARIMAX requires stationary data, which adds to the computation costs during pre-processing. Hourly 2020 Proposed a convLSTM network using relative humidity, cloud coverage, hourly precipitation, wind speed, relative air pressure and wind direction along with air temperature as input. The increased input parameters yield improved accuracy than univariate inputs but also come with higher computation cost.
Lee et al. [19] MLP, LSTM, CNN Daily extreme 2020 The latest addition to the DNN based short-term temperature forecasting studies, which presented a comparative analysis between MLP, LSTM and CNN, but only predicted the extremes and average daily temperature. have powerful versatility to process large amounts of more detailed data, which this paper aims to present. Existing research on temperature forecasting using statistical and NNs are tabulated in Table 1. It can be observed that, earlier versions of temperature forecasting use different statistical models such as MLP, ARIMA or modified ARIMAs. Some of these papers include hourly temperature forecasting as the prerequisite of a load forecasting model [12]. More recent works started adopting NNs and DNNs that yield higher accuracy compared to statistical models, which is discussed in [18]. However, most of these papers use NNs to predict daily extremes and average [19]. To the best of the authors knowledge, only one forecasting model predicts hourly horizon using DNN, achieving an hourly average RMSE value of 2.10 using their proposed convLSTM model tested on a temperature dataset of Germany. However, it uses five meteorological parameters as input. This not only in-creases computation cost, but requires expensive sensor data as well [18]. Univariate regression using NNs can mitigate this drawback. In addition, temperature patterns differ significantly based on geographical location, so it will be interesting to observe how DNNs trained on a local temperature pattern performs on a different region. It is apparent that a study comparing the performance of the most recent DNNs for hourly temperature forecasting, taking into account spatial diversity (local and geographically diverse) and robustness is yet to be explored.
This study intends to address the existing research gap and make the following significant contributions: • Comparative analysis for hourly temperature forecasting using four of the most popular DNNs (SRN, LSTM, GRU, CNN) and two hybrid DNNs (CNN-LSTM parallel, GRU-LSTM parallel), with univariate time series data. VOLUME 4, 2016 • Comparison of hour-by-hour prediction and single run prediction. In addition, explore the effect of normalization of input data. • A robustness analysis to check if the DNNs perform similarly for four different regions and input patterns, thus grading their ability to generalize. • Correlation between model performance and different input patterns based on variance and mean absolute deviation (MAD) of the dataset.
The outcome of this study will be especially helpful to determine which DNN might perform best for applications that require hourly temperature forecasting, particularly load forecasting along with other applications mentioned in Section I. The rest of the paper is organized as follows-Section III gives a mathematical and illustrated overview of the DNNs considered in this study. Section IV breaks down the methodology and implementation of the models. Section V presents the outcome of the comparative analysis and Section VI discusses the robustness analysis of the models along with its correlation to different parameters of a dataset. Finally, Section VII concludes the paper with an indication of future scopes.

III. FORECASTING ARCHITECTURE A. SIMPLE RECURRENT NEURAL NETWORK (SRN)
Conventional feed-forward NNs are ineffective for prediction using sequential data because it assumes all the units of input vector to be independent of time [21]. RNNs differ from conventional feed-forward NNs as they are sequenced-based models that allow the learning of time-based dependencies. RNNs have the ability to create temporal correlation from past data with the present state [22]. RNN allows the signal to move forward and backward, and can make a loop in the NN. Thus RNNs work specially well on sequential data where the decision made at the previous time step (t − 1) is preserved and utilized on the decision made at the current time step t. SRN is the simplest form of RNN that takes two inputs-current state x t at time t and previous hidden state h t−1 , and updates the values by a non-linear activation function. The recurrent unit has a single hyperbolic (tanh) layer. The repeating module of SRN [23] can be expressed by the following equation- where h t is the hidden neuron at time t, o t is the output vector and b is the bias value. Figure 1 illustrates a basic SRN unit. The main drawback of SRN is that it sometimes fails to converge to the optimum minima due to its vanishing gradient problem that might arise during back propagation [24]. So over the course of time, multiple modified versions of RNN have been proposed, some of which have become very popular such as LSTM and GRU.

B. LONG SHORT TERM MEMORY (LSTM)
LSTM is a modified version of RNN first proposed by Hochreiter and Schmidhuber [25] which was proposed to mitigate the vanishing gradient problem of SRNs. LSTM can store previous data in its memory unit and add/discard information during the learning process. LSTM has proven to be very effective for sequential data such as signals, protein patterns, text data, time series forecasting etc. Instead of the single hyperbolic layer in the recurrent unit of SRN, LSTM has four layers. The basic components of an LSTM unit area memory cell and three gating units-input gate (i t ), output gate (o t ) and forget gate (f t ) which are shared by all cells in the block. In total, there are three inputs and two outputs. Each layer receives an input x t , previous hidden layer state h t−1 and previous cell state c t−1 . The hidden layer derives a hidden state vector h t and the output cell state c t . The purpose of it is to determine if a cell c t should be updated by x t or not, f t decides if the previous cell c t−1 should be forgotten, and the output of h t depends on o t to control which part of c t should be used. An activation function normalizes the state of the gates, 0 indicating no information flow and 1 indicating full flow of information through the gate. The basic structure of an LSTM unit is illustrated in Figure 2. The fundamental components of an LSTM cell are-a forget gate, input gate, output gate and a cell state. The nodal outputs of a LSTM network are computed as follows [26]: where input variable at time step t is denoted by x t . c t and h t are cell state and hidden state respectively.c t is referred to as the candidate cell calculated in Eq.4 whose output through the tanh function has a value between -1 and 1.
The σ represents the sigmoid activation function and * symbol denotes element-wise multiplication operation.
c t , f t and o t respectively.C t stores the state information and is updated by Eq. 7. Eq. 4 and 7 uses a hyperbolic tangent operator to calculate the memory cell and o t . This enables the LSTM network to retain the useful information across different timescales. LSTM was modified to avoid the vanishing gradient problem by allowing gradients to flow unchanged. However, LSTM networks are still vulnerable to the exploding gradient problem [27].

C. GATED RECURRENT UNIT (GRU)
Another popular modification of the RNN is the GRU proposed by Cho et al. [28] with an aim to make the recurrent units adapt and capture the dependencies of different timescales and sequences. The updated mechanism allows the GRU to capture long-term dependencies. A GRU unit encompasses two gates, the reset gate r t and the update gate z t . The update gate is similar to the forget gate and input gate in LSTM as it controls storing or erasing potential features from the previous state that can be useful later. Meanwhile, the reset gate controls the amount of information that should be discarded. The reset gate mechanism helps the efficiency of GRU model capacity by allowing it to reset features that are detected to no longer be useful. The basic unit of a GRU is illustrated in Figure 3. The equations for the input and output of a GRU model are: whereh t and h t are the candidate activation and hidden state at time t, respectively. W z , W r , W h are the weight matrices of update gates, reset gates and hidden states respectively. The "*" is used to express element-wise multiplication and φ is the sigmoid activation function. GRU is an updated version of LSTM that has two gating units that hold the flow of information but it does not have a separate memory cell. As LSTM contains 12 parameters for each separate unit, a fully connected LSTM layer becomes computationally costly to implement, thus GRUs improve the computational efficiency by combining two LSTM gates (the input and forget gates) into a single update gate [22] which might compromise performance a little, but its improved training time makes GRU faster than LSTM.

D. CONVOLUTIONAL NEURAL NETWORK (CNN)
CNN has become a standard, go-to model for computer vision and image classification applications. CNN models are capable of filtering and extracting complex patterns and features from massive visual datasets (with ground-truth labels). It works by automatically learning a large number of filters in parallel specific to a training dataset and repeatedly applying the same filter to an input which results in activations known as a feature map. The fundamental concept utilizes the mathematical operator called convolution to transform two functions into a single function. Convolution can be performed on two functions at a time, but CNN is used up to 4D spatio-temporal processing [29]. However, there was uncertainty regarding how CNN will perform on 1D time series data, specially when the dataset is not sufficient [30]. In the particular case of a 1D convolutional layer, 1D pooling layers are used to create CNNs for signal analysis as well as time series analysis. The internal structure of CNN encompasses three layers-convolutional layer, dense layer and pooling layer. The convolution layers VOLUME 4, 2016 perform convolution operation with the help of linear activation to extract the local features. The forward and back propagation are detailed in the following equations [30]: where w l−1 ik denotes the kernel between i th neuron at layer l-1 and the k th neuron at layer l. s l−1 i is the output from the i th neuron at layer l − 1. x l k and b l k are the input and the bias of the k th neuron at layer l, respectively. In order to perform 1D convolution without zero padding, the conv1D(., .) function was used. This implies that the dimension of s l−1 i (output arrays) are higher than the dimension of x l k (input arrays). The intermediate output y l k is obtained by applying an activation function f (.) on the input x l k using the following equation: where ↓ ss denotes a down-sampling operation with a scalar factor, ss [30]. Down-sampling of the feature map is performed in this layer which reduces several values into one value keeping the integrity of the input data unchanged [19]. The last layer is the dense layer which receives the flattened data of the pooling stage and makes it a 1D output sequence. An attractive feature of 1D CNN is that lowcost hardware implementation is possible as 1D CNNs only perform 1D convolutions, which is basically additions and scalar multiplications. A basic internal structure of CNN is shown in Figure 4.

E. CNN-LSTM PARALLEL NETWORK
Hybrid CNN-LSTM networks are often configured in series, where CNN is used to extract features from the input data and subsequently, the output of the CNN is fed into the LSTM as an input. Combining CNN and LSTM can make use of their complementary characteristics such as, CNN being used for feature extraction that expresses spatial locality and LSTM being implemented for time series data analysis for temporal feature detection. However, an obvious query to series CNN-LSTM configurations is, to what extent the accuracy of the CNN model affects the training of the LSTM model. To avoid this confusion completely, CNN-LSTM parallel networks can be used where each NN will have its own path without intersecting or affecting each other [31]. The LSTM follows a conventional path and outputs a 1D array. For the CNN path, the convolution layer and pooling layer outputs a 2D array.
However, it has to be ensured that the vector output of the two paths are of the same dimension before they are added. A flatten layer stacks the 2D output of the pooling layer into a 1D array and the dense layer ensures equal number of elements from both the pathways. One significant drawback of this network is the increased computational cost. The model of CNN-LSTM parallel network considered in this study is shown in Figure 5.

F. GRU-LSTM PARALLEL NETWORK
GRU-LSTM hybrid models have previously been proposed for series configuration. To the best of our knowledge, we are the first to implement a GRU-LSTM parallel network for time series prediction. The series configuration was also trained, but the parallel GRU-LSTM yielded better results which is why it is considered for this study. The concept is similar to that of CNN-LSTM; in order to avoid the output of one network adding any bias to the output of another, the series configuration was replaced with a parallel network where each DNN has separate paths for training the data. GRU and LSTM have a similar working mechanism, with GRU being a little faster than LSTM as it has two gates where LSTM has three. Combining the two models have shown promising results. Similar to CNN-LSTM parallel network, it has to be ensured that the vector outputs of the two separate NN paths are of the same size before summing them. The combined output enters three dense layers to prepare the data for prediction. The GRU-LSTM parallel network considered in our study is illustrated in Figure 6.

IV. METHODOLOGY
Two different approaches were taken using the six DNNs to perform regression-1) Considering each hour of the prediction horizon (6h) as an individual regression problem (hour-by-hour prediction). 2) Considering the total prediction horizon (6h) as a single regression problem (full prediction in single run), and Both approaches are used to predict up to 6h ahead hourly temperature. All six DNN models are evaluated using both approaches. Additionally, the models are trained on data with normalization and compared to the same models trained on data without normalization to see how normalization affects DNNs for univariate time series data. Finally, datasets from four other regions are used to perform a robustness analysis of DNN models. To increase the resolution of the data, a hopping window of hop size equal to 1h is used to divide the data into overlapping blocks of 30h each, for both train and test sets. From each of these blocks, the first 24h is taken as the input sequence and the remaining 6h is taken as the output sequence. This approach is followed for the full prediction in a single run. For the hour-by-hour case, the 6h prediction horizon is considered as six individual regression problems, while the 24h input is kept the same. The models are trained on both normalized data and data without normalization to compare the raw performance of the DNNs, and also observe how normalization affects the performance of the models.

A. DATA COLLECTION
The temperature data is collected from a dataset uploaded by Zhang S. et al. titled "Cautionary Tales on Air-Quality Improvement in Beijing" [32]. The original data contained various air quality readings from twelve nationally controlled monitoring sites. From the whole dataset, the Aoti Zhongxin area is taken for its relatively low number of missing values. The Aoti Zhongxin is considered in this study to represent overall Beijing temperature because of the low variation in readings from other centers.
The dataset consisted of hourly temperature data from 2013-03-01 00:00:00 to 2017-02-28 23:00:00 giving us a total of 35064 hourly readings. The dataset is at first sorted according to datetime. There were 20 missing temperature values and because of the relatively small size of the missing data, it is filled using the forward fill method instead of other complex imputation methods. Then maintaining the order, first 90% of the data is selected for training from 2013-03-01 00:00:00 to 2016-10-04 18:00:00 and the remaining is taken for testing from 2016-10-04 19:00:00 to 2017-02-28 23:00:00. The train-test split can be visualized from Figure  7.
The dataset for the robustness analysis titled "Historical Hourly Weather Data 2012-2017" [33] contains 5 years of high resolution (hourly measurements) temporal data of various weather attributes from January 2012, 12:00:00 to December 2017, 00:00:00, out of which the temperature data is extracted. This data is available for 30 US and Canadian cities. Toronto, Seattle, Dallas and Las Vegas were chosen for the robustness analysis because of their considerably scattered geographical locations so that the temporal data vary as much as possible.

B. MODEL CONSTRUCTION AND HYPERPARAMETER TUNING
Hyperparameter tuning is an important part of NN construction, which is usually done through extensive trial and error. A common practice is to use rule-of-thumb parameters or combinations that have previously performed well for other papers. However, we have carefully chosen all the hyperparameters after manually testing from a wide range of values. A validation run is conducted for each model to decide the hyperparameters for best performance and fitting before training the final models. The train set is split 90-10 for the validation run. The layer-based hyperparameters determined from this run, are provided in the Table 2. General parameters such as optimizer, learning rate and the number of epochs are also important to improve the overall performance and speed of the models. Commonly used optimizers include root mean square propagation (RM-Sprop), stochastic gradient descent (SGD), the adaptive gradient algorithm (AdaGrad), and adaptive moment estimation (Adam). In this paper, after the validation run, the Adam optimizer is chosen which is computationally efficient and showed slightly better results during testing. The batch size of all the models is taken as 64 and the loss function considered are-mean square error (MSE), cosine similarity (for full time single run) and MSE for hour-by-hour prediction.

V. RESULT ANALYSIS A. FORECASTING OUTCOMES
The trained models were used to predict hourly temperatures up to 6h ahead. The prediction is carried out for hour-byhour basis as well as the whole time horizon in a single run. The training and testing period has been mentioned in section IV-A. It is observed that the models trained on unnormalized data perform better than models trained on normalized data, and so only the prediction graphs of models trained on data without normalization are included in the paper (Figure 9 to Figure 20). The error metrics values are tabulated and RMSE is plotted for both with and without normalization.

B. EVALUATION METRICS
The performance of the DNNs are evaluated in terms of three error metrics. The error metrics taken into account are the conventional root mean squared error (RMSE) and  mean average error (MAE) and additionally, the coefficient of determination R 2 . The mathematical expressions of the above error metrics are given as follows: where n is the number of data in forecasted temperature, F t is the forecasted hourly temperature and A t is the actual temperature at instant i. For R 2 ,Ā t is the mean value of the observations. The R 2 value indicates how good a model fits the dataset. The maximum value of R 2 is 1, where values closer to 1 indicate higher prediction accuracy. RMSE puts more emphasis on higher errors compared to the lower ones. Lower values of RMSE and MAE indicate better performance.

C. PERFORMANCE ASSESSMENT BASED ON EVALUATION METRICS
The performance of the DNNs is assessed mainly based on their RMSE values. The effect of normalization is also observed. The error metrics values for models trained on unnormalized data and models trained on normalized are tabulated in Table 3 and Table 4, respectively. The following observations can be extracted from the results:

1) Overview of model performance
It can be observed from Figure 21 that SRN had the highest RMSE, followed by GRU and LSTM, which is expected. LSTM is the modified version of SRN, and despite GRU being proposed after LSTM, its main purpose is to reduce computational cost while retaining accuracy as much as possible. Thus, SRN (1.79 for hour-by-hour, 1.88 for full time) and GRU (1.81, 1.79) showed the poorest performance. LSTM (1.77, 1.77) performed slightly better than SRN and GRU. This also reflects the previously mentioned claim that LSTM is more suitable for detecting long-term dependencies rather than high resolution short-term outputs. CNN performed similar to LSTM in both hour-by-hour (1.77) and full-time prediction (1.76) cases.

2) RNNs vs CNN for time series forecasting
In general, RNNs (SRN, LSTM, GRU) are known to work better on text classification whereas CNN is the standard for image classification. According to literature, RNNs work well with sequential data which makes it ideal for predicting values in a sequence (such as time series) while CNN is excellent for feature extractions. However, it can be observed from Table 3 and Table 4 that they perform similarly on univariate time series predictions. A deciding argument in this regard can be the computation time. CNNs have a huge advantage of being very fast compared to RNNs. In our study, the CNN model ran 5 times faster than LSTM, 4 times faster than GRU and twice as fast as SRN.

3) Singular models vs hybrid models
An interesting case is observed for the hybrid models CNN-LSTM parallel and GRU-LSTM parallel network. Both models outperformed single models for single-run predictions, yet both showed the worst performance for hour-by-hour predictions. GRU-LSTM exhibited the best RMSE (1.69) for single run, but the second worst hour-by-hour RMSE (1.9) out of all the DNNs. Similarly, CNN-LSTM yielded the second best RMSE (1.74) for single run, and the worst RMSE (2.2) for hour-by-hour prediction. The highly inconsistent performance for the hybrid models, in hour-by-hour prediction can be observed in the form of random spikes in Figure 21 and Figure 22. However, it should be noted that the hybrid models are computationally more expensive than single models.

4) Effect of normalization
It is evident from Figure 21 and Figure 22 that the models trained with normalized data performed worse than the models trained without normalization. In Figure 21, only CNN-LSTM parallel network showed random spikes, but in the case of Figure 22, almost every model including SRN, LSTM, GRU-LSTM and CNN-LSTM performed inconsistently. Although normalized data are expected to yield good results on time series forecasting using DNNs, it performed poorly on temperature data.

5) Comparison with existing works
In Section II, only one paper was found to predict hourly temperature using DNN. They have proposed a convLSTM model which achieved an hourly average RMSE of 2.1°C on a temperature dataset of Germany. Although all the models included in this paper have achieved better RMSE (<2.1°C) for single run prediction, our work cannot be conclusively compared as the works are based on two different datasets.
To summarize this section, the following conclusions were reached:    1) GRU-LSTM parallel network shows superior performance out of all six models (for full time single run prediction, with/without normalization). 2) All DNNs perform better on hourly temperature data without normalization compared to data that is normalized. 3) Full time single run predictions are preferable to hourby-hour predictions, as hour-by-hour exhibited random spikes. Not to mention the obvious drawback, the hourby-hour run requires 6 times more computational cost than single runs. 4) In terms of computational cost, CNN is much faster than any other model while sustaining good performance.

VI. ROBUSTNESS ANALYSIS
In section V, a conclusion is drawn from the performance of the models by testing them on the same dataset as they trained on. In this section, the robustness of the models are analyzed by testing the models on new datasets from different geographical locations having uncorrelated climatic characteristics. The robustness is a model's ability to generalize trends and output satisfactory performance on different or altered datasets. The previous three error metrics are compared among different DNNs to assess their robustness in each location.
Four cities from different geographical locations were chosen for the robustness analysis, discussed in section IV. The time period considered for the prediction is from 1 March 2013, 00:00:00 to 28 February 2017, 23:00:00 (same as Beijing dataset). The result obtained from the predictions are summarized in Table 5. The models were run with both normalization and without normalization, also hour-by-hour and single-run approaches. Similar to the previous case, models without normalization in a single run yielded better results, so the discussion will be limited to this. To grasp the changes easier, the comparative RMSE of the DNN models is illustrated in Figure 23.  Figure 23 depicts that all the models performed satisfactorily on untrained, unrelated datasets from different locations. The RMSE of all the models did increase, but the increase is comparatively low, indicating a model's robustness and reliability. From Table 5, it can be observed that GRU has achieved the lowest RMSE (2.0042°C), which indicates that GRU is the most robust DNN.
To draw a correlation between a model's performance and different types of temperature datasets from different regions, various parameters were initially considered, such as distribution plot, autocorrelation function (ACF), partial autocorrelation function (PACF), variance, mean absolute deviation (MAD), etc. These parameters did not have any apparent correlation, except the variance and MAD which showed a negative correlation with model performance. Fi-nally, it is observed that the model performance on different datasets is best explained by the product of the variance and MAD.  Table 6 indicates that the RMSE value has a positive correlation with the variance and the MAD value of a particular dataset. The MAD value is calculated keeping the Beijing data as a point of reference. Initially, only the variance was considered to draw a correlation. Higher variance in the data caused the models to perform poorly. However, Toronto is an exception where the models performed better despite encountering a very high variance. This can be explained by the second parameter, MAD. Toronto has the lowest MAD value among the four regions. So, the best fit for correlating the model performance with the type of regional temperature dataset are considered as the product of the variance and MAD values. The RMSE of all six models are plotted against the MAD * variance in Figure 24.  (except SRN which produced an outlier). Another important point to note is that, for Seattle, all the models have yielded a lower RMSE value compared to Beijing, as it has a lower variance. This implies that the models are able to achieve a degree of generality. On the other hand, the specificity of the models can be understood from the positive correlation of RMSE to the MAD value. This opens the scope of using transfer learning for datasets that have little correlation to the dataset models were trained on.

VII. CONCLUSION
This study has carried out a comparative analysis on six DNN models to observe which performs the best for highresolution hourly temperature forecasting on Beijing temperature data. The study has also presented an in-depth robustness analysis to see the change in performance parameters of these DNNs when tested on a geographically diverse dataset. The comparative analysis has revealed GRU-LSTM parallel network to provide the best performance when tested on the Beijing data at 1.691°C RMSE. CNN on the other hand performs slightly worse at 1.759°C RMSE ranking 3 rd in terms of accuracy but has by far the best computational time. The study has also found out that single-run models are better and more consistent for prediction instead of single-point regression models. The comparative analysis further revealed that the models perform poorly on normalized temperature data which is unusual as neural network models generally tend to perform better on normalized data. In short, this study aimed to act as a benchmark for high-resolution temperature forecasting with only historical temperature data using neural nets that yield sufficient accuracy and are computationally inexpensive.
From the robustness analysis, the study was able to map a correlation between model performance and the product of MAD and variance of the dataset. It was further found that the GRU-based model was able to generalize the most over various geographical locations although it performed poorly on Beijing data. This was explained by the high variability of temperature data across the globe. To perform well on temperature data of a particular location, the models had to trade off robustness for a certain level of specificity. This has indicated a future scope of work where transfer learning can be adopted so that models trained on one dataset can perform well on new data with little correlation with the previous dataset. Moreover, this study can be incorporated with research on embedded systems equipped with artificial intelligence processing capabilities to be used in the future to implement portable, compact devices for on-spot temperature forecasting.