Hybrid Time-Series Framework for Daily-Based PM2.5 Forecasting

The impact of fine particulate matter on health has captured attention worldwide. Many studies have proven that fine particulate matter harms the respiratory system and the cardiovascular system. To prevent people from being harmed, many scientific research studies on PM2.5 prediction have been conducted in recent years. Accurate PM2.5 forecasting can not only alert people to stay away from concentrated areas but also provide the government with environmental policies in the future. In this paper, we propose a hybrid time-series prediction framework for daily-based PM2.5 forecasting. The proposed framework consists of three components: the autoencoder, the dilated convolutional neural network, and the gated recurrent unit. The experimental dataset with 76 monitoring stations from the Taiwan Environmental Protection Administration is applied for comparison of the baseline and the proposed models. The proposed model is not only for the specified city-/county-wide region but also for the particular monitoring station/site to predict PM2.5 concentration. By considering air quality data, meteorological data, and geographical data simultaneously, the proposed model can increase the accuracy of PM2.5 prediction. In addition, the proposed PM2.5 forecasting model can learn the location-centric spatial features and the daily-based temporal features simultaneously. The experimental results show that the prediction accuracy of the proposed model is superior to those of the baseline models.


I. INTRODUCTION
From many research studies of air quality in recent years, the results revealed that rapid climate change and serious environmental pollution have impacted human health worldwide. Additionally, many studies have shown that air quality prediction becomes more important for the living environment, particularly fine particulate matter (PM 2.5 ). From the viewpoint of the government, the warning of PM 2.5 concentration can be helpful to make environmental policies and to remind citizens to stay away from polluted areas. Hence, PM 2.5 forecasting and monitoring are not only national but also international topics for humans.
To live in a healthy environment, many studies have addressed air pollution intensity and air quality forecasting [1], [2]. Additionally, most of the research studies applied either theoretical methods or simulation models to The associate editor coordinating the review of this manuscript and approving it for publication was Turgay Celik . highlight the situation under air pollution [3]. Machine learning has been applied to predict air quality. Dong et al. [4] proposed a method for PM 2.5 forecasting by using the hidden semi-Markov model (HSMM). Donnelly et al. [5] proposed real-time air quality forecasting, which is based on integrated parametric and nonparametric regression. Furthermore, weather and climate trends are considerably relevant to air quality; hence, applying traditional machine learning models is insufficient for air quality prediction.
In recent years, some studies have proposed a novel method that is used for prediction models. The equipment degradation processes possessed long-range dependence and multimode characteristics. The causes of the multimode characteristics include the external environment and operating conditions, as well as the equipment loads throughout its lifetime. Duan et al. [6] developed a multimodal fractional Lévy stable motion degradation model, which is used to predict the product technical life or the remaining useful life of equipment. Liu et al. [7] proposed a prediction model of the remaining useful life based on the generalized Cauchy stochastic process. Reliable gearbox prediction is a complex problem. To overcome the gearbox reliability problem, a hybrid model based on the fractional Lévy stable motion, the gray model and the metabolism method was proposed [8]. In this model, the feature extraction method is used to reveal gearbox degradation and to solve gearbox insensitivity to weak faults. In 2021, a new long-range-dependent degradation model was proposed to predict the remaining useful life of rolling bearings [9]. The degradation model is based on the generalized Cauchy process and can describe the local irregularities and the global correlation characteristics of the time-series data. Linear mathematical models cannot adequately describe wind-speed characteristics. To overcome the limitations of the linearity assumption, a novel model based on the generalized Cauchy process was introduced [10]. In this model, the fractal dimension and Hurst parameter are combined for simulation and forecasting of the wind speed. Furthermore, the prediction model is applicable to describe the local irregularity and global correlation of wind speed.
Many studies have addressed the interrelation between air pollution factors such as PM 2.5 and meteorological data. Traditional machine learning lacks the adoption of datadriven approaches to process time-series air quality data [11], [12]. Nevertheless, deep learning models can apply to data-driven methods [13], in which the features of air quality data are competently extracted. In the domains of image classification, speech recognition, and natural language processing, deep learning models have made remarkable achievements [14]- [16]. Shallow learning models are utilized to predict air quality, and deep learning approaches are appropriately applied to predict time-series air quality data [17]- [20]. In [21], a multivariate time-series method was conducted for air quality prediction. In addition, deep learning models are most suitable for dealing with data-driven approaches and applicable to handle time-series data.
A hybrid time-series framework for daily-based PM 2.5 forecasting is proposed in this paper. The nationwide and city-/county-wide regions for air quality prediction are all covered. Both the temporal and spatial correlation dependencies of features are learned from air quality time-series data, such as wind speed, PM 2.5 , and coordinates. We summarize the major contributions of this paper in the following paragraphs.
First, we develop a hybrid time-series deep learning model, which is composed of three components. Autoencoder (AE) and dilated convolutional neural network (CNN) learn spatial features through air quality and geographical data. Gated recurrent unit (GRU) extracts temporal features through air quality and meteorological data. Compared with the existing models, such as the ST-DNN model proposed by Soh et al. [22], the proposed model can decrease the average MAE and RMSE values by 16% and 18%, respectively. In addition, our model also shortens the average training time by 12%.
Second, according to the experimental results, the proposed model is not only accurate on a nationwide scale but also adequate for region-wide prediction of time-series data such as air quality. Furthermore, our model can be applied to predict the air quality of the target site. The proposed model is superior to the existing prediction models.
The rest of this paper is organized as follows. Section II describes the related research works. The methodology of the proposed PM 2.5 forecasting framework is presented in Section III. Section IV depicts the experimental setup and the prediction comparison, of which the prediction error and the training time are considered in our work. We show the conclusion and future work in Section V.

II. RELATED WORKS
For the literature on PM 2.5 forecasting, almost all existing works dealt with the prediction accuracy of air pollutants by using machine learning models and statistical methods [3], such as HMM [4], regression [5], artificial neural networks [23], and ARIMA [24]. Zhang et al. [1], [2] proposed a real-time air quality prediction approach, which focuses on the analysis of the major research trends and current status, as well as future directions. In Zhou et al. [25], the hidden temporal dependencies regarding PM 2.5 were addressed with Lasso-Granger by developing a probabilistic dynamic causal (PDC) model. The hybrid model is based on the regression neural network and empirical mode decomposition for the previous 24-hour PM 2.5 prediction proposed by Zhou et al. [23]. In Deleawe et al. [26], a machine learning model that conducts air quality measurements in the urban environment was used to predict the CO 2 levels.
In addition, many studies show that deep learning models have been used for air quality prediction. Air quality data possess time series and nonlinear characteristics, and thus, data-driven models are directed to address the topic of urban computing [27]. Moreover, many PM 2.5 forecasting studies are based on big data, which can obtain the predicted results by adopting many historical and multivariate data [28]. Zheng et al. [11] proposed a semisupervised learning model that combines CRF with ANN classifiers. Hsieh et al. [12] developed a semisupervised method to conduct fine-grained and real-time air quality data. An air quality prediction framework in real time that applies data-driven models was presented in Zheng et al. [29].
With regard to the capability of data-driven methods and nonlinear problems, deep learning models have generally been adopted to solve time-series and sequence data problems [6], [30], [31]. That is, air quality data possess characteristics such as time-series data. In [32], Li et al. [32] presented a novel air quality forecasting model by utilizing spatial-temporal deep learning (STDL), which considers the temporal and spatial correlations. To predict air pollution, Ong et al. [33] developed a deep recurrent neural network (DRNN) by adopting the autoencoder approach. In [18], Qi et al. [18] proposed a deep air learning (DAL) model to deal with feature analysis, interpolation and forecasting.
In [34], Zhang et al. [34] developed a deep residual neural network to extract the features of time-series data and analyzed the congestion of citywide crowds.
In [35], a hybrid deep learning framework was developed, which was combined with multiple deep neural network models. The hybrid framework has also been applied to the topics of video classification and face detection [36]. Additionally, hybrid deep learning frameworks have not been well fitted to handle air quality forecasting issue predictions [37]. Many researchers have shown that hybrid deep learning models have exceptional performances compared to classic deep learning models [36].
Many works have shown that convolutional neural networks (CNNs) have excellent performance in video processing and image recognition [16]. Indeed, it is also applied to time-series data prediction [31]. Among deep learning models, CNNs are superior to time-series data and multivariate data. Recurrent neural networks (RNNs) are applicable for learning the time dependencies and extracting the temporal features of time-series data. Furthermore, to mitigate the vanishing gradient problems of RNNs, Hochreiter et al. [38] developed a variant of RNN, LSTM, which refers to the internal states of the memory cells for mitigating the vanishing gradient problems. RNN can predict time-series data; moreover, LSTM is excellent in time-dependent feature extraction.
Du et al. [37] proposed a hybrid deep learning model to predict PM 2.5 in Beijing, and the hybrid model was composed of 1D-CNN and Bi-LSTM. In his work, two datasets were applied to PM 2.5 forecasting: the Beijing PM 2.5 Dataset from the US Embassy in Beijing, and the Urban Air Quality Dataset from the Urban Air Project of Microsoft Research. The experimental results show that 1D-CNN can effectively extract the local trends and spatial features from the air quality time-series data. Soh et al. [22] developed an adaptive deep learning model to forecast PM 2.5 in Taiwan and Beijing. In his work, the Taiwan dataset was collected from the Taiwan Environmental Protection Administration, and the Beijing dataset was provided by the Urban Air Quality Dataset from the Urban Air Project of Microsoft Research. The experimental results also showed that CNN can extract the surrounding targets and spatial features from the air quality data.
Du et al. [37] proposed a deep air quality forecasting framework (DAQFF), which is composed of various deep learning models. The main idea of the DAQFF is not only to deal with time-series forecasting issues but also to address spatiotemporal data features. In addition, the DAQFF correlates the multivariate air quality data. The DAQFF can extract the temporal features as well as the spatial features from air quality data. One-dimensional CNN and bidirectional LSTM are two components of DAQFF; the former deals with the spatial data features, and the latter deals with the temporal data features [39], [40].
In [41], Yu and Koltun proposed a convolutional network approach to conduct dense forecasting, and the proposed model applies dilation factors to aggregate multiscale contexts without degrading the resolution. The proposed model introduces dilation factors for the sake of expanding the receptive fields. In addition, dilated convolution can increase the accuracy and efficiency of dense forecasting.
Soh et al. [22] developed an adaptive air quality prediction model that includes multiple deep learning models. The spatial-temporal deep neural network (ST-DNN) considers terrain and meteorological data concurrently, which means that ST-DNN can extract spatial and temporal features from air quality data. ST-DNN combines three deep learning models: the first two models are artificial neural network (ANN) and long short-term memory (LSTM), which extracts the temporal features from the air quality data; the last one is the convolutional neural network (CNN), which extracts the spatial features from the air quality data. In summary, ST-DNN can extract spatial correlations and the temporal dependencies of neighboring locations. Zhang et al. [42] combined CNN and LSTM to achieve higher forecasting accuracy of air pollution.

III. METHODOLOGY
In this section, we first present the framework of this paper and then state three main components in the framework. That is, the autoencoder, one-dimensional CNN and GRU are described in order.
In this work, our motivation is to develop a deep learning framework for time-series PM 2.5 forecasting. To consider the comparability and fairness of the model performances, we compare the classic and existing deep learning models and choose a traditional machine learning model for comparison. In this paper, PM 2.5 forecasting considers both the location correlation of multiple monitoring stations and the time dependency of a single monitoring station. CNNs can effectively extract the local trends and spatial features of different districts. The GRU possesses a memory mechanism, so it can effectively extract the short-and long-term temporal features of a particular district. Fig. 1 depicts the components and functionalities of our framework. On the left-hand side, the air quality and geographical data are first input into the autoencoder layers and then into the dilated convolution layers. On the right-hand side, the air quality and meteorological data are input into the GRU layers. Afterward, the air quality input data are output from the abovementioned layers and merged with the concatenate layer for data fusion. One flattened layer is then introduced to feed the following dense layer with the input data. Finally, the predicted PM 2.5 values are generated.
By considering the interrelated factors of variant data sources, we consider the correlation of geographical areas, meteorological conditions, and air quality time-series data simultaneously. AE and dilated CNN can effectively extract PM 2.5 concentrations from the specified district based on historical air quality, and GRU can effectively extract PM 2.5 concentrations from the seasonal climate based on historical air quality. To predict the PM 2.5 concentration of particular monitoring stations under various climate circumstances, we combine the learning results of AE and dilated CNN with GRU for all the time periods.
The proposed model is named the hybrid time-series framework (HTSFW), which combines unsupervised and supervised models for daily-based PM 2.5 forecasting. The HTSFW model applies AE and dilated CNN to extract the local trends and spatial features. Additionally, HTSFW utilizes GRU to extract the long dependencies and temporal features.
For conducting the spatiotemporal features of the air quality data, we extract the spatial features from the PM 2.5 values correlated with the monitoring station locations, such as the coordinates; in addition, we concurrently retrieve the temporal features from the PM 2.5 records interrelated to the weather and climate factors, such as the wind speed.
The air quality-related dataset is composed of geographical data and meteorological data, such as PM 2.5 , longitude, latitude, SO 2 , CO, O 3 , NO 2 , PM 10 , wind speed, temperature and humidity. To increase prediction accuracy, the HTSFW model consolidates the training results of the geographical data and the meteorological data. In the HTSFW model, the missing values of the air quality dataset were padded with zeros. In other words, the same experimental dataset was applied to all the baseline and HTSFW models. The data contents are recorded by day; hence, PM 2.5 forecasting generates daily time frames.
As is known, the innovation of the HTSFW model can not only consider a single monitoring site for time-dependent meteorological factors but also examine wide regions for location-interrelated geographical factors. In addition, the HTSFW model combines the unsupervised with the supervised models, which has excellent performance in PM 2.5 forecasting. The details of the HTSFW model are discussed in the following subsections.

A. AUTOENCODER
The first model used in the proposed framework for the air quality and the geographical data and autoencoder is an unsupervised deep learning model [10]. The simplest autoencoder network has one hidden layer. The input layer first encodes the high-dimensional data; then, the hidden layer stores the low-dimensional codes as the intended data features. Additionally, the functionality of the output layer is to use the low-dimensional codes to reconstruct the high-dimensional input vectors.  The dimensionality of the input layer is five, and the input data are encoded and then stored in the hidden layer. Therefore, fewer neurons of the hidden layer lead to the outcome of data compression or dimensionality reduction. With the same dimensionality as the input layer, the purpose of the output layer is to decode the hidden representation from the previous layer and reconstruct it to the original input data.
Furthermore, multiple hidden layers can be constructed to form the stacked autoencoder (SAE) network. In the encoding phase, the dimensionality of the next layer has fewer neurons than the previous layer, which means that each neuron ignores useless data and keeps meaningful data. During the training process, backpropagation can be utilized for fine-tuning the connection weights. In the decoding phase, in contrast to the encoding phase, the dimensionality of the next layer possesses more neurons than the previous layer; each neuron learns the data features and reconstructs the original input.
In addition to dealing with the geographical relationship set of the target stations for PM 2.5 forecasting, we develop an approach to select the target region or site, which is simultaneously applicable for extracting the local trends and long dependencies. Algorithm 1 depicts the proposed method.
Here, d i and s j indicate the particular district and station, respectively, and i and j denote the district and station if d i = d target then 6: sort s j contained in d i by ascending order 7: sort t k recorded in s j by ascending order 8: end if 9: end for 10:end if identifiers, respectively. Moreover, t k denotes the timestamp of the data record, and k represents the timestamp index.
Algorithm 1 first sorts the district identifiers in ascending order; the HTSFW model defines the target district identifier afterward. Then, the proposed algorithm searches the matching district identifier and sets the matching district to be the target district. After that, the monitoring stations of the target district are sorted by the station identifiers in ascending order. Finally, each target monitoring station is sorted by the timestamps in ascending order.
As shown in Algorithm 1, the target district identifier can refer to a single district or multiple districts. In other words, the regional coverage of PM 2.5 forecasting can be dynamically conducted in accordance with the proposed algorithm.
In this paper, an experimental dataset with 76 monitoring stations (Jan. 2014 to Jun. 2019) in Taiwan is downloaded, which was provided by the Taiwan Environmental Protection Administration (TWEPA). In the downloaded dataset, the missing values of the data items were recorded to be empty. To deal with the missing values in the dataset, we padded the mentioned data items with zeros.
B. ONE-DIMENSIONAL DILATED CONVOLUTIONAL NEURAL NETWORK SVM was developed in 1992 and is a supervised machine learning model. It has been widely used for data classification and regression. SVM is also a forecasting method based on a statistical learning framework. One of the SVM models, LSSVM, is a least-squares version of SVM, which is to minimize the sum of the squared errors of the objective function. SVR, another version of SVM, was proposed in 1996 and is mainly used for data regression. Du et al. [37] proposed the DAQFF model, which can accurately predict PM 2.5 concentrations. To compare the performances of the proposed and baseline models, SVR, ARIMA, LSTM, GRU, CNN and RNN were tested in his work. For the next one-hour prediction of the Beijing PM 2. 5  The next model used in the proposed framework for the air quality and geographical data is a one-dimensional dilated CNN. The main consideration is stated as follows. CNNs are well suited for spatial feature extraction, while one-dimensional CNNs apply to time-series data. The dilated convolution network can extract air quality features from the input time-series data. In addition, using dilation factors can expand the receptive fields and further enhance the training efficiency. Both are discussed in the following subsections.

1) 1D-CNN FOR TIME-SERIES DATA FEATURE EXTRACTION
In the image processing field, convolutional neural networks are commonly adopted [16]. Nevertheless, CNN is also applied to time-series data prediction. The classical CNN commonly consists of convolutional, activation and pooling layers. Furthermore, the two-dimensional CNN is popularly utilized for image classification [35]. In this work, the HTSFW model utilizes the one-dimensional CNN to predict the PM 2.5 concentration. In general, the activation function of the one-dimensional convolutional layer is depicted as follows: As shown in Equation (1) is the activation function of the convolutional layer, where * indicates a convolution operator, w t i and b t i represent the weights and biases, respectively, and x t represents the t-th time step of the input data.
The HTSFW model uses two connected one-dimensional convolutional layers to extract the spatial features from the geographical data, and the two connected layers constitute the hierarchy to represent the local trend features. In other words, the two connected one-dimensional CNNs can learn the local trend features of a single monitoring station and can also extract the hidden spatial correlation features from multiple stations.
In Fig. 3, the time-series features of the input air quality data are first filtered through the one-dimensional convolution kernel, and the activation function processes the input features, weights and biases. After that, the extraction process of the activation function generates the output targets.
Since the air quality input dataset contains time-series data items, the one-dimensional CNN is adopted to compress the length of multivariate input data and learn the air quality data features. Due to the local perception and weight sharing of the one-dimensional convolution network, the number of parameters decreases, and the learning efficiency improves.

2) DILATION FOR LEARNING PHASE EFFICIENCY
In our model, we applied 1D-CNN to extract the local trend and spatial features from the time-series data. Since air quality data are related to the time sequence, we apply the dilation factors in the hidden layers to expand the receptive fields.
When the air quality data are inputted to 1D-CNN, by using the dilation convolution, our model can speed up the learning process both in the single time step and the multiple time steps.
Yu and Koltun [41] developed a convolutional network module that applies dilation factors to aggregate multiscale contextual information. The dilated convolution module can exponentially expand the receptive fields without losing coverage. The dilated convolution is defined as follows: Equation (2) states the one-dimensional dilated convolution, in which dilation rate l convolves input F with kernel k, where * l denotes a dilated convolution operator and p = α + l;α = {l−1, 3l−1, 5l−1, 7l−1, . . . ,Ml−1−l} ;M is the input bucket size.   4 represents the dilated convolution operation. First, the one-dimensional dataset is input into the dilated convolution layers. Next, the dilation factors are implemented in the hidden layers; to put it differently, dilation factors are set to one, two, four and eight. Afterward, as previously mentioned, the receptive fields of the next layers can expand exponentially. As a result, the one-dimensional dilated CNN can decrease the number of parameters and reduce the training time. For example, Fig. 4 shows three hidden layers with 16 input data. Based on the depiction of the elapsing time sequence in Fig. 3, the indices start from zero and pass through the left to the right.
In [43], Zhen et al. presented a dilated CNN approach for sequence prediction. The approach utilizes dilation factors to extend the receptive fields and introduces residual connections to form a deeper network. The experimental video analysis results revealed performance enhancements with fewer parameters and shorter running times. Hence, inspired by the convolutional neural network with dilation factors, this work proposes a one-dimensional dilated CNN to expand the receptive fields and increase the training efficiency for air quality forecasting.
In this paper, two one-dimensional convolution layers are concatenated, and the dilation rate is applied to two for both layers. The number of output filters is 64, and the length of the 1D convolution window is one in the first dilated convolution layer. In the second dilated convolution layer, we set the number of output filters to 128, and the length of the 1D convolution window is one. The padding of the two dilated convolution layers is parameterized to the same dimension for both the input and output.

C. GATED RECURRENT UNIT
The exploding and vanishing gradient problems of traditional RNNs are inevitable. To mitigate gradient problems, a long short-term memory (LSTM) network was developed in 1997 [38]. LSTM refers to the internal states within the memory cells for mitigating the mentioned gradient problems. In 2014, another RNN variant, the gated recurrent unit (GRU), was proposed by Cho et al. [14]. LSTM is composed of three main building blocks: the input gate, forget gate, and output gate. In other words, GRU has a simpler network than LSTM. The GRU consists of two main components: the reset gate and update gate. For this reason, the GRU training process results in better efficiency.
The reset gate is responsible for the short-term memory, while the update gate is in charge of the long-term memory. In this work, GRU is used for certain reasons: one is the functionality of hidden state handling, which implies that GRU is well suited to time-series data prediction; the other is a simpler network, which results in a faster learning process. Therefore, GRU is feasible for extracting the long-term temporal dependency features. Fig. 5 shows the main components of the GRU building block. The main components are combined to handle the hidden states and retain the temporal features over a period of time. The main components of the single GRU block are represented as follows: (6) VOLUME 9, 2021 As denoted in the above formulas, r t represents the reset gate that determines the amount of previous information to be ignored, z t represents the update gate that determines the amount of previous information to be passed,ĥ t denotes the candidate hidden state, h t−1 represents the previous hidden state, and h t represents the current hidden state. In addition, x t indicates the input data, W and U are the weights, and b is the bias.
In this work, GRU is applied to retrieve the long-term and time-series features of the air quality data. The HTSFW model extracts the local trend features using AE and a one-dimensional dilated CNN; additionally, the long-term spatial-temporal correlation features hidden in multivariate time-series data are extracted by using GRU.
The local trend features and the long-term correlation features are concatenated afterward, followed by the flattened layer, which transforms the features into a vector. In addition, the vector dimensionality is reduced by the fully connected layer. Eventually, the output of the training process results in PM 2.5 forecasting.

IV. EXPERIMENTS
In this section, we first describe the experimental dataset of this work and interpret the experimental parameters and the settings afterward. Moreover, we conduct performance comparisons of the baseline and proposed models, the ST-DNN [22] and the proposed models.

A. DATASET
In this paper, the experimental dataset is sourced from the Taiwan Environmental Protection Administration (TWEPA). The dataset contains air quality-related data that cover all of the cities and counties in Taiwan, and each data record is a daily average. The data features include air quality, geographical and meteorological data such as PM 2.5 , longitude, latitude, SO 2 , CO, O 3 , NO 2 , PM 10 , wind speed, temperature and humidity.
The time of the dataset is from Jan. 2014 to Jun. 2019, and the data interval is 24 hours. There are 76 monitoring stations built around all cities and counties in Taiwan. To deal with the missing values in the dataset, those recorded as empty were padded with zeros.
Many works show that air quality is highly related to meteorological circumstances. For instance, moderate air pollution is due to high wind speed, good air quality can be due to high atmospheric pressure, and high humidity deteriorates the PM 2.5 concentration [11], [29].
To prepare the experimental dataset, we sort the data items by geographical location after downloading the raw data from the TWEPA. That is, the neighboring cities/counties are sorted in order. For data preprocessing, we extract the spatial correlation features between the air quality and the geographical regions.
In the experimental dataset, each monitoring station possesses air quality, geographical and meteorological data features. In addition, each data item contains a recorded timestamp. The 76 monitoring stations are distributed in Taiwan; therefore, this work can predict the PM 2.5 concentration in Taiwan and the city-/county-wide regions, as well as the particular sites. The HTSFW model is compared with five deep learning models and one machine learning model. The baseline models embrace convolutional neural networks (CNNs), recurrent neural networks (RNNs), and two variants of RNNs, which include long short-term memory (LSTM) and gated recurrent units (GRUs), and support vector regression (SVR). To further compare with the ST-DNN model [22], it is also considered a baseline model.
In our work, the default parameters in Keras are applied to weight initialization. During the training phase, for overfitting prevention, a dropout rate of 0.3 is configured. In addition, the lookup size, batch and epoch are 1, 32 and 100, respectively. The activation functions of CNN and RNN (including LSTM and GRU) are ReLU and tanh, respectively. Adam is used as the optimizer. The learning rate is set to 0.001. The values of beta one and beta two are 0.9 and 0.999, respectively. The epsilon is 1e-7.
For the baseline models, the number of hidden layers is set to one by default, and each hidden layer possesses 128 neurons. In the HTSFW model, we first utilize two AE layers that concatenate with two one-dimensional convolution layers with a dilation factor of two for air quality and geographical feature extraction. In addition, the number of output filters and the length of the convolution window of each layer are applied to (64, 1) and (128, 1), respectively.

TABLE 1. Prediction error comparison between the baseline and proposed models.
We also adopt each of the two GRU layers with 128 hidden neurons for air quality and meteorological feature extraction. The mean square error (MSE) is used as the loss function in the training process, and the activation function used in the output layer for target prediction is tanh. Furthermore, the input time-series data are normalized to [0, 1] by using the min-max function, while the missing values in the dataset of the data items padded the data items with zeros. The dataset is divided into two parts: the first four years of data, and the last 18 months of data. The former part is used for training, and the latter part is used for testing.
For training process evaluation, two metrics, MAE and RMSE, are applied to measure the learning performances. The two error indices are denoted as follows: where n stands for the number of testing data, y i indicates the actual PM 2.5 value, andŷ i represents the predicted PM 2.5 value.

C. PERFORMANCE COMPARISON OF BASELINE AND PROPOSED MODELS
In this section, the HTSFW model is compared with the baseline models. It is noted that the coverage area of PM 2.5 prediction is all the cities and counties in Taiwan, and the air quality data are from the 76 monitoring stations. In addition, both the single and multiple time steps of PM 2.5 forecasting are shown in Tables 1 and 2.
Since all the data items are 24-hour averages, one step indicates one day later, and multistep represents the specified days later. The prediction size in Tables 1 and 2  of baseline models and the proposed HTSFW model are compared by indices MAE and RMSE, respectively.
As shown in Table 1, the performance of PM 2.5 forecasting 24 hours later shows that the HTSFW model can decrease the ratio of MAE by 26.52%, 25.92% and 40.12% by using CNN, RNN and LSTM, respectively. The HTSFW model can reduce the percentage of RMSE by 22.57%, 7.32% and 42.11% by applying the respective CNN, RNN and LSTM models.
In Table 2, for the PM 2.5 prediction one day later, the performance indicates that the HTSFW model can decrease the proportion of MAE by 36.13%, 66.26% and 53.13% by utilizing GRU, SVR and ST-DNN, respectively. In addition, the HTSFW model can reduce the scale of RMSE by 36.24%, 56.46% and 46.22% with GRU, SVR and ST-DNN, respectively.
In addition, we highlight the preprocessing of the missing data items in the experimental dataset. The lack or absence of data item values is inevitable, such as the malfunction of the monitoring station over a five-year duration. The handling of the missing data values is important because it is related to the experiment proceeding. In this work, the missing values of the experimental dataset are padded with zeros. The missing values with the zero-padding approach are applied to all the baseline models and the proposed model.
As shown in Tables 1 and 2, the prediction performances of the proposed model are compared with four classic deep learning models, one traditional machine learning model and one existing deep learning model. The comparison includes single-and multistep forecasting performances, and the prediction sizes are parameterized to ten time steps. In addition, MAE and RMSE are used for performance comparison. In our model, the columns of improvement show a reduction in the ratio of the prediction errors. Based on the improvement values, the proposed model is superior to the six baseline models in the ten time steps by using MAE and RMSE.
The recording period of the experimental dataset is from 01/01/2014 to 06/30/2019, and the data interval is 24 hours. There are 148,399 data samples in total. The time of the training data is from 01/01/2014 to 12/31/2017, and the number of data samples is 107,357 (72%); the time of the testing data is from 01/01/2018 to 06/30/2019, and the number of data samples is 41,042 (28%). To consider all the seasonal factors affecting the PM 2.5 concentration, we use a forecasting period between 01/01/2018 and 12/31/2018. In terms of the performance comparison of all the forecasting models, the HTSFW model can also extract the time-series data features from the seasonal interrelation.   Tables 1 and 2. According to the bar charts corresponding to MAE averages plotted in Fig. 6, the prediction errors of the HTSFW model are clearly lower than those of the baseline models. SVR has the highest MAE average of 10.76, while HTSFW has the lowest MAE average of 7.26, which means that the proposed model can reduce the prediction errors for both short-term and long-term periods.   According to ten different time step performance comparisons in Tables 1 and 2, the HTSFW model is superior to all the baseline models overall. Based on the experimental results, the contribution of this paper is stated as follows. First, the HTSFW model can accurately predict PM 2.5 concentrations in various regions concurrently, for example, urban and rural areas. In other words, HTSFW can predict PM 2.5 concentrations for monitoring stations located nationwide. Second, the HTSFW model can also accurately predict PM 2.5 concentrations in not only a short period of time but also a long period of time. PM 2.5 forecasting in those periods is superior to the existing deep learning models. Therefore, HTSFW is applicable for predicting PM 2.5 concentrations in both single time step and multiple time steps.

D. PERFORMANCE COMPARISON BETWEEN ST-DNN AND PROPOSED MODELS
In this section, the HTSFW model is further compared to the ST-DNN model [22]. First, we compare the training time for all 76 monitoring stations in ten different time steps. Next, we select four cities/counties in Taiwan and compare the prediction errors of the monitoring stations located at each of the four cities/counties. In addition to comparing the prediction errors, we further select four particular sites that are located in the four cities/counties.
As shown in Table 3  time steps indicates that ST-DNN takes 20.57 seconds and HTSFW takes 18.11 seconds. As a result, HTSFW can decrease the ratio of the average training time by 11.96%. Fig. 8 shows the geographical map of all the cities/counties in Taiwan. The upper part of the map in green is the northern region, the middle part in blue is the central region, the lower area in yellow is the southern region, and the right side of Taiwan in red is the eastern region. To predict the PM 2.5 concentration of the specified regions, we apply the developed Algorithm 1 to select the target districts. This work selects one city/county from the four regions mentioned previously indicated by the red arrow signs on the map. The four target districts are Taipei City (TPE), Taichung City (TCH), Tainan City (TAN) and Taitung County (TAT). Our Algorithm 1 can define arbitrary target district identifiers, although we select the four cities/counties in consideration of their representativeness of the four regions in Taiwan. In addition, the four selected target districts possess different weather conditions and climate circumstances. Table 4 represents the experimental comparison of prediction errors between ST-DNN and HTSFW. We train the two models in five different time steps and compare the forecasting errors of the four target districts. In the testing data of the four cities/counties, the highest maximum PM 2.5 concentration value of 74 is in TAN, while the lowest maximum value of 23 is in TAT. For the PM 2.5 forecasting one day later, the HTSFW model can decrease the ratio of MAE by 35.04% and 64.88% in TAN and TAT, respectively; additionally, HTSFW reduces the scale of RMSE by 40.62% and 57.14%.
With respect to the predicted size of 12, the HTSFW model decreases the percentage of MAE by 5.45%, 18.75%, 19.82% and 37.67% in TPE, TCH, TAN and TAT, respectively, while HTSFW can reduce the percentage of RMSE by 9.39%, 9.58%, 19.19% and 31.33%, respectively. According to the prediction comparison in the five different time steps, the HTSFW model is conspicuously better than the ST-DNN model. From the PM 2.5 forecasting of the nationwide coverage and the regional districts, HTSFW performs well in the time series with air quality data prediction.
In Table 4, the experimental results of the HTSFW model show that the forecasting errors of the four regions in five prediction sizes decrease. Since ST-DNN is an adaptive deep learning model, it is composed of traditional machine learning and classic deep learning models. Our hybrid model is further compared to the ST-DNN model by the peak error values. The peak error indicates that the peak predicted PM 2.5 values of the two models are different from the observed PM 2.5 values. Based on the experimental results, HTSFW possesses lower peak errors than ST-DNN in the TPE, TCH and TAT regions of five predicted sizes. For the TAN region, HTSFW has lower peak errors than ST-DNN of predict sizes FIGURE 8. Regional city/county for PM 2.5 forecasting in Taiwan. Note: https://eego.epa.gov.tw/english/tour1/index1.asp? Parser = 99,10,27,,,,,1. For prediction sizes three, six and twelve, HTSFW has slightly higher peak error values than ST-DNN. Fig. 9 depicts the particular monitoring station Yangming in Taipei City; the station Yangming is located in the mountain area. The maximum PM 2.5 concentration of the testing data of station Yangming is 32 in spring. The black trend represents the ground truth, the red line indicates the PM 2.5 prediction of ST-DNN, and the blue line represents the PM 2.5 prediction of HTSFW. From the plotted depiction, the one-year prediction of HTSFW is closer to the ground truth than that of ST-DNN.   10 shows the PM 2.5 forecasting of station Fengyuan, which is located in the urban area of Taichung city. The maximum value of the PM 2.5 concentration of the testing data at the Fengyuan station is 52 in the spring. According to the prediction results, the ST-DNN model [22] is worse than the HTSFW model, especially the peak points of the trend, as shown in the forecasting figure. In Fig. 11, the figure indicates the PM 2.5 prediction of the Tainan station located in an urban area in Tainan City. The maximum value of the PM 2.5 concentration of the testing data at the Tainan station is 65 in the spring. From the plotted representation, the forecasting trend of the HTSFW model is quite close to the ground truth. In contrast, the PM 2.5 prediction of ST-DNN is close to the ground truth when the values are below 20. However, the prediction of the ST-DNN model [22] is evidently different from the ground truth when the values are above 20.
In Fig. 12, the figure represents the PM 2.5 forecasting of the Guanshan station, which is located in a rural area in Taitung County. In eastern Taiwan, there is a sparse population, and most of the people in the east earn a living by farming, particularly Taitung. The maximum PM 2.5 concentration of the testing data at station Guanshan is 21 in autumn. Due to lower pollution from industry and vehicular traffic, the annual VOLUME 9, 2021  In this work, to evaluate the proposed model and the developed Algorithm 1, we first compare the baseline models with the proposed model by using the air quality data from the 76 monitoring stations in Taiwan. Furthermore, we compare the training time and the prediction errors of the existing ST-DNN model [22] with those of the proposed HTSFW model. For the comparison of the ST-DNN and HTSFW models, we compare their prediction errors of the four specified target districts and the sites belonging to each district. Moreover, we compare the prediction errors of selected particular sites belonging to each of the four target districts. According to the experimental results, the HTSFW model outperforms other models in different coverage regions and prediction periods.
Based on the experimental results, the innovation of the HTSFW model is expressed as follows. First, for air quality and geographical data training, the HTSFW model adopts the dilated CNN for feature extraction. Dilation factors are used to expand the receptive fields that lead to increased training efficiency. In addition, HTSFW applies GRU to time-series feature extraction for air quality and meteorological data learning. As shown in Table 3, the learning time decreases significantly because the GRU network is simpler than the LSTM network. Second, we develop an innovative algorithm to select the target district for PM 2.5 forecasting in Algorithm 1. By using the proposed Algorithm 1, the proposed HTSFW model can accurately predict the PM 2.5 concentration of a particular site as well as the specified regions. In Table 4 and Figs. 9-12, HTSFW can accurately predict the PM 2.5 concentration of the four cities/counties in northern, central, southern and eastern Taiwan. Furthermore, for the selected monitoring station in each of the four cities/counties, the PM 2.5 prediction of the HTSFW model is obviously better than that of the ST-DNN model [22].

V. CONCLUSION
In this paper, we propose a hybrid time-series deep learning model for daily-based PM 2.5 forecasting. The accuracy of the PM 2.5 prediction is enhanced by considering the air quality data and the meteorological data simultaneously. In addition, the PM 2.5 concentration is also related to geographical location and time frame. For the performance comparison, we first compare the MAE and RMSE of PM 2.5 prediction between the baseline and the proposed models. For ten different time steps, our model is superior to the baseline models. Moreover, to predict the air quality of the four city-/county-regions, the proposed model is further compared to the ST-DNN model [22]. The selected regions are located in northern, central, southern, and eastern Taiwan. Furthermore, we select one monitoring station in each region for the accuracy comparison. From the experimental results, the proposed model is accurate for all the regions and suitable for local regions and specified sites.
In our work, the proposed HTSFW model can accurately predict the PM 2.5 concentration of the specified monitoring station or regional district in Taiwan. In addition, in this COVID era and post-COVID period, both the administration and citizens have to avoid symptomatic infection and spread of the epidemic. For instance, timely announcements of the setup of screening stations and the confirmed case locations are both critical factors. The government needs to effectively manage and predict the COVID-19 vaccine distribution; hence, the forecasting models need to consider the spatial correlation, temporal dependency, air quality, and traffic transportation simultaneously. In other words, the forecasting models have to effectively extract the data features from variant sources and accurately predict the COVID-19 vaccine distribution in time.
In this work, we develop a hybrid deep learning model to increase the accuracy of PM 2.5 forecasting. Such hybrid models are based on the combination of multiple machine learning and/or deep learning models. The adaptive ST-DNN model is proposed to predict the PM 2.5 concentration of a single monitoring station, which is combined with machine learning and deep learning models. Our HTSFW model is composed of multiple deep learning models that enhance the prediction accuracy of the existing models. Since 2014, GANs have been widely applied to time-series data such as natural language processing [15], [17]. Air quality prediction is highly related to time-series data, and therefore, we will consider utilizing GAN for our future work.
GAN was designed in 2014 and is one of the deep learning models [15], [17]. In this model, two neural networks compete with each other based on zero-sum game theory. The generative network generates new data through unsupervised learning, and the discriminative network evaluates the new data through a competitive process. It has been widely used for image, video, and natural language processing. Inspired by the sequential data of GAN applications, we will seriously consider including GANs in future work.
For future work, we will further study the heights of each monitoring station for PM 2.5 prediction. In addition, we will also develop a new PM 2.5 forecasting model to deal with seasonal climate change to enhance the air quality prediction accuracy.