Comparative Analysis of Deep Learning and Statistical Models for Air Pollutants Prediction in Urban Areas

Rapid growth in urbanization and industrialization leads to an increase in air pollution and poor air quality. Because of its adverse effects on the natural environment and human health, it’s been declared a “silent public health emergency”. To deal with this global challenge, accurate prediction of air pollution is important for stakeholders to take required actions. In recent years, deep learning-based forecasting models show promise for more effective and efficient forecasting of air quality than other approaches. In this study, we made a comparative analysis of various deep learning-based single-step forecasting models such as long short term memory (LSTM), gated recurrent unit (GRU), and a statistical model to predict five air pollutants namely Nitrogen Dioxide (NO<sub>2</sub>), Ozone (O<sub>3</sub>), Sulphur Dioxide (SO<sub>2</sub>), and Particulate Matter (PM2.5, and PM10). For empirical evaluation, we used a publicly available dataset collected in Northern Ireland, using an air quality monitoring station situated in Belfast city centre. It measures the concentration of air pollutants. The performance of forecasting models is evaluated based on three performance metrics: (a) root mean square error (RMSE), (b) mean absolute error (MAE) and (c) R-squared (<inline-formula> <tex-math notation="LaTeX">$\text{R}^{2}$ </tex-math></inline-formula>). The result shows that deep learning models consistently achieved the least RMSE compared to the statistical models with a value of 0.59. In addition, the deep learning model is also found to have the highest <inline-formula> <tex-math notation="LaTeX">$\text{R}^{2}$ </tex-math></inline-formula> score of 0.856.


I. INTRODUCTION
Over the past few years, air pollution has become a major global challenge. Air pollution has a direct impact not only on the environment but also on human health and well-being. It has been observed that air pollution leads to increase mortality and morbidity such as respiratory diseases, impaired cognitive function, cardiovascular diseases, and cancer [1], [2]. Each year, over 3 million deaths are recorded due to air The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . pollution especially in low and middle income countries [3]. In addition, the United Nations (UN) has defined sustainable development goals (SDG) such as 3, 7 and 11 where targets are set for 2030 to reduce deaths, illness and the adverse environmental effect in cities by improving air quality and other factors [4]. Similarly, in the United Kingdom (UK), the government has set a target to reduce 35% of air pollution by 2040 [5].
There are multiple factors involved in deteriorating air quality such as manufacturing, industrial emissions, transportation (in land, air and sea) emissions, dust, and coal consumption [6]. Air pollution is the introduction of harmful materials and gases into the environment which is of great concern to humans and other living organisms. These harmful materials (solids, liquids or gases) are called pollutants. When these pollutants like PM2.5 (Particulate Matter) are produced in higher concentrations than usual, it reduces the quality of our environment and causes serious harmful health effects [7], [8], [9].
Education and raising public awareness relating to this issue requires interdisciplinary approaches with professionals and other stakeholders. Local councils are playing their part and have set up many air quality monitoring stations throughout the country to monitor the concentration of air pollutants. Data collected from such monitoring stations can be used in the prediction of pollutants. Prediction of air quality is important to control air pollution and to identify areas which require solutions to overcome air pollution and related impacts. However, how to model air quality accurately is a challenge on its own and depends on available data and modelling approaches. The main contributions of this study include: (1) We propose a combination of meteorological parameters such as temperature, wind speed and wind direction with lagged air quality feature which is based on the concentration value of the previous hour for the pollutant being predicted for the next hour. We also make use of the datetime index by splitting it into hour, day and month for additional features to improve prediction accuracy and reduction of error. (2) We propose a comprehensive study to predict the five major air pollutants and a comparison of deep learning base models with statistical model. (3) We provide detail architectural and parameters information for both deep learning and statistical models, to predict each pollutant, which can be useful in various applications in the scope of smart cities or can provide benchmarking for more better and accurate models.
The remainder of the paper is organized as follows: related work is described in Section II, the considered statistical and deep learning models with architectural details are provided in Section III. Section IV describe the dataset. Model training and testing is discussed in Section V. Results and discussion are provided in Section VI and finally, we concluded the paper in Section VII.

II. RELATED WORK
In recent years, machine learning (ML), particularly deep learning (DL), models for regression problems have received a great amount of attention to predict air pollution [10], [11]. In [12] a light gradient boosting machine model is proposed to process high dimensional large-scale data, collected from 35 air quality monitoring stations in Beijing. The PM2.5 concentration levels are predicted for the next 24 hours. They compared model performance with Adaptive boosting (Adaboost), gradient boosting decision tree (GBDT), extreme gradient boosting (XGboost), and deep neural network (DNN). Based on symmetric mean absolute percentage error (SMAPE), mean square error (MSE) and mean absolute error (MAE), the results show that their model outperformed. Furthermore, the integration of historical data significantly improves the performance of the model. In a similar study [13] to predict PM2.5, the dataset was collected by a Taiwan air quality monitoring station from 2012 to 2017. The data was gathered from 5 major polluted cities of Taiwan and contains meteorological data as well as air pollution data. In pre-processing, methods like Fourier arrangement and spline multinomial approaches are used to impute missing values in the dataset. Although data was collected on an hourly basis, it was converted into a daily and monthly basis to make the monthly prediction. The proposed model, gradient boosting regression model, based on the coefficient of determination (R 2 ), RMSE, MSE and MAE are compared with models like linear regression, lasso regression, random forest regression, K-nearest neighbours regression, decision tree regression and many others. In [14], data were collected from various air quality monitoring stations including local and nearby industrial areas. The PM2.5 pollutant is predicted using a modified LSTM model and found to be better for prediction up to 8 hours by comparing with models such as LSTM, support vector machine-based regression (SVR) and gradient boosted tree regression (GBTR).
A multivariant lag-FLSTM (lag-LSTM-fully connected network) model is proposed with Bayesian optimisation to predict PM2.5 concentration [15]. The dataset is collected from four monitoring stations situated in Wayne county, United States. It contains hourly air pollutants data, meteorological and temporal features over 2 years. To impute missing values, values were replaced by values of the previous day having same time index. The proposed model was compared with other models including autoregressive integrated moving average (ARIMA), LASSO regression, ridge regression, SVR, artificial neural network (ANN), recurrent neural network (RNN), LSTM, Lag-FLSTM and evaluated using RMSE, MAE and mean absolute percentage error (MAPE). It was emphasized that combining meteorological and other air pollutant data in the training of the model can yield better prediction of PM2.5 concentration. In [16], authors developed forecasting models based on DL methods such as convolutional neural network (CNN), LSTM, CNN-LSTM, spatiotemporal clustering and their combination. The dataset includes hourly meteorological, air pollutant, spatial and temporal data collected from 12 monitoring stations at Beijing based on 2 years. The models performance is evaluated using RMSE and the index of agreement for prediction over 1-6 hours. The results indicated that LSTM is found to be the best model for multi-hour forecasting. However, there is a relatively small difference in the performance of the LSTM and CNN-LSTM.
In [17], a hybrid model is proposed by combining a one dimensional (1D) CNN with bi-directional LSTM for PM2.5 single-step and multi-step forecasting for next 48 hours. The local trends and spatial features based on hourly collected data over 4 years are extracted using 1D-CNN and proposed model is compared with SVR, variants of LSTM, VOLUME 11, 2023 CNN and RNN to achieve lowest RMSE. A ensemble empirical mode decomposition (EEMD) method is combined with a LSTM model to improve PM2.5 forecasting [18]. The EEMD method improved the performance of forecasting by introducing white noise in the target data and decomposing the time series data into several sub-series. To predict PM2.5, data from 74 cities of China were used and predicted by combining LSTM and GRU models [19]. The proposed model produced better prediction in terms of RMSE by comparing with a LSTM model only. A bi-directional GRU model, a simplified version of LSTM without any memory gate, is used to learn long term dependencies by combining with 1D-CNN [20]. Different combinations of features were tested and found that model perform better based on historical data of pollutant and meteorological parameters like temperature, dew point, wind speed and direction. The proposed model achieved significant improvement in accuracy by comparing with traditional ML models and variants of RNN.
In [21], various ML approaches were implemented for the hourly PM2.5 concentration only and prediction is performed in Beijing using meteorological, temporal and PM2.5 concentrations data. The dataset was based on nearby stations and proposed a gradient-boosted regressor model. In a similar study using the publicly available dataset for PM2.5 prediction, a hybrid CNN-LSTM model is proposed and compared with other deep learning approaches [22]. A large-scale study is conducted by incorporating a dataset from 1,085 stations in China and PM2.5 is predicted for the next 72 hours using a state-of-the-art model architecture based on a transformer model. The performance of the proposed architecture is enhanced using a two-stage approach in which along with spatial and temporal dependencies and a stochastic approach is used to capture uncertainty of the air quality data [23]. A combination of graph neural networks (GNN) and GRU model is used to predict PM2.5 for next 72 hours by incorporating both spatial and temporal domain knowledge [24]. In the construction of graphs, the node and edges are defined by meteorological information for each city while geographical information is embedded into graph structures. Although aforementioned studies mostly investigated PM2.5, in this paper we have covered broad range of pollutants such as NO 2 , O 3 , SO 2 , PM2.5, and PM10 and provided prediction model for each pollutant.

III. TIME SERIES FORECASTING MODELS
The details about the considered forecasting models are provided in this section as follows: A. STATISTICAL MODEL An autoregressive integrated moving average (ARIMA) is a classic statistical modelling approach also referred as Box-Jenkin's method for time series forecasting [25]. For non-stationary time series, generally the case for air pollution data, it requires auto regressive (AR) and moving average (MA) terms. To make data stationary, common approaches such as differencing technique can be used on data x(t) and order of difference (also known as lag) is defined by parameter d as given in (1). Here, B is a backward shift operator. Moreover, the model to forecast x d (t) also requires identification of parameters like p and q, which defines number of AR and MA terms, respectively, as given in (2). Here, w are weights relating AR and MA terms and n(t) is Gaussian noise with zero mean. The optimum values of ARIMA model parameters such as p, d and q, can be found using autocorrelation function (ACF) and partial autocorrelation function (PACF) based on the data and its associated differences.
The deep learning models contains recurrent neural network (RNN), which is based on sequence or time series data and found to provide improved performance in applications like natural language processing and speech recognition [26], [27]. The RNN model has memory units to capture and learn dependencies between input and output over short or long term. However, with the increase in the network layers and iterations, RNN tends to forget the dependencies and suffer with vanishing gradient problem. Fig 1 shows a simplified architecture of RNN model with M layers and each layer is compose of cells to process the data. The RNN has vanishing gradients or long-term dependence issue.

1) LONG SHORT TERM MEMORY
A variant of RNN known as long short-term memory (LSTM) solves this issues [28]. A LSTM cell is shown in Fig. 2.
In Fig. 2, each cell has its state and three gates such as forget gate, an input gate and an output gate. Output of forget gate f (t) decides the contribution of previous cell state c(t − 1) to produce current cell state c(t). In addition, input gate output i(t) decides amount of new information, which is required to produce current cell state. Whereas, output gate generates output h(t) based on current cell state, current input (also known as new information) x(t) and previous hidden state (also known as previous output or past information) of cell h(t − 1). These gates are represented mathematically in (3)- (8).
where σ and tanh are the activation functions and b is the bias for respective gates in the LSTM cell. The output of each gate is generated by applying respective activation function on weighted sum, for e.g. w f x is the weight for input in forget gate, and bias.

2) GATED RECURRENT UNIT
Gated Recurrent Unit (GRU) is another variant of RNN and a simplified version of LSTM cell with fewer parameters to achieve faster training compare to LSTM [29]. The GRU cell is based on two gates such as a reset gate and a update gate as shown in Fig. 3.
The reset gate output r(t) control the amount of previous output of cell, h(t − 1), information to forget while output of the update gate z(t) control the amount of previous output of cell to remember. The output h(t) is produce by combining gate outputs andh(t). These gates are represented mathematically as given in (9)-(12).
where σ and tanh are the activation functions and b is the bias for respective gates in the GRU cell. The output of each gate is generated by applying the respective activation function on weighted sum, for e.g. w r x is the weight for input in reset gate, and bias.

IV. DATASET
The dataset used in this study is collected from air quality in Northern Ireland, which is publicly available [30]. This dataset includes hourly concentration level of the air quality parameters such as Nitrogen Dioxide (NO 2 ), Ozone (O 3 ), Sulphur Dioxide (SO 2 ), and Particulate Matter (PM2.5 and PM2.10). In addition, it contains meteorological data such as temperature, wind speed and wind direction. This data is measured at the Belfast city centre using air quality monitoring station between 2015 and 2020. This dataset contains over 50,000 samples.
Statistical information of meteorological data (i.e., temperature, wind speed, wind direction) is provided in Table. 1. It includes statistical information such as the total number of samples, mean, standard deviation, minimum and maximum value of each data. Total number of samples is found to be above 50,000. Mean and standard deviation of all parameters range from 5.63 to 213.19 and 2.77 to 84.87, respectively. Minimum value of all parameters is 0 and maximum value ranges between 24 to 360. Table. 2 provides a summary of the statistical information of all air quality parameters. The NO 2 concentration data ranges from 1 to 203 with a mean and standard deviation of 26.11 and 17.87, respectively. While, lowest mean is found to be of SO 2 with standard deviation of 1.6.

V. MODEL TRAINING AND TESTING
This section provides details about the data preparation, model training, hyperparameters optimization, and testing of the single-step forecasting models.

1) DATA PRE-PROCESSING
Generally, any dataset may contain outliers, and invalid values, and data may need to be normalized as per forecasting model requirements. Outliers are extreme or odd values that are unlike other dataset values and their presence may affect the overall distribution of data. However, outliers must be removed to improve the forecasting model's performance. Likewise, the dataset may have a missing or periodic sequence of values known as invalid values that needs to be removed or replaced by some estimated values before modeling. Finally, normalization is performed on the dataset by re-scaling data to fall in the predefined range. Normalisation can help the forecasting models to perform better on the data with a smaller scale and improve the convergence speed of the models. The dataset used in this paper is pre-processed for outliers using interquartile range method (IQR) and invalid VOLUME 11, 2023  values by removing them. In pre-processing to replace the missing values, we are grouping data into month, day and hour. Then, the missing values are replaced by taking an average of the available concentration values on same month, day and hour in all years of the dataset. This approach allows for a greater spread of values for the missing data.
The meteorological data is considered an addition input features for the forecasting models. Other than the features available in the dataset, we have also created lagged feature by taking the concentration value of the previous hour of the pollutant being predicted and the datetime index is split into hour, day, and month to create additional features. Therefore, for the prediction of each pollutant, we have considered a combination of features such as previous hour concentration, meteorological (temperature, wind speed and wind direction) and temporal (day, month, hour) information for next hour prediction. The workflow for pre-processing of the dataset is shown in Fig. 4. The input features are normalised using Min-Max normalization as defined by (13), where x min and x max are the minimum and maximum values of data x. Fig. 5 shows a sample of NO 2 before and after pre-processing data.

2) MODEL PARAMETERS AND TUNING
The training of the ARIMA model and associated parameters are optimised and tested to forecast air pollutants. We trained the model for predicting all five air pollutants in the dataset. However, for analysis purpose, only NO 2 is explained as a use case. To find optimum values of model parameters such as p, d and q, investigation of ACF and partial autocorrelation (PACF) is required, based on the data and its associated differences. As ARIMA works on the stationary data, we observe ACF of actual data to find if the dataset is stationary or non-stationary. Fig. 6 shows that the ACF of NO 2 has all positive values and gradually dropping which indicates that    the data is non-stationary. By comparing ACF of 1st order differencing (i.e., Fig.7) with the 2nd order differencing (i.e, Fig.8), we can observe that the lag at 1 of 2nd order differencing is negative, which indicates that the data may get over differenced. Whereas, this is not the case in the ACF of 1st order differencing and is sufficient to make the data stationary. Thus, we found that the value of d is 1 which defines order of differencing (d).
x norm = x − x min x max − x min (13) Similarly, the order of MA terms can be found based on over differenced data by looking at ACF cuts off point. In our case, 2nd order differencing indicates over differenced due to the lag at 1 being negative and lag at 1 also shows the cut off point, which leads to q equals to 1.
The order of AR terms is found based on the first cut off point, where PACF at lag 1 is positive. Fig. 9 shows the PACF of NO 2 data. In our case as shown in Fig.10, lag at 1 of PACF of 1st order differencing is positive and shows the 64020 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    cuts off point, which leads to p equals to 1. Hence, we found (1, 1, 1) as an optimum set of parameters (p, d, q) for NO 2 . The detail of the optimum set of parameters for the rest of the pollutants is given Table. 4. In summary, the optimum parameters such as d and q are both found to be 1 for all the pollutants. However, only the value of p is either 0 or 1 for the respective pollutant.
The dataset is split into training, validation, and testing sets with ratio of 70%, 20% and 10%, respectively. In each split, the indices kept higher than previous set, which will avoid the shuffling (i.e., inappropriate in time-series). Fig. 11 shows the workflow of deep learning training and testing with key components. Fig. 12 shows the bounds and step size of parameters for model hyper-parameters optimisation in terms of cell size per layer, total number of layers and dropout rate for both LSTM and GRU models. In DL models, input layer pass features to model where we have considered a maximum of 5 layers and each layer can have minimum of 5 and maximum of 30 cells. In DL models, we are optimising number of layers in the range from 1 to 5 with a step size of 1. In case of number of cells, we are considering minimum of 5 to maximum of 30 cells by increasing with a step size of 5.    In the dropout layer, which randomly drops out number of cells to handle over fitting, Dropout rate is also considered as an optimisation parameter. At last, a fully connected dense layer with Relu activation function is used. We are using Adam optimiser during training and for the tuning of hyperparameters, the Hyperband algorithm [31] is used which tunes the number of units in a layer, the amount of layers used and the learning rate of the model. Here, the Hyperband algorithm optimise the hyperparameters by minimising the validation loss during model training to reduce the training time. The summary of the parameters with architectural details of deep learning models is given in Table. 3. After hyperparameter optimisation, the LSTM model used the least number of layers that is 2 for SO 2 and the maximum of 4 layers for NO 2 . Whereas, the GRU model used the mini-    [20 70] for considered pollutants, respectively. In addition, both DL models used optimised dropout rates with the value of 0 or 0.1 for the respective pollutant. The detail of the optimum set of hyperparameters for all the pollutants is given in Table. 4 and these parameters can be further used to evaluate the complexity and computational effort.

3) PERFORMANCE METRICS
In this work, the performance metrics used to evaluate the effectiveness of the forecasting model are RMSE, R 2 and MAE as defined in (14), (15) and (16) respectively.
where y i andŷ i are target output and predicted output from forecasting model at i th sample, respectively. N is total samples available to compute metrics in testing andȳ is mean based on samples of target output. RMSE is a standard metric used in supervised learning applications for measuring the quality of predictions. It explains how far predicted value falls from the targeted value. The lower value of RMSE represents smaller error in the forecasting. R 2 also known as coefficient of determination is a statistical measure, that shows how well the regression model fits the target data. In general, the model fits the data well if R 2 value is closer to 1 (means the difference between the target and predicted values is small).

VI. RESULTS AND DISCUSSION
The results and related experimental discussion over the performance of all single-step forecasting models for all the pollutants is provided in this section. From the test data, we are showing comparative graphs over one week for better understanding and clarity. Fig. 13     and 17 shows the prediction of all forecasting models over test data for O 3 , SO 2 , PM2.5 and PM10 respectively. For SO 2 and PM10, ARIMA secured the highest accuracy than DL models with respect to R 2 but on the other hand, attained the highest error in terms of RMSE and MAE. In comparison to ARIMA model, both DL models achieved similar performance for PM10 but LSTM performed better than GRU in all respects for SO 2 . In case of PM2.5, both LSTM and ARIMA achieved over 83% accuracy in terms of R 2 . However, LSTM attained least error with respect to RMSE and MAE. From the above discussion, we can easily conclude that the DL models outperformed the statistical model in attaining the lowest error values in terms of both RMSE and MAE consistently for all of the air pollutants. In addition to this, DL models also achieved highest R 2 score for most of the air pollutants except for the PM10 and SO 2 , where ARIMA outperformed DL models. The performance comparison of all forecasting models in terms of RMSE, R 2 and MAE is presented in Fig. 18,19 and 20 respectively.

VII. CONCLUSION
Air pollution is a global health challenge and its accurate prediction is vital to reduce health risks and environmental concerns. This work aims at single-step air pollution prediction for most of the pollutants (e.g. NO 2 , O 3 , SO 2 , PM2.5, PM10) using various forecasting approaches based on DL and statistical models. The performance of the forecasting models is tested using evaluation metrics such as RMSE, MAE and R 2 . At the broader level, among all the forecasting models and pollutants, LSTM achieved the lowest RMSE and MAE of 0.591 and 0.396 respectively, in predicting SO 2 time series data, whereas the highest RMSE and MAE is found to be 9.354 and 6.065 respectively, for NO 2 by ARIMA model. In terms of R 2 , among all forecasting models, both DL models performed similar in achieving the highest score of around 86% while predicting O 3 . On the other end, GRU model is the one found to be having least predictive accuracy of around 55% for SO 2 . Overall findings through results revealed that among all considered forecasting models, DL models outperform statistical models consistently in achieving the least error in terms of RMSE and MAE for all the pollutants and attained better predictive accuracy in terms of R 2 for most of the pollutants. While ARIMA model could only perform better in predicting two pollutants (i.e. SO 2 and PM10) in terms of R 2 score only, however with the highest error value of RMSE and MAE for all of the pollutants. In future, we aim to target multi-step prediction and improve the performance of the DL models using new feature engineering approaches and relating optimisation of hyper-parameters of the models.