Real Time Localized Air Quality Monitoring and Prediction Through Mobile and Fixed IoT Sensing Network

Air pollution and its harm to human health has become a serious problem in many cities around the world. In recent years, research interests in measuring and predicting the quality of air around people has spiked. Since the Internet of Things (IoT) has been widely used in different domains to improve the quality life for people by connecting multiple sensors in different places, it also makes the air pollution monitoring more easier than before. Traditional way of using ﬁxed sensors cannot effectively provide a comprehensive view of air pollution in people’s immediate surroundings, since the closest sensors can be possibly miles away. Our research focuses on modeling the air quality pattern in a given region by adopting both ﬁxed and moving IoT sensors, which are placed on vehicles patrolling around the region. With our approach, a full spectrum of how air quality varies in nearby regions can be analyzed. We demonstrate the feasibility of our approach in effectively measuring and predicting air quality using different machine learning algorithms with real world data. Our evaluation shows a promising result for effective air quality monitoring and prediction for a smart city application.


I. INTRODUCTION
Due to rapid urbanization and industrialization, many countries around the world are facing a critical crisis of air pollution. Air pollution has become a threat to public health and a heavy influential factor on citizen's daily activity. In metropolitan cities in developing countries bothered by problems of air pollution, such as Beijing and Delhi, people usually need to wear a mask before going out [1]. Besides, outdoor activities are also constrained by the intra-day air quality.
Air pollution is caused by the presence of different air pollutants. The primary air pollutant gases are nitrogen dioxide The associate editor coordinating the review of this manuscript and approving it for publication was Junaid Arshad .
(NO 2 ), carbon monoxide (CO), ozone (O 3 ) and sulphur dioxide (SO 2 ) [2]. Another type of air pollutants is air particulate matter (PM). Among them, PM 2.5 and PM 10 are of particular concerns to people, which refers to atmospheric particulate matter that have a diameter of less than 2.5 µm and 10 µm. These particles can cause many respiratory or cardiovascular diseases [3]. Thus, many cities have built their own air quality monitoring stations and publish the real-time air quality information every hour. As the concern for air pollution increases, its becoming increasingly critical to measure the air quality around people [4], [5], which inform people about when is safe to perform outside activities and help them plan better routes to reach their destinations. Typically, monitoring stations at fixed locations is the conventional approach for atmospheric factor monitoring for a large geographical district.
While it is not difficult to implement such fixed sensor based monitoring system, it faces several challenges. First, huge investment is involved in building and deploying monitoring units to cover a large area. Also, it is highly dependent on neighboring environments and tends to be less accurate for farther areas. In areas close to the roads, even small distances can make a huge difference in air quality data measurement from car pollutions. Hence, new ways to collect air quality information in a cheaper and more flexible way and provide detailed air quality prediction is in demand.
To address these issues, one possible solution is to make the sensors mobile using Internet-of-Things(IoT). For example, attaching sensors on moving cars or drones proved to be a feasible method [6]. In this work, we developed the IoT devices to monitor air quality. We collected air pollution data by mounting a sensor to a car and moved around the city of Incheon, Republic of Korea. This data is then preprocessed and stored in our server. One major advantage of using a mobile sensor is that it provides the very first hand air pollution information for an area at a particular time, when the car was moving through there. we can also cover more geographical regions and have more accurate localized information with mobile IoT sensors. While a static fixed sensor can provide continuous feed of information about a particular area, it is not easy with a mobile sensor. However, this can be minimized by having multiple mobile sensors or assigning smaller coverage area to a mobile sensor.
In this work, we propose a hybrid approach, where we deploy multiple static sensors as well as IoT mobile sensors to effectively monitor air quality. The static sensors can provide a holistic view by providing a continuous feed of information. On the other hand, mobile sensors can provide more accurate data about specific areas to reduce the error from static sensors. In this paper, we build a prediction model to utilize the collected data and provide rapid information about the air quality around people. We also developed a visualization tool to better analyze and forecast air quality and provide insights to both professional researchers and ordinary users. The main contributions of our work are summarized as follows: • We proposed a hybrid approach to integrate fixed and mobile IoT sensors to measure and predict air quality data.
• We demonstrated the feasibility and effectiveness of our approach by analysing the prediction result with different machine models.
• We developed a visualization tool to show the relative distribution of the air pollutants with a focus on PM 10 and PM 2.5 , where it provides an intuitive understanding of the air quality around people.
The rest of our paper is organized as follows: Section 2. presents the related work on different air quality measurement and prediction methods. Section 3 describes the development of IoT sensors and data processing. Section 4 explains our models and algorithms. The experimental setup and results are reported in Section 5, and an analysis of the results is provided in Section 6. We summarize our work and offer conclusion in Section 7 and Section 8.

II. RELATED WORK
To measure the air quality, several monitoring methods have been proposed and utilized. In Zheng et al.'s research [7], they use public and private web services as well as a list of public websites to provide real-time meteorological, weather forecasts and air quality data for their forecasting. Small unmanned aerial vehicles are used in the work of Alvarado et al. [8] as a methodology to monitor PM 10 dust particles, where they can calculate the emission rate of a source. With the development of smart city technologies, IoT devices have been shown to be an effective option to collect real time weather, road traffic, pollution and traffic information. Thus, IoT devices are also considered to enable air quality analysis [9].
In addition to the fixed sensors, public transportation infrastructure such as buses has been used to collect air quality data [10]. Also, there is one project [11] engaged the entire community members in collecting data and developed an online air quality monitoring system based on it, which is also called crowdsourcing. Hasenfratz et al. [12] utilized sensor nodes to build a thousand models targeting at different time periods. All these aforementioned methods are either costly or time consuming. In our work, we explore the use of fixed and mobile IoT sensors together to improve the prediction performance, which has not been researched much yet.
To meet the increasing query frequency of air quality in real time and also to enable citizens to react instantly to the pollution, there has been a large body of work on building connected monitoring sensor networks to share the current air quality information with them [13]. Garzon et.al presented in [14] an air quality alert service. Their service continuously determines the areas, where the level of certain matter concentration exceeds the preset threshold, and notify users if they entered them. Maag et al. [15] proposed a multi-pollutant monitoring platform using wearable low-cost sensors. Compared with above methods, our system can serve the similar functions to end users practically with either fewer sensors or less demand for computation.
For prediction, regression models are commonly used in the area of air quality prediction. A multivariate linear regression model for predicting PM 2.5 of short-period time is proposed in Zhao's work [16], which includes other gaseous pollutants such as SO 2 , NO 2 , CO and O 3 . As deep learning emerged as an effective method in many applications, time series data of air pollution based on different network models have been also extensively studied and developed. Novel models such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit Network (GRU) have been proved to be a powerful sequential structures in predicting future values of air quality [9], [17] . Yi et al. [18] proposed a deep distributed fusion network to learn the characteristics of spatial dispersion and capture all the influential factors that may have a VOLUME 8, 2020 direct or indirect effect on air quality. These aforementioned technologies fits non-linear models flexibly but usually being short of offering insight to the hidden mechanism. In addition, they have not shown to necessarily outperform classical regression models in many scenarios [19]. There are also a lot of researches concentrate on approaches to model and simulate the pollutants for prediction [20]. With a small amount of data set oriented in our project, we decided to take conventional regression models as our baseline methods because of computation efficiency, while yielding favorable results.

III. IMPLEMENTATION
In this section, we first describe the design and implementation of IoT sensor device deployed in our research. Our deployment and data collection are performed in Songdo [21], South Korea, which is envisioned to be developed as a smart city. Next, we explain the preliminary processing of the acquired raw data and describe how we store and transmit the collected and cleaned data. Then, we further present the user interface to check the collected data for our analysis. Figure 1 describes the overall architecture of our proposed system.

A. IoT SENSOR INSTRUMENT DESIGN
We assembled two types of sensing devices from off-theshelf parts, one for fixed locations and the other type for moving cars. In total, we developed six IoT sensor devices, where three of them are deployed in three different fixed locations and the other three are mounted on data collection cars. The subsystems of the air quality monitoring modules are presented in Fig. 3, and the functions of the sensors are described as following: • Temperature and humidity sensor: We have a single sensor that can measure both temperature and humidity. The humidity sensor provides an accuracy of 2%, whereas the temperature sensor has an accuracy of 0.5 • C. They have measurement ranges of 0 ∼ 100% and −40 ∼ 80 • C, respectively.  health. Thus, our micro dust sensor covers the entire range that is relevant for human health.
• Carbon Dioxide sensor: Our carbon dioxide sensor can measure CO 2 within a range of 0 ∼ 10000ppm, with an accuracy of 5ppm(0 ∼ 2000ppm), 10ppm(2000 ∼ 5000ppm), and 20ppm(5000 ∼ 10000ppm). Note that since in a natural scenario, the proportion of CO 2 is around 0.03%, this level of accuracy is sufficient for our purpose.
• Raspberry Pi 3B+: The Raspberry Pi is connected to LTE using a dongle. Its main function is to process the sensor data and send it over the internet to the cloud server.
• Arduino mega: This implements the protocol for sending data over the VoLTE network.
• GPS sensor: This GPS sensor is connected to the Arduino, and provides an accuracy of close to 1 m.
• Battery: We use a power bank with a capacity of 7,000 mAh. The overall power consumption of our setup is close to 1A. Thus, our setup can run continuously for around 7 hours without a single charge.

B. SOFTWARE DEVELOPMENT AND PRE-PROCESSING OF ACQUIRED DATA
In this section, we describe the software systems that we designed, which run on top of the IoT sensors and transmit collect data back to our system. Also, we present other software and database components needed in our system to preprocess acquired data.
• Communication Software: We constructed a wireless communication and GPS system to transmit acquired data back to databases for analysis. The geo-tagged data which is stored in Raspberry Pi is transmit over Voice over Long-Term Evolution(VoLTE) once per second to our central server in Songdo area.
• Database: We design the database to store the collected real time sensor values from fixed as well as mobile IoT sensors. The data fields are: 1) time, 2) GPS location, 3) temperature, 4) humidity, 5) CO 2 , 6) PM 10 , and 7) PM 2.5 , where all the collected values are stored in database as shown in Fig. 5. (a). In the areas with weak GPS signals, such as indoors and tunnels, we approximate the value according to the latest neighboring data. Further, we discard out-of-range data during the preprocessing.
• Cloud Server and Data Mapping: We use a cloud server for our system, where the server manages the data and provides an interface for analyzers to check  Fig. 5(b). All the cars followed different paths randomly but tried to cover the entire area as much as possible. All the stored data can be downloaded in the form of Excel spreadsheet for later analysis.
• User Interface (UI): In addition, we developed the User Interface (UI) App so that users can log in our developed APP using their own account and check the air quality data around them immediately. The example of user interface is provided in Fig. 4, where APP can measure the real time air quality measurements and display those.

C. PRE-PROCESSING OF ACQUIRED DATA
Since the acquired data would contain noise, missing values, etc, we need to pre-process the acquired data to develop a robust prediction model. We employ the following techniques to pre-process data: • Outlier detection: Since sudden changes in the collected data usually means an outlier, we calculated the discrete differences of measured sensor values along the timeline to detect the outliers. That is, measured samples with a discrete difference beyond the interval [−0.5, 2] are removed from our data set.
• Interpolation: We choose Gaussian Process Regression (GPR) [22] as our interpolation method because it assists VOLUME 8, 2020 in reaching the best prediction accuracy in our experiments, and the effect of different interpolation methods will be discussed in Section 6.
• Data normalization: Since data are measured at different scale, we normalize the sensor measurement between 0 and 1 using Eq. 1. Thus, we can use normalized dataset for developing the air quality prediction models: where max and min are the maximum and minimum value of the whole dataset and x * is the data value after the normalization.

IV. PREDICTION ALGORITHMS AND MODEL DEVELOPMENT
In this section, we introduce our prediction model and briefly discuss algorithms we used. Since random forest (RF) [23], support vector machine (SVM) [24], and gradient boosting machine (GBM) [25] are commonly recognized as the most powerful algorithms in many machine learning applications [26], [27], we deployed random forest regressor (RFR) [28], support vector regressor (SVR) [29] and gradient boosting regressor (GBR) [30] for predicting air quality. We initially considered these approaches and explain more details in the following sections.

A. SUPPORT VECTOR REGRESSOR (SVR)
The objective of SVR is to determine a hyper-plane in the space generated by mapping training data in its original space to a higher dimensional feature space, and the hyper-plane can minimize the deviation of all sample points from it. Consider the training data set where x ∈ R n , y ∈ R where m corresponds to the number of training data, then the regression problem can be formulated as: Here C is a constant, f (x) is the hyper-plane represented as f (x) = w·x+b, and l is the cost function which is minimized in Eq. 2: where is the deviation which we can bear with at the most. Basically, the equations build a interval-zone with the width of 2 centered on f (x). In our research, feature vector x consists of the properties of time, longitude, and latitude information collected from sensors, and y represents a collected value from air pollutants set CO 2 , PM 2.5 and PM 10 .

B. RANDOM FOREST REGRESSOR (RFR)
RFR is fast in learning, and is capable of handling a large number of input variables yet yielding high accuracy. RFR randomly draws samples from the original dataset with replacement, which is also called bootstrap, and grows an unpruned regression tree for each of the samples, then average the unweighted outputs of multiple decision trees to obtain the final result as follows: where h(x; θ k ) is a collection of tree predictors with k = 1, . . . K , θ k is random vector, which characterizes the kth RF tree, x represents the observed input which are assumed to be independently drawn from the joint distribution (x, y).
Similarly, x represents time, longitude, and latitude information collected from sensors, and y represents a collected value from air pollutants set CO 2 , PM 2.5 and PM 10 .

C. GRADIENT BOOSTING REGRESSOR (GBR)
Gradient descent tries to minimize a function by moving in the opposite direction of the gradient, and it is a fundamental optimization algorithm in the area of machine learning. Boosting is known as an ensemble method that can improve the prediction performance of classification or regression [27]. It constructs additive regression models by iteratively adding basis functions which can further reduce the designed cost function: where the function h(x; a m ) is the basis function that are usually chosen to be simple representation of x with parameters a = {a 1 , a 2 , . . .}, and β m are the expansion coefficients with m = 1, 2, . . . , M . Regression trees are used as a basic function in our model. With our dataset, the features used in x and y are the same as described in previous models.

V. EXPERIMENT
We have chosen the geographic region of Songdo, Incheon, Korea as a location for conducting our experimental study, where Songdo has been developed as one of the smart cities in South Korea. In the experiment, Songdo region is spatially segmented into 100 zones, 10 × 10 grids as shown in Fig. 6 in the latitude range 126.616 • to 126.700 • and the longitude range 37.348 • to 37.401 • , where the red dots represents data collection points by mobile and fixed sensors. With more sensors operating in the future, we can divide the area into more grids which enables a higher resolution service to the public. Three fixed sensors are marked with a yellow star respectively in the map. As we can observe, the density of data collecting points are higher at the fixed sensors' position. In order to cover the entire Songdo area as much as possible, three cars are mounted with our mobile IoT box and navigated the road from Dec. 10th to Dec. 14th, 2018 and from Dec. 17th to Dec. 19th, 2018. Each day all the sensors are calibrated at both pre-deployment and post-deployment stage. The details, such as time intervals and the number of collected data instances are provided in Table 1, and we use the name Car0, Car1 and Car2 to differentiate the three mobile IoT sensors.

A. DATASET
Both the fixed and mobile sensors collect the same format of dataset. The fixed sensors collect air quality data every minute from the three chosen locations in Songdo area shown as yellow stars in Fig. 6. For each fixed sensor, the data collection time periods span all day, basically from morning to night. The mobile sensors, however, collect the air pollution data only a few hours per day, but the whole dataset in general also covers all hours of a day. The geographical locations of these sensors are presented in Fig. 6, where each icon stands for a sensor. The horizontal and vertical lines of the grids are cut according to latitude and longitude, and spaced evenly to grant same size grids. Each collected data instance consists of the sensor box's longitude and latitude, timestamp, temperature, humidity, and concentration value of CO 2 , PM 2.5 and PM 10 .
The observed time series data of PM 2.5 and PM 10 collected from the moving sensors for the entire region are depicted in Fig. 7. We averaged the data collected from all the moving sensors at each moment. Along the X axis is the timeline and Y axis represents the pollutants' observed value, and the quantity unit for PM 10 and PM 2.5 is µg/m 3 .

B. PERFORMANCE METRIC
Based on the previous day's ground truth y i from mobile sensors, we evaluate the predictionŷ i and the model's performance according to Root Mean Square Error (RMSE), which VOLUME 8, 2020 is adopted as an error criteria and defined by Eq. 6 as follows:

VI. RESULTS
In total, three fixed sensors and three mobile sensors generated 13,128 measurements from Dec. 12th, 2018 to Dec. 19th, 2018. The entire dataset is divided into non-overlapping two parts for training and test, while the time intervals in the training and test datasets varies from task to task.

A. OVERALL PERFORMANCE COMPARISONS WITH DIFFERENT PREDICTION ALGORITHMS
In this section, we used RFR, SVR and GBR to validate the overall performance of our proposed air quality prediction model. We split the entire dataset into 8 non-overlapping training and test pairs, where each individual day from Dec. 12th to Dec. 19th is a test dataset and all the prior date forms the training dataset, respectively. Table 2 presents the overall performance of different regression algorithms across various test days. Values in bold indicates the best prediction in a specific testing day. We can see that in general, GB regressor achieved the highest prediction accuracy as shown in Table 2, while RFR and SVR has marginally better performance in one or two days. We provided sample prediction results in Fig. 8 across different time periods. A few trends are visible in the results. First, we find that the values of PM 10 is greater than that of PM 2.5 , as shown in Fig. 7. As expected, there is usually less PM 2.5 content in the environment for PM 2.5 than PM 10 . Second, we find that predictions for the good air quality days are much better than the polluted days. For example, the fine particles' real value in Fig. 8(c) and Fig. 8(d) are much higher than other days. In addition, the RMSE value of the same day, Dec. 16th, is also higher than other days, where they are 21.6 for PM 10 and 15.8 for PM 2.5 as shown in Table 2, respectively. Finally, we find that the Gradient Boosting (GB) technique is the most responsive to sudden changes in patterns. While SVR and RFR are effective in finding the overall trends, they do not provide good prediction in the short term.

B. ACCURACY PERFORMANCE WITH DIFFERENT NUMBER OF GRIDS
For evaluation, we select Dec. 19th, 2018 as the test data and data from Dec. 10th, 2018 to Dec. 18th, 2018 as the training data. Since PM 2.5 and PM 10 are our major interest, we focus on the prediction accuracy comparison on PM 2.5 and PM 10 , and we chose GBR as our prediction algorithm, as it outperforms the other two methods in the previous evaluation section in general. As shown in Table 3, we counted the number of samples in each of the 100 grids and divide the number of samples into 6 intervals based on its distribution: 0 ∼ 10, 11 ∼ 20, 20 ∼ 50, 50 ∼ 100, 100 ∼ 200 and above 200. Then, we calculated the number of grids in each category and all these grids' RMSE of prediction. At last, we averaged the RMSE of all the grids in that specific category.
We can observe that an increase in the number of training samples in a grid leads to lower RMSE, and thus higher prediction accuracy. It demonstrates the validity of our methods and indicates that air quality prediction can be improved with collecting more data in the future. We observed a similar result for carbon dioxide as well.

C. PERFORMANCE WITH DIFFERENT INTERPOLATION METHODS
As discussed before, the collected data is very sparse on the geographical grids in a specific time point and the dispersion characteristics of the fine particles are complex to model. Therefore, different interpolation techniques are examined in our model to fill the missing air pollution data in all the other grids. In order to check whether the interpolation strengthens our prediction, we compare three different interpolation methods with our baseline (no interpolation). Since conventional interpolation method Kriging [4], [31] shares the same mean value and confident interval with Gaussian Process Regression (GPR), we choose linear interpolation and GPR with different kernels (Gaussian and Cauchy) for our investigation. We used the same training and test dataset as described in the previous section. Table 4 presents the overall prediction results comparing different interpolation methods (Linear interpolation, GPR + Cauchy kernel, and GPR + Gaussian kernel) with the original baseline without interpolation. A clear improvement on the accuracy can be observed as shown in Table 4 across all training time intervals. GPR + Gaussian kernel outperforms both Cauchy kernel and linear interpolation in the final results for PM 10 and PM 2.5 .

D. PERFORMANCE ON INTEGRATING MOVING IoT SENSORS
To demonstrate the effectiveness of our hybrid approach in air quality prediction, we compared the performance between using 1) fixed sensors only vs. 2) both fixed sensors and mobile sensors (our approach). In this evaluation, we also use GBR as the analysis tool, and tested on 4 different days chosen from the entire dataset. In each test, data collected from the previous two days ahead of the test date is utilized as the training data. We calculated and compared the prediction RMSE from all the grids for both PM 2.5 and PM 10 using GBR in all the four test days and averaged it to obtain the final RMSE as shown in Table 5.
Details of the training and testing set splits and the final results are presented in Table 5, where the prediction with hybrid fixed and mobile sensors outperformed the one with only fixed sensors in all the test days. With the value of hybrid method marked in bold, it is clear to observe that hybrid sensors method can improve the overall prediction accuracy, compare to using only fixed sensors by 7.0% for PM 10 and 6.5% for PM 2.5 on average. Thus, our proposed method enhanced the performance of air quality prediction.

E. VISUALIZATION
It is challenging to visualize the air quality data because there are multiple sensors data which are moving around. Common method for visualizing air quality data [32] is to overlay a contour map on the geographical map. The pattern in the contour map is simple, where only limited polluted locations are identified and presented as point sources. In this way, the surrounding area's air quality value is roughly estimated without considering the integrated impact of different pollution sources. We studied the relative distribution of the pollutants in Songdo area and drew a heatmap to visualize the hidden relationship of air quality in the whole area.
The map of our experimental area is shown in Fig. 9, whose shape is very close to a rectangle. Thus, we defined the heatmap as a 1,000 × 1,000 pixel image. Since each pixel in the generated visualization graph corresponds to a geographically position on map in Fig. 9, we assign a color value to each pixel according to the air pollution factor value at that geographical location. This task is implemented in the following three steps: First, we can obtain the air quality prediction result in each grid through our proposed prediction method using the ground truth data form the fixed sensors. Then the linear regression is used to calculate the air quality value of each pixel in the 1,000 pixel × 1,000 pixel image. Lastly, each pixel is assigned a color by mapping the air quality data to the pre-set color range. Our visualization highlighted the variability across different regions rather than focusing on the absolute value, which means the colors on the map represent relative values and enable us to easily and directly understand the surrounding air quality conditions. Figure 9(b) is an example of our visualization showing the pollutants distribution of PM 2.5 at 19/12/2018 19 : 00 : 00 in our divided 100 grids. The color bar at the right hand side represents the value range on map. The star, round face and triangle marks on the graph are where the fixed sensors being installed. Observing the visualization results, we find that the upper right area has higher concentration of the air pollutant factors and the center part is less polluted in general. This is because the upper area is closer to a factory area and the center region has several green parks and residential areas. VOLUME 8, 2020 FIGURE 8. Our prediction using Random Forest Regressor (RFR), Gaussian Boosting (GB) and Support Vector Regressor (SVR) against ground truths. We show the predictions with different ranges of granularity. We find that GB performs the best among the other prediction methods.

VII. DISCUSSIONS AND LIMITATIONS
It is interesting to observe that the errors are much higher in the last column in Table 4 and 5. The reason is that in our data set from Dec. 10th -Dec. 14th consists of weekdays and Dec. 15th is a Saturday, which means the air pollution patterns in the selected area are different between weekdays and weekends. The similar pattern can be also observed on Dec. 16th in Table 2, which is Sunday. By looking into the data sheet, the ground-truth data shows that in general weekends have heavier air pollution. Therefore, weekday or weekend is an important factor to consider in designing a better air pollution prediction model. These days, deep learning techniques are widely used for classification and regression tasks. However, our initial results show that deep learning models did not perform well because of small amount of data and simple classical model performed better. For future work, after collecting more data, we plan to experiment extensively with deep learning algorithms and further incorporate different features to improve the prediction performance.

VIII. CONCLUSION
In this paper, we explored a new way to predict immediate air quality around people, by combining fixed and mobile sensors. Our experimental results show that our proposed hybrid distributed fixed and IoT sensor system is effective in predicting air quality around the people. In addition, our proposed system can be practically realizable by leveraging public transportation system such as buses as well as taxis to be equipped with IoT sensor devices to measure different VOLUME 8, 2020 areas. The predicted air quality data from our system can be served in various scenarios, such as planing for outdoor activities.