Intelligent Air Pollution Sensors Calibration for Extreme Events and Drifts Monitoring

Air quality low-cost sensors (LCSs) are affordable and can be deployed in massive scale in order to enable high-resolution spatio-temporal air pollution information. However, they often suffer from sensing accuracy, in particular, when they are used for capturing extreme events. We propose an intelligent sensors calibration method that facilitates correcting LCSs measurements accurately and detecting the calibrators’ drift. The proposed calibration method uses Bayesian framework to establish white-box and black-box calibrators. We evaluate the method in a controlled experiment under different types of smoking events. The calibration results show that the method accurately estimates the aerosol mass concentration during the smoking events. We show that black-box calibrators are more accurate than white-box calibrators. However, black-box calibrators may drift easily when a new smoking event occurs, while white-box calibrators remain robust. Therefore, we implement both of the calibrators in parallel to extract both calibrators’ strengths and also enable drifting monitoring for calibration models. We also discuss that our method is implementable for other types of LCSs suffered from sensing accuracy.


I. INTRODUCTION
I NDOOR air quality has a direct impact on overall human health and significantly affects human work productivity. Based on the United States Environmental Protection Agency (EPA), 1 humans spend about 80%-90% of their time indoors. The levels of indoor air pollution are also often two to five times higher than outdoor levels. In some cases, the pollution levels might exceed 100 times than outdoor levels for the same pollutants. Indeed, excessive levels of indoor air pollutants would lead to immediate harmful effects. For example, incidental propane leaks in industrial plants [1] or excessive carbon monoxide (CO) in vehicles [2] would cause sudden death.
According to World Health Organization (WHO), 2 particulate matter (PM) is a common indicator for air pollution, which is more harmful in affecting human health than any other pollutants. PM indoors can be originated from outdoor origins or generated through human activities, such as cooking, burning candles, using kerosene heaters and smoking. Therefore, accurate indoor air quality measurement enables estimating health and safety risks in work and living environments.
However, air quality in different rooms and spaces of a building varies from one to another. This may require installing multiple sensors indoors within different rooms. Fortunately, low-cost sensors (LCSs) can be utilized for such purposes [3]. LCS can then alert when excessive pollutants have reached a particular health threshold. Indeed, LCSs are affordable and relatively easy to install that can then be massively deployed in buildings [4].
Although LCSs are usually laboratory calibrated, they often suffer from low accuracy and low robustness when they are deployed in fields [5]. These issues usually occur due to sensor designs [6], sensor drifts, changes in environmental conditions, background changes, and fabrication variances [7]. For example, LCSs generally do not include a heater or dryer at their inlets, so the changes in temperature and relative humidity have a significant impact on the performance of low-cost PM sensors [8]. As a result, LCSs often are vulnerable to accurately measure air pollutants at very low and very high concentration levels [9], [10]. Fortunately, to overcome the challenges of LCSs measurement accuracy and robustness, many studies propose various solutions in terms of sensor deployment and sensor calibrations as presented in review studies in [5] and [11]. However, thanks to the advancement of computing technologies, data-driven, and machine-learning (ML) based approaches have recently emerged as a potential solution for these challenges [5], [12].
The state-of-the-art of indoor LCSs was reviewed comprehensively in [13] and [14]. Based on these studies, there is an immense need for performing research on indoor LCS-based measurements and calibrations. These studies highlight that most research activities for the indoor environments have been focused only on the sensors' data analytics (e.g., more than 60% from their reviewed papers), neglecting the evaluation of sensors' performance indoors through developing calibration methods. Another concern when deploying LCSs relates to the sensor drifts and calibrator drifts (also known as concept drift). While sensor drift indicates the aging of the sensors hardware overtime [15], that makes the reading of the sensor to deviate from the actual readings [16]. The calibrator drift refers to the situation where the performance of calibration models reduce due to the changes in environmental conditions [17]. To the best of our knowledge, none of the papers reviewed in the aforementioned articles propose a method that combines sensors calibration and drift detection together, especially for indoor environments, where reference instruments are usually not accessible nor remote sensing can penetrate indoors for sensors validation.
In this article, we contribute by proposing a novel sensor calibration method and a calibrators' drift detection method, which are evaluated in an indoor environment. The novelties of our study include: 1) performing controlled experiments to define scenarios for indoor extreme events (presented in Sections II and III), 2) deploying white-box and black-box calibrators in parallel for correcting LCSs measurements and detecting calibrators' drift (explained in Sections IV and V), and 3) discussing potential industrial applications extended from the proposed methods (discussed in Section VI-C).

II. EXPERIMENT: INDOOR POLLUTANT MEASUREMENTS
In the experiments, we use two types of reference instruments (R) and two generations of LCSs also labeled as L, where R refers to any high precision sensing instruments such as DustTrak and SidePak (as shown in Fig. 1). The measurements of R can be used as ground truth data for sensors calibration and validation purposes. In addition, LCSs are known to be affordable devices (i.e., the cost less than $2500 per unit [18]), which have evolved as efficient solutions for sensing indoor and outdoor air pollution monitoring [3]. In this study, the LCSs generation indicates the improvements on the LCSs' hardware and software (i.e., different LCS version). Both R and LCSs used in this study are shown in Fig. 1, part ❶, with labels R 1 , R 2 , L 1 , and L 2 .

A. Reference Instruments
DustTrak DRX 8534 (TSI Inc.), labeled as R 1 , is capable of simultaneously measuring size-segregated mass fraction aerosol concentrations in the range from 0.001 to 150 mg/m 3 , corresponding to PM 1 , PM 2.5 , PM 4 (Respirable), PM 10 , and total PM size fractions. Therefore, the instrument can measure contaminants such as dusts, smoke, fumes, and mists. The sensing technology of the instrument is based on light-scattering laser photometers. The instrument is battery operated, where data-logging can be done between −20 • C and 60 • C with an operational humidity between 0% and 95%.
SidePak TM Personal Aerosol Monitor AM520 (TSI Inc.), labeled as R 2 , is capable of measuring aerosol mass concentrations in the range from 0.001 to 100 mg/m 3 , corresponding to PM 1 , PM 2.5 , PM 4 (Respirable), PM 5 (China Respirable), PM 10 , and 0.8 μm diesel particulate matter (DPM). Thus, the instrument provides real-time aerosol mass concentration readings of dusts, fumes, mists, smoke, and fog. The instrument is portable and battery operated. The sensing technology of the instrument is based on light-scattering laser photometers. It can also operate between −20 • C and 60 • C with an operational humidity between 0 and 95%.

B. Low-Cost Sensors (LCSs)
In Fig. 1, the LCS units refer to sensor generation I (labeled as L 1 ) and sensor generation II (labeled as L 2 ). The devices measure the mass concentration of PM with diameter smaller than 2.5 μm (PM 2.5 ). The thermal resistor in the sensor stimulates flow induced by temperature gradient. The sensor devices have an air inlet, a light sensor, and an infrared light source. They start measuring when air enters the sensor's air inlet, then the light source concentrates on sensing point. These sensor devices utilize light-scattering particle (LSP) sensing utilities for monitoring PM 2.5 . LSP sensors are well-known low-cost solutions for particle concentrations measurements and monitoring. These portable sensor devices are utilized to perform real-time and spatial PM 2.5 measurements and monitoring [19]. In addition to the features of L 1 , sensor generation II (i.e., L 2 ) is equipped with a case to reduce the effect of air turbulence in the inlet. Sensor L 2 is also equipped with meteorological sensor utilities, including relative humidity (RH), temperature (Temp), and pressure (P). Moreover, an algorithm is embedded in L 2 to filter the raw measured data such that it removes the spikes before data recording and monitoring.

C. Experiment
We carried out the experiments in two different time intervals. The first measurement was performed continuously between 6 and 8 Feb 2020, and the second measurement was performed between 14 and 22 Feb 2020. During the measurements, R 1 and R 2 were placed side by side with the LCSs, i.e., one unit of L 1 and two units of L 2 (L 2a and L 2b ), in a confined space, i.e., a room where the ventilation system was sealed off. The experimental setup is illustrated in Fig. 1, part ❶. The inlets of all instruments were placed exactly next to each other to ensure they extract the same amount of aerosol mass concentrations. Four types of smokes were generated using tobacco, electric cigarette, incense, and shisha, in which the measurements are depicted in Fig. 2. There were in total 12 experimental events for smoke measurements. Tobacco were smoked at events numbers 1, 2, 3, 6, 7, and 11; electric cigarette were smoked at events numbers 4 and 5; incense was lighted at events numbers 8, 9, and 10; and shisha was blown at event number 12. The experimental events were held by blowing the smoke next to the inlets of the experimental setup. During the experiment, we continuously recorded the measurements of PM 2.5 concentration The collected data from instruments and LCSs have different time resolution by default, thus, the data needs to be synchronized. The time resolution of L 1 varies between 40 s and 1 min interval, whereas L 2 has a fixed timeresolution at 1 min interval. Both R 1 and R 2 have a consistent measurement interval of 1 min. Hence, for our data analysis, we aggregate the data to be in 1 min resolution. Note that there is an experimental gap between 8 and 14 Feb 2020 (about a week) .
2) Smoking Events Characteristics: In this article, the whole experiment comprises the smoke and normal events. The median of PM 2.5 concentration for the whole experiment is 27.2 μg/m 3 . The normal event is usually assumed if the PM 2.5 concentration is below this median level. However, as shown in Fig. 2, the experiment shows that the smoke does not dissipate quickly, since the ventilation system is OFF. In addition, before the PM 2.5 concentrations reach the median level, again another smoking event takes place. Therefore, we assume that the smoking event happens when the PM 2.5 concentrations crosses the 75% quantiles that is at 144.76 μg/m 3 . Indeed, as shown in Fig. 2, the experiment highlights the gap between the measurements of R and LCSs, indicating that LCSs suffer from measurement accuracy that is the main concern in this article. Hence, to validate the measurements of LCSs, we use data collected from DustTrak (R 1 ) as the ground truth data. The instrument performance has been approved in many scientific experiments [20].
3) Performance Metrics: We use performance metrics of Pearson correlation coefficient (R), mean absolute error (MAE), and mean absolute percentage error (MAPE), and root mean squared error (RMSE) for sensors and methods validation. The metrics are described in Appendix A.

III. SENSORS PERFORMANCE
In this section, we perform sensors validation using consistency and accuracy tests to evaluate the performance of sensors (as shown in Fig. 1, part ❶), whereas the term consistency refers to similarity in measurements of two LCS, the term accuracy indicates how similar are the measurement of LCS units with the measurement of a reference instrument.

A. Meteorological Variables: Consistency Test
The L 2 is already equipped with meteorological sensors measuring variables Temp, RH, and P. To show the performance of LCSs and how consistent the measurements of meteorological variables are, we perform consistency test between L 2a and L 2b using the metrics of R, MAPE, MAE, and RMSE. The consistency test results are shown in Table I.
These results show that the meteorological measurements are almost identical and demonstrate consistent performance when they are compared between each other. In Table I

B. Aerosol Sensors: Consistency and Accuracy Tests
We validate aerosol LCS measurements using the reference instruments. This validation is known as accuracy test, whereas the comparisons between the same type of devices are known as consistency test. Fig. 3 shows sensors validation heatmap matrix plot between reference instruments (R 1 and R 2 ) and LCSs (L 1 , L 2a , and L 2b ). The figure consists of two performance metrics: the lower part illustrates MAPE, whereas the upper part shows Pearson correlation coefficient (R). The colors represent the level of R and MAPE values. When the color is closer to dark red, R between two devices is strong and MAPE is low. Inversely, when the color is closer to dark blue, R between two devices is low and MAPE is high.
The consistency tests between reference instruments show high correlation (i.e., high R value and small MAPE value). This explains that both reference instruments provide similar performance, and hence either of them can be used as ground truth. In addition, since the performance of R 1 has been approved in many scientific experiments [20], thus, we select R 1 as the ground truth sensing instrument for validating sensors and developing calibrators. Likewise, the consistency tests between both second generations of LCSs demonstrate high R correlation and very low MAPE value. This indicates that they are identical in terms of electronics and consistent in terms of performance. However, L 1 and L 2 have a minor performance difference (i.e., negligible) in terms of R and MAPE, allowing us to apply the same types of calibrators for the two generations of LCSs. The accuracy test between LCSs and the reference instruments shows that the correlation coefficients (R) are low at approximately about 0.6 (i.e., yellow color indicator), while their MAPE values are around 0.4 (light blue). These facts translate that LCSs do not meet the performance of reference instruments. Fig. 4 shows scatter plots of PM 2.5 between R 1 and L 1 and L 2a . The scatter plot of L 2b is not shown in the figure, as it would demonstrate similar pattern. In the figure, the normal event is illustrated by blue color, whereas the smoking events are shown by other colors. Each color shows a different deviation path that interestingly forms a cluster for each type of smoke. It can be seen that the relationship between R 1 and LCSs is correlated nonlinearly for the concentration distribution within each smoke type (cluster). The figure also presents the values of R and MAE for whole, normal, and smoking event scenarios. During normal event, R values for both LCSs are still high (≈ 0.8) and MAE values are low (< 23 μg/m 3 ). These results explain that the performance of LCSs is similar to the reference instrument in normal conditions. However, during smoking events, the measurement error between the reference instrument and LCSs become larger as PM 2.5 concentration increases (such that R < 0.5 and MAE > 900 μg/m 3 ).
In practice, since LCSs are incapable of measuring high levels of PM 2.5 concentrations and extreme events; thus, relying on their measurements for these smoking events would be harmful. As a result, to improve LCSs' PM 2.5 measurement, they need to be calibrated. In next section, we explain our proposed sensors calibration method.

A. Calibration Process
In Fig. 1 (part ❷), we illustrate the development of sensor calibrators, where it consists of two calibrator models, called white-box (W) and black-box (V) calibrators. In general, there are two approaches for developing W. The first approach relies on physics-based models, and the second approach uses statistical models, where the relationship between the inputs and the outputs are visible and transparent [20]. Therefore, white-box calibrator (W) is usually suitable for modeling a calibrator if the measurements of LCSs and reference instruments exhibit regular patterns. For example, in our case as illustrated in Fig. 4, the relationship between reference instrument and LCSs presents exponential shapes. The black-box calibrator V provides little explanatory insight into the relative influence of the independent variables (e.g., inputs variables) in the prediction process (e.g., outputs), but they are often effective in dealing with air quality and environmental data, which are nonlinear [21]. For example, neural-networks are known as a general approximator that can relatively well deal with most nonlinear problems, such as sensors calibration and virtual sensors [9].
Both calibrators (W and V) are then trained independently using the datasets obtained from the experiments (Fig. 1, part ❶). Even though, our sensors calibration process (see Fig. 1) allows flexibility in terms of models choice for V and W. In our study, we select a Bayesian linear model (BLM) as W 2 and a Bayesian neural-network (BNN) as V 2 . We select Bayesian framework, because, first, Bayesian models are robust from overfitting due to the presence of regularization. Second, Bayesian inference leads to probability distributions in their model coefficients and predictive distribution, which enables analyzing them statistically [20]. For comparison of W 2 and V 2 , we also redevelop the most popular calibration methods as mentioned in [5]. These calibration methods include multivariate linear regression (MLR) and artificial neural-network (ANN) representing white-box (W 1 ) and black-box (V 1 ) models, respectively.
Next, we deploy both trained calibrators (W 2 and V 2 ) in parallel to ensure that they complement the strengths and weaknesses of each others (see Fig. 1, part ❸). We further compute the residual [see (3): R] between V 2 and W 2 to monitor the calibrators drift (see Fig. 1, part ❹). Finally, the outputs from the calibrators provide accurate PM 2.5 concentration information for users (see Fig. 1, part ❺). In addition, as described in Section VI-C various industrial applications can benefit from the calibrated PM 2.5 measurements.

B. Calibration Models
In the calibrator development phase (see Fig. 1, part ❷), the W and V calibrators can be expressed mathematically as where W and V are white-box and black-box calibration functions, respectively; and y 1 and y 2 are the outputs of calibrators W and V, respectively. It is worth noting that y 1 and y 2 are the calibration outputs during the training process. The symbol β represents the model coefficients of W and the symbol ω embodies the weights of V. In both calibrators, ε refers to errors that follow a Gaussian distribution with zero mean and σ 2 noise variance, given by ε ∼ N (0, σ 2 ). The inputs X for both calibrators are obtained from the LCS measurements, including PM 2.5 concentration and meteorological variables. As described in Section IV-A, the calibrator functions of W 2 and V 2 are selected to be a BLM and a BNN, respectively. Therefore, the optimization of models' coefficients is then performed using Bayesian inference. In the calibrator deployment phase (see Fig. 1, part ❸), y * 1 and y * 2 are the calibrators' outputs during the testing process, which are in the form of Gaussian predictive distribution symbolized by p(y * 1 |X * , X, y 1 ) and p(y * 2 |X * , X, y 2 ), for W 2 and V 2 , respectively. In both calibrators, symbols X * are the test data obtained from LCS measurements. The derivation of both calibrators is described in [20] and also briefly presented in Appendices B and C.

C. Drift Monitoring Methods
In real deployment, due to various hardware and environmental reasons, the calibration models would become less effective throughout the time. In this article, we call this phenomenon as calibrator drift and we propose two methods for monitoring the calibrators drift (see Fig. 1, part ❹) including: 1) monitoring the outputs of calibrators' residual between W and V, and 2) monitoring one of the key variables, which may affect the calibrators' effectiveness.
The first method computes the predictive distribution of calibrator residual (R) between two deployed calibrators (W and V), shown as the red dashed lines in Fig. 1. In our case, since the predictive distributions for W 2 and V 2 are in the form of a Gaussian distribution (as explained in the Section IV-B), thus, the drift monitoring residual (R) results in Gaussian distribution as where the notations of μ * y 2 and μ * y 1 represent the mean of predictive Gaussian distributions for V 2 and W 2 , respectively, whereas the notations of Σ * y 2 and Σ * y 1 denote the variance of predictive Gaussian distributions for V 2 and W 2 , respectively. The derivation is described in Appendix D.
The second method enables drift detection by complementing the first method through monitoring the updates of one of the key variables measured by LCSs. This is shown as the blue and brown dashed lines in Fig. 1. Due to the simplicity and transparency of the model, the key variables affecting calibrators can be identified by analyzing the model coefficients of the calibrator W. For example, in our case PM 2.5 is the key variable affecting the calibration. If the calibrators were trained with normal event, then the calibrators may drift when LCSs are deployed on smoking events. Let us recall that normal event refers to scenarios, where there is no smoking and generally the PM 2.5 concentration is considered to be low, while smoking event indicates to the scenarios, where LCSs measuring PM 2.5 concentration is high.
To enable the drift detection, an outlier limit (L) can be computed by calculating the upper limit of quantile (q) from the training data. For example, the outlier limit (L) can be set by computing the qth quantile of the training data of PM 2.5 concentrations obtained from LCSs (X PM 2.5 ). Whenever new PM 2.5 measurement (X * PM 2.5 ) is bigger than L, this is considered as outlier in the test data. Indeed, the outlier in test data is one of the indicators for drift occurrence. To this end, the number of outliers in test data needs to be counted. We show this counting with C. Finally, the accepted percentage of outlier X * PM 2.5 (denoted as P) is computed by C l × 100%, where l is the number of X PM 2.5 data points.
Algorithm 1 presents our proposed parallel calibration deployment and drift detection (P). The Algorithm operates such that from lines 1 to 3, it uses three determined thresholds including the maximum accepted residual (T 1 ), maximum accepted percentage of X * PM 2.5 (T 2 ), and the quantile outlier (qth). The Algorithm performs computations for the two methods (explained earlier) from lines 4 to 23 (while LCSs are deployed and perform measurements). The first method (lines 6 to 8) computes both calibrators (W and V) and the residual R. In line 9, using the available training data (X PM 2.5 ), the second method computes the outlier limit (L), where in our case, we select q = 0.99. The lines 10 and 11 compute the outlier test data (C) and the accepted percentage of outlier X * PM 2.5 (P), respectively. In lines 12 and 13, if the mean(R < T 1 ), then our proposed calibration is executed using V, which is known to be more accurate. In our study, since 100 μg/m 3 residual between two calibrators already indicate the drift in the calibrator V, thus, we select T 1 = 100. From lines 14 to 22, if R value crosses the defined threshold (T 1 ), this indicates that V, which is known to be less robust, begins to drift. Hence, our proposed calibration switches to execute calibrator W (line 16). In the lines 17-18, when P crosses the threshold T 2 (e.g., in our case, we select it to be 25%), then calibrator drifts are declared. This means that both calibrators V and W do not function properly (line 19). Therefore, a mitigation such as recalibration is required (line 20), as explained in Section VI-B.

A. Calibration Performance
In order to evaluate the performance of calibrators W and V, we design 12 different scenarios within five groups. As shown in Fig. 5, the groups are labeled by G 1 − G 5 and the scenarios From LCS measurements, obtained {PM 2.5 , Temp, RH, P} to form matrix input X * 6: Compute W : y * 1 = W(X * , β * ) 7: Compute V : y * 2 = V(X * , β * ) 8: Compute R 9: Compute the outlier limit: L = quantile (X PM 2.5 , q) 10: Count C : the occurrence number of X * PM 2.5 > L 11: Compute the accepted percentage of X * PM 2.5 outliers : P = C l × 100% (l is the number of X PM 2.5 data points) 12: if mean(R) < T 1 then 13: Calibrate LCS using V 14: else if mean(R) > T 1 then 15: V does not function well: 16: Calibrate LCS using W 17: if P > T 2 then 18: Calibrator drift is declared! 19: V and W do not function well 20: Mitigation The cross-units validation refers to calibrators' performance evaluation when we train the calibrators on one unit and then test them on another unit of the same type. This approach enables evaluating the calibrators' sensitivity and accuracy. In addition, this validation is beneficial for evaluating calibrators' resilience against sensor fabrication variance. The cross-different-units validation aims to investigate the calibrators' performance when we train the calibrators on one unit and then test the them on another unit of different type. We use this approach to evaluate the calibrators' accuracy. The calibrators drift validation aims to investigate the calibrators drift due to the lack of information in the training data (for example, when calibrators have never experienced smoking events). Finally, benchmark validation is planned to evaluate the calibrators performance using a standard modeling process, which typically uses 70% random data for training and the remaining 30% of the data for testing. In our study, we use benchmark validation to compare its performance with the other validation approaches.
The first group (G 1 ), which includes scenario S 1 aims to evaluate the calibrators using the benchmark validation approach. The second group (G 2 ), which includes scenarios S 2 and S 3 is designed to observe the accuracy of calibrators utilizing the cross-units validation approach. The third group (G 3 ) that includes scenarios S 4 − S 7 uses the cross-different-units validation approach to investigate the calibrators' accuracy across different types of LCSs. The fourth group (G 4 ), which includes scenarios S 8 − S 11 , is designed to perform cross-units validation approach in order to observe the sensitivity of the developed calibrators. In the scenarios in G 4 , we use all data except one particular smoke from the sensor L 2a for training the calibrators. Then, we test the calibrator on sensor L 2b . For instance, in scenario S 8 , we train the calibrators using all dataset from L 2a except for tobacco and test it on sensor L 2b . The fifth group (G 5 ) that consists of only the scenario S 12 is planned to perform calibrators drift validation. In this scenario, we use L 2a to train calibrators using the whole normal events data, and test the trained calibrators with all of the smoking events. Fig. 5 shows the performance results of different calibrators, including BLM (W 2 ), BNN (V 2 ), and our proposed calibrator (P) for different scenarios. In the figure, we also include the most popular white-box (W 1 ) and black-box (V 1 ) calibration methods in order to compare the performance results of the calibrators V 2 , W 2 , and P. As presented in figure, we use the performance metrics R, MAE, and MAPE.
Using benchmark validation approach, which is the case of G 1 , W 2 and V 2 calibrators demonstrate to have a better performance than W 1 and V 1 using all performance metrics. The existence of regularization factor in Bayesian inference makes W 2 and V 2 calibrators more generalized than W 1 and V 1 . In addition, the performance of V 2 is better than W 2 , shown by all performance metrics. Through this approach, our proposed method (P) shows better performance than the rest of the calibrators, except in case of V 2 that has just minor performance difference with P. The reason for this minor difference might be that the training data already contain the outliers, while the test data do not contain the outliers.
The cross-units validation approach that is evaluated within G 2 consists of the scenarios S 2 and S 3 . The values of R for V 2 are consistently higher than W 2 for both scenarios. This implies V 2 generates better calibrators' accuracy. Likewise, the values of metrics MAE and MAPE for V 2 is lower than W 2 , indicating that V 2 is more accurate than W 2 in these scenarios. Both W 2 Fig. 6. Scatter plots between the reference instrument and the calibrated LCS using W 2 (left) and V 2 (right) for scenario S 2 . Fig. 7. Time-series plot representing the ground truth (R 1 ), uncalibrated LCS (L 2b ), calibrated LCS (L 2b ) using W 2 and V 2 , tested on scenario S 2 . and V 2 calibrators outperform W 1 and V 1 due to the same reasons explained previously for scenario S 1 . In scenario S 2 , P demonstrates to have better performance than all calibrators. In scenario S 3 , P also outperforms all of the calibrators, except the V 2 with a very minor difference. The reason for the minor difference is explained in S 1 . To conclude, the performance metrics evaluations confirm that the calibrators function well across units of the same type.
For cross-different-units validation approach, we consider the scenarios S 4 − S 7 in group G 3 . Similar to group G 2 , the performance metrics in the scenarios in group G 3 show that generally W 2 and V 2 have better performance than W 1 and V 1 , respectively. However, in these scenarios, V 2 does not outperform W 2 , indicating that white-box calibrators perform slightly better than black-box calibrators when they are tested on different unit type. Nevertheless, P still outperforms all other calibrators, indicating that P shows promising results when it is tested on different unit type. As outcome of the cross-differentunits validation, the performance results demonstrate that all of the calibrators still function well across different units.
Similar to group G 2 , the group G 4 that includes the scenarios S 8 -S 11 also evaluates the cross-units validation approach. In scenarios of G 4 , calibrators W 2 and V 2 still outperform W 1 and V 1 . However, in some cases (e.g., S 8 and S 11 ), the results of performance metrics show that W 2 slightly perform better than V 2 . Therefore, as an outcome of cross-units validation, V 2 seems to be more sensitive when facing a new smoking event. For example, in S 8 , the calibrator W 2 works better than V 2 , because W 2 is more robust to outliers than V 2 . As a result, V 2 does not accurately calibrate the LCS on the tobacco smoking event. Nevertheless, P outperforms all of the calibrators, although the test data contains outliers. This is due to the fact that the parallel implementation in P enables switching from V 2 to W 2 when the residual R increases due to outliers.
To investigate the calibrators drift validation approach, we consider the group G 5 that includes the scenario S 12 . Let us recall that in this scenario, all smoking events data are excluded in calibrators' training. The results show that both calibrators V 1 and V 2 clearly drift by presenting small values for the metric R and values higher than 1 for MAPE. While W 1 and W 2 maintain the performance to an acceptable level by showing a value about 0.7 for the metric R. Indeed, calibrators W 1 and W 2 are more robust than calibrators V 1 and V 2 , that is, because white-box calibrators have less modeling complexity. In this scenario, our proposed method P alerts the calibrators drift as described in Algorithm 1. This is highlighted by D, i.e., calibrators drift for scenario S 12 in Fig. 5. The calibrators drift analysis will be explained in Section IV-C.
Next, we generate scatter plots (see Fig. 6) and time-series plots (depicted in Fig. 7) to provide further insights about the results presented in Fig. 5. Since most results indicate that V 2 is more accurate than W 2 , in this case, as an example, we consider further analyzing scenario S 2 , which is also a simpler scenario to understand. Fig. 6 depicts scatter plots between the reference instrument (R 1 ) and calibrated LCS (L 2b ), for calibrators W 2 (left subfigure) and V 2 (right subfigure). In this figure, the colors indicate the density of data points for PM 2.5 measurement. The plot shows that the data points of PM 2.5 concentrations scatter around the red reference lines for both calibrators. The results of scatter plot indicate that both calibrators perform well by correcting the measurements of L 2b and making them similar to the measurements of R 1 . In addition, V 2 calibrate PM 2.5 more accurately than W 2 , especially at high PM 2.5 concentrations. Nevertheless, both W 2 and V 2 calibrate PM 2.5 to an acceptable level. This is confirmed by Fig. 7, where both calibrators W 2 and V 2 are tracking very well the reading of R 1 . Fig. 7 also illustrates that the calibrators are able to capture the extreme smoking events effectively. As a result, implementing both of the calibrators enables detecting and avoiding false negative situations, which may be harmful for human.
The results of different scenarios presented in Fig. 5 show that both calibrators have strengths and weaknesses. Indeed, V 2 tends to drift drastically when a completely new situation emerges (as the case in scenario S 12 ), however, W 2 performs adequately with acceptable performance degradation. Indeed, these facts had motivated us to deploy both calibrators in parallel (P) as they have two different characteristics. In order to highlight the performance results of all of the calibrators and P, in Table II, we summarize the mean of R values for the scenarios in each group. Indeed, this table concludes the results presented in Fig. 5 by presenting that 1) W 2 and V 2 are generally better than the most popular calibration methods W 1 and V 1 , 2) V 2 is better than W 2 for most scenarios, 3) our proposed approach P outperforms the other calibrators, and 4) P enables calibrator drift detection as shown in scenario S 12 .
It is worth noting that the drift detection is important because LCSs and reference instruments usually are not installed or placed near each other. Consequently, it is challenging to detect calibrator drifts in the absence of a reference instrument, which provides ground truth data. As described in Section IV-C, deploying two types of calibrators allows cross-checking them. This process which is called drift monitoring aims to ensure both calibrators perform effectively by enabling detecting the calibrators drifts. The next section provides further analysis about the calibrators drifts.

B. Drift Analysis
As explained in Section IV-C, analyzing the model coefficients of calibrator W provides insights about the variables impacting the LCSs measurements. Fig. 8 depicts the model coefficients of calibrators W 2 (obtained using the data from L 2a ) for scenarios S 1 , S 2 , S 4 , and S 8 -S 12 . Since the calibrators W 2 in these scenarios are based on BLMs, their model coefficients (β) are in the form of Gaussian distribution, following p(μ β , V β ), with mean μ β and variance V β . These model coefficients (β) are depicted in Fig. 8 with the ellipsoids, where the core and radius represent the mean and standard deviation of multivariate Gaussian distribution, respectively.
In the figure, the largest magnitude of coefficient β indicates the most dominant variable in LCSs measurements. The variables include PM 2.5 , Temp, and RH, which are associated with β 1 , β 2 , and β 3 , respectively. It can be seen that while PM 2.5 that is associated with β 1 plays a major role in calibration as their values range between 0.7 and 0.9, which are one magnitude bigger than the values in β 2 and β 3 . The variations of Temp and RH measurements have less influence in calibrators performance. In addition, the role of pressure (P) is trivial with the mean of β 4 for all scenario is closed to −0.003 (not including in the figure). Moreover, as illustrated in Fig. 8, the ellipsoids position that divide between normal (yellow) and drift (dark blue) clusters are dominated by the magnitude of β 1 . This means that (as described in Algorithm 1) monitoring the changes on the test data PM 2.5 (X * PM 2.5 ) provides an indication about the calibrator drifts. Fig. 9 illustrates the relationship between residual (R) and PM 2.5 measurements data gathered during the testing process (X * PM 2.5 ), for S 12 . While the blue histogram shows the X * PM 2.5 , the pink histogram is PM 2.5 measurements data collected during the training process (X PM 2.5 ). In the figure, x-axis represents PM 2.5 measurements from LCS prior to calibration, the left yaxis shows the residual (R) between V 2 and W 2 , and the right y-axis presents the frequency of histograms.
As described in Algorithm 1, drifting detection can be performed by monitoring R between V 2 and W 2 . In the figure, R shows incremental pattern (with uncertainty) when the LCS PM 2.5 measurement concentration increases. In this case, W 2 maintains the calibration performance to an acceptable level, but both calibrators fail when LCS PM 2.5 measurements (X * PM 2.5 ) are too large (i.e., mean(R) > T 1 ). In the figure, this is shown when the R reaches 100 μ g/m 3 in the left y-axis. Furthermore, while the outlier limit (L) lies on the edge of the pink histogram's right tail (about 50 μg/m 3 on x-axis at q = 0.99). It is obvious that the blue histogram has deviated (expanded) largely from the pink histogram, indicating that the accepted percentage of X * PM 2.5 already crosses the threshold (i.e., P > T 2 ). This indicates that the calibrator drift is declared (according to Algorithm 1) and both calibrators are unable to calibrate the readings of LCSs.
Obtaining a reliable drifting monitoring also enables detecting the wear in sensors hardware when they are in real use. As the wear of hardware usually provides inconsistent reading, therefore, residual evaluation would assist in identifying the sources of errors. The drifting monitoring allows ensuring the sensors calibrators and hardware function accurately in the field deployment. If they do not function accurately, then the maintenance can be performed based on the information provided by drifting monitoring.

A. Comparison With the State-of-the-Art
LCSs increasingly use ML-based calibration methods to improve the accuracy of sensor measurements [5]. The studies in the state-of-the-art, present specific ML-based calibration methods, however, in contrast, we propose a generic strategy in applying parallel ML-based calibration models (P). Indeed, most of the studies in literature implement either white-box (W) or black-box (V) models to perform calibration. Our proposed method offers flexibility in choosing any ML model to represent W and V models. Thus, we selected BLM and BNN in our proposed method P.
These studies in literature use different datasets generated in different environments, seasons, and locations, while each dataset has different characteristics. Hence, comparing the performance results of the calibration models seems to be inappropriate. Nevertheless, to show the performance of our proposed method (P), we redeveloped the most popular calibration methods [5], i.e., MLR and ANN, and then we, respectively, compared them with our selected calibration methods, which are BLM and BNN. Indeed, as presented in Section V-A, our proposed method (which implements parallel ML models) outperforms individual selected methods (i.e., BLM and BNN) as well as the most popular methods (i.e., MLR and ANN). Our proposed method indeed promotes the use of Bayesian models and parallel deployment for LCSs calibration methods.
Furthermore, as the deployment of sensor networks in smart cities has recently increased, the drifts of calibration models have become challenging during their in-field operation time. The drifts result from various reasons including clean air policies, e.g., traffic, changes in humans consumption patterns such as fuel and gas [22], or temporal effects such as forest fires and volcano eruptions [23]. To detect the drifts, the methods in the state-of-the-art use statistical difference in distributions of air pollution measurements [12]. However, in contradiction our proposed method uses two layers of detection methods, first by computing residual between W and V, and then by monitoring the changes of one of the key variables measured by LCSs, e.g., PM 2.5 . The two layers implementation would reduce the probability of receiving false positive alarms if the proposed method was applied in the earlier mentioned scenarios causing the drifts (such as policies and temporal effects).
Moreover, to the best of our knowledge, it is also the first time, the drift detection method is tested in indoor environments. As we have performed comprehensive experiments by testing and evaluating our proposed method in an indoor environment (by various smoking events), while according to recent literature survey study focused on the use of LCSs indoors [13], the majority of the works in literature do not calibrate nor validate the LCSs used in their studies. For example, based on this survey study there are approximately 77.5% of works did not include details about the calibration of their LCSs [14]. We also demonstrate how extreme events such as smoking activities can alter significantly the LCSs reading, leading to false negative. It is worth noting that false negative situation in sensors reading can be harmful to human exposure as there are no alarm alerting people when the pollution concentration is very high in indoor environments.
In summary, in our paper, we propose a generic parallel MLbased calibration method, which (as mentioned earlier) provides many advantageous compared which the works in literature. Calibration and drift detection methods might perform differently in various environments, e.g., meteorological conditions. However, the dataset we used in our study is limited to only to one type of indoor environment having a specific characteristics such as room size, ventilation, and other influencing factors. Hence, our proposed method requires more evaluations using different and comprehensive datasets obtained from various environmental characteristics. Therefore, the use of comprehensive datasets can assist investigating different LCSs calibration and drift detection methods.

B. Suggested Solutions for Drifting
Besides sensors recalibration, investigating the causes of drifts helps understanding the sources of problems and therefore enables improving the calibration models and the LCS hardware design. We envision three methods to minimize the drift in calibrators: Method 1: Extensive laboratory experiments can be performed for testing different scenarios on new design LCSs. Different kinds of aerosol particles with varying meteorological variables are inserted to an experimental chamber, where the LCSs are placed. The idea aims to mimic as many scenarios in which the LCSs may encounter in the field deployment as possible. For example, if LCSs are designed to be deployed indoors, they should be tested on different indoor scenarios, e.g., smoking and fire sensing. Thus, based on these experiments, effective calibrators can be developed.
Method 2: Adaptive calibration model can be used. The adaptive model can be developed if the ground truth data available, e.g., from a nearby reference instrument or other calibrated LCS, which can communicate via Internet. For example, adaptive calibrators can be developed using federated learning techniques [24].
Method 3: Robust calibrators can be developed such as W, where the calibrators do not drift easily under unexpected circumstances. For example, in our approach, we coupled W and V. Hence, if a drift is detected then W still function to an acceptable limit compared to V in some new cases before retraining the calibrators. The best robust calibrators are physics-based models, where the underlying physical relationship between LCS and reference instrument can be derived.

C. Industrial Applications
Our proposed method can be potentially extended on various industrial applications using the calibrated PM 2.5 concentration (as shown in Fig. 1, part ❺). Following are examples of few potential industrial applications: 1) Personalized Health Device: Accurate measurements of PM 2.5 concentration enables deriving personalized health information from LCS devices [25]. This provides information of individual deposited dosage [26], which can be integrated via wearable devices [27].
2) Smoking Detector: When smoking indoors, the smoke lingers in the air, because the smoke particles sizes are too small such that 85% of them are invisible and odorless. 3 Our experiment shows that high PM 2.5 concentrations remain in the room for hours, which can cause longer breathing issue for humans. Recent development in automatic image and video analytics has enabled smoke detection with a high accuracy [28]. However, adopting this method is expensive since cameras need to be installed in all rooms. Using our proposed methodology for smoking detection is economically beneficial.
3) Fire Detector: Current indoor fire detectors are based on ionization and photoelectric technologies [29]. However, these technologies might not always be effective in detecting very small increase of PM concentration triggered by fires in early stages. Thus, to complement, applying our proposed method of calibrated PM 2.5 LCSs contributes to early fire detection. 4) Poisonous Gases Detector and Monitoring: LCSs can also be used for detecting poisonous gases indoors such as CO [30]. Indeed, CO is a colorless, tasteless, and odorless gas produced by incomplete combustion of carbon-containing materials. Similar to LCSs of PM 2.5 , other LCSs capable of measuring CO require calibration. Extending and embedding our proposed method to low-cost gas sensors such as CO enables detecting accurately the poisonous gas concentrations. 5) Engineering Assets Monitoring: Accurate LCSs deployment can help monitor engineering assets. For example, more affordable accurate sensors can be deployed massively to monitor atmospheric corrosion. Different gases such as CO 2 , SO 2 , and dust can accelerate corrosion in various types of metals [31]. Accurate monitoring of such pollutants enables engineers to perform preventive maintenance. 6) Electronic Nose (e-nose): E-nose is known as an electronic sensing device intended to detect odors. E-nose devices are widely used in research and development, quality control, process and production, health, and security purposes. Although e-nose devices currently are used in many application areas, they are still considered as unreliable solutions [7] due to low accuracy as air quality LCSs. Indeed, our proposed method can be adopted to improve e-nose sensing performance.

VII. CONCLUSION
Air quality LCSs suffer from sensing accuracy when they were used for measuring extreme events. In this article, we proposed an intelligent sensor calibration process that enables effectively correcting LCS readings as well as identifying the calibrators' drift. Therefore, we performed controlled experiments in an indoor environment for defining scenarios for extreme events. These scenarios included 12 different indoor smoking activities. We used the data collected from these controlled experiments for obtaining insight about smoking events and we also utilize dthe data for developing calibrators and investigating their performance. We further used Bayesian framework for developing white-box (W) and black-box (V) calibrators. Then, we deployed these calibrators in parallel (P) in order to correct LCSs measurements and enable detecting calibrators drift. Then, we evaluated the calibrators in a controlled experiment under different types of smoking events within 12 scenarios. For instance, in scenario 2 (i.e., S 2 ), we trained the calibrator on one LCS and test it on another LCS of the same type. Another example was scenario 12 (i.e., S 12 ), which is designed to mimic the calibrators' drift. We then evaluated the developed calibrators on all designed scenarios using different performance metrics. The performance results showed that our proposed method accurately estimates the aerosol mass concentration in different scenarios, except for S 12 . Because the calibrators in S 12 were established only using normal data (not for extreme events). Nevertheless, we demonstrated that our proposed drift monitoring was able to detect the calibrators' drift for S 12 . Finally, we discussed how our proposed method was extendable to various industrial applications, such as smoking, fire and poisonous gas detectors, engineering assets monitoring, and health informatics.

A. Performance Metrics
Performance metrics used to validate the sensors and calibrators are presented in Table III. The notations y andȳ are the measurements through reference instrument and its mean value, respectively. Whereas the notationsŷ andȳ represent LCS measurements and the mean of LCS measurements, respectively, before calibration (for sensors validation) or after calibration (for calibrators validation).

B. BLM: White-Box Calibrator (W)
A Bayesian linear calibrator, y 1 , can be modeled as where ε 1 is a random error term, which follows a Gaussian distribution, with zero mean, and σ 2 noise variance, ε 1 ∼ N (0, σ 2 I). The symbol W is a function of white-box BLM. W(X, β) = Φ(X)β, where Φ(X) is an N × D design matrix for the inputs. In this case, the design matrix is choice to be: Φ(X) = [X 1 , X 2 , . . . , X D ]. The calibrator W, expressed in the (4) can be called a white-box because the relationship between the inputs and output are visible and transparent. In order to optimize variable β, Bayesian inference use Bayes' rule: posterior ∝ likelihood × prior. 1) Prior Distributions: of the calibrator coefficients, β is modeled as a Gaussian distribution: p(β) ∼ N (μ o , σ 2 0 ). Informative prior can be determined by applying linear regression on the data. Therefore, variable μ 0 can be estimated and the variable σ 2 0 are chosen three times larger than its mean value.
2) Likelihood Function: for this model is the conditional probability of observing the measurement data (X) and the model parameters (β, σ 2 ). The likelihood also follows a Gaussian distribution and it can be written as p(y 1 |X, β) = N (y 1 |Φ(X)β, σ 2 I).
3) Posterior Distributions: can be computed using the likelihood function and the prior distribution, based on Bayes' theorem to give p(β|y 1 , X) posterior dist. .
The probabilistic model above is linear, and therefore the posterior distribution can be computed analytically, resulting in another Gaussian distribution to give where β * and V * are the mean and variance, respectively. They can be computed by 4) Predictive Distribution: is also in the form of Gaussian distribution, symbolized by p(y * 1 |X * , X, y 1 ). This can also be computed analytically by using posterior distribution, to give p(y * 1 |X * , X, y 1 ) = (y * 1 |Φ(X * )β * , Φ(X * )V * Φ(X) * T + σ 2 I). (10)

C. BNN: Black-Box Calibrator (V)
Neural networks are usually considered as a black-box model (V), since they provide little explanatory insight into the relative influence of the independent variables in the prediction process. The black-box calibrator can then be modeled as where ε 2 is a random error term following a Gaussian distribution with zero mean and γ precision. The symbol V is a black-box BNN function and X is input data measurements. A neural network, V(X, ω), can be viewed as a probabilistic model, that follows a Gaussian distribution, given by p(y 2 |X, ω, γ) = N (y 2 |V(X, ω), γ −1 ) where the notations of X, ω, and γ are the inputs, the neural network weights, and the precision of the Gaussian distribution, respectively. Equation (12) is also known as a likelihood function.
In a Bayesian framework, a prior distribution needs to be assigned, where in this case, the prior follows a Gaussian distribution with mean zero and the precision of α, given by p(ω|α) = N (ω|0, α −1 I).
Using the prior distribution and likelihood function, the posterior distribution for the BNN can be computed based on Bayes . (14) The inclusion of the prior distribution leads to a regularization, which then counters overfitting. Furthermore, BNN provides a degree of belief on the estimated output, which can be used to assess the quality of the predictions. In our case, the confidence interval will be used to monitor drifting detection, which will be described in Appendix D.
Due to the nonlinear dependence of V(X, β) on ω, the posterior distribution calculation is intractable. Therefore, the posterior distributions as well as predictive distribution, p(y * 2 |X * , X, y 2 ), can be approximated using Laplace approximation or variational inference as described in [20].

D. Calibrators Residual: Drifting Monitoring
The calibrators' drifting is monitored through the residual of two predictive distributions, that is between W and V. Let W and V be independent random variables that are normally distributed, then their residual is also normally distributed W ∼ N (y * 1 |μ * y 1 , Σ * y 1 ) V ∼ N (y * 2 |μ * y 2 , Σ * y 2 ) (16) where R is a residual function for monitoring calibrators. Then, this results in another Gaussian predictive distribution, given by R ∼ N (r|μ * y 2 − μ * y 1 , Σ * y 2 + Σ * y 1 ).