Stochastic Online Calibration of Low-Cost Gas Sensor Networks With Mobile References

There has been a wide interest in high-resolution air quality monitoring with low-cost gas sensor systems in the last years. Such gas sensors, however, suffer from cross-sensitivities, interferences with environmental factors, unit-to-unit variability, aging, and concept drift. Therefore, reliability and trustworthiness of the measurements in the low parts-per-billion (ppb) range remain a concern, particularly over the course of the lifetime of a sensor network in urban environments. In this simulation study, the possibility to continuously recalibrate a wireless sensor network with mobile references and stochastic gradients, computed from encounters, is explored. By using data collected in field experiments, encounters between static and mobile nodes are modeled as a probabilistic process. Moreover, the influence of a collection of design parameters such as base calibration, initial recalibration, choice of optimization algorithm, as well as encounter frequency are analyzed and discussed. With an optimized protocol, it can be shown that long-term reliable measurements with absolute errors of about 50 ppb for CO, 3 ppb for NO2, and 4 ppb for O3 could be achievable with a few mobile references in urban environments.


I. INTRODUCTION
Due to the health impact of low air quality [1], [2], a lot of research with low-cost gas and particulate matter sensors for high-resolution air quality monitoring has been conducted in the last years [3]- [7]. Unit-to-unit variability [8], interferences with other gases and environmental parameters [9], [10], as well as aging [10], [11] are common problems of such sensors.
For air quality monitoring, however, data quality objectives imposed by legislators must be met [12]. Thus, researchers came up with the idea of combining an array of different sensors into so-called low-cost sensors systems with the purpose of compensating interfering effects with models obtained from machine learning algorithms [13], [14] (e.g., neural networks [4] or random forests [9]) and field data. Unfortunately, this generally leads to non-representative models followed by concept drift [15], [16]; the environmental conditions vary over time and space, so the calibration parameters need to change frequently. Hence, maintaining reliability and The associate editor coordinating the review of this manuscript and approving it for publication was Chan Hwang See. trustworthiness over the course of the lifetime of an air quality sensor network remains a challenge.
Traditionally, measurement instruments are recalibrated periodically against references provided by authorities to maintain trustworthiness and to assign measurement uncertainties [17]. For wireless sensor networks, however, such a workflow does not scale, and dedicated network calibration methods have been developed in the last decade. A recently published survey by Maag et al. [18] summarizes proposed sensor network calibration algorithms and their suitability for air quality monitoring applications.
For instance, blind calibration approaches [19]- [22] lack the (legally required) information on the measurement uncertainty and appear to not work for low-cost gas sensors [18]. Multi-hop calibration approaches with static references and mobile sensors [23]- [25] generally lead to the propagation of errors [26], but additional error sources should be avoided at all costs. Moreover, low-cost gas sensors have response times in the range of 30 − 90 s [27]- [29], which is considerably long for mobile nodes. Alternatively, mobile reference instruments, i.e., reliable mobile devices mounted on vehicles, could continuously monitor and recalibrate static low-cost sensor nodes [30].
Regarding data transmission, Saukh et al. [24] propose that static and mobile nodes submit measurements, indexed by time and location, to a database on the cloud at independent frequencies; they define a ''rendezvous'' as a time interval in which two nodes have time and space distances below certain thresholds. Note that both distances can be minimized if static nodes were placed close to stops where vehicles spend about half a minute.
This definition is particularly useful if recalibration is performed periodically in a batch so that a database can be queried for all measurements of interest. Nonetheless, this adds an additional error due to the imprecision of the indexes. In addition, the larger the allowed spatial or temporal distance for a rendezvous, the higher the additional measurement uncertainty will be. If the recalibration frequency is low, more recent measurements should receive a higher weight [31]. From a metrological point-of-view, however, a low recalibration frequency leads to less representative parameters.
While performing recalibration in batches (consisting of many encounters) is well-established, treating network calibration as an online learning (i.e., streaming data) problem [32] has not been considered so far. In a recently published conference paper [31], the idea to sequentially recalibrate nodes as part of a sensor network with mobile references and stochastic gradient descent (SGD) [33], [34] was briefly sketched. On one side, such a lightweight protocol would account for sensor aging and the rapid changes in the atmospheric conditions (i.e., concept drift) yet be robust to single anomalies (e.g., sudden artifacts in the signals).
One the other side, a traceable measurement uncertainty could be associated with each device so that measurements can be interpreted properly [17], [35]. Furthermore, due to the success of deep learning in the last decade [36], more advanced update rules like RMSProp (possibly with additional momentum of the gradient) [33], have been developed, which should be considered as well.
This work builds upon the concept and evaluates such a protocol for gas sensors using field data and simulated encounters under different base calibrations, initial recalibration, algorithms, and encounter frequencies. The paper is structured as follows. In a first step, the problem and its design parameters are presented formally, and the generalized SGD for online calibration is illustrated. Furthermore, the used experimental field data and the performed simulations are rigorously described. In a next step, the obtained results and possible limitations are presented and discussed. Finally, the paper closes with a conclusion and an outlook on future work.

A. PROBLEM DEFINITION
The calibration process aims is to find the set of optimal calibration parameters W ∈ R (p+1)×q that map the sensor signal s to the reference data r, i.e.,r = sW . In this study, the calibration model was fixed to a linear regression with model input s = (1, s 1 , . . . , s p ) ∈ R 1×(p+1) and model output r = (r 1 , . . . , r q ) ∈ R 1×q . Note that the model input contains a ''1'' for the intercept. (In principle, any other model that can be trained via SGD is also possible.) The problem of stochastic online calibration with mobile references over a node lifetime T was modeled according to the scheme depicted in Figure 1. During a deployment phase, a low-cost sensor system with a base calibration W b (i.e., initial calibration parameters) is synchronized with a mobile reference system for a short time period h T (via some wireless technology standard, treated as black box in the following) so that it adapts to its new environment.
At time point t, the tuple (s t , r t ) of sensor and reference signals is collected. With every collected tuple, an online recalibration procedure based on SGD is performed, thereby updating the set of calibration parameters W .
Afterwards, the operational phase begins. Several reference instruments mounted on arbitrary vehicles, e.g., trams or buses, have encounters with static nodes, leading to comparison of sensor and reference values. At every encounter, the same tuple of sensor and reference signals is collected for a recalibration iteration.
In general, finding a map for continuous output variables can be achieved via least squares optimization. Since the output variables can span across different orders of magnitude (i.e., some pollutants are more abundant than others), it might be reasonable to weight their loss contributions by factors e 1 , . . . , e q , stored in a matrix E = diag(e 1 , . . . , e q ) ∈ R q×q . Hence, with a collection of n measurements, i.e., S ∈ R n×(p+1) and R ∈ R n×q , the loss L (in matrix notation, tr refers to the trace) to be minimized is given in (1).
In case of very large data sets or streaming data, the optimal solution is best found via SGD [32], [33]. The gradient g, i.e., the derivative of the loss with respect to the model parameters, is given in (2).
13902 VOLUME 10, 2022 Algorithm 1 describes a generalized version of SGD for the stochastic online calibration of network nodes, i.e., after one encounter. ( is the Hadamard operator.) In the deep learning literature [33], this procedure is known as RMSProp with momentum (RMSProp-m). The variable v is responsible for memorizing the momentum of the gradient; it is controlled by α ∈ [0, 1] (the momentum parameter). A high value of α accelerates the gradient and speeds up convergence when several consecutive incorrect predictions are performed [33]. The variable a stores a discounted moving average of the squared gradient and is controlled by β ∈ [0, 1] (the decay rate). With a high value of β, previous (in)correct predictions are remembered for adapting the individual learning rates. Finally, γ ∈ R + is the step size (the global learning rate).
With a ← 1, α = 0, and β = 1, vanilla SGD is obtained (i.e., without momentum and adaptive learning rate). Setting only α = 0 results in RMSProp (i.e., without momentum). Note that the square root operation is applied element-wise (i.e., Hadamard square root). is a stability constant and should be fixed to a value that is much smaller than the gradients. Therefore, the design parameters of the protocol are W b , h, E, α, β, γ , and the encounter frequency.

Algorithm 1 Stochastic Online Calibration
for t ← 1, T do 7: if (s t , r t ) collected then #At encounter. 8: end if 14: end for 15: end procedure 16 As data set, the field study conducted by Zimmerman et al. [9] was used. The data set was collected at an urban background site from August 2016 to February 2017 and consists of quarter-hourly measurements from 19 low-cost sensor systems. References values for carbon monoxide (CO), nitrogen dioxide (NO 2 ), and ozone (O 3 ) in the parts-per-billion (ppb) range (Table 1) are available from the second month (i.e., October). Considering that the lifetime of such sensors is roughly six to twelve months [11], the duration is sufficient to make reasonable statements about the efficacy of the protocol. There are even reports about relevant drift after one month of operation [37].
Each low-cost sensor system contained the widely used electrochemical sensors CO-B41 [29], NO 2 -B43F [28], and O x -B431 (NO 2 and O 3 combined) [27] from Alphasense as well as sensors for temperature (T) and relative humidity (RH). According to the study authors, the sensor outputs were measured with a custom-designed electronic circuit board and optimized for signal stability. More precisely, said board comprised custom electronics to operate the device, multiple stages of filtering, and an analog-to-digital converter. In addition, the data were logged at a rate of 4 per minute but downsampled to 4 per hour by averaging.
For low-cost gas sensor systems, those are important requirements to minimize the noise. In practice, a network node would sample raw sensor signalss at a predefined frequency and compute an average in an online fashion, thereby avoiding to store all values. With K required samples, the average µs with the k-th raw signals k is given in (3).
After all samples have been collected, this computation terminates; the most recent average is memorized, serving as sensor input for the next encounter with a mobile reference (i.e., s t ← µs, µs ← 0).
Out of the 19 devices, three low-cost sensor systems (#4: device 1, #16: device 2, and #17: device 3) have the least values missing. In order to keep bias as low as possible, only these three systems have been considered in the analysis. Specifically, the optimal protocol was developed solely with device 1 and validated on the data from the remaining two. The measurements from devices 2 and 3 stop one month earlier (i.e., in January). Missing values between start and stop have been imputed with forward filling, since this was seen as an opportunity to simulate potential erroneous events during operation. An overview of the data set is shown in Figure 7 (Appendix A).
In their original paper [9], the authors provided two population calibrations for each sensor type (Table 2), i.e., parameters obtained from a collection of sensors that fits well on average but not necessarily for every unit; a simple (s) laboratory calibration (CO: 0-1600 ppb, NO 2 : 0-50 ppb, in 3-4 points), and an extended (e) field calibration including parameters for the interferences with T and RH (calibration distribution described in Table 1).
Note that neither accounts for cross-sensitivities with other pollutants. Unfortunately, no simple calibration was made VOLUME 10, 2022 available for the O x -B431 sensor, so the values from the NO 2 -B43F sensor had to be used instead. This can be justified by the fact that the data sheets [27], [28] suggest similar calibration parameters. In the following, both population calibrations were examined as potential base calibrations before deployment.

C. SIMULATIONS
The initial recalibration period during deployment (i.e., h) was fixed to 24 hours. Furthermore, errors for NO 2 and O 3 have been weighted by a factor of ten, since they are more relevant [2]. Where not further specified, values for the hyperparameters of the algorithms were taken from the literature (i.e., α = 0.9, β = 0.99) [33], or determined in preliminary experiments (i.e., γ = 10 −7 for SGD, γ = 10 −3 for RMSProp and RMSProp-m).
Encounters between mobile references have been simulated via Bernoulli trials. At each time point t, there is a probability π for an encounter [31]. For t h, π is equal to 1. Since the data set consists of roughly 100 data points per day, a value of π = 1 corresponds to an average encounter rate of 100 per day. Thus, different encounter rates were modeled in this manner. Since this is a stochastic process, 100 such simulations were performed to yield average results, thereby removing the influence of single encounters.
As metric for the performance of a low-cost sensor system, the absolute error for every pollutant i ∈ {CO, NO 2 , O 3 } was computed. It is defined as the average absolute deviation of the predicted concentration from the actual reference value over the period T ignoring missing values (4). Figure 2 illustrates the performance of device 1 with the three algorithms under different encounter frequencies using a simple base calibration, whereas Figure 3 shows the same result with an extended base calibration. The dotted line is the base case error; it is the performance that is obtained if no further adjusted is made (i.e., no recalibration during deployment or operation). The offset in the absolute error at low encounter frequency is the result from the recalibration during deployment. This observation suggests that a recalibration for a few hours during deployment is actually beneficial, as the absolute error generally decreases. This improvement has two reasons. On the one hand, the base calibration is a population calibration that does not necessarily fit for a specific unit.

III. RESULTS AND DISCUSSION
On the other hand, no corrections for interferences with T and RH are included in the simple base calibration model. Specifically, the inclusion of T and RH corrections lowers the base case error (dotted line) for NO 2 and O 3 significantly in comparison with the simple calibration. In general, there is a large consensus that such compensations are necessary [3]. Figure 8 shows how such corrections are introduced during deployment. However, the sensitivities and intercepts of the extended calibration are also different ( Table 2), so they might better represent these sensors.
The subplots in both Figures illustrate that RMSProp generally leads to the highest decrease in absolute error for NO 2 and O 3 , whereas RMSProp-m leads to the highest decrease for CO. Further increase of the performance can only be achieved in case of several encounters per day. The reason is that either the deployment duration is too short so that calibration parameters are not yet optimal, or that they need adjustment due to aging and concept drift. For example, Figure 9 illustrates how the model parameters continuously evolve over time.
With RMSProp-m, the performance even decreases for NO 2 and O 3 at low to moderate encounter frequencies.
In this case, it appears that momentum only benefits adjusting the calibration parameters of CO. Furthermore, since erroneous signals are possible (e.g., due to imputed missing values), they could also increase the absolute error at moderate encounter rates. Nonetheless, the methodology seems to be quite robust to such events, since the performance still increases at the highest encounter frequencies.
The error reduction could be probably even lower without these events; some encounters surely contain more information than others do. Hence, it might be a good idea to filter out erroneous instances before performing gradient descent. Completely faulty nodes, on the other hand, could be identified by monitoring the absolute error over time.
For CO, however, the highest concentrations are not correctly predicted, as shown in Figure 10. The explanation for the higher error of the CO measurements might be that high concentrations were not properly covered by the base calibrations; because the upper limit of the simple calibration lies higher compared to the extended calibration, it performs slightly better for CO (Figures 2 and 3). More precisely, support is lacking for high concentrations, and without these upper levels, the hyperplane might be improperly oriented, thereby resulting in predictions of low quality.
The question arises whether some better base calibrations could be obtained from experiments with orthogonal variables [15]. In field experiments, calibration ranges can not be chosen and all factors of interest are usually correlated. With calibration models obtained from field data, it is even possible to ''measure'' any pollutant with any sensor if correlations are  very strong [15]. Thus, it is challenging to find representative calibration parameters using field data.
Since SGD and RMSProp-m bring no additional benefit over RMSProp for the two relevant pollutants NO 2 and O 3 , both algorithms were not considered in the further analysis. Moreover, the hyperparameters of the algorithm have not been optimized thus far. Figure 4 illustrates the error landscape (in ppb) for the three pollutants under different combinations of the hyperparameters.
It shows that the situation could be improved by implementing individual hyperparameters. By decreasing β and increasing γ , the performance can be improved for NO 2 and O 3 . Moreover, predicting CO with low error requires high values of β and γ . In this case, the optimal hyperparameters should be fixed to β = 0.8 and γ = 0.005 for NO 2 /O 3 as well as β = 0.999 and γ = 0.01 for CO.
Because the hyperplane might be improperly oriented initially, large consecutive gradient updates are required to reorient it, as the highest concentrations occur only rarely. Consequently, there is an interaction with the inadequate base calibration and the choice of the optimization algorithm. Alternatively, a ''better'' base calibration covering the full range might allow the same hyperparameters for all pollutants. Figure 5 illustrates the agreement between sensor and reference measurements for device 1 using RMSProp with the set of optimal hyperparameters and ten encounters per day, starting with an extended calibration. The agreement is exceptionally good, considering that the measurements are coming from low-cost gas sensors.
The final absolute errors from device 1, corresponding to the standard uncertainties in metrology [35], are about 50 ppb for CO, 3 ppb for NO 2 , and 4 ppb for O 3 . In relation to the medians, these uncertainties are 26% for CO, 33% for NO 2 , and 19% for O 3 . Due to uncertainty propagation, the uncertainty from the reference instrument adds up to the one VOLUME 10, 2022  from the low-cost sensor system [35]; a sample calculation is given in Appendix B. To put it into legal context [12], the maximum allowed expanded measurement uncertainties are 25% for CO and NO 2 as well as 30% for O 3 . The higher the pollution levels, the easier it is to meet these requirements.
Finally, since an optimal protocol has been found, it can be validated on the remaining data from devices 2 and 3. Figure 6 evaluates the absolute error and its reduction by switching from a simple base calibration without any further recalibration to an extended base calibration with an initial recalibration during deployment as well as recalibration during operation using RMSProp (with optimal hyperparameters) at ten encounters per day. It shows that this reduction is of similar magnitude for the other two devices, hence supporting the theoretical concept.
In the original study [9], the authors developed several different calibration models from the collected field data.
They claimed that they could reduce the absolute errors to 8 ppb for CO and even below 1 ppb for NO 2 and O 3 with random forest models. With these calibration models, the absolute errors suddenly increased to values of about 49 ppb for CO, 5 ppb for NO 2 , and 3 ppb for O 3 during an independent test at another location. Thus, their absolute errors were much higher during test time. Moreover, with purely linear models, the errors were generally higher.
On the one hand, this observation suggests that their machine learning models captured the local atmosphere of the location at which the models were developed [15], since the performance decreased at the new location. The underlying problem is that the relationships between the pollutants and/or environmental factors are different at other locations or at other time points. Therefore, less reliable measurements can be expected upon relocation of field-calibrated low-cost sensor systems [15]. On the other hand, the low error that they achieved with random forest models advocates basis expansion (i.e., the introduction of power and interaction terms). In particular, random forests (and neural networks) are great at capturing non-linear behavior without explicit basis expansion. Consequently, describing potential non-linearity should further decrease the absolute error. An example for a nonlinearity would be the interaction between sensitivity and temperature that is even reported in the data sheets [27]- [29]. (This also motivated calibrating low-cost gas sensors with machine learning algorithms in the first place [3].) With these results in mind, the question arises as to what extent several encounters per day are realistic. In the case of the city Basel in Switzerland, for example, trams circulate for about 20 hours per day and there are about eight relevant lines. A tram associated to a line has about 45 to 60 minutes from one terminal to the other. Therefore, it can be expected that the encounter frequency should be up to 20 per day if one tram per line would be equipped with a mobile reference. Thus, operating in this regime with eight such references would be not too farfetched, though the question arises what kind of reference instruments could possibly be used, as such devices should be affordable yet reliable [30].
In contrast to using only mobile reference instruments mounted on trams or buses, the temporal resolution with additional static nodes can be much higher. In this manner, down-sampling can be performed when a high data acquisition rate is available, which essentially lowers the noise in the measurements. In addition, multiple data points per hour can be made available.
The presented outcomes demonstrate that stochastic online calibration with reliable mobile references bears the potential for long-term accurate measurements from low-cost gas sensors, since unit-to-unit variability, aging, and concept drift can be continuously compensated. Although the algorithms were only applied to low-cost gas sensors system within one data set, it can be expected that the concept generalizes to arbitrary sensor systems and networks, since SGD was successfully applied in several distinct online learning scenarios before [32].
Despite these promising results, it is also important to point out some of their shortcomings. In particular, several additional error sources can be expected in real-world scenarios. For instance, the response times of the measurement instruments have been assumed to be zero. Due to response times of the reference instruments, a spatial carryover equals to s l = v m t r (with vehicle velocity v m and response time t r ) would result in a real-world scenario. Hence, a short response time can be seen as one requirement for candidate reference devices. A complete list of requirements would help to identify an existing product or guide the development of a new one.
Although every reader intuitively understands the term ''encounter'', it is not absolutely defined. During an encounter, the distance first decreases and then increases again. In practice, an encounter between nodes requires machine-to-machine communication and the range depends on the chosen wireless technology standard (e.g., Bluetooth 5 has ranges up to 200 m [38]). Hence, there is a also time window for data exchange. The larger the allowed distance, the less representative a communicated reference measurement is.
In the presented schema (Figure 1), every encounter triggers a computation, thereby consuming a small amount of energy, which might not always be available. Alternatively, before updating the parameters, the loss in (1) could be computed to assess whether a gradient update would be even required, e.g., by setting a minimum error threshold. (Intuitively, no gradient step is necessary if predictions and reference values coincide.) Yet another option would be that the stationary node transmits the sensor data together with the current calibration parameters so that the mobile reference can perform all  computations. Nonetheless, transmitting data also requires energy. For an upcoming real-world solution, other aspects such as price and availability of references, network overhead, energy consumption of nodes, or costs of computations on the cloud need to be considered as well.

IV. CONCLUSION AND OUTLOOK
This work expanded the concept of sequentially recalibrating nodes of a low-cost gas sensor network for air quality monitoring with mobile references and stochastic gradients. With such mobile references and the proposed algorithm, network nodes could be monitored and continuously recalibrated, thereby aiding to maintain trust in the measurements.
A proper base calibration is crucial for the success of the approach. Thus, characterizing a small population of low-cost gas sensor systems to obtain an adequate base calibration might be an effort worth taken. As of now, this is mostly achieved in lengthy field campaigns, but establishing it as an efficient and inexpensive service in a laboratory setting with orthogonal variables is planned [15].
It could be shown that an initial recalibration during deployment is beneficial for a low-cost gas sensor system,  as every device needs unit-specific calibration parameters. Moreover, a calibration model should also include compensations for interfering variables and cover the range of interest.
If the calibration parameters change over time due to aging and concept drift, they are adjusted accordingly. Moreover, it could be demonstrated that the choice of gradient update rule matters, since RMSProp performed better than vanilla SGD. With optimal hyperparameters and an encounter frequency of up to 20 per day, the absolute error can be reduced to about 50 ppb for CO, 3 ppb for NO 2 , and 4 ppb for O 3 by performing gradient descent updates after encounters.
Finally, future work should focus on filtering out erroneous instances to further increase the performance as well as defining requirements for adequate mobile reference systems. If no commercially available devices meet the requirements, novel ones could be developed. Once potential references have been determined, the proposed protocol needs to be validated in field studies.

APPENDIX B UNCERTAINTY PROPAGATION
With K standard uncertainty sources u j , j ∈ {1, . . . , K }, the expanded uncertainty U of a measurement instrument is defined in (5) [35].

AUTHOR CONTRIBUTIONS
Georgi Tancev performed the analysis and wrote the manuscript. Federico Grasso Toro reviewed the manuscript.