Dealing With Inaccurate Sensor Data in the Context of Mobile Crowdsensing and mHealth

The technological capabilities and ubiquity of smart mobile devices favor the combined utilization of Ecological Momentary Assessments (EMA) and Mobile Crowdsensing (MCS). In the healthcare domain, this combination particularly enables the collection of ecologically valid and longitudinal data. Furthermore, the context in which these data are collected can be captured through the use of smartphone sensors as well as externally connected sensors. The TrackYourTinnitus (TYT) mobile platform uses these concepts to collect the user's individual subjective perception of tinnitus as well as an objective environmental sound level. However, the sound level data in the TYT database are subject to several possible sensor errors and therefore do not allow a meaningful interpretation in terms of correlation with tinnitus symptoms. To this end, a data-centric approach based on Principal Component Analysis (PCA) is proposed in this paper to cleanse MCS mHealth data sets from erroneous sensor data. To further improve the approach, additional information (i.e., responses to the EMA questionnaire) is considered in the PCA and a prior check for constant values is performed. To demonstrate the practical feasibility of the approach, in addition to TYT data, where it is generally unknown which sensor measurements are actually erroneous, a simulation with generated data was designed and performed to evaluate the performance of the approach with different parameters based on different quality metrics. The results obtained show that the approach is able to detect an average of 29.02% of the errors, with an average false-positive rate of 14.11%, yielding an overall error reduction of 22.74%.

the technical capabilities of these devices, predestines them for two concepts: Ecological Momentary Assessments (EMA) and Mobile Crowdsensing (MCS). In health care, a combination of these concepts can be used to collect ecologically valid and qualitative longitudinal data while simultaneously capturing the context in which these data are collected by using the sensors of smartphones and connected devices (e.g., wearables) [1]. For example, environmental data (e.g., noise [2], [3]) could be measured while a user is completing an EMA questionnaire. In this way, questionnaire data can be correlated with environmental data to provide new insights about users (e.g., patients). However, to provide meaningful information, these sensor measurements must meet several aspects of data quality, such as accuracy, objectivity, believability, and reputation [4]. For non-standardized smartphone sensors such as the microphone (i.e., different manufacturers, different mobile operating systems, different scales), achieving these properties can be particularly challenging.
The mobile platform TrackYourTinnitus (TYT) uses EMA and MCS to track a user's individual tinnitus. Tinnitus is the perception of a sound in the ears in the absence of a corresponding external acoustic stimulus. Tinnitus symptoms are subjective and vary over time. Therefore, TYT was created to monitor and assess the variability of these symptoms in the daily lives of tinnitus patients or interested users [5]. The platform has been running since 2014 and consists of a website for general information and user registration (https://www.trackyourtinnitus.org/), a mobile app available for both iOS and Android, and a central backend for data storage. Users of the mobile app are asked to complete EMA questionnaires at different times of the day to assess their individual tinnitus perception (e.g., tinnitus loudness and distress) [6]. In addition, the mobile app records environmental sound levels while completing the daily questionnaire [6]. The process of TYT mobile app [1] as well as the underlying data set, i.e., the structure and findings about the collected data [7], have been described in detail in previous work.
The overall goal of this paper is to investigate the relationships between environmental sound levels and reported tinnitus symptoms. More specifically, we want to investigate whether environmental sound levels have an effect on tinnitus. If the objective sound levels can be related to the subjective data collected in the EMA questionnaires, new insights could emerge, allowing, for example, predictions of tinnitus loudness based on the environmental sound data. In this context, it should also be noted that for tinnitus and many other diseases and disorders, longitudinal studies capable of collecting ecologically valid data over longer This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ periods of time (more than half a decade for TYT) are still very rare. However, the sound data available in the TYT database [7] are not suitable for direct comparison with each other and thus for meaningful interpretation in terms of correlation with tinnitus symptoms [8]. This is because the environmental sound levels were retrieved from many different device models from different manufacturers and thus potentially many different microphones and other audio components. In addition, the sound level is stored as a relative amplitude value (Android) or uncalibrated decibel value (iOS) as retrieved from the mobile system APIs, as opposed to a comparable Sound Pressure Level (SPL) or weighted decibel value (e.g., dB(A)). In addition, sound level values are stored as-is, and no error detection or plausibility checks are performed during or after measurements, which can result in many erroneous values (e.g., zero values) being stored in the database. Furthermore, the mobile app does not indicate on the user interface that the sound measurement is running. Therefore, the user could interfere with the measurement (e.g., if the microphone is covered or the smartphone collides with an object) and generate erroneous values again. These erroneous sensor readings are a common and recognized problem in sensor network applications [9]. To address these problems, we take several approaches. On the one hand, we focus on smartphone calibration, app adaptations to retrieve more meaningful values, and in-app error detection mechanisms [8]. On the other hand, we are working on approaches to cleanse the existing data in the TYT database from a data perspective. This work proposes such a data-centric approach that aims to detect sensor errors in mHealth mobile crowdsensing data sets and subsequently exclude erroneous sensor values from all further analyses.
Based on the characteristics of the given environment, the following requirements for the approach were defined: 1) The approach should be able to detect and identify sensor faults in mHealth data sets. 2) The approach should be able to handle data sets for which it is not known a priori which values are affected by sensor faults.
3) The approach should be computationally efficient, so that it is possible to apply it to very large data sets. The main challenge for the algorithm in the proposed approach is that it is generally not known which sensor measurements are erroneous and which are not. For this reason, a simulation using generated data with similar characteristics to the TYT data set was designed and run to assess the performance of the algorithm with different parameters based on different quality metrics. Furthermore, since little information is available about the sound measurement itself that would allow detection of sensor errors, the algorithm takes into account additional information (i.e., responses to the EMA questionnaire). The core of the proposed approach is a Principal Component Analysis (PCA) complemented by a constant value check. In addition, since conventional PCAs are not able to handle missing values, an upstream procedure is used to impute data.
The remainder of this paper is organized as follows. Section II considers related work. Section III presents in detail the materials and methods used for the proposed approach. Section IV presents the results of the simulation and the application of the approach to the TYT data set. A discussion of the results in terms of the practical relevance and limitations of the approach is given in Section V, while Section VI concludes the paper with a summary and an outlook on future work.

II. RELATED WORK
Much of the previous work on anomaly detection for sensor signals has been done in the field of Wireless Sensor Networks (WSNs). Teh et al. [10] identified two main types of solutions in this context: statistical methods (e.g., Principal Component Analysis (PCA) [11], [12]) and classification-based methods (e.g., Artificial Neural Networks (ANNs) [13]). For WSNs, Statistical Process Control (SPC) [14] approaches are mostly used for the surveillance of chemical processes [10], for which scenarios with highly correlated data from distributed sensors are prevalent. PCA analyses, in turn, are based on multivariate detection mechanisms, such as Squared Prediction Errors (SPEs). Body Sensor Networks (BSNs) [15], as derivatives of WSNs, make use of sensors applied to patients in form of implantable sensors or wearable devices. Note that for both WSNs and BSN, all of the above-mentioned analyses techniques are commonly applied as they generate multivariate data. In contrast to this, univariate data can be also analyzed, for example, by the use of Contribution Plots (CPs) [14] instead of SPEs.
Classification-based techniques rely on a set of pre-labeled training data that allow to build a prediction model for further detection. Safaei et al. [15] stated that ANNs are one of the most popular methods to efficiently identify outliers. During a training phase, a classifier uses pre-labeled data to learn separating outliers from normal data. This approach of combining neurons in a dynamic network in order to describe logical problems [13] allows the development and use of very accurate and fast detection algorithms. However, [16] and [17] name the difficulties in defining the most suitable ANN parameters that enable correct error detection. These parameters include, for example, the type of neural network, the number of hidden layers, and their number of neurons. In addition, training data sets have to be available and precise to create a useful detection mechanism. The authors developed an approach that is suitable for automated fault detection, but due to challenges mentioned above, it is only able to detect 50% of the defined error types. The authors also pointed out a wide gap in the explainability of given input and the retrieved results. In the mHealth context, the large variety of devices with unknown technical peculiarities, creates an additional challenge for this classification. Therefore, apart of the very dynamic possibilities of this approach, we did not consider it suitable for our purposes.
Error detection approaches that need no prior knowledge and can be applied to data of arbitrary form can be found in the field of statistics. PCA is recognized as one of the most popular [10] tools for sensor error detection in the WSN context. [18] proposes the usage of robust PCA in combination with SPE to detect multivariate anomalies within a set of physiological measurements. For this purpose, a central computational device gathers measurements of distributed sensors and normalizes data during pre-processing. In a first step, the dimensional size of the data is reduced. This is followed by SPE to detect multivariate anomalies by comparing it with an adaptive threshold [19]. The authors apply CPs to each physiological value afterwards in order to determine the contribution of each value to the detected anomaly. Finally, the PCA model is updated if the parameters change.
To validate their approach, [18] compared different (online and offline) PCA-implementations in a case study with regard to their performance. The authors found different robust offline PCA-implementations to perform much better than the classic implementation. They investigated the degree of variability that retains after dimension reduction for a number of l principal components: l = 3 describes 98.2%, compared to l = 4 for 94.9% or l = 5 for 99.8%. This reflects in smaller proportions of unexplained variability and gives better dimension reductions.
Miller et al. [14] developed a quality metric that addresses the question of the origin of an observed fault in SPC. In this context, a variable that can be computed with values from the PCA model is given by the Hotelling's T 2 score, which we will use as a control variable in the approach of the work at hand. In contrast to classification-based approaches, PCA-based approaches in combination with the mechanisms mentioned above seem to fit better into our application context. This is not only due to certain similarities of WSNs and mHealth applications, but also due to the very sound methodology and the independence from prior knowledge about the correctness of the data.
Other efficient outlier detection approaches were created, for example, by Saneja and Rani [20]. The authors propose three different phases that are scalable for Big Data applications to identify anomalies in highly correlated parameters. Initially, a correlation coefficient matrix is constructed. Then, correlating sensors are used for dynamic predictions using linear support vector machines, sliding windows and dynamic thresholds. The last stage is constituted by a parallel computing phase. Aderibigbe and Chi [21] propose a mechanism to identify temporal outliers, using median absolute deviation. After exceeding a static threshold, the majority voting allows the detection of false-positives. Due to dynamically changing conditions, dynamic thresholds in particular have a big validity for sound measurements. A Big Data approach may be out of scope for this initial evaluation of methods, but might nevertheless find application in future developments of our approach.
[20] and other statistical-based approaches in the eHealth or mHealth context rely on physiological databases such as PhysioNet 1 to evaluate their performance. PhysioNet (formerly Physiobank 2 ) or MHEALTH 3 consist of measurements that were gathered within various studies and contains arbitrary anomalies. To better control performance, synthetic data generation is required. Although there exist several different simulators [22], most of them are strongly tied to WSNs and tend to focus on validating their protocols. Approaches like [23] therefore manually select a number of 10,000 data points they consider error-free.
Afterwards, the authors inject errors at random, specified by their probability. In our approach, we also apply random errors of defined form to the data set. However, due to the large effort of manual data selection, the approach in the work at hand instead relies entirely on synthetic data.
In summary, none of the aforementioned approaches is able to handle a univariate data set with possibly missing data points, which is the case for TrackYourTinnitus. The contribution of the work at hand is therefore to provide such an outlier detection approach and, in addition, to provide a simulation tool to evaluate the performance of this approach with different parameters suitable for our TYT mHealth data purposes. This is in line with [24], as the authors have stated that quality of imputed data is essential and can therefore influence medical conclusions drawn from it. Our proposed approach is validated against a generated data set that contains realistic anomalies, allowing these anomalies to be tracked and thus fine-granular performance measurements to be made. The aim of this synthesis is to provide the possibility of dealing with verified error-free data points and errors of defined shape. We therefore hope to be able to define algorithm parameters in a more accurate way than other approaches.

III. MATERIALS AND METHODS
In the following, the materials and methods used in the present work are described. In this context, the data set to which the approach is to be applied is outlined in more detail. Furthermore, the chosen approach, including the different algorithm phases, is explained in detail.

A. Materials
The analyzed data set was extracted from the TYT database on the 23 rd of December, 2021. It contains a total of 107,231 entries. The structure of the TYT database was described in [7]. The mobile apps capture the individual tinnitus perception by asking users to fill out tinnitus assessment EMA questionnaires at random times of the day. The process was described in detail in [1]. Each entry in the database contains the answers to the eight questions of the TYT EMA questionnaire [25], along with a timestamp and the user agent that contributed the answers. In addition, a sound level value is stored in each entry that is retrieved by measuring the environmental sound level during the first 15 seconds of the user completing the EMA questionnaire by utilizing the microphone of the mobile device [6]. This value is denoted as sensor value in the following, while the user agents that contributed the EMA answers and sensor measurements are denoted as sensors.

B. Methods
As depicted in Table I As mentioned above, the TYT data set contains not only sensor data, but also EMA answers. Since these answers are either of binary form or retrieved from visual analogue scales (VAS), they can easily be mapped to the above range and are therefore completely uniform. Each sensor s has m measurements as well as m properties p. Each device is characterized by s ≥ 1 sensors and p ≥ 0 properties that are associated with it. In addition, it was assumed that the clocks of the devices are synchronized and thus the time distances between different measurements are approximately the same. Although TYT data was collected at arbitrary points of time, the algorithm outcome should not be influenced by this fact.
PCA was chosen as core of the proposed approach because it is a well-established statistical method for detecting sensor errors in the context of WSNs (as outlined in Section II), which have similar conditions to (mHealth) mobile crowdsensing applications. Since constant value readings, as a severe form of sensor fault that is quite common in the TYT data set (see Section III-B3), could not be reliably eliminated by the PCA in initial tests, the PCA step was complemented by a constant check based on the approach proposed by [9].
1) Algorithm Overview: The overall process for the developed algorithm is illustrated in Fig. 1. In a first step, sensor measurements and EMA properties are treated separately. The algorithm takes 4 sensors and 3 properties as an input and returns 10 a set of marked sensors. The actual implementation handles devices, as properties are treated alongside with sensor measurements, due to their uniform character.
Due to prerequisites of the algorithm main stage, the elimination stage, 1 pre-processing was defined as a necessary intermediate stage. It includes 5 data selection as many data points were considered unusable, e.g., due to too few sensor measurements. As the PCA-based algorithm is not able to handle missing values, a 6 data imputation step was added. This is done by applying a k-Nearest Neighbor (kNN) interpolation for missing data points.
Since constant values are a complete failure form of sensor faults [26], a 7 constant check forms the first step in the 2 elimination stage. Afterwards, the proposed 8 PCA-based approach is performed, using Hotelling's T 2 score and Contribution Plots (CPs). The algorithm finishes with an 10 output of all erroneous and correct sensors. In the following, the different steps are described in detail. 2) Algorithm Stages: As a common language for the domains of scientific computing and data science, the implementation of the proposed architecture from Fig. 1 was done in Python. As a general convention, in-and output files as well as configurations use the JSON file format. Libraries SCIKIT-LEARN 4 and NUMPY 5 provided some helpful routines, for instance SCIKIT's PCA, or different numeric algorithms from the NUMPY library.
The following describes the development of Algorithm 1 as proposed in Section III-B1. 5 Data Selection: Prior to the actual analysis of data points, a set of appropriate devices has to be chosen. Real-world data sometimes poses too few measurements per device. This had to be considered during algorithm development and led to an exclusion of devices with m < 30 measurements. This filtering, taking place prior to Algorithm 1, will not be specified in detail here. Fig. 1. A Description of the Used Variables can be Found in Table I. 1: function IDENTIFYFAULTS(devices, m, k, l, , δ) 2: IMPUTE(devices, k) 3:

Algorithm 1: Pseudo Code of the Elimination Process of
for device in devices do 4: if SD(SMOOTHEN(device.sensor.data)) < then 5: device.sensor.f iltered ← true 6: end if 7: end for 8: t = 1 9: while t ≤ m do 10: X ← BUILDDATAMATRIX(devices) 11:  6 Missing Data Imputation: Handling erroneous sensor measurements by completely removing them and applying interpolation methods could not bring the desired success. It either was not clear if the remaining data points could still be analyzed in a significant way, or interpolation polynomials were of too high order, unnecessarily increasing complexity. As the main error detection stage 8 is not able to handle missing data, the imputation method of choice is stated by a k-Nearest Neighbor (KNN) [27] interpolation. The implementation, as seen in Line 2 of Algorithm 1, replaces missing values with a weighted average, calculated from k nearest neighbors in respect of their distance to the missing data point. It is calculated using the Euclidean Norm of two points A = (a 1 , a 2 , . . ., a n ) and B = (b 1 , b 2 , . . ., b n ), which is defined as: In order to handle missing values in the neighbor data set, a method was chosen as proposed by Dixon [28]: Throughout the calculation of Euclidean Distance, empty values are summed up as zero values. Afterwards, the result is weighted, using the number of missing values. This step concludes the 1 pre-processing stage, resulting in data that fits further analyses. 7 Constant Check: This first step in the elimination stage was designed to overcome constant value errors by utilizing a simple constant check [9]. Realized in Lines 3-7 of Algorithm 1, the standard deviation for a set of sensor measurements is calculated and compared to a threshold . For standard deviations below this threshold, values are assumed to be constant and their belonging sensors are marked with a flag. Before standard deviations can be calculated, a linear Kalman filter [29] is applied. This noise reduction technique allows the estimation of proper sensor values in real time, and serves for the reduction of falsely detected constant values. 8 PCA-stage: The underlying formal description was provided by Teh et al. [10]: Let X ∈ R m×n be a matrix, consisting of m rows and n columns. Columns n include all measurements of a sensor at a given time stamp. Rows m contain the measurements of all sensors at a time frame. A loadings matrix P contains the l first eigenvectors [λ 1 , . . ., λ l ] of the covariance matrix X T X. Prior to this, one needs to choose the number of principal components to keep: l ≤ min{m, n}. For l < n, the dimension of the data set can be reduced. Principal components can be obtained in a score matrix T with the linear transformation T = XP ∈ R m×n , where P is defined as: consisting of the above-mentioned eigenvectors. This formal description is implemented in Lines 10-13 from Algorithm 1. Samples, containing measured sensor values and the corresponding properties, are used in chronological order to build the data matrix X (Line 10). Line 11 takes matrix X as well as the algorithm parameter l, which describes the number of PCA components. Next, it is described how various variables can be derived from a calculated PCA model, which are later combined in the T 2 score. Different from the original BESI algorithm, only a single PCA model has to be calculated per iteration, compared to n − 1 for n sensors.
As the described complexity reduction does not proceed lossless, Hotelling's T 2 score (Line 13) acts as a control variable [14]. Originally, the BESI-algorithm includes Q-statistics for this purpose. However, as [30] describes, this approach is far too sensitive for our purpose of an mHealth application. Compared to delicate chemical processes, which were the original field of application for this algorithm, T 2 score states a sufficient method for error detection. Let x ∈ R 1×n be a vector representing a sample. Hotelling's T 2 score is defined as: with D = diag(λ 1 , λ 2 , . . ., λ l ), as diagonal matrix of the sorted eigenvalues of the covariance matrix. The underneath error detection mechanism relies on the idea of iteratively 11 eliminating faulty sensors if 9 T 2 exceeds delta, until all faulty sensors and their belonging properties are eliminated. As described in Lines 14 -16, Hotelling's T 2 score is used as a measure for abnormal behavior. For T 2 < δ, a sensor is considered working correctly, no further processing is necessary and the next sensor value is analyzed.
If the limit was exceeded, Lines 17 and 18 describe the calculation of sensor contribution to the overall fault. Detecting the highest contribution is done by solving the linear equation P c = T T t and finding the maximum argument. Afterwards, the sensor is marked as faulty in Lines 19 and 20.  II  CONCRETE VALUES THAT WERE CHOSEN FOR THE VARIABLES FROM  TABLE I, FOR THE SIMULATION Algorithm 1 takes an array of potentially erroneous devices, containing their sensors and measurements as well as properties p. For statistical reasons, non-analyzable devices are removed from the data set prior to the application of the algorithm. Sensors that are detected faulty throughout the detection process are only marked erroneous, not eliminated. Therefore, the 10 output allows a distinction between 1) erroneous sensors and 2) those considered being correct. Additionally, 3) sensors that were excluded from the analysis can be obtained. According to different use cases, these three groups might be combined for analyses.
3) Simulation: As mentioned in Section II, a common source of data to validate the error detection approach is stated by physiological databases in combination with manual selection. To overcome limitations of not being able to generate physiological data synthetically as well as injecting errors of defined form, the following paragraph describes our proposed simulation approach.
A first requirement of the developed simulation is to generate a data set which can be used as input for Algorithm 1. This includes the generation of a defined number of devices, their sensors, measurements and properties. Devices should be identical concerning their number of generated values. As second vital requirement, errors should be generated and injected according to a configuration. For validation purposes, meta data of erroneous sensors should be available throughout the whole simulation and validation process. In addition, the simulation should be portable, running on most major operating systems, and in order to create reproducible results, it should be deterministic. As one main assumption of Algorithm 1 is that there exists a correlation between sensor measurements and their properties, these values have to be generated under this assumption.
The input parameters taken by the algorithm were used to fine-tune the accuracy of the developed approach. The variables are shown in Table II and were chosen to be optimal with respect to the selected quality metrics. To this end, 1) the parameters k, l, , and δ were varied based on a binary search process, 2) 50 simulation runs were performed with each set of parameters, 3) the results were recorded, and 4) the quality metrics (see below) were calculated. Finally, the parameters with the optimal quality metrics and the corresponding results were used for the further analyses.
The implementation of the developed simulation was created in Python, utilizing SCIKIT-LEARN 4 and NUMPY 5 . The whole simulator is configurable and can be used to generate a defined number of devices, their sensors and belonging data sets. For convenience reasons, the generated data is saved in JSON file format and has a form that suits the algorithm input format. Via configurations, an occurrence percentage can be defined for each of the following error types: 1) Missing data: The absence of data points may have several reasons. Due to unstable network connections, there might for example occur package loss [31]. In addition, an empty device battery, for example, stops the recording and transmission of sensor data. 2) Outliers: The authors of [32] define outliers as "an observation or subset of observations that appear to be inconsistent with the rest of the data set". In the context of the present work, the device microphone may produce outliers, for example, if the smartphone collides with a surface or object during measurement [8].

3) Bias: Described as a constant shift over time in compar-
ison to the normal sensor readings. According to [33], this error might be caused by overheating of system components, which manifests itself, for example, in the microphone output although there is no perceivable sound source. 4) Drift: This error is similar to a bias, with the difference that a shift is increasing over time and does not depend on the sensor's true value [33]. For example, microvibrations during smartphone usage throughout the day may cause gyroscope drifts [34]. 5) Noise: As another additive fault, noise is described as "small variations in the data set" [10]. According to [35], a common model for this fault is white noise, a phenomenon that, e.g., microphones suffer from due to environmental influences. 6) Gain: Unlike previous types of sensor faults, gain is non-additive. The original sensor reading is multiplied by a constant value and therefore might deviate substantially from the original value [26]. In the context of the present work, gain might occur, for example, if the hardware sensor (i.e., the device's microphone) is corrupted or if device software system components, such as the system APIs or the processing algorithms in the TYT app, are not operating correctly. 7) Constant values: As one of the most severe error types, sensors that only provide constant or stuck-at-zero values are unusable for almost all purposes [26]. For example, missing permissions to access the device's microphone may lead to a constant value of zero being recorded by the TYT app. The generator-script accepts the configuration described above as well as a random number generator seed. The latter parameter allows the simulation to work deterministic by making random choices using the seed. In order to generate error-free random, but correlated numbers at first, a random matrix C = (c ij ) is created for each device. If i is a property and j is a sensor measurement, c ij is uniformly chosen Otherwise, it is uniformly chosen from [−1, 1]. Matrix C was designed to resemble a covariance matrix with dimensions equal to the sum of the number of sensor measurements and properties. Therefore, it is symmetric, but not necessarily positive semi-definite. For this reason, it is converted to the nearest covariance matrix, by converting it to a correlation matrix, clipping the eigenvalues and then converting it back to the covariance matrix. NUMPY's multivariate_normal-routine takes the resulting matrix to generate random but correlated numbers. As the outcome suffers from oscillations, a real periodic smooth random function [36] is applied, C giving function coefficients. Afterwards, the result is mapped to the interval of The different types of errors (as described above) are then applied to the set of data points that were correct up to that point in time. Within the provided simulation configuration, probabilities of occurrence are specified for each type and generation is controlled accordingly. The probabilities used for the configuration in the scope of the present work are shown in Table III. Meta information about the injected error are stored in place with the data points. This meta information is not visible to the detection algorithm, but serves for analysis and validation purposes.
As an example, outliers are created according to their definition from [37]: Let p be a sensor value and let p 1 , . . ., p q , . . . be a sorted list of the remaining data points sorted ascending by their distance to p. The point p is called an O(d, q) outlier if |p q − p| ≥ d holds. Respectively, we try to mimic an O(d, 3) outlier by calculating the mean μ of a sensor reading and its two direct siblings as well as using the specified standard deviation σ.
In addition to the actual simulation, a set of quality metrics was defined that allows a performance validation of the approach. These metrics are calculated by comparing the generated sensors with injected errors to the cleansed sensors after applying the proposed approach. The following metrics based on the true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) as defined by [38] are used: 1) Positive Predictive Value (also referred to as precision or confidence) [38] P P V = T P T P + F P 2) Fault Detection Rate (also referred to as recall or truepositive rate) [38] F DR = T P T P + F N 3) False-positive Rate [38] F P R = F P F P + T N (6 )   TABLE IV  CONCRETE VALUES THAT WERE CHOSEN FOR THE VARIABLES FROM  TABLE I, FOR THE APPLICATION TO THE TYT DATA SET 4) F-measure [39] F −measure = 2 * P P V * F DR P P V + F DR

5) Average Error Reduction
where Error bef ore and Error af ter is the average error in the data set before and after the application of the proposed approach. The Error for a single sensor is thereby defined as the Euclidean Distance of the initially correct sensor values to the faulty sensor values after the error-injection. 4) Data Application: A real-world scenario to validate the proposed approach was given by its application to the TYT data set. Due to a diverging data structure compared to the accepted algorithm input, it was first converted into a utilizable form. This includes removing unnecessary information and selecting user data with a number of m ≥ 30 sensor samples. As described in Section III-B2, a lower number of measurements and properties was considered non-analyzable. Sound level measurements were scaled into the target domain [0, 1] in order to adjust the weighting of the scalar questionnaire answer data. Missing answers were afterwards imputed by kNN-imputation.
The parameters chosen for the application of the approach on the TYT data set are shown in Table IV. Since the data did not seem to suffer from noise, which was supported by a manual review of the data set, the optional Kalman filter was skipped. In addition, = 0.001 was considered sufficient for an adequate constant value detection, because constant value readings in the TYT data set are not much affected by noise (i.e., they are mainly stuck-at-zero values). Parameters k, l and δ were chosen based on the results of several iterations of the algorithm with different parameters obtained through a binary search process (similar to the parameter selection during the simulation process in Section III-B3).

IV. RESULTS
The results from the simulation as well as the application of the approach to the TYT data set are described in the following.

A. Simulation and Performance
The characteristics of the simulation are shown in Table V. 50 runs of the simulation were performed resulting in a total of  25,000 simulated sensors with different types of sensor faults according to the pre-defined error probabilities (see Table III). Examples for the values of these sensors with different injected error types are shown in Fig. 2.
In total, 6,465 (25.86%) of these sensors were considered faulty. After the application of the approach, a total of 4,491 (17.96%) sensors were eliminated, with 1,877 (P P V = 41.79%) of these sensors actually being faulty. Thus, the average fault detection rate (F DR) over all runs was 29.02%. The fault detection rate of each individual run is shown in Fig. 3. In total, 2,614 correct sensors were eliminated by the approach, yielding to an average false-positive rate (F P R) of 14.11%. Fig. 4 shows the false-positive rate for each individual run. Finally, the average error reduction (ER) across all runs was   Fig. 5.

B. Application to the TYT Data Set
The results of the application of the elimination approach to the TYT data set are shown in Table VI. The 107,231 samples in the data set were contributed by 3,255 unique sensors (i.e., unique users with unique devices). 566 of these sensors were eligible for further processing because they held at least 30 samples. Furthermore, the KNN imputation excluded 24 sensors as they contained too many missing data points, resulting in 542 sensors that were considered in the elimination stage. The distribution of manufacturers of these sensors (i.e., devices) is shown in Fig. 6(a). During the elimination stage, the constant check eliminated 85 sensors. In the next step, another 12 sensors were eliminated by the PCA, resulting in a total of 97 (17.89%) sensors eliminated. The eliminated sensors and their contributed  samples were then removed from the original TYT data set, resulting in 14,819 (13.82%) samples removed. The distribution of sensor manufacturers after the elimination is shown in Fig. 6(b), while the distribution for the eliminated sensors is shown in Fig. 6(c).

V. DISCUSSION
In the following, the results of the simulation and the application to the TYT data set are discussed. Furthermore, limitations of the proposed approach are considered.

A. Interpretation of the Results
First, the results of the simulation are discussed in terms of the performance metrics. As shown in Section IV-A, the average positive predictive value (P P V ) of the proposed approach is 41.79% while the average fault detection rate (F DR) is 29.02%. At the same time, the average false-positive rate (F P R) is 14.11%. In other words, about every third faulty sensor is detected and eliminated by the proposed approach, while about every seventh eliminated sensor is in fact not affected by any faults other than missing data and should therefore be considered correct. The quality of the proposed approach can now be assessed by comparing these metrics. The first quality metric we can examine is the tradeoff between P P V and F DR, which is also captured by the F-measure as the harmonic mean of both values [38], [39]. Reducing the number of false-positive increases the P P V . In reality, this increase often leads to a decrease of the F DR, i.e., an increase in the number of falsenegatives, at the same time [38]. The goal would be to maximize both values simultaneously, i.e., to maximize the F-measure. For the results of the simulation, both the P P V and the F DR are reasonably high, leading to an F-measure of 34.26%. In addition, we can assess the ability of the approach to discriminate between faults and non-faults by comparing F DR and F P R [38]. For an approach that just classifies sensors as faulty on a random basis, F DR would be equal to F P R, i.e., the chance of a trueor false-positive prediction would be equal [38]. However, for the proposed approach, F DR is higher (even about twice as high) as F P R, indicating a fair ability to discriminate between faults and non-faults. Moreover, the approach reduces the error by 22.74% on average (ER). Overall, one can conclude a good performance of the proposed approach for the data sets generated in the simulation.
Second, the results of the application of the proposed approach to the TYT data set are considered. As shown in Section IV-B, only 542 of the 3,255 devices (16.66%) are eligible for the elimination process of the approach. This number seems low at a first glance, but it can be seen in Table VI that these devices account for 89,111 (83.10%) of the total samples in the data set. In the elimination stage, 85 devices are removed by the constant check, suggesting many constant value readings within the data set (as also found in [8]). A manual inspection of these values and a review of the code of the TYT mobile apps indicates that most of these constant values are zero values that occur because the Android app has not been granted permission to access the device's microphone. Nevertheless, this high number shows that the constant check is a valuable step in detecting sensor faults. Together with the second step of the elimination stage, the PCA, a total of 97 devices can be eliminated from the data set. As shown in Fig. 6, the manufacturer distribution before [see Fig. 6(a)] and after [see Fig. 6(b)] the elimination is similar. This is due to the fact that the manufacturer distribution of the eliminated devices [see Fig. 6(c)] roughly corresponds to that of the original data set. In other words, the devices classified as faulty by the proposed approach are distributed about evenly across device manufacturers. However, it can be observed that comparatively less miscellaneous devices are detected by the approach. This could be due to the heterogeneity of device manufacturers (e.g., LG, Motorola, HTC, OnePlus, Huawei, Xiaomi and others) in this group. The eliminated devices account for 3.0% of the devices and 13.82% of the samples in the total data set, as extracted from the TYT database. The remaining and cleansed data set could then be used as a baseline for further analyses of the environmental sound level measurements, for example, to investigate the interrelationships with tinnitus symptoms.

B. Limitations
The approach described in the scope of this work is subject to several limitations. These limitations will be discussed in the following paragraphs. Issues in the ecological and mHealth context: Some of the limitations of the proposed approach are given by several error factors, which occur in particular in the context of mHealth. Cho et al. [40] group these error factors into technical-related, userassociated and data-governance factors. As the TYT study is operated on data that is gathered during everyday life, there is no specified environmental setting, no delimited number of utilized devices. Hard-and software-related issues are accompanied by the "black box" behavior of smartphones concerning to their processing of collected data [41]. This lack of explainability, together with missing comparability between different operating systems and device manufacturers, states a huge quality issue. Standardization and calibration protocols [8] could provide a solution at least for the latter problem.
Assumption of over-determination: A main assumption of the algorithm in the proposed approach is an over-determination of the system. In this context, we included EMA questionnaire answers, as the devices and their sensors alone do not meet this requirement. It needs to be further investigated under which circumstance, for example, constant questionnaire answers might influence the algorithm outcome. In addition, our approach is a priori not able to handle data sets with missing values or too few samples (m < 30, see Section III-B2). [20] emphasize the ability of generalizing even from noisy or incomplete data that artificial intelligence-based approaches bring. It needs to be evaluated, how, for example, the generated simulation data set would be sufficient for accurately trained models and how the detection mechanism could be improved.
Lack of data distribution knowledge: Ayadi et al. [42] name a lack of data distribution knowledge prior to the analysis to be a huge challenge for parameter-based statistical outlier detection approaches. Without prior knowledge of the data form and according manual selection, the developed approach creates challenges in selecting fitting parameters and thresholds. Further research must evaluate if approaches using dynamic thresholds, such as [20], might be applied to our context.
PCA approach: The core detection mechanism of our algorithm, the PCA stage, mainly relies on an initial model training. Since this model should be trained using data mainly unaffected from errors, it is essential for the later detection mechanism. As it is still unclear how large this training set has to be, further improvements for the PCA algorithm are difficult to apply. These improvements could include, for example, the implementation of a robust PCA, as it is less sensitive to anomalies than classical PCA [18], or of an online PCA algorithm [18], [19], which incrementally adapts an initially trained PCA model and therefore would increase performance.
False-positive rate: The proposed approach still shows a quite high false-positive rate. The application of the Boruta algorithm [43] might overcome this problem. To this end, this machine learning algorithm tries to select features that are relevant for the results. Properties are seen as features, sensors are equal to results. However, the inclusion of BORUTA_PY, 6 as implementation of the previously named algorithm, strongly increased computational complexity, but could not bring the desired results in a first attempt.

VI. SUMMARY AND OUTLOOK
In this work, we proposed an approach to cleanse Mobile Crowdsensing (MCS) mHealth data sets from erroneous sensor data. Specifically, the overall goal was to harness environmental sound level data provided via the TrackYourTinnitus (TYT) database to enable meaningful interpretation regarding the correlation of these objective sensor readings with subjective tinnitus symptoms. To this end, the existing data were analyzed and the problem formalized. The proposed approach includes several pre-processing steps and uses Principal Component Analysis (PCA), which considers the responses to the Ecological Momentary Assessment (EMA) questionnaire of the TYT app as additional information, and a preceding constant value check (constant check) to detect sensor errors. In addition, a simulation was designed and run using generated data with similar characteristics to the TYT data set to assess the performance of the approach with different parameters. The simulation was then used to evaluate the performance of the proposed approach with different quality metrics and to determine appropriate parameters for application to the TYT data. Finally, the approach was applied to the TYT data set, eliminating all devices that were found to be erroneous. The results show that the proposed approach performs well for data sets similar to the simulated data and therefore could be used for real-world scenarios.
Future work should further explore the aspects discussed in Section V. The proposed approach is subject to several limitations and uncertainties that need to be addressed. In addition, there are other approaches in the literature that could lead to even better results. For mHealth data sets such as TYT, additional information about the devices and the context of the sensor measurements would enable more accurate analyses. The TYT platform could therefore be extended to provide such information.
In conclusion, it has been shown that sensor errors in MCS mHealth data sets can be detected using a data-centric approach. However, there are many challenges and uncertainties when the distribution of sensor errors is not known a priori and when little to no additional information is available about the data quality of sensor measurements. Therefore, further work should and must be done in the future to interpret MCS mHealth data.