Private True Data Mining: Differential Privacy Featuring Errors to Manage Internet-of-Things Data

Available data may differ from true data in many cases due to sensing errors, especially for the Internet of Things (IoT). Although privacy-preserving data mining has been widely studied during the last decade, little attention has been paid to data values containing errors. Differential privacy, which is the de facto standard privacy metric, can be achieved by adding noise to a target value that must be protected. However, if the target value already contains errors, there is no reason to add extra noise. In this paper, a novel privacy model called true-value-based differential privacy (TDP) is proposed. This model applies traditional differential privacy to the “true value” unknown by the data owner or anonymizer but not to the “measured value” containing errors. Based on TDP, the amount of noise added by differential privacy techniques can be reduced by approximately 20% by our solution. As a result, the error of generated histograms can be reduced by 40.4% and 29.6% on average according to mean square error and Jensen–Shannon divergence, respectively. We validate this result on synthetic and five real data sets. Moreover, we proved that the privacy protection level does not decrease as long as the measurement error is not overestimated.


I. INTRODUCTION
Significant amounts of IoT data are generated every day by many different sensors, such as thermal cameras, home appliance sensors, automotive sensors, and smartphone-equipped sensors. These IoT data can be used for health monitoring [1], context-aware recommendation (or recommender) systems [2], navigation [3], and other applications. However, sensing people or their surrounding environment might involve information that identifies an individual [4]. Thus private information is at risk of leakage. By anonymizing data based on -differential privacy [5], [6], which is the de facto standard privacy metric ( represents the privacy budget), privacy leakage can be controlled. Differential privacy has been used in many studies, such as [7]- [9], as it is one of the most critical privacy metrics [10]. It is considered an important concept for data analysis [11], [12].
The associate editor coordinating the review of this manuscript and approving it for publication was Zhan Bu .
Local differential privacy is a specialized concept of differential privacy especially for data collection from each person. In this paper, ''differential privacy,'' refers to ''local differential privacy.'' A differential private value can be obtained by adding Laplace noise to a target value, which must be protected with respect to numerical values [5]. For categorical values, a differential private category ID can be obtained by disguising the sensed category ID with a certain probability [13], [14]. These methods are widely used to achieve -differential privacy. However, they do not consider errors in values.
In this paper, an original value with no error is referred to as a ''true'' value; the owner or anonymizer might not know these values. Alternatively, sensed values that might have errors are referred to as ''measured'' values. Existing studies regarding differential privacy do not consider true values but only measured values. Our study aims to determine whether additional noise should be added to protect privacy if the target value already contains errors. This research proposes FIGURE 1. Concept of true-value-based differential privacy (TDP). Traditional differential privacy adds differential privacy noise to the measured value. However, TDP tries to add noise to the true value. a new privacy model, protecting the true value as opposed to the measured value. Since the data owner might not know the true value, the true data is assumed to have a specific probability distribution, such as a normal distribution. This probability distribution is based on the data owner's or anonymizer's knowledge or the theory of errors [15], [16]. The difference between the traditional approach and the proposed truevalue-based differential privacy (TDP) is described in Fig. 1. According to the TDP concept, the amount of noise added to the measured value can be reduced.
We assume that the anonymizer can estimate the distribution of measurement errors to some extent. Therefore, TDP can be achieved even if the anonymizer is incorrect, as long as they do not overestimate the amount of sensing errors. The relationship between the anonymizer's error distribution and the TDP is listed in Table 1. Therefore, if the anonymizer cannot be certain about the error distribution, they can guarantee TDP by conservatively estimating the amount of error. If the error amount is predicted to be zero, we get the same result as traditional differential privacy. Thus, TDP can reduce the amount of error introduced compared to traditional differential privacy and still achieve the desired privacy protection level specified by .
If we have no information about the error distribution, we cannot use the proposed method in this paper. However, we believe that there are many situations where it is possible to make estimates under the condition that we can underestimate the amount of error. An additional discussion on this point is given in Section VI.
In Section IV, we have conducted experiments using three synthetic datasets and three real datasets. We compared our method with the other three methods for numerical datasets and four methods for categorical datasets. The experimental results show our solution could reduce the amount of noise by approximately 20% and reduce the amount of error of generated histograms from 20% to 40% on average.
In recent years, several methods for LDP have been proposed. These LDP methods primarily estimate the distribution underlying the user data and federated learning [17]- [20]. They put noise on the values to satisfy differential privacy or randomly vary the category values. They proposed various methods to reduce the amount of noise in the differential privacy to have as little negative impact on the statistical analysis as possible. However, they do not consider sensing errors in IoT environments.
The authentication protocols for IoT are organized in detail in a survey paper by Ferrag et al. [21]. This survey paper categorized IoT environments into Machine-to-Machine Communications, Internet of Vehicles, Internet of Energy, and Internet of Sensors. The advantages and disadvantages of authentication protocols in each environment were summarized. One of the main objectives of an authentication protocol is to send each data to the correct entity correctly. It cannot directly solve the purpose of statistically correct analysis without sending the correct information of each data to anyone. Note that this objective is not a special objective of our study but a common objective in existing studies that collect data using differential privacy.
Badun et al. surveyed security and privacy issues on IoT platforms [22]. They state that most IoT platforms do not inform users about the type of information they collect and where it is shared. Therefore, some IoT platforms use new technologies that still put the data stored in the cloud under the user's control. Such techniques can protect user data; however, they do not provide a mechanism to analyze the data of many users across the board statistically.
Husnoo et al. organized the techniques of differential privacy in IoT environments [23]. They categorized two major usage scenarios for differential privacy: (1) a scenario where a trusted server holds the true data and only shares the statistical analysis results with third parties and (2) a scenario where data is collected on an untrusted server. Our study and several studies in literature focus on the second scenario. The disadvantage of the latter scenario is that the cumulative amount of noise added to each data becomes larger and affects the usefulness of the data [23]. Using our proposed method considering measurement errors, the cumulative amount of noise can be reduced as shown in our experimental results. Ma et al. proposed an algorithm based on stochastic influence perturbation to satisfy differential privacy [24]. They assumed that a trusted central server has a whole raw dataset, and their aim was to generate a private version of the dataset. They proposed a framework for network traffic tensor data privacy protection. They used multiple-strategy differential privacy for network traffic tensor data. Their mechanism also assumed a central trusted server, although they partly used local differential privacy.
Onesimu et al. proposed a novel privacy protection data collection scheme for IoT-based healthcare service systems [25]. Their method uses a clustering-based anonymous model to develop an efficient privacy protection scheme that satisfies privacy requirements and protects a healthcare IoT from various privacy attacks. The proposed scheme can efficiently deal with privacy attacks such as attribute, identity, and membership disclosure; and sensitivity, similarity, and skewness attacks. However, their method does not ensure differential privacy. Thus, techniques on security and privacy in IoT environments do not often mention sensing noise. Our work is positioned as an important and pioneering effort to consider sensing noise in addressing privacy.
The motivation, the research gap, and contribution to this study are summarized below.

A. MOTIVATION
This study aims to estimate the distribution of personal data sensed in IoT environments while protecting user data by differential privacy. We assume the sensed data contains sensing noise.

B. RESEARCH GAP
Existing methods do not consider the sensing noise. Therefore, they add a lot of extra privacy noise to the sensed data.

C. CONTRIBUTION
First, we propose true-value-based differential privacy (TDP), which is a novel concept of differential privacy considering sensing noise. Second, we proposed anonymizing algorithms for numerical data and categorical data to satisfy TDP. Third, we prove that the proposed algorithms ensure TDP. Fourth, we show that the proposed algorithms can reduce the amount of differential privacy noise using synthetic and real datasets. Fifth, we show that the proposed algorithms can reduce errors in the estimated distribution of personal data using the same datasets.
The rest of the paper is structured as follows. First, the application and assumptions, along with the definition of privacy, are presented in Section II. Then, the proposed design and its mechanisms are introduced in Section III. Next, the simulation results using synthetic and five real data sets are presented in Section IV. The related methods are discussed in Section V, and several design issues of the proposed method are mentioned in Section VI. Finally, the conclusions of this work are presented in Section VII.

A. APPLICATION MODEL
Currently, IoT devices can collect and estimate people's attribute information, such as location, heartbeat, health condition, age, and moving behavior. Based on these attribute data, people can use various services such as recommender systems. In addition, the data collector can also serve as a data anonymizer, anonymizing the obtained data and sending it to the data receiver (see Fig. 2).
Two kinds of attribute data are considered: The first is a numerical attribute, such as heartbeats per minute, while the second is a categorical attribute, such as a disease name (e.g., .
The collected attribute data usually have some sensing errors since it is difficult to sense and estimate people's FIGURE 2. Application scenario. The data receiver, an attacker, collects user data from the data owner and data collector using privacy techniques differentially.
attributes with complete accuracy. In the worst-case scenario, some attribute data cannot be collected at all. Missing data can be estimated through multiple imputations or predictions based on regression models [26]. These estimated values exhibit a large number of errors.

B. ASSUMPTIONS
Anonymizers may not know the true attribute values, but they can estimate them. However, these estimated values might contain errors. Anonymizers can estimate the error distribution of numerical attribute values. A normal distribution is considered an error model for numerical attributes since the measurement errors follow normal distributions in many cases [27]. A normal distribution is characterized by the parameter σ , which represents the standard deviation. Section VI-A contains further discussion about assuming a normal distribution. However, please note that the concept of TDP can be applied to other error models.
The wrong classification probability p i→j is considered with reference to categorical attributes. This probability signifies that the true category ID is i. However, the anonymizer is unaware of the true category ID and assumes that the category ID is j.
In this paper, parameters σ and p i→j for all i, j are referred to as ''error parameters.'' Three scenarios are assumed.

1) SCENARIO I
The anonymizer knows the exact error parameters.

2) SCENARIO II
The anonymizer does not know the exact error parameters. The estimated parameters might differ from the actual parameters; however, they are not pessimistic about the degree of error. The mathematical definitions of the numerical attributes are described in Section III-A, and those of the categorical attributes are described in Section III-B.

3) SCENARIO III
The anonymizer does not know the exact error parameters and has no estimate for them.
In this paper, we do not target Scenario III. Instead, we mainly target Scenario II because Scenario I is relatively unrealistic.

C. ATTACK MODEL
The receiver of anonymized data is considered an attacker, and the attacker is considered a semi-honest entity; that is, the attacker follows the given protocol of anonymized IoT data collection. However, the attacker might try to extract individual information from each anonymized datum. Furthermore, each anonymized datum contains errors based on original sensing errors and intentionally added noise; therefore, the attacker cannot accurately estimate people's true data, but the attacker can estimate them as a particular probability distribution.

D. PRIVACY METRIC
In a privacy-preserving data mining community, differential privacy [5] is considered the most important privacy metric. Although differential privacy was originally used in a queryresponse database, recent studies have used it for anonymized data collection.
Suppose that a person has an attribute value and the person or anonymizer who collects the attribute value anonymizes its value. Let be a positive real number. The differential privacy is defined as follows: Definition 1 ( -Differential Privacy): Let D and D be databases differing on one record at most. A randomized mechanism A satisfies -differential privacy if and only if for all Y ⊂ Range(A), the following equation holds: Kasiviswanathan et al. [28] established that this definition could be applied to anonymized data collection.
Definition 2 (Local Privacy): Let x and x be a database of size = 1, and let be a positive real number. A randomized mechanism A satisfies -differential privacy if and only if for any output y, the following equation holds: In this paper, it is considered that a value of x might contain sensing errors. Therefore, the focus must be placed on the true value of x, which is an unknown value, even for the data owner and the anonymizer. TDP is proposed to handle the privacy of unknown values.

III. TRUE-VALUE-BASED DIFFERENTIAL PRIVACY (TDP)
Existing studies define x and x in Definition 2 as measured values. In this paper, they are defined as true values. The anonymization mechanisms for both numerical and categorical attributes are described.

A. NUMERICAL VALUES ANONYMIZATION
The Laplace mechanism [5], which adds noise based on the Laplace distribution can be used for numerical attributes. The theorem of the Laplace mechanism for data collection is introduced. However, the Laplace mechanism does not take into account sensing errors. As a result, the noise based on the normal distribution is added to a true value as a sensing error, and additional noise based on the Laplace mechanism is added to the noisy value. This is the traditional approach, which always adds the Laplace noise, and is referred to as the baseline approach for numerical attributes. The resulting probability density function, representing the probability of the distance between the final noisy and true values, can be calculated by performing convolution of the normal and Laplace distribution.
Let N (x; σ 2 ), L(x; b) represent the probability density functions of the normal distribution, with the standard deviation being σ and the Laplace distribution with the scale being b. Centered distributions that peak at zero are only considered without loss of generality.
The convolution of the normal distribution with the standard deviation being σ and the Laplace distribution with the scale being b is represented by where erfc is the complementary error function, which is represented by VOLUME 10, 2022 FIGURE 3. Ratio of probability density function values whose distance is with respect to the normal distribution, Laplace distribution, and the convolution of the two distributions (σ = = = 1).
It is noted that for Scenario II, the value of σ can be wrong, as long as it is not pessimistic. Let σ t and σ represent the true standard deviation and the standard deviation the anonymizer believes, respectively. Here, pessimistic means that exp( ), 1/ exp( ), and the ratio of probability density function values whose distance is with respect to N (x; σ 2 ), L(x; / ), and U(x; σ 2 , / ), where and σ are set to one, are presented in Fig. 3. The ratio of probability density function values whose distance is with respect to the normal distribution is calculated by Equation 6, shows that R N (x;σ 2 ) approaches ∞ when x is close to -∞. Therefore, even if σ is very large, extra noise needs to be added to achieve -differential privacy. Similarly, in Fig. 3, R L(x; , ) and R U (x;σ 2 , , ) are defined as the ratio of the probability density function values whose distance is with respect to L(x; / ) and U(x; σ 2 , / ), respectively.
The ratio of the probability density function values whose distance is should exist between the lines of exp( ) and 1/ exp( ), according to the definition of -differential privacy. Fig. 3 shows that R L(x; , ) and R U (x;σ 2 , , ) satisfy this condition; therefore, L(x; / ) and U(x; σ 2 , / ) mechanisms achieve -differential privacy (here σ = = = 1). Although R U (x;σ 2 , / ) approaches exp( ) (or 1/ exp( )) when |x| is large, its convergence to exp( ) (or 1/ exp( )) is slower than that of R L(x; / ) . Consequently, the mechanism adds much more noise than required.
The algorithm proposed in this paper is simple but effective; Laplace noise is not added when the obtained Laplace noise is smaller than the predefined threshold w. Thus, the total loss is expected to become smaller (i.e., the ratio of probability density function values whose distance is is expected to approach exp( ) and 1/ exp( ) faster).
However, the definition of an appropriate value for w is complex. If the threshold w is very large, the resulting value cannot achieve either traditional -differential privacy or TDP. Conversely, the resulting value contains unnecessary noise if the threshold w is very small. It can be seen that if the value of w is too large, the requirement for differential privacy is not met. Alternatively, it can be seen that if the value of w is too small, it adds more noise than necessary.
The probability density function, which adds the Laplace noise only when the noise x satisfies abs(x) ≥ w 1 is represented by Therefore, the probability density function obtained from the original sensing error and the Laplace noise defined in Equation 7 can be represented by The ratio of probability density function values for the proposed algorithm whose distance is is represented by the following: The aim is to find an appropriate value of w where R V approximates exp( ) but does not cause R V to overestimate exp( ) or 1/ exp( ).
The following theorem is considered (see Fig. 4); Theorem 3: If w is near ∞, the value of R V approaches the value of R N . If w is near zero, the value of R V approaches the value of R U .
The ratio between x + /2 and x − /2 is defined in this study; therefore, the range −w − /2 < x < 0 can be checked considering whether or not the maximum ratio is greater than exp( ). It is noted that only the range x < 0 needs to be checked because V is symmetrical with respect to the point (x, y) = (0, 1), where y represents the ratio of probability density function values whose distance is .
Algorithm 1 describes the method that yields the anonymized value. In Algorithm 1, the value of w is calculated in Lines 1-16. erfc(x) can be computed using approximate equations, such as when x ≥ 0 from [29]. Note that we can obtain an approximate value of erfc(x) with x < 0 from the property of After checking the approximate values, precise values need to be calculated. Mathematical tools such as Maxima, 2 which is a popular free software program, can be employed.

B. CATEGORICAL VALUES ANONYMIZATION
The randomized response mechanism [13], [14] can be used for categorical attributes. First, a sensed value is categorized into one of the predefined categories. Another category replaces that category with a certain probability, and then the resulting category ID is sent to the data receiver. The randomized response is referred to as the baseline approach for categorical attributes. The retention probability of unchanging category ID is p α and the probabilities of other IDs are (1−p α )/(M −1), where M is the number of categories. The following equation , should hold to satisfy -differential privacy. Therefore, it is set Since M ≥ 2, p α > 0.5 is obtained. Let p i→j represent the probability that the true category ID C i is (mis-)classified to C j due to sensing errors. It is assumed that the retention probability is greater than if r > 0 then 7: w max ← w 8: else 9: if w − w min is sufficient small then 10: w ← w 11: Break. 12: else 13: w min ← w 14: end if 15: end if 16: end while 17: Generate Laplace noise l based on L(0, / ). 18: if l < w then 19: Return v s . 20: else 21: Return v s + l. 22: end if any other probabilities; that is, the following equation is assumed It is assumed that the values of p i→j for all i, j can be estimated. Let For Scenario II, these values can be wrong, as long as they are not pessimistic.
Let p i→j,t and p i→j represent the true probability and the probability that the anonymizer believes, respectively. Here, pessimistic estimation means that First, the situation where the following expression is satisfied; This case holds TDP clearly. In this case, the random mechanism A in Definition 3 does not need to do anything. In other words, the TDP can be satisfied by outputting the input measured values as they are. VOLUME 10, 2022 Randomly select j each with a probability x i→j , and return j. 8

: end if
If Equation 17 is not satisfied, the following simultaneous equations with respect to x i→j for all i and j are solved: where and · represents the scalar product of two vectors. The value of x i→i could be greater than one and the value of x i→j could be less than zero. Therefore, the obtained values are normalized by Finally, when the measured category ID is C i , the anonymizer generates the anonymized version C j with probability x i→j .
Algorithm 2 shows the method which yields the anonymized category ID.

C. PROOF OF ACHIEVING TRUE VALUE-BASED DIFFERENTIAL PRIVACY
Next, it is proved that the proposed algorithms (for Scenarios I and II) realize TDP.

1) NUMERICAL ATTRIBUTES
Initially, Scenario I is considered. Since Algorithm 1 ensures that 1/ exp( ) ≤ R V(x;σ 2 , , ,w) ≤ exp( ) for the true value if σ is correct, it is able to achieve TDP based on Definition 3.
Next, Scenario II is considered. It is assumed that the anonymizer's knowledge about sensing errors is not correct, but their knowledge about measurement errors is not pessimistic. The concept ''pessimistic'' is defined in Equation 5 regarding numerical attributes.
Let the ratio of the probability density function values whose distance is with respect to N (x; σ 2 ) be R N (x;σ 2 ) . By differentiating R N (x;σ 2 ) with respect to σ , it is obtained When x is less than zero, the value of the differentiation of R N (x;σ 2 ) with respect to σ is always less than zero. Therefore, if σ becomes larger, the value of R N (x;σ 2 ) becomes smaller. It can be concluded that R V(x;σ 2 , , ,w) becomes smaller when σ becomes larger, since the proposed probability density function V(x; σ 2 , , w) is a convolutional function of N (x; σ 2 ) and Equation 7, which does not depend on σ . Therefore, if the knowledge about measurement errors of the anonymizer is not pessimistic, then R V(x;σ 2 , , ,w) ≤ R V(x;σ 2 t , , ,w) for x ≤ 0. If the anonymizer sets the value of error parameters to pessimistic (i.e., set σ to a small value), the amount of noise added by the proposed mechanism is larger than the necessary amount. Although the usefulness of the proposed algorithm becomes worse in this case, the ratio of the anonymization probabilities generated by the proposed mechanism from two neighboring databases is within the range between exp( ) and 1/ exp( ), with some extra space available. However, the total loss of the proposed mechanism is less than the baseline approach, even in this case. When x > 0 is considered, the discussion is similar, and then R V(x;σ 2 , , ,w) > R V(x;σ 2 t , , ,w) for x > 0. Since 1/ exp( ) ≤ R V(x;σ 2 t , , ,w) ≤ exp( ) for σ 2 t , then 1/ exp( ) ≤ R V(x;σ 2 , , ,w) ≤ exp( ) for σ 2 . Therefore, Definition 3 holds.

2) CATEGORICAL ATTRIBUTES
Initially, Scenario I is considered. It is assumed that the attacker obtains a category ID γ as the anonymized version of a categorical attribute. Let P(v a = γ |v t = i) represent the anonymized version of the category ID is γ when the probability that the true category ID is i. The proposed mechanism ensures that when we ignore the process of Equation 20. The ratio of two equations of Equation 22 is e or 1/e . Therefore, Definition 3 hold. Based on the post-processing property of differential privacy, the resulted values of the process of Equation 20 also satisfies TDP. Next, Scenario II is considered. It is assumed that the anonymizer's knowledge about sensing errors is not correct but their knowledge about measurement errors is not pessimistic. Let x i→j,t and x i→j represent the disguising probabilities based on the true error parameters and the believed error parameters, respectively. If the error parameters are not pessimistic, then Therefore From Equations 14 and 24, it is concluded that Definitions 3 hold.

D. ANALYSIS 1) NUMERICAL ATTRIBUTES
The proposed mechanism avoids the addition of Laplace noise if the generated Laplace noise l is less than threshold w. Then, the avoidance (or skipping) ratio can be calculated by Let η U and η V represent the expected values of Laplace noise addition with respect to the baseline approach and the proposed mechanism, respectively. The value of η U can be calculated by and the value of η V can be calculated by Theorem 4: R V(x;σ 2 , , ,w) represents the ratio of probability density function values whose distance is for the proposed mechanism. It approaches exp( ) if x is close to Proof: R V(x;σ 2 , , ,w) can be represented by Since the convergence of erfc[x] with x → ∞ to zero is more rapid than that of exp(x) with x → ∞ and since it is obtained Next, x → ∞ is considered. Using l'Hopital's rule, it is obtained since the differentiation of erfc[x] yields − 2e −x 2 √ π . From Equations 30, 31, and 33, it is obtained

2) CATEGORICAL ATTRIBUTES
Let ζ U and ζ V represent the probabilities that the true category ID is equivalent to the anonymized category ID corresponding to the baseline approach and the proposed mechanism, respectively. The baseline approach represents a method that always adds the Laplace noise with respect to numerical attributes and the randomized response method with respect to categorical attributes, as described in Sections III-A and III-B. Assuming that the true category ID is i, and

IV. EVALUATION A. PARAMETERS SETTING
The value of and error parameters σ and p i for all i need to be set to realistic values.

1) VALUE OF
Apple's deployment ensures that is equal to 1 or 2 per each datum [30]. An Apple's differential privacy team set = 2, 4, 8 for their evaluations [31]. In the paper that proposed RAPPOR [32], which was developed by Google, = log (3) is used as the main setting. Based on these settings, is set in the range 1-10. For this range, when is equal to 100, the absolute value of the average noise added by the Laplace mechanism is in the range 5-50. In this case, privacy can be considered to be sufficiently protected. It is noted that if is multiplied by a, the average noise is also multiplied by a. For categorical attributes, when the number of categories M is 2, the retention probability value ranges between 73.11% VOLUME 10, 2022  and 99.996% if is set in the range 1-10. A value of 10 for means that when the value of M is small, we have a situation where the privacy protection level is very low. Therefore, for categorical attributes, performance evaluation at small values of is especially important.

2) VALUE OF σ
When the standard deviation is σ , the average sensing error ASE(σ ) is described as When σ is set to 1/40 of the value of (which is the range of possible values), ASE(σ ) is 2.0% of the value of . In this case, the IoT device sensing the attribute value is considered to have high accuracy. Similarly, when σ is set to 1/10, 1/4, and 1/2 of the value of , respectively, the values of ASE(σ ) become 8.0% (relatively high accuracy), 20% (relatively low accuracy), and 40% (low accuracy) of the value of , respectively. Based on this analysis, σ is set in the range 1/40 to 1/2 of the value of .

3) VALUE OF p i
p i→i for all i is set to the same value, which is referred as τ . τ is set in the range 0.3-0.9. This means that the IoT device is able to sense a person's attributes and that it can correctly judge the attribute category with a probability value from 0.3 (low accuracy) to 0.9 (high accuracy). p i→j for all i, j(i = j) is set to another value, that is,

B. UTILITY METRIC
The data receiver aims to use the anonymized value for several services. Therefore, the estimated value should be close to the true value. Let N represent the number of people whose attribute values are collected. Let v i and v i represent the true value of person i and an anonymized one, respectively. The utility is defined below with respect to numerical attributes: While the utility is defined as follows with respect to categorical attributes: where δ i,j is the Kronecker delta Both metrics are considered superior if their values are significant.
Some methods can estimate statistical values (e.g., averages) or generate cross-tabulation from the collected data. If the aim is to generate cross-tabulation, a total loss, which compares the true cross-tabulation with the generated crosstabulation, should be used. However, in this paper, the aim is mainly focused on individual data; that is, the aim is not to do a statistical analysis but to use each person 's attribute value because IoT-related services, such as health monitoring, context-aware recommender systems, and navigation described in Section I need to analyze an individual's attribute value.

C. NUMERICAL VALUE RESULTS
is set in the range 10-1,000, is set in the range 1-10, and σ is set in the range 1/40 of the value of to 1/2 of the value of . The number of times the proposed mechanism avoided the addition of Laplace noise to a measured value was evaluated. It was also evaluated how the proposed mechanism was able to reduce the average amount of Laplace noise. Results with being equal to 10 are shown in Fig. 5. Computed results based on Equations 25, 26, and 27 are also presented in Fig. 5. Results with =100 and =1000 are almost the same as those in Fig. 5; therefore, they are not shown.
Computed results based on Equations 25, 26, and 27 are in close agreement with simulation results, for all parameter settings. The proposed mechanism reduced the number of times Laplace noise is added and the corresponding average Laplace noise. Large values of σ or result in a large reduction rate. A large value of σ means that a large sensing error noise is already added to a true value, while a large value of means that the privacy protection level is not high; that is, a large amount of noise is not needed. Therefore, the proposed mechanism can reduce additional Laplace noise, especially when the values of σ and are large. From Equation 6, it is concluded that the addition of noise cannot be completely avoided. However, Fig. 5 indicates that the noise skipping ratio approaches one. U n was evaluated using Equation 39 using the same values for and as above (Fig. 6). Since a large σ results in a low U n (i.e., high total loss), even if none of the privacy protection mechanisms are conducted, the difference between the proposed mechanism and the baseline approach is small. This is true not only when the value of σ is minimal but also when the value of σ is substantial. However, if σ is set to a medium value, the proposed mechanism can reduce the total loss U n by 25%-40% compared with the baseline approach. When is set to one, the difference between the proposed mechanism and the baseline approach is minor. However, when the value of is equal to one, the average absolute value of the Laplace noise to be added is about 50 when is equal to 100. This amount of noise seems to be very large. Therefore, in the usual case, the value of should be larger.
The actual ratio of probability density function values whose distance is was determined by conducting simulations. True values that should be protected were set to − /2 and /2. Normal distribution's noise was randomly added to the true values independently. The noise-added values were anonymized by the proposed mechanism and the baseline approach, respectively. Histograms were created for the range −3 -3 . The number of bins was 200. This simulation was repeated 2 31 times. In Fig. 7, an example of the average result with = 2, = 100, and σ = 25 is presented. The ratio of the probability density function values of the normal distribution and the Laplace distribution, along with exp( ) and 1/ exp( ) functions, are also shown as a reference. The results for both the proposed and the baseline approach exist within the range exp( ) -1/ exp( ). Therefore, it is concluded that both mechanisms (for Scenarios I and II) achieve TDP. The ratio of the probability density function values of the Laplace distribution is the same as exp( ) and 1/ exp( ) in the range x < − /2 and /2 < x; therefore, the Laplace mechanism is the best if the measured values have no errors. Regarding the proposed mechanism, the ratio of the probability density function values reaches exp( ) and 1/ exp( ) at about x = − /2 and x = /2. However, this ratio is a little far from exp( ) and 1/ exp( ) at x = −40 and x = 40. On the contrary, the ratio of the probability density function values in the baseline approach reaches exp( ) and 1/ exp( ) at about x = −30 and x = 30. Note that the values of the probability density functions are large when x is near zero; therefore, a high utility can be achieved if the ratio is near exp( ) and 1/ exp( ) when x is near zero. Hence, the proposed mechanism can achieve high utility (i.e., low total loss) compared with the baseline approach.
Additional simulations were conducted with other parameter settings. As a result, it was confirmed that the ratio of the probability density function values of the proposed mechanism exists within the range exp( ) -1/ exp( ), except for those results with considerable variation due to the number of samples in each bin being too small.

D. CATEGORICAL VALUE RESULTS
The value of was set in the range 1-10, the value of M was set in the range 5-100, and the value of τ was set in the range 0.3-0.9. A true category ID was set to a random integer, and the category ID with probability 1 − τ was changed. Then, category ID was randomized by the baseline mechanism and by the proposed mechanism. This simulation was repeated for 2 31 times. Results with being equal to one are shown in Fig. 8. Simulation results along with computed results calculated from Equations 35 and 36 are also presented. A close agreement is observed between simulated and computed results. The values of U c obtained by the proposed method are larger than or equal to those obtained using the baseline approach for all parameter settings. When M is large or is small, the values of U c are small for both mechanisms since it is difficult to maintain high accuracy for both mechanisms in such cases. However, in other cases, the proposed mechanism reduces the total loss compared with that of the baseline approach, especially when is small, i.e., the privacy protection level is high. When is large, the experimental results of the proposed method are similar to those of other methods. The value of is large enough so that the noise added to achieve differential privacy is very small. This is why there was no difference in accuracy between the methods. Therefore, it is more important to experiment when the value of is small.
Next, a true category ID is set to 1, and M is set to 10. The number of times each category ID was selected as a randomized category ID was counted. Let c max and c min represent the number of maximum times and the number of minimum times, respectively. Simulation results for the ratio c max /c min are shown in Fig. 9. exp( ) is also shown as a reference. Since the results of both the proposed method and the baseline approach are equal to or less than exp( ), it is concluded that both mechanisms achieve TDP for true data in Scenarios I and II. Compared with the proposed mechanism, the result based on the baseline approach is far from the exp( ) line; therefore, it is concluded that the proposed mechanism is capable of achieving high utility (i.e., low total loss).

E. REAL DATA SET RESULTS
Simulations were conducted using a real data set called the Adult data set [33], which is a widely used benchmark in the research area of privacy-preserving data mining. It consists FIGURE 9. Simulation results: the ratio of probability distributions for categorical attributes (M = 10, τ = 0.6). Since both Proposal and Baseline are smaller than the value of exp( ), they both satisfy the requirement of differential privacy. Furthermore, since the Proposal is closer to the value of exp( ) than Baseline, it can put more appropriate noise in terms of data utility. of six numerical attributes and nine categorical attributes and has 30,162 records after eliminating unknown values.
It was assumed that each value of the Adult data set was true. It was also assumed that IoT devices estimated age, sex, race, and native country. using estimation methods [38]- [40]. For numerical attributes, σ was set to 0.1 of the value of , and was set to 8. For categorical attributes, τ was set to 0.6, and was set to 2.
Simulation results are shown in Table 3. The names of the attributes along with and M are also shown. The proposed mechanism was able to increase U n to approximately 92% from approximately 85% for all numerical attributes and increase U c by a maximum of 20% for categorical attributes compared with the baseline approach. These results showed that the proposed mechanism could increase the utility (i.e., reduce total loss) for real data sets.  Finally, simulations were conducted using other real data sets with the same parameter settings as above.
A data set of activities based on multisensor data fusion (AReM data set) [34] was used corresponding to the numerical attributes. The set consists of 42,239 instances of six numerical attributes.
A data set of daily living activities recognition using binary sensors (ADL data set) [35], a data set of healthy older people activities using a non-battery wearable sensor (RFID data set) [36], and a data set of localization for people's activity (Localization data set) [37] were used corresponding to categorical attributes. The numbers of instances are 741, 75,128, and 164,860, respectively.
Simulation results are shown in Table 4. These results showed that the proposed mechanism outperforms the baseline approach for all data sets used in this study.

F. HISTOGRAM GENERATION
Several studies on local differential privacy have been conducted to generate accurate histograms of attribute values. In this section, we compare the accuracy of histograms generated from our proposed method and that from state-of-the-art methods. Li et al. proposed the square wave (SW) method for numerical attributes [19], which uses the expectationmaximization algorithm, and repeats the E-step and M-step many times. In this paper, we set the number of these iterations as 100,000. Gu et al. [18] proposed IDUE based on Google's RAPPOR [32] (IDUE(R)) and IDUE based on OUE [41] (IDUE(O)) for categorical attributes. Sei and Ohsuga [42] proposed an algorithm for both numerical and categorical attributes, referred to as the NuRR method. Murakami and Kawamoto proposed a utility-optimized RAP-POR (uRAP) technique for categorical characteristics [17]. uRAP assumes that non-sensitive data exist in personal data and does not protect them. However, it ensures differential privacy for sensitive data and can realize high utility. This paper, like most prior studies, assumes that all data should be protected by differential privacy; nonetheless, uRAP can be used in such situations. Zhao et al. proposed several strategies for differentially private data collection for numerical attributes [20]. For generating histograms, PM-SUB and PM-OPT can be used. Since PM-SUB is a simple version of PM-OPT, PM-OPT is used for this evaluation. Zhao et al. also proposed a Three-Output mechanism with only three discrete output possibilities. For example, regardless of an input value, Three-Output outputs −C, 0, or C where the value of C is determined based on . Therefore, generating an accurate histogram is challenging, although the performance of Three-Output is very high to obtain average values from differentially private data.
In summary, IDUE(R), IDUE(O), NuRR, and uRAP were compared with our proposed method for categorical attributes; SW, NuRR, and PM-OPT were compared with our proposed method for numerical attributes.
In details, we measured the mean square error (MSE) of the difference between an original histogram and that generated from anonymized values. We generated a histogram per attribute for the evaluation. Note that generating a histogram of multidimensional attributes can easily be achieved by targeting the power set of attribute  values. Real datasets have multiple attributes, therefore, the MSE is calculated by averaging the MSE of each attribute.
First, we conducted experiments on numerical attributes. The number of bins of the histogram was set to 100. Three synthetic datasets (uniform, peak, and normal) and two real datasets (AReM and Adult) were used. As for synthetic datasets, the number of records was set to 10,000 and was set to 1.0. In the uniform dataset, the value of each record was randomly generated in [0,1]. The values of all records were set to 0.5 in the peak dataset. For the normal dataset, each value was sampled as an independent and identically distributed random variable in a normal distribution with a mean of 0.5. The value of varied from 1 to 10, and the sigma ratio varied within the set {0.025, 0.1, 0.25, 0.5}. The default values were set to 7 and 0.25 for and sigma ratio, respectively. The results of varying the sigma ratio and are shown in Figs. 10 and 11, respectively. In general, MSE for all methods in the peak dataset is the largest compared to other datasets since MSE calculates the squared value of the difference between the original and estimated histograms. Thus, the larger the difference between the values of the bins of the original histogram, the larger the MSE. The bin value corresponding to 0.5 in the peak dataset is 10,000. Suppose the estimated value is 9,000, the MSE is 1000 2 /100 = 10, 000. However, for the uniform dataset, the true value of each bin in the original histogram is 100. If the estimated value for each bin is 90, then the MSE is 100 * 10 2 /100 = 100. Thus, the more variation there is in the distribution of values, the larger the value of MSE tends to be.
The larger the sigma ratio, the larger the MSE, which is a natural result since a large sigma ratio means a large observation error (Fig. 10). The MSEs of the proposed method do not vary much on the AReM dataset. Although this result is unexpected, we consider that the reduction of observation noise has been successful.
The larger the , the smaller the MSEs (Fig. 11). This is because, in the proposed and NuRR methods, larger implies smaller Laplace noise. Similarly, for SW, the larger the , the smaller the difference between the privatized and original values.
Results show that the MSE of the proposed method is the smallest in all conditions and datasets. MSEs for SW are large because SW does not consider observation noises. However, with small observation noise (i.e., small sigma ratios), the MSEs of SW can be smaller than that of the proposed method (in the AReM dataset). Authors in [19] highlighted the accuracy of SW depends on the characteristics of the datasets. The accuracy of SW can be worse if the dataset has large spikes in the distribution. The AReM dataset has larger spikes than the Adult dataset.
Next, we conducted experiments using categorized datasets. The synthesized numerical datasets were converted into categorical datasets with 10 categories. Moreover, we used the ADL and Adult datasets. The results of varying τ are shown in Fig. 12. The values of was set to 7. Similar to the experimental results on the numerical dataset, the MSE of the proposed method is the smallest for any parameter setting. For the uniform dataset, each measured value differs from the true value, however, the true distribution remains uniform, implying that the frequency does not change much. Hence, the MSEs of the proposed method are similar to those of IDUE(O) (Fig. 12a). Results have shown that IDUE(O) achieves higher accuracy than IDUE(R) [18]. However, the effect of the measured value being different from the true value is more significant in the peak dataset. This is because the frequency of each value can be different. Therefore, the MSEs of the proposed method are less than those of other methods (Fig. 12b.) The degree of variation in the frequency of each value in the normal dataset is between the degrees of variation in the peak and uniform datasets, respectively. Hence, the difference between the MSEs of the proposed method and those of other methods in the normal dataset is larger and smaller than that in the uniform and peak datasets, respectively (Fig. 12c). For categorical attributes, the distribution of the values of AReM dataset is gentler than that of the values of Adult dataset. This characteristic is reflected in the results of their MSEs, as shown in (Figs. 12d and 12e).
The experimental results of uRAP and IDUE(O) are comparable because uRAP and IDUE(O) depend on the same OUE mechanism. The main task of PM-OPT is for federated learning; therefore, the accuracy of estimating the distribution of user data is not very high.
The results of varying are shown in Fig. 13. In the peak dataset, MSEs of the proposed method are similar to those of other methods when is large (Fig. 13a.) However, MSEs of the proposed method are the smallest for all other datasets (Figs. 13b-13e.) Similar to the results in Fig. 12, the greater the variations in the dataset, the more pronounced the effectiveness of the proposed method. This is because the effect of the measured error is much larger if the variations in the dataset are large.
MSE is ideal for evaluating errors in large histogram values since it is highly sensitive to large histogram values. Therefore, MSE is suitable for the scenario where the analyzer wants to know the rough distribution of the data. However, VOLUME 10, 2022  there is another scenario where we also want to know the finer details of the data distribution. In this case, Jensen-Shannon (JS) divergence [43] is suitable as a utility metric because it can evaluate the errors in small values of a histogram [44]. The results are shown in Figs. 14-17. The JS divergence results follow the same pattern as the MSE results. The JS divergence data, like the MSE results, reveal that our suggested method outperforms existing methods in terms of accuracy; however, the advantage of the proposed method has decreased marginally. In particular, the resulting histogram's error can be decreased by 40.4% on average in MSE and 29.6% on average in JS divergence. Therefore, we can infer that the proposed method is effective for both cases where a data analyst wants to know the underlying distribution of user data broadly, and the data analyst wants to know it in a fine-grained way.
To summarize the findings, the proposed method outperforms IDUE(R), IDUE(O), NuRR, uRAP, SW, and PM-OPT in terms of accuracy. This tendency was noticeable, particularly when there was a lot of sensor noise. The proposed technique considers sensing noise into account, and Algorithms 1 and 2 work efficiently limit the amount of noise imparted while maintaining the level of privacy protection.

G. CALCULATION COST
Local devices are used to run the anonymizing algorithms mentioned in Algorithms 1 and 2. We tested them on a laptop  computer with 8GB of memory and a Core i5 10210U CPU. The method was executed in less than a second for each data set. A server is used to estimate the distribution of user data. We tested it in a workstation with 128GB memory and an Intel Xeon W-2295 CPU. It took 7.8 seconds on average for the AReM dataset with 42,239 values. The computation time of other methods is shorter because their algorithms are simple. Since the calculation time of our methods is proportional to the number of users and the number of attributes, it will take more time if the number of users increases. However, since collecting user data takes a certain amount of time, even if it takes a few minutes to estimate the data distribution, it is still considered practical enough.

V. RELATED RESEARCH WORK
A large number of research studies for anonymized data collection have been carried out. Wang et al. [45] proposed a method to identify the top-k most frequent new terms by collecting term usage data from each person under differential privacy. Kim et al. [46] derived population statistics by collecting differential private indoor positioning data. Anonymized data collection could also be realized based on encryption approaches [47], [48]. These methods can obtain aggregation values, and they do not aim to obtain each person's value. Moreover, these methods do not consider errors in values. On the other hand, in the proposed scenario, the aim is to obtain each person's value with as high accuracy as VOLUME 10, 2022 possible since services such as recommender systems need individuals' attribute values.
Abul et al. [49] and Sei and Ohsuga [50] proposed location anonymization methods taking into account location errors. These methods achieve k-anonymity, which is a basic privacy metric. However, they cannot be applied to -differential privacy.
Ge et al. [51], and Krishnan et al. [52] proposed methods to clean ''dirty data'' privately. They used differential privacy as a privacy metric and focused on data cleaning for resolving inconsistent attributes of an extensive database containing several people's true data. They assumed that each database value was true and used the Laplace mechanism without considering errors in values.
Several studies proposed machine learning methods, such as deep neural networks (deep learning) for IoT sensing values with differential privacy. Shi et al. [53] proposed a reinforcement technique for transportation network companies using passengers' data. Xu et al. [54] focused on mobile data analysis in edge computing. Guan et al. [55] applied machine learning for the Internet of medical things. Although they use differential privacy as a privacy metric, they do not consider the proposed TDP. By applying TDP, it is believed that the accuracy of their methods increases while maintaining their privacy protection levels.

A. ERROR PROBABILITY DISTRIBUTION
Because our proposal for numerical attributes assumes that the error follows a normal distribution, the anonymizer must check whether the error follows a normal distribution and obtain the normal distribution's standard deviation. Notably, the proposed concept can be applied to the normal distribution and other distributions.
Many studies are based on the assumption that GPS location measurement errors follow a normal distribution, simulating these errors by generating noise that follows a bivariate normal distribution [56]- [58]. In addition to GPS measurements, many studies assume that measurement errors by sensors follow a normal distribution, and many studies have confirmed that measurement errors follow a normal distribution using actual data. Ferreira et al. measured various data from smart meters and generated error values for voltage and reactive power using a normal distribution in their experiments [59]. Sun et al. proposed a method to infer user intentions using spatio-temporal information and user behavior [60]. In their experiments, the measurement noises were drawn from a normal distribution. Xiao et al. proposed an RFID-based localization and tracking system, measuring and analyzing the error of the radar antenna, then describing the measurement error as following a normal distribution [61].
Many researchers such as [42], [62]- [64] also assume that sensing errors in IoT devices follow a normal distribution. Moreover, several researchers confirmed that actual sensing data follows a normal distribution. For example, Devon et al. collected 29,000 pieces of GPS data and illustrated that a normal distribution fits the data [65]. Wang et al. [66] observed that the pose tracking accuracy of the Microsoft Kinect 2, which can perform real-time gesture recognition, fits a normal distribution. Gao et al. [67] generated sensing samples based on a normal distribution of their experiments. Nguyen et al. [68] discussed how the measurement errors of sensing locations could affect mobile, robotic, and wireless sensor networks. In their proposed algorithm, the location error is modeled to follow a normal distribution. Using a real dataset with sensing errors, they showed that their algorithm achieves high performance.
Therefore, we can assume that the error probability fits a normal distribution in many cases. Usually, sensor vendors show a data sheet for each sensor product that contains information about the sensor's accuracy. There are several ways to express accuracy. Sometimes it is expressed using a standard deviation; in this case, the anonymizer uses the standard deviation of the normal distribution. If the accuracy is expressed using the average error (let m be the average), we can obtain the standard deviation of the normal distribution using the following equation: In IoT systems, a machine learning technique that includes deep neural networks has also been used. Estimated values from deep neural networks might include estimation errors, and several researchers such as [69]- [71] reported that the estimation errors followed a normal distribution. If an anonymizer can obtain several training samples, the anonymizer can produce the error probability distribution and calculate its standard deviation. The Anderson-Darling test [72], which tests whether samples come from a normal distribution, can be used to check whether the error probability distribution follows a normal distribution.
Android OS provides APIs for location, speed, and bearing. In addition, the APIs for the data return the measurement values and their accuracy at 68% confidence. 3 In a normal distribution, 68% of the data falls within one standard deviation from the mean.
Errors always exist in the measurements regardless of how carefully and scientifically the measurements are performed. However, error analysis allows scientists to evaluate the degree of uncertainty. It has been proven that if the measurements have many small random error sources and negligible systematic errors, the measurements are normally distributed [16].
Although not all measurement errors follow a normal distribution, as mentioned above, many measurement errors are considered to follow a normal distribution. The method proposed here targets the case where measurement errors are considered to follow a normal distribution. On the other hand, our proposed novel concept, TDP, can be applied to any other error models. We hope that this paper is the first step toward error-aware differential privacy.

B. MACHINE LEARNING WITH NOISY IoT DATA
Several studies generated machine learning models from noisy IoT data, and the models could achieve high accuracy [73], [74]. In contrast, studies on differential privacy showed that if the value of the privacy budget is small, i.e., if the noise added using differential privacy techniques is large, the accuracy of machine learning models will deteriorate significantly [53]- [55].
In this paper, we have shown that by considering the observation noise of IoT data, the amount of noise to protect privacy can be reduced. Our proposed concept can be used with existing differentially private techniques. Therefore, the accuracy of the machine learning model can be improved while maintaining the same privacy protection level. Some mechanisms can achieve differential privacy other than the Laplace mechanism and the randomized response mechanism. Geng and Viswanath [75] proposed the staircase mechanism for numerical values and proved that it was better when was very large. Andrés et al. [76] proposed the geoindistinguishable mechanism, especially for location information. The authors of this paper believe that TDP can be applied to other privacy metrics that manage probability distributions of attribute values. These privacy metrics include pk-anonymity [77], [78], which is a probability extension of the k-anonymity [79] model, and t-closeness [80], which is a refinement of k-anonymity.
Future work includes applying TDP to other mechanisms to achieve differential privacy and other privacy metrics.
In this paper, the normal distribution as the error distribution is considered. However, other error distributions can also be considered. By replacing N (x; σ 2 ) with other error distributions, new algorithms for error distributions can be introduced.
In this paper, the target scenario proposed is the collection of anonymized data from each person individually. Privacypreserving data publishing, which states that a data holder has much personal data and anonymizes and publishes them, is another important scenario in the research area of privacy; -differential privacy can be applied in this case. Furthermore, TDP can also be applied to -differential privacy in this case. A concrete discussion on these concepts is presented in future work.

VII. CONCLUSION
Differential privacy can protect user privacy by adding noise to a target value, which must be protected. Sensing values in IoT environments involve some errors; however, existing solutions have not taken sensing noise into account. In other words, present systems attempt to protect the detected value in the presence of sensing noise. On the contrary, our research aims at protecting the true value. Our technique modifies the amount of added noise based on the sensor noise model, whereas existing systems do not. Our strategy can lower the amount of noise introduced by the differential privacy technique by roughly 20%. As a result, the resulting histogram's mean square error and JS divergence can be lowered by 40.4% and 29.6% on average, respectively. A novel privacy metamodel called TDP is introduced and applied to differential privacy since the data owner or anonymizer might not know the true value. We validate this result on synthetic and five real data sets. This is the first research work that proposes and applies TDP. The authors expect many studies based on TDP in the near future.