Associated Attribute-Aware Differentially Private Data Publishing via Microaggregation

Releasing raw data sets with sensitive personal information will leak privacy. Therefore, various differential privacy methods have been proposed for efficient data sharing while preserving privacy. However, they focus on noise processing of all quasi-identifier attributes, which results in high time-space complexity and low data utility. In this paper, we propose a Differential Privacy Protection model considering the Correlations between Attributes, denoted DPPCA. DPPCA first computes the degree of correlations between the quasi-identifier attributes and the sensitive attributes, and determines the pair of attributes with maximal degree of correlation. Then based the attributes with the maximal degree of correlations, it uses microaggregation to partition the data set into clusters of size <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$k\geq 2$ </tex-math></inline-formula>) according to three types of attributes, i.e., numerical, non-numerical, and hybrid attributes, such that there are <inline-formula> <tex-math notation="LaTeX">$l$ </tex-math></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$l < k$ </tex-math></inline-formula>) values of sensitive attributes in a cluster. Finally, noise is added to each cluster separately such that each cluster satisfies <inline-formula> <tex-math notation="LaTeX">$\varepsilon $ </tex-math></inline-formula>-differential privacy. While keeping the same degree of preserving privacy, our experimental results demonstrate that DPPCA substantially reduces the amount of added noise to 11% for the Census data set and the Adult data set. Therefore, DPPCA greatly improve the data utility while reaching the same degree of differential privacy.


I. INTRODUCTION
Data-sharing mechanisms make cooperation and research among various organizations convenient but substantially increase the risk of privacy disclosure [1]- [3]. However, data owners (e.g., government agencies and hospitals) must outsource data to third parties to obtain better services. If the original data are released without being processed, the privacy of individuals or organizations will be easily disclosed. For example, the Centers for Disease Control and Prevention must collect cases from various medical institutions, and these data often contain an amount of sensitive information on patients. If the Centers for Disease Control and Prevention releases this original information, then the users' private information will be disclosed. Therefore, privacy preserving data release has become desirable. However, The associate editor coordinating the review of this manuscript and approving it for publication was Yan Huo . conventional privacy protection models, such as k-anonymity and its extension [4], have various disadvantages. For example, these models always assume that the attacker has background knowledge; however, the background knowledge that is possessed by the attacker is difficult to predetermine. In addition, the anonymized privacy protection model cannot provide an effective and rigorous proof. Once the model parameters have been changed, it is impossible to quantitatively analyze the privacy protection level, which severely weakens the reliability of privacy protection. Hence, differential privacy was proposed.
Differential privacy is a research hotspot [5], [6], and it can overcome the disadvantages of conventional privacy protection without considering the background knowledge that is possessed by the attacker. However, differential privacy requires the addition of a substantial amount of noise to the query results, thereby resulting in a limited application scope. Typically, the shortcomings are manifested in two ways: the noise that is added to the output is substantially increased, and the data utility is mainly a limited to the release of the query results [7]- [9].
It is challenging to prevent the disclosure of private data and to improve the utility of published data. To solve these problems, the noninteractive protection framework [10] converts or compresses the original data and subsequently adds noise to the query results to satisfy ε-differential privacy. Unlike conventional ε-differential privacy methods, the microaggregation methods [11], [12] aggregate the original records into clusters of size at least k (k ≥ 2) items and add noise to each cluster. For example, Soria-Comas et al. [12] proposed an insensitive microaggregation algorithm for increasing the within-cluster homogeneity. They considered the sensitivity of the input datasets to changes in only one record. Since much less noise is added to each cluster than to each record, the added noise will be smaller. To the best of our knowledge, the available methods focuses on all quasi-identifiers when performing microaggregation operations, and they assume that the data attributes are independent or fully correlated instead of considering partially correlated attributes in the input data set, which is likely to lead to a reduction in data utility while protecting privacy.
To balance data utility and protect sensitive information, we use microaggregation to realize differential privacy. Our method does not process all quasi-identifiers, and we process only the attributes that are most relevant to sensitive attributes, while other quasi-identifier attributes are anonymized. The objective is to reduce the loss of data utility. We calculate the mutual information between quasiidentifiers attributes and sensitive attributes. The greater the mutual information is, the higher the correlation. Hence, we identify the most dependent attribute pairs and microaggregate them. Then, we ensure that they satisfy differential privacy. The main contributions are as follow.
We proposed a new DPPCA model for differentially private data publishing, which were designed to reduce the influence of added noise. The model microaggregates an attribute pairs with the largest dependencies and makes the resulting cluster satisfy differential privacy. Since the amount of noise depends on the sensitivity of the data set. We add noise to a cluster much less than to each record in the dataset. To make DPPCA adapt to various types of data and improve its generalization performance, we have discussed numerical attributes, category attributes, and hybrid attributes. And we design suitable algorithms for each data type. In addition, we analyze each algorithm theoretically and calculate its time complexity and space complexity. We use the squared error sum (SSE) and the record linkage (RL) to evaluate the utility of the proposed model, that is, we select numerical attributes, categorical attributes, and hybrid attributes from the Census and Adult datasets for the evaluation of SSE and RL, respectively. In addition, we elaborated on the explanations for each experimental result. This paper is organized as follows, some related works are introduced in Section 2, and the proposed microaggregation-based differential privacy protection model DPPCA are summarized in Section 3 and Section 4. Section 5 validates the effectiveness of the proposed model DPPCA. The paper ends with conclusion in Section 6.

II. RELATED WORK
The ε-differential privacy method does not impose any assumption regarding the attacker's background knowledge. There are two scenarios for implementing differential privacy: interactive scenarios and noninteractive scenarios. For interactive scenarios, data analysis is limited because it enables only a limited number of queries to be answered; for noninteractive data publishing, all queries can be answered at once. According to the characteristics of interactive and noninteractive scenarios, noninteractive scenarios provide higher data usage flexibility [13]. The objectives are to approximate the data distribution by partitioning the data threshold and to calculate the number of records in each partitioned set. The accuracy of a histogram query depends mainly on the number of records that are included. If the number of records is small, the relative error is large, i.e., the accuracy of the histogram query cannot be guaranteed. To overcome this problem, Zhang et al. [14] considered the data utility of differential privacy and constructed a Bayesian network to the data prior to adding noise; in addition, they theoretically analyzed the privacy and provided a utility guarantee. Cormode et al. [15], [16] analyzed in detail the connection, advantages and disadvantages of using microaggregation to realize differential privacy. In short, the methods for improving the utility of differentially private data publishing differ in terms of their limitations. Compared with conventional differential privacy methods, the microaggregation methods can improve the data utility and do not impose excessive restrictions on the query type. For example, Soria-Comas et al. [11] proposed the synergy between differential privacy and k-anonymity when publishing anonymous data. However, their method was not sufficient for considering the utility of numerical attributes. Martínez et al. [17] studied categorical attributes and proposed a semantic adaptive maximum distance to average record algorithm. However, this algorithm is not efficient since the size of each cluster cannot be determined in advance, which causes a large information loss during the data conversion of clusters. To reduce the information loss of microaggregation, Soria-Comas et al. [12] proposed the use of an insensitive microaggregation method to obtain an anonymized data set. They add random noise to every quasi-identifier attribute except sensitive attributes. Sánchez et al. [18] used the microaggregation of individual rankings to reduce the scale parameters of added noise in most cases. Masooma [19] proposed a stable microaggregation framework that incorporates microaggregation and differential privacy into the data dissemination process. In addition to this work, many researchers have proposed their own solutions in terms of the large-scale graphic [20] data release and privacy preserving big data publishing [21]. VOLUME 8, 2020 According to the above analysis, the available microaggregation-based differentially private data publishing models do not consider the association between quasiidentifier attributes and sensitive attributes. This paper proposes the realization of a differentially private data release via microaggregation by using the association attributes, namely, we investigate how to effectively reduce the sensitivity of the dataset and improve the signal-to-noise ratio while realizing differential privacy.

III. PRELIMINARIES
Prior to presenting our model formally, we briefly define the symbols that are used in this article. Suppose there are n records in the original data set D and the data attributes consist of the identifier (ID) attribute, quasi-identifiers (QI) attributes, and sensitive (SA) attributes. The identifier attribute is removed before the data are released; hence, we do not discuss it here. The QI attributes and SA attributes are denoted {A i |i = 1, . . . , m}, and {S j |j = 1, . . . , s}, respectively, where m and s are the numbers of QI attributes and SA attributes, respectively. For simplicity, we only discuss a single sensitive attribute, namely, {S j } with j = 1. For multisensitive attributes, the attribute can be generalized according to the processing method of the QI attribute, which is not described in detail here.

A. ε-DIFFERENTIAL PRIVACY
The ε-differential privacy model does not make any assumptions regarding the attacker's background knowledge and has a rigorous mathematical basis. Under this model, the presence or absence of a single record has no effect on the dataset. Thus, this model has attracted the attention of many researchers in computer science and statistics. The definition of ε-differential privacy is as follows: Definition 1 (ε-Differential Privacy): For any data sets D 1 and D 2 that differ by one record, a randomized algorithm A provides ε-differential privacy protection if and only if the following is satisfied: where S is the output threshold of algorithm A. Definition 1 implies that even if the attacker knows most of the records in the original data set, she/he still cannot accurately determine whether a record is in D 1 or D 2 . For realizing differential privacy, Dwork et al. [22] proposed the Laplace mechanism: for a query function F, the differential privacy mechanism generates the actual result as middleware and adds Laplacian noise to the response of the query. The amount of added noise depends on the sensitivity, which is the maximum change in the query result after the removal of a record from the dataset or the addition of a record to the dataset, i.e., the added noise level is closely related to the global sensitivity. Definition 2 (L 1 -Sensivity [23]): For any two adjacent data sets D 1 and D 2 , there is a function f: D → R d .
The l 1 -sensitivity of the function f is defined as follows: where R represents the mapped real space and d represents the query dimension of the function f.

B. MICROAGGREGATION
To effectively reduce the amount of added noise when publishing differentially private data, we microaggregate the original data, namely, we divide the data set into several clusters, where each clusters contains at least k records. The data in each cluster are the most similar, and the data that differ in terms of their clusters are the most different. Microaggregation can be realized via two steps: 1) Partitioning: The original dataset is divided into clusters of size at least k, and the records in each cluster are as similar as possible to each other. 2) Aggregation: After the data set has been divided into clusters, each original cluster will be replaced by the mean or median of the cluster, i.e., every record can be replaced by a representative record in this cluster.

C. MUTUAL INFORMATION
In probability theory or information theory, the mutual information is a measure of the mutual dependence between two random variables. The mutual information between two discrete random variables X and Y can be defined as follows: where p(x, y) is the joint probability distribution function of X and Y , and p(x) and p(y) are the marginal probability distribution functions of X and Y , respectively. For continuous random variables: where p(x, y) is the joint probability density function of X and Y , and p(x) and p(y) are the marginal probability density functions of X and Y, respectively. The larger the mutual information value is, the higher the correlation. For the QI attribute Q i and the sensitive attribute S in dataset D, the mutual information is: The range of category attributes is ϕ is the probability that each category attribute value is on the category attribute field, and Pr(ϕ j i , ϕ k s ) is the joint probability of the category attribute. d i and d s represent the number of attribute values of QI and SA, respectively.

IV. ATTRIBUTE-RELATED DIFFERENTIALLY PRIVATE DATA PUBLISHING ALGORITHM
To reduce the amount of added noise and to improve the utility of data publishing, we determine the correlation between the attribute pairs of the original dataset and add noise to each cluster after microaggregation.

A. DIFFERENTIALLY PRIVATE DATA PUBLISHING FOR NUMERICAL ATTRIBUTES
The dataset contains numerical attributes and categorical attributes, and each type of attribute is processed differently. We will discuss them separately below. This section mainly discusses the microaggregation of numerical attributes and the process of implementing differential privacy.

1) MICROAGGREGATION OF NUMERICAL ATTRIBUTES
There are many scenarios for microaggregation applications in which data is divided. For example, in [12], the microaggregation method is used to realize differential privacy that is based on k-anonymity. However, the microaggregation method is not sufficient for protecting the dataset directly. After comprehensively comparing these microaggregation methods, we decided to modify the MDAV [24] to realize differentially private data release. In Algorithm 1, we present an improved MDAV method for numerical attributes.
Since the sensitivity has a substantial impact on the results of microaggregation and the general MDAV does not require sensitivity between the attributes in the cluster, we use the nonsensitive microaggregation algorithm [12],for which deleting or adding a record does not substantially impact the entire cluster. For example, consider a dataset D that contains two attributes and 15 records. Figure 1 presents the original D and the insensitive MDAV clustering result after the modification of a record, where the cluster size is k = 5, and × represents the cluster core. According to Figure 1, changing only one record by applying insensitive microaggregation does not affect the entire clustering result. Such clustering has lower sensitivity and can also reduce the privacy budget of subsequent noise additions.

2) DIFFERENTIALLY PRIVATE PROTECTION OF NUMERICAL ATTRIBUTES
Suppose Ir( * ) is a function that returns the attribute value of the r-th record that corresponds to D. To realize differential Algorithm 1 Attribute-Dependent Microaggregation Algorithm for Numerical Attributes Input: Original dataset D = {QI , SA}, the cluster size k Output: Microaggregated dataset D with attribute dependency 1: Identify the attribute pair that has the maximum mutual information I (A i , SA) max via formula (5); 2: while (|D| >= 3k) do 3: x ←− average record of D; // select the mean as the initial cluster center 4: r ←− the most distant record from x in D; 5: s ←− the most distant record from r in D; 6: Construct a cluster with r and its k − 1 closest neighbors; 7: make the value of the SA attribute in each class have l(2 < l < k) distinct values; 8: Remove these records from D; 9: Construct a cluster with s and its k − 1 closest neighbors; 10: make the value of the SA attribute in each class have l(2 < l < k) distinct values; 11: Remove these records from D; 12: end while 13: if (|D| >= 2k) then 14: x ←− average the remaining records of D; 15: r ←− the most distant record from x in D; 16: construct a cluster with r and its k − 1 closest neighbors; 17: remove these records from D; 18: end if 19: Construct another cluster in D with the remaining records; 20: The centroid of the cluster is x i , and use the centroid to replace other data values in the cluster; 21: return D privacy, we perform a nonsensitive microaggregation operation on the original dataset D to obtain a dataset D ,and then we add Laplacian noise to D to obtain a differentially private dataset D ε .
Theorem 1: For any function of f :→ R d , if the output of algorithm A satisfies the following equation, then A satisfies ε-differential privacy. where Proof: From the Laplace distribution function, it follows that It is necessary to make |f (D)| − |f (D )| ≤ f , to satisfy This formula satisfies the absolute value inequality, so Lap i ( f /ε) satisfies the definition of differential privacy. This concludes the proof. For the attribute Q max that has the highest correlation with SA, we must calculate Ir(D) = |Maximum attribute threshold − Minimum attribute threshold|. Here, the sensitivity is f = n k × Ir(X ) k , n is the number of records in the data set, k is the size of the cluster, and a probability p(x) ∈ (0, 1) is randomly generated. From p(x) = 1 2b exp(− |x| b ), we can know that x = −bln(2bp(x)) is the noise to be added. The amount of noise is proportional to f and inversely proportional to ε.

B. CATEGORICAL ATTRIBUTE DIFFERENTIALLY PRIVATE DATA PUBLISHING METHOD
For categorical attributes, we cannot add Laplacian noise to the numerical attributes; instead, we use an exponential mechanism [25]. The exponential mechanism must traverse the entire data domain, and we finally identify a dataset that satisfies differential privacy. The dataset responds to all queries regarding this concept category with the specified accuracy under the premise of privacy protection.

1) MICROAGGREGATION OF CATEGORICAL ATTRIBUTES
Considering the utility of anonymous data, we use the semantic distance to measure categorical attributes. Formally, given the distance function d of the data space, the centroid of a set x 1 , x 2 , . . . , x n can be defined as: where x 1 , x 2 , . . . , x n are clusters that consist of n tuples and c is a centroid candidate set. In this paper, we use classification [26] to estimate the similarity between two category attribute values (such as a and b) [25]. The shortest path length between the connected elements is calculated via the taxonomic relationship.

dis(a, b)
Here, T (i) is a set of taxonomic subsets of element i, including itself. The longer the path is, the larger the distance of the attribute-value pair. Let (a, b) = l 1 , . . . , l k be the path that connects attribute values a and b; |path(a, b)| = k is the length of this path. Considering all possible paths from a to b, we use dis_sem o (a, b) to represent the distance of the element pair (a, b), which can be defined as:

Definition 3 (Weighted Distance of Categorical Attributes):
Consider a centroid candidate set c j and single-variable attributes V = < v 1 , ω 1 >, . . . , < v n , ω n >, where v i is an attribute value tag and ω i is the number of times it appears in the dataset. The weighted distance of the categorical attributes sd o from all elements in V is defined as the sum of the distances between c j and each v i ∈ V , weighted by the frequency ω i of v i in the dataset, which is expressed as An input dataset with univariate attributes can be expressed as V = < v 1 , ω 1 >, . . . , < v n , ω n >. The steps are as follows: (I) map the values in V to the concepts in taxonomic tree O; (II) extract semantic-related concepts from O; (III) analyze the taxonomic relationships; and (IV) retrieve new concepts (taxonomic ancestors), which become the centroid candidates for the values in V . An example of solving for the distance between attribute values is presented as follows.
Example 2: Consider a univariate patient dataset in which each record corresponds to an individual. The following set of attribute values can be extracted: V 1 = {<colic, 1>, <lumbago, 3>, <appendicitis, 1>, <pain, 1>, <migraine, 2>, <gastritis, 1>}. These values are mapped to concepts that are found in the dictionary of WordNet, and the minimum hierarchical tree is extracted. Fig.2 illustrates the hierarchical tree structure [27]. The process of constructing the centroid and microaggregating the categorical attributes is presented in Algorithm 2. We first extract a set of attribute values from V , where V = < v 1 , ω 1 >, . . . , < v i , ω i >, in which v i denotes the attribute tag value and ω i denotes the number of occurrences of v i , and then we map the attribute values to the WordNet model and extract the minimum hierarchical structure H WN .

2) DIFFERENTIALLY PRIVATE PROTECTION OF CATEGORICAL ATTRIBUTES
For categorical attributes, we add exponential noise to realize differential private data release. The exponential mechanism is defined as follows.
Theorem 2 [28]: Given a scoring function u: (D × O) −→ R, if algorithm A satisfies the following equation, then A satisfies ε-differential privacy: where u = max ∀r,D 1 ,D 2 |u(D 1 , r) − u(D 2 , r)| is the global sensitivity of the score function u( * ). According to this formula, the higher the score is, the higher the probability of being selected for output. A method for realizing differential Calculate the shortest path of each attribute tag value according to H WN , and use formula (11) to find the weighted distance of categorical attributes sd 0 ; 4: v r ←− the most distant record value from v 0 in V ;

5:
Form a cluster with v r and its closest k − 1 record values in D; 6: Remove the clustered records from D; 7: v s ←− the most distant record value from v r in V ; 8: Form a cluster with v s and its closest k−1 record values in D; 9: Remove the clustered records from D; 10: end while 11: if (|D| >= 2k) then 12: v 0 ←− calculate the centroid of the remaining records in the dataset; 13: v r ←− the most distant record value from v 0 in V ; 14: Form a cluster with v r and its closest k − 1 record values in D; 15: Remove the clustered records from D; 16 privacy is to select the centroid probabilistically that is proportional to the expected level ε of differential privacy.
Consider a function with a discrete output t. The exponential mechanism selects a near-optimal output that is based on the input data D and the availability function q(D, t) while preserving privacy. Each output is associated with a selection probability Pr(t), which increases with the quality criterion exponent, which is defined as follows: Via this approach, the exponential mechanism will be more likely to choose the best output. The detailed process is presented in Algorithm 3.
Proof: The result is output with a probability, rather than directly adding noise to the numerical data, so it needs to be normalized and expressed in as a probability.
From the first term of the above formula The second term of formula the above formula Due to i exp(εq(x ,T i )/2 q) i exp(εq(x,T i )/2 q) ≤ exp( ε 2 ) from the above formulas, we can obtain Hence, Theorem 2 follows also in this case.

Algorithm 3 ε-Differential Privacy of Categorical Attributes
Input: Original dataset D with n records, clusters c i , |c i | ≥ k, i = 1, . . . , n Output: dataset D with ε-differential privacy 1: for each record of C i do 2: Let A j be the attribute of cluster C i and |A j | = t be the number of attribute values; 3: Count the frequency of occurrence of each A t j ; 4: Apply the quality standard function q( * , * ) for each centroid candidate a t j in τ (A j ), where τ (A j ) is the domain of A j ; 5: Calculate the attribute value after noise has been added according to formula (13), and extract the value of S(A i ) randomly; 6: The calculated probability value of each attribute is denoted p(A t j ) and satisfies argmin i t=1 p(A t j ) > pr(a t j ) ; 7: pr(a t j ) ∝ exp( ε×q(S(A j ),a t j ) 2 n k (q(A j )) ) satisfies the exponential mechanism; 8: end for 9: return D

C. DIFFERENTIALLY PRIVATE DATA RELEASE FOR HYBRID ATTRIBUTES
In general, in addition to data sets with pure numeric attributes, data sets with hybrid class attributes are more common. Such data sets contain both numeric and nonnumeric attributes. Before these data sets that include the hybrid attributes are released, noise must be added to satisfy ε-differential privacy. For a dataset D with hybrid attributes [24], the record distance can be defined by combining the respective attribute values: , a m t )) 2 (16) where a i j , i = 1, . . . , m, is the coordinate of x j (j = 1, 2); dist(a i 1 , a i 2 ) is the distance between the values of the i-th attribute A i in x 1 and x 2 ; and dist(a i b , a i t ) is the distance between the domain Dom(A i ) boundaries that are used to eliminate the impact of the attribute size.
Similarly, we can use the weighted distance to microaggregate the attributes. During hybrid attribute clustering, we make a judgment regarding the attribute type and use Algorithm 1 or Algorithm 2 based on the judgment result. In summary, noises is added to the attribute types so that each attribute satisfies Definition 1. With the calculation method of the hybrid attributes, we can easily formulate a hybrid attribute differential privacy algorithm according to Algorithm 1 or Algorithm 2.

D. TIME COMPLEXITY ANALYSIS
The time complexity of the algorithm reflects the magnitude of the increase in the program execution time with the increase in the input scale, which can be used to measure the advantages and disadvantages of the algorithm. For the above algorithm, the input scales are all n. For the algorithms of the numerical and categorical attributes, based on the cost of calculating the mutual information, the time complexities of Algorithm 2 and Algorithm 3 are O(n 2 ).

V. EXPERIMENTAL EVALUATION A. EXPERIMENTAL SETUP
To evaluate the performance of the model, we use the Census and Adult data sets. The Census data set contains 1080 records with only numeric attributes, and the Adult data set contains 30162 records. For the Census data set, we select the Fica, Fedtax, Intval and Pothval attributes as the QI attributes and Ptotval as the SA attribute. For the Adult data set, we select categorical attributes Education-Level and Native_Country as the QI attributes and Occupation as the SA attribute, and these attributes are denoted by Adult-Categorical. To evaluate the performance of our method on hybrid data sets, we select Occupation, Native_Country, Age, and Hours_per_week as the QI attributes and Occupation as the SA attribute from the Adult data sets, and these attributes are denoted Adult-Hybrid. Considering that attributes with a large value difference will have a greater impact on the results, we use (0,1) normalization to normalize the attributes. The experimental environment is an Intel Core 2.20 GHz CPU with 8 GB of RAM, the operating system is Windows 10 Ultimate, and the programming language is Java.

B. EXPERIMENTAL METRICS
We use the squared error sum (SSE) and the record linkage (RL) to measure the data utility and the leakage risk for the models. Here, SSE is defined as the squared sum of the attribute distances between the original data set X and its anonymized set X , which is expressed as follows: where a i j is the ith attribute value of the jth original record and (a i j ) is the perturbed version. RL is the percentage of the original dataset that correctly matches the anonymous dataset X , which is expressed as follows: n where n is the number of original records. RL measures the actual privacy from the natural perspective of privacy attacks.
The probability Pr(x j ) is calculated as follows: where G is the set of original records with the minimum distance from x j . If the original record x j is in G, Pr(x j ) is 1/|G|; otherwise, Pr(x j ) is 0.

C. RESULTS AND ANALYSIS
Before the experiment, we preprocessed and normalized the entire data set. To eliminate the effects of random errors, we execute the algorithm 10 times and average the results. DPPCA represents our model, and MDAV_QI_DP denotes the MDAV [12] in which noise is added to satisfy ε-differential privacy. We first apply Algorithm 1 and Algorithm 2, and add noise that satisfies ε-differential privacy for numerical attributes and categorical attributes, respectively. Figure 3 presents SSE and RL for the numerical attributes in the Census data set as functions of parameters k and ε. Figure 3  performance of the DPPCA with conservational ε-differential privacy. According to Figure 3(a), the overall data utility ranking is as follows: MDAV > DPPCA > MDAV_QI_DP. However, the MDAV algorithm is only a clustering algorithm without noise, and its privacy protection is limited. For k ≈ √ n or larger, SSE gradually decreases until it reaches a steady state, i.e., information loss is minimized. According to Figure 3(b), RL of the MDAV model fluctuates substantially, and the likelihood of identity leakage is high. As k increases, the RL of DPPCA remains stable at approximately 0.1, i.e., the RL value of DPPCA is relatively small, and the privacy of anonymous output is improved. According to Figure 3(c), the SSE of DPPCA decreases regardless of the value of ε. When ε is 10, the SSE is minimal, and it remains stable after k > 33. From Figure 3(d), the value of the overall RL is lower than 0.6%, although the RL of DPPCA fluctuates substantially with the increase in the k value, thereby indicating that the risk of privacy leakage of DPPCA is low. In summary, DPPCA improves the data availability on the Census data set while demonstrating satisfactory data privacy. Even though the prior microaggregation step also causes a loss of information, the loss is small compared to the noise to be added, and it negligibly affects the threshold of the effect of the DPPCA display.
After evaluating the performance of our proposed model on numerical attributes, we evaluate the classification attributes. Figure 4(a) and (b) compare DPPCA with the general microaggregation algorithm MDAV. Figure 4(c) and (d) compare the DPPCA with the conservational ε-differential privacy. According to Figure 4(a), the MDAV fluctuates substantially since it does not apply to nonnumeric attributes. The DPPCA gradually decreases until it stabilizes. In Figure 4(a), regardless of the value of ε, the SSE of DPPCA is significantly lower than those of MDAV_QI_DP and MDAV; hence, the information loss of DPPCA is small, and it is dominant in the leakage of privacy risks. Figure 4(b) shows that the percentages of connection records of the MDAV are very high when ε = 0.01 and 0.1, and the RL of DPPCA is slightly larger. When ε = 1 and 10, the RL values of the two models are substantially reduced. According to Figure 4(c), the smaller the value of ε is, the larger the SSE, i.e., the larger the information loss. According to Figure 4(d), the larger the value of ε is, the larger the value of RL. In summary, the smaller the value of RL, the lower the risk of identity leakage and the higher the privacy of anonymous output. From this experiment, the balance between data utility and privacy security can be satisfied when ε = 1. DPPCA outperforms the conventional differential privacy method in terms of information loss and leakage risk, and the model design is more reasonable.
After evaluating the performance of our model on numerical and categorical attributes, we evaluate the performance of our model on hybrid attributes. Figure 5(a) and (b) compare DPPCA with MDAV. Figure 5(c) and (d) compare DPPCA with the conservational ε-differential privacy. According to Figure 5(a), the SSE will decrease as k increases. The DPPCA and MDAV_QI_DP methods have similar SSE curves. When ε = 1 and k > 174, the SSE is relatively stable and relatively small. When ε = 10 and k > 66, the SSE remains unchanged. For Figure 5(b), the RL for DPPCA is very close to zero. The privacy performance of the model is not easy to evaluate. Since the hybrid attributes differ in terms of their characteristics and their sensitivies to the added noise, the changes in the curves are not readily observed. In summary, we must consider the balance of data utility and privacy security. Some of the curves in the figure are from Figure 5(c) and (d).
As the value of k increases, the values of SSE and RL of DPPCA remain approximately constant and close to zero. For a hybrid attribute, for DPPCA or conservational ε-differential privacy, the k value and the ε value minimally affect the data; hence, the changes in their values are not large.

VI. CONCLUSION
We propose a model for publishing private data using microaggregated association attribute differences, referred to as DPPCA. This model fully considers the dependency relations of Quasi-Identifier attributes and sensitive attributes. The microaggregation operations, which can reduce the data sensitivity, were performed on these relevant attribute pairs to make each cluster contain more than k(k > 2) items. By considering only the most relevant attribute pairs, the time and space complexity of the algorithm can be reduced. Then, we add noise to each cluster to realize ε-differentially private data release. The experimental results demonstrate that the model can substantially improve data utility and reduce the amount of noise that is added to the dataset. However, many challenging problems remain to be solved. For example, in the processing of categorical attributes and hybrid attributes, the selection of a suitable value of parameter ε for multiple sensitive attributes requires further research.