Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual’s privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.

presents the main notations used in this paper. 89  Our contributions in this paper are summarized as follows. 90 First, we introduced the relationship between the variance and 91 the covariance of differentially private numerical data and 92 those of original data. Second, we developed an algorithm 93 to generate a copula model based on the estimated variance 94 and covariance. Third, we developed an algorithm to convert 95 the discrete cumulative distribution function into a contin-96 uous cumulative distribution function in the copula space 97 to generate a high-quality machine learning model. Finally, 98 we evaluated the performance of the proposed method using 99 synthetic and real datasets. 100 The remainder of this paper is organized as follows. The 101 assumptions of this study are described in Section II. Dif-102 ferential privacy, decision trees, and related research are 103 discussed in Section III. Section IV analyzes the effect of 104 differential privacy on decision trees. The proposed solu-105 tion is introduced in Section V. The evaluation conducted 106 is presented in Section VI, and the evaluation results are 107 discussed in Section VII. Finally, we conclude this paper 108 in Section VIII.

110
A. TARGET SCENARIO 111 We call the organization generating the machine learn-112 ing model the model generator. Many techniques can be 113 employed for differentially private machine learning model 114 generation. These techniques can be divided into three cate-115 gories. In the first category, the model generator is assumed 116 to store the original (i.e., non-privatized) personal data. The 117 model generator is a trusted entity, and the generated models 118 are shared with untrusted third parties. Many studies on dif-119 ferentially private decision trees fall into this category [17], 120 [18], [19], [20]. We assume that the model generator is a 121 semi-honest entity in this study; therefore, it cannot have 122 direct access to the original personal data. The second and 123 third categories also make this assumption. The difference 124 between the second and third categories hinges on whether 125 or not the model generator has indirect access to the original • The organization stores the data for fairness auditing: 144 The problem of the biased output of machine learning 145 models for sensitive personal attributes such as race and 146 gender has been widely recognized as a fairness problem 147 in machine learning. An analysis of training data is 148 required [27], [28], [29] to audit fairness or generate fair 149 machine learning models. Moreover, what constitutes a 150 bias depends on the attitudes of people and, therefore, 151 it may change in the future. Therefore, it is necessary to 152 store training data to cope with future changes. 153 In the third category, it is assumed that original personal 154 data can be accessed indirectly when machine learning mod-155 els are generated. In this category, the model generator trains  neural networks, many issues still need to be resolved [30]. 176 In contrast, the decision tree algorithm, which is one of the 177 most popular machine learning algorithms, has high human 178 interpretability. The main drawbacks of this algorithm are its 179 tendency to overfit the data and its instability when small 180 changes occur in the data; however, they can be minimized by 181 limiting the depth of the tree, pruning unreliable leaf nodes, 182 building ensembles instead of a single tree, etc. [17]. 183 Although our proposed method can be applied to any 184 machine learning algorithm, it is most effective for the deci-185 sion tree algorithm. This is because all differential privacy 186 data have a large noise, and decision trees overfit such noises. 187 However, we show the results of applying the proposed 188 method to DNN, kNN, and SVM in Section VI to demonstrate 189 the adaptability of the proposed method to other machine 190 learning algorithms.

191
C. TARGET DATA TYPE 192 We focus on numerical data in this paper because numerical 193 data can be easily converted into categorical data; that is, 194 numerical data are more useful than categorical data. For 195 time series data, the proposed method can be implemented 196 at each point in time. We can treat image data by applying 197 the proposed method to each pixel of each image. However,198 in that case, its utility would significantly decrease because 199 one image is composed of numerous pixels. Applying and 200 evaluating other data types is a future task.

203
Differential privacy is considered the most important pri-204 vacy metric [31]. In machine learning algorithms such as 205 deep neural networks and decision trees, differential privacy 206 has been studied extensively in the past decade [9], [17]. 207 Differential privacy is used for the central model, i.e., the 208 anonymizer holds all original data (the first category was 209 introduced in Section II). In contrast, local differential privacy 210 assumes a local model, i.e., each person privatizes their values 211 locally. In this paper, ''differential privacy'' refers to ''local 212 differential privacy.'' 213 Let X , Y , and M represent a domain of personal data, set 214 of privatized data, and a privacy mechanism that takes x ∈ X 215 and outputs y ∈ Y , respectively. (1) 220 Many differential privacy methods use the Laplace mech-221 anism for numerical attributes [9]. Here, X represents numer-222 ical values. Let represent the difference between the max-223 imum and minimum values of X . The Laplace mechanism 224 adds noise drawn from the Laplace distribution with a mean 225 of zero and scale / .

226
Much of the work on local differential privacy, such as [32], 227 [33], [34], is primarily aimed at generating histograms of 228 attribute values. If the generated histogram can achieve a 229 sufficiently high accuracy, then it can predict the output value 230 from the input value as well as the machine learning model. 231 In Section VI, we compare the proposed method with the 232 state-of-the-art methods [32]. Many researchers assume that the model generator has orig-236 inal (non-privatized) data, and they propose algorithms that 237 generate differentially private machine learning models using 238 these original data [35], [36], [37], [38], [39], [40], [41]. 239 In this study, we assume that the model generator is not an 240 honest entity, and that other methods are required. original data samples, and the goal is to generate and share a 286 synthetic dataset that is similar to the data samples. In con-287 trast, we assume that the server does not have original data 288 samples, and we aim to generate a machine learning model 289 from differentially private data samples. 290 We summarize the various perspectives of each type of 291 generated machine-learning model with differential privacy 292 and differentially private synthetic data generation in Table 2. 293 Our target is the second category, where the model generator 294 does not have original data but owns differentially private 295 data.

297
A covariance matrix of all attributes and a cumulative dis-298 tribution function F j of each attribute Q j are used to generate 299 a copula model.  Then, we divide each value s i,j of the samples by the 307 standard deviation of each attribute σ j . That is, for all i, j, (2) 309 The values of the cumulative distribution function of a 310 standard normal distribution for each value of the samples 311 were calculated; (3) 313 Then, we obtain the corresponding value of each attribute 314 from t i,j . More specifically, for all i, j, we calculate is an inverse function of F j .
[60] proposed a method to predict the number 319 of people with a certain combination of attribute values in 320 a population from a small sample using copula. They use 321 mutual information to compute the copula; however, they do 322 not use differential privacy or other privacy measures, i.e., 323 they assume that they have original (non-privatized) data. 324 Moreover, they did not predict the value of an attribute. 325 Copula-based data synthesis has been studied to generate 326 perturbed data while preserving the surrounding distribution 327 of the data, which can be used to train machine learning 328 models

336
A decision tree is a method for analyzing data using a 337 tree structure. Each internal node represents a rule for data 338 splitting.
where X i,j represents the jth value at X i , X i represents the    Fig. 2a. Thus, it becomes very difficult to determine 377 the correct split point when the noise of differential privacy 378 is added to all data samples. This leads to difficulties in 379 generating an accurate decision tree model under differential 380 privacy.

381
On the other hand, Figs. 2d and 2e shows the results for 382 the pseudo data generated by the proposed method. Because 383 our proposed method generates pseudo data that preserve the 384 statistical trend of each attribute, the shapes in Figs. 2d and 2e 385 are similar to the original shape in Fig. 2a. The splitting 386 points are eight and three, respectively, which are close to the 387 optimal splitting point of five.

388
Moreover, the correlation information of attributes is 389 destroyed in the differentially private data. Therefore, when 390 creating a decision tree from differentially private data, 391 the deeper the node is, the more significant the effect 392 of the error becomes. In contrast, the pseudo dataset based on 393 the proposed method reconstructs the correlation information 394 of attributes. Therefore, even when the nodes are deeper, 395 the deterioration of the accuracy of the decision tree can be 396 suppressed.

398
Let L(x; µ, s) represent the Laplace probability density func-399 tion with mean µ, scale s, and a random variable x ∈ X . When 400 the mean µ is zero, we use L(x; s).

402
Copula-based data synthesis has been researched to produce 403 perturbed data and incorporate rich statistical information in 404 the perturbed data. The proposed protocol is developed in 405 three steps: 1) generate a covariance matrix from the differ-406 entially private data (Section V-B), 2) generate a cumulative 407 distribution function (Section V-C), and 3) generate copula 408 samples (Section V-D). The generated copula samples are 409 used to train the machine learning model. The algorithms 410 used in all the steps were developed in this study. Overview 411 of the proposed method is shown in Fig. 3.

412
In the first step, we introduce the relationship between the 413 variance and the covariance of differentially private data and 414 those of original data. The model generator cannot access 415 the original data; however, the proposed method can estimate 416 the variance and the covariance of the original data from the 417 differentially private data.    Let X i denote the random variable of the ith attribute of 447 personal data, Z i denote the random variable with a Laplace 448 distribution L(x; 1/ i ), and X i denote the summation of X i 449 and Z i , i.e., Let E[·] denote the expected value of a random variable ·. 452 From the property of the linearity of expectation, because the mean of Z is zero.

455
Let σ 2 X i represent the variance of X i . The value of σ 2 X i can 456 be calculated by Thus, where we ensure the variance is greater than or equal to zero. 465 Let σ X i ,X j represent the covariance of X i and X j . The covari-466 ance of σ X i , X j is represented by The following equation is obtained because Z i and Z j 472 are independent of other random variables and E[ (13) 475 VOLUME 10, 2022 Let be a covariance matrix calculated based on 476 Equations 11 and 13. It may be invalid for a normal distri-477 bution because the generated covariance matrix may contain 478 some errors; that is, it may not be a positive definite matrix. 479 We use the eigenvalue decomposition technique to create a 489 The fact that a matrix is positive definite is equivalent to

521
In the same way,

523
Let w represent the width of each bin, i.e., Let b pri represent the number of bins in the output domain. 526 This value is calculated as Let L (x; µ, s) represent the cumulative distribution func-529 tion of the Laplace distribution with mean µ and scale s. The 530 probability that a true value is categorized in b i , and it is 531 privatized to another bin b pri j is calculated by A copula model is created from the noise-mitigated covari-545 ance matrix (Section V-B) and the noise-mitigated cumu-546 lative distribution function F j (j = 1, . . . , g) (Section V-C). 547 Then, copula samples can be generated using the cop-548 ula model based on Section III-F. However, Section III-F 549 assumes that the random variable of a cumulative distribution 550 function is continuous whereas the random variable of the 551 cumulative distribution function obtained in Section V-C is 552 discrete.

553
Let F j (k) represent the probability that the random variable 554 of the jth attribute is less than or equal to k, where k = 555    The overall procedure of the proposed method is shown in 564 Algorithm 1.

566
We evaluated the effectiveness of our proposed method using   Q i ← {v j,i |j = 1, . . . , n} 3: for j = 1, . . . , g do 6: σ X i ,X j ← covariance of Q i and Q j 7: end for 8: end for 9: Generate covariance matrix from σ 2 X i and σ X i ,X j (i, j = 1, . . . , g) 10: for i = 1, . . . , g do F j ← estimation results of expectation-maximization using P k,l (k, l = 1, . . . , b pri ) and Q j 25: end for 26: num ← the target number of samples 27: S ← samples generated based on g-dimensional multivariate normal distribution with 28: for i = 1, . . . , num do 29: for j = 1, . . . , g do 30:   The accuracy of IDUE(R) and IDUE(O) also improves as 666 the value of n increases. In general, methods that generate 667 histograms from differentially private data require a large 668 amount of data. It is expected that the accuracy of these meth-669 ods will be much better when large datasets are available. 670 In contrast, the accuracy of the other methods did not improve 671 as the value of n increased. The accuracy of the machine 672 learning model is not expected to improve because of the large 673 influence of noise in differential privacy, even if there is a 674 large amount of data with large errors.

675
To evaluate the variability of the MSE of the proposed 676 method, the results are shown in Fig. 7, where the standard 677 deviation is represented as an error bar. When the size of a 678 dataset is small, the value of the standard deviation is rela-679 tively large, but the value of the standard deviation decreases 680 as the size of the dataset increases. Overall, it can also be seen 681 that the standard deviation is not very large compared to the 682 value of MSE. In addition, all of the training accuracies (and 683 their standard deviations) were almost 0.0.

685
We used four real datasets for the evaluation. A description 686 of each dataset is provided below.

687
In the real datasets of Boston, !Kung, Diabetes, and Adult, 688 the number of attributes is 14, 4, 11, and 7, respectively. These 689 datasets are accessible to all. Moreover, our research targets 690 the area of the convergence of privacy and machine learning 691 technologies; therefore, we selected famous datasets for the 692 privacy and machine learning areas, respectively. The most 693 important reason for using the Adult dataset is that it is often 694 used as a benchmark in the field of privacy protection data 695 analysis. The !Kung dataset is also often used to evaluate 696 differential privacy techniques. Boston and Diabetes datasets 697 are famous for machine learning because they are included 698 in the scikit-learn framework, which is the foremost machine 699 learning framework. Each dataset is detailed below. The Boston dataset is considered the baseline dataset 702 for machine learning algorithms [69], [70]. A famous 703 scikit-learn framework 1 contains these data. The Boston 704 dataset comprises data on housing in Boston in the 705 late 1970s. It contains 506 sets of data with attributes 706 such as the crime rate of each city and the percentage 707 of the low-income population. Further, this dataset has 708 been used in many studies on privacy-preserving data 709 mining [76], [77]. The !Kung dataset [78], [79] is a small census dataset 712 that is widely used for experiments on data mining for 713 differential privacy, such as in [37] and [80]. The !Kung 714 dataset contains 287 records. Following [37], we set 715   Many studies have used this dataset to evaluate data 723 mining techniques [82], [83].

724
• Adult dataset 725 The Adult dataset [84] has been used in many studies 726 on privacy-preserving data mining, such as [85], [86].  Finally, we conducted experiments on DNN, SVM, and 746 kNN to determine if the proposed method can be applied 747 to other machine learning algorithms besides decision trees. 748 The results are depicted in Fig. 9, which shows the increase 749 ratio of the MSE of each machine learning algorithm. For a 750 decision tree, let α be the MSE of Proposal+DT, and let β be 751 the MSE of DT. In this case, the increase ratio is calculated 752 by (α − β)/β. Therefore, the increase ratio becomes negative 753 if the MSE of Proposal+DT is less than that of DT. Thus, 754 we calculated the increase ratio for the other algorithms 755 as well. For kNN and DT, the proposed method is clearly 756 effective; for DNN, the proposed method can improve the 757 accuracy of the Boston, !Kung, and Diabetes datasets, except 758 for the Adult dataset, which has a large amount of data. For 759 the Adult dataset, the proposed method does not deteriorate 760 the accuracy, and the accuracy is almost the same as that 761 of the DNN. However, the proposed method is not effective 762 for SVM. 763 VOLUME 10, 2022  In the previous section, we compared the proposed method 780 with the copula method, histogram generation meth-781 ods (IDUE(R) and IDUE(O)), and data augmentation. 782 Experimental results show that the proposed method has 783 the highest accuracy. On the other hand, the computa-784 tion complexity of the proposed method is higher than 785 the copula method because the proposed method uses an 786 expectation-maximization-based algorithm and a copula 787 algorithm. On the contrary, data augmentation has a very 788 small computational cost but also poor accuracy.

789
If histogram analysis rather than machine learning model 790 generation is the goal, then histogram generation methods 791   Table 3 summarizes the accuracy of the machine learning 795 models generated and the complexity of the methods.

797
We found that the proposed method is especially effective 798 when is in the range 0.01-8. Here, we analyze the amount 799 of noise imparted to confirm that it is within a range that can 800 be applied in many practical scenarios. The noise added by 801 differential privacy is generated from L(x; / ). Therefore, objective attribute (median value of owner-occupied homes). 816 Fig. 11 also shows the correlation values of differentially 817 private data and data generated by the proposed method. 818 When the value of is small, the correlation values of dif-819 ferentially private data and data generated by the proposed 820 method approach zero. Because each data sample has a large 821 noise, the information about correlation will be lost. The reason why the proposed method works well is as  In DNN, parameters are updated using stochastic gradi-866 ent descent or its variants. If too much noise is added to 867 this process, it will often be trained in the wrong direc-868 tion. However, by increasing the batch size, the robustness 869 to noisy data is increased. This is because, within a single 870 batch, gradient updates from randomly sampled noisy data 871 are nearly canceled out [87]. Nevertheless, there is a limit to 872 the ability to cancel out noise. The experimental results show 873 that the accuracy of DNN is better when using the proposed 874 method.

875
The SVM for regression is also called support vector  On the other hand, KNN is known to be very sensitive to 884 noisy data [89]. Therefore, the proposed method works well 885 also for KNN, as shown by the experimental results. handling high-dimensional data as it is with our method 889 is difficult. To treat high-dimensional data, techniques of 890 dimension reduction, such as principal component analysis 891 (PCA), can be used. Several studies have shown that reducing 892 the dimensions improved machine learning models' accu-893 racy [90], [91]. To perform PCA with differentially private 894 data, the algorithm Wang and Xu proposed [92] can be 895 used.

896
For DNN, many models use high-dimensional data. How-897 ever, several studies have generated highly accurate DNN 898 models, using PCA or other dimension reduction techniques, 899 such as [93]. This study is concerned with data with relatively 900 few attributes. Therefore, for high-dimensional data, it has 901 not been verified that the proposed method works effectively 902 without dimensionality reduction. Verification of how the 903 proposed method works with and without dimensionality 904 reduction is a future issue.

905
One reason to focus on decision trees in this paper is 906 high human interpretability. On the other hand, in many 907 studies, resarchers have aimed to interpret DNNs' behavior. 908 For example, Nascita et al. proposed an algorithm that pro-909 vides global interpretation for DNNs [94]. Interpretation of 910 model behavior when DNN models are constructed using 911 our proposed method is also an issue to be addressed in the 912 future.

914
General preprocessing techniques include data cleaning, 915 dimension reduction, and so on [95]. They do not consider 916 differentially private numerical data, which are very noisy 917 but for which the probability distribution of the noise is 918 the Laplace distribution. Our proposed method generates a 919 copula-based synthetic dataset that reduces the noise due 920 to differential privacy. Therefore, the techniques (e.g., data 921 cleaning and dimension reduction) could be applied to the 922 copula-based synthetic dataset generated by the proposed 923 method. Data augmentation is another preprocessing tech-924 nique used for increasing training data. In addition, this tech-925 nique does not consider differentially private data; therefore, 926 it makes little contribution to improving the accuracy of 927 machine learning. In the experiment section, we showed that 928 our method outperforms other techniques, including a data 929 augmentation technique.

931
Personal data with noise caused by differential privacy is 932 widely collected to protect privacy. In this paper, we pro-933 posed a method for generating highly accurate machine 934 learning models, especially decision tree models, based on 935 datasets with differential privacy noise. Experimental results 936 show that the proposed method improves the accuracy of 937 machine learning models, not only for the decision tree 938 algorithm but also for kNN and DNN with relatively few 939 attributes, for a range of practical values compared with 940 the conventional copula method and state-of-the-art IDUE(R) 941 and IDUE(O).

942
In future work, we plan to extend the proposed method to 943 other types of datasets where differential privacy is applica-944 ble, such as time-series data, image data, and data with graph