Privacy-Preserving Collaborative Data Collection and Analysis With Many Missing Values

Privacy-preserving data mining techniques are useful for analyzing various information, such as Internet of Things data and COVID-19-related patient data. However, collecting a large amount of sensitive personal information is a challenging task. In addition, this information may have missing values, which are not considered in the existing methods for collecting personal information while ensuring data privacy. Failure to account for missing values reduces the accuracy of the data analysis. In this article, we propose a method for privacy-preserving data collection that considers many missing values. The patient data are anonymized and sent to a data collection server. The data collection server creates a generative model and a contingency table suitable for multi-attribute analysis based on expectation–maximization and Gaussian copula methods. Using differential privacy (the de facto standard) as a privacy metric, we conduct experiments on synthetic and real data, including COVID-19-related data. The results are 50–80% more accurate than those of existing methods that do not consider missing values.

T O control a pandemic such as the coronavirus disease 2019 (COVID-19), we require the age, gender, family structure, and medical history of the infected individuals [1], [2]. Although such data may be provided to medical institutions by the patients themselves, the information is highly sensitive. If this information is anonymized, it can be shared among researchers worldwide without identifying the patients, which would help to elucidate the state of the pandemic and predict its course with greater accuracy.
Even when anonymized, a large amount of sensitive personal information is difficult to acquire. Moreover, this information may have missing values, as individuals who are willing to provide all confidential information are fewer than those who are willing to provide incomplete information. Researchers have proposed several methods that collect personal information while ensuring data privacy [3], [4], [5], [6]. In most of these methods, the privacy model is the -differential privacy [7], the de facto standard of privacy assurance [8]. Although these methods achieve differentially private data collection, they do not consider missing values. Consequently, the accuracy of the data analysis is significantly reduced, especially in multi-attribute analysis involving many missing values.
In this paper, we propose a method for privacy-preserving data collection that considers many missing values. The patient data are anonymized on the patient's device and/or computer in authorized hospitals, and are sent to a data collection server. Each patient can select which data to share or not share. The data collection server creates a generative model and contingency table suitable for multi-attribute analysis based on expectation-maximization and Gaussian copula methods.
We considered that if the value distribution of one or two attributes can be restored, the error in each attribute can be limited even when there are several missing values. Copula enables data generation when certain information (such as correlation and mutual information) is available for each pair of attributes. We thus combined the features of copula with those of data recovery using differential privacy. To our knowledge, this idea is novel to privacy-preserving data collection. Applying a copula model to differentially private data collection with many missing values is our first contribution. The main technical contribution is as follows. To generate a copula model, a value distribution of each attribute and mutual information of all attributes are required. However, the server cannot collect original data, rather it collects noised data. Therefore, if the server generates value distributions and mutual information from the collected data, the generated copula model will be collapsed. Therefore, a technique to mitigate differentially private noise is necessary (described in Sections 4.2.1 and 4.2.2.) The remainder of this paper is organized as follows. Section 2 presents a motivating example and the assumptions of this study. Section 3 describes related work and Section 4 discusses the proposed method in detail. In Sections 5 and 6, we evaluate the results and discuss several practical considerations, respectively. Conclusions are presented in Section 7. Fig. 1 presents a typical application of the proposed privacypreserving data collection method. The data of each patient, such as age, gender, and medical history, are provided to authorized entities such as hospitals. The patients decide the information provided to the data collection server, which will be shared with researchers worldwide. Patients can also provide information not provided to hospitals, such as family structure and salary.

Motivating Example
If the patient data are sufficiently detailed, researchers can identify patients from the information provided even when all identifiers are removed. However, if the information is insufficiently detailed, the effectiveness of the data analysis is significantly reduced. To solve this problem, we propose a differential privacy model.
Hospitals also apply a differential privacy mechanism to the information provided by each patient. The differential privacy mechanism can process any additional information provided by the patient. The differentially private information, including the additional differentially private information, is sent to the data collection server, which collects the differentially private information from several hospitals. The server then constructs a generative model that should be similar to a generative model created using true information, which is unknown to the server. Applying the generative model, researchers can construct a contingency table, mine the association rules, and perform machine learning with a suitable model such as a deep neural network.

Assumptions
In the COVID-19 scenario, we assume that all patients provide their information to the data collection server through authorized entities such as hospitals. The same assumption is made in COVID-19 contact tracing applications [9]. For example, the Ministry of Health, Labor, and Welfare in Japan launched a smartphone application called COVID-19 Contact-Confirming Application (COCOA). 1 When a COCOA user is confirmed to be infected with COVID-19, an authorized health center issues a code that the user can enter into COCOA. Only users with valid codes can register their infection.
Our proposed method can be used in other scenarios, such as crowd-sensing applications. In these applications, participants provide information such as their location and accelerometer data. Because smartphones can be used for health monitoring and cognitive function assessment [10], they can provide a medical-information portal to the data collection server. When the involvement of authorized entities is difficult, incentive and trustworthiness mechanisms such as those proposed in [11], [12] can be used.
We also assume many missing values in the collected data. As reported in the literature, many individuals hesitate to provide all their information [13], [14]. The rate of missing values ranges between 25% and 55% and may even be higher [13]. We also assume that the data collection server is honest-but-curious. That is, the server honestly follows the proposed scheme but attempts to reveal as much personal data as possible. Furthermore, we assume that the data collection server constructs a generative model and a contingency table. For this purpose, the server requires categorical attribute values. If the original value is a numerical value, it is classified into a predefined category in advance.
Several privacy-protection studies assume that users want to receive services from a service provider based on their attribute values. In such cases, the service provider requires the precise information on each user's attribute values [15]. However, in our scenario, all individuals voluntarily provide anonymized values to the data collection server and do not expect services from the data collection server based on their attribute values, although the server may provide various incentives such as financial rewards. The data collection server aims to create a dataset that can be statistically analyzed without precise information about each individual's attribute values.

Differential Privacy
Differential privacy models [7] have been actively studied in the data mining field [16], [17]. Differential privacy ensures 1. https://play.google.com/store/apps/details?id¼jp.go.mhlw. covid19radar&hl¼en_US (accessed June 26,2020) that the output of the anonymization algorithm does not heavily rely on the data of a particular individual. Each individual can make a well-informed decision about whether to provide the data, and the risk of information leakage is controlled by the privacy budget .
A randomized algorithm A satisfies -differential privacy if and only if for all pairs of individual values s 1 and s 2 and for all R & RangeðAÞ, the following equation holds: P ðAðs 1 Þ 2 RÞ e P ðAðs 2 Þ 2 RÞ: (1) Our research targets privacy-preserving data collection from each person. In this scenario, each datum from person can be considered as a database with one record. In this case, the privacy model is also known as -local differential privacy. Our research targets -local differential privacy. In this paper, "-differential privacy" refers to "-local differential privacy."

Anonymized Data Analysis With Differential Privacy
Mobile crowd-sensing is one application of anonymized data collection, and privacy-preserving techniques can incentivize participants [18]. Erlingsson [20], which are changed at random to ensure differential privacy. However, as these methods do not consider missing values, all records with missing values must be removed before applying the methods. Sei et al. [4] proposed S2Mb, which enhances the randomized response scheme [21], and proposed a method for estimating true counts from values with several errors [5]. They assumed a single attribute without missing values; if there were multiple attributes without missing values, they were converted into one attribute in advance. Several other studies on privacy-preserving data collection have been published [22], [23], [24], [25], [26]. However, all methods for differentially private anonymized data collection are heavily influenced by the number of records in the database. Thus, when the number of records is small, the accuracy of data analysis by these methods is significantly reduced.
Wang et al. proposed a differentially private deep neural network platform for sensitive crowd-sourced data [27]. This platform develops a deep neural network model using the sensitive data and publishes the trained model. However, attackers can use model inversion attacks [28], [29] and membership inference attacks [30], [31] to infer the sensitive raw data from the trained model. To protect these data, the platform adds noise in the training phase. Nevertheless, Wang et al. assumed that the platform is a trusted entity that can collect true information about sensitive data.

Missing Value Imputation
Wei et al. analyzed and compared the imputation accuracies of eight imputation methods [32]. The best-performing models were random forest and quantile regression imputation of left-censored data. In another study, Deb and Liew proposed an imputation method applicable to traffic accident data [33]. Their approach identifies a set of correlated records using a decision tree. The missing values are imputed from the correlation between the missing and nonmissing attributes. Their method also samples several potential imputed values with high similarity.
Many imputation methods use fuzzy clustering algorithms. For example, Rahman et al. proposed a missing value imputation framework based on fuzzy expectation-maximization and fuzzy clustering [34]. This method searches and uses records with highest similarity to the record with missing values. The search is performed by a general fuzzy cmeans clustering algorithm. Based on the membership degrees of all clusters, the missing values are then imputed by a fuzzy expectation-maximization algorithm [35], which is a modification of the regular expectation-maximization algorithm. Meanwhile, Sefidian and Daneshpour proposed the Gray-based fuzzy c-means and mutual information feature selection imputation method [36]. While executing the clustering algorithm, the distance between records is calculated using the gray relational grade and the highly related attributes (in terms of mutual information) are selected. Raja and Thangavel proposed a rough k-means centroid-based imputation method [37] that can handle inconsistencies and uncertainties in datasets. They reported that their proposed method outperforms the simple k-means and fuzzy c-means clustering methods.
All of the aforementioned methods assume that the obtained values represent the true values. However, the present study assumes that the server obtains disguised values because the true values are changed by differential privacy techniques. The structure of the disguised values depends on the applied differential privacy techniques. The structure can be a Bloom filter [3] and a set of dummy values [4]. Therefore, the existing missing value imputation methods are inapplicable to differential privacy scenarios.

Differentially Private Synthetic Datasets Generation
The literature includes several studies on differentially private data synthesis, such as [38], [39], [40], [41], [42]. These studies attempt to generate differentially private synthetic datasets from original (non-privatized) datasets. An example scenario is as follows. Assume a company holds an original (non-privatized) dataset that it wants to share with another organization. Because the original dataset contains sensitive personal information, the company should privatize the dataset. In this scenario, the company can use a differentially private data synthesis method before sharing the dataset.
Zhang et al. [38] proposed PrivSyn, an automatic synthetic data generation method, that calculates correlations of non-privatized attribute values and calculates multiple differentially private marginals to capture the characteristics of the non-privatized dataset. From these marginals, PrivSyn generates a differentially private synthetic dataset.
Vietri et al. [39] proposed two algorithms, FEM and sep-FEM. Their goal is to create a differentially private synthetic dataset from a non-privatized dataset that largely maintains the answers to a large number of statistical queries. Although these algorithms follow the same dynamics as DualQuery [43] and MWEM [44], they could significantly improve accuracy. Similarly, Harder et al. [41] and Cai et al. [42] also assume a trusted data server has non-private personal information. In the framework proposed by Li et al. [40], the server can be honest-but-curious, but it needs true information that aggregates personal attribute values. These methods can release differentially private synthetic datasets with high accuracy; however, their assumptions and objectives differ from our study (see Fig. 2). We assume that a server does not have any original personal data, rather we assume it has differentially private data. The assumption and the objective of our research are very common in differentially private data collection research [3], [4], [5], [19], [45]. However, it is important to note that previous studies do not consider missing values.
The techniques of differentially private synthetic dataset generation are clearly more accurate than the techniques of differentially private data collection if the data collection server has original values; however, in this study, this assumption is not valid, and hence, a different method is needed in the present case.

PROPOSED METHOD
Based on differential privacy, we anonymize the patient personal data at the client side. The server collects the anonymized data and reconstructs the distributions of each attribute and all combinations of two attributes. From the two-attribute distributions, the mutual information of all pairs of attributes is calculated. Next, the generative model of the patient personal data is calculated from the mutual information using a Gaussian copula [46], [47]. Because our proposed method requires only the information about the combination of every attribute pair, it is robust to missing values. Finally, to visualize the generative model, we construct a contingency table from the generative model and the distribution of each attribute. The notations used in this study are listed in Table 1.
In the proposed method, to analyze collected differentially private data, the server constructs a copula model that mitigates the noise added by the differentially private technique. Constructing a copula model requires a value distribution of each attribute and mutual information about all attributes, as described in Section 4.2.3. Therefore, the proposed method first estimates single-attribute distribution (Section 4.2.1) and then estimates attribute-pair distribution (Section 4.2.2). Generation of the copula model is described generated in Section 4.2.3. The copula model can generate an arbitrary number of data samples without missing values. From these data samples, a contingency table is constructed (Sections 4.2.4 and 4.2.5.)

Anonymization at the Client Side
Let s ij represent the value of attribute A j of patient i. The number of attributes is g; that is, patient i has attribute values s i1 ; . . . ; s ig . Some values of s ij may be missing. Let f j be the number of categories of A j .
We anonymize each non-missing value s ij . Let V j represent the domain of A j and V jk represent the kth value of V j . For example, assume that A 1 represents the attribute of a disease {COVID-19, flu, cancer}. In this case, f 1 ¼ 3 and V 11 ; V 12 ; and V 13 are COVID-19, flu, and cancer, respectively.
Based on a previous method [4], we create a value set R ij for each attribute A j as follows: where RanðS; hÞ represents a function that randomly selects h elements without duplication from set S. For example, assume that S ¼ fA; B; Cg, and h=2. In this case, RanðS; hÞ outputs fA; Bg, fB; Cg, or fA; Cg. To satisfy -differential privacy, the parameters h j and p j are respectively determined as following [4]. As there are g attributes in our scenario, each R ij should satisfy =g-differential privacy [48]. Algorithm 1 is the anonymization algorithm from the client side.
The privacy budget allocated to each attribute is =g. Even if all attributes are the same, i.e., the correlations among attributes are 1, we can satisfy -differential privacy due to the composition property of differential privacy [48].  . . . ; s ig g, each domain V j Output:Anonymized version of fs i1 ; . . . ; s ig g 1: for j ¼ 1; . . . ; g do 2: f j jV j j 3: Based on (3), determine p j and h j by substituting =g into 4: Based on (2), obtain R ij from s ij and V j 5: end for 6: returnR R i i ¼ fR i1 ; . . . ; R ig g

Estimation at the Server Side
The data collection server first estimates the value distribution of each attribute as described in Section 4.2.1. It then estimates the value distribution of each attribute pair as described in Section 4.2.2. Using these estimated value distributions, the server creates a generative model (a Gaussian copula; see Section 4.2.3). Finally, it generates n complete data records and creates a contingency table of target attributes, which is specified by a data analyzer (Sections 4.2.4 and 4.2.5). Fig. 3 presents the overall structure of the proposed estimation scheme.

Separated Estimation: Estimation of a Value Distribution of Each Attribute
Each client sends its true value and ðh j À 1Þ randomly selected values other than the true value with probability p j and sends h j randomly selected values other than the true value with probability ð1 À p j Þ for attribute j, as represented in Algorithm 1. As a result, the probability that the true value is sent is p j , and the probability that another value is sent is as for attribute j. Here, because a total of h j values are sent, Let w jk represent the number of occurrences of V jk in fR R 1 1 ; . . . ; R R n n g, and let u jk represent the true number of occurrences of V jk . Thus, we have the following equation: where M is the matrix where the diagonal elements are p j and other elements are q j . The symbol z jk represents the estimated number of occurrences of V jk . We can easily estimate these values by calculating the following equation: where M À1 represents the inverse matrix of M. However, the estimation accuracy is very low [3]. Moreover, calculating the inverse function requires significant computation time, particularly for a large matrix. To overcome these limitations, we selected the expectation-maximization (EM)based algorithm. If we know the values of u jk , we can calculate each expected value of w jk . In our problem setting, we know the actual values of w jk ; however, we do not know u jk . Therefore, considering u jk as unobserved latent variables, the EM-based algorithm can provide maximum a posteriori estimation. It can find the unobserved latent variables that best explain the observed values. Moreover, the EM-based algorithm can ensure the likelihood increase with each iteration [49], [50].
The symbol e n j represents the number of records in which a value exists for attribute A j Let z jk represent the estimated number of occurrences of V jk in A j . From the expectation-maximization-based algorithm [4], we obtain z jk by repeating the following substitution: where and

Separated Estimation: Estimation of a Value Distribution of Every Two Attribute Combinations
Let V jj 0 be the combinations of the elements of attributes A j and A j 0 Let w jj 0 kk 0 represent the number of simultaneous occurrences of V jk and V j 0 k 0 in each record of fR R 1 1 ; . . . ; R R n n g. The symbol f n jj 0 represents the number of records in which a value exists for both attributes A j and A j 0 As an example, assume that Table 2 was created by the privacy-preserving data collection. e n 1 , e n 2 , and e n 3 are 4, 2, and 3, respectively because attribute A 1 has four values, attribute A 2 has two values, and attribute A 3 has three values. g n 1;2 is 2 because two records (the first and fourth records) contain values in both A 1 and A 2 (the values are [39, 40, 58, 35.2, 35.5] and [33, 34, 88, 37.5, 37.6]). Similarly, g n 1;3 and g n 2;3 are 3 and 1, respectively. As in Section 4.2.1, we estimate the occurrence of each combination V jk and V j 0 k 0 of attributes A j and A j 0 for n patients. By calculating these values for all combinations A j and A j 0 , we can estimate all value distributions of all attribute pairs.
After estimating attribute-pair distribution, the method of calculating mutual information is as follows. Mutual information of attributes j and j 0 is calculated as follows: where pðk; k 0 Þ represents the joint probability that V jk and V j 0 k 0 occur, and pðkÞ represents the probability that V jk occurs.

Generative Model Construction: Constructing a Generative Model as the Gaussian Copula
Let X 1 ; . . . ; X g be random variables and let F ðx 1 ; . . . ; x g Þ represent the joint probability distribution function of X 1 ; . . . ; X g . The marginal distribution functions F 1 ; . . . ; F g and the joint probability distribution function have the following relationship.
Theorem 4.1 (Sklar's Theorem [51]). A function C uniquely satisfies the following expression: From Sklar's Theorem, we have for arbitrary u u ¼ ðu 1 ; . . . ; u g Þ (u i 2 ½0; 1). Based on Sklar's Theorem, we have where FðÁÞ represents the cumulative distribution function of a standard Gaussian distribution, and F g ðx 1 ; . . . ; x g ; SÞ represents the cumulative distribution function of a g-dimensional Gaussian distribution with random variables X 1 ; . . . ; X g , and a covariance matrix S.
From (17), the cumulative distribution of the Gaussian copula can be expressed as Cðu 1 ; . . . ; u g Þ ¼ F g ðF À1 ðu 1 Þ; . . . ; F À1 ðu g Þ; SÞ: The Gaussian copula C represents the cumulative distribution function of each marginal distribution, which is a uniform distribution in the range [0,1]. The probability density function of the Gaussian copula cðu 1 ; . . . ; u g ; SÞ satisfies the following relationship: where fðÁÞ represents the probability density function of a standard Gaussian distribution, i.e., Therefore, we have where v v ¼ F À1 ðu uÞ. S must be estimated from the collected data. Let u u i and v v i represent the ith u u and ith v v, respectively. From (21), the log-likelihood function of the Gaussian copula is given by where Therefore, the maximum likelihood estimator b S is To alleviate the high computational cost of (24), we estimate S using a suboptimal approach [47]. First, we calculate the mutual information of every pair of attributes using the reconstructed data in Section 4.2.2. We then determine each suboptimal element of S that minimizes the distance between the mutual information of the estimated joint distribution and that calculated from the reconstructed data (see Section 4.2.2).
Although a copula assumes that it can obtain original (non-privatized) data, it can construct a generative model from small data. There are several types of copulas, such as Gaussian and Student-t copulas. Garcia-Jorcano and Benito demonstrated experimentally that the Gaussian and Studentt copulas were best among Gaussian, Student-t, Clayton, Gumbel, and Frank copulas [53]. Lasmar and Berthoumieu preferred the Gaussian copula to the Student-t copula because the Gaussian copula can achieve high accuracy and its parameters can be easily estimated [54]. Based on these previous studies, the Gaussian copula was selected in this study. However, several other studies pointed out the shortcomings of the Gaussian copula [55]. We intend to consider using other types of copulas in future work.

Contingency Table Construction: Generation of Records Based on the Generative Model
We generated n complete data from the Gaussian copula C and the reconstructed data in Section 4.2.1. The n values of each attribute A j were determined based on the estimated attribute distribution in Section 4.2.1. We also generated random values x 1 ; . . . ; x g based on an g-dimensional Gaussian distribution with covariance matrix b S. We then obtained u i ¼ Fðx j Þ for all i ¼ 1; . . . ; g. From the reconstructed data in Section 4.2.1, we finally obtained F À1 j ðu j Þ for each attribute value, where F j represents the marginal distribution of attribute A j .

Contingency Table Construction: Counting Each
Combination of the Target Attributes After the above process, we obtained n complete data records with g attributes. If a contingency table is used for many attributes, it loses its primary value [56], [57]. Therefore, data analyzers generally select several attributes. The target contingency table is then constructed by simply counting the occurrences of each combination of attribute values from the n generated complete data records. Algorithm 2 presents the server algorithm.

Security Analysis
The proposed algorithm satisfies -differential privacy, as proven below. Because the estimation algorithm at the server side uses only the anonymized data generated at the client side, we must prove the safety of the anonymization algorithm at the client side. Table   Input: Privacy parameter , anonymization parameters p j and h j , anonymized dataset fR R 1 1 ; . . . ; R R n n g, set of target attributes for contingency if j 6 ¼ j 0 then 7:

Algorithm 2. Algorithm for Generating the Gaussian Copula and Contingency
Z jj 0 estimated value distribution of the combination of A j and A j 0 calculated by Equations (12), (13), and (8) based on R ij and R ij 0 ði ¼ 1; . . . ; nÞ. 8: end if 9: end for 10: end for 11: for j ¼ 1; . . . ; g do 12: Construct cumulative distribution function F j based on Z j . 13: end for 14: for j ¼ 1; . . . ; g do 15: for j 0 ¼ 1; . . . ; g do 16: if j 6 ¼ j 0 then 17: for until the change of S converges do 18: Create a temporary Copula with F j , F 0 j and a temporal S.

19:
jth row and j 0 th column of S are calculated based on the mutual information of Z jj 0 . Theorem 4.2. The anonymization algorithm at the client side satisfies -differential privacy.
Proof. First, we prove that the anonymization for each attribute satisfies =g-differential privacy. The probability that R ij contains s ij and h j À 1 specified elements is given by and the probability that R ij does not contain s ij but contains h j specified elements is By Equation (1), the constraints on the values of p j and h j based on =g-differential privacy are given by Substituting Equation (3) into the right side of Expression (27), we obtain max e =g ; e À=g : Because e is greater than 1 and and g are greater than 0, Expression (27) is satisfied. As mentioned above, each attribute value is protected by =g-differential privacy. Because there are g attributes satisfying =g-differential privacy, the final output of the anonymization algorithm satisfies -differential privacy [48]. t u

Time Complexity Analysis
The proposed system involves five steps as shown in Fig. 3. The relationship between each step and time complexity is described in Table 3. The total time complexity is Oðng þ g 2 þ Q c j¼1 f 0 j Þ. Please note that the last item Q c j¼1 f 0 j is common for all the existing methods because the purpose is to generate histograms with the number of bins as Q c j¼1 f 0 j . Because the task "Construct a generative model of all attributes" is more complex than the task "Generate records based on the generative model," the impact of Oðg 2 Þ is larger than the impact of OðngÞ. Therefore, if too many attributes are found, the calculation cost will increase greatly. However, even when the number of attributes is 500 and the number of people is 10,000, the calculation time was 310 min in our experiments. All experiments were conducted on an Intel Xeon CPU W-2295 PC with 64 GB RAM.
Please note that as described in Section 3.4, evaluating the techniques of differentially private synthetic dataset generation is impossible in this study because the data collection server does not access the original values of people in the experiments although the techniques of differentially private synthetic dataset generation require the original values.
The experimental results of the simple combination of the differentially private technique at the client side and the copula technique at the server side are also shown. This method is referred to as DF+Copula.
If the estimated contingency table generated by each method was similar to that generated from the valid data, which was unknown to the data collection server, the estimated contingency table was considered to be well generated by the model.
In this study, a contingency table is considered as a probability distribution of attribute values. To measure the difference between the probability distributions, we applied the Jensen-Shannon (JS) divergence rather than the usual Kullback-Leibler (KL) divergence, because the KL divergence assumes all non-zero probabilities. If any probabilities are zero, the KL divergence fails due to a division-byzero error. The JS divergence is based on the KL divergence but does not impose the non-zero constraint.
In the Apple implementation, equals 1 or 2 per datum [58]. In evaluations by the Apple differential privacy team, was set to 2, 4, and 8 [59]. Microsoft described their differentially private framework and, in their paper, they set from 0.1 to 10 [60]. In the paper that proposed RAP-POR [19], which was developed by Google, ¼ log ð3Þ is used as the main setting. Hsu showed that, in the literature, ranges from 0.01 to 10 [61]. Based on these settings, we set the value of from 0.01 to 10.
We varied the missing value rate m from 0.3 to 0.8, and the number of attributes c in the analysis from 1 to 5. The reported results are the averages of 100 experiments for each parameter setting. As the default parameters, we set m ¼ 0:5, c ¼ 3, and ¼ 5.
Note that the missing value rate m is used only for the experiments, and the proposed algorithm does not require Oð Q c j¼1 f 0 j Þ where f 0 j represents the domain size of jth targeted attribute for analysis this information. The number of targeted attributes for analysis c can be freely determined by the data analyst according to the purpose of the analysis.

Experiments on Synthetic Data
We first evaluated the JS divergence on synthetic datasets, changing the number of attributes g from 10 to 100. The attribute values of each individual were determined randomly, the number of patients n was set to 10,000, and f j ¼ 2 for all j.
Several existing studies target a binary (i.e., f j ¼ 2) scenario. For example, Kairouz et al. [3] proposed a technique targeting f j ¼ 2 first. Then, they extended the technique and proposed O-RAPPOR. We conducted this experiment to verify the effectiveness of the proposed method in such a basic setting. To evaluate the practical application of the proposed method, we conducted experiments with four real datasets. Figs. 4a, 4b, 4c and 4d, 4e, 4f present the results for g ¼ 10 and g ¼ 100, respectively. Under almost all parameter settings, the JS divergence was lower in the proposed method than in the established methods. As the number of attributes g increased, the results in Figs. 4d, 4e, and 4f were higher than those in Figs. 4a, 4b, and 4c. Meanwhile, increasing the privacy budget lowered the privacy-protection level. Therefore, when was large, the JS divergence decreased in all methods.
As the missing value rate m and several target attributes c increased, the JS divergence increased in the O-RAPPOR, S2Mb, MDN, and PDE/ETE methods, but remained low in the proposed method and DF+Copula. PDE/ETE does not transform the numerical values into categorical values.
The results demonstrate that in terms of accuracy, the proposed method outperforms the simple combination of a differentially private technique and a copula technique (DF +Copula.) This observation confirms the robustness of the proposed method to missing values.
Then, we varied the number of attributes (g) from 10 to 200. Fig. 5 shows the results. The fewer the attributes, the fewer are the bins in the histogram, and accordingly, the JS divergence tends to be smaller. Conversely, if many attributes are found, most of the values in the histogram will be zero. Therefore, by predicting the value of most bins to be 0, we can expect a small JS divergence in this case as well. Because of these two characteristics, for many methods, we can see from the figure that the JS divergence increases as the number of attributes increases to some extent, and then, it decreases as the number of attributes increases further. For all settings, the proposed method exhibited the smallest JS divergence among all the tested methods.
Although several previous studies used a dataset with more than 200 attributes [41], many studies use datasets with less than 100 attributes [38], [39], [40], [42]. Because our proposed method calculates attribute-pair distribution, the processing time for datasets with many attributes is relatively long. In future, we will address ways to reduce the simulation time and will conduct additional experiments with the proposed method on datasets with a large number of attributes.

Experiments on Real Data
In the real-data experiments, we first investigated the Adult dataset [62], which is widely used in evaluations of privacypreserving data mining techniques (for example, see [63], [64], [65]). The Adult dataset consists of 15 attributes (e.g., age, income) in 32,561 records. The number of categories in these experiments was set from 2 to 9 per attribute.
Figs. 6a, 6b, and 6c present the experimental results. When the missing value rate was small or was large, the JS divergence of the proposed method was similar to those of S2Mb, PDE/ETE, and O-RAPPOR. Similarly, when was small, the JS divergence of the proposed method was similar to the JS divergence of S2Mb, PDE/ETE, and DF+Copula.
However, at high missing value rates, the proposed method outperformed the other methods, achieving a high level of privacy protection.
To determine whether the proposed method is applicable to small datasets, we randomly sampled 10% of the 32,561 records in the Adult dataset and measured their JS divergence. Figs. 6d, 6e, and 6f present the results. Owing to the data sparsity, the estimation task was more difficult than in the other experiments and the JS divergence in all methods was higher for the 3,256 records than for the 32,561 records. However, the proposed method was robust to the small dataset. On a larger dataset with an insignificant missing value rate, the JS divergence was higher in the proposed method than in the existing methods. Therefore, regardless of missing value rate, the proposed method outperformed the other methods on smaller datasets.
We then used the Communities and Crime Unnormalized dataset [66] (hereafter referred to as the Community dataset). This dataset contains 124 predictive attributes such as the percentage of individuals aged 25 and over with a bachelor's or higher degree, which may be considered as private information in some communities.
After removing the 22 attributes with more than 80% missing values, we obtained 102 attributes for analysis. Fig. 7 presents the experimental results of the Community dataset. The results are similar to those of the synthetic Adult dataset. For almost all parameter settings, the proposed method outperformed the other methods. As the number of participants n was smaller than in the previous experiments, increasing the missing value rate increased the JS divergence of the proposed method. However, the increase in JS divergence is not considerable.
We next used a default dataset containing 21,985 records with the following attributes: sex, job, income, number of loans from other companies, number of delayed payments, and a default flag (0 or 1). Here, the word default means that a debtor failed to pay off a loan. The results of this dataset, which was generated from authentic default data, are plotted in Fig. 8. As shown in Fig. 8a, the proposed method accurately reconstructed the contingency tables even when the missing value ratio (m) increased to 0.8. On the contrary, the accuracies of the existing methods greatly decreased as the missing value ratio increased. Increasing the number of attributes used for generating contingency tables (c) also increased the reconstructed error (Fig. 8b). However, the proposed method was more resistant to increasing c than the other methods. Fig. 8c shows the effect of on the reconstruction error in the five methods. When was sufficiently large, the accuracies of all methods were very similar, but when was small, the reconstructed error was clearly lowest in the proposed method.
Finally, we applied a dataset related to the 2019 coronavirus disease (COVID-19) called Patient Medical Data for Novel Coronavirus COVID-19. 2 Hereafter, we refer to this dataset as the COVID-19 dataset. This dataset contains 427,036 records 2. https://datarepository.wolframcloud.com/resources/Patient-Medical-Data-for-Novel-Coronavirus-COVID-19/ (accessed June 20, 2020) with 23 attributes. More than 90% of the values are missing for 12 attributes, and approximately 27% are missing even for basic attributes such as age and sex. From the COVID-19 dataset, we extracted the Japanese medical data and analyzed the attributes with few missing values (namely, age, sex, administrative division, date of confirmation, and chronic disease status). The date of confirmation was categorized by month and the number of categories in each attribute ranged from 2 to 29. Fig. 9 presents the results of the COVID-19 dataset. Under all parameter settings, the JS divergence was lower in the proposed method than in the other methods. As the rate of missing values in the original COVID-19 dataset was 68.7%, we concluded that the proposed method effectively handles real datasets with missing values.
As examples, Figs. 10 and 11 respectively display histograms generated by the compared methods for combinations of the race, native country, and salary attributes in the Adult dataset and for combinations of the age, gender, and month of confirmation attributes in the COVID-19 dataset. Shown are the true histograms and those generated by O-RAPPOR, S2Mb, MDN, PDE/ETE and DF+Copula. On the Adult dataset, the generated histograms of O-RAPPOR and S2Mb did not differ greatly from the original histogram, but several values differed considerably from the true values. In contrast, the proposed method determined the true distribution almost perfectly, although several values contained errors. The proposed method also reconstructed the histogram of the COVID-19 dataset with high accuracy, whereas the histograms reconstructed by the existing methods noticeably differed from the true histogram.
The execution time of the proposed system was 61 seconds on the Adult dataset (with 15 attributes and 32,561 records) and 1,399 seconds for an artificial dataset with 100 attributes and 10,000 records. As the proposed system calculates the occurrence frequencies of the attribute values for all pairs of attributes, its runtime obviously increased with number of attributes; nevertheless, the execution time on the artificial dataset was short enough for practical use.

Discussion of the Results of Experiments
Existing private data collection methods (O-RAPPOR, S2Mb, PDE/ETE, MDN) do not consider missing values. For example, if the server wants to analyze the relationship between attributes A 1 , A 2 , and A 3 , such methods only use data samples that have all these attribute values. For example, if the missing value rate is 0.5 among all attributes, the probability that three attributes have values is ð1 À 0:5Þ 3 =0.125. In other words, only 12.5% of data samples can be used to analyze these   Differing from existing methods, the proposed method can reduce the effect of missing values because it estimates the attribute value distribution while mitigating the noise added by the differential privacy. The proposed method estimates single-attribute distribution and pair-attribute distribution, i.e., the relationship among three or more attributes does not need to be calculated to construct a copula model. Nevertheless, the constructed copula model can generate data samples that represent the true relationship between all attributes because the model is built to reproduce the characteristics of all pair-attribute distributions.
DF+Copula uses differentially private data as it is and constructs a copula model based on the differentially private data. Although each data sample has a significant amount of noise, true information is not completely lost. Therefore, DF+Copula could realize relatively high accuracy, particularly when m is large. However, DF+Copula does not mitigate the noise, and the accuracy is less than that of the proposed method.
In this experiment, we set the value of to between 0.01 and 10. An epsilon value of 0.01 is a very strict setting, while a value of 10 is a less privacy-preserving setting. Therefore, depending on the datasets, the accuracy of the proposed method and several other methods are similar when is 0.01 or 10. However, for all values, the accuracy of the proposed method is better or comparable to other methods. Regardless of the value of , it is evident that the proposed noise reduction technique has a positive impact.

Kinds of Missing Values
In general, missing data can be categorized into the following three types: Missing completely at random (MCAR): MCAR signifies that the probability of an attribute value being observed or missed does not depend on any attributes.  Missing at random (MAR): MAR signifies that the probability of an attribute value being observed or missed depends on other attributes but not on the missing attribute value itself. Missing not at random (MNAR): MNAR signifies that the probability of an attribute value being observed or missed depends on the missing attribute value itself. In our experiments, we generated MCAR values. To our knowledge, we introduce the first privacy-preserving method for anonymized data collection with many missing values. The proposed method creates a Gaussian copula using the estimated value distributions of each attribute and of each pair of attributes. By combining the estimation with existing methods for missing value imputation of MAR and MNAR data (e.g., [67], [68]), our proposed method could be extended to MAR and MNAR data.
If all attributes are independent of each other, a multidimensional analysis is not required. Therefore, we assumed that at least several attributes depend on each other. However, the proposed algorithm can also be used when all attributes are independent. In this case, the proposed algorithm will perform similarly to the existing methods.

Treating Continuous Attributes
Continuous attributes in our method can be handled by two approaches. In the first approach, the continuous attribute values are discretized into several categories and the proposed algorithm is applied to the categorized values. For example, suppose that the domain of an attribute is [0, 10). When the domain is discretized into [0,0.1), [0.1,0.2),..., [9.9, 10), the attribute is divided into 100 categories. In the second approach, a continuous attribute is not discretized, but differential privacy is achieved by adding Laplace noise to each continuous attribute value [69]. The value distributions of each attribute and each pair of attributes can be reconstructed from a set of noise-added values generated with a certain probability distribution [5], [70]. These techniques can be used for determining the reconstructed value distribution in our method, which is needed for applying the Gaussian copula. Although the proposed method is applicable to continuous attributes, the present study considers discrete attributes to emphasize our approach.

Treating Massive Attributes
In massive-attribute cases, the privacy budget will be limited. This challenge is faced not only by the proposed method, but by all methods based on differential privacy. In our research, we allocated the same privacy budget to all attributes for technical convenience. However, reasonable privacy-budget allocation techniques such as [71], which can be used in the existing methods, can also be implemented in the proposed method.

Treating Data Streams
We here apply our method to data streams. Let t be the number of times that a differentially private value is reported by a user to the data collection server. In this case, the privacy budget of each report can be computed as =t. This naive solution decreases the utility at the data collection server.
More sophisticated approaches such as those in [72], [73] could be applied to data streaming in our method.
The following sampling technique can be used. Even if the number of rounds is t, each person can send their differentially private value only t 0 ð < tÞ times because our proposed method is robust to missing values, as shown in the experimental results. In this case, the privacy budget is =t 0 , and this value is larger than =t.
When a set of new data comes in, we can reapply the proposed method based only on the new data. We acknowledge that it may be possible to reduce the time required to generate a new copula model by reusing a copula model that has already been generated. We intend to investigate this possibility in future work.
Such techniques can be used for reconstructing the value distribution at each time stamp in our method, which is needed for applying the Gaussian copula. In this way, our proposed method can be applied to data streaming. Of course, these techniques can only partially prevent the usefulness degradation of the data.
Dwork et al. [74] recommended that the system clarifies the value of epsilon being used per datum, the lifetime of the data, cumulative privacy loss incurred before the data are retired, etc. In our research, these values can be freely determined under the agreement between the data collector and a person. Let 0 represent the value of epsilon being used per datum. Each person sends g attribute values to the data collector; hence, the total privacy loss is g 0 for each report. If each person sends attribute values under (g 0 )-differential privacy h times during the lifetime of the data, the cumulative privacy loss incurred before the data are retired is hg 0 . In our experiments, we assumed h=1, and we set ¼ g 0 . For example, when the value of was set to 0.01 and g was 10, the value of 0 was set to 0.01/10. Further, we can obtain the result for h > 1. When h is set to 100, the value of 0 is reduced by a factor of 100. Therefore, the results obtained for g = 10, h = 1, and =0.01 can be considered equivalent to those obtained for g = 10, h = 100, and =1.0 because 0 is set to 0.001 in both cases.
If the person wants to keep the total privacy loss during the lifetime of the data below target and sends their attribute values h times during the lifetime, each datum of their attribute value should be protected under ( target= hg)-differential privacy. In other words, if each person will send their differentially private value t 0 ð < tÞ times under 0 -differential privacy for each attribute value, the value of t 0 should be set to ð target = 0 gÞ where ¼ 0 g.
Therefore, we focus on -differential privacy (and -local differential privacy) in the present study. In future work, ð; dÞ-differential privacy and other extensions of differential privacy will be considered.

Several Issues on Differential Privacy
In this study, it is assumed an individual provides data based on the protocol of the proposed method. However, there exists the threat of a manipulation attack in which an individual alters the proposed protocol with the goal of forcing the data collection server to draw false conclusions. Non-interactive local protocols, including our proposed protocol, are not robust to manipulation attacks, and their effectiveness will increase as privacy guarantees are strengthened. Cheu et al. stated that multiparty computation or shuffling might solve this problem [86]. Multiparty computation is a technology that requires long computation time but can process data confidentially [87]. Shuffling assumes a trusted shuffler exists, and the shuffler anonymizes the origins of individuals' messages [88], [89]. Using techniques for finding malicious input data in machine-learning models such as [90] is another option. Verification of whether these methods against manipulation attacks will also work for our proposed method is a future issue.
Determining the level of privacy protection in relation to business requirements is not an easy task. Dandekar et al. proposed an algorithm that takes the privacy-utility tradeoff and minimizes the compensation budget [91].
Ding et al. stated that several differential privacy algorithms have bugs and they do not actually ensure differential privacy [92]. Although the method proposed in this paper was proven to ensure differential privacy, it is important to recognize that such problems exist.

CONCLUSION
Patient information is required for monitoring the status of patients' infections such as COVID-19. For this purpose, it is often collected and shared with researchers. Although privacy protection is a significant concern and necessitates privacyprotection techniques, excessive emphasis on privacy-protection processing stifles the data analysis. In addition, many patients elect not to provide personal information and some patients provide only some of their personal attributes values due to privacy concerns. In such cases, the collected privacyprotected data have many missing values. Several methods have been proposed for privacy-protection data mining, but these methods do not consider missing values. Consequently, the accuracy of data analysis is significantly reduced when the missing values are numerous.
In this paper, we inferred the value distributions of single attributes and combinations of two attributes, and generated a Gaussian copula. Because it uses the information about combinations of two attributes, the proposed method is robust to missing values in data. The generated Gaussian copula utilizes the information from all combinations of two attributes, which enhances its data reproducibility. On the real COVID-19 dataset, we demonstrated that the proposed method significantly reduces the JS divergence from those of the existing methods. In this study, we evaluated the proposed method on public data, but in future work, we expect to collect more sensitive attribute values using the proposed method.
Hiroshi Okumura received the PhD degree in economics from the Kobe University, Hyogo, Japan, in 2012. He is working with Mitsubishi Research Institute, Tokyo, Japan. His research interests include statistics, econometrics, and statistical machine learning. He is a member of Japan Statistical Society and Information Processing Society of Japan (IPSJ).
Akihiko Ohsuga (Member, IEEE) received the PhD degree in computer science from Waseda University, in 1995. From 1981 to 2007 he was with Toshiba Corporation. He joined the University of Electro-Communications, in 2007. He is currently a professor with the Graduate School of Informatics and Engineering. He is also a visiting professor with the National Institute of Informatics. His research interests include agent technologies, web intelligence, and software engineering. He is a member of IEEE Computer Society (IEEE CS), Information Processing Society of Japan (IPSJ), Institute of Electronics, Information and Communication Engineers (IEICE), Japanese Society for Artificial Intelligence (JSAI), Japan Society for Software Science and Technology (JSSST), and Institute of Electrical Engineers of Japan (IEEJ). He has been a fellow of IPSJ since 2017. He served as a chair of IEEE CS Japan Chapter, a member of JSAI Board of Directors, a member of JSSST Board of Directors, and a member of JSSST Councilor. He received IPSJ Best Paper Awards, in 1987 and 2017.