Rapid Re-Identification Risk Assessment for Anonymous Data Set in Mobile Multimedia Scene

Ubiquitous mobile multimedia applications bring great convenience to users. However, when enjoying mobile multimedia services, users provide personal data to service platforms. Although the service platforms always claim that the collected personal data are de-identified, the risk of re-identifying users through linkage attacks still exists and is incalculable. This paper proposes a rapid prediction model for the overall re-identification risk based on the statistics of data sets (i.e., the number of individuals, number of attributes, distribution of attribute values, and attribute dependency). Our proposed model reveals the impact of statistics on the overall re-identification risk and adopts random sampling and semi-random sampling methods to predict the overall re-identification risk of data sets with and without strong dependency ordered attribute pairs. Experimental results show that for the data sets without strong dependency ordered attribute pairs, the random sampling method has a high prediction accuracy (the prediction error is less than 0.05). For the data sets with strong dependency ordered attribute pairs, the semi-random sampling method has a high prediction accuracy (the prediction error is less than 0.09). Exploiting our model, governments and individuals can quickly assess the privacy leakage risk of their data sets, given only the statistic of the data sets. Besides, this model can also evaluate the privacy risk of data collection schemes in advance according to historical statistics, and identify suspected services.

multimedia application providers (e.g. Facebook and Twitter) through various smart terminals and IOT devices [4]- [6]. The combination of these quasi-identifiers is often used by attackers to re-identify the anonymous user. Famous attacks include re-identification of a Massachusetts hospital anonymous records by linking it to the public voter database [7] and de-anonymization of anonymous subscribers in large sparse data set (i.e., Netflix Prize data set) whose background knowledge (as few as 5-10 attributes) can be get from Internet Movie Database [8].
In order to protect citizens' privacy, many governments have promulgated privacy protection regulations or personal information protection laws, such as the General Data Protection Regulation (GDPR) in European Union and the Data Protection Act (DPA) in United Kingdom, considering that each person in data set should be anonymous. And GDPR define the higher standard for anonymization, personal data should not contain obvious identifiers and not be re-identifiable. However, the contradiction between data sharing and privacy protection still exists, and the scale of privacy protection is still difficult to define. Due to the lack of effective privacy risk assessment methods, how to strike a balance between privacy protection and data sharing is still a hard problem.
The re-identification risk of anonymous user, which is defined as the inverse of the number of records matching the user attribute group in data set, is the key indicator of privacy risk assessment. If only a unique record matches the user attribute in the data set, the probability of his re-identification risk is 1. If another 3 records match the same attribute group, the probability drops to 1/4. The famous privacy protection model k-anonymity requires each anonymous record in data set sharing the same attribute group with at least another k − 1 records [9]. But in the real world, the records of users in data set are highly unique. Rocher et al. find that 99.98% of Americans would be correctly re-identified by 15 demographic attributes [10]. The study shows that, even in a huge data set, almost all of users can be re-identified by enough attributes. Besides, from a qualitative perspective, number of individuals, distribution of attribute values, and attribute dependency may also affect the re-identification risk. But there is no simple method to briskly assess the reidentification risk based on the statistic of data set.
We denote the average re-identification risk of all users in data set as overall re-identification risk (ORR). ORR is an important indicator for assessing the privacy disclosure risk of data set. For governments, it is an important tool to define the scale of privacy protection. For users, it is related to the security of sensitive personal information. For data collectors, it means the privacy risk of publishing anonymous data set. For attackers, it reveals the probability of successfully attacking. Although, the data collectors can easily calculate the ORR of the data set, they are unwilling to disclosure it, for commercial purposes or security considerations. Fortunately, for the purpose of data sharing, some data collectors publish incomplete information about the collected data, such as statistics or sampling data of the original complete data set. And the incomplete information may contain some knowledge about ORR. Therefore, how to predict the ORR of complete data sets when only partial information obtained is still an important and challenging problem in the field of privacy protection.
In summary, this paper proposes a rapid re-identification risk assessment (R3A) model for anonymous data set in mobile multimedia scene. The main contributions of this paper are as follows: • Reveal the relationship between re-identification risk and statistics of data set, and first propose the rapid reidentification risk predicting method based on statistics.
• Propose information gain ratio and frequent tuple to describe the attribute dependency. Random sampling and semi-random sampling method are proposed for different degree of attribute dependency, achieving high prediction accuracy. The rest of the paper is organized as follows. Related work is reviewed and summarized in Section II and Section III presents R3A model. Experiment result and analysis are given in section IV and 4 rules about entropy and ORR are discussed in Section V. Finally, we conclude the paper and propose some future work in Section VI.

II. RELATED WORK
Current research on privacy protection technology focuses on privacy protection means. Sweeney [9], proposed k-anonymity model, through generalization and concealment, each record at least share the same quasi-identifier attribute group with the other k − 1 records in the data set, thus the probability of successful linkage attack drops to 1/k. Due to the values of sensitive attributes associated with quasi-identifier group may be similar, k-anonymity model does not preserve the privacy. Then, l-diversity and t-closeness and other more privacy protection model have been proposed [11], [12]. But these models always need to be refined for new types of attacks, and they are not suitable for modern high-dimensional data set. In 2006, Dwork [13], proposed the differential privacy, providing stringent mathematical underpinning and reliable privacy performance evaluation, can resist various attack considering the maximum background knowledge of attackers. Recently, the privacy protection models or technologies combined with artificial intelligence or blockchain technology become a new research hot spots [14]- [17]. However, the above privacy models focus on maximum protection against various attacks, and do not concern the re-identification risk of anonymous data set.
Studies on re-identification risk of data set are common in the fields of statistics and medicine. The re-identification risk of user is often defined as the product of membership probability and success linkage probability [18]. Member probability is the probability that the target user appears in the data set, which is decided by the attacker's background knowledge. The success linkage probability is determined by the number of records matching the target user's quasiidentifier attribute group in the data set. If there are k records in the data set matching the userİŕs quasi-identifier attribute group, the link success probability is 1/k. Since the member probability is determined by the attacker's background knowledge, which is difficult to estimate, most researchers set the member probability as 1, then the re-identification risk of user equals the success linkage probability of user. Because the user with unique records, whose success linkage probability is 100%, is certain to be re-identified, some studies also equate unique probability with the re-identification risk [19].
El Emam et al. [20], emphasized that the uniqueness decreases with population size growth. They managed reidentification risk by controlling population size of data set. Sweeney and Golle et al. found that, in 1990 87% of U.S. population can be uniquely identified by birthdate, gender, and ZIP code [21], while in 2000 the ratio dropped to 63% [22]. Due to the uncertainty of the data collecting methods of above two studies, we do not know the real reason of the decline of the American population uniqueness. But we found the fact that, compared with 1990, America's population grew by 13% in 2000. And the growth of population size generally leads to the decrease of uniqueness. In study [10], Rocher et al. proposed a generative copula-based model to accurately predict the uniqueness of data records by random sampling from the complete data set. The study shows that, the uniqueness of data set can be predicted by partial information of complete data set (e.g., extremely incomplete sampling data sets). But the study did not concern the effect of statistics on uniqueness.
Against modern high-dimensional and sparse macro-data, Narayanan and Shmatikov [8] presents a new class of statistical de-anonymization (namely re-identification) attacks, which can easily identify anonymous subscribers by only 5-10 known attributes and uncover their potentially sensitive information. Merener [23] extended the study, established mathematical theory describing results on de-anonymization that can be achieved by an adversary under general and realistic assumptions. He also found the fact that when the auxiliary information including a rare attribute of D, the size of auxiliary information could be reduced in about 50%. The theory and algorithm applied on Joint Canada/United States Survey of Health 2004, which is less sparsity than Netflix database, getting a satisfactory success of empirical linkage attack.
The trajectory data set is a special data set, and trajectory uniqueness is a commonly highlighted research problem. Y. A. D. Montjoye et al. asserted in [24], that trajectories of 95% users can be uniquely determined by four spatiotemporal points, and in [25], that four spatio-temporal points can also uniquely fix the trajectories of 90% credit card holders. Both studies emphasized that, the trajectory uniqueness grows dramatically with the increasing time slots. Although two enormous data sets covering millions of users were analyzed by the above two studies, the trajectory uniqueness and its probability evaluation is scenario dependent and may be inapplicable to other trajectory data sets, encouraging vigorous discussions [19]. Tu et al. [26] proposed an attack system to recover user trajectories with an accuracy of 73%∼91%, from aggregated mobile data sets (i.e., the number of users covered by a cellular tower at a specific time stamp. Although the study did not reveal any statistical correlation between uniqueness and aggregated data, it hinted an association between them.
The above studies implied that, the statistic of data set, such as number of individuals, number of attributes, distribution of attribute values, may affect the uniqueness or re-identification risk of data set. However, no study had researched the deeper relationship between statistics and reidentification risk. To the best of our knowledge, we are very first to propose rapid re-identification risk assessment method based on statistics.

III. R3A MODEL
From a statistical perspective, we assume that data sets with the same statistics (i.e., number of individuals, distribution of attribute values, attribute dependency) have similar ORRs (this assumption will be verified by simulation in Chapter IV). Based on this assumption, we propose R3A model, in which the ORR of target data set can be predicted by the average ORR of random data sets with the same statistic. Considering that data owners may not disclose the attribute dependency, R3A model recommends two predicting methods, namely, full random sampling without the knowledge of attribute dependency and semi-random sampling considering attribute dependency.

A. DEFINITION
This section defines the terms used in the paper. We use the attribute value frequency matrix (AVFM) to describe the number of individuals, the number of attributes, and the distribution of attribute values. The attribute dependency are quantified by the information gain ratio.
Definition 1: AVFM We consider a data set D containing n records. Each row is a user record with d quasi-attributes. The set of the j-th attribute values of all users is denoted as q (j) , which containing l j elements. The element d ij represents the j-th attribute value of the user x (i) . For example, d 25 = male, representing the fifth attribute value of x (2) is male.
We denote the frequency of i-th element of q (j) as And VOLUME 8, 2020 The generation of AVFM is described in detail below. As shown in Table 1, there are 5 records in data set D, each of which has three attributes. The attribute value ratios of the three attributes are 1:2:1, 3:2 and 2:3. The AVFM m of data set D is as follows.
Obviously, if the AVFM of the data set D is known, we can easily calculate the number of individuals, the number of attributes and the distribution of attribute values in data set D.
Definition 2: ORR If David's record x (m) is unique in data set D, then he is always correctly re-identified. If there are another two users sharing same record with David, then the re-identification probability of David is 1/3. We consider the number of poten- According to [10], the user with record x can be correctly re-identified with the probability of h x , which is defined as follows.
The ORR of data set D is equal to the average reidentification probability of every users in data set D, which is defined as follows.
Definition 3: Information gain ratio We denote the set of all possible values of attribute A as l A . Considering a is an element of l A , f a denotes the frequency of a in data set D. The entropy of attribute A is defined as follows.
We consider tuple (a, b) is an element of l A × l B , and f a∧b denotes the frequency of (a, b) in data set D. The mutual information of attribute A and attribute B is defined as follows.
The information gain ratio of B on A is defined as follows.
Definition 4: Support and confidence T and S are attribute groups of data set D, and T ∩ S = . Tuple t is a value of T , and tuple s is a value of S. The support of t with respect to D is defined as the proportion of users in the data set which contains the item t.
The confidence value of a rule, t ⇒ s, with respect to a set of data set D, is the proportion of the users that contains t which also contains s.
Confidence t ⇒ s is defined as: and supp(t ∪ s) means the support of the union of the items t and s. For example, the rule {smoking} ⇒ {male} has a confidence of 1.0 in a data set, which means that for 100% of the smoker the rule is correct (100% of the smoker is male).
Definition 5: Attribute dependency We consider tuple t is a value of attribute group T , and its frequency in data set D is f t . The entropy of attribute group T is defined as follows.
The attribute group S is dependent on the attribute group T, if the knowledge of T can reduce the uncertainty of S (i.e., entropy). Obviously, the attribute dependency is asymmetric, we use the information gain ratio g(S, T ) to quantify the dependency of S on T . When g(S, T ) = 1, S is completely dependent on T ; when g(S, T ) = 0, S is completely independent on T ; when 0 < g(S, T ) < 1, S is partially dependent on T . We call S is weakly dependent on T when 0 < g(S, T ) < 0.5, and strongly dependent on T when 0.5 ≤ g(S, T ) < 1.
The relationship among dependency, information Gain Ratio and support is shown as Table 2.
Definition 6: Experience entropy We consider data set D with n records consists of d attributes, which are denoted as A 1 to A n . We define = (a 1 , . . . , a d ) is an element of M, and the frequencies of a 1 to a d in data set D are f 1 to f d . The entropy of data set D is defined as follows.
And apparently, if every tuple in D is unique, the entropy is at its maximum. max entropy = log 2 n (14) We consider the probability of tuple m is p m = f 1 ×...×f d n d , and the experience entropy of data set D is defined as follows.

B. RANDOM SAMPLING METHOD
The AVFM of the data set implies all the statistical characteristics required by the random sampling method. We considered that the set of all data sets with the same AVFM m is D m . The set of overall re-identification risks of every data set in D m is the population R m . Due to the extremely large capacity of R m , we adopted the random sampling method to analyze the statistical property of R m . We considered the capacity of each sample is 1, the method of sample selection is as follows: first, the standard data set is generated based on m. Second, each column element in the standard data set is randomly sorted to generate a new data set, and the over reidentification risk of the new data set is equivalent to a new sample which is randomly selected from R m . Then, repeat step 2 to get more random samples. Due to the capacity of R m is extremely large, the sampling method is equivalent to sampling without replacement. We considered the AVFM m of data set D is shown in formula 16, the standard data set based on m is D m . The process of random sampling is described in detail below.  We considered the AVFM m of target data set D is a l × d matrix, where the sum of each column is n. The algorithm of random sampling method is shown as Algorithm 1. The dependency among the attributes in real-world data set would affect the predicting accuracy, so we need to use the attribute dependency background knowledge to correct the predicted results of real-world data set. Considering that it is difficult to obtain dependencies among three or more attributes, the experiment only considered dependencies between two attributes.

Algorithm 1 Random Sampling Method
The semi-random sampling method is described in detail below. If attribute A is strongly dependent on B, the tuples with confidence or frequency exceeding a certain threshold in strong dependency ordered attribute pair (A, B) is considered as frequent tuples. For example, if (A, B) is a strong dependency ordered attribute pair and the confidence threshold is 0.8, then the tuple (a, b) with conf (b ⇒ a) = 0.9 is a frequent tuple. All frequent tuples and their confidences in (A, B) are called dependency background knowledge about (A, B). The semi-random sampling method is similar to the random sampling method, except that the semi-random sampling method can maintain attribute dependencies of target data set to some extent. For example, D m is a data set generated by VOLUME 8, 2020 semi-random sampling, (a, b) is a frequent tuple with conf (b ⇒ a) = 0.9 in D m , then the conf(b ⇒ a) in D m is 0.9.
We considered the confidence of frequent tuple (a, b) in target data set is b_a, the algorithm of semi-random sampling method is shown as Algorithm 2. We selected 20 representative AVFM with which the risks of the random data sets are approximately equally spaced distribution between 0 and 1. Then we randomly selected 100 samples from the ORR population R m corresponding to each AVFM. The capacity of each sample was 50. We used the sample mean to predict the ORR of the target random data set. For simplicity, we made the target ORR equal to average sample mean. The absolute errors of predicting are shown in Figure 1. The x axis shows the average sample mean of each AVFM. The y axis shows the absolute error of predicting (i.e., the difference between the target ORR and the sample mean). As shown in Figure 1, the absolute error of each AVFM is close to zero. It means that the sample mean is centrally distributed around the average sample mean and occasionally some abnormal sample mean occurs, but the deviation between the outlier and the average sample mean   is no more than 0.001. The results show that random data sets with the same AVFM have highly consistent ORR. That is, random data sets with the same statistics (i.e., number of individuals, distribution of attribute values) have similar ORRs. Our assumption was verified on the random data sets.
Through further research, we found that all the 100,000 random data sets do not contain strong dependency ordered attribute pairs. Compared with the total number of all random data sets with same AVFM, the number of random data sets used in simulation and the number of data sets containing strong dependency ordered attribute pairs are negligible. Due to the random distribution of the two in the population of the random data sets, the possibility of an intersection between the two is extremely tiny. Considering that there are usually strong dependency ordered attribute TABLE 5. Attribute dependencies between any two non-sensitive attributes of SPD data set. pairs in the real-world data sets, so we tested the predicting effect of the random sampling method on real data sets.

2) PREDICTING ORR OF REAL-WORLD DATA SETS
Since it was difficult to get a large number of real-world data sets, we generated 20 target data sets by selecting the intersecting positions of random rows and certain columns from a big real-world data set. The original real-world data set used in this study was the SPD data set with a capacity of 3985166 and 10 attributes [19]. Table 3 describes in  detail the considered attributes, Table 4 shows the selected attributes of each target data set and Table 5 provides the attribute dependencies between any two non-sensitive attributes of SPD data set. As shown in Table 5, most ordered attribute pairs are weak dependencies, only three of them are strong dependencies. The three ordered attribute pairs are (patnty, oshpd_id), (patnty, patzip) and (oshpd_id, patzip). It is understandable that in the real world, patient's hospital, county and ZIP code are highly correlated, and the dependencies among them are easily available from public information.
We used random sampling method to predict the ORRs of the real-world data sets, and the absolute errors of predicting are shown in the Figure 2. The x axis shows the average sample mean corresponding to the AVFM of each target data set. The y axis shows the absolute error of predicting (i.e., the difference between sample mean and target ORR). The predicting errors of groups 6, 13-16 and 20 were above 0.2, while the errors of other data sets were all below 0.05.
Through further research, we found that the attribute dependencies of the target data sets were close to the ones of the SPD data set. All target data sets with high predicting errors contained strong dependency ordered attribute pairs, while all data sets with low predicting errors did not contain strong dependency ordered attribute pairs. It shows that, the strong dependency ordered attribute pairs will heavily interfere the predicting accuracy of random sampling method.

B. SEMI-RANDOM SAMPLING METHOD
Compared with random sampling method, the background knowledge of attribute dependencies in target data set should be considered in semi-random sampling method. Considering that in reality the statistical characteristics of large population are easier to obtain than those of specific small population, we used knowledge of attribute dependencies in SPD data set to constrain the random data set. Due to the records of target data set is random sampling from SPD data set, the following two situations need to be considered: (1) The target data set do not contain some frequent tuples of SPD data set; (2) The confidence of frequent tuple of target data set is theoretically lower than the corresponding one of SPD data set. For situation one, we do not need to do anything. For situation two, we need to ensure that the confidence of frequent tuple of the random data set is equal the smallest one of the theoretical values of the target set and the background value of SPD data set. For example, if tuple (a, b) is a frequent tuple in SPD data set with conf (b ⇒ a) = 0.9, then in the random data set containing same frequent tuple, the confidence of (a, b) is equal to min(0.9, supp(a)/supp(b)), where the supp(a) and supp(b) are the supports of elements a and b in the target data set.
We considered the tuple of strong dependency ordered attribute pairs, with confidence greater than 0.9, or with confidence between 0.5 and 0.9, and frequency exceeding 398 were frequent tuples. The absolute error of semi-random sampling method is shown as Figure 3. With the background knowledge of attribute dependency, the ORR prediction accuracy was greatly improved, and the absolute predicting error was limited to 0.09. Obviously, the background knowledge we considered was incomplete, if more background knowledge was obtained, the absolute prediction error would be further reduced. The results show that data sets with same statistic (i.e., number of individuals, number of attributes, distribution of attribute values, attribute dependency) have  highly consistent overall re-identification risk. It means that, for real-world data sets, our assumptions are also correct.

V. DISCUSSION
Here, we discuss the relationship between entropy and ORR. We obtained 40 testing data sets by randomly selecting 30000 and 300,000 records from the SPD data set according to the attribute combination shown in table 4. The Max entropy (ME), the experience entropy (EE), the entropy of random data set with the same AVFM (RDE), the entropy of target data set (TDE), of the forty data sets are shown as Figure 4. The solid line represents the data set capacity of 30000, and the dashed line represents the data set capacity of 30000. Based on information theory, statistics knowledge and experimental results, we have summarized the following four rules.
Rule 1: The ME is absolutely determined by the capacity of the data set, and the larger the capacity, the greater the maximum entropy. If the TDE is equal to the ME, which means that each tuple of the target data set is unique, then the ORR of the target data set is 1. EE must be greater than or equal to TDE and RDE.
Rule 2: When selecting the same combination of attributes, the EEs of the sampling data sets with 30000 records and 300,000 records are very close, because they are all from SPD data set, having the close proportion of attribute values of each attribute. If the capacity of data set changes, but the combination of attributes and the proportion of each attribute value remain, EE will be greater than ME when the capacity of the data set is small enough. This is because too small data set capacity will make the number of tuples in the data set far lower than the capacity of tuple space T, resulting in the ME of data set will be lower than the EE calculated based on probability distribution. For example, we consider a data set with 1000 records and 4 attributes, and each attribute has 10 attribute values, the frequency of each attribute value is 100. Then ME of data set is 9.9658, which is lower than the EE 13.2877 of data set.
Rule 3: RDE is less than or equal to EE. When the capacity of tuple space T remain, RDE is close to EE if the data set capacity is large enough. The larger the RDE, the larger the ORR. For data sets with same AVFM, the larger the volume, the larger the RDE, but the lower the ORR.
Rule 4: TDE is less than or equal to RDE, because the attribute dependencies in the real-world data set will weaken the uncertainty of the data, and random data sets destroy the attribute dependencies and maximize the entropy of the data set. When there are no strong dependency ordered attribute pairs in the target data set, the TDE is very close to the RDE, and the ORR of the target data set is very close to the one of the random data set. When there are strong dependency ordered attribute pairs in the target data set, TDE will deviate from RDE greatly, and the ORR of the target data set will be much lower than that of the random data set. In general, for data sets with same capacity, the ORR of the data set with significantly larger TDE is greater than the one of the data set with smaller TDE.
In short, when the data set capacity is large enough, there is ME EE RDE TDE. The ORR of the data set is highly correlated with the TDE, and the dependencies among the attributes of the data set will make the TDE deviate from the RDE, and the ORR of the target data set will be much lower than that of the random data set. It means that two data sets with same AVFM, the ORR of the one with stronger attribute dependencies is lower than the other. And the attribute dependencies (e.g, the dependency of beer on diaper) are exactly the value of the data set. This suggests that data privacy and value are not always contradictory. Differential privacy technology preserves user privacy by adding random noise, but random noise will destroy attribute dependency and reduce data availability. If we can maintain attribute dependency while adding noise, the availability of the data will be preserved without privacy risk increasing.
For modern sparse data sets with thousands of attributes, i.e., each user includes for fewer non-null attributes, R3A is not suitable. However, because of recording too many user attributes, modern high-dimensional sparse data sets have high privacy risk. From the perspective of privacy protection, high-dimensional data sets should be divided into low-dimensional data sets for which R3A model has a good prediction effect. From the perspective of privacy attack prevention, attackers can only acquire the knowledge of a few attributes, and R3A model is capable of predicting the ORR of data set composed of these attributes.

VI. CONCLUSION
In this paper, we propose R3A model to rapidly predict the ORR of data set. Our model has high prediction accuracy, when considering the background knowledge of attribute dependency (i.e., the confidence of all frequent tuples). Fortunately, in the real-world, the background knowledge can be easily obtained through public data. That provides a wide space for the application of our model. For example, R3A model can be used to rapidly assess the privacy disclosure risk, providing references for government policy making and personal privacy estimation.