Protecting Privacy in Knowledge Graphs With Personalized Anonymization

Knowledge graphs (KGs) are emerging data models allowing data providers to share data. This data sharing might bring new knowledge and collaborations, with evident benefits for providers. However, since KGs might contain sensitive information about users, it is of utmost importance to ensure KG anonymization before publishing. Recently, some proposals have addressed the problem of KGs’ anonymization based on the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="hoang-ieq1-3300360.gif"/></alternatives></inline-formula>-anonymity principle. These techniques propose to anonymize the whole dataset with the same anonymization level. However, in a contest where data are collected from different users, it is crucial to consider also users’ preferences on the anonymization level to adopt for their data. To cope with this requirement, this paper presents the Personalized <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="hoang-ieq2-3300360.gif"/></alternatives></inline-formula>-Attribute Degree (p-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="hoang-ieq3-3300360.gif"/></alternatives></inline-formula>-ad) principle. It allows users to specify their anonymity levels (the <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="hoang-ieq4-3300360.gif"/></alternatives></inline-formula> values) while preventing adversaries from re-identifying them with a confidence higher than <inline-formula><tex-math notation="LaTeX">$\frac{1}{k}$</tex-math><alternatives><mml:math><mml:mfrac><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mfrac></mml:math><inline-graphic xlink:href="hoang-ieq5-3300360.gif"/></alternatives></inline-formula> with their specified <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="hoang-ieq6-3300360.gif"/></alternatives></inline-formula>. Moreover, we design the Personalized Cluster-Based Knowledge Graph Anonymization Algorithm (PCKGA) to generate anonymized KGs satisfying p-<inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives><mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="hoang-ieq7-3300360.gif"/></alternatives></inline-formula>-ad. We conduct experiments on four real-life datasets and show that PCKGA greatly improves the quality of anonymized KGs comparing to previous algorithms.

The problem of graph anonymization has been well investigated in the past by using two main techniques: k-anonymity and differential privacy (DP) [1].k-anonymity follows the noninteractive setting where data providers modify users' relationships in their graphs such that adversaries cannot re-identify a user in the modified graphs with a confidence higher than 1 k even by exploiting background knowledge on the target victim.For instance, k-degree, k-neighborhood [1] modify the original graph such that the degree or neighbor structure of any user in the anonymized graph is indistinguishable from that of k − 1 other users.Unfortunately, it has been shown in [2] that the above approaches are not enough for KGs since adversaries can exploit both attribute values and relationships.
On the other hand, DP anonymizes graphs under two settings: interactive and non-interactive.The former [3] creates a system interactively answering statistics queries from graphs while preventing adversaries from inferring the existence of a user (i.e., a node) by looking at the extracted statistics.The latter publishes graphs' statistics once while ensuring the same privacy protection as the former.For example, [3] introduces an algorithm to generate statistics of graphs and their nodes' attributes (e.g., the number and size of communities, the distribution of attributes' values in the communities') while protecting users' identities.[4] uses DP to inject noise to data transferred between clients in order to protect users' privacy in federated learning settings.Although DP solves some of the k-anonymity's limitations by not relying on assumptions about what information adversaries can exploit, data providers must design a different DP analysis algorithm for each type of statistics that data recipients require.Having separate statistics on the same KG might represent a limitation for KG analyses, as several exploit the correlation between KGs' attributes and relationships (e.g., drug design, tax calculations, or financial reporting).As an example, in [3], the extracted statistics on graphs' attributes (e.g., the distribution of communities' attributes) are not correlated to those of users' relationships (e.g., the size of communities).Moreover, many KGs' analyses are still not supported by DP (e.g., tax calculations).Therefore, to enhance users' privacy protection in the current developments of KGs' analyses without sacrificing shared KGs' utility, we focus on k-anonymity approaches.
Recently, some proposals appeared (e.g., [2], [5]) to extend k-anonymity to KGs. [5] has introduced an anatomy approach to generate groups of at least k users and returns the frequency of each sensitive value in these groups.However, this approach restricts the utility of KGs, since data recipients cannot use standard analytical tools (e.g., deep learning) to analyze KGs.k-Attribute Degree (k-ad) [2] prevents users from being re-identified with a confidence higher than 1  k even though adversaries exploit both attribute values and relationship out-/in-degrees.To this end, k-ad requires to add/remove edges such that the information of any user in the anonymized KG is indistinguishable from that of k − 1 other ones in the KG.However, several studies in academia [6] and industry, 4 for instance those, based on Westin's privacy types [7], suggest that online users can be classified into groups (typically 4) with different levels of privacy concerns.Therefore, we believe that users' privacy types should also be considered when data are anonymized, by allowing each user to set up their privacy levels.
Few k-anonymity and DP proposals for relational data [8], [9], [10] and undirected graphs [11], [12] allow users to specify personalized preferences for their data anonymization.Liu et al. [8] allow users to specify two thresholds.The former prevents anonymized data to be more detailed than the threshold.The latter is specified for each sensitive value to ensure that a user cannot be associated with any sensitive value with a confidence higher than his/her specified threshold for the value.In [11], users can specify their preferences in anonymizing: (1) only their attribute values, (2) their attribute values and relationships, (3) their values, relationships, and neighbors.[12] presents a similar approach but its supported preferences are: (high) anonymizing both users' sensitive values and neighbors, (medium) anonymizing only neighbors, and (low) no anonymization.[9], [10] are DP-based approaches allowing users to specify their own privacy parameters' values (i.e., ) and data providers must consider all of the values when they generate statistics from the users' attributes in relational data.However, we are not aware of any proposals for KGs allowing users to set their personalized preferences (e.g., k).All of the state of the art privacy protection proposals for KGs [2], [5] apply the same k value to all users.A naive solution to make each user able to specify his/her k value is to select the maximum k among all specified values, but this might result in poor data quality.
Therefore, in this paper, we present the Personalized k-Attribute Degree (p-k-ad) principle, to protect users' identities even if multiple k's values are specified.p-k-ad requires that, for every user u in an anonymized KG, his/her attribute values and relationship out-/in-degrees are indistinguishable from those of at least k u − 1 other users in the KG, where k u is the k value set by u.Thus, adversaries cannot re-identify a user u with a confidence higher than 1 k u .To generate anonymized KGs satisfying the p-k-ad principle, following the approach presented in [2], we design the Personalized Cluster-Based Knowledge Graph Anonymization Algorithm (PCKGA), which generates anonymized KGs according to two steps: Clusters Generation and Knowledge Graph Generalization.The former allows data providers to specify their own clustering algorithm (e.g., k-medoids [13], HDBSCAN [14]) to generate clusters.The latter generates anonymized KGs such that the anonymized attribute values and relationships' out-/indegrees of users in the same clusters are identical.We formally prove that if the clusters' sizes are greater than or equal to the maximum k values of their users, the generated anonymized KGs satisfy p-k-ad principle.Here, the anonymized attribute values of users in a cluster are the union of their attribute values.The out-/in-degrees of the anonymized relationships of the users are the maximum out-/in-degrees of the users' relationships.To optimize the quality of anonymized KGs, we must minimize (1) the difference between attribute values and relationships' out-/in-degrees of users in the same clusters, and (2) the disparity among their k values.(2) guarantees that users with smaller k values are not anonymized with those with excessively high k values, that leads to high information loss.Since these operations are done in the Clusters Generation step, in this paper, we focus on designing this step, and use the Knowledge Graph Generalization step that we proposed in [2].
Although allowing providers to choose their clustering algorithms increases flexibility, standard clustering algorithms do not fit the scenario of personalized anonymization since they only minimize the distance among users without considering their k values.This might bring to have clusters with very similar users but with too different k's values, thus resulting in high information loss.To address this challenge, we propose a new clustering algorithm, called the Varying-Anonymity Clustering Algorithm (VAC), that can be used in the first step of PCKGA instead of current clustering algorithms.VAC measures the anonymization distance between two users by using not only their distance w.r.t.their attributes and degrees, but also their k values.VAC generates clusters such that the anonymization distances among users in the same cluster are minimized.Moreover, it removes users whose k values or distances to other users are too high.To ensure that sizes of obtained clusters are greater than or equal to the maximum among k values of their users, we develop the Merge-Split Algorithm (MS).MS modifies clusters generated from the clustering algorithm by removing invalid clusters, whose size is less than the required k.Then, MS adds users belonging to the invalid clusters to nearest valid clusters such that and all clusters are still valid.We formally prove that no matter what clustering algorithm data providers exploit, by using MS, we always generate anonymized KGs satisfying the p-k-ad principle.
We conduct experiments by using four real-life datasets and compare anonymized KGs' quality generated by executing PCKGA with various settings: VAC, state-of-the-art clustering algorithms (i.e., k-medoids [13], HDBSCAN [14]) with appropriate parameters' values (i.e., max/min/mean users' k values), and a simple anonymization approach using HDBSCAN (i.e., HDB*).HDB* gathers users whose k values are equal into groups and anonymizes users in the same group with their users' k value.The experimental results show that executing PCKGA with VAC results in anonymized KGs whose information loss is 210% lower than that of those generated by k-medoids and HDBSCAN and 26% lower than that of HDB*.Furthermore, our results indicate that PCKGA performance are good enough to be used in practice.Our experiments also show that PCKGA outperforms the previous anonymization algorithm for KGs [2] and relational data [15].Fig. 1.Knowledge graphs (dashed lines denote added fake edges).k values of Ken (u 0 ), Mary (u 1 ), Henry (u 2 ), Tom (u 3 ), and Jane (u 4 ) are 2, 2, 2, 1, and 4, respectively.G is generated from two clusters {u 0 , u 1 } and {u 2 , u 3 } while u 4 is removed due to its high k value.
The remainder of this paper is organized as follows.Section II explains the adversary knowledge and our protection model.Anonymization algorithms are presented in Section III, while Section IV explains their details.Privacy guarantees of our proposal are analyzed in Section V. We illustrate the experimental results in Section VI and conclude our paper in Section VII.Due to the lack of space, the summary of notations, the complexity analysis, proofs of theorems in Section V, and some experimental results are included in the supplementary material section, available online.

II. PERSONALIZED ANONYMIZATION OF KGS
In this section, we introduce a new privacy principle to support KG anonymization with personalized k values.Before that, we briefly introduce KGs.We refer the reader to the supplementary material section for a summary of the notations used in this section and throughout paper, available online.

A. Knowledge Graph
We model a KG as a graph G(V, E, R), where V, E, R are the set of nodes, edges connecting these nodes, and relationship types of these edges, respectively.Since this work focuses on protecting users' privacy in KGs, we categorize the set of nodes V into the set of users V U and the set of attribute values V A , where V U ∪ V A = V .Relationship types in R are categorized into two subsets: user-to-user relationship types R UU , representing users' relationships (e.g., f ollows), and user-to-attribute relationship types R UA , modelling users' attributes (e.g., age).Thus, R = R UU ∪ R UA .We model each edge e ∈ E as a tuple (u, r, v), where u, v ∈ V and r ∈ R. We denote with E UA ⊆ E those e = (u, r a , v a ), such that r a ∈ R UA , and E UU ⊆ E those e = (u, r r , v r ), such that r r ∈ R UU .Fig. 1(a) illustrates an example of KG.

B. Personalized K-Attribute Degree Principle
This section introduces the Personalized k-Attribute Degree principle.We start by characterizing the knowledge that an adversary can exploit to perform re-identification attacks.
Let G(V, E, R) be a KG, and G(V , E, R) be its anonymized version created by modifying G. Given a target user u ∈ V U , the adversary's goal is to re-identify u by using the background knowledge he/she has on u and the information he/she can extract from G. We formally define the adversary knowledge as follows.
The Personalized k-Attribute Degree principle is defined as follows.
Definition 2 (Personalized k-Attribute Degree): Let G(V , E, R) be an anonymized KG.G satisfies the Personalized k-Attribute Degree (p-k-ad) if and only if for every user u in V U , there is a set where k u is a positive integer specified by u.Fig. 1(b) illustrates the anonymized version of G (Fig. 1(a)) satisfying p-k-ad.In this work, similarly to the state-of-the-art personalized k-anonymity [8], [11], [12] and DP proposals [9], [10], we assume that users' personalized k values are public.However, as point out in [10] the privacy parameters could be used to infer personal information.For instance, a user might select high parameters' values based on his job/function (e.g., politician).So, attackers could infer user's job based on selected values (e.g., politician could be inferred by high values).To address this concern, we follow [10] to assume that there is no relationship between selected privacy values and any sensitive values embedded in anonymized KGs.Therefore, even though adversaries exploit the public k values of target users, they cannot increase the confidence of re-identifying users.Privacy protection guarantees of p-k-ad will be described in Section V.

III. PERSONALIZED CLUSTER-BASED KNOWLEDGE GRAPH ANONYMIZATION
This section introduces the overall idea for the generation of anonymized KGs satisfying p-k-ad (see Section II-B).We propose the Personalized Cluster-Based Knowledge Graph users, that is, whose profile data are not distant according to some distance metrics.Then, the k-anonymity principle is satisfied by anonymizing users belonging to the same cluster with the same generalization (e.g., same attribute values and relationship out-/in-degrees).Therefore, PCKGA contains two main steps: Clusters Generation (lines 1-4) and Knowledge Graph Generalization (line 5).The first step generates clusters by calling the clustering algorithm A received as input with the anonymization distance matrix D a and its parameters P5 (lines 1-3).Subsequently, these clusters are adjusted to ensure their validity, which requires that the number of users in each cluster is greater than or equal to the maximum k value of their users (line 4).Then, it calls the Knowledge Graph Generalization Algorithm (KGG) [2] to generalize attributes and relationships (line 5).Finally, it returns G (line 6).
Clusters Generation: Literature offers several clustering algorithms (e.g., k-Medoids [13], HDBSCAN [14]), proposed for the k-anonymization of both relational and non-relational data that can be used by our PCKGA algorithm.Clusters generation rely on the definition of the anonymity level of a cluster, formally defined in what follows.
Definition 3 (Cluster anonymity level): Let c be a cluster of users, where each user u ∈ c has specified an anonymity level, denoted as k u ∈ N. The anonymity level of cluster c, denoted as k c , is defined as the maximum value among the anonymity levels specified by users in c, i.e., k c = max u∈c k u .
The PCKGA's cluster generation step creates only valid clusters, that is, clusters c whose number of users, i.e., |c|, is greater than or equal to its anonymity level.This is done by the Merge-Split Algorithm (MS) (Algorithm 3) that possibly refines the output of the provided clustering algorithm A so that all the generated clusters are valid.
Although PCKGA can work with any clustering algorithm, previously defined algorithms do not fit well our scenario of personalized anonymization, since they minimize the distance among users without considering the values of k they select.This could result in clusters with very similar users but with very different k's values, bringing therefore the risk of big clusters.Indeed, even if only a few users specify high k values, the cluster has to be enlarged so that its cardinality is greater than or equal to the maximum among all k's values specified by its users.To cope with this issue, we propose a new clustering algorithm for PCKGA, called the Varying Minimum-Size Constraint Clustering Algorithm (VAC) (Algorithm 2).VAC generates clusters useful for personalized k-anonymization.
Knowledge Graph Generalization: The previous step generates only clusters whose cardinality is always greater than or equal to the maximum value among the anonymity levels specified by their users.This satisfies only partially the requirements imposed by the p-k-ad principle introduced by Definition 2. In particular, it ensures that, for each user in G, there exists a set of users C(G, u), i.e., a cluster, such that |C(G, u)| ≥ k u .To fully satisfy p-k-ad principle, we have also to ensure that all users in C(G, u) have identical attributes' values and the same out-/in-degree for all relationship types.
For this purpose, we exploit the Knowledge Graph Generalization Algorithm (KGG) presented in [2], which has been designed to satisfy the k-Attribute Degree (k-ad) principle [2].This principle requires that, each user within the anonymized graph must have a minimum of k − 1 other users having the same attribute values and out-/in-degree for all relationship types.To this end, KGG first extracts the union of the attribute values of all users in the target cluster.Subsequently, it generalizes attribute values of users in the cluster by adding fake user-to-attribute edges to make the users' values identical to those in the union.To generalize the user-to-user relationships of users in a cluster, KGG finds the maximum out-/in-degree of all relationship types of the users belonging to the cluster.It then increases the users' out-/in-degrees to match the maximum degrees by adding user-to-user edges.If it is impossible to add edges, it reduces the maximum out-/in-degree of the relationship types by removing edges of the users whose degrees are equal to the maximum ones.It then continue adding user-to-user edges.
However, KGG works under the assumption that k is unique for the whole graph.Applying KGG in the context of PCKGA implies executing it on each cluster c generated by PCKGA's first step with k set to k c .
In the following section, we will focus on the clusters generation step, which is the one impacted by supporting personalized Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.k values.We refer interested readers to [2] for more details on KGG.

IV. CLUSTERS GENERATION
The Cluster Generation step generates a set of valid clusters.It first calculates the parameters' values P used to execute the clustering algorithm A (line 2, Algorithm 1) and then executes the clustering algorithm A with the calculated parameters P (line 3, Algorithm 1).For instance, in the case of VAC, P includes K: k's values of users in G.We explain how we set parameters' values for other clustering algorithms (i.e., k-Medoids, HDBSCAN) in Section VI-B.This step creates a preliminary set of clusters C.Then, we run the Merge-Split algorithm (MS) that merges and splits clusters in C to generate the set of valid clusters C (line 4, Algorithm 1).
We recall that even though data providers can use any clustering algorithm as A, we propose a new clustering algorithm, namely VAC, which takes into account users' k values.The main advantage of VAC is that providers only need to specify an anonymization distance matrix and users' k values.In contrast, if k-medoids is used, it requires specifying the number of generated clusters, whereas HDBSCAN needs the minimum size for all generated clusters.Even though we present VAC as the clustering algorithm A that generates clusters for PCKGA, the providers may have different needs and want to develop their own clustering algorithms.In such cases, the providers can still use their own algorithms with PCKGA without any modification.
VAC aims to generate clusters while minimizing: (1) the maximum distance between users in the same cluster, and (2) the differences among these users' k values.The key idea is the definition of a novel distance metric that considers both (1) and (2).In the rest of the section, we first present the proposed distance metric, then VAC and MS.

A. Distance Measure
The proposed distance measure aims to estimate the anonymization cost of a cluster by considering its users' k values.Given a user u and his/her k value, k u , we first estimate the cost of finding k u − 1 users that are most similar to u. Hereafter, we refer to these users as u's nearest neighbors, denoted as N (u).To measure the similarity between two users and thus computing N (u), we use the Attribute and Degree Information Loss (ADM) Distance [2], denoted as d adm .This measure incorporates three components: the attributes' distance (d am ), the distance of the target users' out-degree (d o dm ), and the distance of their in-degree (d i dm ).d am estimates the information loss of generalizing attribute values of user u (denoted as AL(u, v)), and those of user v (denoted as AL(v, u)) to make attribute values' of u and v equal.Given an attribute r a , the information loss of generalizing r a 's values for a user u such that the values are equal to those of a user v is the differences between u's generalized attribute values (denoted in what follows as GV (G, u, v, r a )) and u's original values (i.e., I a (G, r a , u), I a (G, r a , v)).If r a is a categorical attribute, the differences are the number of values added by the generalization step (i.e., |GV (G, u, v, r a ) \ I a (G, r a , u)|).If r a is a numerical attribute, the differences are the changes of minimum and maximum values after generalizing r a 's values (i.e., min GV (G, u, v, r a ) − min I a (G, r a , u) and max GV (G, u, v, r a ) − max I a (G, r a , u)).d am then takes the average of the information loss of u and v. d am is formally defined as follows: Definition 4 (Attribute Information Loss Distance [2]): Let u, v be two users in a KG G.The Attribute Information Loss Distance (d am ) measuring the information loss of making u's and v's attribute values identical is computed as follows: where M is either the max or min function, In contrast, d o dm and d i dm measure the information loss of user u and user v when generalizing their relationships.The information loss of the target users on a given relationship r r is computed by finding the differences between users' generalized out-/in-degree (denoted as GD o (G, u, v, r r ), GD i (G, u, v, r r )) and original ones (denoted as  [2]): Let u, v be two users in a KG G.The Out-Degree Information Loss Distance (d o dm ) of making u and v having the same out-degrees on all relationship types is computed as follows: By combining the above defined distances, d adm estimates the information we lose on generalizing two users u and v's attributes and relationships.The higher the distance between two users is, the less similar they are.d adm is formally defined as follows: Definition 6 (Attribute and Degree Information Loss Distance [2]): Let u, v be two users in a KG G.The Attribute and Degree Information Loss Metric (ADM ) of making u and v having the same values on all attributes, and the same out-/in-degree on all relationship types is computed as follows: where d am , d o dm , and d i dm are the attribute, out-degree, and in-degree information loss distance between u and v, formally defined in Definitions 4-5.
Thus, the cost of finding k u − 1 users similar to u is defined as cost(u) = max v∈N (u) d adm (u, v).We refer to this cost as u's core distance.
Example 3: Considering Fig. 1(a), the nearest neighbors of each user are: is empty since u 3 does not need any neighbors to be in a valid cluster (k u 3 = 1).Therefore: Given a user u, his/her core distance allows us to determine the minimum information loss that a cluster containing u will have to accommodate enough similar users to satisfy k u constraint.By using the user core distance, we can now define the anonymization distance between two users u and v as the cost to generate the smallest valid cluster around them.To build a valid cluster we have to insert a number of users equal to the maximum between k u and k v .Moreover, to minimize the cost, we can insert u's and v's nearest neighbors (i.e., N (u), N(v)).Thus, the cost estimation of the obtained cluster depends on the maximum of distances between its users.This can be seen as the maximum among: the core distance of u (i.e., the maximum distance between u and its N (u)), the core distance of v (i.e., the maximum distance between v and its N (v)), and their ADM distance (i.e., d adm (u, v)).The total cost is given by the maximum distance of its users multiplied by the number of its users, this latter set as the maximum of k u and k v .This multiplication allows us to measure not only the anonymity level of the valid cluster containing both u and v but also the distances between the cluster's users.The higher either their distances or their anonymity levels is, the higher their anonymization distance is.
We exploit the anonymization distance to compute the cost of u's anonymization.Intuitively, each user u must be added to a valid cluster that have at least k u − 1 other users.To minimize the information loss of anonymizing user u, his/her cluster must include u and his/her k u − 1-nearest neighbors.So, the minimum anonymization cost of user u can be considered as the maximum anonymization distances among users in the cluster.We define the cost as follows.

B. The Varying-Anonymity Clustering Algorithm
The Varying-Anonymity Clustering Algorithm (VAC) (see Algorithm 2) receives as input the anonymization distances computed on each pair of users in G: D a ; the set of users k values: K; and returns a set of clusters built by leveraging on the user Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
anonymization cost previously defined.Distances are modelled as a matrix, called distance matrix, D a ∈ R |V U |×|V U | , whose element D a [i, j] contains d a (u i , u j ), u i , u j ∈ V U .The usage of the distance matrix allows our algorithm to be compatible with state-of-the-art clustering libraries. 6In case the original KG is too big and it is impractical to calculate the matrix, VAC's scalability can easily be improved by calculating distances at run-time and caching them in a database.
To minimize the distances between users, VAC utilizes the approach of prioritizing the creation of clusters for users with lower anonymization costs over those with higher costs.More precisely, the algorithm first sorts all users in V U in ascending order by their anonymization cost and stores the sorted users in a sorted set U asc (line 1).A sorted set ensures the uniqueness of users and the efficiency of users' addition/removal as well as of the retrieval of the smallest-anonymization-cost user (i.e., U asc [0]).The set of clusters C is initialized as the empty set (line 2).While U asc 's size is higher than or equal to the k value of the first user in U asc , the algorithm starts generating clusters (lines 3-7).It removes the first element in U asc and stores the element in u (line 4).u is passed to function find_best_cluster() to create its best valid cluster c, and insert c into C (line 5-6).Once the cluster has been created, the function removes from U asc all users in c, so that in the next while iteration a new user is considered.After the while cycle ends, if U asc still contains some users, they will not be included in any cluster because the cluster containing them is invalid (i.e., |U asc | < k U asc [0] ).In Section VI, we will show how these removed users impact the obtained information loss.Finally, the algorithm returns the set of generated clusters C (line 8).
Function find_best_cluster(): Given a user: u, the anonymization distance matrix: D a , the set of users: U asc , and k values of users in G: K, this function finds a cluster for u such that the maximum anonymization distance between pairs of users in the found cluster is minimized.First, it sorts users in U asc in ascending order by their anonymization distance to u and stores them in the array U c (line 1).Then, it initializes a cluster c with only user u, whereas the cluster anonymity level k c is set to u's anonymity level (lines 2-3).It initializes i as 0 (line 4).While c is not valid (i.e., k c − |c| > 0) and U c still has users (i.e., i < |U c |), it starts finding u's nearest users to add into c (lines 5-10).In each iteration, the function adds the ith user in U c (i.e., U c [i]) to c (line 6).Since U c is sorted in ascending order by the anonymization distance to u, the added user is the one closest to u.Then, it updates k c (line 7), removes the user from U asc (line 8), and increase i by 1 (line 9).Finally, it returns cluster c (line 11).
Then, VAC considers u 0 .The cluster for u 0 is {u 0 , u 1 }, since u 1 is u 0 's closest user.VAC removes these users from U asc , and creates the last cluster {u 2 , u 4 } for the remaining users.Then, all these clusters are returned.Note that, the last cluster is invalid The complexity of VAC is O(n 2 log n), where n is the number of users in V U . 7

C. The Merge-Split Algorithm
The set of clusters C generated by the provided clustering algorithm A is possibly modified such that only valid clusters are passed to the generalization phase.This is done by the Merge-Split Algorithm (MS) (see Algorithm 3).This algorithm takes as input the anonymization distance matrix: D a ; a set of clusters: C; k values of users in G: K; and a threshold: τ .It returns the set of clusters C such that, for each c ∈ C, the number of its users is: The second condition ensures that the MS algorithm does not generate too large clusters.This assurance allows MS to reduce the greatness of the generalized attributes and the out-/in-degrees of relationships.Thus, the number of fake edges needed for generalization also decreases.
Algorithm 3 first selects from C only the valid clusters (i.e., C valid ), while users in the other clusters are collected into U (lines 1-9).Then, it calls Procedure assign_valid_clusters() to assign these users to valid clusters in C valid (line 10).As detailed add(c min , u) 19: end if 20: end for later, each user u ∈ U is inserted into a valid cluster c where the maximum anonymization distance between u and other users in c is less than or equal to the maximum distance calculated by threshold τ .Then, Algorithm 3 refines each cluster c in C valid by splitting it if it is too large, that is, if c's size is higher than or equal to twice its anonymity (lines [12][13][14][15][16][17][18][19][20][21].If this is the case, Algorithm 3 calls Function split_big_cluster() to split c into a set of smaller clusters whose sizes are greater than or equal to k c and less than |c| (line 16).The obtained set of clusters C c is then added to C valid (line 17).Otherwise, Algorithm 3 adds c to C (line 19).Finally, it returns C (line 22).
Procedure assign_valid_clusters(): This procedure first calculates the maximum anonymization distance τ d , based on the maximum and minimum anonymization distance of users and threshold τ (lines 1-2).Then, for each user u in U , it searches for a cluster c ∈ C valid where the maximum anonymization distance between u and other users in c is less than or equal to τ d and c is still valid after adding u.If it exists, u is assigned to c, otherwise it will be removed.In particular, for every user u in U , it initializes min d to +∞ (line 4) and c min to the empty set (line 5).Then, for each c in C valid , it calculates the maximum distance between the users in c and u, i.e., d (lines 7-10).If d is less than min d , the number of users in c after adding u is greater than or equal to the maximum anonymity of c and u, and d is less than or equal to τ d , it updates min d (line 13) and c min (line 14).If there is a cluster c min in C valid satisfying the above conditions, it adds u to c min (line 18).Then, it continues finding a cluster for the next user in U (line 3).
Example 7: Let {u 3 }, {u 0 , u 1 }, {u 2 , u 4 } be the clusters generated in Example 6 and suppose τ has been set to 0.5. is less than τ d (i.e., 0.575) and the resulting cluster is still valid.However, it cannot find any cluster for u 4 as adding it to any cluster will make the cluster invalid.MS returns clusters: {u 0 , u 1 } and {u 2 , u 3 }.
Function split_big_clusters(): This function receives the anonymization distance matrix: D a , a cluster: c, and users' k values: K. First, it calculates the number of clusters to be generated: n c (line 1) and finds the distance matrix of users in c: D a c (line 2).Then, it calls Algorithm BanditPAM [16], an efficient variant of k-Medoids, to find the set of users who are the center of clusters to be generated: M c (line 3).Next, it finds the set of users who are not in M c : c r (line 4) and initializes the set of resulting clusters C c as empty set (line 5).It starts creating n c clusters of users in c and adding the clusters to C c (line [6][7][8][9][10][11][12][13][14].For every i between 0 to n c − 1, if i is less than n c − 1, it calls Function find_nearest_elements() to find k c − 1 users in c r who are the closest ones to the ith medoid user (M c [i]): c c (line 8).8 In case i is greater than or equal to n c − 1, it assigns c c of c r (line 10).Then, it removes users in c c from c r (line 12) and adds the cluster containing the medoid user and c c 's users to C c (line 13).Finally, the function returns C c (line 15).
The complexity of MS is O(m × n 2 log n), where m is the number of clusters in C, and n is the number of users in V U .We include the detailed time complexity analysis in the supplementary material section, available online.

V. PRIVACY ANALYSIS
In this section, we analyze how the p-k-ad principle and the PCKGA algorithm protect user privacy.The following theorem states that, if an anonymized KG G satisfies p-k-ad, any user u in G cannot be re-identified with a confidence higher than 1 k u , where k u is specified by u.
Theorem 1: Let G be an anonymized KG.If G satisfies p-k-ad, for every user u ∈ V U , an adversary cannot exploit AK(u) to re-identify u with a confidence higher than 1 k u .The privacy protection of PCKGA relies on MS, which generates valid clusters, and KGG [2], that generalizes users' attributes and relationships.As the following theorem proves, MS always generates a set of valid clusters.
Theorem 2: Let G be a KG and C be a set of clusters generated by Algorithm 3 when its input is G. Every c in C is valid.
By applying KGG [2] on the set of valid clusters generated from MS, PCKGA always generates anonymized KGs satisfying p-k-ad.
Theorem 3: Let G be a KG, C be the set of clusters generated by Algorithm 3 over G, and G be the anonymized version of G created by KGG algorithm, executed with G and C. G satisfies p-k-ad.
The supplementary material includes theorems' proofs, available online.

VI. EVALUATION
We conduct experiments on four real-life datasets to evaluate the proposed algorithms (i.e., VAC and MS).The first experiment aims to show the effectiveness of VAC over state-ofthe-art clustering algorithms (i.e., k-Medoids [13] and HDB-SCAN [14]).The second experiment is designed to evaluate the impact of MS in improving the quality of anonymized KGs.The third experiment shows the performance of VAC and MS.In the final experiment, we conduct a comparison between the anonymized KGs produced by our algorithms and those generated by k-ad's algorithm [2].Since KGs can illustrate relational data, we also compare our algorithm against a cluster-based anonymization algorithm designed for this type of data [15].

A. Datasets
We use four popular real-life datasets, namely Freebase [17], Yago [18], Credit [20], and Coil [19].Freebase and Yago are selected since they are the most widely used in state-of-theart KGs' deep learning publications.Freebase and Yago store attributes' values (e.g., nationality, location) and relationships (e.g., spouse, parent) of famous people (e.g., the film director Anthony Asquith) derived from Wikipedia, WordNet, and other data sources.Even if they present semantically similar attributes/relationships, they have different types of attributes/relationships and different sizes.Credit stores properties of real bad credits in Germany and Coil contains information on customers of an insurance company.For each dataset, we manually categorize its nodes into the set of users (i.e., V U ) and set of attributes' values (i.e., V A ). Since Freebase has 5000 users and 4016 values, it has 9016 nodes.Yago has 13303 nodes These datasets represent a valid benchmark to test our algorithms as they contain different numbers of users, attributes and relationships (see Table I for datasets' properties).However, they do not contain users' anonymity levels, which we generated synthetically following two main strategies.As first strategy, we adopt the approach used to generate anonymity levels to evaluate state-of-the-art personalized k-anonymity methods for location data [21], [22], [23].Here, anonymity levels are generated using Zipf distribution, with α parameter set to 2. As the second strategy, we consider that individuals do not set their anonymity levels randomly.To model users' behavior, we assume that users who share more information are more willing to protect their data and thus require a higher anonymity level.This assumption comes from a recent survey run by CISCO 9 that pointed out that users who actively set their privacy settings are those that exploit more online services and thus those that expose more information.According to this second strategy, the anonymity level of a user u is determined by the number of u's edges.The greater the number of edges is, the greater the associated k value is.
In both the strategies, we assume that anonymity levels are taken from fixed intervals.State-of-the-art personalized anonymization techniques for location data specified these levels to be from 2 to 5 [21], from 3 to 15 [22], and from 10 to 50 [23].In this paper, we therefore use two intervals containing these levels.The first contains levels between 2 and 5 with an increment by 1 (denoted hereafter as 2,5,1); the second contains levels between 5 and 50 with an increment by 5 (denoted as 5,50,5 in what follows).Thus, the maximum anonymity levels of users are 5 and 50, respectively for the two considered intervals.

B. Clustering Settings
The first experiment aims to show the effectiveness of VAC over two state-of-the-art clustering algorithms: k-Medoids [13] and HDBSCAN [14].However, these two algorithms do not support personalized k values.k-Medoids receives as input only 9 https://www.cisco.com/c/dam/global/en_uk/products/collateral/security/cybersecurity-series-2019-cps.pdf the number of clusters to be generated, say κ, whereas HDB-SCAN receives the minimum size of clusters to be generated, say k unique .This last parameter represents the k-anonymity level that has to be applied to the whole dataset, i.e., all the clusters.In order to exploit them in PCKGA, as an alternative to VAC, we have to specify κ in k-Medoids and k unique for HDBSCAN, included in the parameters' values P (line 2, Algorithm 1).
In particular, to define k unique , we adopt three strategies, according to which k unique is set as the max, min, average of all anonymity levels of users in the considered datasets.As such, we run HDBSCAN (k-Medoids, resp.) with k fixed as max, denoted as hdbscan# max (km# max, resp.);min, e.g., hdbscan# min (km# min, resp), and average, e.g., hdbscan#mean (km#mean, resp.).κ is defined as the number of users in the dataset divided by the minimum number of users in each cluster, that is, the adopted k unique .Moreover, since HDBSCAN ensures the minimum size of its generated clusters, we develop a basic HDBSCAN's extension, namely HDB*, to generate clusters with multiple k unique and compare it with VAC.In particular, we first gather users into clusters such that users in the same cluster have the same k values.Then, we execute HDBSCAN on each of these clusters with k unique equal to its users' k value.Therefore, each user is only anonymized with users sharing the same k value.
Due to the lack of space, this section only presents the results for k-Medoids, HDB*, and VAC, whereas the Supplementary Material reports HDBSCAN's settings' results, available online.

C. Metrics
To evaluate the quality of generated KGs, we use two metrics: the average information loss (AIL), and the ratio of removed users (RRU ).AIL of a user is estimated as the differences between his/her attributes' values and out-/in-degrees in anonymized KG G with his/her original ones.If the user is removed, his/her information loss is 1.
where AL and DL are u's information loss on his/her attributes and relationships.We define AL as follows: where d o (G, r r , u) and d i (G, r r , u) (d o (G, r r , u) and d i (G, r r , u), resp.) are the out-and in-degree of relationship r r of u in original KG G (anonymized KG G, resp.).Since removed users have a high impact on AIL, we use RRU to analyze these impacts in detail.RRU measures the percentage of users in the original KG G that are not included in its anonymized version G.
In case the dataset exploits user anonymity levels generated by using zipf , we generate the levels three times and run our experiments separately.The metrics' results are given as average of the three executions.

D. VAC Algorithm
This experiment compares the quality of anonymized KGs generated by running PCKGA with VAC, k-Medoids [13], and our extension of HDBSCAN (HDB*).To evaluate the impact of the clustering algorithms, we only execute VAC (k-Medoids, HDB*, resp.)without the Merge-Split Algorithm.Among the obtained clusters, we only keep those that are valid to generate anonymized KGs.Table II illustrates the average information loss of users and the ratio of removed users in anonymized KGs, for both the considered datasets.VAC helps PCKGA to always generate the highest quality KGs compared to k-Medoids and HDB*, even though VAC does not require any additional input parameter.With the setting te#5, 50, 5 for Yago, k-Medoids generates the highest quality KGs whose information loss is 0.117 while the information loss of those generated by VAC is 0.0022.The main reason is that VAC removes outliers.In this setting, VAC only removes a few number of users: 0.08%, while the lowest ratios of removed users among all executions of k-Medoids, and HDB* are 0% and 2.4%, respectively.
Table II also shows how the adopted set of anonymity levels impact.In general, the higher values of k (i.e., stronger privacy protection) result in lower anonymized KGs' quality.The information loss of KGs generated for settings te#5, 50, 5 and zipf #5, 50, 5, whose maximum k value is 50, is higher than that of those generated for settings te#2, 5, 1 and zip#2, 5, 1, whose maximum k value is 5, in most clustering settings (i.e., VAC, km#min, hdb * ).It is not the case for the other clustering settings (i.e., km#min, km#mean), because k-Medoids generates many invalid clusters whose users are removed from anonymized KGs.With clustering setting km#mean, the ratio of removed users in Yago is 46.7% (21.7%, resp.) for setting te#2, 5, 1 (te#5, 50, 5, resp.).These huge amount of removed users make anonymized KGs to loose much information.Moreover, it is relevant to note that anonymized KGs generated using levels generated according to te have higher quality than those generated with zipf .We recall that according to te, users who have a similar number of edges have similar anonymity levels.Then, since users in the same cluster might have a similar number of edges, they also have similar anonymity levels.This prevents users with small anonymity levels to be anonymized with high anonymity levels.As a result, they do not lose too much information.
As VAC considers different k values, it generates high-quality anonymized KGs in both the real-life scenario (i.e., te) and the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.random simulation scenario (i.e., zipf ) for the highest range of k values (i.e., 5, 50, 5).The advantages of VAC over HDB* indicate that simple customization of current clustering algorithms does not fit the scenario of personalized anonymization.

E. MS Algorithm
This experiment aims to evaluate the effectiveness of MS.We consider the zipf #5, 50, 5 setting, as using this setting, k-Medoids, HDB*, and VAC generate the high-information-loss KGs.We recall that MS removes invalid clusters and adds their users to valid ones.This is done only if the anonymization distances between these users and valid clusters are less than or equal to the maximum distance measured by τ .We also run MS by varying τ , to measure the impact of τ on the quality of generated anonymized KGs.Table III illustrates MS' effectiveness on Freebase and Yago.
By increasing τ , MS decreases the number of removed users, which leads to the decrements of users' information loss, since AIL of a removed user is 1.MS decreases the information loss of users in KGs generated by k-Medoids executed with all the settings.As an example, in case of km#min in dataset Yago, increasing τ from 0 to 1 decreases the information loss of the generated KGs from 0.368 to 0.016, whereas the ratio of removed users is decreased from 36.2% to 0%.MS improves the quality KGs generated by executing with HDB* to be closer to that of those generated by VAC.On Freebase, increasing τ from 0 to 1 decreases the information loss of HDB* (VAC, resp.) from 0.017 to 0.011 (from 0.0168 to 0.0087, resp.).Consequently, the information loss of users in original KGs is decreasing when increasing τ .Even though VAC does not generate big clusters, whose cardinality is higher than or equal to twice their anonymity, MS can still add some removed users to valid clusters and increase the quality of the generated clusters.As a result, the information loss of KGs using VAC is also decreased from 0.0070 to 0.0048 when increasing τ from 0 to 1 on Yago.Moreover, the maximum ratio of removed users of clusters generated by MS is at most those generated by removing invalid clusters.Therefore, MS is effective enough to improve the quality of anonymized KGs if data providers decide to choose clustering algorithms which do not support personalized anonymization (e.g., k-Medoids, HDB*).

F. Overall Performance
We measure the performance of VAC by monitoring the execution time of VAC when running it with varying k values' generation strategies.MS' performance is computed by tracking the execution time of MS when running it with clusters generated by VAC with the strategy zipf #5, 50, 5.Both algorithms have been implemented in Python 3 and run on a Debian 4 server, with 128 GB of RAM and its CPU is Intel Xeon with 128 cores.Table V illustrates the performance on the two datasets.The performance of VAC depends on the size of the datasets and the users' k values.The execution times of VAC running on Yago with zipf #5, 50, 5 and te#5, 50, 5 settings (523.93 and 469.86 seconds, respectively) are higher than those of VAC running on Freebase (150.14 and 183.21 seconds, respectively).The execution time of VAC running with zipf #5, 50, 5 (150.14 seconds) is higher than that of VAC running with zipf #2, 5, 1 (110.10seconds).However, the number of users in the datasets has a higher impact on VAC's performance than the users' k values.Even though VAC has a higher execution time than k-Medoids, HDBSCAN, HDB* (i.e., on Yago, with zipf #5, 50, 5, they have 32, 36, and 54 seconds, resp., where VAC is about 524 seconds), the information loss with k-Medoids and HDBSCAN (0.267, 0.334, and 0.222, resp., with zipf #5, 50, 5 setting) is much higher than the information loss of VAC (0.0168).On the other hand, MS performance mostly relies on datasets' number of users instead of its parameter τ .On Freebase dataset, increasing τ from 0 to 0.25 decreases the execution time from 32.19 to 31.75 seconds.However, when it reaches 0.5, the execution time increases to 32.81 seconds.The standard deviations of execution times on varying values of τ are 0.71 seconds on Freebase, and 1.92 seconds on Yago that are small comparing to their execution times.Nevertheless, changing from Freebase to Yago increases the execution time.The average execution time on Freebase (i.e., 32.01 seconds) is smaller than the one on Yago (i.e., 99.52 seconds).
The complexity of VAC and MS are O(n 2 log n) and O(m × n 2 log n), respectively (see Appendix 2, available online).In practice, data providers do not need to anonymize their KGs in real-time.Instead, they will anonymize their KGs once and publish the anonymized versions.Therefore, MS and VAC are feasible with KGs' real-life applications.

G. Comparison With the Simple Knowledge Graph Personalized Anonymization
This experiment aims to compare the quality of anonymized KGs generated by PCKGA (i.e., using VAC/MS and KGG) and state-of-the-art anonymization algorithms in common usages.First, we compare PCKGA with the cluster-based anonymization algorithm, i.e., the Cluster-Based Knowledge Graph Anonymization Algorithm (CKGA) [2] in anonymizing KGs.Second, we compare it with Primule [15], the cluster-based anonymization algorithm for relational data which is used to anonymize users' profile.Unlike PCKGA, CKGA and Primule impose a unique anonymity level for all users.We use the most Since PCKGA and CKGA [2] supports similar parameters (i.e., clustering algorithm A and τ ), we evaluate the impact of the parameters on the information loss of anonymized KGs generated from Freebase and Yago.Fig. 2 and Table IV illustrate the quality of anonymized KGs generated by our algorithm and CKGA.Across all values of τ , PCKGA generates higher quality KGs.As τ increases from 0 to 1, the average information loss of anonymized KGs generated by PCKGA remains lower compared to CKGA.For example, Fig. 2(a) shows that the anonymized KGs of Freebase generated by CKGA exhibit a minimum and maximum average information loss of 0.027 and 0.244, respectively.In contrast, PCKGA achieves lower values: 0.009 and 0.013, respectively.Table IV further demonstrates that the minimum and maximum ratios of removed users in Freebase's KGs generated by CKGA are 0 and 0.223 respectively, while PCKGA achieves significantly lower values of 0 and 0.008.Similar trends are observed in the experimental results obtained from Yago.The reason is that CKGA does not consider the different users' anonymity levels and applies the maximum one to all users.Therefore, PCKGA can generate better quality anonymized KGs in the scenario of personalized anonymization.
Primule [15] is designed to anonymize relational data, we use two real-life relational datasets (Credit [20] and Coil [19]) to compare our work with Primule.Since Primule does not remove any outliers from the datasets; to ensure that PCKGA also does not remove outliers, we executed it with τ = 1.PCKGA's anonymized KGs show 40.41% and 31% lower information loss (0.137 on Credit and 0.067 on Coil) compared to Primule's (0.339 on Credit and 0.213 on Coil) in both datasets.This difference is attributed to the fact that Primule creates large clusters that exceed the required k value by at least two times and applies the maximum k value to all users, while PCKGA considers all values.

VII. CONCLUSION
In this paper, we proposed the Personalized k-Attribute Degree principle to allow users to specify their own protection level (k) and PCKGA to generate anonymized KGs satisfying the proposed principle.We conducted experiments by using Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
two real-life datasets to show that running PCKGA with our clustering algorithm (VAC) generates high-quality anonymized KGs compared to those generated by running with state-of-theart clustering algorithms.PCKGA has good performance to be used in practice, and outperforms the previous anonymization algorithm for KGs [2] and relational data [15].Our work can be extended in various directions.We plan to design a risk assessment module to recommend anonymity levels to users, so that it is easier for non-expert users to specify their anonymity levels.Additionally, we plan to investigate novel generalization techniques that utilize fuzzy logic representation to represent users' relationships.These techniques allow for the representation of user relationships within the same clusters using the probability of their presence, rather than the traditional method of adding and removing relationships.Another extension is to protect users' privacy from inference attacks when attackers exploit the public protection levels.The protection can rely on the a risk assessment algorithm that predicts the probability that the protection level of a user is associated with the sensitivity level of his/her sensitive values.If the risk is too high, users' data are removed from anonymized KGs.

Algorithm 2 :Function 1 :
VAC(D a , K). Input: D a : the anonymization distance matrix of users in the original KG G; K: k values of users in G. Output: The set of clusters C. 1: Let U asc be the sorted set containing users in V U sorted in ascending order by their anonymization cost 2: C ← ∅ 3: while |U asc | ≥ k U asc [0] do 4: u ← get_and_remove_at_index(U asc , 0) 5: c ← find_best_cluster(u, D a , U asc , K) 6: add(C, c) 7: end while 8: return C find_best_cluster (u, D a , U asc , K).
Let us consider the anonymized KG G shown in Fig.1(b).Since u 0 has one outgoing relationship of type f ollows with u 2 and no incoming relationships of the same type, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 3 :
MS(D a , C, K, τ).Input: D a : the anonymization distance matrix of users in G; C: the set of clusters returned from the cluster generation phase; K: k values of users in G; and τ : a threshold.

TABLE I PROPERTIES
OF DATASETS USED FOR EXPERIMENTS including 8917 users and 4386 values.Coil has 5863 nodes containing 5822 users and 41 values while that of Credit is 2046 including 1000 users and 1046 values.

TABLE II ANONYMIZED
KGS' QUALITY GENERATED WITH: k-MEDOIDS, HDB*, AND VAC TABLE III ANONYMIZED KGS' QUALITY GENERATED BY MS EXECUTED WITH CLUSTERS GENERATED BY k-MEDOIDS, HDB*, AND VAC.

TABLE IV RATIO
OF REMOVED USERS IN KGS FROM PCKGA AND CKGA EXECUTED WITH HDBSCAN(HDB.)AND k-MEDOIDS(KM.)