Publishing Node Strength Distribution With Node Differential Privacy

The challenge of graph data publishing under node-differential privacy mainly comes from the high sensitivity of the query. Compared with edge-differential privacy that can only protect the relationship between people, node-differential privacy can protect both the relationship between people and personal information. Therefore, Node-differential privacy must pay attention to the protection of personal information. This paper studies the release of node strength distribution under node-differential privacy by reducing sensitivity. We propose two algorithms to publish node strength distribution: Alternate-histogram (ALT-histogram) and Density-histogram (DEN-histogram). Experimental results show that compared with the existing node strength histogram publishing algorithm, our proposed algorithm has advantages in L1-error and KS-distance, make the noise node strength distribution closer to that of the original graph. We also propose an introspective analysis to understand the influence of the projection algorithm on the node strength distribution, experiments to prove that the projection algorithm plays an important role in the node strength distribution.


I. INTRODUCTION
As a general data structure, the graph can not only directly describe the relationship between things but also reflect various statistical characteristics. Many privacy data in life can be abstracted into a graph, such as social networks, recommendation systems, communication mode. However, in the case of graph anonymity [1], [2], the direct publication of graph data still has a high probability of disclosing personal sensitive information. Therefore, how to protect the privacy of sensitive personal information in the graph has gradually attracted the attention of researchers. Reference [3] is the first paper that proposes to study the distribution of nodedifferential privacy. Since then, the problem of publishing information on a graph under differential privacy [4], [5] has been studied by researchers. The difference from other privacy protection technologies is that differential privacy does not rely on the attacker's specific background knowledge and can provide quantifiable privacy guarantees. An algorithm that satisfies differential privacy guarantees that the output The associate editor coordinating the review of this manuscript and approving it for publication was Hao Ji. distribution does not change significantly from one dataset to a neighboring dataset.
Node-differential privacy [10]- [14] and edge-differential privacy [3], [6]- [9], [15] are variants of differential privacy applied to the graph. In edge-differential privacy, two neighboring graphs differ by a single edge, while in nodedifferential privacy, two neighboring graphs differ by one node and all edges connected with this node. For a graph G = (V , E) with the number of nodes |V | = n (where V is the set of all nodes and E is the set of all edges), deleting one edge only affects the degree change of two nodes on this edge, while deleting one node will cause the n − 1 edge to be deleted in the worst case. Therefore, node-differential privacy in graph data is more difficult to satisfy than edge-differential privacy, but it can provide higher privacy protection.
Most of the existing differential privacy research on graphs [3], [9], [11], [14], [15] are mostly focused on the degree of research on unweight graphs [3], [11], [14] but ignore the study of weighted graphs. However, as [15] said, the importance of nodes in a weighted graph does not entirely depend on the degree distribution of the weighted graph. For example, Bitcoin OTC [20] is a graph of ''who trusts who'' composed of people who use Bitcoin for transactions on the platform.
In this graph, the node degree cannot represent the user's reputation, but can only represent the number of user transactions. At this time, the node strength can be used to represent the user's reputation, and the node strength is the sum of the weights of all edges connected to the node. This paper studies the publishing of the histogram of weighted graph node strength under node-differential privacy. Given a graph G = (V, E, W), the goal is to publish a histogram of node strength while satisfying node-differential privacy, as close as possible to the true distribution of G. The main contributions of this paper are as follows:(1)as far as we know, the issue of node strength histogram on the weighted graph under node differential privacy is studied for the first time.(2) Based on the sensitivity determined by the projection algorithm, we propose two algorithms to publish node strength histogram under node-differential privacy: ALT-histogram and DEN-histogram, and prove that the two node strength distribution publishing mechanisms meet the definition of nodedifferential privacy. (3) A large number of experiments are carried out on four real datasets, and the proposed algorithm is compared with other existing algorithms. The experimental results show that compared with the baseline algorithm, the proposed algorithm has obvious improvement. We also conducted an introspective analysis to examine the influence of the projection algorithm.
The rest of this paper is organized as follows. In Sect.2, we discuss the related work. In Sect.3, we define the problem and discuss the differential privacy and its application on graphs. In Sect.4, we introduce our algorithm. The experimental results are given in In Sect.5, and the conclusion in Sect.6.

II. RELATED WORK
The application of differential privacy to graph data has been extensively studied. Kasiviswanathan et al. [10] used differential privacy technology to study graph data for the first time. They showed how to estimate the number of triangles in social networks satisfying edge-differential privacy and showed how to calibrate the noise of subgraph count accordingly. Day et al. [14] proposed two algorithms to publish the degree distribution of graphs under node-differential privacy. The θ -Cumulativehistogram algorithm first transforms the histogram into a cumulative histogram, then adds noise, and finally extracts the noisy histogram. The advantage of this algorithm is that it can significantly reduce the sensitivity, but it is only suitable for the degree distribution. If it is extended to the node strength distribution, the sensitivity is too high. The (θ, )-histogram algorithm proposed a geometric series group algorithm according to the properties of social networks. This algorithm cannot adapt to different datasets because of its fixed grouping. Ding et al. [16] proposed a node-differential privacy algorithm for each node 3-subgraph counting is used for protection, and the algorithm was extended to the local clustering coefficient; Qian et al. [15] proposed two algorithms to publish the node strength histogram of graphs under edge-differential privacy for the first time, in which the node strength histogram publishing algorithm based on sequence awareness has high accuracy but high time overhead, and the privacy protection provided by edge-differential privacy is weaker than that of nodedifferential privacy.
A key technique to satisfy node-differential privacy is that of ''projection'', which transforms a graph to be θ -degreebounded to reduce sensitivity. Kasiviswanathan et al. [10] proposed to reduce the sensitivity by ''truncation'' projection, which is to delete all nodes whose node degree is greater than the threshold value. Although this algorithm can reduce the sensitivity, it can delete many unnecessary edges and lose a lot of effective information in the original graph. Blocki et al. [12] proposed a projection algorithm of ''edge removal'' to reduce the sensitivity, that is to select any order of edges, traverse the edges in this order, and then delete the edges in the nodes whose node degree is greater than the threshold. Compared with the ''truncation'' algorithm, this algorithm retains more effective information, but it is still not ideal. Chen and Zhou [13] based on the exponential mechanism, used the flow graph algorithm to map the degree distribution of the graph to a histogram. The disadvantage of this algorithm is that the sensitivity is too high, which is far greater than the first two algorithms. Day et al. [14] proposed a projection algorithm of ''edge adding''. Firstly, the edge set of the graph is set to be empty, and the edges are traversed according to the stable order of the edges. If the node degrees of the nodes at both ends of the edge are less than the threshold value, they are added, otherwise, they are not added. Because of the uncertainty of edge order, this algorithm still cannot keep the edge information of the original graph to the greatest extent. Qian et al. [15] proposed an algorithm to project the weight of the original graph. This algorithm traversed each edge and projects the edge whose weight is greater than the threshold value so that the weight of the edge is the threshold value. This algorithm is the first weight projection algorithm.
To solve the problem of node strength publishing under differential privacy, the existing study [10]- [12] reduces the sensitivity by projection, but it loses a lot of effective information in the original graph. Moreover, the research on node strength mainly focuses on edge-difference privacy, and there is no research on node strength publication in differential privacy. As mention above, node-differential privacy provides higher protection than edge-differential privacy. In the structure of the graph, a node usually represents a person, and node-differential privacy cannot only protect the relationship between people but also protect personal information. Therefore, it is necessary to realize node-differential privacy to protect user data.

A weighted and finite graph
represents the connection weight of the edge e ij . In this paper, for simplicity of presentation, we assume that n = |V |, the number of nodes of the input graph G, is publicly known.
We generally extend the degree of a node v i to the sum of weights of edges that connect v j , denoted by node strength, which is given as follows: (1) where (i) is the set of adjacent nodes of node v i .

B. DIFFERENTIAL PRIVACY
The notion of ε-differential Privacy [4] is defined based on the concept of neighboring databases. Two datasets G and G are defined as neighboring datasets if they only differ in one record. Definition 1 (ε-Differential Privacy). A randomized algorithm A satisfies ε-differential privacy when for any two neighboring databases, D and D , denoted by D D , where S ⊆ range(A) and ε is a parameter for privacy level. Definition 2 (Global Sensitivity) For a query f :G → R d , and neighboring databases G and G , the l 1 -global sensitivity of f is defined as: where the L1 distance ||x|| 1 is the sum of the absolute values of each element of the vector x. Laplace Mechanism [4] For a query f :G → R d over a database G, we use Laplace mechanism to satisfy ε-differential privacy, noise Lap( f /ε) is added to the output of the query.
and [17] Given a quality score u(G, H i ) for outputting H i on input G and a randomized algorithm A, we have is the global sensitivity of the quality function. The sampling probability for each H i ∈ H is determined based on a user specified quality function u.

C. COMPOSITION PROPERTIES
Differential privacy satisfies sequential composition and transformation invariance [18], [19]. In short, sequential composition guarantees that for any sequence of computations A 1 , A 2 , . . . ,A k , if each A i satisfies ε-differential privacy, then publishing the results of all of A 1 , A 2 , . . . ,A k satisfies k i=1 ε i -differential privacy. The transformation invariance property says that given A 1 that satisfies ε-differential privacy and any algorithm A 2 , the new algorithm A(·) =A 2 (A 1 (·)) satisfies ε-differential privacy. In other words, performing post-processing on the output of an algorithm that satisfies differential privacy does not affect the privacy guarantee.

D. UTILITY METRICS
To evaluate the performance of different node strength distribution algorithms, the L1 error used in [10] and the KS-distance used in [3] are used as quantitative indicators.
The L1 error is used to calculate the cumulative error between the strength histogram distributions of two nodes. For two strength distributions D and D with length M, the L1 error between them can be obtained by while for the strength distribution whose length is smaller than M , it is filled with 0 to M for convenience.
The KS-distance is used to reflect the proximity between the strength histograms of two nodes. For node strength distributions D and D , the KS-distance between them is defined as

IV. PROPOSED ALGORITHM A. PROPOSED PROJECTION ALGORITHM
Although the current graph projection algorithm reduces the sensitivity, but it loses a lot of important information, and the existing algorithms can only be divided into the projection of node degree and the projection of edge weights, and cannot project both at the same time. In the algorithm of node strength distribution under node-differential privacy, both node degree and edge weight are directly related to node strength. Therefore, to project node degree and edge weight at the same time, we propose a new projection algorithm, which can constrain node degree and edge weight, and solve the problems of the high sensitivity of node-differential privacy and sparse distribution of real node strength. As shown in Algorithm 1, our algorithm requires all edges in the input graph G to be sorted from smallest to largest weight, represented by (G).
The projection algorithm first constructs a graph in which the nodes in the graph are all nodes in G, but there are no edges, and then traverses the edges according to the (G) order. If the degree of the nodes at both ends of the edge is less than the degree threshold and the edge weight is less than or equal to the weight threshold, it is added to the edge set. If the degree of the nodes at both ends of the edge is less than the degree threshold and the edge weight is greater than the weight threshold, the edge weight is set as the weight threshold and then added to the edge set. If there is a node at both ends of the edge whose degree is greater than or equal to the degree threshold, it is not added.
According to the projected graph, we can get the node strength histogram hist(G (λ,θ ) ) = {h 1 , h 2 , h 3 , . . .},h i represents the number of nodes whose strength is i. According to Algorithm 1 Proj (λ,θ) Projection by Strength-Bounded Input : An input graph G = (V , E, W ) , and an edge ordering (G) =< e1, e2, e3, . . . , en >, a degree bound θ, a weight bound λ Output : An output (θ, λ) − bounded graph G (λ,θ ) the global sensitivity of hist(G (λ,θ) ) released by lemma1, hist is 2θ + 1. Lemma 1: For any G G that differ in one node, we have Proof: Assume, without loss of generality, Compared with G = (V , E), G = (V , E ) has an additional node v + , that is, V = V ∪ {v + }, E = E∪E + , E + is the set of all edges connected to v + in G'. In the worst case, v + is connected to θ nodes, and the existence of v + and its connected edges will affect the strength of θ nodes except v + . And because the strength of a node changes, it will cause two transformations in the histogram, and v + itself has another change, then deleting a node and its connected edges will cause 2θ + 1 changes.

B. PARAMETER SELECTION
The simplest solution to meet differential privacy is to add Laplace noise directly, but this algorithm will lead to too much noise accumulated by interval queries, increase the error, and finally make the data lose its availability. One way to improve the accuracy is to apply the idea of aggregation on the histogram box to group the histogram. The smaller the grouping number k, the less noise is added to the histogram, but at the same time, more aggregation errors caused by the group average instead of the real value will be introduced. Similarly, larger θ and λ can retain more information, but the number of barrels in the node strength histogram will increase and more noise will be added.
Because the parameters of the algorithm will directly affect the results, choosing the appropriate parameters can reduce the error of the algorithm. In this section, it is proposed to select parameters through the exponential mechanism to weigh the projection error caused by projection and the aggregation error caused by data release. For this reason, a quality function with low sensitivity is designed to select the optimal grouping center, node degree threshold θ, and edge weight threshold λ.
Our mass function consists of two parts. The first part captures the error caused by the projection algorithm. Because the projection algorithm changes some nodes and edges, the two buckets in the histogram will change due to the change of the node strength of one node. The node strength is determined by the degree of the node and the edge weight of the edge connected to the node. The projection error is: where value (e) represents the weight of edge e in the graph, and E (v) contains all edges adjacent to node v, and deg(v) represents the degree of node v. The second part captures the error caused by the data publishing algorithm. the aggregation error of our proposed data publishing algorithm comes from the error caused by replacing the true value of the bucket with the group average value after grouping the histogram and the error caused by adding Laplace noise. therefore, the aggregation error can be expressed by using the formula in [14]: c d represents the true value of bin, B n represents the result of the packet, and g k represents the k'th packet. Combining these two components, the proposed quality function is: To apply the exponential mechanism to select the parameter pair (θ, λ, B), we need to determine the sensitivity of the mass function. lemma 2 proves that q p ≤ 6θ + 4. Lemma 2: For any G G that differ in one node, we have Proof: Based on Lemma 1, there is a change in the strength of at most θ + 1 nodes in both G (λ,θ) and G (λ,θ) , so we can get l proj (G, θ i , λ i ) − l proj G ,θ i , λ i ≤ 2 (θ + 1).
According to [14], l hist G, θ i , j − l hist G , θ i , j ≤ 2 (2θ + 1) can be obtained. Thus, we have Finally, we can conclude that for any G G that differ in one node, we have q p ≤ 6θ + 4. VOLUME 8, 2020

C. ALT-HISTOGRAM ALGORITHM
In this part, we introduce an ALT-histogram algorithm, which randomly selects K buckets as the group center in the histogram, and then selects a bucket that is not the group center to replace the group center corresponding to the bucket in the partition result. If the error can be improved, it will be replaced; if it cannot be improved, the next bucket that is not in the group center will be selected until the group center no longer changes. Next, the real value of each bucket in the group is replaced by the average of the group. Finally, add noise to release.

Algorithm 2 ALT-Histogram Algorithm
Input : A weight graph G = (V , E, W ), privacy budget ε 1 and ε 2 , candidates O, T and K Output : A histogtamH satisfying the differential privacy

18: returnH
As shown in Algorithm 2, it uses A weight graph G, privacy budget ε 1 and ε 2 , node degree threshold candidate set O, edge weight threshold candidate set T and group number candidate set K as input. The output is a noise node strength distribution histogram satisfying differential privacy.
Firstly, k buckets are randomly selected as the group center C = {c 1 , . . . ,c k (line4). Then traverse the group center and use the formula in [15] to divide all the buckets into corresponding groups.
where p and q represent the position of c l and c r in the original histogram, c l represents the group center on the left side of the bucket, and c r represents the group center on the right side of the bucket. If dist (c l , h i ) < dist(h i , c r ), then we divide all buckets between c l and h i into b l . Otherwise, the division between h i and c r is divided into b r . The result of this step is B = {b 1 , . . . ,b k } (line7). Traverse the partition result b i corresponding to the current group center c i . A new group of group center C is obtained by replacing c i with a bucket h i in b i , and the cost(C ) caused by C is calculated. If it is less than cost(C),C is used instead of C (lines [8][9][10][11]. Because this algorithm uses the average value of the group to replace the true value of each bucket in the group, so the error caused by grouping is represented by where b i,average represents the average value of b i .Then return to line6 to start traversing the group center again until the group center does not change (lines 12-15). Next, the real value of each bucket in the group is replaced by the average of the group. Finally, add noise to release(lines [16][17][18]. Lemma 3 ALT-histogram of Algorithm 2 satisfies (ε 1 + ε 2 )-node-differential privacy Proof: In algorithm 2, Line2 uses the exponential mechanism with privacy budget of ε 1 . Line17 uses the Laplace mechanism with a privacy budget of ε 2 . All other steps are independent and performed on the noisy output. By the composition theorem, the algorithm satisfies (ε 1 + ε 2 )-nodedifferential privacy

D. DEN -HISTOGRAM ALGORITHM
In this section, we will discuss another algorithm, which is based on the observation that the fluctuation of the group center in the adjacent bucket is small and the distance between the group center is large. The central idea is to find the highdensity bucket separated by the low-density bucket. In short, if the density of the other buckets around one bucket is lower than his, and the bucket is far away from the other group centers, then this bucket is suitable for the group center.
As shown in Algorithm 3, it uses A weight graph G, privacy budget ε 1 , ε 2 and ε 3 , node degree threshold candidate set O, edge weight threshold candidate set T and group number candidate set K as input. The output is a noise node strength distribution histogram satisfying differential privacy.
First, calculate the distance between each bucket in the original histogram and add noise (lines 4-7).
Secondly, the local density of each bucket in the original histogram is calculated, where disThreshold represents the truncation threshold and j represents other buckets, X (x) = { 1,x≤0 0,x>0 . The meaning of this formula is to find the number of buckets whose distance from the i'th bucket is less than the truncation distance and regard it as the local density of the i'th bucket. (lines 8-10)

19: returnH
Next, calculate the distance between the bucket and the bucket with higher local density, o set the points with higher local density and larger distance from the high-density bucket as the group center, we sort them according to the size of γ i = ρ i δ i . Select the top k as group centers and then divide them according to formula (14) (lines [11][12][13][14][15][16]. Finally, the real value of each bucket in the group is replaced by the average of the group, then add noise to release(lines [17][18][19]. It should be noted that the local density of two buckets cannot be calculated based on the real distance between them, otherwise, it will violate differential privacy, so we add Laplace noise to distance. Lemma 4 below proves that the sensitivity of the computation of distance (i, j) is at most 4θ + 2.
Lemma 4. For any G G that differ in one node, distance(i, j) represents the distance from the i'th bucket to the j'th bucket in the hist G (λ,θ) , distance (i, j) represents the distance from the i'th bucket to the j'th bucket in the hist G (λ,θ) , and we can get the Proof: Assume, without loss of generality, According to Lemma 1, the existence of a node will cause at most 2θ + 1 changes in the histogram, so we can conclude that hist (i) − hist (i) ≤ 2θ + 1, similarly, hist (j) − hist (j) ≤ 2θ + 1. So, we can get Lemma 5: DEN-histogram of Algorithm 3 satisfies (ε 1 + ε 2 + ε 3 )-node-differential privacy.
Proof: In algorithm 3, Line2 uses the exponential mechanism with privacy budget of ε 1 . Line 6 uses the Laplace mechanism with a privacy budget of ε 2 to noise the distance between buckets. Line18 uses the Laplace mechanism with a privacy budget of ε 3 . All other steps are independent and performed on the noisy output. By the composition theorem, the algorithm satisfies (ε 1 +ε 2 +ε 3 )-node-differential privacy.

V. EXPERIMENT
In this section, we compare the experimental results of the proposed algorithm with (θ, )-histogram algorithm [14] and θ -Cumulativehistogram algorithm [14], and analyze how different aspects of our proposed algorithm affect the effectiveness.

A. DATASETS AND SETTINGS
Our experiment is based on four real datasets downloaded from [20]. These datasets come from different fields: (1) Wiki-vote is a Wikipedia voting network. (2) Facebook is a social network dataset where we add random weight values to each edge of an unweighted graph. (3) Bitcoin OTC trust network is a ''who trust who'' network composed of people who use bitcoin to trade on the bitcoin OTC platform. This is the first explicit weighted signed directional network that can be used for research. (4) Enron dataset is an e-mail network obtained from about 1 million e-mail datasets. Table 1 describes the attributes of the datasets, such as maximum edge Weight max and maximum node strength ns max . According to [14], [15], all experiments are performed under the condition of privacy budget ε ∈ [0.1, 2.0].  critical points because all four algorithms carry out the projection process, which restricts the maximum strength of nodes. (4) the performance of θ -CumulativeHistogram is the worst and the distance from the original distribution is the furthest. This is mainly because the sensitivity of this algorithm is so high that it is impossible to select more suitable parameters when select parameters through the exponential mechanism. VOLUME 8, 2020

C. INTROSPECTIVE ANALYSIS
In this section, we perform an introspective analysis to see how our proposed projection algorithm affects the performance of ALT-histogram and DEN-histogram. Noproj-ALThistogram is the ALT-histogram algorithm without projection, and Noproj-DEN-histogram is the DEN-histogram algorithm without projection. Fig.3 shows the results. The upper part is KS distance and the lower part is an L1 error. It can be seen that, in terms of L1 error or KS distance, compared with Noproj-ALT-histogram and Noproj-DEN-histogram, ALT-histogram and DEN-histogram are better than those of Noproj-ALT-histogram and Noproj-DEN-histogram, the main reason is that projection reduces the sensitivity of nodedifferential privacy to reduce the added noise and improve the accuracy of data. It also shows that it is important to constrain the edge weight and node degree in the published node strength histogram. If it is not constrained, it will result in serious sparsity and outliers in the published results, so it is necessary to add too much noise and destroy the original information.

VI. CONCLUSION
This paper focuses on the histogram of the node strength distribution of the published graph under node-differential privacy. First, we propose a projection algorithm that preserves the information of the graph as much as possible by limiting the weight and degree and restricts the sensitivity of the node intensity histogram to be published on the projection graph. Then, we proposed two algorithms: ALT-histogram and DEN-histogram and proved that they satisfy differential privacy. The experimental results show that, compared with the existing node strength histogram publishing algorithms, the proposed algorithm has advantages in L1-error and KS-distance, which makes the noisy node strength distribution closer to the node strength distribution of the original graph. In the future, we plan to extend the algorithm to triangle coefficients and local clustering coefficients.
GANGHONG LIU is currently pursuing the master's degree with the School of Computer (Software), Inner Mongolia University. Her research interests include social networking, differential privacy, and machine learning.
XUEBIN MA is currently an Associate Professor with Inner Mongolia University. His main research interests include privacy protection technology, wireless network routing technology, and delay tolerant networks.
WUYUNGERILE LI received the Ph.D. degree from Shizuoka University, Hamamatsu, Japan, in 2013. She is currently an Associate Professor with Inner Mongolia University. Her research interests include WSNs, crowdsourcing, and mobile computing.