Link User Identities Across Social Networks Based on Contact Graph and User Social Behavior

With the rapid development of Social Networking Services (SNSs), linking online user IDs is becoming increasingly important to internet service providers. Existing methods can achieve matching adjacent IDs between different services, where adjacent IDs mean the IDs that send message loggings at the same physical location. However, nonadjacent IDs also need to be matched in reality, which is a key challenge. In this paper, a new method based on users social behaviors and contact graph is put forward to realize linking of IDs across domains. This method can be used for matching both adjacent IDs and nonadjacent IDs. Specifically, all the IDs are mapped to contact graph. And we utilize a set matching algorithm based on the contact graph to find out the set of candidate IDs and generate confidence score by means of this algorithm to select the most appropriate matching. Our experimental results show that our algorithm is capable of identifying not only the set of adjacent IDs that belong to one same user but also the set of nonadjacent IDs that belong to one same user.


I. INTRODUCTION
There are various online services which play an important role in our daily life. It is very normal for an ordinary user to have two or more online IDs in different servers. For instance, a user can log in Twitter, Foursquare and Facebook simultaneously [21], [22]. In addition, a user may have several online IDs at the same server, with different IDs playing different roles. As user IDs offer abundant data, service providers are highly motivated to mine customer data and optimize user experience. To comprehend user behaviors more comprehensively, it's increasingly important to link user IDs among different services so as to merge separated data [1], [2]. Therefore, linking online user IDs has a profound influence on service providers. To cater to the development of SNSs, the methods for Linking IDs have developed from the ones of linking IDs on the same platform to those for linking IDs across domains. The methods of linking IDs on the same platform are mainly used for linking user IDs by relying on the user data of specific services, such as user profiles and The associate editor coordinating the review of this manuscript and approving it for publication was Khin Wee Lai . social graph [3], [4]. However, it's difficult to match IDs on more than one server by linking IDs on the same platform. The methods of linking IDs across domains mainly consist of a method based on trajectory data and a method based on contact graph. Specifically, the method based on trajectory data is applicable for matching IDs in a one-to-one manner between different servers [10]. However, this method still faces key challenge when there are more than two user IDs on the same platform. The method based on contact graph can be used for matching IDs in a one-to-many manner between different servers, in which the set of candidate IDs for object ID is defaulted as the neighbor of object ID [15]. However, we found that non-neighbors of object ID may also belong to one same user as object ID [14], [22], [25]. More exactly, an indirect neighbor of object ID may also belong to one same user as object ID, see Fig. 1. The figure 1 not only shows that John and John's neighbors may belong to one same user, but also shows that John and John's indirect neighbors Mark and Tommy belong to one same user. After statistics on the real data set, we found that matching pairs that are indirectly adjacent to object ID and belong to one same user account for the proportion of the total number of matching pairs as shown in Figure 2 [22]. In Figure 2, Different user indicates the proportion of matching pairs with a matching probability of 0 in the matching result, Neighbor indicates the proportion of matching pairs that are adjacent to object ID and belong to one same user in the matching result, and Indirect neighbor indicates the proportion of matching pairs that are indirectly adjacent to object ID and belong to one same user in the matching result. None of existing methods is capable of capturing IDs that are indirectly adjacent to object ID and belong to one same user as object ID. The goal of this paper is to link indirect neighbors and neighbors which belong to one same user as object ID in different services. In this paper, a method based on contact graph and user social behaviors is proposed to link IDs, which is capable of identifying not only the set of adjacent IDs that belong to one same user but also the set of nonadjacent IDs that belong to one same user. Firstly, all IDs are mapped to a big graph by using contact graph model, in which nodes are user IDs and two nodes on both ends of one side mean that these two nodes have accessed to the same physical location. Secondly, the goal of this paper is to link some IDs which are indirect neighbors of the object ID, and integrate them with the neighbors of the object ID to form a set of candidate IDs. In order to gain the IDs which are indirect neighbors of object ID, the indirect neighbors are preprocessed through link prediction in our research [25]. Meanwhile, a universal model for users' social behaviors is built in this paper. Finally, we utilize a set matching algorithm based on the contact graph and the universal model to dispose the set of candidate IDs, and select the most appropriate match in line with confidence score gained by means of this algorithm.
The performance of the algorithm in this study is measured on two real data sets. The first dataset is the Twitter-Foursquare dataset. The second dataset is the Gowalla-Brightkite dataset, which comes from the public website snap. At the same time, this article uses the two most advanced algorithms of SIMP and CN as the baseline algorithm. Our experimental results show that our algorithm is capable of identifying not only the set of adjacent IDs that belong to one same user but also the set of nonadjacent IDs that belong to one same user. In summary, we have made the following contributions: (i) A new method based on users social behaviors and contact graph is put forward to realize linking of IDs across domains. (ii) Our algorithm is capable of identifying not only the set of adjacent IDs that belong to one same user but also the set of nonadjacent IDs that belong to one same user.

II. RELATE WORK
The related studies, which contain the methods for linking IDs on the same platform and the method for linking IDs across domains, are introducted in this section.

A. LINKING IDS ON THE SAME PLATFORM
As the mainstream methods, the methods for linking IDs on the same platform were extensively researched in the early days when the internet was not so widely popularized. Goga et al. [3] used user profile attributes (such as user name, profile photo) and certain similarity features (such as posting timestamp and writing style) to link ID, which can be largely reliable in practice match user's identity in practice. Korula et al. [4] utilized the social graph method to link user IDs. This is the first time that someone has formalized the user identity linking problem, and designed an effective, partial, and simple parallel algorithm to solve it. Goga et al. [5] connects users based on the similarity of the movie ratings of Netflix and IMDB. Zafarani et al. [6] proposed the MOBIUS algorithm of behavior modeling to realize link ID, which realizes the mapping of individual identities on media websites. D. Liu et al. used the CT method to reduce the energy loss during data transmission [39]. Narayanan et al. [7] proposed a new de-anonymity attack method, which performs well in the case of uneven data and lack of background. Mu et al. [19] can observe the real identity correspondence in social networks more naturally through the temporal and spatial location of potential users. Nevertheless, these methods which are only applicable to linking IDs on the same platform cannot cater to the development of SNSs.

B. LINKING IDS ACROSS DOMAINS
The method for linking IDs across domains has a far-reaching influence on service providers. Existing method can be used for matching IDs in a one-to-one and one-to-many manner between different servers. In terms of matching IDs in an oneto-one manner, Riederer and Rossi link IDs on the basis of LBSs [35], [36]. Specifically, Riederer et al. [10] proposed an effective and general method to solve the coordination problem based on location data sets, which use any pair of sporadic location data sets to determine the most likely match. Rossi et al. [11] proposed a linking method based on trajectory data, which can obtain the spatiotemporal data created by the user over time and the frequency of the place visited. At the same time, Jiang Hongbo and others not only applied LBSs technology to mobile computing, but also solved the related problems of indoor navigation [37], [38]. In addition, R.Zafarani et al. [8] links the same users through different platforms so that they can fully understand user behavior and provide better recommendations. Li, C. et al. [24] match user identities between two servers through a mapping relationship. Although these methods can match IDs between two servers in an one-to-one manner, they face great difficulties in terms of ID diversity. In terms of oneto-many ID matching between two servers, X. Han et al. [12] used the location data link ID generated by users in social media platforms and proposed a framework based on copolymerization. Seglem et al. [13] integrated the profile files of different services through the user's spatio-temporal location information. However, the study did not attempt to preserve user privacy, and was even considered as an attempt to infringe user privacy. Vosecky et al. [9] proposed a user identification method based on web profile matching, and combined with the user's friend network to further extend the effectiveness of this method, but it also depends on the characteristics of specific services (such as social graphs). Huandong Wang et al. [15] proposed SIMP algorithm based on the connection graph and the temporal and spatial locality of user activities. The SIMP algorithm can ensure that when users access online services at will, they can associate users with actual locations and time. However, the SIMP algorithm cannot capturing IDs that are not adjacent and belong to one same user.
In sum, to adapt to the development of SNSs, the methods for linking user IDs are kept on improving, and have developed from the ones for linking IDs on the same platform to those for linking IDs across domains. In terms of linking IDs across domains, Huandong Wang et al. proposed SIMP algorithm based on the connection graph and the temporal and spatial locality of user activities. SIMP algorithm based on contact graph can be used for matching IDs in a one-tomany manner between different servers, in which the set of candidate IDs for object ID is defaulted as the neighbor of the object ID [15]. However, it is found out that a non-neighbor of the object ID may also belong to one same user as the object ID [14], [22], [25]. This paper proposes a new ID matching algorithm based on Bayesian theory. Our method can identify not only adjacent IDs belonging to one same user, but also non-adjacent IDs belonging to one same user.

III. METHODOLOGY
In this section, we first present the overall flow of our method in figure 3. Secondly, some basic concepts and candidate ID sets are introduced. At the same time, a probabilistic model describing the social behavior of users is presented. Finally, we clarify the goal of this paper and propose an algorithm for matching ID.

A. PROBLEM DEFINITION
In this study, S represents the set of online ID types and B denotes the set of online IDs. For each online ID x, its ID type is s(x). ∀s ∈ S, B s represents the set of all ids of type s.
For an arbitrary online ID y ∈ B, its mobile record is expressed as r(y) = {(l 1 , t 1 ), (l 2 , t 2 ), . . .}, where (l s , t s ) represent a mobile record at location l s and time t s . In addition, the set of of all time bins is denoted as T , and the set of all regions is denoted as L.
Furthermore, the movement record of an ID set δ is defined as r(δ) = {r(z)|z ∈ δ}. For each pair of online ID y, x ∈ B, whether the two IDs belong to one same user is represented as a binary variable SU (y, x). That is, 1, if y, x belong to one same user 0, otherwise We can also apply this to a more general situation. For an online ID set δ, whether they belong to the same user is represented as variable SU (δ) = y,x∈δ SU (y, x). In a similar fashion, for an online ID x and a set of IDs δ, whether they belong to the same user is represented as variable SU (δ, x) = y∈δ SU (y, x). Definition 1 (Contact Graph): The contact graph of IDs is expressed as G = (B,E). For a pair of online IDs x, y ∈ B, if x and y have posted message records at the same location, then there is an edge between x and y in E, i.e., ∃l ∈ L, such that (l, t 1 ) ∈ r(x) and (l, t 2 ) ∈ r(y) hold for some t 1 , t 2 ∈ T .
For two online IDs x, y ∈ B, if x and y have posted message records at the same location, then x is the neighbor of y. For three online IDs x, y, z ∈ B, if x is the neighbor of y, y is the neighbor of z, then x is the indirect neighbor of z. As shown in Figure 1, John is not only Paul's neighbor, but also an indirect neighbor of Mark and Tommy.

B. FILTER THE CANDIDATE SET OF OBJECT ID
Considering the candidate set of object ID is based on contact graph, it is of great importance for us to understand the implications of the graph. In order to integrate the information of different services, all of the IDs that have different relationships with each other are mapped to the graph. In view of the method for linking IDs across domains, Huandong Wang et al. believed that the candidate set of object ID is limited to its neighbors, and thus proposed a SIMP algorithm. As shown in Fig. 4, the candidate set of object ID x is {y 1 , y 2 }. However, enlightened by Wei Chen et al. [22], we found that the object ID and its indirect neighbors might belong to one same user, and at the same time, we did identify such a circumstance in real data sets. Therefore, we consider the candidate set of object ID as the neighbors and indirect neighbors of object ID. As shown in Fig. 4, the candidate set of object ID x is {y 1 , y 2 , y 3 , y 4 }. For any x ∈ B, the candidate set of object ID x is divided into two parts: neighbor N (x) = {b | b ∈ B, (b, x) ∈ E} of the object ID and indirect neighbor I(x) of the object ID. To obtain more accurate results, the neighbor of object ID is cut into N s (x) = N (x) ∩ B s . Also, we cut its indirect neighbor into I s q (x) = I q (x) ∩ B s in combination with the link prediction method [25]. Thus, our candidate ID set is C(x) = N s (x) I s q (x).

C. GENERAL MODEL OF INDIVIDUAL BEHAVIOR
To clearly describe the problem for linking IDs across domains, this study constructed a general model based on user behavior. Established by the spatio-temporal localization of user activities. This model may be used to describe how users generate records at different locations.
On the one hand, in order to model various behaviors of users when accessing the server, we binarized the number of records in the discrete time bin, and simplified it based on this [10]. Meanwhile, this study considered that the user's access to position l at each discrete time bin followed the Bernoulli distribution with probability q l . On the other hand, when the user visits location l, whether there is a record with ID type s follows Bernoulli distribution with probability q s . To sum up, if ID x ∈ B, the probability generated by the observation record is as follows: where F x (l, t) is the judgment function of whether the mobile record (l, t) exists in r(x). We can estimate q l and q s from the direction of probability theory. In addition, the number of records accessed by ID x in server s and location l is expressed as N s l . Therefore, the total number of records of all users accessing server s is expressed as, For any position l, the number of expected records is expressed as: By combining equations (1) and (2), we achieve the estimation for q l and q s . Furthermore, for a set of online id δ ⊆ B, their observation records are generated with the following probability under the condition that they belong to the same user: where F δ (l, t) is the judgment function of whether the mobile record (l, t) exists in r(δ). VOLUME 10, 2022

D. ID SET MATCHING PROBLEM
For any object ID x ∈ B, our goal is to find a set of online IDs that belong to one user as x. At the same time, our goal is formally defined as: User Identify Set Matching Problem(UISMP). Given: Object ID x and its movement record r(x). The candidate set of object ID δ 1 , . . . , δ N ⊆ B and their movement records r(δ i ), where i = 1, . . . , N .
Problem: We must get a ranking function φ : {δ 1 , . . . , δ N } → {1, . . . , N }. This ranking function can make the IDs belonging to one same user as x arranged as high as possible, and the ranking function is expressed as: The goal of our algorithm is to get the ranking function φ (δ k ) of each candidate ID set δ k . To be specific, based on the movement records of the same user in δ k , we get a joint probability P (SU (δ k , x) = 1 | r(W )), where W is the set of candidate IDs. At the same time, the candidate ID set is ranked on the basis of this joint probability. In addition, W is set to C(x). We describe in detail how to calculate P (SU (δ k , x) = 1 | r(W )) later.
We first consider the one-to-one matching problem, which is to link the ID pairs belonging to the same user in the two services. In the case of pairwise matching, each user can only have at most one user in each service. At the same time, assuming that the server of the object ID x is s0 and the server of the candidate ID is s1, we get W = C s 1 (x) ∪ x, as shown in Figure 1. In more depth, according to the probabilistic model of user social behavior in (3.3), we obtain the probability P(SU (y, x) = 1 | r(W )) that y and x belong to one same user, which solves the one-to-one matching problem [15].
So far, the one-to-one matching problem has been solved. However, under normal circumstances, there may be multiple IDs in C s 1 (x) that belong to one same user as the object ID x, as shown in Figure 2. On the basis of the one-to-one matching problem, this section further studies the one-tomany matching problem.
When we consider multiple ID matching, for example, y 1 , y 2 ∈ C(x), variables SU (y 1 , x) and SU (y 2 , x) are not independent of each other. This means that we cannot directly use the product of each ID probability as the joint probability. For an ID set δ ⊆ W , to calculate P(SU (δ, x) = 1 | r(W )), we should first obtain: = P(r(W ), SU (δ, x) = 1)/P(r(W )), At the same time, the previous equations are dealt with more deeply. Simplifying the above formula through the total probability formula of all partitions of W , we acquire: represents the prior probability of partition h, and the concept of partition is mentioned in H. Wang's research [20]. More precisely, if δ and x are divided into a set h, then P(SU (δ, x) = 1 | h) = 1; otherwise, P(SU (δ, x) = 1 | h) = 0. Therefore, we first get all the IDs within the set δ ∪ x and then denote the set of all partitions as P (B, δ ∪ x). Based on the Bayesian theory, we can obtain equation (4) by combining the relationship between P(r(W ), SU (δ, x) = 1) and P(SU (δ, x) = 1 | r(W )), Besides, for any partition h ∈ P(W ), we use M (h) to approximate P(r(W ) | h)P(h), which is expressed as follows: Puttiing it into (4), we obtain: There is another contrasting situation that we need to consider SU (δ, x) = 1 corresponds to certain other partitions, which are the partitions in δ ∪ x that are not divided into a set. Therefore, we also obtain: By combining (5) and (6), we have: So far, this paper has obtained the probability of ID matching through a series of derivations, which solves UISMP problem.

E. SIMILARITY ALGORITHM
According to the introduction in section D, we have solved the UISMP problem. However, there is still a problem with calculations. The problem is that the computational complexity of formula equation (7) is high, which leads to much calculation.To solve the above problems, we adopt the following three methods: • Ignore some non-adjacent ids: If two IDs are neither adjacent nor indirectly adjacent, then the two IDs do not belong to one same user. Therefore, the candidate ID set of object ID x is limited to the neighbors and indirect neighbors, and this candidate ID set can be expressed as C(x). As a result, we can greatly reduce the size of W and further reduce them.
• Reduce feasible partitions: In order to reduce the computational complexity, we adjust the feasible region to P(B, δ ∪ x). Specifically, |W | is reduced to N max . If |W | ≥ N max , all IDs in W \{x ∪ δ} belonging to different users. Based on the similarity method and equation (7), we propose an algorithm for calculating the confidence level, as described in Algorithm 1. After three approximation methods, the complexity of our algorithm is reduced. Given the object ID x and its candidate set C(x), if |C(x)| ≤ N max , then we can calculate the confidence score by traversing all the partitions in P(W ) according to equation (7). Otherwise, we only use two methods to reduce computational complexity.

IV. PERFORMANCE EVALUATION
First, this chapter briefly introduces the two original data sets used in the experiment. The two data sets are mainly the Gowalla-Brightkite data set and the Twitter-Foursquare data set. Second, this article explains the content of the two baseline algorithms. Third, the evaluation indicators of the experiment and the results of the experiment analysis are supplemented.

A. DATASETS
The performance of the algorithm is evaluated on two real data sets, which come from papers by SNAP and Huandong Wang respectively.

1) GOWALLA-BRIGHTKITE
Users have relevant dynamic information on the Gowalla and Brightkite platforms. Therefore, we get the real mapping between Brightkite account and Gowalla account. This dataset can be retrieved on SNAP. Simultaneously, the dataset has a total of 58,228 users and 441,143 check-in locations. However, only part of the original data set was used during the experiment, namely, their movement trajectories [27], [30], [31], [34].

2) TWITTER-FOURSQUARE
As we all know, Foursquare and Twitter are two social networks with a large number of users worldwide. It is worth noting that users can publish data related to time and space on these two social network platforms. To evaluate the performance of our algorithm, our model is trained on Foursquare and Twitter data sets. However, only part of the original data set is used in training, namely their movement trajectories [16], [26], [32].

B. BASELINE ALGORITHMS
The two baseline algorithms are elaborated as follows:

1) SIMP
In order to more accurately describe the daily behavior of users, Huandong Wang et al. established a connection graph model, which maps all accounts to a large graph. At the same time, the SIMP algorithm was proposed on the basis of the contact graph model. In order to prove the optimality of the algorithm, Huandong Wang used a Bayes-based method to calculate the confidence probability of the algorithm. Finally, the algorithm solves the problems of inconsistent data quality and ID diversity when linking across services [15], [16].

2) COMMON NEIGHBOR (CN)
The common neighbor algorithm proposed by Dashun Wang and others can effectively solve some link problems [17], [18]. Specifically, they use the number of common neighbors between two nodes to determine the similarity between the two nodes.Therefore, we regard each ID as a point, the IDs that have visited the message record in the same physical location as neighbors, and measure their similarity by the number of public access records between the two IDs.

C. EVALUATION METRICS
We choose three evaluation indicators, including recall, precise and AUC. These three standard indicators are used to evaluate our system performance. More specifically, we use an algorithm to generate a set list for each target ID: [x 1 , x 2 , . . . , x k ], where x i represents the i th ID, and k represents the number of matching IDs. Secondly, we use the algorithm to calculate the three performance indicators of the list. The detailed calculation process is as follows: VOLUME 10, 2022

1) ID LIST EVALUATION
After setting the set size to 1 (|Y i | = 1), our set list is transformed into ID list. Therefore, for each set of set lists [Y 1 , Y 2 , . . . , Y k ], we combine the highest-ranked IDs in each set into a new ID list, and use AUC, precision, and recall to evaluate the list.
Precision & Recall : Precision is defined as the proportion of user pairs that return the correct link contained in the result. Recall is defined as the proportion of actual linked user pairs included in the returned results [29], [33].
Among them, γ is the number of ID pairs that are actually linked in the real data, β is the number of ID pairs that are returned, and α is the number of ID pairs that are actually linked in the returned result.
AUC: AUC in machine learning books means the area under the ROC curve [23], [28]. The curve here represents the relationship between the true positive rate (TPR) and the false positive rate (FPR). The value of AUC is higher than the probability of choosing a positive instance to choose a negative instance. This value is mainly the evaluation value of the accuracy of the permutation function, namely: where m 0 and m 1 respectively represent the number of positive and negative instances, and r i represents the rank of the i th positive instance. Among them, the positive instance indicates that it is correct on the basis of real data. We set the K value of the ID list to k = 10, so m 0 + m 1 = k = 10.

D. EXPERIMENT AND RESULTS
We conduct three sets of experiments on Twitter-Foursquare and Gowalla-Brightkite. As can be seen from figure 5, figure 6, figure 7 and figure 8, we evaluate our algorithm based on the conclusion of the comparison. AUC is used to evaluate our system. Simultaneously, precise and recall are used to evaluate the accuracy of our system.   First, we use UISMP, SIMP and CN to do experiments on the Twitter-Foursquare platform. The experimental results are shown in Figure 5 and Figure 6. Furthermore, the AUC corresponding to UISMP is slightly better than other algorithms in the results, as shown in Figure 5. At the same time, Precise corresponding to UISMP is slightly higher than other algorithms in the results, as shown in Figure 6.
Second, we used UISMP, SIMP and CN to do experiments on the Gowalla-Brightkite platform. The experimental results are shown in Figure 7 and Figure 8. Furthermore, the AUC corresponding to UISMP is slightly better than other algorithms in the results, as shown in Figure 7. At the same time, Precise corresponding to UISMP is slightly higher than other algorithms in the result, as shown in Figure 8.
Thirdly, through the comparison of these groups of experiments, we found that our algorithm UISMP is superior to the other two baseline algorithms in terms of AUC. In addition, when the recall is the same, the precision of our algorithm SIMP is generally higher than that of the baseline algorithm.

V. CONCLUSION
In this paper, a method based on contact graph and user social behaviors is proposed to link IDs, which is capable of identifying not only the set of adjacent IDs that belong to one same user but also the set of nonadjacent IDs that belong to one same user. Firstly, all IDs are mapped to a big graph by using contact graph model, in which nodes are user IDs and two nodes on both ends of one side mean that these two nodes have accessed to the same physical location. Secondly, the goal of this paper is to link some IDs which are indirect neighbors of the object ID, and integrate them with the neighbors of the object ID to form a set of candidate IDs. In order to gain the IDs which are indirect neighbors of object ID, the indirect neighbors are preprocessed through link prediction in our research. Meanwhile, a universal model for users' social behaviors is built in this paper. Finally, we utilize a set matching algorithm based on the contact graph and the universal model to dispose the set of candidate IDs, and select the most appropriate match in line with confidence score gained by means of this algorithm. This algorithm is capable of identifying not only the set of adjacent IDs that belong to one same user but also the set of nonadjacent IDs that belong to one same user.