The Deep Fusion of Topological Structure and Attribute Information for Link Prediction

The link prediction can be used to seek missing or future links in the network, so it has become a hot research topic. The network generally contains two types of information: the topological structure of network formed by the connection between nodes, and the attribute information of nodes. However, the existing topology-based link prediction algorithms consider little attribute information. In this paper, a novel algorithm called Network Embedding with Attribute Deep Fusion for Link Prediction (NEADF-LP) is proposed. We get the embedded vectors with topological structure and attribute information by structure encoder and attribute encoder respectively, and fuse two vectors deeply. Compared with mainstream baselines on CiteSeer and Cora datasets, the results show that the deep fusion of topological structure and attribute information improve the accuracy of link prediction effectively.


I. INTRODUCTION
Complex networks in the real world can usually be constructed with nodes and links, in which nodes represent entities and links represent the connections between these entities [1]. Link prediction is to predict the possibility that two nodes in a network have a link by utilizing the information of nodes and the topological structure of network, which can be applied in various scientific fields, such as social networks, bio-networks, recommendation systems, stock market forecasts, and so on [2]. Link prediction has become a research focus due to its important value of science and commerce.
To deal with link prediction, a large amount of work has been done, especially in the realm of defining structural similarity criteria on the basis of the topological structure and modeling attribution knowledge.
The topological structure of the network is more versatile than the attribute information of nodes. Using the same structural similarity metrics enables higher prediction accuracy on different networks with similar structures. We can divide the mainstream methods for link prediction which based on the The associate editor coordinating the review of this manuscript and approving it for publication was Yichuan Jiang . topological structure similarity of the network into three categories [2]. The first class is proposed on the global structure of the network, such as Katz Index [3], which calculates the topological similarity of the nodes by the global structural information. The second class is called local indices. Traditional local indices simulate the similarity by calculating the number of common neighbors(CN) [4], or penalizing the large-degree nodes by setting penalizing parameter, such as Hub Promoted Index [5]. Adamic-Adar Index(AA) [6] and Resource Allocation Index(RA) [7] punish the large-degree common neighbors based on CN index. Significant Influence(SI) [2] models the significant influence by distinguishing the strong influence from the weak. The third class concentrates on quasi-local structures of the network to get the compromise between complexity and performance. The Local Path Index(LP) [8] considers short paths within two or three hops but ignores the longer paths. In recent years, some methods based on these traditional indices above are proposed to improve prediction precision. Liu et al. [9] filter out the redundant links in the network to improve the accuracy of the k-shell method from the perspective of spreading dynamics. Ahmed et al. [10] present an algorithm based on random walks in temporal networks. Path-based similarity metrics utilize more topological information of network for link prediction. However, traversing more nodes of networks will lead to higher computational complexity. Zhang and Chen [11] study a heuristic learning paradigm for link prediction. They propose a method to learn heuristics from local subgraphs using a graph neural network based on the γ -decaying heuristic theory.
The similarity indices of unstructured information could be defined by the attributes of nodes. Compared with the structural information of the network, the attributes information of the nodes is difficult to express in a structured form. The attribute-based method mainly uses attributes to calculate the similarity between the pair of nodes and predicts whether the two nodes will establish the link based on the value. The more identical attributes two nodes contain, the more similar the two nodes are [10]. For example, in the citation network, Trouillon et al. [12] predict mutual references between authors by using the information of the authors, such as schools, countries, and educational backgrounds. Wang et al. [13] extract topological information firstly by a latent-feature representation model, establish the connections between topological and non-topological information by logistic model secondly, and calculate the possibility that existing users and cold-start users are finally connected. Xie et al. [14] propose a joint prediction feature model that the characteristics of users and the network structure be taken into account, and they transform the link prediction problem into a classification problem. Sheikh et al. [15] introduce the GAT2VEC framework that generates structural contexts by structural information, and generates attribute contexts by attributes, and employs a shallow neural network model to learn a joint representation from them. Most attribute-based methods are more efficient in terms of computation time [16]. As there are many different types of attribute information, it is difficult to give a general solution. Therefore, the current attribute-based prediction method is limited to a specific attribute field, and the uniform representation of attributes cannot be performed. The algorithm performance varies significantly in different networks.
There are two main ways to improve the accuracy of link prediction, representing the network structure more reasonably and making full use of the node attribute information. With the rapid development in deep learning, word2vec [17] is proposed that is an extraordinary word embedding framework. The embedding vectors derived from the model preserve the syntactic and semantic relations between words under simple linear operations [18]. Word embedding provides a new concept in the study of social networks because it performs well to the representation requirement in large-scale social networks [19].
In recent years, many researchers have focused on network topological structure and presented various structure preserved graph embedding methods. These methods have significant effects on the representation of network topological structure. Graph embedding encodes the original graph into a low dimensional embedding space [20], and embedding vectors can be easily exploited by link prediction [21].
Applying random walk, DeepWalk [22] generates node sequences on graph firstly, and learns the node vectors with the help of the Skip-Gram [23] model. Both first-order and second-order proximities of nodes preserved, LINE [24] calculates node vectors with the two proximities respectively, and concatenates them directly. Node2vec [25] obtains node sequences via the biased random walk, which can perform BFS-like or DFS-like random walking to explore structures in different types of graphs. At present, there are many methods based on deep learning. SDNE [26], a semi-supervised deep model, exploits the first-order proximity and second-order proximity to characterize the local and global network structure. Xie et al. [27] present a network embedding framework, Sim2vec, the framework encodes more comprehensive node similarities among different nodes of the network into unified latent spaces.
Although we can obtain tolerable prediction accuracy via current link prediction algorithms, considering the insufficient combination of topological structure and attribute information, it is worth seeking a link prediction model that yields better link prediction accuracy. Besides the basic topological structure, the attribute information is abundant in network, such as user profile information in social networks, and writer information of article in the citation network. As analysis [28] shows, the performance of an algorithm can be effectively enhanced by taking external information into account, such as the attributes of nodes. The more of the same attributes two people have, such as age, gender, education, or job, the more possibility that they share the same interests and tastes. Therefore, attribute-based methods are preferable in some aspects. Song et al. [29] present a combined approach based on discriminative feature combinations that direct link prediction by the attribute information of nodes and topological structure. However, it is difficult to predetermine the weights of structure and attributes [30]. Bu et al. [31] propose Graph K-means which formulates the downstream task as a multi-objective optimization problem in a discrete-time dynamical system, and it does not need to pre-determine the weights of structure and attributes.
In this paper, we propose a novel link prediction method which considers both the topological structure and the attribute information of nodes, namely Network Embedding with Attribute Deep Fusion for Link Prediction, or abbreviated as NEADF-LP. In order to verify the performance of NEADF-LP, a series of experiments compared with state-ofthe-art methods have been completed on two typical citation network datasets Cora and CiteSeer. The results show our method performs better compared with state-of-the-art methods. Moreover, NEADF-LP is still effective while a few links missed.
Our contributions are summarized as follows: • We embed the attributes (discrete attributes and continuous attributes) of nodes in the network into low-dimensional vectors for subsequent fusion.
• We build a tower-structure model, in which the number of neurons in the current layer is halved from the VOLUME 8, 2020 previous layer, so nonlinear fusion between structural features and attribute features could be achieved.
• We propose NEADF-LP (Network Embedding with Attribute Deep Fusion for Link Prediction), a novel link prediction method which takes both the topological structure and the attribute information of nodes in to consideration. The rest of this paper is organized as follows. Section II defines the problem description and the metric for evaluating the result. The novel approach NEADF-LP is presented in Section III. Experimental results are given and discussed in Section IV using standard datasets and metrics. The conclusion is drawn in Section V.

II. PROBLEM DESCRIPTION AND EVALUATION METRIC
Consider an undirected network G = (V , E) where V represents the nodes set, and E represents the links set. Let U denote the universal set which contains all |V |(|V | − 1)/2 possible node pairs in network G, where |V | represents the number of nodes in V . The purpose of link prediction is to find out the missing links or future links in U − E, where U − E denotes the non-observed or nonexistent links.
To evaluate the capability of a link prediction method, the link set E is randomly partitioned into two parts: a training set E train and a test set . We use AUC, a standard metric, to evaluate the prediction performance of our proposed method. AUC is the probability that the score of a link randomly chosen in E test is higher than that in U − E. That is, we each time select a link from the test set E test randomly, and then from the nonexistent link set, U −E, randomly select another link. If the score of link in E test is higher than that in U − E, then we get 1. If two scores are equal, then we get 0.5. After independent comparisons for n times, if there are n times that the links in E test have higher scores and n times they are the same [2], the AUC can be presented as follow:

III. PROPOSED METHODOLOGY A. FRAMEWORK
The framework of NEADF-LP is shown in Fig.1. Our method consists of three essential components, which we will introduce in the following parts in detail: • Structure Encoder: the structure encoder is designed to represent each node as an embedded vector with topological structure by DeepWalk; • Attribute Encoder: the function of the attribute encoder is to encode the discrete and continuous attributes of the node into a uniform and real-valued attribute feature vector representation of the node; • Deep Fusion: the deep fusion utilizes the tower-structure model to deeply fuse the topological structure vector and the attribute information vector of the network to perform link prediction.

B. STRUCTURE ENCODER
Considering the operational efficiency on large-scale data sets and the parallelism and scalability of the algorithms, we utilize DeepWalk algorithm as structure encoder to represent each node as an embedded vector with topological structure in this work. DeepWalk learns the potential vector representation of a node by truncated random walks [22]. Comparing the word embedding model Word2Vec in the NLP domain, the basic processing element of the word embedding model is a word, but the processing element of network embedding is a node correspondingly; the word embedding is to analyze the sequence of words constituting the sentence, whereas network embedding is to analyze the path of a node which truncated random walks pass.
The algorithm mainly consists of three parts, as shown in Fig.2 • Node sequence sampling The random walker samples the node sequence in the network, which selects a node v i from G = (V , E) as the root node, randomly walks from the root node, and randomly selects a neighbor node of the last visited node. When the maximum path length t is reached, the path through which the random walker passes is taken as a sample sequence W v i , and W v i = t. The sampling is repeated λ times at node v i .
• Node mapping to a vector representation The sampled node sequences are treated as sentences which are input to the Skip-Gram model for training with the window size of w. In the network representation learning task, a window of size w is set for the sampled node sequence W v i . For each v j mapping to the current vector space v j , and for each representation of the given node v j , maximize the probability of its appearance in the window w of the node sequence W v i : Reduce computational complexity by hierarchical Softmax. Given a node v k ∈ V , calculating the conditional probability Pr v k | v j requires a large amount of computation, so this conditional probability is calculated by the hierarchical Softmax. Put the node in the leaf node of a binary tree. For each leaf node, there is always a unique path from the root node to the leaf node. Based on this unique path, the probability of occurrence of a leaf node is estimated. Record the path to node v k as The conditional probability can be represented as: A model of the binary classifier can be built on the parent node of node b l to estimate Pr b l | v j : where (b l ) represents the parent of node b l . As a result, the time complexity of Pr v k | v j can be reduced from O(|V |) to O(log |V |). After iterative training, the weight matrix of the hidden layer of the extended Skip-Gram model is used as the potential vector representation of the node.

C. ATTRIBUTE ENCODER
Many real social network data contain a wealth of attribute information, which may be varied. In order to facilitate the combination of attribute information and topological structure, we propose an attribute encoder, which encodes the discrete attributes and continuous attributes of a node to obtain a uniform and real-valued attribute feature vector.

1) DISCRETE ATTRIBUTE
Discrete attributes, more commonly categorical attributes, such as gender, age, can be one-hot encoded to a continuous vector representation. This method performs one-hot encoding of all discrete attributes of the node. Finally, the coding of each type of attribute is spliced to obtain the final discrete attribute feature vector. For example, the gender attribute has

2) CONTINUOUS ATTRIBUTE
Continuous attributes are ubiquitous in social networks, such as a blog with text, image, and vedio. These types of attributes cannot be directly compared with each other. However, by manual processing, information such as text and picture can also be converted into continuous vector representations using existing methods. TF-IDF is a weighting method to represent text data as comparable values. Term Frequency (TF) refers to the number of times a given word appears in the file. Inverse Document Frequency (IDF) is a measure of the universal importance of a word. TF-IDF is the product of the two, which can filter out common words and retain important words in the document. In the field of text information processing, Bag-of-words is usually used to obtain a vector representation of text, and TF-IDF is used to reduce noise to obtain a real-valued text vector representation. We process text attributes by TF-IDF in this work. The main steps to calculate text features using TF-IDF are summarized as follows: • Calculate TF. In order to facilitate the comparison of texts of different lengths, normalize TF; • Calculate IDF; • Calculate TF-IDF, using the product of TF and IDF; • Splice the TF-IDF value of each word in the document to form the feature vector of the text attribute. After the continuous attribute is represented as a real value vector, all attribute feature vectors of the nodes are stitched together, and we obtain the final attribute feature vector of node in the network. Next, the attribute information of each node is processed into feature vectors with the same dimensions. A node that lacks a certain type of attribute fills in 0 with the corresponding attribute feature position. Fig.3 illustrates a diagram of attribute vector.

D. DEEP FUSION
As shown in Fig.4, we build a tower-structure model to achieve nonlinear fusion between structural feature and attribute feature. The link prediction is the object-oriented, and a neural network is established to fuse the features.  We first define f as a function of the similarity between v i and v j , and the probability that v i and v j are connected is computed by: The degree of structural similarity between nodes needs to be extracted from the neighbors. Based on the assumption of conditional independence, the probability between node v i and the set of its neighbor nodes N i is computed by: Then the likelihood function L of the model can be defined as: Next, the structure of the model is introduced in detail:

Input layer
The structure vector representation of the node, denoted as u, contains the structural information of the node; the attribute vector representation of the node is denoted as u , which contains the attribute information of the node.

Hidden layers
The structural feature vector u and the attribute feature vector u are weighted, and being input into a multi-layer perceptron, and the representation of each hidden layer is denoted as h 0 , h 1 , . . . , h n . The representation of the hidden layers is defined as follows: where γ represents the weight of the attribute feature, δ k represents the activation function, and n represents the number of the hidden layers. Stacking multiple nonlinear hidden layers has been shown to learn better data representation. We build a tower-structure model shown in Fig.4 in the hidden layer: the number of neurons in the current layer is halved from the previous layer. This kind of tower structure can learn more abstract features, which have achieved good results in the recommendation tasks. u and u need to be adjusted by the weight coefficient γ when being input to the hidden layer.

Output layers
The output of hidden layer h is converted into a probability vector o, which contains the probability that input node v i is connected to other nodes, as follow: The corresponding row in the weight matrix of the hidden layer is taken as an abstract representation of node v j , denoted as u j . The similarity function of nodes v i and v j is defined as: (10) Substituting (10) into (5), the connection probability in the vector o is calculated by: (11) Then the optimization objective function of the model is recorded as : Substitute (11) into formula (12): The purpose of this optimization objective function is to make the similarity between neighbor nodes larger, and the similarity between non-neighbor nodes smaller.
Consider the gradient calculation in (12), which mainly consists of two parts: Since the second half of (14) is computationally intensive, we approximate the calculation by sampling a part of the nodes by means of negative sampling.
We summarize our algorithm in Algorithm 1.

Algorithm 1 NEADF-LP
Input: network G = (V , E), attribute set A, embedding dimension d, learning rate η, negative samples L, attribute weight γ . Output: the link probability with other nodes for each node. 1: Initialize all the model parameters ϕ. 2: Generate structural embedding vector u for each node v ∈ V using DeepWalk. 3: Generate attribute vector u by attribute information in A for each node. 4: while not converged do 5: for each (u,u ) do 6: calculate the probability vector o using (8)   7: Sample L negative samples and calculate objective function using (12)

IV. EXPERIMENT AND RESULT A. DATASET AND EXPERIMENTAL ENVIRONMENT
This work utilizes two network datasets with attribute information, CiteSeer 1 and Cora 2 , to study the performance of our approach. CiteSeer is a multi-disciplinary dataset consisting of papers from 10 research fields. Cora is a dataset based on citations between scientific papers. The detailed information about these datasets is described in Table 1, in which V and E are the numbers of nodes and links respectively; A represents the dimension of the preprocessed attribute vector; AD means the average degree and ACC represents the average clustering coefficient.
The experiment is conducted on the two datasets using a single Windows server with Intel(R) Core(TM) i5-4200U CPU @2.3GHz, 512G RAM and NVIDIA GeForce 840M. The codes of our proposed models are implemented with TensorFlow 1.12 in Python 3.6.

B. PROCEDURE
The processes of our experiment are as follows: Dataset dividing: remove the existing edges in the network by 5%, 15%, 25%, 35%, and 45% as the positive samples of the test set, and select the same number of non-existing edges as the negative samples of the test set. Train the model with the remaining existing edges as the train set. 1 http://citeseerx.ist.psu.edu/ 2 https://linqs.soe.ucsc.edu/data Representation of structural feature vectors: obtain the representation of a node with topological features using the algorithm described in Section III-B.
Representation of attribute feature vectors: the discrete attribute and the continuous attribute of the node are respectively encoded using the algorithm described in Section III-C, and they are stitched together as the attribute feature vector of the node.
Training of the deep fusion model: the structural feature vectors and the attribute feature vectors are input into the deep neural network for training, and the hidden layer of the tower structure are used to fuse the structural features and the attribute features nonlinearly. The connection probability value of between one node and the other nodes are shown in the output layer.
Link prediction: calculate the AUC value using the probabilistic method according to the connection result in the previous step.

C. HYPER-PARAMETERS
According to test experiments, we set the context window size k, the walk length l and the number of walks per node t are set to 10, 80 and 16 respectively, and the dimensions of vector d is 128. After data preprocessing, the initial attribute dimensions of Citeseer and Cora are 3707 and 1433, respectively. In order to facilitate the input of attribute feature vectors into the deep learning model, the vector of attribute is also processed into 128-dimensional feature vectors. Besides, the weight of attribute feature vector γ is set as 0.2 and 0.4 on CiteSeer and Cora respectively.

D. BASELINE METHODS
Five link prediction algorithms available in the recent literatures, including Common Neighbor (CN) [4], Local Path criterion(LP) [8], DeepWalk [22], Node2vec [25] and Graph-GAN [32] algorithms are used to prove the effectiveness and robustness of our proposed method. Besides, all the compared baseline methods and the proposed NEADF-LP method are evaluated on the same criteria.
Common Neighbor (CN) [4] formulates the similarity between any two nodes by counting the number of thier common friends.
Local Path criterion (LP) [8] considers the effects of second-order neighbors and third-order neighbors of nodes to develop the concept of resource allocation, where parameter α in LP is set as 0.01.
DeepWalk [22] is an approach to represent a node of the network as a d-dimension vector with the help of the truncated random walks over the graph. After that, it processes the sequences by the Skip-Gram model with hierarchical softmax. In the experiment, we set the embedding dimension as 128 and set window size w as 10, random walk length as 80 and walks per node as 10.
Node2vec [25] is a generalized version of DeepWalk, which uses the biased random walks based on BFS and DFS sampling. Introducing two hyper-parameters p and q, VOLUME 8, 2020 it exploits structural equivalence. The parameter setting is same as DeepWalk.
GraphGAN [32] combines both generative thinking and discriminative thinking into network embedding. As a result, we can get the low-dimensional representation vectors of nodes. We set all parameters in the experiment to be optimal.

E. RESULT AND DISCUSS
As shown in Fig.5, our proposed method is compared with five baseline methods on CiteSeer and Cora real-world network. The x-axis represents the five proportions of the remaining training set, and the AUC value is shown on the y-axis. From the Fig.5, Table 2 and Table 3, we can see that NEADF-LP significantly outperforms all baselines, which indicates that this model combining topological structure and attribute information has much better predicting power on unobserved links. In addition, when the proportion of the training set is 0.55, the AUC value of NEADF-LP is about 0.8. It mainly because the attribute information of node is helpful for link prediction while the topological structure information missed.
The effect of CN is not good because it only focuses on the first order proximities. The method of CN performs better on  the dataset of Cora, owing to the relatively higher average degree and clustering coefficient, which indicates that relationship between nodes are relatively close. As the proportion of the training set increases to a certain extent (0.75 on CiteSeer, 0.65 on Cora), the effect of LP method increases dramatically. The effects of DeepWalk and Node2vec are not as good as we expected, especially in the CiteSeer dataset, because the network representations, through a multitude of random walks and the shallow Skip-Gram model, is unable to capture highly nonlinear features in the underlying network. Besides, compared with DeepWalk, Node2vec performs better. That is because DeepWalk samples nodes randomly, which may result in different learned feature representations. While Node2vec designs a flexible objective that is not tied to a particular sampling strategy and provides parameters to tune the explored search space, the trouble of inflexible sampling of nodes can be solved. Although Node2vec can capture more general information from the graph, it cannot capture the structural similarities of links. Among the baselines, the performance of GraphGAN is the best in most instances (all proportions of the training set on Cora and proportions above 0.75 on CiteSeer). This might because the accessibility similarity is not encoded by GraphGAN explicitly, although it optimizes the connectivity distribution on the edges by generator and discriminator explicitly.
The link prediction methods based on topological structure have limited space for improvement. We fuse the attribute information of the node deeply on the basis of making full use of the topological structure. Experimental results show that it could improve the link prediction accuracy significantly.

F. PARAMETER SENSITIVITY
In this section, we study how the weight of attribute feature vector affect the performance of NEADF-LP. The effect of different parameter γ is shown in Fig.6 (80% edges remained as training set). As shown in Fig.6, the best weight of attribute feature vector on CiteSeer is 0.2 while on Cora is 0.4. When the parameter is set as 0, that is, without attribute information, the AUC value on both CiteSeer and Cora is lowest, which indicates that the attribute information helps a lot for link prediction. Besides, due to the large differences of attributes in different networks, the parameter of the weight of attribute information in different datasets should be determined by more specific experiments.

V. CONCLUSION AND FUTURE WORK
Link prediction has attracted more attention in various area. The existing link prediction algorithms mainly consider the structure information but ignore the attributes of the node. To improve the accuracy of link prediction, we proposed a novel method which utilizes the combination of the topological structure and the attribute information by deeply fusing them. First, we use DeepWalk algorithm as structure encoder to represent each node as an embedded vector with topological structure. Secondly, we encode the discrete attributes and continuous attributes of the node to obtain a uniform and real-valued attribute feature vector. At last, we propose the tower-structure model to deeply fuse the vector with topological structure and the vector with attribute information of the network to perform link prediction. The results of experiment demonstrate that the method proposed in this paper has better prediction power than the common benchmark algorithm, and attribute information is helpful for link prediction.
There are still some challenges ahead. Our algorithm considers a unweighted network. However, the type of nodes in the real world are more variable, and the relationship between nodes is more changeable. It makes sense to extend this work to heterogeneous networks. Besides, the random walk method is used to sample the node sequence when learning the topology of network. Sampling the node sequence in some specific way and in parallel might greatly improve the effect and efficiency of the algorithm.