Deep Attributed Network Embedding via Weisfeiler-Lehman and Autoencoder

Network embedding plays a critical role in many applications. Node classification, link prediction, and network visualization are examples of such applications. Attributed network embedding aims to learn the low-dimensional representation of network nodes by integrating network architecture and attribute information. The network architectures of many real-world applications are complex, and the relations between network architectures and their attributed nodes are opaque. Thus, shallow models fail to capture deep nonlinear information when an attributed network is embedded, leading to unreliable embedding. In the present paper, a Deep Attributed Network Embedding via Weisfeiler-Lehman and Autoencoder (DANE-WLA) is proposed in order to capture high nonlinearity and preserve the many proximities in the network attribute information of nodes and structures. Weisfeiler-Lehman proximity schema was used to capture the node dependency between both node edges and node attributes based on information sequences. Then, a deep autoencoder was applied to invest complex nonlinear information. Extensive experiments were conducted on benchmark datasets to verify that DANE-WLA is computationally efficient for various tasks requiring network embedding. The experimental results show that our model outperforms the state-of-the-art network embedding models.


I. INTRODUCTION
Complex networks are used to describe objects (nodes) and dependencies (edges) [1], which makes them useful as a delimiter for many computing systems in the real world [2][3][4], for example, social networks [5,6], citation networks [7,8] and networks of protein-protein interactions [9]. Such network complexity is a major challenge to numerous machine-learning tasks. Node classification [8,10], link prediction [11,12], and community detection [13,14] are examples of these tasks. To handle these problems, some network embedding methods [15][16][17][18] have been presented to encode the architecture and represent them as a vector in a low-dimensional space.
Traditional methods of network embedding, such as Laplacian (LE) [19] and IsoMAP [20] have been adopted to represent the network as a graph structure. These methods attempt to capture local proximity among nodes with common links using graph theory. Because of bifurcation and the expansion of real-world networks, many potential links between nodes can be lost and ignored. Consequently, building an embedding model based on local proximity between nodes is unsatisfactory because it neglects the general network structure.
In recent years, many researchers have shown the importance of considering the general structure of the network as well as the first-order local proximity [8,15,21,22] to develop a network-embedding model. For example, Deepwalk [15] and node2vec [21] seek to hold a high degree of proximity between nodes through a random traffic node sequence in a graph. SDNE [16] and Line [8] considered first-order and second-order proximity between nodes in the network. This type of networkembedding process is called plain network embedding, whereby the links of the node independently rely upon using no ancillary data of the node attributes. Hence, it is impossible to capture the data of each node or to take advantage of both network structure and node attributes.
Attributed networks are widely used to represent networks in which links of nodes and attributes are important in the analysis process [23][24][25]. For example, in social networks, users associate with their friends and share their profiles as attributes. The embedding process becomes more challenging in attributed networks because it considers network structure and attributed information in the embedding process.
To overcome this challenge, several researchers have proposed network embedding models that learn jointly from node links and attributed information based on previous embedding models [2,[26][27][28]. For example, Yang et al. [2] used textual features to observe random walks on networks and proposed the Text-associated DeepWalk (TADW) model. In Huang et.al. [28], links of the node and attributes were used interchangeably as labels to observe learning for each other.
All the embedding methods mentioned above are based on models of shallow network embedding. They involve the attributed data in the modeling process to reduce the data scattering problem. However, attributed network patterns are considered nonlinearly [29], which makes it difficult to capture complex latent information using shallow embedding models. So many studies have suggested deep learning models to capture this latent information [23,[29][30][31].
On the other hand, Weisfeiler-Lehman Graph Kernels [32] is a well-relied methodology to combine scalability with the capacity to accord with node attributes. For example H. Yang et al. [33,34] defined a Weisfeiler-Lehman matrix to represent the interrelationship between a node's attributes and its structure. Xuewei Ma et.al. [35] relied on the ability of the Weisfeiler-Lehman Graph Kernels algorithm to integrate node attributes with the neighborhood to build their own model. To address these challenges, we propose a deep learning-based model to embed the attributed networks. The motivation behind this is the great achievement of the methods that are based on deep learning in the field of representational learning [40].
We summarize the contributions of this work as follows:  A deep model (DANE-WLA) was proposed for network embedding. The model consisted of two stages. In the first stage, a new Weisfeiler-Lehman schema was used to capture the integration of the links of the node and the attributes. In the second stage, a deep autoencoder was applied to the proximity matrix to invest in nonlinear complex patterns.  The applicability of the proposed model to real-world networking tasks was evaluated by conducting experiments on four datasets used in common tasks, including node classification, link prediction, and visualization. The results showed that DANE-WLA significantly and consistently outperformed many other recent network embedding methods, even with variations in the various experimental variables.
This paper is organized as follows: The related work is explored in Section 2, and the technical details of DANE-WLA are exhibited in Section 3. Section 4 presents the experiments and results. Finally, in section 5, our conclusion is presented.

II. RELATED WORK
Below is a summary of the related work, which is categorized into three sections: 1) models of plain network embedding, 2) models of Attributed Network Embedding (ANE), and 3) models of deep neural networks (DNN). TABLE 1 shows a summary of these methods.

A. PLAIN NETWORK EMBEDDING
The earlier methods of network embedding, such as Laplacian Eigenmaps [19] and locality preserving projection [36] are related to the methods of graph embedding. These methods aim to preserve the local data structure and represent it with a lower dimension. Such methods may not be viable for large-scale network embedding, as all of them have a time complexity of ( 3 ), where is the number of nodes. Recently, interest has been focused on developing learning graph embedding with a method-inspired skip-gram model [41]. DeepWalk [15] depends on this task in the case of ding. This method is based on memorizing short random walks of simultaneous frequencies of pairs of nodes. Node2Vec [21] is a generalized version of a deep walk. In this version, the depth and breadth of the random walks are tuned. The authors in [8] proposed an embedding model to benefit the first and second order of proximity, while node representations were learned. Another method was proposed in HOPE [37] to capture asymmetric transitivity and higher-order proximity in directed graphs. However, these previous methods utilized the network structure and ignored the helpful attributes of nodes.

B. ATTRIBUTED NETWORK EMBEDDING (ANE)
Recently, the ANE has attracted attention. The TADW model [2] is a typical model of ANE, which amalgamates DeepWalk and related text features into the matrix factorization framework. In another study, Hong Yang et al. suggested a binarized ANE (BANE) [24] model. They defined the Weisfeiler-Lehman proximity matrix for data dependency to be captured between node links and attributes. Subsequently, they used the cyclic coordinate descent (CCD) task to learn the binary node representation [42]. The above works depend on shallow models.

C. DEEP NEURAL NETWORKS
DNNs are developed to learn complex, non-linear object characteristics automatically, and they have been successfully used in many fields, such as image classification [43] and natural language processing [44]. Of all models of deep learning, only the autoencoder and its variants present unsupervised methods for feature engineering and are embraced in several tasks of network analysis. Zhai and Zhang [45] proposed an approach based on an autoencoder to learn representations of nodes in sparse networks. DANE [23] was proposed with a separate double autoencoder that reconstructs the network structure and node attributes. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181120 Richang et al. proposed DANE [29], which consists of three steps that deal with the data discrepancy, integration of graph topology and features, and nonlinearity patterns of ANE in an integrated framework.These frameworks may not be applicable to large-scale network embedding, as they have a time complexity of ( 3 ), where is the number of nodes. Cong Li et.al. suggested that (DANRL-AEN) [39] method espouses the concept of the autoencoder. To catch the different order proximity, they developed the decoding part to three offshoots. DANRL-AEN is ( 2 ), where is the number of nodes.

A. PROBLEM DEFINITION
The attributed network ( , , ) is a network with | |nodes and | | edges, and each node has features. In this network, | | ∈ | | * is the attributed network. ANE aims to build trained models to represent the nodes in a lowdimensional space on the basis of the matrix of links between node E and attribute matrix X. Therefore, the learned representation is able to maintain the closeness existing in the network structure and node attributes.

Definition 1:
Given an attributed network ( , , ), the task of ANE, : → | | * , (G) = H is a function that produces a vectorial low-dimensional representation of each node ∈ V, where the structural proximity in and attribute proximity in are preserved, and the − ℎ row in H is called the node embedding of

B. PROPOSED DANE-WLA FRAMEWORK
The architecture of the DANE-WLA consists of two main stages as shown in FIGURE 1.

Stage 1 (Information aggregation):
In this stage, The Weisfeiler-Lehman proximity schema is defined based on Weisfeiler-Lehman graph kernels to generate a new proximity matrix P. This scheme integrates the node attributes in X and links in the network structure.

Stage2 (Deep Embedding):
Because the hidden information in attributed networks is mostly complicated and not linear, a deep autoencoder model is used to embed the proximity matrix P produced from the first stage. The above two stages are summarized in Algorithm 1, and their details are presented next.

1) INFORMATION AGGREGATION
Information aggregation from neighboring network nodes to a goal node is constructed from the graph kernels of Weisfeiler-Lehman [32]. It uses a parameter k to control the neighboring node layers, which join the process of aggregating information.
The Weisfeiler-Lehman scheme is used to increase node labels using the labels of their associated nodes. Subsequently, the increased labels are compressed into new labels.

d is embedding dimensions
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. where ( ) is the neighbors set of .
The basic idea of graphs with continuous attributes X is to construct a spread diagram that captures the features of the current node and creates them by combining them with the features of neighboring nodes according to balanced proportions. Suppose we have a continuous attribute 0 ( ) = ( ) ∈ for each node ∈ . Then, we recursively define where ( ) = | ( )| denotes the degree of node , and is the degree of node . The matrix = + is called the signless Laplacian matrix of . For undirected graph G, the normalized signless Laplacian matrices are defined as [46] The element of + are given by : , Using the recursive procedure described above, we propose a -based feature extraction scheme that generates node features from the node attributes of the graphs. Definition 4 ( features): Let ( , ) be an attributed network, and let be the number of iterations. Then, for every ∈ (0, . . . , }), we define the WL features as So, we can define as 2) DEEP NETWORK EMBEDDING In this stage, low-dimensional deep embedding vectors are generated from the proximity matrix using deep learning models. We apply the autoencoder to capture the nonlinear complex patterns resulting from the previous stage because it is a dominant unadministered deep model of feature learning [23]. The model first learns the input's deep representations through nonlinear learning from the inputs. Next, the outputs are constructed from deep representations. In the DANE-WLA deep autoencoder process, given the proximity matrix , we aim to learn a mapping function ∶ → ∈ | | * as defined in equation 7. The autoencoder architecture contains an encoder and decoder, which can be defined as: Encoder: This transforms the input vector to a hidden layer representation ℎ ∈ , where ∈ is the ith row of the improved matrix . It is defined as: where { (1) , } are encoder parameters, (1) is a weight matrix while is a bias vector, and (. ) denotes the nonlinear activation function. 1) Decoder: The autoencoder maps ℎ to the reconstruction ̃ of the same shape as . It is formulated as: ̃ = ∅( (2) ℎ +, ′ ) (8) where ൛ (2) , ′ ൟ are decoder parameters, (2) is a weight matrix while ′ is a bias vector, and ∅ (. ) denotes the nonlinear activation function.
A solution of the following optimization problem can help in learning all the parameters: min ∑ ℒ( =1 ,̃; (1) , , (2) , ′ ) By reconstruction error minimization, parameters of the model can be learned: To capture the high nonlinearity in the improved matrix, the deep autoencoder utilizes layers in the encoder as follows: Reciprocally, there are layers in the decoder, where ℎ ( ) is the required low-dimensional representation of the − ℎ node.

C. MODEL ANALYSIS AND DISCUSSION
In this subsection, we focus on and evaluate the competitive characteristics and time complexity of the proposed method by analyzing them and then comparing them with other models.

1) TIME COMPLEXITY:
Our proposed DANE-WLA framework consists of two parts: information aggregation and deep network embedding. In the first part, we used a Weisfeiler-Lehman whose timecomplexity is Ο( | |) [32], where k is a small value. This part has Ο(| |)time-complexity. In the second part, we used an autoencoder whose time-complexity is Ο( | |) where represents the largest hidden layer dimension. Accordingly, the DANE-WLA time complexity is Ο(| | + | |) (See  TABLE 2).

2) COMPARISON BASED ON THE DESIDERATA
A node embedding technique must have specific properties to produce callable node features and scalability [47,48].  Generic features: Prevalent properties of nodes, such as social network user profiles, are encoded when learning to embed a node. Generic characteristics have been utilized for the universal contextualization of node locations in the integration space.  Non-linear: A node-node closeness outcome in the goal model is a non-linear task of the vector of node embedding, which helps to encode superlinear and sublinear proximity outcome encoding.  Implicit: The decomposed target matrix is inexplicitly computed, which saves time and the required complexity of space by applying the integration model in practical largescale made-up settings.  Inductive: The node representation technique assigns invisible nodes to the representation areas that are unconnected to the training network. In DANE-WLA, we can represent any node by applying Algorithm2.  Higher-order: Representation encodes the information of nodes that is far from a node. For instance, in contextualization, which relies on WL, two nodes are similar in generation k+1 if they are similar in k generation, and all neighbors are similar.  Handling missing data: When missing data are presented in node attributes and ignored, the results of the network analysis will be highly skewed.
We summarized the evaluation of our proposed embedding model against previous embedding methods in TABLE 2 where concerns have (✔) desired properties.

IV. EXPERIMENTS
We applied the proposed model to real-world networks to verify its efficiency and effectiveness in ANE. Then, we compared its performance with that of many well-known models in this field.

A. BENCHMARK DATASETS
To evaluate the proposed model, we used four social networks in our experiments, which are freely available on SNAP [49]. The descriptions of these datasets are as follows.  Facebook: The nodes of this network represent Facebook pages, whereas likes are expressed by the edges linked to the nodes. Node attributes are taken from page represents, and the category is one of four classes (Politician, Government, Company, and TV Show).  Twitch social networks: Twitch users represent the nodes of this network, while friendships between them are expressed by the edges linked to the nodes. Node attributes are taken depending on the streaming habits, and the binary category variable is described as whether the user produces evident content.  Wikipedia graphs: In this network, Wikipedia pages represent the nodes and edges are the reciprocal links. The node attributed represents the existence of nouns in the essay, and the binary category variable articulates the volume of traffic on the page.  GitHub: The nodes of this network represent the GitHub developers, and the links represent common follower relationships. Attributes are determined based on location, metadata, and biography. The categories are to classify each developer as either a machine learning developer or a web developer. TABLE 3 summarizes the details of these dataset networks.

B. BASELINES
The proposed model DANE-WLA is compared with seven common models, and the details of these models can be listed as follows:  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181120  HOPE [37]: A plain network embedding method that attempts to capture asymmetric transitivity and higher-order proximity in directed graphs using generalized singular value decomposition instead of an adjacency matrix.  DeepWalk [15]: It learns embedding through the prognosis of the local neighborhood of the nodes, sampled from random walks on the network, and uses skip-gram to analyze the walking tracks.  Node2vec [21]: It is a generalized version of the deep walk, and there are two basic strategies to create a path between two nodes in the graph "Depth First Search" (DFS) and "Breadth-First Search". The authors proposed a method to construct a random walk synthesis of these two strategies.  TADW [2]: TADW uses textual features to observe random walks on graphs and proposes the text-associated DeepWalk to incorporate node content features.  BANE [24]: This model was proposed to embed the attributed network in a binary space. To capture the dependence of the network architecture on the data, a proximity matrix based on the Weisfeiler-Lehman kernel is defined. In addition, cyclic coordinate descent (CCD) is utilized to represent the binary node representation limitation.  TENE [50]: This method subedits the attributed network embedding issue as a mutual non-negative matrix factorization issue. In addition, the node embedding acquired from sub-tasks is utilized to organize them.

 MUSAE[47]:
MUSAE learns attributed node representation, which expressly factorizes the production of an attributed matrix and a normalized adjacency matrix power.

C. PARAMETER SETTINGS
Considering the Weisfeiler Lehman Kernel (with parameter ∈ [5]), the graphlet sampling (GS) kernel [51](with ∈{3, 4, 5}) and Relaxed Weisfeiler-Lehman Kernel [52] (with ∈ [5]), we conducted an experiment on our Weisfeiler-Lehman kernel with a depth parameter ∈ [6]. We found that the Weisfeiler-Lehman kernel gives the best results with depth parameter = 4 for all datasets. In addition, after making the comparison of applying the employment of Sigmoid, Tanh, and Relu in the proposed model, the ReLU function is used as the nonlinear activation function of the autoencoder. The details of the layer size and dimensions of the layers in the autoencoder are listed in Table 4. For all baselines, we used the public source code provided by Karate Club [49] and tuned the parameters to ensure that each model has achieved the optimal performance on all datasets and different experiments.

D. EXPERIMENTAL RESULTS
This section presents the results of comparative evaluation experiments of the proposed model against all the baselines. To compare the efficiency of the node embedding models, we used node classification, correlation prediction, and visualization tasks.

1) NODE CLASSIFICATION RESULTS
Classifying nodes is a significant task. It aims to predict the labels of the nodes using the same known information of the network. This is a widespread task to determine the efficiency of network embedding achievement. In the experiment, we first learned the node representation over different models. These node representations were used as features to classify each node into a set of classes. Then, some nodes were randomly sampled for training, while the others were sampled for testing. We randomly selected 90%, 70%, 50%, 30%, and 10% nodes as the testing set and the remaining as the training set, respectively. Classification implementation was evaluated by training the SVM as the classifier. Macro-F1 and Micro-F1 were employed as the evaluation metrics. If there are ∁ classes in the test set and , , and are the true positive, false positive, true negative, and true negative of class i, then the Micro − F1 and Macro − F1 might be as follows: The node classification results are presented in Table 5. For each tuning, the best and second-best models are highlighted in boldface and underlined, respectively. From these results, the following conclusions can be drawn:  Plain vs. Plain: DeepWalk shows similar or better results than Node2vec in different datasets. Compared to HOPE, these methods based on random walks work better, mainly because these methods can explore diverse information about the structure of the network through biased random walks.  Plain vs. Attribute: The results show that the methods that integrate the attributes of the nodes have better experimental results than those that focus only on the structure. This shows that the integration of structural information and the semantics of the attributes is advantageous for learning the vectors.  Attribute vs. Attribute: DANE-WLA has achieved the best performance in terms of the Micro-F1 and Macro-F1 values of all methods for most configurations. The ranking results are higher than those of the remaining reference methods for a percentage of 3% for the FACEBOOK dataset. The results show that DANE-WLA is efficient and robust in handling both structure and attribute information.

2) LINK PREDICTION
Link prediction is an outstanding implementation of network embedding that endeavors to infer the probability of the presence of edges between nodes that are not connected in the input network. We created the test set by combining 20% of the randomly selected edges with an equal number of nonexistent edges. Link prediction was measured by two prevalent evaluation metrics: AUC and precision [53].
AUC is the area under the receiver-operating characteristic curve. It is a stable measurement used to define the average accuracy over the entire range of test values. The trivial AUC of the random estimation method was 0.5. In short, when the value is larger, the performance improves.  Precision represents the precision index of the link prediction. This is used to select the first positions after sorting the set of unknown edges in descending partition order. If there are edges in the test set, the precision may be as follows:

Precision =
The link prediction results are presented in Table 6. The best ACU and precision values are boldfaced, and the second-best performers are underlined.
The results show that our proposed model DANE-WLA achieves either the best or the second-best results in terms of performance measures and for different datasets. For the precision metric, we note that DANE-WLA provides high results, which shows that a large percentage of the edges that are isolated before the embedding stage were identified. This means that the DANE-WLA model enables us to identify future edges accurately and with high efficiency.

3) VISUALIZATION
Visualization is an application of network analysis and is an effective way to measure the quality and efficiency of an embedding model. Visualization clarifies the global characteristics of embedding by understanding whether nodes of the same type are sufficiently close to each other in the embedding space. First, we represent nodes from the Facebook dataset in a low-dimensional embedding space using various models. This information is then used as an input for the t-SNE tool [54]. Therefore, each node is represented as a 2D vector, and all nodes are displayed in a 2D space. Each color represents a node with the same label. When the points of the same color are in close clusters, this indicates that the performance of the embedding model is good. The network visualization with different embedding methods is shown in FIGURE 2. It can be seen that the points are distributed in four different colors representing four separate clusters. It is clear that our model achieves better compactness and separation of clusters when it is compared to other reference models. Points of similar color resulting from DANE-WLA model appear together in the same space. While in other models the clusters did not appear separately from each other. Therefore, our model may work the best for both supervised and unsupervised tasks.

4) PARAMETERS ANALYSIS
In this evaluation method, the impact of the depth parameter K on the performance of our DANE-WLA model was first analyzed, which varied from 1 to 6. We found that DANE-WLA provided the best results with a depth parameter K=4 for all datasets. FIGURE 3 shows the micro-F1 and macro-F1 results obtained by varying the K parameters when setting the training ratio=80% and embedding dimension =64. Second, as shown in FIGURE 4, the proposed model provided stable results when the training data percentage became equal to 30% of the total data. Finally, we tested the Micro-F1 and Macro-F1 scores of DANE-WLA with embedding dimensions varying from 32 to 256, as shown in FIGURE 5. From the figure, we can observe that the proposed model yields good results with different embedding sizes, and we note that the performance is directly proportional to the dimension of the embedding size. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181120

V. CONCLUSION
In this paper, a new deep attribute network embedding method called DANE-WLA was proposed. We argued that plain methods ignore the integration between the network structure and the attributed information. Most ANE methods are shallow and fail to capture nonlinear deep information. In the same context, existing ANE methods are mainly designed for static networks, and they work with nonlinear complexity. Accordingly, we proposed the DANE-WLA method with linear complexity that tackled the implicit data, invisible nodes, missing data, structure and attribute integration, and nonlinear patterns of ANE in an integrated framework. We found that Weisfller-Lehman embedding is efficient in capturing dependencies among node links and attributes. In addition, using an autoencoder is significant for investigating the nonlinear and complex patterns in the produced Weisfller-Lehman matrix. Experiments with link prediction, node classification, and visualization tasks showed that the DANE-WLA model outperformed the other network embedding models across the four real datasets. In addition, the model was evaluated using several advanced parameter sensitivity analyses and model contributions, and the results showed the superiority of DANE-WLA.