Adversarial Attention-Based Variational Graph Autoencoder

Autoencoders have been successfully used for graph embedding, and many variants have been proven to effectively express graph data and conduct graph analysis in low-dimensional space. However, previous methods ignore the structure and properties of the reconstructed graph, or they do not consider the potential data distribution in the graph, which typically leads to unsatisfactory graph embedding performance. In this paper, we propose the adversarial attention variational graph autoencoder (AAVGA), which is a novel framework that incorporates attention networks into the encoder part and uses an adversarial mechanism in embedded training. The encoder involves node neighbors in the representation of nodes by stacking attention layers, which can further improve the graph embedding performance of the encoder. At the same time, due to the adversarial mechanism, the distribution of the potential features that are generated by the encoder are closer to the actual distribution of the original graph data; thus, the decoder generates a graph that is closer to the original graph. Experimental results prove that AAVGA performs competitively with state-of-the-art popular graph encoders on three citation datasets.


I. INTRODUCTION
Compared with Euclidean spatial data, such as image, text, and voice data, non-Euclidean data, such as graph data, have been difficult to process. Therefore, graph embedding algorithms have become a research hotspot. Graph research focuses on node classification [1], link prediction [2], graph classification [3], and graph generation [4], and three types of graph embedding algorithms that are based on factorization, random walks and graph neural networks are established.
Graph embedding algorithms that are based on factorization, for example, graph factorization [5], aim at identifying a balance point between the adjacency matrix and the regularization term such that the generated vector retains most of the information of the adjacency matrix. LINE [6] utilizes this strategy and attempts to embed the adjacency matrix, and the second-order similarity of the nodes is maintained in the vector. HOPE [7] introduced a higher-order similarity matrix and retained the high-order similarity via the generalized singular value decomposition.
A representative algorithm that is based on the random walk is DeepWalk [8]. This method randomly walks upstream The associate editor coordinating the review of this manuscript and approving it for publication was Mu-Yen Chen . and downstream of nodes to generate equal-length sequences as the adjacency features of nodes are utilized for skip-gram model training. Node2Vec [2] applies a weight to the random walk in the direction of DFS and BFS based on the former so that the generated sequence more accurately reflects the structural information of the node.
Recently, neural networks have been used in reliable intelligent path following control [9] and control system stability [10]. The graph embedding algorithm has gradually transitioned to the era of neural networks. Kipf et al. [1] simplified the definition of frequency domain convolution and proposed GCN for conducting the convolution operation in the space domain, which significantly increased the computational efficiency of graph convolution models. Since then, many variants of GCN have been proposed [11]- [15]. GraphSAGE [16] does not restrict the sampling information to the topology of the nodes; instead, it uses the inherent characteristics of the nodes. It also abandons the diffusion mechanism, which involves a massive number of parameters, to realize distributed training of large-scale graph data and inductive learning. Graph attention (GAT) networks [17] conduct aggregation operations on neighboring nodes using an attention mechanism to adaptively allocate neighbor weights. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ These methods are supervised graph embedding methods. In recent years, the application of graph data has become increasingly widespread. The structure of the graph is more complicated. In practical application scenarios, many data labels often have high acquisition thresholds. Therefore, it is of substantial value to investigate methods for efficient unsupervised graph representation learning on graph data. The graph autoencoder, which is based on the reconstruction loss, is a typical unsupervised learning approach. GAE and VGAE [18] use an encoder to obtain latent vectors, and a decoder uses latent variables to reconstruct the graph structure. Due to the high-dimensional and complex distributional characteristics of the graph data, the distribution of the feature vector that is encoded by the encoder deviates from the actual distribution. To solve the problem regarding the distribution of the encoded data, DVNE [19] directly embeds the nodes according to a Gaussian distribution and uses the Wasserstein distance as the similarity measure between the distributions, thereby effectively modelling the uncertainty of the nodes in the network. ARGA and ARVGA [20] further introduced the adversarial mechanism [21], which forced the encoder to generate potential vectors that were closer to the data in terms of distribution via adversarial training. CAAE [22] employs Bayesian personalized ranking (BPR) as a discriminative model, which improves the discriminative performance of the discriminator. Although these autoencoders have yielded reasonable results, they do not consider the difference in node importance. Since the importance may differ substantially among neighbor nodes, to learn a robust and stable node embedding, the weights of neighbor nodes should be changed during the training process.
In this study, we focus on the differential representation of neighbor nodes and propose a novel adversarial attention-based variational graph autoencoder (AAVGA). The objectives are to differentiate structural information and to jointly apply adversarial regularization mechanisms to increase the accuracy of graph embedding results. Our encoder utilizes neighbors that differ in terms of weight in the feature representation of the nodes by stacking graph attention layers to generate potential feature vectors, and we add multiple sets of independent attention mechanisms so that the encoder can allocate attention to multiple related features between the central node and neighbor nodes. The adversarial mechanism supervises the encoder and forces its generated vectors to obey the prior distribution of the graph data. Our key contributions are summarized as follows: • For unsupervised weight learning, we propose an attention-based graph variational autoencoder that infers the feature weights of nodes. The introduction of attention improves the graph embedding learning performance of the graph variational autoencoder, minimizes the reconstruction errors of the graph structure, and captures the highly non-linear network structure.
• To regularize the distribution of the encoded data, we introduce an adversarial component into the attentionbased variational graph autoencoder. This component can determine whether the input is from the low-dimensional representation of the graph network or the true distribution of the samples. The discriminator encourages the encoder to generate low-dimensional variables with distributions that are closer to the true distribution of the data and to learn an effective representation of the graph.
• We developed a variation of the graph autoencoder, namely, the adversarial attention-based variational graph autoencoder (AAVGA), for learning a graph representation. We use AAVGA for unsupervised graph link prediction and graph clustering visualization. Experimental results prove that our method performs competitively with other methods.

II. RELATED WORK A. GRAPH REPRESENTATION LEARNING
The graph representation learning methods can be divided into three categories: factorization-based methods, randomwalk-based methods, and deep-learning-based (GNN-based) methods. The factorization-based methods decompose the adjacency matrix, Laplacian matrix, and node similarity matrix to convert the node information into a low-dimensional vector while preserving the structural similarity. These methods, which include GF [5], GraRep [23], and HOPE [7], rely on the decomposition calculation of the correlation matrix; hence, they have high time and space complexities. In recent years, inspired by the word vector method, various methods treat sequences that are generated by random walks in the graph as sentences and nodes as words, such as DeepWalk [8] and Node2Vec [2], which realize large-scale graph embedded learning. However, these methods have shortcomings such as insufficient utilization of graph structure information and fusion graph attribute information. Graph signal processing (GSP) [24] migrates the basic concepts of signal processing, such as Fourier transform and filtering, to graphs to realize the compression, transformation, and reconstruction of the graph signal structure; this laid the foundation for graph embedding learning. Inspired by the definition of graph signal convolution filtering in GSP, a set of neural network theories that are based on graph convolution operations have been developed in recent years. Bruna et al. [25] introduced convolution into graph neural networks and developed a graph convolution network model that was based on the concept of frequency domain convolution operations. Since then, researchers have continuously proposed improvements and expansions of this neural network model that are based on frequency-domain image convolution [26], [27].
With strong end-to-end learning capabilities, GNN can be combined with corresponding unsupervised loss functions to realize unsupervised graph representation learning. According to the design, the loss functions can be divided into two categories: loss functions that are based on contrast and loss functions that are based on reconstruction. Based on the contrast loss, GraphSAGE [16] uses neighbors as context. The upper layer is an attention-based autoencoder, which assigns exclusive weights to nodes during the graph embedding process. The next layer is the adversarial layer, which forces Z to obey the prior distribution of the graph data by distinguishing the latent vector Z that is generated by the encoder from the real data.
Recently, Weihua Hu et al. used subgraphs as the context for contrast learning [28], and DGI [29] regards full graphs as the context. An autoencoder that is based on the reconstruction loss, namely, VGAE [18], uses GNN to encode and learn graph data based on VAE [30]. In ARVGA [20], an adversarial mechanism was added for training the discriminator and forcing the latent vector of the graph data that is generated by the encoder to obey the prior distribution. However, these frameworks do not consider the differences among the graph neighbor nodes during the encoding process, which may cause the original graph structure to be destroyed during the encoding process. We introduce the attention mechanism to solve this problem.

B. ATTENTION MODELS
Our method is inspired by the graph attention network (GAT) [17]. It uses an attention mechanism to conduct aggregation operations on neighboring nodes and realizes the adaptive allocation of neighbor weights. Similar to GraphSAGE [16], the graph attention model retains complete locality information; it can also conduct inductive learning. Based on this, gated attention networks (GaANs) were developed [31]. Furthermore, we combine GAT [17], VGAE [18], and GAN [21], and while ensuring the prior distribution of the data, the expressive ability of the graph encoder is improved.

C. ADVERSARIAL MODELS
The adversarial strategy in our method is derived from GAN [21]: a generator and a discriminator are constructed to optimize each other in a minimax game. GAN can enhance the generalization performance of a graph embedding model. GraphGAN [32] was the first network to utilize the adversarial strategy in graph learning. ANE [33] regards embedding vectors as the generated result and uses GAN [21] as an additional regularization term in available network embedding methods such as DeepWalk [8] by imposing the distribution of the real data as a prior distribution. ARVGA [20] used the above strategies in variational autoencoders but did not consider the difference in attention between nodes. Our method improves this.

III. AAVGA ARCHITECTURE
In this section, we will introduce our proposed AAVGA. The main structure of the framework is illustrated in Fig. 1. It is divided into three significant models: an encoder, a decoder, and a discriminator. [18] and ARVGA [20] use the original two-layer GCN [1] as the encoder. Our proposed AAVGA combines the strategy of GAT [17] and replaces the two-layer graph convolutional network in the general encoder with a single-layer graph attention network to generate the latent representation of the graph data. The formal definition is as follows: Let the feature vector that corresponds to node v i in layer L be h i , h i ∈R d (l) , where d (l) represents the feature-length of the node. After an aggregation operation with the attention mechanism as the core, the output is the new feature vector h i of each node, where h i ∈R d (l+1) and d (l+1) represents the length of the output feature vector. We refer to this aggregation operation as a graph attention layer. Assuming that the center node is v i , we set the weight coefficients of neighbor nodes v j to v i as: is the weight parameter of the node feature transformation of this layer and a (·) is a function that calculates the correlation between two nodes. In principle, here, we can calculate the weight parameter from any node to node v i in the graph; however, to simplify the calculation, we limit it to the first-order neighbors. a can use the inner product of the vectors to define a parameterfree correlation calculation wh i , wh j . Alternatively, it can be defined as a neural network layer with parameters. If a: → úR is satisfied, a scalar value that represents the correlation between the two vectors is output. We select a single layer of the fully connected layer: Its weight parameter is a ∈R 2d (l+1) , and the activation function is LeakyReLU. To better assign weights, we must normalize the calculated correlation with all neighbors via softmax normalization: where α ij is the weight coefficient. Via the processing of (3), it is ensured that the sum of the weight coefficients of all neighbors is 1. Combine (2) and (3) to obtain the complete weight coefficient calculation formula: After calculating the above weight coefficients, according to the strategy of weighted summation of the attention mechanism, the new feature vector of node v i is: To further improve the expression ability of the attention layer, we have also incorporated a multi-head attention mechanism into AAVGA, in which (5) is used to form K groups of independent attention mechanisms, in contrast to GAT [17].
To reduce the dimension of the output latent feature vector, we replaced the concatenation operation with the averaging operation: By summing multiple sets of independent attention mechanisms, the multi-head attention mechanism can impose the attention distribution on multiple relevant features between the central node and neighbor nodes, which enhances the encoder's representation ability.
We use an encoder that is based on the attention mechanism to fit µ and σ . µ = GNN µ (X, A) Here, µ is the matrix of mean vectors µ i , and logσ shares the weights w with µ in the attention layer. A variational graph encoder is defined by an inference model: 2) DECODER For the decoder, we follow the strategies of VGAE [18] and ARVGA [20] with the objective of reconstructing graph A via the encoder: uses two nodes to represent the inner product of the vectors to satisfy the adjacency relationship, thereby predicting whether any two points in the graph are linked: 3

) LOSS FUNCTION
According to experience, the standard normal distribution is selected as the prior distribution of the hidden variable z: The loss of the entire encoder is defined as follows:

+KL [q(Z|X, A)||p (Z)] (16)
If only L recon is used as the loss function, to optimize the performance of the generator, the model will learn that the value of the variance as zero because this is a fixed value that is sampled from the normal distribution, which is helpful for reducing the number of generated samples and the real differences between the samples. However, our objective is to optimize the variational autoencoder. To realize this objective, we add the KL divergence of each independent normal distribution and the standard normal distribution to the loss function and force each normal distribution to approach the standard normal distribution.

B. ADVERSARIAL MODEL
Our adversarial model is composed of two main parts: The encoder in the graph attention autoencoder acts as the generator of the adversarial network. The generator attempts to deceive the discriminator by generating fake data, where fake data refers to the latent variables that are generated from the image data by the encoder. The loss of the generator is as formulated in (16). The discriminator attempts to distinguish whether samples are from real data or generators. The discriminator judges the data from the prior distribution p z output as positive and the data from the latent variable z output as negative, and its loss is as follows: X, A))) (17) We use the Gaussian distribution as the prior distribution of the graph data. We posit that the potential vector z that is generated by the encoder does not satisfy the prior distribution of the data in Euclidean space; hence, we use the standard multilayer perceptron as the discriminator. In the process of embedding and learning, the adversarial regularization constraint is imposed to reduce the deviation of the data distribution in the training process. The main objective of the adversarial model is to jointly train the encoder and the

C. ALGORITHM
Algorithm 1 is our proposed framework. In step 2, the attention weight coefficient of each node is obtained, and the corresponding feature vector is obtained in step 3; the feature vectors compose the latent variable matrix Z. Then, in step 5 and step 6, we obtain the same value from the generated result Z and the real data distribution pz. The samples are used to train the discriminator. In step 7, we attempt to use the encoder to fool the discriminator, which can also be understood as training the encoder with the discriminator so that the distribution of the data that are generated by the encoder is closer to the true distribution. Finally, in step 9, we use the total loss to update the parameters.

IV. EXPERIMENTS
In this section, we will evaluate AAVGA on three benchmark citation network datasets. The satisfactory performance of the framework is proved on the link prediction task in unsupervised graph learning.

A. DATASETS
We use the three most popular citation datasets (Cora, Citeseer, and Pubmed) in graph embedding learning to evaluate our model. The datasets' structures are described in Table 1. The nodes correspond to the papers in the dataset, the features are the characteristics of each paper, and the edges represent the link relationships between papers. Consider the Cora dataset as an example: The Cora dataset is composed of machine learning papers and has been highly popular in recent years for graph deep learning. In the dataset, the thesis is classified into one of the following seven categories: case-based, genetic algorithm, neural network, probability method, reinforcement learning, rule learning, and theory. The paper selection criterion is citation by at least one other paper in the final corpus. There are 2708 papers in the corpus. There is a vocabulary that contains multiple words in the dataset; after stemming and removing the endings, only 1433 unique words remain. All words with a document frequency of less than 10 are deleted. The Cora dataset contains 1433 unique words; hence, the feature is 1433-dimensional. The absence and presence of words in the paper are represented by 0 and 1, respectively. The Citeseer and Pubmed datasets are of similar structure to the Cora dataset.

B. METRICS
We regard the graph area that is enclosed by the ROC curve and the x-axis as a comprehensive measurement index, which is denoted as AUC. Another evaluation indicator is AP, which is the graph area that is enclosed by the PR curve and the x-axis. These two indices are the primary evaluation indicators of link prediction tasks. We divide the dataset into a training set, a validation set, and a test set. There are 5% reference edges in the verification set for hyperparameter optimization and 10% reference edges in the test set, which are used to evaluate the performance. To ensure accuracy, we conducted 10 experiments on each dataset. The average and standard error of the experimental results are reported.

C. BASELINES
To prove the satisfactory performance of our proposed AAVGA framework, we compare it with six popular link prediction algorithms: • Spectral Clustering [34]: This is clustering method that is based on graph theory. The weighted undirected graph is divided into two or more optimal subgraphs such that the internals of the subgraphs are as similar as possible and the distance between the subgraphs is as large as possible to realize the common clustering objective.
• Deep Walk [8]: Learning a social representation through a truncated random walk can yield better results if the network has few vertices, and the method is also scalable and can adapt to changes in the network. VOLUME 8, 2020 • GAE [18]: This is an artificial neural network that can learn an efficient representation of input graph data through encoding and decoding via unsupervised learning.
• VGAE [18]: The encoder does not learn a low-dimensional vector representation of the samples; it learns the distribution of the low-dimensional vector representation. Suppose this representation follows a normal distribution. Then, the distribution that is represented by the low-dimensional vector is sampled to obtain the low-dimensional vector representation.
• ARGA [20]: An adversarial mechanism is added based on the graph autoencoder to ensure the consistency of the data distribution during the training process.
• ARVGA [20]: Adversarial regularization is incorporated based on a variational graph autoencoder. For the above benchmark methods, we follow the parameter settings in the corresponding papers.

D. EXPERIMENTAL SETUP
In contrast to the autoencoder in the benchmark that uses double-layer GCN to fit the data, for the Cora and Citeseer datasets, we use a graph attention network layer with 64 neurons and multiple attention layers (k = 3) when training our framework, namely, AAVGA. We use the Adam algorithm to optimize the encoder and the discriminator, and the learning rates of the encoder and discriminator are both set to 0.001. Training is conducted for 300 iterations. Since the data scale of the Pubmed dataset is much larger than those of the previous two datasets, to increase the expressive power of the encoder, for the Pubmed dataset, we increased the number of neurons to 128 and set the number of multi-head attention layers to 6 and the number of epochs to 1000. The remaining parameters remain unchanged.

E. EXPERIMENTAL RESULTS
The results of the link prediction experiment are presented in Table 2. Our method, namely, AAVGA, has yielded excellent results on the three datasets. Except Cora's AUC, all AUC and AP indicators exceeded 94%. Compared with the other benchmarks, its performance on the Citeseer dataset is the strongest. Compared with VGAE, AUC and AP have improved by 3.2% and 2.6%, respectively, whereas ARVGA improved by approximately 1.6%. In comparison with non-encoder graph embedding methods, the performance improvement of our method is pronounced. On the Citeseer dataset, the AUC of AAVGA is 13.5% higher than those of spectral clustering and DeepWalk; AP increased by 9.6% and 11%, respectively. The experimental results demonstrate that competitive performance can be realized by combining the attention mechanism and the adversarial mechanism in the graph encoder.

F. IN-DEPTH ANALYSIS
In this section, we will investigate the roles of the attention and adversarial components in the AAVGA framework.
In our experiments, we use the following variants of our architecture: • AAVGA: The full version of our proposed autoencoder, which includes all components.
• AVGA: A variant of our architecture that includes all components except the adversarial components.
• ARVGA: A variant in which adversarial regularization that is based on a variational graph autoencoder has been incorporated. We also compare AVGA with the full versions of AAVGA and ARVGA on unsupervised link prediction. The relevant experimental settings are as described in Section 4.4. Fig. 2 presents the average AUC and AP on the test node after repeating the ten-iteration training for the three frameworks.
AAVGA outperforms AVGA and ARVGA on all datasets. Therefore, each component contributes to the overall performance of our architecture. AVGA outperforms ARVGA, which proves that the self-attention components contribute the most in our architecture, compared to the adversarial components.
We will discuss the effect of the multi-head attention layer in the encoder on the experimental results. In contrast to previous encoders that used double-layer GCN networks to  fit graph data, we used a single-layer graph attention network because we found that double-layer networks are outperformed by single-layer networks, which may be related to our multi-head attention mechanism. As shown in Fig. 3, we gradually increased the attention level of the bulls from 1 to 9. Among them, the link prediction performance of AAVGA gradually improved and was maximized when k = 3. After that, as k increases, the performance fluctuates slightly, but the overall trend is decreasing. After analysis, we conclude that the optimal number of groups for multi-head attention is related to the sparseness of the adjacency matrix A. When the sparseness of the graph is low, oversmoothing will quickly occur.

G. GRAPH VISUALIZATION
To present our results more intuitively and to prove the versatility of the AAVGA framework, we will visualize the graph data in this section. First, we use AAVGA to embed the three processed citation datasets (Cora, Citeseer, and Pubmed). After the training has been completed, the data are encoded by an autoencoder to obtain feature vectors. To realize the two-dimensional visualization of the graph data, we use PCA [35] to reduce the feature vector's dimension. Finally, k-means++ [36] is used to cluster the dimensionality-reduced data. The visualization results are presented in Fig. 4. According to Fig. 4, in which each color represents a category, the citation categories in each dataset are well divided. This proves the satisfactory performance of the AAVGA framework.

V. CONCLUSION
In this paper, we propose a novel adversarial attention-based variational graph autoencoder (AAVGA). Most encoders use GCN to fit graph data. We believe that this is insufficient for the original graph data due to their expressive power. Therefore, we optimize the GCN in the encoder as a graph attention network while considering the differences in the links between nodes in the graph data, and a more suitable weight is assigned to each node. This improves the representation performance of the encoder. At the same time, the adversarial mechanism ensures the consistency of the distribution of the latent variables that are obtained by the coding and the prior distribution. This increases the robustness of the whole framework. The results of the link prediction experiments demonstrate that our method, namely, AAVGA, outperforms the baselines.
Future work can focus on four aspects: 1. The innerproduct decoder can be replaced with a complex generative neural network. 2. The selection of a Gaussian distribution as the prior distribution of the data is not optimal. The exploration of a more reasonable prior distribution is a new direction. 3. In GAEs, many similarity measures have been adopted, and these methods differ; hence, the selection of a suitable similarity measure for a task and model system requires additional investigation. 4. We can combine GAEs with GNNs to render GAEs more powerful, which is a promising future direction.
ZIQIANG WENG received the B.S. degree from the Department of Information and Computing Science, Qilu University of Technology (Shandong Academy of Sciences), in 2019. He is currently pursuing the master's degree with the School of Computer Science and Engineering, Qilu University of Technology (Shandong Academy of Sciences). His research interests include machine learning, data mining, and graph neural networks.
WEIYU ZHANG received the Ph.D. degree in computer science from the Beijing University of Posts and Telecommunications, in 2016. He is currently an Associate Professor with the Qilu University of Technology (Shandong Academy of Sciences). Until now, he has published more than ten papers in conferences and journals, such as Neurocomputing, Acta Electronica Sinica, and so on. His current research is sponsored by the Natural Science Foundation of Shandong Province, the Natural Science Foundation of China, and the National Key Research and Development Program. His research interests include machine learning, data mining, social network analysis, and graph neural networks.
WEI DOU received the B.S. degree from the Department of Information and Computing Science, Lvliang University, in 2018. He is currently pursuing the master's degree with the School of Computer Science and Engineering, Qilu University of Technology (Shandong Academy of Sciences). His research interests include machine learning and network representation learning. VOLUME 8, 2020