A Deep Graph Structured Clustering Network

Graph clustering is a fundamental task in data analysis and has attracted considerable attention in recommendation systems, mapping knowledge domain, and biological science. Because graph convolution is very effective in combining the feature information and topology information of graph data, some graph clustering methods based on graph convolution have achieved superior performance. However, current methods lack the consideration of structured information and the process of graph convolution. Specifically, most of existing methods ignore the implicit interaction between topology information and feature information, and the stacking of a small number of graph convolutional layers leads to insufficient learning of complex information. Inspired by graph convolutional network and auto-encoder, we propose a deep graph structured clustering network that applies a deep clustering method to graph structured data processing. Deep graph convolution is employed in the backbone network, and evaluates the result of each iteration with node feature and topology information. In order to optimize the network without supervision, a triple self-supervised module is designed to help update parameters for overall network. In our model, we exploit all information of the graph structured data and perform self-supervised learning. Furthermore, improved graph convolution layers significantly alleviate the problem of clustering performance degradation caused by over-smoothing. Our model is designed to perform on representative and indirect graph datasets, and experimental results demonstrate that our model achieves superior performance over state-of-the-art models.


I. INTRODUCTION
Typically, graph structured data can be represented as G = (V, E) where V represents node information and E represents edge information. The graph data contains abundant information including node feature and topology information. Owing to the expressive power of the graph-based model, many applications in the real world can be represented as graph such as user-item interactions in recommendation systems, referrals in mapping knowledge domain, and protein molecular structures in biological science. Some recent research and applications have made considerable progress in graph data processing.
Graph convolution has been proposed, and many graph clustering methods based on it have been used to achieve state-of-the-art performance. Graph clustering is achieved The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano . by learning the distribution features of graph data. Graph convolutional network (GCN) [1] is very important in the aspect of extracting the feature and topology information of graph. Existing deep graph clustering methods mainly rely on the topology structure of the restored graph data for the construction of the objective function. Some representative graph clustering methods have been proposed. Diederik P Kingma et al. [2] extracts the feature information of graph data to learn representations of node features. And the results of the model are expected to be as similar as possible to the node features of the original graph. Thomas Kipf et al. [3] adopts deep learning and graph convolution to perform clustering task. It uses results of model learning to reconstruct the topology of the original graph in the self-supervised training. Shirui Pan et al. [4] combines graph convolution operation with generative adversarial network (GAN). GAN is used to optimize the distribution of clustering results to improve performance. Xiaotong Zhang et al. [5] implements clustering VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ task efficiently by optimizing the classical graph convolution operation, which can only be used in few datasets to conduct clustering task. All of the above methods aim to reconstruct the original topology or node features to optimize learning results so as to accomplish the clustering task, but they ignore the implicit interaction between feature information and topology information. And only a small number of GCN layers stack, resulting in not learning sufficient features. However, the graph clustering task requires sufficient consideration of graph structured data, and fully learn structured information.
Some clustering distribution optimization methods based on deep learning are widely applied. Xifeng Guo et al. [6] optimizes the label distribution of clustering results while preserving local structure information of data. Shirui Pan et al. [4] propose to optimize the clustering distribution results by taking the output results as the input of GAN. These methods are used to construct the self-supervised module of unsupervised graph clustering task.
Many clustering methods based on auto-encoder have been used to achieve state-of-the-art performance. In non graph data, the auto-encoder based on deep neural network (DNN) [2] learns the high-dimensional representation of the original data features to complete the clustering task. Because graph data contains more information, it is not suitable to deal with it directly. As mentioned above, the auto-encoder based on GCN [3] can not comprehensively consider the node feature information when processing graph structured data. Because of the limitation of auto-encoder based on GCN and DNN, we propose a combination of two auto-encoders to process graph structured data.
In this article, we propose two auto-encoder modules based on GCN and DNN to fully consider the structured information in graph data, and we design deep GCN layers to learn structured information of the graph. However, as the number of GCN layers increases, an over-smoothing problem occurs, and it leads to the degradation of cluster performance. Our model has deep GCN layers with some tricks to alleviate the impact of the over-smoothing problem, and experimental results demonstrate the superior performance of our model. Inspired by the clustering distribution optimization method based on deep learning, we design a triple self-supervised module to perform end-to-end training of the model.
The main contributions of this article are in three folds: (1) We propose a deep graph structured clustering network (DGSCN) that applies a deep clustering method to graph structured data processing. Both graph node feature information and topology information are considered.
(2) A method with deep GCN layers for clustering task is proposed. It improves the performance of clustering tasks and alleviates the influence of over-smoothing. The experiment mentioned in this article can be used for reference.
(3) We design a triple self-supervised module which contains the backbone network, GCN auto-encoder, and DNN auto-encoder to complete end-to-end self-supervised learning. The proposed DGSCN significantly outperforms state-of-the-art methods through our experiment.
Recently some research about deep clustering have made considerable progress. Mathilde Caron et al. [12] adopts the clustering results as pseudo-labels in order to apply them to train deep neural networks with big data. Deep clustering method of [6] synthesizes the clustering loss of the autoencoder, and reconstructing loss can jointly optimize clustering label assignments. Learning function et al. in [6] is suitable for clustering with local structure retention, and it helps auto-encoder to learn better data representations. The auto-encoder in [2] uses neural network to perform representations learning, which learns the high-dimensional interpretable features of original data. However, all of these methods focus on learning the representations of non-graph data feature.
In the past few years, various methods have been proposed for graph clustering. For indirect graph, a method of mining topology information [18] is proposed, a K nearest neighbor graph is calculated based on the original data to obtain the constraint relationship between the original data in adjacency matrix. This method provides the basis of structured clustering for indirect graph data.
Recently some graph clustering methods are based on shallow learning paradigms achieve state-of-the-art performance in network communities and attributed graphs. In network communities, a novel dubbed manifold regularized stochastic model (MrSBM) [19] was proposed, which is a probabilistic modeling method based on stochastic block models (SBMs). Besides modeling edges that are within or connecting blocks, MrSBM also considers modeling vertex features utilizing the probabilities of vertex-cluster preference and feature-cluster contribution. In order to consider both edge structure and node attributes, a novel latent factor model for community identification and summarization (LFCIS) [20] was proposed. LFCIS first formulates an objective function that evaluating the overall clustering quality taking into the consideration of both edge topology and node features in the network. In attributed graphs, FSPGA [21] was proposed to discover fuzzy structural patterns for graph analytics, which takes into consideration both the graph topology and attribute values. The competitive experimental results validate the effectiveness of them. The above works put forward a new way of processing graph structured data.
On the other hand, some graph clustering methods based on deep model have made great progress in original graph f . P is calculated by Q, P and Q form the triple self-supervised module. The distribution result P and Q constitute the loss function to update parameters for entire network.
datasets. In order to exploit information that considers both the data content and spatial structure of graph in specific tasks, a copolymerization method [22] and content dissemination [23] have been proposed. These methods do not take efficient account into the topology information in the graph structure if applied directly to sparse raw graphs. In order to learn the fundamental structure information, some GCN-based clustering methods have been proposed. The unsupervised graph structure potential feature representations method [3] obtains the embedding of the nodes in the graph through the structure of the encoder-decoder to support the following tasks, such as link prediction. Chun Wang et al. [24] adopts attention networks to capture the importance of neighboring nodes, and uses KL divergence loss to supervise the training process of graph clustering. Shirui Pan et al. [4] combines GCN with GAN to optimize clustering distribution results. All the above clustering methods based on GCN hope to perform the clustering task by learning the topology relation of data, but they ignore the implicit interaction between feature information and topology information. Xiaotong Zhang et al. [5] optimizes convolution operations that are more suitable for clustering tasks, which outperforms over some specific datasets. However, it cannot be generalized. Deyu Bo et al. [25] considers information about the features of the data itself and adds potential representations of its own features to the GCN layer for integrated learning, but ignoring the spatial topology information of the graph structure itself. Furthermore, GCN layer of all the above clustering methods is only a simple stack, and it leads to insufficient learning of complex structured information. Some progress has been made in the deep GCN layers. Guohao Li et al. [26], Yu Rong et al. [27] increase the number of GCN layers, so the network can learn more high-dimensional features. And they alleviate over-smoothing problem caused by deep GCN layers and demonstrate superior performance in clustering. The limitation of the above method lies in the higher computational cost. In our method, we simplify the model as much as possible to reduce the extra cost of calculation.
In our case, we propose a special weighted network and deep graph convolution module. It contains topology information embedding module, feature information embedding module, deep graph convolution network module, and triple self-supervised module. The deep graph clustering method aims to combine deep representations learning with graph structure representations learning in clustering tasks. On this basis, GCN layers are optimized to alleviate the influence of over-smoothing and improve clustering performance.

III. THE PROPOSED MODEL
In this section, we introduce the proposed deep graph structured clustering network model (DGSCN). Its overall framework is shown in Fig. 1 and the mathematical notations used VOLUME 8, 2020 in the paper can be referred to Table 1. In this section, we'll introduce their details Our model can be divided into four modules, and we introduce the process of proposing the model. In order to accomplish the clustering task well, it needs to consider the graph structured data comprehensively and use deep graph convolution networks to sufficiently learn. Graph data contains abundant structured information, including node feature information and topology information. Two auto-encoder modules based on DNN and GCN are proposed to learn node feature information and topology information respectively. These two auto-encoder modules are used to learn graph structured information. Deep GCN module uses the learning results of auto-encoders to generate clustering distribution results through deep graph convolution operation, and triple self-supervised module combines the above three modules to complete the network self-supervised training.
For two auto-encoder modules, the objective function is based on the reconstruction of its adjacency matrix and feature matrix. The result of their high-dimensional code is used to generate the clustering result distribution Q, and then Student's t distribution [28] is used by us as the kernel to generate target distribution P. We expect the distribution of P and Q to be as similar as possible. They have the same meaning as P, Q in Fig. 1. In the learning stage of the autoencoder, the learning effect of backbone network layer is evaluated in real time. GCN methods are used in the first few layers, with which we use the first-order approximation of Chebyshev inequality as the convolution kernel to perform graph convolution operation [3]. DenseGCN [26] constructs dense links between layers, which is greatly enriched by data content. It is necessary for GCN to learn structured features of the data while alleviating the impact of the over-smoothing problem. After deep GCN extraction of the structured features at the last layer, predicted clustering result is generated.
KL divergence is calculated in order to expect the predicted distribution Z and the target distribution P to be as similar as possible. Finally, end-to-end self-supervised training is possible through the triple self-supervised module we proposed.

A. GCN MODEL FOR AUTO-ENCODER
Through KNN method [18], the graph structure representations of the original data without topology relation is obtained. We use GCN module to learn the interpretability of the topology information. As the inputs of the network, feature matrix of graph nodes X and the adjacency matrix A represents the topology relation between nodes. The graph auto-encoder consists of GCN layers, which aims to learn the high-dimensional representations of the structural features in graph space. GCN layer can be expressed by the following formula: where A = A + I N is the adjacency matrix of the self-connected undirected graph, I N is the identity matrix, N represents the number of nodes in the graph, and D represents the feature dimension.
The GCN module is used to construct an auto-encoder to learn the high-dimensional feature representations of spatial features in encoding stage. Overall process can be summarized by the following formula: where GCN (·) is a graph convolutional operation. Specifically, W ge is weighted matrix of layer l of GCN autoencoder. Adjacency matrix A of the original data and the feature matrix X as model input. GCN module generates the representations of the specified dimensional, and the output can be finally found through the iterative process Z ∈R N ×f where f is the dimensional of the expected representation. After the high-dimensional feature representations is obtained, the adjacency matrix of the original data is reconstructed on this model, and objective function is constructed to evaluate the efficiency of graph embedding.
The probability that two nodes are connected can be predicted by p(Â|Z ). In other words, we train a connection prediction layer based on graph embedding: Here, original data is treated as an undirected graph, and adjacency matrix as a symmetric matrix. Therefore, efficient inner product is used as the decoder. This can be illustrated by the following formula: The adjacency matrix determines the structure of the graph. We set Y = q(Z |X , A),Ȳ = p(Â|Z ), both Y andȲ are adjacency matrix. We expect that the reconstructed adjacency matrix is similar to the original adjacency matrix, so the objective function is defined as follows: where q(Z |(X , A)) represents the adjacency matrix of the input graph data, n is the number of nodes in dataset, and E(i, j) represents the cross entropy loss function.

B. DNN MODEL FOR AUTO-ENCODER
It is not sufficient to perform clustering task if topology information is only embedded. We propose DNN auto-encoder to learn the high-dimensional representations of each node and weight it to participate in the backbone graph convolution module for iteration. There are several unsupervised learning methods to embed feature information in high dimensions. The dependency relationship between nodes is ignored temporarily, and only the features of nodes are considered. DNN auto-encoder efficiently learns feature information, which can be abstractly represented as a high-dimensional feature. First, each node feature is splitted to generate the feature matrix X f , this feature matrix is calculated as the input of the fully connected neural network, so that we can obtain high-dimensional embedded X e for the original data. After obtaining the high-dimensional embedding, fully connected network decoder it, and finally calculate the decoding result ofX f . At this point, network structure of the coder and decoder layers should be perfectly symmetrical. Therefore, final generated X f andX f should have the same dimensionality and their values should be as similar as possible. The above process can be represented by the following equation: where E(·) represents the encoder of the multi-layer fully connected neural network, D(·) represents the decoder of the multi-layer fully connected neural network, φ(·) represents the nonlinear activation function, such as the ReLU (·). W

C. DenseGCN
DenseNet [30] is a deep layers method of convolutional neural network (CNN), which uses close link between layers to improve information flow in the network, allowing the result to be reused between each layer. We apply a similar idea to GCN and specifically construct dense links from each GCN layer to exploit information at each layer. This method is applied to the graph clustering model. DenseGCN structure can be referred to Fig. 2. The framework of our model can be expressed by the following formula: where H is DenseGCN convolution operation, which can be divided into F and T . F is the GCN layer function shown in Eq.1 and W (l) is weighted matrix of layer l of backbone GCN network. T is a vertex cascading function, which intensively fuses the input graph G 0 with all intermediate GCN layers, so G l+1 contains the results of all GCN output from the previous layer. For example, if F produces a D dimension feature matrix, the dimension of input graph G 0 is D 0 , and the dimension of each node feature in G l+1 is D 0 + D × (l + 1). DenseGCN and GCN comprise our backbone network. Obviously, G l = (V l , E l ) represents the information of the data itself at the GCN of l layer. G l contains node information V l and edge information E l . The specific representations of node information is a feature matrix. In our proposed model, the feature matrix of the data itself is expressed as H (l) . Among them, V l represents the feature information of the data itself, E l represents the dependency formula between data. We evaluate the results learned from DNN auto-encoder and GCN auto-encoder, which are applied to GCN layers and DenseGCN layer to perform clustering tasks. The specific method can be expressed as: where H (l) is the feature input of the backbone network. G represents the natural iteration result of the backbone GCN layer. α, β, and γ represents weight coefficient of each variable, which definition domain are [0, 1]. Specifically, the content of iterative learning in backbone network is obtained by weighting learning results of two auto-encoders and original input. For classical GCN layers, the convolution formula is Eq.1. The inspiration of DenseGCN originated from close connection between layers. When computing the convolution of the graph, the output of previous layers should be considered significantly. In order to increase the content of information flow to construct deep GCN layers, H (l+1) stands for the result of DenseGCN, which can be computed as: where η(·) represents the splice function of the output vector in the previous layer, l represents lth layer of the backbone network. In our model, H (l+1) represents clustering dimension results.
Clustering distribution result Z of GCN module in the backbone network can be defined by the following formula: where H (l) represents the output of the last layer of backbone GCN network.

D. TRIPLE SELF-SUPERVISED MODEL
Through two auto-encoders, our model obtains the data representations of feature and topology information of graph. However, the DNN model and GCN model are not suitable for unsupervised training. Considering the characteristics of graph clustering task and the proposed model, we propose a triple self-supervised module based on optimizing the distribution of clustering results. The triple self-supervised module calculates the KL divergence of the clustering result distribution and the target distribution to form a part of the loss function. This module combines the backbone GCN network, DNN auto-encoder, and GCN auto-encoder in a unified module to realize the end-to-end self-supervised training of the model. The clustering result is generated from the learning results of two auto-encoders, and then the similarity between the clustering result and the cluster center is calculated. Specifically, for the ith sample and the jth cluster, we use Student's t distribution [28] as the kernel to measure the similarity between the data representations and the clustering centers. We use Q = [q ij ] as the target distribution. After obtaining the clustering distribution Q, we aim to improve cluster purity and normalize loss contribution of each centroid to prevent large clusters from distorting the hidden feature space. Therefore, the clustering result distribution Q and the target distribution P is calculated as follows: where h f i is the output of the i layer of the DNN auto-encoder, h adj i is the output of the i layer of the GCN auto-encoder, µ j is initialized by K-means learned by the auto-encoder before training, v is the degree of freedom of Student's t distribution [28], q ij is regarded as the probability of assigning the sample i to the cluster j. represents the weight representations of the final result distribution of the two autoencoders. f j = i q ij is the soft cluster frequency. In the target distribution P, the square sum of each distribution in Q is normalized. In this way, the spatial feature information of the graph structure and the features of the data itself are considered. In order to make the distribution with higher confidence, we can derive our objective function from the following equation: Here the target P distribution is obtained. Next, we use the P distribution to train the cluster distribution Z output by the backbone network. Then objective function can be formulated as follows: Now the result distribution of the backbone network and the learning results of the two auto-encoders constitute our objective function. Since our P distribution was initially obtained by evaluating the results of DNN and GCN auto-encoder modules respectively, DNN and GCN modules benefit from the target distribution P in low-dimensional feature representations by minimizing the KL difference loss. P is used to train Z generated by the backbone GCN network. Objective function is computed as follows: where w 1 ∈ [0, 1] represents weight ratio of the KL divergence of the clustering result distribution P and the target distribution Q. P contains the learning results of GCN auto-encoder and DNN auto-encoder. w 2 ∈ [0, 1] represents the weight ratio of the KL divergence of the target distribution P and the distribution of clustering results Z of backbone GCN output. w 3 ∈ [0, 1], unless the data itself comes from an indirect graph dataset w 3 = 0, because graph structured data generated by KNN [18] can not describe the true dependency. L adj represents the learning effect of the GCN auto-encoder. This means the difference between the spatial structure of the original data after reconstruction and the original data. We call it triple self-supervised module.

E. ALGORITHM OVERVIEW AND COMPLEXITY ANALYSIS
Algorithm 1 summarizes DGSCN workflow. We set d as the dimension of the input data, and assume data dimension of each layer in auto-encoder dd with pre-train DNN auto-encoder; W (l) with GCN backbone network; 2: Initialize (µ 1 ,µ 1 , . . . , µ k ) with K-means on representations learned by pre-train auto-encoder; 3: for iter 0, 1, . . . , N do 4: for i 0, 1, . . . , l do 5 1 d 2 . . . d l ). Finally, considering the feature of clustering task itself, we assume that the number of clusters is K in our final task. According to the result in [31], time complexity of the cluster-  Table 2.
(1) Wiki [32]: It is a webpage network where nodes are webpages and are connected if one links the other. It is a typical dataset with a small number of nodes, but a abundant spatial topology relationship between nodes.
(2) Reuters [33]: Reuters is a text dataset, which contains approximately 810,000 English news stories marked with category trees. We use 4 root categories as labels and cluster from a random subset of 10,000 examples.
(3) Citeseer : This dataset contains the sparse word bag feature vectors of each document and a list of citation links between the documents. The access to the dataset: http://citeseerx.ist.psu.edu/index.
(4) DBLP : DBLP is an integrated database system with the author as the core of research results in the field of computer science. We mainly focus on the cooperation among the authors. If two authors are co-authors, there is an edge relationship between them, and the authors are divided into four areas. This dataset marks each author's research area based on the meeting they submitted. The access to the dataset: https://dblp.uni-trier.de.
(5) ACM : ACM is a publication relationship dataset. If two papers are written by the same author, then there is an advantage between them. Keywords are featured in ACM. We choose papers published on KDD, SIGMOD, SIGCOMM and MobiCOMM, and divide them into three categories according to research fields. The access to the dataset: http://dl.acm.org/. To research clustering task further, we implement experiments with Wiki and ACM. These two datasets have few nodes but abundant spatial topology information.

B. BASELINES
We experimented with our proposed model and nine other models. The details are as follows: (1) K-means [34]: K-means is a classical machine learning method, which randomly selects K seed points, works out the distance between all points and the seed point, and puts the points into the nearest seed point group. After all the points VOLUME 8, 2020 are contained in the group, moving the seed point to the center of the group and iterating the above steps until the seed point does not move.
(2) DeepWalk [35]: This model is divided into two parts: random walk and feature vector generation. Firstly, random walk algorithm extracts some vertex sequence from the graph. Secondly, natural language processing tool word2vec represents each vertex. Deepwalk is a graph embedding method.
(3) AE [36]: Auto-Encoder has two parts: encoder and decoder. The node feature of original data is obtained through the encoder, and the original data is restored by decoder. Then, K-means clustering method is applied to cluster the learned high-dimensional feature and obtain the result. This approach completely ignores topology information.
(4) IDEC [6]: IDEC jointly optimize the cluster label allocation and learn functions suitable for clustering using local structure to improve auto-encoder learn performance.
(5) GAE [3]: This is an unsupervised graph embedding method that adopts GCN to learn data representations, which performs tasks by clustering high-dimensional feature. GAE only considers the reconstruction of graph topology information.
(6) ARGE&&ARVGE [4]: Based on the classical graph clustering algorithm GAE, an adversarial generation training mechanism is proposed. The clustering result generated by GAE is used as the input of generative adversarial network (GAN) to learn the distribution of clustering results. Clustering is accomplished by combining the two modules. The algorithm fails to fully consider the graph structure information.
(7) DAEGC [24]: It uses attention networks to learn node representations and clustering loss to supervise graph clustering. Different from GCN, it adopts graph attention mechanism to replace the graph convolution operation for iteration to conduct the clustering task. DAEGC only considers the reconstruction of graph topology.
(8) SDCN [25]: It uses GCN and DNN auto-encoders to jointly construct a clustering network. And the self-supervised module is proposed to update the parameters of graph convolution module and DNN auto-encoder module. There are some limitations in the processing of topology information.
(9) AGC [5]: It is a frequency-domain method in the process of graph convolution to make the convolution more suitable for performing clustering tasks. AGC method performs well on a single dataset but lacks the generalization ability.

C. EVALUATION INDICATORS
We adopt four popular clustering evaluation indicators. Accuracy (ACC) represents the accuracy of the clustering result. Normalized Mutual Information (NMI) is the amount of correct clustering information contained in our clustering results. Average Rand Index (ARI) means the evaluation of clustering results by calculating the similarity between two clusters, which is an improvement based on Rand Index (RI).
Macro F1-score (F1) represents the evaluation index obtained by weighting the accuracy rate and recalling rate of clustering results. For each evaluation indicator, the higher the evaluation value is, the better the clustering task performance is.

D. PARAMETER SETTINGS
For fair comparison, all the methods are conducted on the same hardware platform of GTX 2080Ti GPU. For DNN-based (AE + K-means, DEC, IDEC, SDCN) and GCN-based clustering methods, a pre-train auto-encoder is adopted. We perform all data points with 30 epochs to train the DNN auto-encoder end-to-end with the learning rate is 10 −3 . In order to be consistent with the previous method, the size of the auto-encoder is d-500-500-2000-10, where d is the size of input data. These numbers represent the number of layers chosen by the previous method in turn.
For GCN-based methods, the dimension of GAE and VAGE is d-256-16, and train 30 epochs for all datasets. For DAEGC, the settings of [27] is used. For SDCN method, the hyperparameter settings proposed in [24] are adopted, and the best result is listed. For AGC, it uses adaptive graph convolution iteration times, so there is no need to set it. And the rest of the parameter settings can refer to [5].
The parameters of our model DGSCN are set as follows: Since the priori knowledge and spatial structure is more significant, we choose the form with the objective function. For the structural data of KNN graph, spatial structure feature is insufficient, so this item in the loss function is ignored. For the five datasets, the layer number n of DenseGCN is 36. For the weight of the loss function, we uniformly set it to w 1 = 0.1, w 2 = 0.01, and w 3 = 1 in the experiment. The weight proportions of the feature vectors participating in the backbone network iteration are α = 1, β = 1, and γ = 1. The weight strategy we used when generating P distribution of clustering results is = 0.1, and for all experiments, in Eq.16 let v = 1. K-means algorithm is used to generate clustering results of all models.
From Table 3 we draw two conclusions: (1) Our method has obvious advantage when the datasets itself has an original adjacency matrix representations. Owing to full consideration of the feature and topology information from the original graph, superior baseline results are obtained  from DBLP and Citeseer. The reconstruction of the graph dependency is added to the self-supervised module for learning, and then two auto-encoder modules are adopted to update the parameters.
(2) If dataset does not have the original graph structure, KNN [18] is used to generate the graph structure data. However, the adjacency matrix of graph constructed by the KNN method is not reliable. Beacuse it does not fully reflect the dependencies of data itself, if the topology information is generated by KNN, our model may not work well. And in the backbone network module, we have deep GCN layers. This is equivalent to magnifying the influence of topology information weight to a certain extent, which makes our model unable to highlight its advantages.
One conclusion drawn by Table 4: If topology information between nodes is abundant, our model shows obvious superiority because topology enhances the representations. Topology learning results are added to the backbone GCN layer for iteration. We have achieved excellent experimental results on Wiki dataset.
We fully consider the weight of data itself and the dependencies represented by adjacency matrix, so our approach is superior to the other methods only considering the graph structure of clustering such as GAE, VGAE, DAEGC. Because the auto-encoder module of graph reconstruction is added, our method is also superior to the method SDCN with only data features. Furthermore, compared with the spectral method AGC, our model has wider applications.

F. ANALYSIS OF DenseGCN LAYERS
Novel deep GCN layers are proposed in the model. The result of deep GCN layers on the overall clustering task is analyzed as follows. In order to show our experimental results in details, we choose four different results from experimental datasets.
We fix the other parameters as the weight of loss function, which are set to w 1 = 0.1, w 2 = 0.01, w 3 = 1. The weight parameters of the feature in iteration are α = 1, β = 1, and γ = 1. The weight when generating the P distribution of the clustering result is = 0.1. From our experimental result Fig. 3, we can draw the following conclusions: (1) According to our experimental results, the number of layers increases, the overall trend of clustering accuracy first increases and then decreases after the peak. For different datasets, the information content determines the peak accuracy of clustering task. However, we can conclude that the deep GCN layers has a positive result on the accuracy of clustering tasks. We need to adjust the most appropriate number of GCN layers based on the dataset.
(2) Curve analysis of four datasets reveals that the dataset with the smallest dimension of node features is DBLP. DBLP curve reflects that it peaks at the earliest epoch, as the number of layers increasing clustering accuracy declines due to over-smoothing. Reuters is the largest dataset in our experiments, which contains rich feature information. Reuters curve reflects that the strategy of deep layers is more suitable for the datasets with large information content. For smaller datasets, clustering performance will decrease. Over-smoothing usually leads to the infeasibility of stacking simple GCN layers, so we propose DenseGCN to alleviate the influence of oversmoothing.
(3) Deep GCN layers improve clustering performance, but corresponding strategies should be formulated according to the feature of the dataset itself. However, for the clustering tasks that have been performed well, the improvement of intuitive accuracy calculated by deep GCN is not obvious.

G. ANALYSIS OF WEIGHTS BASED ON GCN AUTO-ENCODER
In order to exploit overall information of graph data, we add the learning results of graph reconstruction to the backbone network. In our experiment, γ = 0 means that only the DNN auto-encoder module performs weight operation. The results of our experiment can be referred to Fig. 4, and the conclusions can be drawn as follows: (1) Since we use deep GCN layers strategy, our experimental results are superior to SDCN [25] when γ = 0. Results indicate that when the value of γ is small, the graph reconstruction feature is insufficient. Moreover, objective function is considered in the loss of graph reconstruction, so the performance will be worse. As the weight coefficient is increasing, the result tends to be stable.
(2) When our model is applied to datasets with superior average performance, weight method of graph reconstruction has no obvious superiority, especially for ACM dataset.
(3) When the dataset is complicated, such as Wiki dataset, especially when the dataset has abundant topology

H. ANALYSIS OF OVER-SMOOTHING
Over-smoothing is one of the most important challenge in graph convolutional networks (GCN). If over-smoothing problem is not considered, it will lead to the increase of training iteration. Furthermore, accuracy rate of clustering and related evaluation indexes decline. Therefore, the deep GCN layers in our proposed model not only perform the clustering task efficiently, but also consider whether over-smoothing can be alleviated. Here we compare our proposed method DGSCN with SDCN [25] and GAE [3] in Citeseer and Wiki dataset. The results are as follows in Fig. 6. Iteration number is 200, we choose 150 iterations after the model reaches the peak. Conclusions are as follows: (1) Because our model is more complicated and has more parameters, SDCN and GAE costs fewer training times to achieve peak performance. However, after reaching the peak, our model significantly alleviates the over-smoothing influence, and the clustering accuracy only slightly decreases.
(2) SDCN and GAE method has fewer model parameters, which can achieve the best effect faster. However, as the number of training increases, instability of the model leads to the performance fluctuation. Furthermore, the clustering accuracy of the whole model shows a downward trend and reaches the instable status.
(3) For the Wiki dataset, which has 17 clustering categories. The problem of over-smoothing does not have much impact on the bad performance methods, because the data features cannot be learned effectively and the abundant categories influence it.

V. CONCLUSION
In this papaer, we propose a deep graph structured clustering network (DGSCN) for processing graph structured data with deep clustering method. This model considers both graph node feature and topology information to improve model performance. Our model has novel deep GCN layers in clustering tasks, and a triple self-supervised module is applied to perform end-to-end unsupervised training. Experimental results demonstrate that our method achieves superior performance over state-of-the-art models.
Our proposed method also has some limitations. Deep layers mean more computational cost, we are committed to finding more efficient deep GCN methods. And different datasets own different graph structure. Many datasets own less nodes feature but more relation feature. We hope to find an adaptive processing method, which is one of our future research focuses.