Semi-Supervised Node Classification With Discriminable Squeeze Excitation Graph Convolutional Networks

In recent years, Graph Convolutional Networks (GCNs) have been increasingly and widely used in graph data representation and semi-supervised learning. GCNs can reveal and dig deep into irregular data with spatial topological structure. However, in the task of node classification, most models will be over-smoothing (indistinguishable representations of nodes in different classes) after stacking multilayer GCNs. To alleviate the issue of over-smoothing, we propose a novel Discriminable Squeeze and Excitation Graph Convolutional Network (D-SEGCN) based on the attention mechanism of features for semi-supervised node classification. In the proposed D-SEGCN model, Squeeze and Excitation (SE) module is fused to the Graph Convolutional Networks to form SEGCN, which can selectively enhance significant features and realize the adaptive calibration of feature dimensions, thus enhancing node features. Then, the feature representation is obtained through a hierarchical structure, it is put into a discriminable capsule layer which can judge the similarities between the node features to obtain new feature. Finally, the feature representation and new feature are fused to obtain the final feature to strengthen the discriminability of nodes. Furthermore, we demonstrate that D-SEGCN significantly outperforms the state-of-the-art methods on three citation networks datasets.


I. INTRODUCTION
Graph structure is a non-linear data structure. Graphs have many examples in real life, such as transportation networks, subway networks, social networks, state execution (automata) in computers, etc., which can be abstracted into graph structure. Classification of graph structure data is the foundation in many fields. Besides, node classification is an important and basic task in graph data mining. GCNs are often used to learn the expression of each node and finally classify. In reality, due to the tremendous amount of graph data and high labeling cost, we usually face a semi-supervised node classification scenario. For example, for the classification of semi-supervised nodes in a citation network, where nodes represent articles and edges represent citation relationships, the task is to predict the label of each article with a few labeled data.
The associate editor coordinating the review of this manuscript and approving it for publication was Hazrat Ali .
In recent years, as a new graph learning technology, Graph Neural Networks (GNNs) have attracted much attention.The concept of GNNS was first proposed in Gori et al. [1], and further was elaborated in Scarselli et al [2]. The early GNN mainly solved graph theory problems, such as the classification of molecular structure. But in fact, Euclidean space (such as image) or sequence (such as text), many common scenes can also be converted into graph, and then use GNNs technology to model. Based on graphic signal processing, Bruner et al. [3] proposed convolution neural network based on Spectral-domain and Spatial-domain.
The initiative of Spectral-domain Graph Convolution Networks is to implement convolution operations on topological graphs which are accomplished by Graph Fourier Transform. The Spectral-domain graph convolution network has been increasingly improved and expanded, including Spectral CNN [3], Chebyshev Spectrum CNN(ChebNet) [4], 1stChebNet [5], [6], and [7]. Significantly, Kipf and Welling et al. [5] determined the convolution network structure through the local first-order approximation of Spectral-domain GCN, then node embedding is generated by the combined neighborhood information based on the scheme of model.
Since spectral methods are usually difficult to deal with large-scale graphs, Spatial-domain graph convolutional networks have developed rapidly recently. Spatial convolution can be compared to convolution performed directly on image pixels. Based on Spatial-domain convolution, its operation is directly defined on the connection relationship of each node, which is more similar to the traditional Convolutional Neural Networks. Micheli et al. [8] proposed the Neural Network for Graphs (NN4G), which was the first GCNs based on the Spatial-domain. NN4G was realized the convolution of the graph by directly accumulating the neighborhood information of the node, and was memorized the information of each layer through the residual connection. Based on the adjacency matrix exploration, Li et al. [9] proposed a model to learn the unknown hidden structural relationship through the adjacency matrix, and to construct a residual graph adjacency matrix by using a reasonable distance function with two nodes. Zhuang et al. [10] proposed the dual graph convolution network (DGCN ) model, two convolutional neural networks were respectively devised to embed the local-consistencybased and global-consistency-based knowledge. In addition, Velicković et al. [11] proposed Graph Attention Networks (GAT), which used an aggregation function to fuse adjacent nodes, random walks, and candidate models in the graph to learn new representations. Thekumparampil et al. [12] proposed a semi-supervised GCN based on attention mechanism, and pointed out that the key of the extended model was the communication layer, not the perception layer. The attention mechanism tends to select neighbor nodes of the same category as the center node, and embedding it into the model will give the model strong correlation and interpretability. The Diffusion Convolution Neural Detwork (DCNN) was proposed by Atwood et al. [13], it regarded graph convolution as a diffusion process, which assumed that information transfers from one node to another with a certain transfer probability, so that the information distribution will reach equilibrium after several rounds. Peng et al. [15] proposed a novel clustering method by minimizing the discrepancy between pairwise sample assignments for each data point. As explained in [9], each GCN layer acted as a form of Laplacian smoothing in essence, which made the node features in the same connected component similar. The GraphSAGE model was proposed by Hamilton et al. [14] modified GCN from two aspects. On the one hand, the strategy of sampling neighbors transformed GCN with a full-graph training model to a node-centric small-batch training model; on the other hand, this algorithm extended the operation of aggregating neighbors. Peng et al. [16] proposed a novel subspace clustering approach by introducing a new deep model-Structured AutoEncoder(StructAE). The StructAE learns a set of explicit transformations to progressively map input data points into nonlinear latent spaces while preserving the local and global subspace structure. Thereby, adding too many convolutional layers will result in the output features being over-smoothing and render them indistinguishable. Some recent methods are proposed by Chen et al. [17], Xu et al. [18], and Ying et al. [19] try to get the global information through deeper models, some of which are unsupervised models and some need many training examples.
However, these GCN-based models rarely paid attention to the over-smoothing issue. In the node classification task, over-smoothing means similarity of representations among different classes'nodes. The reason for over-smoothing is that multiple layers of GCNs are stacked. If the model is stacked with multiple GCN layers, the more neighbors the node aggregates. The interaction between higher-order neighbors brings noise and dilutes useful information, resulting in the final representation of all nodes tending to be consistent and difficult to distinguish.
In order to alleviate the over-smoothing issue, in this paper, we propose a novel architecture of Discriminative Squeeze and Excitation Graph Convolutional Networks, D-SEGCN for brevity, for node classification on graphs. Inspired by the flourish of applying deep architectures and the pooling mechanism into image classifification tasks, we design a model with feature dimensions based on attention mechanism to enhance the node features. The model is a deep hierarchical network, the main idea of D-SEGCN is to acquire feature dimensions based on the attention mechanism, which can highlight the important features of nodes. As illustrated in Fig. 1, D-SEGCN mainly consists of eight SEGCN modules (each SEGCN module contains two SEGCN layers), four coarsening layers, four refining layers, a capsule network layer, and a softmax classifier layer. Taking the adjacency matrix A and the feature matrix X of graph G as the network inputs.
The main advantages of the proposed D-SEGCN for semi-supervised node classification tasks are summarized as follows.
Firstly, compared with the previous work proposed by Hu et al. [18], the proposed model in the paper is fused SE module to GCN. By incorporating Squeeze and Excitation module to GCN, the network can use global information to selectively enhance the significant features and suppress the useless features. The proposed model realize the adaptive calibration of feature dimensions, and enhance the features of nodes. These operations can alleviate over-smoothing during the multi-level neighborhood aggregation.
Secondly, the feature representation is obtained through hierarchical structure and new feature from the discriminable capsule layer are fused, which can remembering the rich information in the graph structure s well judging the similarities among features, to strengthen the discriminability of nodes. Thus effectively alleviating the over-smoothing problem.
Thirdly, the performance of the proposed model is tested on three citation networks datasets. The results show that VOLUME 8, 2020 the D-SEGCN model is advanced and effective compared to state-of-the-art models.

II. PRELIMINARIES AND RELATED WORK
Let G = (V , E) be the input graph (weighted or unweighted), where V and E are the node set and edge set, respectively. Let A be the |V | × |V | adjacency matrix of the graph with each entry A ij denoting the weight of the edge between i and j. We denote the adjacency matrix by . . , x n ) ∈ R n×p be the node feature matrix, where p is the dimension of the attributive features. We use edge weights to indicate connection strength between nodes.
In the paper, we input undirected graph G 1 = (v 1 , ε 1 ), where v 1 and ε 1 are respectively the set of n 1 nodes and e 1 edges, let A 1 ∈ R n 1 ×n 1 be the adjacency matrix describing edge weights. For the D-SEGCN model, the graph sent to the j-th layer is represented by Q j with n j nodes. The adjacency matrix and hidden representation matrix of Q j are represented by A j ∈ R n j ×n j and P j ∈ R n j ×p j , respectively. Given the labeled node set V L containing m, far less than n 1 nodes, where each node v j ∈ V L is combination with a label y j ∈ Y , our object is to predict labels of V \ V L .

A. GRAPH CONVOLUTIONAL NETWORKS
In the past few years, there has been a surge of applying convolutions on graphs. GCN contains one input layer, several propagation (hidden) layers and one final perceptron layer. Given an input feature matrix X (0) = X and adjacency matrix A, GCN conduct the following layer-wise propagation in hidden layers as formula (1): where is a layer-specific weight matrix needing to be trained. We denote σ (·) as an activation function, such as ReLU (·) = max(0, ·), and X k+1 ∈ R n k+1 ×d k+1 as the output of activations in the k-th layer. For semi-supervised classification, GCN [5] defines the final perceptron layer as formula (2): where W (k) ∈ R d K ×o and o denotes the number of classes. The final output Z ∈ R n×o denotes the label prediction for all data X in which each row Z i denotes the label prediction for the i-th node.

B. SQUEEZE AND EXCITATION NETWORKS
Squeeze and Excitation Network (SENet) by Hu et al. [21] proposed a better feature representation structure, which can be utilized to represent features on input by learning branch structure. Structurally, branches are used to learn how to evaluate the correlation between channels, and then applied to the original feature map to achieve input calibration. In order to make the network measure the channel association through the global information, the global pooling is used to capture the global information in the structure, and then two full connection layers are connected to recalibrate the input, which can make the network learn a better representation. Squeeze and Excitation operations are described below.

C. SQUEEZE OPERATION
Squeeze operation first compresses the feature with dimension of H * W * C to features of 1 * 1 * C using global pooling to obtain the global information. Using formulas (3) and (4) for the specific operation, where v c is the c convolution kernel, x s represents the s input, u c ∈ R W ×H ×C , u c represents the c-th H * W matrix in u, and the subscript c represents the number of channels. The squeeze operation F sq (·) is a global average pooling, therefore, squeeze operation converts the input of H * W * C into the output of 1 * 1 * C.

D. EXCITATION OPERATION
Specifically, the excitation operation is to learn the importance of each feature channel automatically, and then enhance significant features and suppress the useless features for the current task according to the importance, similar to the attention mechanism. Generally speaking, the network can use global information to enhance the significant feature channels and suppress the useless feature channels selectively, so as to realize the adaptive calibration of feature channels. Using formula (5) for the specific operation, F ex is the excitation function, i.e. the function of adaptive recalibration, δ is ReLU.
Here, W 1 is the parameter of dimension reduction layer, and W 2 is the is the parameter of the dimension-increasing layer, the result after previous squeeze operation is z, W 1 z is a fully connected layer operation, W 1 ∈ R C r * c , r is a scaling parameter. The purpose of this parameter is to reduce the number of channels and thus reduce the amount of calculation. Here z ∈ R 1 * 1 * C , W 1 z ∈ R 1 * 1 * C r , after passing a ReLU layer, the output dimension is unchanged. Then the dimension of W 2 is C * C r , the dimension of the output is 1 * 1 * C. Finally, the sigmoid function is used to obtain S.  as formulated in formulas(6), (7) and (8), then we perform GCN operation using formual (9), finally we aggregate nodes with similar structure into super-nodes to generate a coarser graph G j+1 with fewer nodes and node embedding matrix P j+1 . As illustrated in Fig. 1, D-SEGCN mainly consists of eight SEGCN modules (each SEGCN module contains two SEGCN layers), four coarsening layers, four refining layers, a capsule network layer, and a softmax classifier layer. Taking the adjacency matrix A and the feature matrix X of graph G as the network inputs. The SEGCN layer is first conducted to learn node representations. Due to the attention mechanism based on feature dimensions, some significant node features will be strengthened. Then a coarsening operation is performed to aggregate structurally similar nodes into super-nodes. After the coarsening operation, each super-node represents the native structure of the original graph, which can easily make use of the global structure of the graph. Along with the coarsening layers, the symmetric graph refining layers are applied to restore the original graph structure for node classification tasks. Finally, on the discriminable Capsule layer, the feature matrix is performed dynamic routing operation to form a new feature matrix. The acquired features and the final new features are fused to enhance the distinguishability of the nodes, thus alleviating over-smoothing issue by the continuous merging of nodes. It is worth noting that ⊕ is an add operation. The graph G j+1 with p-dimensional potential representation of n j+1 super-node is generated in the coarsening layer of layer j, refining layer is just the opposite. The corresponding adjacent matrix A j+1 and P j+1 will be sent to the (j + 1)-th layer. Symmetrically, before the graph refining operation is executed, we firstly conduct SE operation and then refine the coarsened graph to restore the finer graph structure. In order to promote optimization in a deeper networks, we adde shortcut connections [22] spanning each coarsened graph and its corresponding refined parts.

B. SQUEEZE AND EXCITATION GRAPH CONVOLUTIONAL NETWORKS
Graph Convolutional Networks have achieved promising generalization in various tasks, and our work is based on VOLUME 8, 2020 FIGURE 2. The workflow of SEGCN module. Taking the feature matrix X ∈ R n×d as the input and the feature dimension as the channel, through F sq (·) Squeeze operation and F ex (·) Excitation operation, get a 1 * d tensor S, then we multiply X and S by F scale operation, the new feature matrixX is obtained and then put into the Graph Convolutional Networks.
the GCNs. We combine squeeze and excitation networks with Graph Convolutional Networks to produce SEGCN. Taking the feature matrix as the network input and the feature dimension as the channel, the new feature matrix is put into the Graph Convolutional Networks, see Fig. 2 for details. Every SEGCN layer consists of two steps, i.e. squeeze and excitation operation and graph convolution.

1) SQUEEZE OPERATION
Squeeze operation uses the global pooling operation, first compresses the n * d to 1 * d, so as to obtain the information of global receptive field. See formula (6) for the specific operation, where X j ∈ R n j ×d j , where d j is the dimension of the attributive features in the j-th layer. X i represents the i-th row of X , that is the feature vector for the i-th node, N j is the number of nodes for the j-th layer, Z j ∈ R 1×d j .

2) EXCITATION OPERATION
Then there is just an excitation operation, which is akin to a gate mechanism in a recurrent neural network, giving each part a relevant weight value (learned), similar to the attention mechanism, see formula (7) for the specific operation. After getting the squeezed feature map, we use sigmoid function to excite it. Firstly, the dimension is reduced by a fully connected layer, i.e. W 1 Z in the following formula (7), after passing a ReLU function, we can obtain δ(W 1 Z ), the output dimensions are unchanged. Then we through the full connection to upgrade the dimension, i.e. W 2 δ(W 1 Z ) and finally sigmoid function is used to obtain S. 3

) REWEIGHT OPERATION
We regard the weight of the output of excitation operation as the importance of each feature after feature dimension selection, and then we weight each feature dimension to the previous feature X by multiplication to complete the recalibration of the original feature dimension. See Fig. 2 for details. The output weight of excitation operation is regarded as the importance of each feature dimension after feature selection, and then used to weight the previous feature dimension by dimension through multiplication to complete the recalibration of the original feature on the dimension. See formula (8) for the specific operation.
In this paper, at layer j, letX j = P j , taking graph adjacency matrix A j and feature matrix P j as input, each SEGCN layer outputs a hidden representation matrix Q j ∈ R n j ×d j+1 , which is described as: where P 1 =X (0) , ReLU (·) = max(0, ·), adjacency matrix with self-loopÂ j = A j + I ,D j is the degree matrix ofÂ j , and W j ∈ R d j ×d j+1 is a trainable weight matrix.

C. GRAPH COARSENING
In the phase, the input graph G(G 0 ) is repeatedly coarsened into a series smaller graphs G 0 , G 1 , G 2 , . . . G m such that For the graph coarsening procedure, we design the following two super clustering strategies to assign nodes with similar structures into a super-node in the coarsen graph. We first conduct structural equivalence clustering, followed by normalized heavy edge clustering. The number of coarsening layer is q. Some work has been proposed for learning hierarchical information on graphs. Liang et al. [23] and Hu et al. [20] uses a coarsening procedure to build a coarsened graph of small size.

1) STRUCTURAL EQUIVALENCE CLUSTERING(SEC)
If two nodes share the same set of neighbors, they are considered to be structurally equivalent. We then assign these two nodes to be a super-node. For example, as showed in Fig. 3, nodes B and E are structurally equivalent, so these two nodes are merged in a super-node. We mark all structurally equivalent nodes and leave other nodes unmarked.

2) NORMALIZED HEAVY EDGE CLUSTERING(NHEC)
Heavy edge matching is a popular matching method for graph coarsening [24]. For an unmarked node u k , its heavy edge clustering is a pair of nodes (u k , u m ) such that the weight of the edge between u k and u m is the largest. Then, we propose normalizing the edge weight between the unmarked node pair (u k , u m ) as the heavy edge clustering using the formula(10) as follows, where A km is the edge weight between u k and u m , and D(·) is the node weight. We iteratively extract an unlabeled node u k and calculate its normalized connection strength with all unlabeled neighbors. For the example in Fig. 3, node A is equally likely to be merged with node C and node D, without edge weight normalization. With normalization, node A will be merged with D, which is a better merging since A is more structurally similar to D. Node pair (A, D) are merged together to constitute a super-node. After that, since only node C remains unmarked, it is a super-node by itself.

3) A HYBRID CLUSTERING METHOD
In this paper, we use a hybrid of two clustering methods above for graph coarsening. To construct G i+1 from G i , we first find out all the structural equivalence clustering (SEC) η 1 , where G i is treated as an unweighted graph. This is followed by the searching of the normalized heavy edge clustering (NHEC) η 2 of G i . Nodes in each clustering are then merged into a super-node in G i+1 . Note that some nodes might not be merged at all and they will be directly copied to G i+1 . Then we need to be able to acquire all the super-nodes. Fig. 3 provides a toy example for illustrating the process. The figure displays in the process of applying Structural Equivalence Clustering (SEC) and Normalized Heavy Edge Clustering(NHEC) for graph coarsening. In SEC operation, node B and node E share the same neighbors, so they are merged into a super-node. In NHEC operation, node A and node D are merged because they have the largest normalized connection weight. Node C constitutes a super-node by itself since it remains unmarked.
For one super-node v i , its edge weight to v j is the summation over edge weights of v j 's neighbor nodes contained in v i . The updated node weights and edge weights will be used in formula (10) in the next coarsening layer. In order to help restore the coarsened graph to the original graph, we preserve the matching relationship between nodes and their corresponding super-nodes in a matrix M j ∈ R n j ×n j+1 . Formally, at layer j, entry m km in the matching matrix M j is calculated as the formula (11): Note that m 33 = 1 in this illustration, since node C constitutes its super-node by itself. Next, the hidden node feature matrix is determined as the formula (12): Then, we can generate a coarsen graph G j+1 whose adjacency matrix can be calculated as the formula (13): The coarsen graph G j+1 along with the resulting representation matrix P j+1 will be fed into the next layer as input. The graph coarsening procedure is summarized in Algorithm 1.

D. THE GRAPH REFINING LAYER
To restore the original topology of the graph and further facilitate node classification, we heap up the same numbers of graph refining layers as coarsening layers. As for the coarsening process, each refining layer consists of two steps: generating the node embedding vector and restoring the node representation.
Since we have saved the matching matrix during the coarsening process, we use M l−j to restore the refined node representation matrix of layer j. We further employ residual connections between the two corresponding coarsening and VOLUME 8, 2020

Algorithm 1 Graph Coarsening
Input: A input Graph G i , node feature matrix P j , adjacency matrix A j and levels for coarsening q Output: Coarsened graph G j+1 and node representation P j+1 and adjacency matrix A j+1 , j ∈ [0, q − 1] 1 for j = 0, 1, . . . , q − 1 do 2 η 1 ←− all the structural equivalence clustering G j 3 Mark nodes in M 1 as marked 4 η 2 = ∅ 5 Sort all unmarked nodes in ascending order according to the number of neighbors 6 for each unmarked node u k do 7 for each unmarked node u k adjacent to u m do 8 Calculate S(u k , u m ) according to formula (10) 9 η 2 = η 2 (u k , u m ) and mark both as matched 10 Update node weights and edge weights 11 Construct matching matrix M j according to formula (11) and based on η 1 , η 2 12 Calculate node representation P j+1 according to formula (12) 13 return coarsened graph G j+1 , adjacency matrix A j+1 according to formula (13) and node representation P j+1 refining layers. In summary, node representations are calculated by the formula (14):

E. DISCRIMINABLE CAPSULE LAYER
Geoffrey Hinton is one of the founders of deep learning and the inventor of classic neural network algorithms such as back propagation. In order to solve CNN's shortcomings, Hinton in [25] raised a more effective network for image processing Capsule Network, which combined the advantages of CNN and considered the relative position, angle and other information missing from CNN, so as to improve the effect of recognition. A capsule is a group of neurons whose activities vector represents the instantiation parameters of a specific type of entity such as an object or an object part. Compared with the conventional convolution network, the advantages of the capsule network are as follows: firstly, the output of the convolution neural network is scalar, while the capsule network is vector with direction; secondly, each layer of the convolution neural network needs to do the same convolution operation, so it needs quite a lot of network data to study, which is time-consuming, inefficient and expensive. Capsule network requires the model to learn the feature variables in the capsule to maximize the retention of those valuable information, so it can use less training data to infer the possible variables to achieve the anticipated effect of CNN; thirdly, convolution neural network can not well deal with ambiguity, because continuous pooling will lose many important feature information, so that they are not sensitive to small changes. Therefore, when the complex tasks such as semantic segmentation are completed, we need to build complex architecture to solve the problem of information loss, Capsule Network is different, since each capsule is carrying a large amount of information, such as target location, rotation, and the size of the thickness and tilt. The information and other details have been saved and translated to the upper capsule and can use simple consistent architecture to cope with different visual tasks.

Algorithm 2 Routing Algorithm
n] 9 V = a n 10 return a 1 , a 2 , . . . , a n−1 , V We propose a discriminable capsule network, the network is adept in judging the similarities between features and is used to remember the rich information in the graph structure. In the layer l − 2, we use a SEGCN on P l−2 to output ( as in formula (15)): P l−2 ∈ R n×o , where o represents the number of classes, n is the number of nodes. We feed n node feature vectors v 1 , v 2 , . . . , v n into the discriminable capsule network as n capsules, as shown in Fig. 4. Taking 1 to n (n is the number of nodes) rows of feature matrix P l−2 (v 1 , v 2 , . . . , v n ) and input them into the network as n capsules. They are multiplied by n weights W 1 , W 2 , . . . , W n , respectively. We get u 1 , u 2 , . . . , u n . Through dynamic routing procedure, we can get output a 1 , a 2 , . . . , a n−1 , a n . These new capsules (feature vectors) form the matrix P l−1 ∈ R n×o (o is the number of classes). The dynamic routing procedure is summarized in Algorithm 2. We want the length of the output vector of the capsule to represent the probability that the entity represented by the capsule appears in the current input. Therefore, we use a nonlinear squashing function (see formula (16) to ensure that the short vector is compressed to almost zero length, and the long vector is compressed to a length slightly less than 1.
a r = ||s r || 2 1 + ||s r || 2 · s r ||s r || (16) where a r (see formula (17))is the vector output of capsule v r and s r is its total input. For all but the first layer of capsules, the total input to a capsule s r is a weighted sum over all prediction vectors u i from the capsules in the layer below and is produced by multiplying the input v i of a capsule in the layer below by a weight matrix W i .
where the c r i (as in formula (18)) are coupling coefficients that are determined by the iterative. Sum of coupling coefficient of capsule v i and all capsules in the above layer is 1, which is determined by routing softmax, and its initial logits b r i are the log prior probabilities that capsule v i should be coupled to capsule v r .
The log priors can learn differently from other ownership values at the same time. The initial coupling coefficient are refined iteratively by measuring the consistency between each current output of capsule a r and the prediction u i made by capsule v i . The agreement is simply the scalar product t ir = a r · u i . The protocol is treated as a log likelihood and added to the initial logit b r i , then the new values of all coupling coefficients connecting capsule v i and higher-level capsules are calculated.
We fuse the feature P l−2 with the matrix P l−1 obtained from the discriminative capsule layer to get the matrixP l−1 , ⊕ is the add operation as in formula (19):

F. THE OUTPUT LAYER
Finally, in the output layer l, we use a softmax classifier on P l−1 to output probabilities of nodes (formula (20)): (20) where W l ∈ R d×o is a trainable weight matrix and P l ∈ R d×o denotes the probabilities of nodes pertain to each class. The loss function is defined as the cross-entropy of predictions over the labeled nodes(see formula (21)): where (·) is the indicator function, y i is the true label for v i , h i is the prediction for labeled node v i , and P(p i , y i ) is the predicted probability that v i is of class y i , o represents the number of classes.

IV. EXPERIMENTS AND ANALYSIS A. DATA SETS
To verify the effectiveness and benefit of the proposed D-SEGCN on semi-supervised learning tasks, we test it on three citation networks (Cora, Citeseer and Pubmed). Citation network is a network composed of papers and their relationships, including citation relationships, co-authors, etc. It has a natural graph structure, and the tasks for data sets is the general classification of papers and the prediction of connections. The details of these data sets and their usages in our experiments are introduced below. The statistics of these data sets are summarised in Tabel 1.
Cora. This data contains 2708 nodes and 5429 edges. Each node has a 1433 dimension feature descriptor and all the nodes are falling into seven classes.
Citeseer. This network data contains 3327 nodes and 4732 edges. The nodes are falling into six classes. Each node is divided into a 3703 dimension feature descriptor.
Pubmed. This data contains 19717 nodes and 44338 edges. Each node has a 500 dimension feature descriptor and all the nodes are divided into three classes.

B. EXPERIMENTAL SETTING
For Cora, Citeseer and Pubmed datasets, we follow the experimental setup of previous works by Kipf and Welling et al. [5], and Velicković et al. [11]. During training, only 20 labels per class are employed to each citation network. Besides, 500 nodes in each data set are selected randomly as the validation set.

C. PARAMETER SETTINGS
We train D-SEGCN model for a maximum of 1500 epochs (training iterations) using an ADAM algorithm [31] with learning rate 0.03. The dropout is applied to all features vectors with rates of 0.9. All the network weights W are initialized using Glorot initialization [32]. Considering the different scales of the data sets, we set the total number of layers l to eighteen for citation networks.

D. COMPARISON WITH STATE-OF-THE -ART METHODS
Baseline Methods. In order to evaluate the performance of D-SEGCN, we compared our method with the following representative methods. The specific classification accuracy is shown in Table 2.
ManiReg Belkin et al. [26] proposed a learning algorithm based on a new regularization form that can take advantage of the geometric properties of edge distributions. It focused on a semi-supervised framework that combines labeled and unlabeled data in general learners.
DeepWalk Perozzi et al. [27] generated node embeddings through unsupervised random walks, and then input the embedding vector into the SVM classifier for node classification.  LP Zhu et al. [28] proposed a traditional graph based on a Gaussian random field model to semi-supervised learning methods.
Planetoid Yang et al. [29] proposed a semi-supervised learning framework based on graph embedding. The framework gave a graph between instances and trains the embedding of each instance to jointly predict the class labels and neighbor contexts in the graph.
Chebyshev Defferrard et al. [4] proposed a kind of convolutional network which extended traditional CNN to Non-Euclidean space.
GCN Kipf and Welling et al. [5] produced node embedding vectors by truncating the Chebyshev polynomial to the first-order neighborhoods.
DGCN Zhuang and Ma [10] used the adjacency matrix and positive mutual information matrix of graph to encode both local consistency and global consistency.
GAT Velicković et al. [11] generated a node embedding vector by modeling the difference between the node and its hop neighbor.
H-GCN Hu et al. [20] proposed a novel deep hierarchical graph convolutional network for semi-supervised node classification in order to increase the receptive field.
Results. Table 1 summarizes the basic situation of three citation network benchmark datasets (Cora, Citeseer and Pubmed [30]). Table 2 summarizes the comparison results of the three datasets. The best results are marked as bold in Table 2. Overall, we can note that D-SEGCN performs better than other graph based semi-supervised learning methods, such as ManiReg, Deepwalk and LP, which further demonstrates the effectiveness of D-SEGCN in the task of executive graph semi-supervised classification.

E. ABLATION STTUDY
To verify the effectiveness of the proposed SEGCN module and discriminable Capsule layer to alleviate over-smoothing problems, the ablation study is conducted in the section. The results are shown in Table 3 and Table 4.
We take measures to verify that the proposed D-SEGCN can use global information to selectively enhance the significant features and suppress the useless features. Thereby strengthening the features of nodes. At first, the squeeze and excitation operation is not fused to GCN to get the classification accuracy of the three benchmark graph datasets (Cora, Citeseer, and Pubmed) are: 84.5%, 72.4%, and 79.8%, respectively (see Table 3). By incorporating squeeze and excitation operation to the first layer of GCN, the classification accuracy on the three benchmark graph datasets (Cora, Citeseer, and Pubmed) are: 84.2 ± 0.4%, 72.4%, and 80.2%, respectively. Then squeeze and excitation operation are fused to each GCN layer to obtain the classification accuracy are: 84.6%, 73.1%, 80.6%, respectively. Another operation used to prove the discriminable capsule layer is adept in judging the similarities among features, and can be used to remember the rich information in the graph structure and enhance the discriminability of nodes. Therefore, the over-smoothing problem is alleviated. The addition of discriminable Capsule layer allows newly acquired features to fuse with previous features. As can be seen from Table 4, the classification accuracy of the model for the three data sets Cora, Citeseer, and Pubmed are increased: 0.2%, 0.9% and 0.6%, respectively.

V. CONCLUSION
In this paper, we have introduced a novel graph convolutional networks for semi-supervised classification on graph-structured data. The D-SEGCN model is composed of SEGCN layers, coarsening layers, symmetrical refining layers, a discriminable capsule layer and a softmax classifier layer. Compared to the previous work, the proposed model can selectively enhance significant features and suppress useless ones by adding squeeze and excitation modules. While the proposed model is also to realize the adaptive calibration of feature dimension, enhance the discriminability of nodes. The final features are fused with the features obtained from the discriminable capsule layer, which can well judge the similarities among features and can remember the rich information in the graph structure. Therefore, the over-smoothing can be effectively alleviated by aggregating the multi-level neighborhood. Comprehensive experiments have confirmed that the proposed method consistently outperformed other state-of-the-art methods. In particular, the proposed approach in this p paper achieves more significant results when the labeled data is extremely scarce. At present, he is mainly engaged in image understanding and interpretation, computer vision, intelligent algorithm system implementation, and other research work. He has presided over and participated in more than ten national projects, many provincial and ministerial projects, published more than 20 SCI retrieval articles in international mainstream journals, and obtained more than 30 national invention patents. In 2012, he won one Shaanxi Science and Technology Award, and one Shaanxi Teaching Achievement Award, in 2018.
YANG ZHANG received the B.S. degree in intelligence science and technology from the School of Artificial Intelligence, Xidian University, Xian, China, in 2018, where he is currently pursuing the M.S. degree in circuit and system. His current research interests include machine learning, deep learning, and computer vision, with a focus on visual object tracking.
FENGGE WANG received the B.S. degree in electronic information science and technology from the School of Science, Xi'an University of Architecture and Technology, Xian, China, in 2018. She is currently pursuing the M.S. degree in circuit and system from the School of Artificial Intelligence, Xidian University, Xian. Her current research interests include machine learning, deep learning, and computer vision, with a focus on image classification.