Graph Convolutional Network based on Multihead Pooling for Short Text Classification

The short text, sparse features, and the lack of training data, etc. are still the key bottlenecks that restrict the successful application of traditional text classification methods. To address these problems, we propose a Multi-head-Pooling-based Graph Convolutional Network (MP-GCN) for semi-supervised short text classification, and introduce its three architectures, which focus on the node representation learning of 1-order, 1&2-order of isomorphic graphs, and 1-order of heterogeneous graphs, respectively. It only focuses on the structural information of the text graph and does not need pre-training word embedding as the initial node feature. A graph pooling based on self-attention is introduced to evaluate and select important nodes, and the multi-head method is used to provide multiple representation subspaces for pooling without adding trainable parameters. Experimental results demonstrated that, without using pre-training embedding, MP-GCN outperforms state-of-the-art models across five benchmark datasets.


I. INTRODUCTION
Text classification is a classical problem in natural language processing (NLP), which aims to assign labels or tags to textual units such as sentences, queries, paragraphs, and documents. In the past few years, scholars had proposed a series of deep learning models for that, such as models based on recurrent neural networks (RNNs), convolutional neural networks (CNNs), transformers, capsule nets, which surpass the traditional machine learning methods in various text classification tasks.
In recent years, some scholars began to study semisupervised graph convolutional networks (GCNs) for text classification [1] [2]. The main reason is that it can be applied to many practical scenarios. Firstly, it can deal with the short text better by adding more relationships to word nodes and can be applied to the scene with sparse and ambiguous semantics and lack of context [2]. Secondly, it is also suitable for the scene with limited labeled training data, which usually leads to the poor performance of traditional neural networks [3]. As a consequence, there is a pressing need for studying semisupervised GCNs for text classification.
The application of semi-supervised GCNs is also facing challenges. Firstly, due to different scenes, the pre-training word vector may not be able to improve the text classification effect, but increase the difficulty of graph building. Secondly, it usually builds a graph for the whole corpus so that more links or dependencies can be added to the nodes (e.g. using heuristics). Therefore, both graph building and feature extraction need to consider the calculation and memory consumption.
In this study, we present a Multi-head-Pooling-based Graph Convolutional Network (MP-GCN) for semi-supervised short text classification. MP-GCN can evaluate and select important nodes from multiple perspectives through multi-head pooling, and achieve strong classification performance with lower computational cost. Our source code is available at https://github.com/shanzhonglujie/MP-GCN. To summarize, our contributions are as follows: (1) We propose a novel graph convolutional network (MP-GCN) for short text classification and introduce its three architectures. MP-GCN mainly focuses on the structural information of the text graph and enhances the representation learning of important nodes. Finally, it can achieve strong classification performance without combining any prior information or pre-trained embedding.
(2) Our model does not add new trainable parameters to the convolution calculation and can be applied to the feature extraction of isomorphic and heterogeneous graphs respectively. Moreover, it can output the predictive graph embeddings of words or documents to downstream tasks.

A. GRAPH CONVOLUTIONAL NETWORKS
Graph convolutional networks (GCNs) are mainly divided into spectral methods and spatial methods [4].
In recent years, spectral methods had received growing attention recently. Bruna et al. [4] first put forward the concept of GCNs in 2014, which defined graph convolution in spectrum space, with high complexity in space and time. Deferrard et al. [5] used k-order Chebyshev polynomials as the convolution kernel to extract features of k-order neighborhood nodes, which improved the calculation efficiency. Kipf et al. [1] simplified the Chebyshev network and proposed a simple, efficient graph convolutional network (GCN) with 1-order message passing.
In addition, spatial methods had also been developed rapidly. Hamilton et al. [6] proposed a graph sampling aggregation network (GraphSAGE), which used methods such as limiting the number of samples collected and mini-batch training to expand GCNs into inductive learning networks and solve the problem of large-scale data processing. Velikovic et al. [7] proposed a graph attention network (GAT), and defined aggregation function through the attention mechanism, to assign different weights to each related node and select more similar nodes for aggregation.

CLASSIFICATION
The above GCNs could be applied to several tasks based on texts. Among them, text classification is an important and classical problem in natural language processing (NLP) [8]. Yao et al. [2] proposed Text-GCN for semi-supervised text classification. PMI and TF-IDF algorithms were used to build the topological graph containing words and document nodes, and two-layer GCN was used. Huang et al. [9] proposed a new graph convolutional network, which solved the problem of not supporting online testing and high memory consumption in [2], but its computational complexity is relatively high. Zhang et al. [10] proposed to employ GCNs on the dependency tree of sentences to capture syntactic information and lexical dependence. On this basis, they proposed a framework for sentiment classification in specific aspects. Hu et al. [3] built a heterogeneous graph of topic-text-entity and proposed a heterogeneous graph attention network (HGAT) based on the two-level attention mechanism for short text classification. This network could learn the importance of different adjacent nodes and the importance of different types of node information to the current node. Liu et al. [11] built three kinds of heterogeneous graphs to describe semantic, syntactic, and sequential contextual information and proposed a tensor graph convolutional network (TensorGCN) to harmonize and integrate heterogeneous information from three types of graphs.
In practical application, some methods either view a document or a sentence as a graph of word nodes or rely on the document citation relation to build the graph, but they don't utilize much text information [8] [12].

CLASSIFICATION
Deep neural networks, which automatically represent texts as embeddings, have been widely used for NLP tasks. Two typical deep neural networks CNNs and RNNs have shown their power for text classification [13] [14]. In the short text, the context dependence between words is usually weak, and the CNN-based model usually performs better. Considering the efficiency, the CNN-based model is more suitable for real-time application than some large models, such as Bert [15]. However, these deep learning networks cannot solve the problem of lack of training data, which prohibits them from successful practical applications.

A. GCN
The graph convolutional networks (GCNs) are mainly used to process the data with generalized topological graph structure and explore its characteristics and laws deeply [1]. GCNs can be divided into two categories: spectral methods and spatial methods. With the deepening of the research, the spectral methods become more efficient, and their practicability becomes stronger. Among them, GCN [1] transforms spectral convolution operation in the time domain into matrix multiplication operation in the frequency domain: where is the convolution kernel in the time domain. is the input signal. * is convolution calculation. is the adjacency matrix with self-loops, which reflects the interconnection relationship of nodes in the graph. can be decomposed, i.e. = + , is the unit matrix, and A0 is the adjacency matrix without self-loops. is the degree matrix of , and = ∑ , . is a scalar and degenerates to 1 finally. / / is the normalized form of adjacency matrix A, which is used to prevent the gradient from disappearing or exploding when a multi-layer network is optimized. Equation (1) represents the convolution calculation of single layer GCN (called GCNConv), which only acts on the 1-order sub-graph of each node, i.e., in each layer, only the 1-order neighbors of each node are required to participate in the calculation.
After integration, the single-layer convolution formula of GCN is: where = / / , ∈ℝ × is the normalized form of ∈ℝ × . σ is the activation function.
∈ℝ × is the parameter to be trained, which is used for affine transformation of . ∈ℝ × is the input node feature. ∈ℝ × is the output. V is the number of nodes in the graph. C is the initial dimension of the node feature and F is the output dimension. Compared with the previous work, GCN reduces the number of parameters to be trained and decreases the risk of over-fitting [1]. Because GCN is mainly based on matrix multiplication, it can achieve efficient feature extraction.

B. GCN BASED ON MULTI-HEAD POOLING (MP-GCN)
The purpose of pooling methods is to reduce the size of parameters by node selection [16] [17] [18] [19] (e.g. downsampling) to generate smaller representations. However, instead of down-sampling, our pool method selects important nodes to enhance their representation learning and does not abandon non-important nodes.
According to the different combinations of the pooling layer and the graph convolution layer, we define three architectures of MP-GCN, which are illustrated in Figure 1. In practice, the same as GCN, MP-GCN cannot use too many convolution layers in series to obtain a wider receptive field, when the layer (order) number l>2, its effect is not significantly improved. Therefore, MP-GCN only extracts features of 1, 2 order neighborhoods. Because our pooling method only focuses on structural information of nodes, so we call it S-pool. In Figure 1, three architectures all use two graph convolution layers (GCNConv by default) to extract features. From up to down, the first GCNConv is for aggregating the information of (1-order) nodes immediate adjacent to the central node. The second GCNConv is for aggregating the information of the 2order adjacent nodes. Besides, GCNConv can be replaced by other graph convolution layers, such as ChebConv [5] and GATConv [7]. Multi-head S-pool is our proposed pooling layer, which is used to evaluate and select important nodes. Because the information of unselected nodes may be lost during pooling, the residual connection ( Add) is used to recover their information. Weight is used to weight the nodes according to The purpose of the S-pool is to select important nodes correctly and reduce the influence of non-important nodes. It introduces the self-attention mechanism [20] to score the nodes. Because our short text classification work only uses node structural information, the attention method for calculating element similarity [21] is not suitable. References [22] [23] [24] use a learnable projection vector p to calculate node scores, which can be expressed as follows: where xi is the feature vector of node i. yi is the node score vector and represents the amount of information of node i that can be retained when projected onto the direction of p. Equation (3) is a self-attention method to evaluate the importance of each node. In our study, the importance can be calculated through 1dimensional projection: where ( ) ∈ℝ × is the hidden state of l layer, and is the node feature dimension. When l=0, ( ) = ( ) = (identity matrix).
is a trainable parameter with the size of × 1, and is generated by the uniform distribution. TopK is used to obtain the weights of the top K (selected) nodes with the highest weights and set the weights of other (unselected) nodes to zero. The activation function σ selects Tanh for nonlinear stretching. Equation (4) is to score all nodes and realizes the node selection operation of S-pool.
Because the representation of the unlabeled target node can be determined by the selected adjacent node to some extent, so the parameter setting of TopK has a certain impact on the performance of the model. To adapt to the change in the number of nodes, we set q as the ratio of the selected node to all nodes, i.e. = . Since the number of word nodes is usually greater than the number of document nodes, it can also refer to the vocabulary, i.e. = , and V1 is vocabulary size. Fortunately, the significant structures in a graph will not change during computation. Therefore, only one q-value is set for each layer. Through experiments, we notice that the selection of q is mainly influenced by the graph structure.
Through S-pool, MP-GCN can aggregate node information with significant structural characteristics by highlighting important nodes. Besides, it has reasonable complexity and ensures that the whole model is still based on matrix calculation.

2) MULTI-HEAD FOR S-POOL
The S-pool focuses on important but limited nodes. Due to the randomness of initial parameters, some hidden important nodes may be lost. Therefore, the multi-head method [25] [26] [27] is introduced to form multiple subspaces and enables the model to select nodes from different aspects.
One multi-head method is to convert the size of parameter in (4) to × . Each column of can be used to calculate a group of node scores. N is the head number. For the stability of the model, we use a more simplified method, which is as follows: where ∈ℝ × is the attention score matrix. RS is the Ndimensional random sampling operation. is the nonlinear transformation function. W is the parameter in convolution calculation (GCNConv). Equation (5) realizes the random sampling of node evaluation (node feature projection) results. What we simplify is the trainable parameters, see parameter replacement in (6). The multi-head evaluation method based on (5) is not as diverse as (4), but its performance is more stable and does not introduce new trainable parameters. This multihead method enables MP-GCN to learn node representation from different subspaces, which improves the objectivity and stability of node selection.

3) MP-GCN-1
MP-GCN-1 and MP-GCN-2 (MP-GCN-1/2 for short later) are both used to process isomorphic (text) graphs, and their calculation is similar. We take MP-GCN-1 as an instance to illustrate the calculation of MP-GCN: where some variables have been explained earlier. The input node feature ∈ℝ × is initialized to the identity matrix (C=V). ⊙ is element-wise product. ReLU is the activation function. ( ) ∈ℝ × , ( ) ∈ℝ × are trainable parameters, and F and E are the output dimensions. Equations (7) and (11) are the convolution calculation of the first layer and the second layer, respectively, and ( ) and ( ) are their outputs. Equations (8) (9) (10) are for the calculation of the multi-head S-pool. Equation (8) obtains the attention score matrix. In (9), ∈ℝ × and ∈ℝ × are mean and maximum of N groups of attention scores. Equation (10) performs the weighting operation and the residual calculation. An example of multi-head S-pool implementation is shown in Figure 2.
Multi-head S-pool strengthens the representation learning of important nodes. The S-pool highlights the important nodes of the whole graph and weakens some secondary nodes. Introducing the multi-head method can make our network pay attention to the nodes from many aspects.
MP-GCN uses the classical cross-entropy to define the loss function. It does not use the L2 regularization term to constrain the model parameters. . ⊙ is element-wise product. In the figure above, we assume that a graph has three nodes, and each node has three features ( and X). Then, we obtain the attention score matrix (SA). The mean and maximum of 3 groups of attention scores (Smax and Smean) are used to highlight the important nodes and realize node selection. Finally, the enhanced nodes (Z (1) ) can bring more accurate classification.

4) GRAPH BUILDING
The graph building affects the performance of the model. Graphs with accurate structure and rich information can bring great help to our task. Common tricks for graph building are as follows: (1) Adding more relationships (edges) to nodes: Finding more semantic relationships can solve the semantic sparsity of short texts, such as word co-occurrence [2]. In addition, external dependencies can provide more relationships for documents [30], such as shared topics or entities [3].
(2) Adding multiple types of information to nodes: Building the heterogeneous graph can provide various types of information for nodes, so as to enhance the representation of nodes [29] [31] [32].
(3) Adding context information to nodes: The graph structure is difficult to represent the semantic and syntactic information in a long continuous word sequence. Combining dependency-based models to obtain context information is a better way to solve this problem.
(4) Combining pre-training model: Bert [15], Glove [33], etc. can provide pre-training word embeddings for node initialization, which can improve the performance of the model in many NLP tasks.
In our study, we propose MP-GCN-1* and apply it to process heterogeneous (text) graphs. It can realize text classification by learning multiple representations with different types of nodes without increasing model parameters.
First, inspired by [2], we build a heterogeneous graph for the corpus and turn the text classification problem into the node classification problem. The nodes of this graph are composed of documents and words. Because the documents can be represented by the sum of word embedding vectors, the method of processing the isomorphic graph can be directly used in this graph.
This graph focuses on global word co-occurrence information in a corpus [2]. Its corresponding adjacency matrix is defined as: where denotes the connection relationship between nodes i and j. The edge weights between word nodes are calculated by PMI [2], which explicitly models word co-occurrence. The edge weights between document (sentence) nodes and word nodes are calculated by TF-IDF, which models the importance of words in documents. The diagonal element of is set to 1. In the following experiments, is the input of MP-GCN-1/2. has two types of edges. Although it can be treated as an isomorphic graph [2], it is essentially a heterogeneous graph.
However, this operation of treating a heterogeneous graph as an isomorphic graph may not be optimal. Therefore, different types of edges should be treated separately. Specifically, we set the adjacency matrix containing only the edges between word nodes as and the adjacency matrix containing only the edges between word nodes and document nodes as . and can be obtained by masking operation. , , and are the inputs of MP-GCN-1*.

Multi-head S-pool Add & Weight Outputs
( )

Self-Attention
( ) = ( ) + ( ) . (17) Compared to MP-GCN-1/2, MP-GCN-1* still does not increase trainable parameters. It evaluates the importance of nodes according to different types of edges, which makes the model more stable and accurate.

A. DATASET
We ran our experiments on five widely used benchmark corpora 1 including 20-Newsgroups 2 (20NG), Ohsumed 3 , R52 and R8 of Reuters dataset 4 , and Movie Review (MR) 5 [2]. These datasets were processed by cleaning data, segmentation, removing stop words, and removing words whose frequency is less than 5. Some important descriptions of datasets are listed in Table I.
To conduct a fair comparison study, all models did not use pre-trained word embedding. They utilized the same settings in [2] [9] [11].

C. SETTINGS
The main parameter settings of MP-GCN included: the head number N=12, the learning rate e=0.005, and the ratio of Top-K q=kV1 (see Table I).

D. RESULTS ANALYSIS
Performance. Table II shows that, the traditional model achieves better performance than the dependency-based models. The word embedding models are superior to the two models because of the joint representation of word context. Benefit from the rich representation of the graph, graph-based models surpass the above models. Among them, the performance of Text-GCN is outstanding. Without pre-training word embedding, MP-GCN significantly outperforms all baselines and achieves state-of-the-art results on these benchmark datasets. In the three architectures of MP-GCN, the performance of the MP-GCN-1/2 is similar. MP-GCN-1* outperforms MP-GCN-1/2 on four datasets. The reasons why MP-GCN works well include that: 1) it is mainly due to the advantages of graph structure, the rich links effectively associate the nodes (including sparse ones), so that these nodes can be effectively represented. 2) MP-GCN can enhance the representation learning of important nodes. These selected and enhanced nodes contain more distinctive features, which can make the classification more accurate.
Besides, the processing effect of our model on longer text is limited (see 20ng in Table II), and the benefit of increased computational consumption is low. Because it is difficult to select a small number of important nodes as the semantic representation of the whole long text document.
Ablation analysis. In Table III, it can be seen from the experiment that the performance of our models will be greatly reduced without using any of the methods of multi-head or TopK. To achieve better results, our models must combine the two methods for pooling.  In Table IV, the experimental results show that different types of edges bring different improvements to the model. As the reference information for evaluating the importance of nodes, the edge from A P is more informative than the edge from A T . The experimental results also confirm the effectiveness of the model for heterogeneous graph processing.
Through the above two ablation experiments, it is found that the worst performance of our models is close to the Text-GCN. This is because the residual connection is added to the model architecture, which makes the model more stable and will not significantly reduce the performance due to the failure of the trick.
Parameter Sensitivity. Figures 3a and 3b show that MP-GCN can get better performance by adjusting the q-value. The selection of q is mainly related to the graph structure, and its setting in the same layer of MP-GCN is constant. Figures 3c  and 3d show that excess heads may reduce the efficiency and the effect. But with insufficient heads, MP-GCN will only focus on fewer subspaces, which reduces the accuracy and objectivity of node selection. Through repeated experiments, it is proved that MP-GCN has a more stable performance when the head number N=12.  Effects of the Size of Labeled Data. Since our model is a semi-supervised model, we also tested the performance of the model under different proportions of training data. Figures 4 reports test accuracies with 2.5%, 5%, 10%, 30% and 50% of MR training set. Compared with some baseline models 6 , our three models and Text-GCN can achieve higher test accuracy with limited labeled documents. It is because these GCN-based models can propagate document label information to the entire graph well [2].
Time consumption. With the introduction of the multi-head S-pool, the computational complexity of the MP-GCN will inevitably increase compared with GCN. This is because it does not do down-sampling like traditional pooling methods, but enhances the representation learning of selected nodes without deleting any nodes.
Taking the main equations in MP-GCN-1 as an instance, the computational complexity formulas of (7), (8), and (12) are (| | ), (| | ), and (| | ), respectively. is the set of edges and | | is the number of edges. After integration, the maximum computational complexity of MP-GCN-1 is close to that of two-layer GCN [1], i.e., (| | ) or (2| | + | | ) (C=V, ≪ | |). Compared with two-layer GCN, MP-GCN-1 mainly increases the acceptable computational cost of (8) (9) (10). Through the above analysis, it can be seen that the time consumption of MP-GCN is still mainly related to the number of edges. Besides, we build a graph based on the whole corpus, and the memory consumption is (| |). Visualization. The MP-GCN can output better graph embeddings of words and documents. We visualized them and compared the performance of node representation learning between Text-GCN [2] and MP-GCN. Figures 5a and 5b show that the document node embedding of MP-GCN-1 has a higher 6 https://github.com/shanzhonglujie/MP-GCN/tree/main/test 7 https://github.com/MorningForest/BertGCN 8 https://github.com/shanzhonglujie/MP-GCN/tree/main/bert-based vector aggregation degree, i.e., MP-GCN-1 can learn more discriminative document embeddings. Figures 5c and 5d show that the node embedding of MP-GCN-1 contains more words with the same labels and close position (similar semantics), which means most of them are more closely related to certain document classes [2]. Experimental results illustrate that the output embedding of MP-GCN-1 is more discriminative than that of Text-GCN. Besides, other MP-GCN models can achieve the same effect. Compared with the pre-training models. Text classification is a regular task of NLP. In this task, Bert has achieved a dominant position. In our study, we introduced the Bert module into MP-GCN and compared it with the Bert-based models. The experimental results are shown in Table 5. BertGCN 7 is the text classification model combining Bert and GCN, and it achieves the SOTA performance. Details for the baseline models and dataset can be found in [39]. In the experiment, the head number of our model was set to 1. The source code of our Bert-based MP-GCN is also available 8 .
TABLE5. Test accuracy comparison with the Bert-based models. We run the models 10 times and report the mean test accuracy. Combined with the two baselines, the Bert-based MP-GCN performs better. The proposed pooling method plays a guiding role in training. Compared with BertGCN, our model can achieve good performance with fewer iterations, and it is easier to converge, see Figure 6. In this experiment, our model can get better results within 10 iterations. Combined with the pre-training model (such as Bert), the performance of our model can be improved to a certain extent. However, because the grammar and semantics of the text in the pre-training must be consistent with the downstream task, the pre-training models cannot be applied to all scenarios, especially specific scenes. In contrast, MP-GCN can build a graph to solve the text classification problem without pretraining word embedding, i.e., it builds a new language model. Therefore, our model can be applied in most specific scenarios.

Model
Discussion. MP-GCN is an innovation of network structure based on GCN and has a stronger ability to cover all data. When applied to short text classification, it can provide a certain degree of attention for long-tailed (sparse) words.
Why does MP-GCN mainly focus on the first-order nodes? Through experiments, it is found that the structure information of the first-order nodes extracted by GCN is more important than that of the second-order nodes. Similar experiments can be found in [40].
Why does MP-GCN not pool the input of the second graph convolution layer? If pooling is used, the classification effect of the model will not necessarily increase, but the computational consumption will increase.

V. CONCLUSION
In this study, we propose MP-GCN for short text classification. This network introduces multi-head pooling to enhance the representation learning of important nodes. We introduce three architectures of MP-GCN, which focus on node representation learning of 1-order, 1&2-order of isomorphic graph, and 1order of heterogeneous graph, respectively. Experimental results demonstrate that, without using pre-training embedding, MP-GCN can outperform state-of-the-art models across five benchmark datasets.