Joint Learning with BERT-GCN and Multi-Attention for Event Text Classification and Event Assignment

Government hotline is closely related to people’s lives and plays an important role in solving social problems and maintaining social stability in China. However, the event text of the hotline is inconsistent in length and unclear in elements, so it is a challenge for the operator to manually complete the assignment tasks of hotline event. To address these problems, we propose a joint learning method for event text classification and event assignment for Chinese government hotline. Firstly, graph convolution network (GCN) and BERT are used to process the event text respectively to obtain the corresponding representation vector. Then, the obtained two representation vectors are fused by the dynamic fusion gate to get fusion vector and classified the fusion vector through the text classification. Secondly, we use multi-attention mechanism to process the GCN result vector, BERT result vector and the “sanding” vector to obtain attentive “event-sanding” representation vector and calculate the corresponding department probability distribution. Finally, the historical prior knowledge based reorder model is used to sort the results of the “event-sanding” matching module and output the optimal assignment department of government hotline event. Experimental results show that our method can achieve better performance compared with several baseline approaches. The ablation experiments also demonstrate the validity of each proposed module in our model.


I. INTRODUCTION
Government hotline is an important channel for the government to connect with the public. It is an effective way for citizens to participate in social governance in China [1]. Through the government hotline, the government can learn about people's livelihood issues in a timely manner, effectively resolve social conflicts, and improve governance capacity and efficiency. In recent years, several local governments in China have made some achievements in the digitalization reform of hotlines, but there are also problems in the event assignment process. As the hotline operator cannot deeply understand the rights and responsibilities of each department, it is still difficult to select the right responsible department from dozens of departments to complete event assignment after fully understanding the needs of citizens, resulting in low accuracy and efficiency of event assignment. Therefore, it is of great significance to develop an automatic assignment method which can accurately determine the corresponding responsible departments.
Different from traditional natural language texts, citizens' telephone texts have problems of different lengths, and most of them are narrative descriptions, often with unclear descriptions and incomplete elements [2]. Therefore, it is difficult to accurately allocate events by simply using hotline text data. Combining it with public data of government departments (such as department profile and "sanding" data, etc.) can provide support for effectively allocating various events. It is worth noting that the "sanding" includes the organization's specifications, main responsibilities, internal organizations and their specific responsibilities, staffing and leading positions. Therefore, how to effectively use the "sanding" information to improve the accuracy of event assignment has become an important research focus of event assignment task.
How to effectively process the event text and get the semantic encoding of the event is the core task of event assignment. Traditional word embedding methods such as Word2vec [3] and GloVe [4] can only consider the simple context information of the text, but it is difficult to deal with polysemy in the natural language text. The pre-trained language model (BERT [5], XLNet [6], RoBERTa [7], etc.) can effectively solve this problem. But the pre-trained language model has the limitation of the input text length, and difficult to deal effectively with the long text. Therefore, when process the long event text, it will truncate the input text which will lead to the information loss problem. This problem can be solved by constructing the event graph and applying the GCN [8] to nodes and edges of constructed graph. At the same time, attention mechanism [9] can improve the event assignment accuracy by enhancing the association between hotline events and "sanding" data of government departments. In addition, the event text type can provide a priori knowledge for event assignment to further improve the accuracy of event assignment. Based on above, we proposed a joint learning method for event text classification and event assignment. Our contributions are as follows: (1) We jointly learn event text classification and event assignment to reduce model parameters and improve model accuracy. A novel joint learning framework of event text classification and event assignment is proposed to employ the event question type classification task to guide the event text encoding, and then leverage the event text encoding to conduct event assignment。 (2) To solve the problem of different event text length and unclear elements, we propose a BERT-GCN model, which uses the gate mechanism based on prior knowledge to realize the dynamic feature fusion of BERT and GCN, so as to effectively extract event text features to complete the task of event text classification and event assignment.
(3) We proposed an "event-sanding" matching model based on multi-attention, it can effectively match the "eventsanding" pairs by applying several attentions to "GCNsanding", GCN-BERT and BERT-"sanding".

A. GRAPH CONVOLUTIONAL NETWORKS
In recent years, due to the universal existence of graph data, the topic of graph neural network (GNN) has attracted more and more attention of researchers [10,11]. A number of researchers apply existing neural network models to graphs structure to achieve end-to-end graph modeling [8,12,13], and graph convolution neural network is the most active one. In the pioneering work, Bruna et al. [14] proposed the first graph convolution neural network in 2013, they defined graph convolution in spectral space based on graph theory and convolution theorem. The original spectral method has the disadvantage of high space-time complexity, Kipf and Welling [8] presented a simplified graph neural network model, called graph convolutional networks (GCN), which can reduce the complexity of space-time greatly through parameterize the convolution kernel in spectral method. Similarly, Defferrard et al [15]. proposed a CNN algorithm based on spectral graph theory, which can generalize convolutional neural networks from low-dimensional regular grids to high-dimensional irregular domains. It also can reduce the complexity of space-time greatly. Besides to the above work, GCN was also applied in several NLP tasks such as semantic role labeling [16], relation classification [17] and machine translation [18], where GCN is used to encode syntactic structure of sentences. For text classification tasks, there is also a lot of work on graph neural networks [8,12,13,19,20]. However, they only use word as nodes to construct graph in a document or a sentence, which will lose the semantic information of the text. Therefore, we focus on combining language models with GCN by a dynamic fusion gate, which can effectively utilize the textual prior knowledge.

B. TEXT CLASSIFICATION
Text classification refers to extracting features from original text data and predicting the categories of text data based on these features. Text classification can be applied to many natural language processing (NLP) tasks, such as emotion analysis, topic tagging, question answering, and conversational behavior classification. For machine learning models, NB was the first model to be used for text classification task. Subsequently, such as KNN, SVM, RF, XGBoost [21], LightGBM [22] and other machine learning algorithms have been widely used for text classification.
For deep learning model, Kim proposed TextCNN [23], which firstly use CNN for sentence classification. Tang et al. [24] used CNN or LSTM to extract single sentence representation, and then used gated-RNN to encode the internal relationships and semantic relationships between sentences, so as to obtain document-level text coding for text classification. However, CNN model can only extract local features of text effectively, and it is difficult to obtain contextual semantic information of text. Although LSTM and GRU models can extract contextual semantic information, they cannot extract local features of sentences well. In order to give full play to the respective advantages of CNN and RNN, more and more researchers combine CNN and RNN [25][26][27]. Meanwhile, to further increase the representation accuracy of such models, attention mechanisms have been introduced as an integral part of text classification models [28,29]. Although the combination of CNN and RNN has achieved great results, its essential defects are difficult to change. With the emergence and development of pre-trained language models, such problems have been effectively improved. Pre-trained language models can effectively learn global semantic representation and significantly boost NLP tasks, including text classification. Pre-trained language models have some limitation on the text length, so it is difficult to extract the syntactic structure features of long text effectively. The good news is the GNN-based models can extract the syntactic structure features of long text by encoding text graph nodes, so some researchers study using GNN for text classification. Hao et al. [19] propose a graph-CNN based model which can convert text to graph-of-words, then use graph convolution operations to convolve the word graph. It has the advantage of learning different levels of semantics with CNN models. Yao et al. [30] propose the text graph convolution network (TextGCN) for text classification, which builds a single text graph for a whole dataset based on word co-occurrence information and document word relations.

C. ANSWER SELECTION
Answer selection as a sub-task of automatic question answering has attracted wide attention of researchers, and neural network-based models have been proved to be effective in this task. In 2015, Severyn et al. [31] use CNN to process query and document to obtain the feature maps respectively, and then a similarity matrix is added to enhance the predict result. Also in 2015, Wang et al. [32] propose an RNN based approach to answer selection problems, which uses a bidirectional Long-Short Term Memory (BiLSTM) network to sequentially read words from question and answer sentences, and outputs the relevance scores. Instead of learning the vector representation of questions and answers separately, recent studies have introduced attention mechanism to highlight the relevant interaction information between questions and answers, which can better focus on relevant parts between them. Wen et al. [33] proposed a hybrid attention mechanism for automatic answer selection, which can align the most informative part of the question-and-answer pair. Yang et al. [34] proposed a hierarchical attention network, which fully fuse the input documents and knowledge base by taking advantage of the semantic characteristics of input sequences. With the widespread application of pre-trained language model, the accuracy of answer selection task can also be improved by pre-trained method. Laskar et al. [35] used pre-trained language model BERT for answer selection tasks, which can effectively leverage the context of each word in a sentence and improve performance of the model.

A. PROBLEM DEFINITION
We aim to jointly conduct two tasks: event text classification and event assignment, which can be regard as a domain specific text classification problem and text matching problem. Given an event description text Ti, the goal is to simultaneously classify the correct type from a candidate set 12 { , ,..., } n C c c c  and select the correct department for a candidate set D through a specific answer selection task. The specific answer selection task is to select the set of correct answers from a set of candidates 12 { , ,..., } , and then ranking the candidates to get the best one.

B. MODEL
We introduce the proposed joint learning model for event text classification and event assignment. As is depicted in figure.1, The overall framework consists of three components: (1)GCN and BERT based Event Text Classification module, (2) Multiattention based "event-sanding" matching module, (3) reordering module. Firstly, module (1) is used to encode the input event text to obtain Vgcn and Vlm vectors, and then two vectors are fused through a specially designed gate to conduct a text classification. Secondly, the attentive "event-sanding" representation vector is calculated by Vgcn and Vlm vectors together with the "sanding" vector Vsanding in module (2) through a multi-attention layer. The representation vector is then used to calculate the "event-sanding" matching probability. Finally, module (3) is used to reorder all the calculated matching probabilities to get the final assigned department.

1) GCN AND BERTBASED EVENT TEXT CLASSIFICATION
The event classification model we proposed includes 4 modules:(i) Graph Construction; (ii) Graph Convolution Network; (iii) BERT layer; (iv) Dynamical Fusion layer. Graph Construction. In this section, we introduce how to construct the event key information interaction graph from an event text. As the event text has a length problem and some sentences in the event text are not relevant to the event topic. Therefore, we extract the keywords of the event text which serve as the topics of the event. In the next steps, these keywords are service as graph node, which can represent the key information of event graph. Since keyword extraction is not the main point of this paper, we use the exist tools to extract it. We adopted two methods to construct event graph between sentences and within sentences respectively. Algorithm 1 shows the construction process of event graph between sentences.
Algorithm 1 construction of event graph between sentences Input: The event text D Output: Event graph between sentences 1. Do word segmentation, named entity recognition and keyword extraction on D to get the keywords K 2. for sentence s in Ddo 3.
Assign s to vertex vk 5. else 6.
Assign s to vertex vempty 7.
end if 8. end for 9. for vertex vi and vjdo 10. ifvi and vj in the same sentence then 11.
add an edge between vi and vj 12. Given an event text, we first use the exist tool to do word segmentation, named entity recognition and keyword extraction on it. After get the keywords of event, we can use these keywords to construct a coarse graph. As shown in Algorithm 1, for the k keywords extracted, we first adopt a simple strategy that assigns a sentence s to the keyword k if k appears in the sentence. Note that one sentence can be associated with multiple keywords, which implicitly indicates there exist connection between them, so we add an edge between them. Sentences that do not contain any of the keywords are put into a special node called "empty". Algorithm 2 construction of event graph within sentence Input: The event sentence s and the given keyword k in s, the distance threshold L Output: Event graph within sentence 1. Do dependency parsing on sentence s to obtain the syntactic dependency graph 2. forwi in sentence sdo 3.
add an edge between wi and k 5.
end if 8. end for After this, we have constructed a coarse graph between sentences of events, but lack of relations and nodes within a sentence. Therefore, in the next steps we will show the construction process of event graph within sentences. As shown in Algorithm 2, for the given keyword and corresponding sentence, we first use the dependency parsing method (such as HIT LTP) to analyze the sentence and obtain the corresponding syntactic dependency graph. Then, all words whose distance from the given keyword is less than L in the syntactic dependency graph are retrieved as the nodes inside the sentence, and an edge is added between all retrieved word and the keyword. The edge weight of the retrieved word and the keyword is the distance between them. Together with the coarse graph constructed by Algorithm 1, the event graph of the given event text is completed. Graph Convolution Network. For the event text D processed by word segmentation, we first implement word embedding operation on each token in D to get the word embedding sequence,  For the final embedding set E, we use self-attention mechanism to get the keyword representation vector vi that contains context information, the formula is shown in (1)-(4), where Wk, bk, Ww, bware the learnable parameter, uw is the context vector.
After we get the vertex representation vector set V={v1, v2, …, vi, …} in the graph, we feed them to a GCN to make use of the graph structure of the constructed graph. We use an implementation of GCN model similar to the work of Kipf and Welling [8]. So, the GCN follows the following layer-wise propagation rule： where, A is adjacency matrix of the undirected graph, IN is the identity matrix, W (l) is a layer-specific trainable weight matrix. σ(· ) denotes an activation function (such as ReLU). For the input vertex representation vector set V, we calculate the vector representation Vgcn of the event text according to formula (8)-(9), 11 22 A D AD where, ,

WW
are input-to-hidden weight matrix and hidden-to-output weight matrix respectively. BERT layer. For the given event text D, BERT is used as the information extraction model to get the event context vector Vlm. Its calculation formula is as follows, () lm V BERT D  (10) Dynamical Fusion layer. To solve the problem of varying length of event text, we use a Dynamical Fusion method based on a gating mechanism to fuse GCN information Vgcn, and language model information Vlm to obtain a fused contextual semantic information representation Vfuse. The calculation formula is as follows: where 12 , are the fusion weights of GCN information and language model information respectively, it is obtained by combining the prior knowledge of event text and the calculation formula is as follows: where Wg is the learnable weight matrix, bg is the bias vector, and is the element-wise multiplication. The probability distribution of event text types is calculated as follows, () where Wfuseis the learnable weight matrix, bfuse is the bias vector, and ptype is the event category probability.

2) MULTI-ATTENTION BASED "EVENT-SANDING" MATCHING
In order to enhance the interaction between event and "sanding", we propose a multi-attention mechanism to fetch the important information from GCN-event result, BERTevent result and BERT-"sanding" result. GCN-event and BERT-"sanding" attention. For the given "sanding" text Dsanding, we use BERT model to extract its context information and obtain the corresponding representation vector Vsanding. () We use the two-way attention mechanism [36] to model the interaction between GCN-event and BERT-"sanding", and the specific formula is as follows, tanh( ) Then the self-attention mechanism is used to process the concatenated representation vector, and the specific formula is as follows: BERT-event and BERT-"sanding" attention. Similar to GCN-event and BERT-"sanding" attention, the two-way attention mechanism is also used to model the interaction between BERT-event and BERT-"sanding", and the specific formula is as follows, tanh( ) In order to make use of the contribution of each attention, we concatenate all of attention representation vectors in this paper. Finally, the attentive "event-sanding" representation vector will be:  (0 or 1). "event-sanding" Matching Loss. In the multi-attention based "event-sanding" matching module, the "event-sanding" matching task is trained to minimize the cross-entropy loss function: [ log (1 ) log (1  )] where pi is the "event-sanding" matching probability, and yi is the binary classification label of the "event-sanding" pair. Overall Loss Function. In this paper, we use the joint loss function strategy to optimize the whole network, and the final objective function is to minimize the above two loss functions.  and 2  are the hyper-parameters to balance losses.

4) PRIOR KNOWLEDGE BASED REORDERING
After the government hotline events are processed by the model, a list of probability is obtained by matching all "sanding" to one event. Due to the fact that a department contains multiple responsibilities, the matching probability results need to be reordered for one department. In this paper, the reordering module reorders probabilities of the "eventsanding" by using a weighted average strategy, and then to get the department selection. The weighting strategy is acquired based on the prior knowledge of the events. Since both historical event assignment and new event assignment follow the distribution of event assignment, so there is a high similarity for the probability distribution of the "event-sanding" matching between new and historical events. Therefore, we construct the reordering module, in which the assignment weights of all "sanding" are calculated by the prior knowledge of historical assignment statistics of "sanding". After the "sanding" weights are obtained, the reordering module process the matching values by weighting and averaging, and then calculate the total score for each department. Finally, the best department is selected based on the score.

A. DATASETS
In this paper we use an experimental dataset based on the real event assignment case of Wuhu government hotline, including two parts of "event-department" and "event-sanding". The dataset of "event-department" is constructed based on the real event assignment processing result of government hotline, containing 30,000 corresponding event data of 30 municipal departments. The dataset of "event-sanding" is manually annotated by service personnel based on the actual processing results of 30,000 pieces of data in the "event-department" datasets, including 30,000 positive samples (matched) and 60,000 negative samples (unmatched). The datasets description is shown in Table 1.

B. EXPERIMENT SETTING
In this paper, The PyTorch 1.7.1 is used to build a network model and experiments are carried out on a NVIDIA GeForce GTX 3090 for Ubuntu 18.04 LTS system. We use BERT-wwm [5] model as our language model to carry out semantic representation of event and "sanding", the language model embedding dimension is 768, vocabulary size is 30000 and input sequence length is 512. The GCN embedding size is set to 768. The Adam optimizer with a learning rate of 10-5 is used as the optimization method of the model. We train our models in batches with size of 16.

C. EVENT TEXT CLASSIFICATION RESULT
We first compare the proposed method with several state-ofthe-art methods on the text classification task, including HAN [28], TextGCN [30], XLNet [6], BertGCN [37]. Micro-F1 and Weighted-F1 are adopted as evaluation metrics.  Table 2 summarizes the experimental results of different methods on event text classification. We show that the joint learning model achieves state-of-the-art performance on the event classification task. There are several notable observations in the results. (i) HAN only uses BiLSTM network and multi-level attention mechanism to extract text features, which can extract text contextual information only, but is poor in extracting text local features and has limitations on the length of input text; (ii) When GCN is used to encode the text graph, it can effectively obtain the syntactic structure information of the text and solve the text length problem better. However, it has poor ability to extract the text context information. (iii) Language model can effectively encode text context information, but it also has limitations on the length of the input text; (iv) In this paper, we combine language model with GCN through a gating fusion mechanism which is specific designed for the task, it can effectively extract the effective information of event text, so our model got the optimal effect. The experimental results in Table 3 further prove the correctness of the above analysis and the validity of the proposed model.

D. EVENT ASSIGNMENT RESULT
To evaluate the event assignment, we also compare the proposed method with the following state-of-the-art baseline methods on event assignment task, including Siamese-BiLSTM-based [38], ABCNN-based [39], BERT-BiGRUbased [5], ELECTRA-BiGRU-based [40]. In our paper, the first five results accuracy (P@5), mean average precision (MAP) and mean reciprocal ranking (MRR) indexes are used to measure the overall effect of the answer selection network in the "event-sanding" matching module. Precision, Recall and F1 indexes were used to evaluate the performance of event assignment. The experiment results for event assignment are shown in Table 3. Our method is better than all other baseline methods in the effect of each indicator, it has a 3% to 5% improvement in overall event assignment performance. In the "eventsanding" matching module, the P@5, MAP, MRR indicators of our method improved by about 1% to 3%. Compared with Siamese-BiLSTM-based model and ABCNN-based model (use BiLSTM network and CNN as the basic model of feature extraction respectively), we use language model as the feature extraction model, which can effectively obtain the semantic information of text context, and achieve the optimal model effect. Compared with the BERT-BiGRU-based model and ELECTRA-BiGRU-based model (combine language model with BiGRU network for event assignment), we combine language model with GCN, which can effectively solve the input event text length problem. In addition, multi-attention mechanism was used to deal with the matching problem of "event-sanding" pairs, which can effectively highlight the correlation between "event-sanding" pairs. Therefore, compared with other baseline models, our model achieves the optimal assignment effect.

E. ABLATION ANALYSIS
To validate the effectiveness of event text classification module and "event-sanding" matching module in the proposed method, we also conducted ablation experiments on these three modules respectively.

1) EVENT TEXT CLASSIFICATION MODULE
Our model takes advantages of the GCN and language model, which can effectively deal with the different length problem of input event text. Therefore, two ablation experiments are conducted on the GCN and language model to demonstrate the validity of the two model. During the ablation of GCN, the GCN was removed and only language model was used to extract event text features, while other parts remained unchanged. When ablation of the language model, BERT model was removed and only GCN was used to extract the syntactic structure features of the text, while other parts remained unchanged.  Table 4 shows the ablation experiment results of the event text classification module, where "no-GCN" is the results of the ablation GCN and "no-LM" is the results of the ablation language model. It can be seen that our method outperforms the two ablation methods in all evaluation indicators, with an average of more than 3 percentage points. For the GCN ablation model, the "no-GCN" network cannot extract the text context information efficiently and accurately due to the insufficient encoding ability of GCN to the text context information, so the experimental results are inferior to those of the model without ablation. As for the model that ablates language model, it is difficult to extract the syntactic structure information of longer text due to the length problem of the input text, so the "no-LM" results are not as good as those of the model that does not ablate language model. The results of ablation experiment demonstrate the effectiveness of the proposed event text classification module for hotline event assignment.

2) "EVENT-SANDING" MATCHING MODULE
The ablation of "event-sanding" matching module is the ablation of 3 attention module: ① remove GCN-event and BERT-"sanding" attention, only concatenate the other two attention result vector, and the rest remains unchanged.
② remove GCN-event and BERT-event attention, only concatenate the other two attention result vector, and the rest remains unchanged.
③ remove BERT-event and BERT-"sanding" attention, only concatenate the other two attention result vector, and the rest remains unchanged.  Table 5 shows the "event-sanding" matching module ablation results, in which ① is the assignment method for GCN-event and BERT-"sanding" attention module ablation, ② is the assignment method for GCN-event and BERT-event attention module ablation and ③ is the assignment method for remove BERT-event and BERT-"sanding" attention module ablation. As can be seen from the values in Table 6, the model effect is ranked as follows: ours>②>①>③. This is because language model can effectively process most of the input event text, which also conforms to the input length limitation of language model (512) for most of the event text, so the ablation experiment ③ has the worst effect. However, the GCN in this paper is not effective in short text processing, and it is difficult to extract the context information effectively. Moreover, the ablation experiment ② only remove attention layer of a same event with different encode, so the ablation experiment ② is second only to the proposed model. However, ablation experiment ① removes the attention layer between GCN and "sanding" encode, which makes it difficult to effectively use the syntactic structure information of long event text. Therefore, ablation experiment ① is less effective than ablation experiment ②. Based on the above-mentioned experiments, the "event-sanding" matching module we proposed in this paper can effectively improve the accuracy of event assignment.

V. CONCLUSION
In this paper we propose a joint learning method with BERT-GCN and multi-attention for event text classification and event assignment for Chinese government hotline. Firstly, the graph constructing algorithm is used to process the event text to obtain the event graph. Then, the syntactic structure information of event text is obtained by the GCN. At the same time, the context information of event text is obtained by the BERT model. The two information is fused by the dynamic fusion gate to obtain fusion vector and finally classified the fusion vector through the text classification. Secondly, we use multi-attention mechanism to process the GCN result vector, BERT result vector and the "sanding" vector to obtain five attentive "event-sanding" representation vector. After that, five "event-sanding" representation vector is concatenated and calculate the corresponding department probability distribution. Finally, we use the historical prior knowledge of event distribution to construct a reorder model to sort the results of the "event-sanding" matching module, and finally output the optimal assigned department of government hotline event. The experimental results show that our model achieves SOTA results in many indicators. The ablation experiment also demonstrates the effectiveness of each proposed module.