Leveraging Pre-Trained Language Model for Summary Generation on Short Text

Bidirectional Encoder Representations from Transformers represents the latest incarnation of pre-trained language models which have been obtained a satisfactory effect in text summarization tasks. However, it has not achieved good results for the generation of Chinese short text summaries. In this work, we propose a novel short text summary generation model based on keyword templates, which uses templates found in training data to extract keywords to guide summary generation. The experimental results of the LCSTS data set show that our model performs better than the baseline model. The analysis shows that the methods used in our model can generate high-quality summaries.


I. INTRODUCTION
In deep learning research, when the target task training data is less, usually pre-training and fine-tuning methods can achieve outstanding results [1]. In recent years, we have witnessed the impressive results of pre-trained language models in several sub-tasks in the field of natural language processing(NLP) [2], such as dialogue systems, machine translation, and named entity recognition. It mainly includes ELMo [3], GPT [4], BERT [5], ALBERT [6] and other models. Among the pre-training methods mentioned above, BERT [5] has the most outstanding performance. The model uses masked language modelling and next sentence prediction methods for pre-training on a large number of corpora. It can be fine-tuned according to the specific requirements of downstream tasks and can achieve satisfactory results.
In this article, we explore how to apply the pre-trained language model to short text summary generation better. The purpose of the summary generation is to automatically generate a coherent summary from a given document by rewriting or extracting to shorten the document or paragraph, to alleviate the reading pressure caused by excessive information on users. Text summary generation methods are usually divided into two types: extractive methods and abstractive methods [7]. The extractive method directly selects significant and The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Alawneh . small redundant sentences or phrases from the text to form a summary. In contrast, the abstractive method is relatively complex, but can generate novel vocabulary and does not depend on the source document. This method is more in line with the standard for humans to write summaries. Compared with previous work, the summary generation model using the pre-trained language model has made significant progress. However, these works are based on long documents [8]. Experiments show that the BERT pre-training model does not perform well to generate Chinese short text summaries. To solve this problem, we propose a short text summary generation model based on keyword templates.
Template-based summaries are an effective method in the traditional summary generation, in which domain experts manually create a number of hard templates, and then use templates to guide summary generation [9]. However, it is unrealistic to create all templates manually, requiring a lot of experts and labour-intensive. Different from previous work, the templates used in our model are all from the training set without the need for experts to recreate it.
In this article, we introduce a short text summary generation model based on keyword templates, which makes the BERT model better applied to Chinese short text summarization generation tasks. Different from BERT's sentence division method, we have added a keyword-based sentence division method based on the original sentence division method. In the training phase, we extract keywords in the reference summary and divide the input text; in the testing phase, we use similarity calculations to find the most similar text in the training set and extract the keywords in the reference summary of the training text for sentence division. Experiments show that our proposed method has achieved good results in abstractive model and has generated higher-quality summaries.
Contributions made by this article: 1. We introduce a short text summary generation model based on keyword templates and improve the data preprocessing method of Chinese short text in summary generation tasks.
2. We showed how to apply the pre-trained language model to generate short text summaries efficiently, and verify it through the abstractive method.
3. Our model can be used as a stepping stone to improve the quality of the summary and make the pre-trained language model better used in the generation of short text summaries.

II. RELATED WORK
In this section, we will introduce relevant work research on pretrained language models, extractive models, and abstractive models.

A. PRETRAINED LANGUAGE MODELS
The research of pre-training language models is mainly aimed at language understanding tasks, which can usually be classified into feature-based models and fine-tuning-based models according to their characteristics [1]. Feature-based methods mainly use pre-training models to provide language representations and features for the downstream tasks [10]. EMLo [3] used a bidirectional LSTM [11] language model to obtain a context-sensitive pre-trained representation. In the supervised task, they are spliced into the word vector input or the top level representation of the model as features. GPT [4] used Transformer [12] network instead of LSTM [11] as a language model to better capture long-distance language structures. When applied to downstream tasks, GPT [4] does not need to rebuild a new model structure, and can effectively improve the generalization ability of supervised models and accelerate convergence. The fine-tuning methods are mainly to pre-train the model on the language modeling target, and then fine-tune the model on the downstream tasks with supervised data. The BERT model uses ''masked language modelling'' and ''next sentence prediction'' methods to train on a large-scale corpus. BERT can be widely used because it can be applied to multiple downstream tasks of natural language processing by fine-tuning and achieving outstanding results. Unlike the BERT pre-training model, ALBERT [6] is faster to train and uses less memory. ALBERT used factors such as factorization and cross-layer parameter sharing to reduce model parameters and improve training speed effectively.
In past research, pre-trained language models are usually applied to natural language understanding tasks to improve their performance. Recently, many scholars have applied pre-trained language models to generation tasks.
For example, BERT can fine-tune its parameters with specific generation task parameters. In this article, we try to use the BERT model for the text summary generation task.

B. EXTRACTIVE MODELS
The extractive summary first scores all sentences in the document according to their importance, then sorts the sentences according to the score, and finally select multiple sentences with the highest scores to form a summary. The early summary generation model mainly used human feature engineering, and its common methods include context matching [13], graph model [14], but the model effect is not very satisfactory. With the advancement of artificial intelligence, deep neural networks are widely used in summary generation tasks. Yin et al. [15] applied neural networks to extractive summarization tasks, mapping sentences into vectors, and selecting vital sentences to form summaries. Yasunaga et al. [16] combined a recurrent neural network with a graph convolutional network to calculate the importance of each sentence, and then select the sentence to form a summary. Shashi et al. [17] performed global optimization of ROUGE [18] metrics through reinforcement learning, conceptualized extractable single document summaries as sentence ordering tasks and proposed a novel training algorithm. This model improves the use of cross-entropy as the loss function in the model training process, which may lead to summary length and excessive redundant information. The SUMO [19] model proposed an end-to-end extractive text summary generation method, which regards single-document extractive summaries as a tree induction problem. The model subtree is the sentence in the original document related to the summary or explains the summary, thereby improving the correlation between the model generated summary and the text. Recently, pretrained language models has been shown to be useful for improving text summarization tasks. Yang Liu [20] applied BERT [5] to extractive summaries for the first time, the experimental results show that the pre-training model is also suitable for text summarization generation. Hongwang et al. [21] tried three different pre-training strategies, Mask, Replace and Switch, and used self-supervised methods to capture the global features of the document to more accurately grasp the main content of the article. The HIBERT [22] model treated the task as a sequence labelling task and marks whether a sentence appears in summary in the original text. Zhong et al. [23] proposed a completely new method to transform the extractive summary task into a semantic matching problem. The advantage is that the candidate summary (several sentences) can be directly extracted instead of sentence-level (sentence by sentence).

C. ABSTRACTIVE MODELS
Compared with the extractive method, the abstractive method is closer to the standard of human summary writing. In recent years, there has been more and more research on abstractive methods. The abstractive method regards the task of summary generation as a sequence-to-sequence problem and used neural networks to solve it. Nallapati [24] and Chopra et al. [25] used RNN to replace traditional the encoder and decoder and achieved good results. Lin et al. [26] proposed a global coding framework based on context information to improve the model's ability to control global details. The model uses a gated convolution unit to ensure that the core information is retained and filters redundant information. Chen et al. [11] proposed an accurate and fast summary model, first select some crucial sentences, and then generate operations on these sentences. The model uses a novel sentence-level strategy gradient method to connect the sentence extraction network and the summary generation network. Not only the best experimental results are obtained, but also a significant speed increase in decoding and training. Zhang et al. [27] applied the BERT model to abstractive summaries for the first time. The model uses BERT as the encoder to extract the input document's features and then uses the Transformer to decode to generate the initial results, masked the output, and then made predictions through another BERT. Logan Lebanoff [28] proposed to map single sentences and pairs of sentences to a unified space for sorting. According to this sorting, single sentences and paired sentences that are important for the summary are selected. Finally, by compressing the single sentences, the paired sentences are merged to generate the summary.
Different from the previous work, we propose a short text summary generation model based on keyword templates. Our model uses different data processing methods, which can better apply the pre-trained language model to generate short text summaries in Chinese and generate high-quality summaries.

III. MODEL
In this section, we will introduce the overall structure of the model from two parts: data preprocessing, model architecture.

A. DATA PREPROCESSING
Unlike the BERT data preprocessing method, we added a keyword-based sentence division form based on the original sentence division model. The experimental results show that simple sentence division does not apply BERT to the generation of short text summaries. Therefore, based on the original sentence division model, we extract the reference summary keywords in the training set and divide the text twice.
In the model training stage, we extract keywords from the reference summary and divide the input text twice. In the model testing stage, we use the similarity calculation tool (macropodus) [29] to find the most similar data in the training set to the test document. The keywords in the reference summary of the data are extracted to divide the test document twice.
As shown in the left half of Figure 1, we divide the text according to the sentence structure. In the right half of the figure, we first extract ''the summary generation'' as a keyword, and divide the text again based on the original sentence division.
We insert [CLS] before each sentence (or phrase) obtained after preprocessing and insert [SEP] at the end of each sentence (or phrase) to represent each sentence (or phrase). And use the inserted [CLS] symbols to collect sentence features. The document is represented as a sequence of tokens X = [w 1 , w 2 , . . . , w n ]. Each token w i consists of token embedding, segment embedding, and position embedding, as shown in Figure 1. Among them, token embedding converts each word into a fixed-dimensional vector; position embedding represents the position information of each token in the text; segmentation embeddings divide the sentence into E A or E B mainly according to whether i is even or odd [20]. For example, suppose the text contains a total of six sentences, we divide [sent 0 , sent 1 , sent 2 , sent 3 , sent 4

B. MODEL ARCHITECTURE
To verify our proposed method's effectiveness in generating short text summaries, we use the traditional Encoder-Decoder structure and fine-tune BERT. The encoder uses BERT for initialization, and the decoder uses eight Transformers stacked with random initialization [8]. We name our model BSA. If the model uses keyword templates to divide sentences, it named BSA*.
We represent the preprocessed data as [sent 1 , sent 2 , sent 3 ,. . . , sent n ] and input it into the BERT layer. The vector t i is the feature vector collected by the i th [CLS] symbol output by the BERT layer, which can be used to represent sent i . After the Encoder, we use the Decoder stacked by Transformers to decode the BERT output: where LN refers to the normalization layer; MHAtt is the multi-headed attention composed of multiple self-attention mechanisms connected [20]; the symbol l represents the depth of the model stack. FFN is feedforward neural network between the encoder and decoder of each layer.
In formula (1) where pos indicates the position and i refers to the dimension. Using Transformers as a feature extractor can effectively extract document-level features and generate higher quality and more coherent summaries. The decoder focus on summary generation from the encoder outputs. However, the encoder uses a pre-trained model, the decoder is initialized randomly and needs to be trained from scratch, and there is a mismatch between them. To solve this problem, we will use different optimization strategies for encoder and decoder [8]. Respectively, the specific formula is as follows: Encoder : Adam optimizer, β 1 = 0.9 warmup = 20000 lr = lr 1 · min(step −0.5 , step · warmup −1.5 ) (5) Decoder : Adam optimizer, β 2 = 0.999 warmup = 10000 lr = lr 2 · min(step −0.5 , step · warmup −1.5 ) (6) where lr 1 = 2e −3 is the learning rate of the encoder, and lr 2 = 0.1 is the learning rate of the decoder. In the model's initial stage, the encoder side uses a smaller learning rate and smoothing strategy to fine-tune the pre-trained model [8]. As the training continues, the decoder side tends to be stable, and better accuracy can be obtained. The BSA model architecture is shown in Figure 2.

IV. EXPERIMENTAL DETAILS
In this section, we introduce the data set used in the experiment, model parameter settings, and the baseline models compared with the experiment.

A. EXPERIMENTAL CORPUS
In this study, we mainly explore how to better apply the pre-trained model to Chinese short text summarization generation, so the experimental data set only uses LCSTS data set. LCSTS is a public Chinese short text summary generation data set constructed by Hu Chen in 2015 [30], which contains 2.4 million real Chinese short text data and summaries given by each text author. The data set contains a total of three parts, of which 2.4M text summary pairs are used for training, 8K text summary pairs are used for verification, and the remaining 0.7K text summary pairs are used for testing.

B. EXPERIMENT SETTINGS
We implemented our experiment in pytorch through 8 GPUs (RTX2080ti). Both source and summary were tokenized with BERT's subwords tokenizer.
In our model, we apply dropout before all linear layers, and its probability is set to 0.2; label smoothing with smoothing factor is set to 0.1. On the decoder side, we set the hidden unit size of each Transformer to 768, and the feedforward neural network of each layer to 2048. Regarding the setting of the learning rate, the lr 1 = 2e −3 for the encoder and lr 2 = 0.1 for the decoder [8]. In terms of quality evaluation indicators, we learn from previous research methods and use ROUGE as the model's evaluation indicator. ROUGE is a text summary automatic evaluation method proposed by Chin-yew Lin in 2004 [18]. We use the standard F1 scores of ROUGE-1, ROUGE-2 and ROUGE-L to evaluate the reference summaries and generated summaries.

C. BASELINE MODELS
We compare the experimental results with seven baseline models. Below is the baseline model we compared on the LCSTS dataset. RNN [11] and Bi-MulRnn [31] are seq2seq models based on RNN, without and with attention mechanism, respectively. CopyNet [13] is a seq2seq model that uses both the copy mechanism and the attention mechanism. SRB [32] introduces the cosine function based on the traditional seq2seq model to calculate the similarity. TSNHG [33] is attention-based seq2seq with gated recurrent unit, which improves the experimental results under different topic data VOLUME 8, 2020 by classifying the training set data according to the topic. CGU [34] is a seq2seq model base on the convolutional gated unit and the attention mechanism. RTCS [35] is the conventional seq2seq model with convolutional neural network.
Besides, we have proved the effectiveness of our proposed method through ablation experiments.

V. EXPERIMENT ANALYSIS
In this section, we verify the experimental results of the BSA and BSA* models in the short text summary and conduct a comparative analysis. Besides, we also showed two summary examples to illustrate that our model can generate high-quality summaries.  Compared with the model using RNN as the feature extractor (RNN, Bi-MulRnn), our model uses Transformers as the feature extractor to capture the document's in-depth features more efficiently. Besides, our model uses a pre-trained BERT model compared with other baseline models, which can effectively improve model performance and generate higher quality summaries.
Through the comparative analysis of ablation experiments, our model results show that the use of keywords to subdivide the text can effectively improve the quality of the generated summaries and better apply the pre-trained language model to Chinese short text summaries.

B. DISCUSSION
We show two examples of summaries generated by our model and compare them with RNN models and reference summaries. At the same time, we also show the summaries generated by our model under different data processing methods. As shown in the summary example in Table 2, compared with the RNN model, the summary generated by our model can highlight the main content of the text. In contrast, the BSA

VI. CONCLUSION
To improve the application of the pre-trained language model in Chinese short text summary generation, this article proposes a novel Chinese short text summary generation model based on the keyword templates. In our model, we divide the text twice by extracting keywords. Experimental results show that our model's document sentence division method effectively improves the generated summary quality. We offer how to efficiently apply the pre-trained language model to the generation of short text summaries.
However, this method has only been verified on the abstractive model. In future work, we will verify whether the short text summary generation model based on keyword templates can improve the extractive model summary generation's quality and verified on other pre-trained language models. SHUAI ZHAO is currently pursuing the bachelor's degree, majoring in computer science, with the School of Information Engineering, Beijing Institute of Graphic Communication, whose primary research fields are natural language processing, deep learning, and so on. He has participated in many critical projects of the Beijing Natural Science Foundation; the scientific research projects of the Beijing Municipal Education Commission; and school-level vital projects. He has published three academic papers in national and foreign publications, including three SCI / EI searches. It has been granted two national invention patents, one utility model patent, one publication, and compilation, and has won the first-class scholarship in 2019.
FUCHENG YOU is a Tutor, Doctor, and Professor of Computer Science with the School of Information Engineering, Beijing Institute of Graphic Communication. He is currently the Director of the ''Digital Image Processing'' Research Office of the Beijing Key Laboratory of ''High-End Printing Equipment Signal and Information Processing'' and the Director of Research and Recruitment Office of the School of Information Engineering. The main courses include database principle, digital image processing, image processing and analysis, and so on. The main research directions are natural language processing, digital image processing, machine vision application, among others. He has presided over many key projects of the Beijing Natural Science Foundation; the scientific research projects of the Beijing Municipal Education Commission; and school-level key projects. More than 80 academic papers have been published in national and foreign publications, of which SCI / EI has searched more than 50. It has been granted 12 national invention patents, five utility model patents, two books, and won the Yachang Education Award in 2016.
ZENG YUAN LIU is currently pursuing the bachelor's degree, majoring in computer science, with the School of Information Engineering, Beijing Institute of Graphic Communication, whose primary research fields are natural language processing, deep learning, among others. He has participated in many critical projects of the Beijing Natural Science Foundation; the scientific research projects of the Beijing Municipal Education Commission; and school-level vital projects. He won the Second-Class Scholarship in 2019. VOLUME 8, 2020