A Topic Information Fusion and Semantic Relevance for Text Summarization

With the continuous development of deep learning, pre-trained models have achieved sound effects in the field of natural language processing. However, text summarization research is far from what people want, especially in abstractive summarization. A high-quality summarization system needs to focus on the topic content of the document and the similarity between the summary and the source document. In this paper, we propose a topic information fusion and semantic relevance for text summarization based on Fine-tuning BERT(TIF-SR). Primarily, considering the critical role of topic information in summary generation, we extract topic keywords and fusion them with source documents as part of the input. Secondly, make the summary closer to the source document by calculating the semantic similarity between the generated summary and the source document, the quality of the abstract is improved. The experimental data indicate that the ROUGE index and readability have improved in this model, so these shreds of evidence suggest that the method proposed by our model is sufficient.


I. INTRODUCTION
Text summarization is to make the text easier to read and understand, especially for poor readers. There are two kinds of summary generation: extractive and abstractive. Extractive summarization models extract some important words or sentences from the original documents. Compared with abstractive methods, it is comparatively mature and straightforward, but its drawback is limited to the original text, and abstract coherence is weak. Abstractive summarization systems are not limited to the original text. Extractive summarization is easier, but abstractive summarization is more like the way humans process text. In this paper, we focus on abstractive summarization of single-document.
In some previous work, deep neural network approaches achieve favorable effects in various Natural Language Processing(NLP) tasks like image captioning [1], machine translation [2]- [4], speech recognition [5], and so on. In particular, the attention-based sequence-to-sequence(seq2seq) model with recurrent neural networks(RNNs) [6]- [8] is widely The associate editor coordinating the review of this manuscript and approving it for publication was Francesco Mercaldo . used in abstractive summarization. The model compresses the original document into dense vectors with the encoder, and the decoder generates summary using the compressed vectors. While promising, most of the current research on the summarization generation focuses on short text, because these methods fail to provide satisfactory performance when faced with long input sequence.
In this work, we focus on the key factors that affect the quality of the summary generation. The sequence-tosequence framework consists of two parts: the encoder layer that processes the input and the decoder layer that generates the output. There have been lots of works on the design and alignment of the encoder and the decoder, like the choice of the neural network structure, the specific design of the attention mechanism [9]- [11], etc. In contrast, the topic information [12], [13], which directly affects the results of the summarization task, has received relatively much less attention.
At present, almost all attention mechanism models distribute their attention evenly on all the contents of the text, which has many advantages. Still, they ignore the influence of text topics on the generation of abstracts, as shown in Table 1. Based on this problem, we will extract the subject keywords of the source text and fuse the subject information as input. Another common weakness of previous works is the sequence-to-sequence model tends to produce grammatical and coherent summary regardless of the semantic relevance to source texts. Although the generated summarization is similar to source texts literally, they have low semantic relevance. In this work, our goal also is to improve the semantic relevance [14] between source texts and generated summarization. In our model, we compress the source texts into vectors with the encoder and decode the vector into summarization with the decoder. The encoder layer produces the representation of original documents, and the decoder layer produces the representation of the generated summarization. A similarity evaluation component is introduced to measure the relevance of original texts and generated summarization. During training, it maximizes the similarity score to encourage high semantic relevance between source texts and generated summarization.
The contributions of this work are as follow: 1. We argue the BERT [15] model, with its pre-training on a huge dataset and the powerful architecture for learning complex features, can further boost the performance of abstractive summarization, We fine-tune the BERT model to generate text summary.
2. The attention mechanism distribute their attention evenly on all the contents of the text, which ignore the influence of text topics on the generation of abstracts. We will extract the subject keywords from the source text and fuse the subject information as input.
3. The sequence-to-sequence models tend to regardless of the semantic relevance to source texts. They have low semantic relevance. In this work, we will improve the semantic relevance between source texts and generated summarization.

II. RELATED WORK
At present, the methods of summary generation are mainly divided into extractive and abstractive approaches. On the one hand, early extractive summarization systems mostly focused on statistical and linguistic features, such as word frequency and term frequency [16]. Classical approaches include graph-based methods [17], integer linear programming [18], LDA [19], Conditional random field [20], and classifier-based methods [21], [22]. On the other hand, due to the technology, early research on abstractive systems is relatively less. Most of the jobs just focused on sentence compression, using methods like syntactic tree pruning [23] and machine translation [24].
In recent years, with the development of deep learning techniques and it's widely investigated both on extractive and abstractive summarization tasks. In terms of neural networks extractive summarization systems, the sequence-tosequence model gains impressive performance by applying neural networks. The improvement on automatic metrics like ROUGE [25] has reached satisfactory results, such as Dong [26], Narayan [27], and Zhang [28].
In the meantime, the sequence-to-sequence neural network models have provided a feasible new method for abstractive summarization systems. Rush [7] were the first to apply neural networks in abstractive text summarization and used Attention Mechanism to automatic summarization. Experiment data show that the attention mechanism has achieved good results; their model includes encoder and decoder, in which the encoder layer used Convolutional Neural Network (CNN), and the decoder layer used neural language model. Chopra [6] used Recurrent Neural Network (RNN) to replace the decoder, and Nallapati [8] improved the performance by applying RNN encoder-decoder model. Hou [29] used the subject information as part of the encoder input to improve the quality of the summary generated by the model. Ma [14] used the RNN model based on the attention mechanism to calculate the similarity between the generated summary and the source document and achieved excellent results. However, feature extractors with RNN or CNN as the core often have some problems that cannot be ignored. For instance, parallel computing power, unable to capture long-distance features. Therefore, Ming et al. proposed using the Transformer [30] as feature extractor to generate abstracts and achieved excellent results. The NEUSUM [31] adopts an end-to-end extractable text summary, which solves the problems of sentence scoring and sentence selection splitting in the previous model. Chen [32] combined the reinforcement learning method to train the extractive model and the abstractive model separately. Sentence-level strategy gradient method is used to connect the two networks. Recently, language model pre-training has been shown to be effective for improving many natural language processing tasks. Such as BERT [15], which stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all layers. Besides, there are some improved methods based on attention mechanism, optimization method, and text information embedding. Inspired by BERT's pre-training and fine-tuning, Song [33] proposed the masked mass model to generate text in the encoder-decoder model. Zhang [34] used BERT to extract sentence features on the encoder side and Transformer and Bert on the decoder side respectively to make predictions. Wang [35] used BERT to encode each sentence, tried three training strategies of Mask, Replace and Switch, and proved its effectiveness. The HIBERT [36] uses an unsupervised method to pre-train on a large corpus to better adapt to downstream tasks.
Compared to previous work, we propose a topic information fusion and semantic relevance for text summarization based on Fine-tuning BERT(TIF-SR). Our method focuses on the topic information and the semantic relevance using the BERT model. Previous models limited their attention to source documents and ignored important subject information in the original text. The sequence-to-sequence models regardless of the semantic relevance to source texts. We will extract the topic keywords as part of the input and improve the semantic relevance between source texts and generated summarization. Unlike previous models, which only fuse subject information or calculate semantic similarity, our model consists of three components: the topic information is used as part of the encoder input, and the input is compressed into semantic vectors using BERT. The decoder uses the Transformer to produces semantic vectors of the generated summaries. Finally, the similarity function evaluates the relevance between the semantic vectors of source texts and generated summaries.

III. MODEL A. MODEL INPUT
The topic of a document is a concise summary of the main content of a document, which is usually composed of multiple topic keywords. The topic keywords are the smallest unit to represent the content of the document, reflecting the main content of the document. In this model, we first preprocess the document, extract 10 subject keywords of the article using the TextRank algorithm, and use these keywords as part of the model input.
The core of the TextRank [37] algorithm is to divide the text into several constituent units (words, sentences) and build a graph model. It uses the voting mechanism to sort the scores of important components in the text and then takes the words with the highest scores as the topic keywords.
As illustrated in Figure 1, we insert a [CLS] token before the first sentence and a [SEP] token after each sentence. The [CLS] is used as a symbol to aggregate features from one sentence or a pair of sentences. In particular, we add topic token embedding as part of the input. To make the generated summary fully reflect the subject information, in the first place, our model extracts the subject keyword information from the source document and takes it as a part of the input. The input content of our model is similar to that of BERT model, but the subject information is added. The input information mainly includes input document, token embedding, topic token embedding, interval segment embedding, and position embedding.

B. MODEL ARCHITECTURE
Our model conforms to the typical encoder and decoder technical framework: after preprocessing, the original document is input from the encoder side, and the decoder side generates statements as the result of the summary. The main idea behind this architecture is to use the transfer learning from pretrained BERT, a masked language model. We have replaced the encoder part with BERT and the decoder is used transformer [26]. As illustrated in Figure 2.
First, we input the constructed word embedding to the encoder layer, which is the Bert layer (BERT-Base, Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters). The Bert layer makes use of the encoder part of the Transformer and stacks multiple Bidirectional Transformers as the language model. In a random mask, 15% of words in each sentence are predicted by their context, and next sense prediction is used to understand sentence meaning and relationship better. The encoder layer compresses the input sequence into a semantic vector.
After obtaining the semantic vectors from BERT, we build the Transformer layer after the Bert layer to capture the document features and generate the corresponding summary. One of the advantages of using Transfomer networks is that training is much faster than LSTM [38] based models as we eliminate sequential behavior in Transformer models. It is built by many multi-head self attentions blocks. Transformer has high parallel computing speed and long-distance feature acquisition ability by use self-attention, which is another useful feature extractor apart from RNN and CNN.
The traditional attention mechanism is to give a set of vector set values, and a vector query. The attention mechanism is a mechanism for calculating the weighted summation of values based on the query. Because of the seq2seq model, the encoder compresses the input information into a fixed-length semantic vector, but as the length of the input statement increases, the fixed-length semantic vector cannot fully express all the information of the text, which affects the decoder effect conversely the final output sequence. But self-attention encourages the model to learn long-term dependencies and does not create much computational complexity. The attention is applied to a set of queries simultaneously and packed together into a matrix Q. The keys and values are also packed together into matrices K and V. We compute the matrix of outputs as: where the representations are computed through the attention mechanism with itself and packed into a matrix. In addition, Transformer uses multiple self-attention mechanisms to stack up, and effectively obtains information of different positions and dimensions: In addition to the attention layer, Transformer adds fully connected feed-forward network to each layer of encoder and decoder, which is applied to each position separately and identically.
Transformer uses sine and cosine functions of different frequencies to represent the positional encodings and adds the positional encodings to the input layer by adding.
Transformer based on models generate more correct and coherent sentences grammatically. The decoder focus on summarization tasks from the BERT outputs: where h 0 = PosEmb(T ) and T are the sentence vectors output by BERT; LN is the layer normalization operation; MHAtt is the multi-head attention operation; the superscript l indicates the depth of the stacked layer. The final output layer use a softmax function to get the predicted score:

C. SEMANTIC RELEVANCE
In this section, our purpose is to compute the semantic relevance of source documents and generated summary given the source semantic vector V 1 and the generated summarization semantic vector V 2 . As illustrated in Figure 3. We focus on the task of semantic similarity calculation, and there are many ways to do this task. For example, Inner Product, Dice Coefficient, Jaccard Coefficient. Because the source document and the generated summary are in the same language, it is assumed that FIGURE 3. The overview architecture of the TIF-SR model. VOLUME 8, 2020 their semantic vectors are distributed in the same space. Cosine similarity [39] works better in calculating the distance between two vectors in the same space. The critical problem of using semantic relevance metrics is how to get the semantic vector V 1 and V 2 . In our model, the semantic vector V 1 is represented by the output of BERT layer, which generates the semantic vector by compressing and encoding the source document. Similarly, the semantic vector generated by the encoder side is input into the transformer layer as the initial state, and Transformer layer generates the semantic vector V 2 by decompressing. We use cosine similarity to calculate semantic relevance. The closer the cosine value is to 1, the more similar the two vectors are:

D. TRAINING
Set the input document to x and given the model parameter θ. The model generates summary y and semantic vector V 1 and V 2 . The objective is to minimize the loss function: where p(y|x; θ ) is the conditional probability of summaries given source documents and is computed by our model. cos(V 1 , V 2 ) is cosine similarity of semantic vectors V 1 and V 2 . This term tries to maximize the semantic relevance between source texts and generated summarization.

IV. ANALYSIS OF EXPERIMENTAL RESULTS
In this section, we present the evaluation of our model and show its performance on Chinese corpus. Besides, we perform a case study to explain the semantic relevance between generated summary and source text.

ROUGE (recall-Oriented Understudy for Gisting Evaluation)
[25] is a similarity measurement method based on the recall rate. Similar to BLEU, which is a commonly used evaluation standard in automatic summarization. At present, The ROUGE evaluation criteria include ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L. ROUGE-1 represents the amount of information contained in automatic abstract, ROUGE-2, ROUGE-3, ROUGE-4 represents the fluency of automatic abstract, and ROUGE-L is the degree of information covered by abstract to the original text. The ROUGE-N calculation method as shown in Formula (12), as shown at the bottom of this page. Among them, Re is an artificial standard abstract, the gram is an N-element word, and count represents the maximum number of N-element words that the model generates the same abstract as the standard abstract. The basic idea is that the experts manually write a summary to form a standard summary set, and then compare the abstract produced by the model with the standard artificial summary. The quality of the summary generated by the model is further evaluated by comparing the number of basic units that overlap the two.

B. EXPERIMENTAL CORPUS
There are two corpora used in this study, LCSTS [40], and NLPCC2017 [29]. LCSTS is constructed by Hu, Chen. The dataset consists of more than 2.4 million text-summary pairs, constructed from a famous Chinese social media website called Sina Weibo. It is split into three parts, with 2,400,591 pairs in PART I, 10,666 pairs in PART II and 1,106 pairs in PART III. All the text-summary pairs in PART II and PART III are manually annotated with relevant scores ranged from 1 to 5, and we only reserve pairs with scores no less than 3. Following the previous work, we use PART I as the training set, PART II as the development set, and PART III as the test set. In addition,NLPCC2017 is a public data set provided by Headline, which mainly contains 50,000 news-summary pairs, and the length of each article ranges from 10-10000 Chinese characters.. In this experiment, 49500 texts are selected as the training set and verification set, and the remaining 500 texts are taken as the test set.

C. MODEL COMPARISON
In order to evaluate the performance of the proposed model in summary generation, six typical models were selected for comparisons, Where RNN and RNN-context are the benchmark models in the data.
RNN [40]: With RNN as the encoder, the input of the decoder is the last hidden state of the encoder, and no context is used during the decoder.
RNN-context [40]: Based on the previous model, the context is used during the decoder, and the input to the decoder is a combination of all hidden states of the encoder.
Bi-MulRnn [41]: The model encoder uses a bi-directional recurrent neural network. The decoder uses a multi-layer recurrent neural network and the attention mechanism.
SRB [42]: Introducing the semantic correlation neural model, the encoder compresses the original text into the semantic vector, and the decoder generates the summary and the semantic vector of the summary. Then, the similarity function is used to evaluate the correlation between the semantic vector of the original text and the generating summary.
Seq2seq [43]: The encoder is a bidirectional LSTM structure containing information in both directions of the context; the decoder is a unidirectional LSTM structure that reads the input and generates a summary to map the target vocabulary to the high latitude space. CopyNet [44]: Copy mechanism is introduced. The Encoder is bidirectional RNN; decoder is hybrid generation mode and copy mode. Survival mode selects words from preset vocabulary and copy mode selects words from the input sequence.
CGU [45]: A framework of global coding is introduced to perform global coding to improve the representation effect of source information.
MEAD [46]: This method considers the centroid, position, common sub-sequence and keywords of a sentence to score its features.
Submodular [47]: This model uses submodule functions to select important sentences by calculating their diminishing returns and generate summary.
NLP-ONE [48]: This method includes the input sequence attention mechanism, the output sequence attention mechanism, and the fusion of two attention mechanisms for the generation of the summary.
AS-TKF [29] This model used the topic information as part of the encoder input to improve the quality of the summary generate.

V. ANALYSIS
In this section, we show the results of our experiments and analyze the performance of our model. Also, we provide an example to show that the generated summary is closer to the topic and has higher semantic similarity with the source document.

A. RESULTS ON LCSTS DATASET
In the experiments on the LCSTS dataset, we compare our TIF-SR with the above baseline systems, including RNN, RNN-context, Bi-MuRnn, SRB, Seq2seq, CopyNet, and CGU. Our model achieves the advantages of ROUGE score over the baselines, and the advantages of ROUGE score on the LCSTS are significant. Table2 shows the results of our TIF-SR and baseline systems on the LCSTS. Compared with the CGU model, the advantage of our model is that ROUGE-1 is increased by 2.7, ROUGE-2 is increased by 1.7, and ROUGE-L is increased by 0.9. The experimental results are as Table 2 and Figure 4: In the experiments on the NLPCC2017 dataset, we compare our TIF-SR with the above baseline systems, including    MEAD, SUbmodular, NLP-ONE and AS-TKF. By comparing the experimental results, the effect of the fusion topic information and semantic similarity calculation model proposed in this paper over the baselines, and the best effect is achieved. The results show that the fusion of topic information and the calculation of semantic similarity play a decisive role in the generation of high-quality summary and can better guide the generation of summary. The experimental results are as Table 3 and Figure 5: We present the summary generated by our TIF-SR and compare it with the reference summary and the seq2seq model. As illustrated in Table 4. The original text introduced the express company's violent sorting of courier services, but only one express company admitted to apologize, and the other express companies did not speak. The seq2seq model only states that the express company apologizes, but ignores VOLUME 8, 2020 the text theme to emphasize the silence of other express companies. Our model can filter the trivial details that are irrelevant to the core meaning of the source text and focuses on the information that contributes most to the main idea. In comparison, the summary generated by our model have better fluency, sentence coherence, and more abundant information.
As our model incorporates topic keyword information, it can better grasp the subject information of the original document, thereby leading to the generation of summary closer to the topic. And we calculate the semantic similarity to improve the similarity between the generated summary and the original document, making the generated summary closer to the original document.

VI. CONCLUSION
Seq2seq models are widely used in text summarization generation research. However, traditional models often have problems such as ignoring topic information and generating low similarity in summaries. In this paper, we propose a topic information fusion and semantic relevance for text summarization on Fine-tuning BERT. First of all, we take into consideration the impact of topic information on the quality of the summary, and the topic information is given higher weight by extracting topic keywords as part of the input. Secondly, in order to obtain the summary with higher semantic similarity, we calculate the semantic similarity between the generated summary and the original document, and maximize the similarity score through training. The experimental data show that our model has acquired good results on the LCSTS dataset, and the generated abstract is more readable and coherent.
FUCHENG YOU is currently a Tutor, a Doctor, and a Professor of computer science with the School of Information Engineering, Beijing Institute of Graphic Communication. He is also the Director of Digital Image Processing Research Office with the Beijing Key Laboratory of High-End Printing Equipment Signal and Information Processing and the Director of Research and Recruitment Office with the School of Information Engineering. His main courses include database principle, digital image processing, image processing and analysis, and so on. He presided over many key projects with the Beijing Natural Science Foundation, the scientific research projects with the Beijing Municipal Education Commission, and the school-level key projects. More than 80 academic articles have been published in national and foreign publications, of which SCI/EI has searched more than 50. He holds over 12 national invention patents, five utility model patents, and two books. His main research interests include natural language processing, digital image processing, machine vision application, and so on. He received the Yachang Education Award, in 2016.
SHUAI ZHAO is currently pursuing the degree in computer science with the School of Information Engineering, Beijing Institute of Graphic Communication. He has participated in many critical projects with the Beijing Natural Science Foundation, the scientific research projects with the Beijing Municipal Education Commission, and the school-level vital projects. He has published three academic articles in national and foreign publications, including three SCI/EI searches. He holds over two national invention patents, one utility model patent, and one publication and compilation. His research interests include natural language processing, deep learning, and so on. He received the First-Class Scholarship, in 2019.
JINGJING CHEN is currently pursuing the degree in computer science with the School of Information Engineering, Beijing Institute of Graphic Communication. She has participated in many critical projects with the Beijing Natural Science Foundation, the scientific research projects with the Beijing Municipal Education Commission, and the school-level vital projects. She has published one academic article in EI searches. She holds one national patent for utility model and participated in publishing and compiling one. Her research interests include natural language processing, deep learning, and so on. She received the First-Class Scholarship, in 2019. VOLUME 8, 2020