Multi-Layer Transformer Aggregation Encoder for Answer Generation

Answer generation is one of the most important tasks in natural language processing, and deep learning-based methods have shown their strength over traditional machine learning based methods. However, most previous deep learning-based answer generation models were built on traditional recurrent neural networks or convolutional neural networks. The former model cannot well exploit contextual correlation preserved in paragraphs due to their inherent computation complexity. For the latter, since the size of the convolutional kernel is ﬁxed, the model cannot extract complete semantic information features. In order to alleviate this problem, based on multi-layer Transformer aggregation coder, we propose an end-to-end answer generation model (AG-MTA). AG-MTA consists of a multi-layer attention Transformer unit and a multi-layer attention Transformer aggregation encoder (MTA). It can focus on information representation at different positions and aggregate nodes at same layer to combine the context information. Thereby, it fuses semantic information from base layer to top layer, enhancing the information representation of the encoder. Furthermore, based on trigonometric function, a novel position encoding method is also proposed. Experiments are conducted on public datasets SQuAD. AG-MTA reaches the state-of-the-art performance, EM score achieves 71.1 and F1 score achieves 80.3.


I. INTRODUCTION
Question answering(Q&A) system is built on the basis of understanding of the questions. It generates answers by searching existing knowledge bases such as knowledge graph, databases, or even internet, making knowledge acquirement more direct, efficient, and accurate. With the continuous development of question answering system, many novel methods have been developed. The most notable work is the Match-LSTM [1] framework. Then, the QANet [2] improves the speed and accuracy of answer generation by combining Convolutional Neural Network (CNN) and LSTM, and achieved reliable results on the SQuAD [3] dataset. Recent method [4] combines the CNN network and attention mechanism for Chinese question classification, which boosts the effect of the answer generation.
Most of the current research work is based on typical neural networks to deal with tasks such as intent classification and answer generation. However, these methods have The associate editor coordinating the review of this manuscript and approving it for publication was Arianna DUlizia . disadvantage in utilizing contextual correlation. In this paper, in order to enhancing the relevance of contextual information, we propose a novel multi-layer attention Transformer aggregation encoder (MTA), and a novel answer generation network based on MTA encoder (AG-MTA). The main contributions of this paper are as follows: 1. A multi-layer attention Transformer aggregation encoder (MTA) is proposed to utilize contextual information at different layers to model the sequences.
2. Multi-layer attention and feedforward layer are designed to pay attention to different subspaces' information based on the Transformer unit structure. 3. A novel position encoding method that make use of the absolute position and relative position information by encoding the position of each word.
4. Multi-layer attention transformer units are proposed to enhance the context representation and solves the problem of information loss.
The related works are discussed in section 2. Section 3 presents AG-MTA model. Section 4 presents evaluation of AG-MTA based answer generation system and discussion of the experiment results. Finally, we draw some conclusion in section 5.

II. RELATED WORK
Along with recent advancement in question answering method, much progress has been made on answer generation. Yu et al. [5] proposed a method which matches questions and answers by considering semantic coding of problems. At the same time, the application of LSTM has made much progress in the Q&A system. Tan et al. [6] enhance the composite representation of the model by connecting the LSTM network with the convolutional neural network. Liu et al. [7] proposed a method that applied dynamic LSTM networks to solve the problem of long-range dependence of RNN. Lende and Raghuwanshi [8] proposed a closed domain Q&A system for processing documents about the education acts, and improves the accuracy of retrieval answers by using NLP techniques. In particular, the remarkable improvement for reading comprehension in the long text has also led to the improvement of answer generation methods. Relying on efficient neural network models, these methods perform well in the answer generation task.
Wang and Nyberg [9] proposed a method to solve the answer selection problem. This method mainly uses bidirectional Long-Short Term Memory network, without any external knowledge resources. However, this model requires long-time training and may result in loss of information. Wang and Jiang [1] proposed a network structure called MATCH-LSTM, which is mainly used to answer the question that need to find continuous words in the article. However, this method is difficult to predict longer answers.
Recently, attention mechanism has also been introduced to answer generation. Seo et al. [10] proposed a complex network model based on Bi-Directional Attention Flow (Bi-DAF). The model contains the Query2Context module, similar to Context2Query, which can perform attention calculation on query by context information. Dhingra et al. [11] proposed a new attention model Gate-Attention Reader, which utilized attention mechanisms to connect query and paragraph information, thereby enhancing the information representation of each dimension in word embedding. Vaswani et al. [12] proposed a new self-attention encoder and decoder model, replacing LSTM and CNN models. The experiment results prove the effectiveness of the method which can provide new ideas and solutions for NLP field.
In addition, other studies also proposed different machine learning methods and different answer generation architectures. Aiming at the problem of gradient explosion when neural network updating a larger number of word vectors, Liu et al. [13] proposed an algorithm for accelerating neural network parameter convergence based on stochastic conjugate gradients. Wang et al. [14] proposed an end-to-end model called R 3 that uses the reinforcement learning framework to combine phrase sorting method and answer generation module, while traditional approach sorts the document first and then generate the answer. Yang et al. [15] uses semi-supervised learning method to generate questions based on the unlabeled text. This method not only increases the amount of training data but also achieves satisfactory results. Ghaeini et al. [16] designed a question answering network based on the gated. To improve the accuracy of the answer, this method established the interdependence between documents and queries. In order to make full use of various types of knowledge, Zhong et al. [17] proposed a graph algorithm to enhance the accuracy of the question answering system.
As presented above, the existing encoder models use only the top layer output information of the network, losing information available in other layers. Related research shows that different network layers can capture different levels of semantic information in the sequence. Therefore, it is necessary to add some useful sequence information of the base layer into the coding result [18]- [20].
On the other hand, some methods use convolutional neural network models or simple attention mechanisms to extract text information, which can significantly shorten the training time of the model, and the performance of these models is roughly the same as that of the RNN network. Bell and Penchas [21] proposed a method that can capture the local dependencies well, it replaced the RNN in the reading comprehension model with fully convolutional network. Zhou et al. [22] proposed a method to capture both semantic information and semantic correlations between questions and answers. In addition, Tay et al. [23] designed multi-cast attention networks to improve the training performance, which can be used in many tasks in the Q&A field. Dong et al. [24] proposed the multi-column convolutional neural networks, which can extract features between questions and answers at different layers and captures information well. He and Golub [25] show that the character-level encoderdecoder framework can be applied to the Q&A system.
In summary, most of the above methods use LSTM and CNN networks to generate answers directly, and fail to exploit the context information and the relations existing among the whole article and queries. To solve this problem, the AG-MTA model is proposed, which combines the context information with the different levels of semantic information.

III. METHOD A. PROBLEM DEFINITION
For the answer generation task in the Q&A system, we describe the formal problem definition as follows. A context material paragraph can be defined as CTX: where n is the number of words in the paragraph; and we define the question as: where m is the number of words in the question. The model outputs a sub-sequence S from paragraph CTX according to the question Q, the sequence S is the sequence of answers VOLUME 8, 2020 generated by the model based on the question and the paragraph. So we define the answer S as: where i, k represent the starting position and ending position of the answer in the paragraph. In Fig. 2, we describe the process of answer generation.

B. ANSWER GENERATION NETWORK BASED ON MTA
The architecture of AG-MTA is shown in Fig. 1. It mainly consists of positional encoding, embedding layer, multiple transformers aggregation encoder, and context-query attention modules. As shown in Fig. 1, The first part is to convert article information into a corresponding relationship matrix through the character embedding layer and the word embedding layer. The word embedding layer uses a pre-trained Glove [26] word vector with number of dimension p 1 , and the dimensions of the character embedding layer is set to p 2 . The word vector corresponding to a word w is x w , and each character vector is recorded as x c . Then we randomly initialize the character vector x c and add it to the model.
In the meantime, each word can be seen as a connection to each character vector. We fixed the length of each word to a constant j. Thus, the word w can also be represented as a matrix of p 2 * j, which is the combination of the character vectors. Therefore, the final word vector [x w ; x c ]∈R p 1 +p 2 for the word w can be obtained by concatenating x w and x c .
Finally, the method adds the result to the positional encoding vector to obtain the final input sequence information.
Location information is especially important for attention mechanisms. For example, the words ''Tom broke the vase on the table'' and ''The vase broke the Tom on the table'' are almost same for attention mechanism. But the meaning of these two sentences are entirely different. Therefore, we introduce a new mechanism, a novel position encoding, to number the position of each word. By using the parity of trigonometric function, position information is introduced for each word by combining the position vector and the word vector. Therefore, by utilizing its information, the attention mechanism can distinguish words at different positions.
where pos represents the position of the word, i represents the dimension of the i-th word, and d represents the dimension of the word vector. In addition to being able to express the absolute position of the sequence, the above equation can also express relative position relationships. We can explain the relationship by the following equation.
we set the position vector p and q, where q = p + k, and k is the distance from p to q. According to formula (6), the sin(q) = sin (p + k). Therefore, position vector q can be expressed as the linear change of position vector p, thus representing the relative position information.
In the second part, questions Q, final word vectors [x w ; x c ] and position vectors PE are used as inputs to the MTA module. By using the multi-layer attention to learn different layer information and capture the semantic information of different types, the model can obtain the high-level semantic information of the whole sequence. After that, we send the result of the question code Q(query) and the article context code C (context) which obtained by the MTA module to the context-query attention layer for learning the question and answer information. Inspired by QANet [2], this module can learn the associations between context and query effectively, and obtain keywords that describe the relationship between the query and the context. The module contains two calculation schemes: context-to-query attention A and queryto-context attention B. By using the above two calculation schemes, we can obtain the similarity matrix of query and context, which can enhance the relevance of query and context. The formal expression is as follows: where SM (n * m) is the similarity matrix function between context and query, n is the length of context, and m is the length of query. The function SM can be described as follows: where W 0 is a trainable variable and is the element-wise product.
Then, we send the result to the coding layer which consists of 3 MTA modules to learn the relationship between context and query from a global perspective. The three MTAs output M 0 , M 1 and M 2 respectively. Finally, the result will be sent to two softmax functions to get the start position and end position of the target answer in the article paragraph. The formal express as follows: The model's loss function can be expressed as: where y start i , y end i represent the start and end positions of the answer in the context.

C. MULTI-LAYER ATTENTION TRANSFORMER UNIT
In order to understand and make full use of the output information of each layer of the network, we add a multi-layer attention method based on Transformer [12]. As shown in Fig. 3, the architecture uses Transformer structure as a base network. It uses a combination of multi-head attention mechanism and feedforward neural network to model sequences. Our method can use the multi-layer attention to learn different layer information and capture the semantic information of different levels in the sequence.
For the basic Transformer building blocks that contain a set of self-attention mechanisms and feedforward networks, we have the following definitions: where LayerNorm() is a layer normalization function, Attention() is a self-attention calculation function, and VOLUME 8, 2020 FIGURE 3. Multi-layer attention transformer unit. We changed the single-layer self-attention to a multi-layer attention in transformer and connected the layers in a fully connected manner. We also add the sequence information output from each layer to the next attention layer which strengthened the utilization of the network at different layers and reduces the loss of information.
FFN() is a feedforward neural network with a ReLU function as an activation function. Also, Q l−1 , K l−1 , V l−1 are the query, key and value vectors transformed from the previous layer T l−1 , and they are also initialization parameters of Attention(). We first reconstruct the basic Transformer unit structure. In order to obtain key information of query and context, we changed the single-layer self-attention mechanism to a multi-layer attention mechanism, and fully interconnects all layers. The data processing in the multi-layer attention mechanism can be formally defined as follows: where A l −k is the result calculated by the attention function of the l-k layer, and Aggregation() is an aggregate function that unifies the results of each layer. The calculation method is as follows: we first concatenate x 1 , x 2 , . . . , x k , then send them to the feedforward neural network with sigmoid as the activation function, and accumulate all the inputs. Finally, we use the layer normalization function to get the result. The reason why we use a fully connected layer for the multi-layered attention layer instead of a residual connection is as follows: 1. Use fully connected layer can spread loss directly to the base layer for easy training.
2. The coding information of each layer is an aggregation of all the previous layers, and retains key information of all layers.
3. The final coding result relies on representations from all layers, including both sophisticated and simple features.
By using the Multi-head Attention mechanism, the model can pay attention to the representation information of different subspaces from different locations. The specific calculation method is as follows: Based on the Transformer structure, we changed the previous multi-head attention layer to the combination of multiple multi-head attention layers. As shown in Fig. 1, The model aggregates the information of each attention layer and sends it to the next layer to make full use of the information of each layer.

D. MULTI-LAYER ATTENTION TRANSFORMER AGGREGATION ENCODER
Based on the Transformer structural model, we use layer aggregation techniques to integrate the information of each layer better. The structure of the MTA is shown in Figure 4.
By aggregating nodes, we can better utilize the information between each unit to analyze the sequence information from multiple aspects and ensure the efficient utilization of information.
The multi-layer attention transformer units are aggregated according to the following formula: The aggregate function aggregation() is the one as formula (17). We aggregate the nodes of the same layer into one node, then send the result back to the linear backbone network as the input of the next layer. All the aggregation steps replace the layering combine operation by an addition operation so that the computational complexity can be reduced while maintaining the size of each layer.

A. DATASET
In experiments, we used the SQuAD [4]   We aggregate each multi-layer attention transformer unit between aggregation nodes and transmit the aggregated information to the backbone network to further enhance the utilization of information. In addition, since each layer transmits information in parallel, the computational efficiency of the model is improved.
answers. Table 1 shows an example of the SQuAD data set. SQuAD [4] has extracted more than 100,000 questionanswer pairs from hundreds of articles on Wikipedia through crowdsourcing. Compared to other datasets like MCTest [27], Algebra [28], Science [29] and WiKiQA [30], the reason we chose the SQuAD dataset is that the number of questions in SQuAD is far greater than them. On the other hand, the number of questions in CNN/Daily [31] Mail and CBT [32] data sets are relatively large, but these are both cloze-style datasets, rather than a real question answering data.

B. NETWORK PARAMETER SETTINGS
Some of the hyper-parameters used in the neural network are shown in Table 2. We use the ADAM optimization algorithm [33] to train the model. Where β 1 = 0.8, β 2 = 0.999, = 10 −7 . For the setting of the learning rate lr, we use the warm-up scheme to gradually increase from 0.0 to 0.001 in the first 2000 steps of the model training, and then maintain a steady rate for training.

C. EXPERIMENTAL RESULTS AND ANALYSIS
In the answer generation task, we mainly evaluate the performance of the model with EM and F1 scores. EM is a score for complete match, which requires the model's prediction be exactly the same as the answer in the data set. F1 score is used to measure the degree of fuzzy matching between the model's prediction and the answer, it takes into account both the accuracy and recall of the model, so the evaluation results are more objective.
As shown in the plot(a) of Fig.5, the loss rate of the model is relatively large during the training process. When attention heads is set to 1 and attention layers is set to 3, the network loses a lot of information during the feedforward process  and it is difficult to capture key information, so the model is difficult to converge. In plot(b), we change training steps from 30000 to 50000, attention dimension from 96 to 128, and increase the attention layers from 3 to 4. The EM score increased from 67.3 to 68.9 and the F1 score increased from 76.2 to 78. With the increment of the number of training rounds and attention layers, the model can capture and take full advantage of semantic information of the sentence. In the plot (c), we set the number of attention heads to 8, by using the multi-head attention mechanism, the model can find the key words in the sentence, so that the meaning of the context can be clearly expressed. The EM score and the F1 score are VOLUME 8, 2020  reached 71.1 and 80.3, respectively. It can be seen that the accuracy of our model has increased significantly by using the multi-head attention mechanism and MTA module.
In order to test the validity of the model, we use the test set to test the accuracy of the model and the ability to generate answers. We conducted three sets of experiments separately, and the experimental results are shown in Table 3. We choose three typical paragraphs and questions, where the colored words in the paragraphs are the answers to the questions. As can be seen from the table, our model shows quite good performance.

D. ABLATION STUDIES AND COMPARISONS WITH PRIOR METHODS
To demonstrate whether AG-MTA can effectively generate the correct answer, Table 4  As shown in Table 4, by comparing Model 1 and Model 3, it can be seen that the more training steps, the better the model fits. In addition, by comparing Model 1 and Model 2, In order to measure the performance of our model, we compared it to other representative methods. As shown in Table 5, Dev represents model's test score under the development set, and Test represents model's test score under the test set. Compared with other methods, whether in Dev or Test, AG-MTA has a greater improvement in performance. Compared with models that only use the LSTM network (such as LR Base-Line [4]) or attention mechanism (such as BiDAF [10]). AG-MTA combines the context information and extracts key semantic information by using MTA module, position encoding, and multi-head attention mechanism. Most importantly, since the coding part of our model uses the pure attention mechanism scheme and data-parallel computing, with more data, the model can get better performance.
In addition, to help qualitatively evaluate our MTA module, position encoding and multi-head attention mechanism  methodology, we conduct extensive ablation experiments. As shown in Table 6, the experimental results show that the model (A) with all modules obtained the best experimental performance. By comparing the experimental results of (A), (B) and (C), we observe that with help of the position encoding and the multi-head attention mechanism, model (A) can use logical semantic information to express the relationship among words. By comparing the experimental results of (A) and (D), it can be seen that the MTA module can significantly improve the performance of the model (A) by fusing semantic information in different locations from base layer to top layer.
We also made a formal comparison of the results, as shown in Figure 6. It can be seen that with the continuous improvement of the network, the performance of the network is getting better and better, and the EM and F1 values of our model have achieved 71.1 and 80.3 respectively.

E. DISCUSSION
Compared to the performance of other answer generation methods, our answer generation model produces significant improvement. By applying MTA module testing for  AG-MTA with different parameters, we can improve EM from 67.3% to 71.1% and F1 from 76.2 % to 80.3%.
We have also studied the impact of the number of attention layers and the number of multi-layer attention Transformer unit module on the EM score and F1 score. Base on the model 6 in Table 4, we tune some parameters, so that Train Steps = 80000, Attention Dimension = 128 and Attention Heads = 8; and accordingly adjust the number of Attention Layers and Unit Numbers. The experimental result is shown in Figure 7, where A is Attention Layers and U is Unit Numbers. It can be observed in Figure 7, as the number of Attention Layers and Unit Numbers increases, the growth rate of EM score and F1 score gradually decreases. However, the computing resources consumed by the model have increased exponentially. As the model's complexity is increasing, it is difficult to fit. Therefore, we set Attention Layers = 4 and Unit Numbers = 6 to avoid consuming too many computing resources.
As shown in Table 5, the experimental results indicate that our AG-MTA can well exploit contextual correlation preserved in paragraphs. Nevertheless, the lack of similarity matching between questions and paragraphs will lead to poor logical reasoning ability. Compared with BERT [40], the AG-MTA is not achieving the best performance. But the BERT also has high demand for hardware. Therefore, in terms of practicality, our model has significant performance with general applicability.

V. CONCLUSION
In this paper, we propose an end-to-end model for answer generation based on multi-layer Transformer aggregation coder. The model enhances the contextual correlation and improves the accuracy of the answer generation. We propose MTA to focus on information representation at different levels and aggregate the nodes of the same layer to combine the context information. Furthermore, a novel position encoding method that make full use of absolute position and relative position information of the word is designed to enhance the relationship of each word. Experiments on the SQuAD dataset verified that our model has a significant improvement over the state-of-the-art method. Moreover, ablation study on multi-head attention mechanism and position encoding have been done to prove the effectiveness of each component.
In the experiments, we found that time cost increases with the complexity increase of the network structure. Hence, in our future work, we will pay more attention to the experiment on network architecture optimization tasks, and the reduction of model parameters to make these methods have greater applicability.