Neural Network With Hierarchical Attention Mechanism for Contextual Topic Dialogue Generation

The encoder-decoder model has achieved remarkable results in natural language generation. However, in the dialogue generation work, we often ignore the influence of the dialogue context information and topic information in the generation, resulting in the generated replies not close to the context or lack of topic information leads to general responses. In this work, we study the generation of multi-turn dialogues based on a large corpus and take advantage of the context information and topic information of the conversation in the process of dialogue generation to generate more coherent context-sensitive responses. We improve upon existing models and attention mechanisms and propose a new hierarchical model to better solve the problem of dialogue context (the HAT model). This method enables the model to obtain more contextual information when processing and improves the ability of the model in terms of contextual relevance to produce high-quality responses. In addition, to address the absence of topics in the responses, we pre-train the LDA(Latent Dirichlet Allocation) topic model to extract topic words of the dialogue content and retain as much topic information of dialogue as possible. Our model is extensively tested in several corpora, and the experiments illustrate that our model is superior to most hierarchical and non-hierarchical models with respect to multiple evaluation metrics.


I. INTRODUCTION
Conversation systems are widely used in a range of applications, from technical support services to language learning tools and entertainment,such as Microsoft's XiaoIce, Apple's Siri and Google's Google Assistant. Dialogue systems can be categorized as single-turn dialogue and multi-turn dialogue. The generation of single-turn dialogue has made great progress in recent years [1]. A single-turn conversation is characterized by considering only the current utterance to generate a response. For example, the classic Seq2Seq model and subsequent work that improved the encoder-decoder structure have achieved good results in a single turn of dialogue. However, the format of a single turn of dialogue is not consistent with the habits of daily human communication: human communication typically consists of multiple turns of sentence communication, so multi-turn dialogue generation The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . requires further investigation. Compared with a single round of dialogue, multi-turn dialogue faces several challenges. Instinctively, humans tend to adapt conversations to their interlocutor not only by looking at the last utterance but also by considering information and concepts covered in the conversation history [2]. The generation of multi-turn dialogue is not limited to consideration of the current statement but should consider the overall contextual information of the dialogue to obtain more semantic information when generating a reply. Current solutions to this problem fall into two categories: (1) improving the Seq2Seq model to generate diversified replies [3]; (2) adding additional information, such as structured world knowledge, personality or emotion [4], to the training corpus to enrich the content of the generated dialogue.
In this paper, we study the multi-turn dialogue response generation of an open domain session in a conversation system and attempt to train the response generation model based on the response and its context. The context refers the current sentence and previous dialogue information. Serban et al. introduced the HRED model to a dialogue system and improved the classical Seq2Seq model [5]. HRED uses a hierarchical structure to construct multiple turns of dialogue. Additionally, the encoder RNN was used to encode the input sentences,the hidden vector at the last moment is regarded as the encoding vector of the input sentence and as the input vector of the next layer of the RNN. A context RNN in the middle layer is used to encode dialogue-level information, such as the state and intention of the overall dialogue, while the first-layer RNN is used to encode the sentence-level information of a sentence. The sentence representation vector of the first layer output in the middle layer is input at each moment. Therefore, the hidden layer vector of the context RNN can remember the previous dialogue information as a context vector. Finally, the vector encoding the previous dialogue information is used as the input vector of the decoder RNN; So, in the decoding process, in addition to the information of the answer sentence itself, the dialogue context information is included. Serban et al.subsequently improved the HRED model, proposed the VHRED model, introduced the idea of variational coding on the basis of the HRED model [6], and added a Gaussian random variable to the context RNN link to enhance the diversity.Tang H et al proposed the structural features are captured by the sequential variational autoencoder component, and the topic modeling component based on Gaussian distribution is used to enhance the recognition of text semantics [7].Although the previous work has begun to consider the influence of the content of the context in the generation, the effect obtained is not very obvious. From the perspective of the generated effect, the context information is not fully utilized.However, how to obtain a better contextual representation is currently the main problem. Some researchers use cosine similarity to define contextual relevance. Xing et al. introduced the traditional attention mechanism to the hierarchical model [8]. Compared with the previous model, the new model improved the contextual relevance. Kong et al. proposed the HSAN,which use a hierarchical encoder to update the word and utterance representations with their position information respectively [9].We tested the HAN and HSAN model on multiple datasets and output the weight of the attention mechanism for most sentences. Higher attention weights were found to be assigned to words or sentences that are not important in context, which is not ideal for the use of context information and results in considerable loss of the main body information of the sentence. Therefore, in this work, we redesigned the hierarchical attention model to model the context. At the same time, we took into account the problem of topic missing in the generated response, We adopt a multi-task learning approach to use the LDA model to model the context topic, which effectively improves the effect of multi-round dialogue generation. The contributions of this paper are as follows: (1) The importance of context in multi-turn dialogue generation is illustrated, and an improved solution is proposed.
(2) The current hierarchical model is improved, and a more complete attention mechanism is designed to make the model learn more fine-grained information and maintain contextual relevance.
(3) The importance of dialogue topic information is demonstrated, and adopting a multi-task method to use the pre-trained LDA model to extract topics from the content of multiple turns of dialogue to retain more topic information when generating replies.

II. RELATED WORK
With the maturation of the single-turn dialogue system, multi-turn dialogue generation has attracted increased attention.Chen, H. used adversarial training to generate dialogue [10]. Multi-turn dialogue generation has more challenges and problems than single-turn dialogue generation, but it also has more directions and innovations. In the generation of multiple turns of dialogue, not only the current words but also the previous dialogue context must be considered to generate responses, and the context and themes of the overall dialogue must be taken into account. Therefore, multiple rounds of dialogue result in difficulties that remain to be addressed. Serban et al. proposed a hierarchical encoder-decoder model to model the context and then proposed variants to the HRED model, named VHRED and MrRNN [11]. On the basis of the HRED model, Gaussian random variables were introduced to enhance the generation of responses. HVMN adds a memory network to VHRED to improve the quality of dialogue responses, but the appropriateness of the answers is weakened due to the lack of long-term memory [12]. To address this problem, some researchers have attempted to define the relevance of a context using a similarity measure, such as cosine similarity [13].However, these similarities are defined on either word or sentence level, which cannot well tackle the topic lacked problem in multi-turn dialogue generation. Moreover, attention mechanisms have been widely used in natural language processing [14]. Attention can be used to process a single sentence to capture the keywords, but the structure and coherence of the entire dialogue cannot be captured. Therefore, the intention network was used to capture the high-level structure and coherence of real dialogue information [15]. Tian et al. proposed the WSeq model and added a weighted attention mechanism to HRED [16]. It is more about how to use the hierarchical method to integrate the representation of each utterance sentence into a representation method. Then, a deep dialogue integration model (DUA) was proposed to solve the problems of noise and redundancy in the direct splicing of past conversations as contextual information in multiple turns of dialogue. DUA(Deep Utterance Aggregation) uses an attention mechanism to mine key information from dialogue and replies, highlighting key information and ignoring redundant information, to obtain a matching score for utterances and the response [17]. In theory, DUA can match more contextual information, but all this requires a huge corpus to support it to achieve the desired effect. Zhou et al. proposed a two-hop graph attention mechanism VOLUME 10, 2022 to enhance the semantic representation of the context so as to better extract context information [18].Zeng et al.proposed a multi-task learning method to label conversation and text to improve the quality of generation [19]. Tang et al.use the Gaussian distribution to enhance the recognition of text semantics [7].Zhang et al proposed a large and adjustable neural dialogue generation model named DalioGPT to ensure the richness of generated content [20].Lewis M et al present BART,a denoising autoencoder for pretraining sequence-tosequence models [21].Bao et al.introduced discrete latent variables to solve the inherent one-to-many mapping problem in response generation [22] and Devlin J et al proposed the BERT,BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers [23].Feng s et al. use GAN network to provide information and coherence through two complementary evaluation viewpoints [24].Although these works have made great efforts in multiple rounds of dialogue generation, they have not been able to solve the problem of lack of topics in dialogue generation. Zhang et al. proposed a topic drift method to simulate the scene of topic transfer in human daily conversations [25], and Sevegani K made the context coherent by predicting and generating a ''bridging'' utterance connecting the new topic to the topic of the previous conversation turn [26].However, the current dialogue generation model still suffers from several challenging problems: (1) Commonly repeated replies. Despite the neural network being trained on a large corpus, repeated meaningless replies, such as ''I don't know'' and ''how are you?'', occur. (2) Context independence. Even if the context and topic information are given, the generated response may be irrelevant to the context. (3) Loss of contextual topic information. Although considerable work has improved the content generated by multiple turns of dialogue, the topic information retention of multiple turns of dialogue remains a problem to be solved.
In this work, we studied the generation of multiple turns of dialogue while simultaneously improving the existing attention mechanism so that the model can retain more fine-grained information to assist in response generation without deviating from the context.At the same time, we use the pre-trained LDA topic model to control the generation of conversation topics through multi-task learning. Attention mechanisms were first proposed in the field of machine translation to enhance the relevance of the context [16], [27]. Seo et al. proposed a hierarchical attention network [28] to accurately focus on objects of different scales and shapes in an image. Li et al. proposed a deep reinforcement learning method to generate meaningful and diverse responses or increase the length of the generated dialogue [4]. Dhingra et al. proposed an end-to-end dialogue system for information reception [29]. The algorithm, called KB-infobot, is based on knowledge base (KB) reinforcement learning. On the basis of these works, we use the advantages of the current mainstream models and also remodel the discourse and context. In addition to focusing on fine-grained information, we should consider the loss of contextual information during dissemination. To overcome the lack of topic information in the generated response, we adopt the pre-trained LDA model to assign topics to the conversation history to improve the topic relevance of the model's responses.Therefore, we designed a more complete attention mechanism to model the contextual information, and verified its effectiveness through a large number of experiments, and we adopted a multi-task learning method to use the LDA model to capture the topic information in the dialogue for use the final generation.

III. THE PROPOSED MODEL
The proposed method is based on the encoder-decoder framework. Suppose that the dataset format can be described as For any turn of dialogue,(C i , Y i ) contains the corresponding target response Y i , and the composition of is the message information and c i,1 , . . . c i,m−1 is the information from the previous turns of dialogue. Given the contextual dialogue corpus C = (c 1 , c 2 , . . . , c m ), we use the bidirectional gated unit recurrent neural network (BiGRU) [16] and encode every c i as a hidden vector is the t-th hidden state of a back-ward GRU, h i,0 is initialized with a isotropic Gaussian distribution,and w i,T i is the Ti-th word in the sentence,e i,t is the embedding of w i,T i , z t and r t are an update gate and a reset gate respectively. The calculation is conducted as follows: where σ (·) is the sigmoid function and Wz, Wr, Ws, Vz, Vr, and Vs are parameters.

A. WORD-LEVEL ATTENTION
For the input hidden layer vector h i,j , we calculated the attention weight of each word as α i,t,1 , . . . , α i,t,T i . Although considerable training time is required to use this method to calculate the attention of the sentence, this approach can retain fine-grained information. Utterance c i is converted into a vector r i,t , for ∀i ∈ {1, . . . , m},r i,t can be obtain by the following formulas, where α i,t,j is the calculated weight value.
where l m+1,t is the initialized Gaussian distribution, s t−1 is the hidden state of the t-1 decoder, and η(·) is a multilayer perceptron with tanh activation function. Then, r i,t m i=1 is initialized as a sentence-level encoder and transformed into a context representation vector l 1,t , . . . . l m,t . Figure 1 illustrates the word-level model.

B. CONTEXT-LEVEL ATTENTION
We initialized a GRU unit to encode each sentence, and for each utterance, we used word-level attention to calculate the attention weight of each word. A new representation of the sentence l m,t is obtained after calculating the weight of each word. Inspired by many current works, we improved the current sentence-level attention mechanism to obtain a better contextual representation. The hierarchical attention mechanism proposed by the HRAN model has achieved relatively good results in context modelling. We took advantage of the hierarchical attention and further improve the context-level attention mechanism. We also improved the sentence-level attention. For each sentence, we calculated the context-level attention. First, we calculated the importance of each input sentence, obtained a new representation of each sentence, and then set its attention value to a constant that was not dynamically updated during the decoding process. Therefore, the result based on the weight of the hidden layer state used for decoding is calculated as follows: where h i and h s , respectively, represent the hidden state of the i-th sentence and the last sentence. V, W, and C are parameters. After the weight of each sentence α i is calculated, the weights do not change throughout the rest of the decoding process. In the decoding process, the hidden layer state st at the t-th step can be calculated as follows: where y t−1 is the output of decoding step t-1, the hidden layer state of step t-1 of s t−1 decoding, and f is GRU. The weight calculated via attention for each sentence is fixed and is not updated when the parameters are updated. After the above attention calculation, each sentence is assigned a weight value. Then, we recalculate the attention weights and constantly update the weight of each sentence, as follows: where V, W, and C are parameters independent of the former attention mechanism. e i,t and α i,t are calculated during the t-th step of decoding, and the hidden layer state at the t-th step is calculated by the following formula,T denotes the transposition operation of V.s t−1 is the is the hidden state of t-1-th time step in decoding:

C. TOPIC ATTENTION
In response to the improvement of the conversation topic, to obtain the topic of the conversation history, we extract topic words to respond to the topic relevance. We use a pre-trained LDA model to assign the topic T to the conversation context. LDA is a document-driven probabilistic model [30]. In our work, we treat the historical information of the conversation as a document, so the most likely topic is sufficient to model the conversation. After the historical dialogue is entered and the subject words of the entire history are obtained, we select n words with the highest probability under T (n = 100 in our experiment). We linearly combine the subject terms {t 1 , t 2 , . . . , t n } into a fixed-length vector k, and the subject term attention value is calculated as follows: where i ∈ {1, . . . , n},o n,t is the last hidden state of the context encoding and s t−1 is the hidden state of the decoder at t-1. At the same time, we consider the transfer of dialogue information and changes in dialogue state and adverse reactions when multiple topic words are generated. We use the last hidden state of o N,t for topic calculation assignment.

D. DECODER
The main function of the decoder is to decode a response based on the abovementioned context result and topic word assignment. Our goal is to estimate a generation probability p (y 1 , . . . y m | C) from dataset D. Then, in a given context dialogue C, we can generate a reply Y = (y 1 , . . . , y T ). We define p (y i ) as p (y i ) = p v (y i ) + p k (y i ), where p v (y i ) and p k (y i ) are defined as follows: where S i = GRU y i−1, , s i−1 , d i , k , V represents a reply word, and K represents a topic word. σ is the tanh activation function, and N is the normalizer. Figure 2 shows the framework of the proposed hierarchical attention topic model.

IV. EXPERIMENT
A. DATASETS 1) DAILYDIALOG [31] Dailydialog is a high-quality open-domain multi-turn dialogue corpus that covers ten major topics in daily life, reflects the human dialogue style, has a definite dialogue mode, and contains a wealth of emotional information. The dataset contains 13k dialogue sequences, each with an average of eight turns. In recent years, many studies have used this dataset to evaluate new models. Because this small chat dataset resembles daily conversation, it is not particularly prominent in terms of topics. As a result, models such as HRAN do not focus on the topic of sentences. To avoid this problem, we performed some data enhancement on this dataset to expand the number of dialogue sequences.

2) EmpChat
[32] Hannah Rashkin et al. proposed a new empathy dialogue generation dataset composed of 25k dialogue sequences. Many language structure training corpora are obtained from text scraping, social media dialogue or independent books with little curation. Models trained in this way often result in offensive and cold responses in the dialogue. The EmpChat dataset has been improved in this regard. Every conversation is based on a specific situation, and the participant describes the situation he is in with a specified emotion. The speaker describes his situation at the beginning of the dialogue, and the listener responds empathetically after seeing the description.

3) PersonaChat:
[33] The persona-chat dataset comes from real crowdsourced conversations between people. These crowdsourcers are randomly paired and asked to act according to a given persona. The paired crowdsourcers have a natural conversation and try to get to know each other. The persona-chat data are designed to simulate a conversation where two interlocutors meet for the first time and get to know each other. The goal is to participate in the dialogue process as much as possible, understand each other's interests, discuss their own interests and find common ground.

4) REDDIT
To train the LDA topic model, a scoring corpus that is not the corpus of the abovementioned training dialogue is used. This approach is taken to indirectly introduce the prior knowledge of other data sources to increase the diversity and information of the dialogue. We selected the Reddit dataset, which consists of posts and comments. Each comment has rich metadata (such as author, number of replies, and karma for user comments). To obtain the dataset, we selected 95 English Kanban boards from approximately 1.1 million public Kanban boards. Our selection is based on the top-ranked reddit sections. Topics discussed in these sections include news, education, business, politics, and sports. Furthermore, we categorized the Reddit dataset. According to the number of dialogue rounds, we trained the model using three, four, and five rounds of dialogue, and selected the best LDA model for topic word extraction. Table 1 shows the size and specifications of the working dataset.

B. BASELINE
To verify the effectiveness of our proposed model, we used six baseline models for a comparison experiment.

1) HRED
[5] The model proposed by Serban et al. was the first to use a hierarchical encoder-decoder structure to model multi-turn dialogue generation.

2) VHRED
[6] Serban et al. and others made improvements to the HRED model and proposed VHRED to overcome the difficulty of RNNLM and HRED to produce meaningful and high-quality responses. The idea of VAE is introduced. Additionally, noise is introduced in the middle layer to reconstruct the input data, so the samples from the auto-encoder have a higher globality.

3) HRAN
[8] Xing et al. used a traditional attention mechanisms to learn the importance of context. HRAN includes repeated attention mechanisms, called the word-level attention mechanism and discourse-level attention mechanism, to make full use of context information. HRAN uses word-level encoders to encode the information of each utterance in the context into latent vectors. Then, when each word is generated, the hierarchical attention mechanism uses word-level attention and sentencelevel attention to learn the deep relationship between sentence and sentence opinions.

5) DSHRED
[34] DSHRED uses dynamic and static attention mechanisms to generate more context-sensitive responses. The query with dynamic attention is the hidden state of the decoder, and the query with static attention is the hidden state of the last sentence in the conversation.

6) ReCoSa
[35] ReCoSa uses multi-head self-attention to detect multiple relative utterances in the context, achieving the most advanced performance. First, the initial representation of the context of each round of dialogue is obtained through a word-level encoder; then, the updated context representation and the masked representation of the response a used simultaneously. Finally, the attention weight between each context representation and the response representation is calculated and used in the process of further decoding the response.

7) HSAN
[9] HSAN ues a hierarchical self-attention network, which attends to the important words and utterances in context simultaneously. Firstly, it use the hierarchical encoder to update the word and utterance representations with their position information respectively. Secondly, the response representations are updated by the mask self-attention module in the decoder. Finally, the relevance between utterances and response is computed by another self-attention module and used for the next response decoding process.

C. TOPIC MODEL
We pre-trained the LDA model on the Reddit dataset. The size of the training data is 1M, and we set the number of topics to 150. The value of α is set to 1/150, and the value of ϒ is set to 0.01. Finally, we eliminate the 1000 words with the highest generation rate from the generated topic words to avoid repeated generation when assigning topics to sentences with high probability. Figure 3 shows the model of LDA, First sampling from the Dirichlet distribution θ i to generate the topic distribution of document i,then sampling from the topic polynomial distribution θ i to generate the topic z i,j of the j-th word of the document i,sampling from Dirichlet distribution β to generate topic Z i,j corresponding word distribution φ zi,j ,finally sampling from the polynomial distribution φ zi,j of words and finally generating words w i,j .

D. EVALUATION METRICS 1) PERPLEXITY
Perplexity measures the accuracy of a model's prediction of a human response. Perplexity can also be thought of as the average branch factor, that is, how many choices are available when predicting the next word. A lower perplexity usually indicates better generation performance.

2) ROUGE
Evaluating text content based on the co-occurrence information of n-grams in the text is an evaluation method oriented to the recall rate of n-grams. The quality of the text is evaluated by counting the number of overlapping basic units (n-grams, word sequences, and word pairs).The calculation formula is shown in equation 22, as shown at the bottom of the next page.

3) BLEU
This metric used as an evaluation metric in the field of machine translation, and it is also commonly used to evaluate dialogue systems.The metric counts the number of occurrences of n-gram phrases in the generated response and the real response in the entire training corpus. Although use of the BLEU indicator is controversial for the evaluation of dialogue generation, because comparison models were used in their paper and considering the fairness of the comparison experiment, we decided to include this metric.

4) EMBEDDING
[36] The embedding-based metric measures performance by calculating the similarity between sentence embeddings and the generated answers.In this article, we use embedding average, vector extrema and greedy matching as three word vector evaluation indicators. Moreover, we use the Word2Vec tool to train word embeddings on the Google News Corpus dataset for evaluation.

5) DISTINCT
[37] Diversity is a major problem in dialogue generation. As in Li et al, we measure diversity as the ratio of different elemental (dist1) and diatomic (dist2) results in all generated responses. The two dist indicator calculate different unigrams and bigrams. The value of the indicator reflects whether the content generated by the model's response is bad or meaningless.

6) HUMAN ANNOTATION
In addition to the automatic metrics above, we further recruited human annotators to judge the quality of the generated responses of different models. Five labelers with rich experience were invited to do evaluation.Responses generated by different models were pooled and randomly shuffled for each labeler. Labelers referred to the test messages and judged the quality of the responses according to the following criteria: +2: The response is not only relevant and natural, but also contained the context message and topic information.

+1:
The response can be used as a reply to the message, but it is too universal like ''Yes or No'', ''Hello'' and ''I don't know''. 0: The response cannot be used as a reply to the message. Agreements among labelers were calculated with Fleiss'kappa. [38] E. PARAMETER SETTINGS All models are based on pytorch. The hyperparameters of the model are shown in the table2.

F. EVALUATION RESULTS
Tables 3 -5 presents the results of the automatic evaluation. Our model achieves obvious index improvements on the three corpora. Our hierarchical attention topic model generates richer response content and has a better generation effect. On the dailydialog corpus, the perplexity PPL metric of our method is reduced by approximately 2.8% compared with that of HAN, which is better than the other models. Secondly, in terms of the Bleu, embedding and distinct evaluation   metrics, our model outperforms the other models to varying degrees. In terms of Distinct-1, our model and the VHRED model have the same results (VHRED introduces Gaussian variables to enhance the diversity of responses), and in terms of Distinct-2, our model has achieve an approximately 13% compared to the second most effective VHRED model. Compared with the latest Recosa model for multiple turns of dialogue, our model achieves an improvement of approximately 33%. The results confirm the correctness of our method of introducing topic information to enrich the content of dialogue generation. Second, in terms of the embedding, because the HRED uses a hierarchical structure, it does not use an attention mechanism to model the context, resulting in a considerable loss of original text information when generating responses. The HAN mechanism greatly improves this metric, and our HAT model improved the context-level attention compared to HRAN's hierarchical attention. Then, we demonstrate the performance of each model based on two major evaluation metrics: diversity and relevance. Figure 4 and Figure 5 show the model comparison. Our approach outperforms HRAN, indicating that including the hierarchical attention mechanism in the modelling of contextual information will lead to better results.   model can make better use of context and topic information (responses marked as +2) and fewer general responses, achieving the best performance. Among them, compared with HRED, it increased the ''+2'' response by 12.9% and decreased the ''0'' response by 15.7%. This means that compared with the traditional attention model, the hierarchical attention model has been well strengthened. At the same time, compared with HRAN, the ''+2'' response of HAT has increased by 4.2%, and the ''0'' response has been reduced by 7.2%. At the same time, the Kappa value of the HAT model is also the highest compared to the baseline model. This is a good indication that HAT can better generate sentences containing topic information.

A. CASE STUDY
We used a sample from the three corpus test set to compare the effectiveness of our model and that of the baseline model. Given the context of seven rounds to predict the generated result, the examples are shown in the Table 6 -8. The generated content indicates that our model is comparable to the baseline model. The topic words extracted from the context can be used and applied to the generated replies when the prediction is generated so that more information can be reserved for response generation. On the dailydialog corpus,the content of the conversation indicates that the conversation occurred in a store and covered product after-sales issues. HAT can understand the content of the conversation and knows that salesperson B apologized for his product and proposed a solution. The response given by HAT indicates that the solution was accepted. For the prediction of the topic word, HAT predicts ''accept'' based on ''apologize'' in the context and then further predicts the word ''shopping'' according to the content.Then,on the Personchat corpus, the scene in this case should be two friends introducing themselves, including personal information and some basic preferences exchanges. From the generated results, we can see that our model also takes into account the contextual information of the dialogue, including the topic] word information such as ''music'', ''book'', ''hobbies'', etc. appearing above for generation, and the final generated results effectively cover the contextual information and themes. Information, among which topic word information such as ''softball'', ''volleyball'', ''music'', and ''together'' appear in the generated sentences, which increases the coherence and quality of responses.On the EmpChat corpus, the model has also achieved relatively good results. The sample shows that the generated response contains contextual information and subject word information such as ''car'' and ''kind'', which enriches the dialogue Generate effect.

B. MODEL ABLATION
The experimental results in Tables 3 -5 show that our model achieves substantial improvements with respect to multiple metrics and datasets, because of the hierarchical attention model and topic structure. To verify the impact of the introduction of the topic information structure and hierarchical attention mechanism on the generated results, we performed     an ablation analysis on the three corpus and removed the hierarchical attention mechanism and topic structure to create new models, respectively, named No-Att and No-Topic. The results in Table 10 -12 show that after removing the topic prediction module or the hierarchical attention mechanism respectively, our model sufferes different degrees of decline in the three corpus, which confirmed that our model with the topic information and hierarchical attention mechanism can improve the response generation results. In the table 13 -15, we test the model after removing the topic prediction module and the hierarchica attention mechanism named No-topic/Att.Compared with the original model, the obtained data has different degrees of decline, which verifies that the HAT model has a better improvement in contextual information association and topic prediction.

C. ERROR ANALYSIS
We conducted a statistical analysis of the HAT generation effect on the dailydialog corpus. We randomly selected 100 rounds of dialogue from the test set responses generated VOLUME 10, 2022     by HAT. The statistical results showed that approximately 20% of the sentence responses lacked correct subject information. Potential reasons for this issue include the following: (1) The influence of the dailydialog corpus itself because the corpus contains daily dialogue and does not have strong topicality. Second, the conversational sentence topic changes frequently, and the model cannot learn the context and obtain subject information and contextual information effectively.
(2) This model generates replies for topics to predict multiple topic words. However, due to the increase in the number of conversations in multiple turns of a dialogue, the topics discussed at first and the topics discussed later may be completely irrelevant, which can cause deviation in the model predictions. These problems represent directions for future work. Potential ways to address these issues include (1) labelling the topic of the corpus, (2) optimizing the current topic model to improve the contextual topic logic and (3) Improve the model for more tasks.

VI. CONCLUSION
In this paper, we proposed a topic dialogue generation model based on a hierarchical attention mechanism to solve the problem of how to generate a contextual content response in multiple turns of dialogue generation. At the same time, our pre-trained topic model can be assigned topic words in the generation through multi-task learning, which improves the topic relevance of the dialogue.The improvement of our model compared with the baseline model is considerable.
Moreover,in the ablation experiment, we demonstrated the importance of introducing themes in dialogue generation and hierarchical attention mechanism and provided a better reference for multiple turns of dialogue generation. In future work, we will further consider the utilization of contextual information and the controllability of generated topics to improve the effect of conversational data in other fields and be suitable for more tasks.