SEQ2SEQ++: A Multitasking-Based Seq2seq Model to Generate Meaningful and Relevant Answers

Question-answering chatbots have tremendous potential to complement humans in various fields. They are implemented using either rule-based or machine learning-based systems. Unlike the former, machine learning-based chatbots are more scalable. Sequence-to-sequence (Seq2Seq) learning is one of the most popular approaches in machine learning-based chatbots and has shown remarkable progress since its introduction in 2014. However, chatbots based on Seq2Seq learning show a weakness in that it tends to generate answers that can be generic and inconsistent with the questions, thereby becoming meaningless and, therefore, may lower the chatbot adoption rate. This weakness can be attributed to three issues: question encoder overfit, answer generation overfit, and language model influence. Several recent methods utilize multitask learning (MTL) to address this weakness. However, the existing MTL models show very little improvement over single-task learning, wherein they still generate generic and inconsistent answers. This paper presents a novel approach to MTL for the Seq2Seq learning model called SEQ2SEQ++, which comprises a multifunctional encoder, an answer decoder, an answer encoder, and a ternary classifier. Additionally, SEQ2SEQ++ utilizes a dynamic tasks loss weight mechanism for MTL loss calculation and a novel attention mechanism called the comprehensive attention mechanism. Experiments on NarrativeQA and SQuAD datasets were conducted to gauge the performance of the proposed model in comparison with two recently proposed models. The experimental results show that SEQ2SEQ++ yields noteworthy improvements over the two models on bilingual evaluation understudy, word error rate, and Distinct-2 metrics.


FIGURE 1. Chatbot types.
refers to the process of extracting the answer from a text for a question. Answer selection refers to the selection of the best answer for a question from a list of answers. Answer generation refers to the task of generating the answer to a question. The generated answer can be a new answer, which is not seen in the existing training dataset.
There are two popular approaches in conversational chatbot modeling namely Transformer network-based models such as [13]- [15] and recurrent neural network (RNN)based sequence to sequence learning (Seq2Seq) models such as [16]- [22]. The Transformer network is based on the feed-forward network [11], wherein sentences are processed as a whole rather than word by word by utilizing a selfattention mechanism, which can be highly parallelized. However, Transformer-based models need to handle sequence processing with positional embeddings to encode information related to a specific position, and this requires very high computational and memory cost because of their quadratic memory usage and computational time complexity O(N 2 ) costs where N is the sequence length [10], [14], [23]- [27].
The other popular and more suitable method for sequence processing such as question answering is the RNN-based Seq2Seq method [1], [28]. RNNs view input as a chain structure whereby the next output depends on the previous hidden state (sequential by design), thus making RNN-based models have the capability to process sequences of variable length. RNN-based models cannot be parallelized and thus can be slow while processing long sequences [11]. However, unlike the Transformer network, RNN-based models require only linear number of O(N) of operations where N is the sequence length and thus do not require high computational and memory cost. This shows RNN-based Seq2Seq method is still worthwhile to be investigated and has great potential. It is to be noted that the Seq2Seq learning approach has been extensively researched for both single-turn and multiturn conversations [1], [16]- [22], [28].
We focused on a single-turn (non-hierarchical) questionanswering (answer generation) using the RNN-based Seq2Seq learning under Multitask Learning (MTL) framework, similar to what is defined in our key benchmark paper [18]. MTL is a machine learning approach where multiple tasks are learned to improve answer generation quality. MTL has shown success in many applications of machine learning including NLP [29], [30]. MTL in the Seq2Seq model was first explored for machine translation problems [29], [31] and the success lent an impetus to other researchers [18], [20], [32]- [36] to explore it for chatbot answer generation to address the issue of generic and inconsistent answers by Seq2Seq model. Seq2Seq learning under MTL has considerable potential because the MTL framework provides an efficient mechanism to integrate multiple enhancements as discussed further in this paper. [18] is particularly of interest to us because it is the only paper known to us that utilized the MTL framework for answer generation without the requirement of a secondary dataset. All other MTL frameworks such as [20], [32]- [36] require an additional dataset, which may not available in all scenarios.
The Seq2Seq method utilizes an encoder and decoder architecture [37], [38]. The encoder, which consists of an RNN, aims to represent the meaning of the question sentence by encoding the question sentence into a dense vector called hidden states. Subsequently, the decoder, which is another neural network, aims to generate an answer sentence based on the encoder's hidden states. Figure 2 shows a typical Seq2Seq model.
However, findings of a number of studies [18]- [22] have demonstrated that the Seq2Seq method [37] tends to generate frequently occurring words in the answer, thereby compromising the quality of the generated answer. Generated answers may be meaningless or irrelevant to the question and, as a result, conversations with chatbots can be meaningless, abruptly terminated by users, and eventually lower the chatbot adoption rate. This weakness can be attributed to three (3) key issues: language model influence, answer generation overfit, and question encoder overfit. These issues are described in further detail in the following sub-sections.

A. LANGUAGE MODEL INFLUENCE
The decoder in a Seq2Seq model, typically an RNN, also behaves like a language model. A language model refers to the ability to generate the next word on the basis of previously generated words or words. As the decoding progresses, the language model influence of the decoder becomes stronger than the influence of the question. Consequently, the decoder may generate answers that are irrelevant to the question.

B. ANSWER GENERATION OVERFIT
Seq2Seq method learning occurs by optimizing the crossentropy loss function to find the best sequence of words that form the answer. The goal of training is to minimize the loss during training. However, uneven word frequency in data causes the model to generate frequently occurring words to minimize the loss during training, which results in an answer generation overfit. The answer-generation overfit causes the model to produce frequently occurring words learned from the dataset.

C. QUESTION ENCODER OVERFIT
Question-answering models are typically trained with specific datasets in a domain, such as customer support logs or question-answer pairs of an academic subject. However, the availability of these data in these specific domains may be limited. The question encoder overfit refers to the situation in which the question encoder becomes overfit owing to limited training data. Overfit occurs because the model generates question encodings to minimize loss during training. Question encoder overfit causes the model to suffer when handling unseen questions.
In addition, most of the existing work found in the literature focuses only on addressing either the overfitting issue or language model influence issue, but not both, thereby leaving a gap and a great opportunity to address the weakness in a more holistic manner.
In this paper, we study how to address all three issues in an MTL setting. We introduce the SEQ2SEQ++, MTLbased Seq2Seq model using RNN, which consists of a multifunctional encoder (MFE), an answer decoder (AD), an answer encoder (AE), and a ternary classifier (TC). MFE performs question encoding, first-word prediction, and lastword prediction. AD performs answer generation, and AE performs answer decoding. The TC performs a three-class classification of answers for a given question. Additionally, our method utilizes a dynamic task loss weight scheme for MTL loss calculation and a novel attention mechanism called the comprehensive attention mechanism (CAM) for answer generation.
The following works were considered for performance benchmarking where two datasets namely NarrativeQA [39] and SQuAD [40] are utilized:i) MTL-BC: An MTL model that utilizes a binary classifier as the auxiliary task and fixed task loss weighting, as proposed in [18]. This is our key benchmark paper. ii) STL: Single-task baseline Seq2Seq learning with global attention mechanism [41]. This single-task learning (STL) model is used as control method. iii) MTL-LTS: A sequential MTL model that utilizes a separate network to predict the first word, as proposed by [42]. This is our key benchmark paper. This work aims to compare effects of parallel MTL training proposed in our work. The key contributions of this paper are:i) A new MTL-based Seq2Seq model, called SEQ2SEQ++, for question-answering ii) Dynamic task loss weight mechanism for MTL: A new computation method to calculate the task loss weights automatically and dynamically iii) CAM: a novel attention mechanism for answer generation

II. RELATED WORK
In this section, we review the methods found in the extant literature and identify the gaps.

A. LANGUAGE MODEL INFLUENCE ISSUE
A typical approach to address the language model influence of the decoder is the attention mechanism. An attention mechanism is a method that allows the decoder to focus on certain parts of a question to decode the answer to that question. The most prevalent and extensively utilized attention mechanism is the global attention mechanism ( Figure 3, Table 1), which computes the attention weights in accordance with the encoder's hidden states and decoders' last hidden state [41]. These computed attention weights (ATT t ) are then used to compute the context vector C t ), which is then used for subsequent computation to eventually generate the answer. Several other attention mechanisms have been proposed, such as the local attention mechanism [43], hybrid attention mechanism [44], and weighted attention mechanism [45]. However, similar to the global attention mechanism [41], these attention mechanisms [43]- [45] also focus only on the encoder's hidden states and the decoder's final hidden state at each decoding step. Although the decoder's final hidden state represents all hidden states of previous time steps, the representation of the answer words generated in earlier time steps becomes diluted as it progresses in time and thus increases the decoder's language model influence, which tends to generate irrelevant answers. This leaves a gap in identifying an attention mechanism that can consider all hidden states of the decoder during decoding to address the language model influence more effectively.

B. ANSWER GENERATION OVERFIT ISSUE
The answer-generation overfit can be addressed by adding one or more regularization terms to the cross-entropy loss function to compute a new loss to be backpropagated.
One existing approach is to train the Seq2Seq method in a reinforcement learning or adversarial framework. However, reinforcement or adversarial learning requires custom reward functions or human interactions [46], which render this approach less practicable for application in cross-domain problems.
Another more practical approach to address overfitting is training the Seq2Seq method in a ''multi-task learning,'' i.e., MTL framework. MTL is a machine learning approach in which multiple tasks are learned in parallel to improve the main task. For question-answering, the main task to be improved is the answer-generation task. In MTL-based Seq2Seq learning, in addition to the answer generation task (main task), other tasks (usually referred to as auxiliary tasks) are introduced during model training. The losses from the auxiliary tasks are used as a regularization term to reduce the overfitting of the main task of answer generation. Equation (1) shows a generic form of multitask loss calculation where L MTL is the total model loss, L ag is the answer-generation task loss, L n refers to the loss of an n-th auxiliary task, and α refers to the task loss weight for each task in MTL. The task loss weights determine the extent to which the task influences learning and the extent to which the task will be learned. Therefore, it is vital to identify suitable auxiliary tasks. Moreover, in all existing MTL works for Seq2Seq learning, a fixed-weight scheme is used to calculate the MTL loss. In this fixed weight scheme, the auxiliary task is typically assigned a small value, such as 0.001, 0.01, or 0.1 [18]. However, determining the weight for the auxiliary task loss is not an easy task, given that there are no specific rules or formulas to determine the actual value to be used. Arbitrary values must be assigned and tested before the final weight for a task can be identified. Researchers have to perform numerous experiments on the task loss weights on a trial-and-error basis before settling for a specific value. In addition, training different datasets may require different values to be used. The task loss weight for a dataset may not be effective for another dataset. This trial-and-error approach is time-consuming and inefficient. It becomes even worse or nearly impossible if there are more than two tasks. This leaves a gap in identifying a more efficient and effective approach to determine the auxiliary task loss weights for an MTL. L MTL = α ag L ag + α n L n (1)

C. QUESTION ENCODER OVERFIT ISSUE
There are three approaches for addressing the question encoder overfit issue. The first approach is to provide additional embeddings or additional encodings of supporting information such as emotions, topic, or facts that accompany the question [19]- [22], [47], [48]. The underlying idea for this approach is to reduce the overfitting of the encoder so that it can generate a richer representation of the question to be passed to the decoder to furnish relevant answers. However, these approaches are skewed to a specific goal and are also dependent on additional inputs such as facts, topics, or emotions, which may not be available for all questionanswering scenarios or datasets. This leaves a gap in identifying a method that can reduce the question encoder overfitting without depending on any additional input. The second approach is the MTL approach, wherein a Seq2Seq model is trained to perform answer generation and another task such as answer classification [18], [20]. The underlying idea for this approach is to share the encoding of the question encoder to perform both tasks so that the question encoder overfit can be reduced. For example, [6] utilized binary question-answer classification as an  auxiliary task (Figure 4). Binary classification refers to a classification in which an answer is classified as either correct or incorrect. However, classifying an answer as right or wrong only (binary classification) may not be a natural classification method. In NLP, a generated answer can also be categorized as partially correct.
Referring to Table 2 as an example, the question is ''What did you have for breakfast?'' and the exact answer is ''I had half-boiled egg and toasted bread for breakfast.'' This is a fully correct answer. Suppose ''I had toasted bread for breakfast'' is given as an answer. It is not fully correct, but at the same time, it is not completely wrong. In this scenario, the given answer can be considered a partially correct answer. However, in binary classification, this answer is classified as incorrect. In line with this argument, binary classification is not a ''natural'' classification of answers, which leaves a gap in identifying an auxiliary task with a more natural answer classification to be used in MTL-based answer-generation model training.
The third approach to address the question encoder overfit is to train the Seq2Seq model using a two-phased (sequential) training approach [42]. During phase 1, this model learns to perform first-word prediction, whereas during phase 2, the model is trained to predict the answers, except for the first word. The idea of first-word prediction as an auxiliary task in MTL is a good idea to reduce model overfit. However, performing sequential MTL suffers from an issue referred to as a negative transfer. It is a situation where learning of the first task may negatively affect the learning of the second task. This leaves a gap in identifying a more suitable approach to add the first-word prediction as a task in the MTL framework. Table 3 summarizes the gaps identified in the related works, as discussed.

III. PROPOSED METHODS AND MODEL
In this section, we describe our proposed model, SEQ2SEQ++ ( Figure 5, Table 4), to address the issues and gaps discussed in the previous section. SEQ2SEQ++ integrates four newly proposed methods: the CAM, DL weighting scheme, TC, and MFE in a single model.
The four newly proposed methods in this work are CAM, DL, TC, and MFE. i) CAM is an attention computation method proposed to address the language model influence issue. ii) DL is a dynamic tasks loss weight computation method implemented during SEQ2SEQ++ training and proposed to address the answer-generation overfit issue. iii) MFE and TC are proposed to address the question encoder overfit issue. By integrating all these methods, SEQ2SEQ++ performs four tasks in parallel. These tasks are answer generation, ternary classification, first-word prediction, and last-word prediction tasks.
The details of each method are discussed in the following sections.

A. COMPREHENSIVE ATTENTION MECHANISM
In this work, a new attention mechanism called the ''comprehensive attention mechanism'' or CAM, which considers all the decoder's previous hidden states during attention weight computation, is proposed to address the language model influence more effectively.
In CAM, the attention weights are computed in accordance with all the encoder's hidden states and the sum of all the previous hidden states of the decoder. These computed attention weights are then used to compute the context vector. Subsequently, the decoder utilizes the context vector to generate an answer. The CAM-based decoding steps are the same as a typical decoding process to generate the answer, except for the computation of the attention weights.
where: h 1 , h 2 , . . . , h T are the hidden states of the encoder t = decoding timestep S = (s 1 , s 2 , . . . , s t−1 ) ii) Second, the context vector for decoding at time step where C t is a weighted average context vector for answer prediction. iii) Next, the output context vector (O t ) and decoder's hidden state at time step t, (s t ) are computed based on the concatenation of the generated context vector (C t ) and supplied ground truth (y t−1 ) using GRU transformation (f GRU ). O t is used for the prediction of the next word and s t is utilized for subsequent decoding steps (Eq. 4).
where O t is the output context vector generated by the GRU transformation s t is the hidden state at time step, t generated by the GRU transformation Embed t−1 is the embedding of the supplied ground truth word or the previous predicted word (y t−1 ). iv) Finally, the conditional probability of the next token is computed in accordance with the output vector p(y t ) (Eq. 5).
where W x are matrixes (learnable parameters) that is be trained in the model.

B. MULTIFUNCTIONAL ENCODER
MTL provides the means to add auxiliary tasks that can be trained simultaneously with answer-generation tasks. By utilizing auxiliary tasks that share the question encoder, it forces the question encoder to balance and fine-tune the encodings for all receiving tasks. This step ensures that all receiving tasks can mitigate their respective prediction losses, such as answer generation loss or answer classification loss. Because each task in an MTL contributes to the overall model loss, mitigating each task's loss to is very important. Thus, by sharing the question encoder, the question encoder overfit will eventually be reduced. This can ensure that the answer generation bias toward high-frequency words is reduced, and thus a more meaningful answer can be generated. The MFE is proposed to take advantage of the parallel MTL. The question encoder is shared with two additional tasks: first-and last-word prediction. MFE performs its tasks in two stages.
i) First, it takes in question embedding and generates question encoding using the GRU transformation [37] denoted as f GRU in Figure 5 and Eq. 6. ii) Next, additional computations on the question encoding (hidden states) are performed to make the first word (shown as f FW in Figure 5), as shown in (Eq. 7) to (Eq. 9) and last-word predictions (shown as f LW in Figure 5), respectively, as shown in (Eq. 10) to (Eq. 12). The following computations are performed during the firstword prediction (f FW ):i) First, self-attention weights are computed (Eq. 7). ii) Next, the question context vector for first-word prediction (C FW ) is computed as the weighted average of the question encoder hidden states (Eq. 8). Finally, the probabilities of each word in the vocabulary to be the first word are computed (Eq. 9).
Similar computations are performed to predict the final word. This is denoted as the f LW function in Figure 5. The computations are shown in Eq. 10-12. Similar to MFE, a TC is proposed to take advantage of parallel MTL to reduce the question encoder overfit. The TC classifies a question-answer pair as ''correct,'' ''partially correct,'' or ''wrong.'' As illustrated in Figure 5, the TC performs the questionanswer classification task based on the question encodings that are generated by the MFE, and answer encoding, which is generated by the answer encoder (AE). Eq. 13 shows the answer generation where Embed A can be embedding for correct, partially-correct or wrong answers. Answer-Hidden-States can be one of the below:i) a 1 , a 2 . . . , a U , hidden states for correct answer, U is the correct answer length ii) b 1 , b 2 . . . , b V , hidden states for partially correct answer, V is the partially correct answer length iii) c 1 , c 2 . . . , c W , hidden states for wrong answer, W is the wrong answer length. The question and answer context vectors are computed as the weighted average of the question and answer encodings, respectively (Eq. 14) to (Eq. 21). The probability for each classification class is then computed on the concatenated question and answer vectors (Eq. 22) to (Eq. 24).

Answer-Hidden-States
where: ATT Q , ATT C and ATT P are the self-attention weights C Q is the computed question context vector at time step, t a 1 , a 2 , . . . , a U are the hidden states of the decoder for correct answers U is the correct answer length C C is the computed correct answer context vector at time step, t c 1 , c 2 . . . , c W , hidden states for wrong answer W is the wrong answer length b 1 , b 2 , . . . , b V are the hidden states of the decoder for partially correct answers V is the partially correct answer length C P is the computed partially correct answer context vector at time step, t C W is the computed wrong answer context vector at time step, t.

D. DYNAMIC TASKS LOSS WEIGHT SCHEME
In this work, a task loss weight scheme called the ''dynamic tasks loss weights scheme'' is proposed. In this scheme, during model training, the task loss weights for each task (answer generation, answer classification, first-word prediction, and last-word prediction) are automatically recalculated for the second epoch. The new weights are based on the relative loss of each task in comparison with the total loss of all tasks during each epoch (Eq. 25), where α n is the task loss weight, L Tn is the loss of task-n, and N is the total number of tasks. The weights represent the percentage contribution of each task to the overall MTL loss. Therefore, the sum of all weights should be 1, which represents 100%.
Algorithm 1 delineates the steps to perform the SEQ2SEQ++ model training, which utilizes the DL scheme. First, the shared question encoder performs question encoding, and subsequently the first-and last-word predictions. The model then computes the loss for each prediction (L fw and L lw ). The question encoding is then passed to the answer decoder to generate an answer using CAM. This model then computes the answer-generation loss (lag). In addition, the answer encoder performs the answer encoding. Both the question encoding and answer encoding are then passed to the TC to perform question-answer classification and subsequently compute the ternary classification loss (L tc ). The model then computes the MTL loss (LMTL) and updates its weights (i.e., parameters). The formula for the MTL loss calculation is shown in step 1.7 in Algorithm 1. Finally, the model computes the new task loss weights to be used for the next epoch. These steps are performed for each batch of the question-answer pairs for the total epoch count, as defined for training.
The variables α ag , α tc , α fw , and αl w represent the task loss weight for answer generation, the task loss weight ternary classification, the task loss weight for the first-word prediction, and the task loss weight for last-word prediction tasks. At the start of the training, each weight is initialized to 0.25.  During each epoch, all the task loss weights (α ag , α tc , α fw , and α lw ) are recalculated and updated for use in the next epoch (step 3 of Algorithm 1). Table 5 presents a sample calculation. The total MTL loss is 4, and the answer generation loss is 2. Therefore, the new task loss weight for answer generation (α ag ) is 2/4 = 0.5. The same calculation is performed to calculate the new task loss weights for all tasks. The new task loss weights for each task will be proportional to the total model loss. This means that the influence of each task in each epoch is dynamically determined by the task loss in the previous epoch. Thus, model overfitting can be effectively reduced using this dynamic proportional task loss weight during training as compared to using a fixed task loss weight approach, as in typical MTL learning.

IV. EXPERIMENT AND DISCUSSION
We developed eight models, and trained and tested each of them on two datasets NarrativeQA [39] and SQuAD [40] to gauge the effectiveness of our proposed methods to generate meaningful and relevant answers for each dataset. All except the STL model are MTL-based models. The STL model is a single-task method, which has the answer-generation task only and is used as a control method. The results of the experiments are described in detail in this section.

A. MODELS
The models (summarized in Table 6) utilized for these experiments are:i) MTL-BC: The MTL-BC model is an MTL-based Seq2Seq model that utilizes the global attention mechanism [41] during decoding and a binary classifier as the auxiliary task [18]. This model is trained using  Algorithm 2. It utilizes a fixed task loss weight scheme for multitask loss calculation during training, whereby the value of 0.1 is assigned as the binary classification task loss weight. The architecture of the MTL-BC model is shown in Figure 6. The notations are summarized in Table 7. This model is the second and key benchmark model. ii) STL: The STL model is a single-task Seq2Seq learning model as proposed by Bahdanau et al. [41] and utilizes the global attention mechanism [41] and is trained using Algorithm 2. This model is the first benchmark model. The architecture of the STL model is shown in Figure 7. The notations are summarized in Table 8. iii) MTL-LTS: The MTL-LTS model is the third benchmark for this research work. It is a sequential MTL model with a global attention mechanism that utilizes a separate network as the auxiliary task to predict the first word, as proposed in [42]. This model is trained using Algorithm 4. The architecture of the MTL-LTS model is shown in Figure 8. The notations are summarized in Table 9.      MTL-BC-CAM model is illustrated in Figure 9. The notations are encapsulated in Table 10.     Figure 6). This interim model is proposed to study the effectiveness dynamic tasks loss weight scheme to reduce answer generation overfit. vi) The MTL-TC model is a modified version of MTL-BC that model that utilizes a TC as the auxiliary task. This model is trained using Algorithm 2. It utilizes a fixed-task loss weight scheme for multi-task loss calculation during training, whereby the value of 0.1 is assigned as the ternary classification task loss weight. The architecture of the MTL-TC model is shown in Figure 10. The notations are summarized in Table 11. This interim model is proposed to study the effectiveness ternary classification to reduce question encoding overfit. vii) MTL-MFE: This model utilizes MFE, global attention mechanism during answer decoding and dynamic tasks loss weights for multi-task loss calculation. This model is trained using Algorithm 6. MTL-MFE architecture is illustrated in Figure 11. The notations are summarized in Table 12. This interim model is proposed to study the effectiveness of MFE, which performs first-word and last-word predictions in parallel to reduce question encoding overfit. viii) SEQ2SEQ++: This is the ultimate model proposed to address all three issues, as described in the problem statement (language model influence issue, answer generation overfit issue, and question encoder overfit issue). SEQ2SEQ++ integrates all the newly proposed methods, namely, CAM, DL, TC, and MFE ( Figure 5, Table 4). This model is trained using Algorithm 1.

B. DATASETS
All models were trained and tested on two datasets. The first dataset, NarrativeQA [39] is a fiction-based dataset and was proposed for reading comprehension evaluation; however, we took the question and the first correct answer in our experiments because we only need that part. The second one,  [40]. SQuAD is a question-answer dataset based on Wikipedia articles. For our experiments, we selected from SQuAD the questions that have answers, and we only took the first answer. Table 13 presents the details of both the datasets used in this experiment.
During training, the pairs of {questions, answers} are used by all models to learn to generate answers. For example End for Epoch Looping (Table 14), for the question ''How are plants different from animals?'' the model will learn to produce the answer ''Primary cell wall composed of the polysaccharides cellulose.'' In addition to the answer-generation task, MTL-BC, MTL-BC-CAM, MTL-BC-DL, MTL-TC, and SEQ2SEQ++ require additional data to perform the answer classification.
MTL-BC, MTL-BC-CAM, and MTL-BC-DL require triplets of {question, correct answer, wrong answer} for answer-classification training. Table 15 shows the sample training data for the MTL-BC, MTL-BC-CAM, and MTL-BC-DL. The training dataset was generated as described in [18].
MTL-TC and SEQ2SEQ++ require quadruples of {questions, correct answers, partially correct answers, wrong answers} for answer classification training. Table 16 summarizes the sample training data for MTL-TC and SEQ2SEQ++. The training dataset was generated using the following approach. ii) The original answer is then manipulated by removing or adding one or more words to obtain a BLEU score of more than 0 and less than 1 as compared to the gold answer and labeled as ''Partially Correct.'' iii) Then, to find the third class to represent the wrong (negative) answer, any answer that belongs to another question in the dataset and has a BLEU score of 0 as compared to the gold answer, is randomly selected and labeled ''Wrong.''

C. EVALUATION METRICS
A total of three (3) evaluation metrics were used in this study to measure the performance of each model from various angles. They are: -

1) BILINGUAL EVALUATION UNDERSTUDY (BLEU)
The BLEU metric [49] is the most popular metrics utilized in answer generation works such as but not limited to [18], [22], [34], [50], [51] and utilized in [18], one of our key reference works.It is used to measure the quality of machine translation versus human translation. It calculates an N-gram precision between the two sequences and imposes a commensurate penalty for a machine sequence that is shorter than the human sequence. In addition to its original purpose, BLEU is used extensively for other text generation evaluations, including natural answer generation (NAG), and is reported to have a high correlation with human judgments of quality [49]. Here, BLEU-2 corresponding to bi-gram versions of the approach is used to evaluate the generated answer versus the gold answer. The BLEU metric measures the correctness of the systems being evaluated and gives a score between 0 and 1. The higher the BLEU rate, the better is the model. In other words, a higher BLEU score indicates the generation of answers that are relevant to the question.

2) WORD ERROR RATE
Word error rate (WER) [52] measures the ''mistakes'' of the model. It measures the rate of wrongly generated words against the overall generated word. It gives a score of 0-1. A lower score indicates fewer errors. In other words, a low WER score indicates the generation of fewer high-frequency and generic words. The formula is given as (26). WER complements BLEU; we added it to obtain a more holistic measurement of each model's performance.
where WC wrong is wrongly generated word count, WC total is total generated word count

3) DISTINCT-2
In addition to evaluating the correctness of the model using BLEU and error rate using WER metrics, the diversity of the answers generated by each model was also measured using a technique proposed in [53] and utilized in our key reference work [18]. Here, the Distinct-2 metric is used to measure diversity. Distinct-2 is the number of distinct bigrams divided by the total number of bigrams generated by the model. It has a value between 0 and 1. The higher the score, the more diverse the answers. In other words, generating more diverse answers indicates the generation of less high-frequency and generic answers. The formula is given as (27).

D. EXPERIMENTAL SETTINGS
All models were implemented using Python version 3.6, and TensorFlow [54] version 1.15.0, which provides a serverless Jupyter notebook environment with GPUs for interactive development [55]. Each model was run for a maximum of 250 epochs and batches of 32 training pairs. The checkpoint with the lowest validation loss was used for the testing (experiments). For all models, the diverse beam search technique proposed in [56] was implemented during testing. Several combinations of group size and beam size were used, and the best outcome for each model is taken for performance comparison. Table 17 summarizes the settings. Additionally, model-specific settings were used for the experiments. There are as follows: i) MTL-BC [18] performs answer generation using the global attention mechanism and binary answer classification, and fixed weights are used to calculate the MTL loss during training. ii) STL [41] is a single-task model, and it only performs an answer generation task by utilizing the global attention mechanism iii) MTL-LTS [42] performs first-word prediction and then answer generation using the global attention mechanism. VOLUME 9, 2021 iv) MTL-BC-CAM performs answer generation using the CAM method and binary answer classification, and fixed weights are used to calculate the MTL loss during training. v) MTL-BC-DL performs answer generation using the global attention mechanism and binary answer classification. It utilizes dynamic tasks loss weight calculations during model training. vi) MTL-TC performs answer generation using the global attention mechanism and ternary answer classification, and fixed weights are used to calculate the MTL loss during training. vii) MTL-MFE performs answer generation using the global attention mechanism and first-and last-word predictions. It utilizes dynamic tasks loss weight calculations during model training viii) SEQ2SEQ++ performs answer generation using CAM, first-and last-word predictions, and ternary answer classification. It utilizes dynamic tasks loss weight calculations during model training. Table 18 presents a succinct summary of the modelspecific settings.

E. EXPERIMENTS RESULTS AND ANALYSIS
This section focuses into the experiment settings and analysis of the results based on (i) the comparison of interim models (MTL-BC-CAM, MTL-BD-DL, MTL-TC, and MTL-MFE) versus the benchmark models (STL and MTL-BC), and (ii) SEQ2SEQ++ against the benchmark models (STL and MTL-BC). The NarrativeQA and SQuaD datasets, and BLEU, WER and Distinct-2 metrics are used. We also report the analysis, signifance testing and case study for each issue.

1) INTERIM MODELS VERSUS BENCHMARK MODELS
This section presents and analyzes the experimental results of our interim models against specific benchmark models for each issue (language model influence, answer generation overfit, question encoder overfit and MTL-MFE versus MTL-LTS). These interim models were developed and tested to gauge the effectiveness of each of our proposed methods individually in addressing a specific issue as discussed in more details in the following sections.     (represented by a lower WER score). However, in terms of diversity MTL-BC performed better than MTL-BC-CAM which is represented by a higher Distinct-2 score.

SQuaD DATASET
For the BLEU metric, MTL-BC-CAM scored 0.6888 which is 1.8% higher than MTL-BC's score of 0.6769. As for WER, MTL-BC-CAM scored 0.2622 which is 9.4% lower than MTL-BC which scored 0.2893. For the Distinct-2 metric, MTL-BC-CAM scored 0.8956 which is 0.6% higher than MTL-BC's score which is 0.8907. This experiment result on the SQuaD dataset shows MTL-BC-CAM performed better than MTL-BC in all the metrics. This means MTL-BC-CAM can produce answers with higher correctness (represented by higher BLEU score), lower errors (represented by lower WER score), and higher diversity (represented by higher Distinct-2 scores).

ANALYSIS
MTL-BC-CAM scored higher than MTL-BC in two (2) metrics for NarrativeQA dataset (BLEU: 11.9% and WER: 18.7%) and in all the metrics for SQuAD dataset (BLEU: 1.8%, WER: 9.4% and Distinct-2: 0.6%). This experiment outcome shows that by utilizing CAM, MTL-BC-CAM can capture the representation of the answer more precisely without loss of important information which is the question and generated answer words during each decoding step. Thus, CAM is a more effective attention mechanism than the global attention mechanism to address the language model influence issue. This demonstrates that answer generation is more effective by utilizing all the decoder's previously generated hidden states which represent the decoder's generated words.

SIGNIFICANCE
To evaluate whether the performance of MTL-BC-CAM versus MTL-BC scores is statistically significant or not, the paired Student's t-test statistical tests were performed on the BLEU and WER scores for both datasets. Results (Table 21) indicate that MTL-BC-CAM performed significantly better than MTL-BC on three (3) measurements which are BLEU-NarrativeQA, WER-NarrativeQA, and WER-SQuaD. BLEU-SQuaD measurement shows an insignificant difference. Student's t-test could not be performed for Distinct-2, as Distinct-2 is an overall score of model diversity and is not based on individual answers generated.  Table 22 shows two sample questions and the corresponding generated answers by each of the experimented models. Column ''BLEU score'' shows the BLEU score of the respective answers generated by each model. Column ''Frequency of term/phrase'' shows the frequency of selected words in the respective datasets.

CASE STUDY
Sample 1 output shows that MTL-BC failed to generate the correct answer because halfway through the answer generation, it generated the word ''plans'' which has a frequency of 36 instead of ''patient'' which has a frequency of 18. For Sample 2, MTL-BC generated the end token ('' end '') too soon. The end token has a very high frequency of 24,819 compared to ''of'' which is 4078.
The result shows that MTL-BC-CAM can generate correct answers for both questions even though MTL-BC failed to generate the correct answers.

b: ANSWER GENERATION OVERFIT
To understand the effectiveness of the dynamic-task loss weights scheme versus the fixed-task loss weights scheme in reducing answer generation overfit, we compare MTL-BC-DL with MTL-BC. Tables 23 and 24 show the experiment result for MTL-BC-DL versus MTL-BC using the Narra-tiveQA and SQuaD dataset.

NarrativeQA DATASET
For the BLEU metric, MTL-BC-DL scored 0.5979 which is 4.7% higher than MTL-BC's score of 0.5709. As for WER, VOLUME 9, 2021  MTL-BC-DL scored 0.3183, which is 8.9% lower than MTL-BC which scored 0.3492. For the Distinct-2 metric, MTL-BC scored slightly higher (0.6%) than MTL-BC-DL. MTL-BC's score was 0.8078 against MTL-BC-DL's score of 0.8029. This experimental result on the NarrativeQA dataset confirms that MTL-BC-DL performed better than MTL-BC in the BLEU and WER metrics. MTL-BC fared better than MTL-BC-DL. This means MTL-BC-DL can produce answers with higher correctness (represented by a higher BLEU score) and lower errors (represented by a lower WER score). On the other hand, MTL-BC generated a higher diversity (represented by higher Distinct-2 scores).

SQuaD DATASET
For the BLEU metric, MTL-BC-DL scored 0.6878 which is 1.6% higher than MTL-BC-DL's score of 0.6769. As for the WER metric, MTL-BC-DL scored 0.2807 which is 3.0% lower than MTL-BC which scored 0.2893. For the Distinct-2 metric, MTL-BC scored slightly higher (0.7%) than MTL-BC-DL. MTL-BC's score was 0.8907 against MTL-BC-DL's   score of 0.8842. This experiment results on the SQuaD dataset show MTL-BC-DL performed better than MTL-BC in the BLEU and WER metrics. MTL-BC fared better than MTL-BC-DL. This is a similar result to the experiment on the NarrativeQA dataset. The result shows that MTL-BC-DL can produce answers with higher correctness (represented by a higher BLEU score) and lower errors (represented by a lower WER score). On the other hand, MTL-BC produced higher diversity (represented by higher Distinct-2 scores).

ANALYSIS
MTL-BC-DL scored higher than MTL-BC in two (2) metrics for both NarrativeQA (BLEU: 4.7% and WER: 8.9%) and SQuAD datasets (BLEU: 1.6% and WER: 3.0%) respectively. This is because MTL-BC-DL utilizes the Dynamic Tasks Loss Weights Scheme (DL) which recalculates each of the task's loss weights during each epoch and assigns the new values to be used for the next epoch. A task's loss represents the difference between the predicted value against the actual value. Higher the loss means the task is not performing well relatively and has to do a lot more learning to improve its prediction to reduce the loss. A relatively higher task loss weight for a task means the task contributes more to the MTL loss. A higher loss will ensure model learning continues and doesn't stop early. There will be bigger updates to the neural network weights accordingly to ensure the model can predict better in future epochs. Thus, answer generation overfit issue which occurs in fixed tasks loss weight scheme can be avoided or reduced effectively by utilizing DL.

SIGNIFICANCE
To evaluate whether the performance of MTL-BC-DL versus MTL-BC scores is statistically significant or not, the paired Student's t-test statistical tests were performed on the BLEU and WER scores for both datasets. Results (Table 25) indicate that MTL-BC-DL performed significantly better than MTL-BC on the WER-NarrativeQA measurement. Other measurements show insignificant differences. Table 26 shows two (2) sample questions and the corresponding generated answers by each of the experimented models. Column ''BLEU score'' shows the BLEU score of the respective answers generated by each model. Column ''Frequency of term/phrase'' shows the frequency of selected words in the respective datasets.

CASE STUDY
Sample 1 output shows that MTL-BC generated the word ''father'' which has a higher frequency of 359 instead of the word ''cousins'' with a frequency of 31. For Sample 2, MTL-BC generated the common word ''the'' which has a very high frequency of 7479. The correct word that should be generated is ''battle.'' question. In both cases, MTL-BC-DL can generate the correct answer. This outcome shows that by utilizing the dynamic tasks loss weight scheme, the performance of an MTL model can be further improved.

c: QUESTION ENCODER OVERFIT
To understand the effectiveness of our proposed methods in reducing the question encoder overfit against the benchmark

NarrativeQA DATASET
For the BLEU metric, MTL-TC scored 20.5% higher than MTL-BC. MTL-TC and MTL-BC scored 0.0.6880 and 0.5709 respectively. As for the WER metric, MTL-TC scored 0.2482 which is 28.9% lower than MTL-BC which scored 0.3492. For the Distinct-2 metric, MTL-TC scored more than MTL-BC. MTL-TC's score was 0.815 against MTL-BC's score of 0.8078. This experiment result on the NarrativeQA dataset shows MTL-TC performed better than MTL-BC in all the evaluation metrics. This means MTL-TC can produce answers with higher correctness (represented by higher BLEU score), lower errors (represented by lower WER score), and higher diversity (represented by higher Distinct-2 scores) than MTL-BC.

SQuaD DATASET
For the BLEU metric, MTL-TC scored 0.6954 which is 2.7% higher than MTL-BC's score of 0.6769. As for the WER metric, MTL-TC scored 0.2720 which is 6% lower than MTL-BC's score of 0.2893. For the Distinct-2 metric, MTL-BC and MTL-TC's scores were almost the same with only a very marginal difference of 0.01%. This experiment result on the SQuAD dataset shows MTL-TC performed better than MTL-BC in all the BLEU and WER metrics. This means MTL-TC can produce answers with higher correctness (represented by a higher BLEU score) and lower errors (represented by a lower WER score) than MTL-BC.

ANALYSIS
MTL-TC scored higher than MTL-BC in all the metrics for NarrativeQA dataset (BLEU: 20.5%, WER: 28.9% and Distinct-2: 0.9%) and two (2) metrics for SQuAD datasets (BLEU: 2.7%, and WER: 6.0%). This experiment outcome shows that by utilizing a slightly more complex task (ternary classification) as compared to binary classification, the question encoder needs to fine-tune its encoding to ensure the receiving networks can perform their tasks well.

SIGNIFICANCE
To evaluate whether the performance of MTL-TC versus MTL-BC scores is statistically significant or not, the paired Student's t-test statistical tests were performed on the BLEU and WER scores for both datasets. Results (Table 29) indicate that MTL-TC performed significantly better than MTL-BC in two (2) measurements which are BLEU-NarrativeQA and WER-NarrativeQA. Other measurements showed insignificant differences.
In this case, the question encoding is passed to the answer decoder and ternary classifier. This demonstrates that answer generation is more effective when question encoding overfit can be reduced by utilizing a slightly more complex question-answer classification method. Table 30 shows two (2) sample questions and the corresponding generated answers by each of the experimented models. Column ''BLEU score'' shows the BLEU score of the respective answers generated by each model. Column ''Frequency of term/phrase'' shows the frequency of selected words in the respective dataset.

CASE STUDY
In Sample 1, an incorrect answer was generated by MTL-BC because it generated the word ''women'' which occurs 61 times in the dataset instead of the correct word ''atmosphere'' which occurs only 5 times in the dataset. Sample 2 output shows that MTL-BC generated the word ''and'' which has a very high frequency in the dataset which is 3867 instead of the correct word which is ''technology'' which has a frequency of 34 only. In both samples, MTL-TC can generate the correct answers. The MTL-TC which uses ternary classification tasks during training can generate correct answers as compared to when using a binary classification task. This shows by training an MTL model with ternary classification can reduce question encoding overfit and thus reducing the occurrence of frequently occurring words in answers.     WER: 48.7% and Distinct-2: 0.4%) and SQuAD datasets (BLEU: 45.8%, WER: 52.5% and Distinct-2: 1.2%). This experiment outcome shows that the MTL-MFE model which is trained in parallel mode is more effective to reduce question encoder overfit as compared to MTL-LTS training which is based on sequential mode training. By training in parallel mode, the question encoder needs to fine-tune its encoding to ensure all the receiving networks can perform their tasks well. In this case, the question encoding is passed to the answer decoder, first-word predictor, and last-word predictor. This demonstrates that answer generation is more effective when question encoding overfit can be reduced by utilizing a multifunctional encoder and performing training in parallel mode.

SIGNIFICANCE
To evaluate whether the performance of MTL-MFE versus MTL-LTS scores is statistically significant or not, the paired Student's t-test statistical tests were performed on the BLEU and WER scores for both datasets. Results (Table 33) indicate that MTL-MFE performed significantly better than MTL-LTS in all the measurements. Table 34 shows two (2) sample questions and the corresponding generated answers by each of the experimented models. Column ''BLEU score'' shows the BLEU score of the respective answers generated by each model. Column ''Frequency of term/phrase'' shows the frequency of selected words in the respective dataset.

CASE STUDY
Sample 1 output shows that MTL-LTS generated the word ''her'' which has a very high-frequency count of 58,556 in the dataset instead of the word ''Christianity'' which has a frequency of 11 only. Similarly, for Sample 2, a shorter and incorrect answer was generated by MTL-LTS because it generated the end token ('' end '') too early. The end token has a very high frequency of 24819 compared to the word ''public'' which is only 153. The MTL-MFE can generate correct answers as compared to MTL-LTS. This shows by training the auxiliary tasks in parallel, question encoding overfitting can be reduced.

2) SEQ2SEQ++ VERSUS BENCHMARK MODELS
This section presents the analysis of improvements by SEQ2SEQ++ against the benchmark models (MTL-BC [18], STL [41], and MTL-LTS [42]) for each dataset utilized in this study. The results are presented in Tables 35 and 36.

a: EXPERIMENT ON NarrativeQA DATASET
For the BLEU metric (Figure 12), SEQ2SEQ++ scored the highest at 0.8245. This score is much higher than the benchmark models STL, MTL-BC, and MTL-LTS, which only scored 0.5399, 0.5709, and 0.5665, respectively. SEQ2SEQ++ scored 44.4% higher than MTL-BC, which is the next best model in terms of BLEU score for NarrativeQA.
As for the WER metric ( Figure 13), SEQ2SEQ++ had the lowest score of 0.1368. This score is much lower than the benchmark models STL, MTL-BC, and MTL-LTS, which scored 0.3701, 0.3492, and 0.3323, respectively. SEQ2SEQ++ scored 58.9% lower than MTL-LTS, which is the next best model in terms of the WER score for NarrativeQA.
For the Distinct-2 metric (Figure 14), SEQ2SEQ++ scored the highest with 0.8264, followed by MTL-LTS with 0.8247, MTL-BC with 0.8078, and STL with 0.7838. For this metric, the score of the SEQ2SEQ++ model is only slightly higher (0.2%) than the MTL-LTS score.

b: EXPERIMENT ON SQuaD DATASET
For the BLEU metric (Figure 12), SEQ2SEQ++ scored the highest at 0.7941. This score is much higher than  the benchmark models STL, MTL-BC, and MTL-LTS, which only scored 0.5087, 0.6769, and 0.5288, respectively. SEQ2SEQ++ scored 17.3% higher than MTL-BC, which is the next best model in terms of BLEU score for the SquaD dataset.
As for the WER metric ( Figure 13), SEQ2SEQ++ had the lowest score of 0.1815. This score is much lower than the benchmark models STL, MTL-BC, and MTL-LTS, which scored 0.4145, 0.2893, and 0.4353, respectively. SEQ2SEQ++ scored 37.3% lower than MTL-BC, which is the next best model in terms of the WER score for SquaD.
For the Distinct-2 metric (Figure 14), SEQ2SEQ++ scored the highest with 0.8972, followed by MTL-MC with 0.8907, MTL-LTS with 0.8837, and STL with 0.809. For this metric, the score of the SEQ2SEQ++ model is only slightly higher (0.7%) than the MTL-BC'S score.

SEQ2SEQ++
achieved the best performance (Figures 11,12 and 13) for all the evaluation metrics and for both datasets as compared to all the benchmark models STL, MTL-BC, and MTL-LTS.  all three issues (language model influence, answer generation overfit, and question encoding overfit) more effectively than the benchmark models. It can generate answers with higher quality (highest BLEU score), lower error rate (lowest WER score), and higher diversity (highest Distinct-2 score) in comparison with all the other benchmark models.

3) SIGNIFICANCE
To evaluate whether the SEQ2SEQ++ model's scores against the second-best benchmark model are statistically significant, Student's t-test statistical tests were performed for BLEU and WER scores. The results (Table 37) indicate that the performance difference in terms of BLEU and WER scores of the SEQ2SEQ++ model against the next best models is statistically significant. Student's t-test could not be performed for Distinct-2, as Distinct-2 is an overall score of model diversity and is not based on individual answers generated.

4) CASE STUDY
This section studies samples of answers generated by each model as a case study to highlight the advantages of each of the novel methods and the SEQ2SEQ++ model in addressing the issues in Seq2Seq learning. One (1) sample is shown from each dataset. The column ''Frequency of term/phrase'' provides details on the frequency of selected words in the dataset. Table 38 shows four (4) samples of generated answers by SEQ2SEQ++ and benchmark models (STL, MTL-LTS, and MTL-BC). SEQ2SEQ++ model can reduce overall model overfitting by not generating frequently occurring words as part of the answers as the other models did and eventually improves the answer generation quality.
For example, in sample 1, incorrect answers were generated using benchmark models. STL generated the word ''tree'' which has a higher frequency of 99 instead of the word ''guard.'' MTL-BC performed slightly better than STL because it can generate the word ''Guard'' correctly but incorrectly generated the end token '' end '' too early.
The end token has a frequency of 24,000 for NarrativeQA. MTL-LTS generated the wrong word ''death'' instead of the word ''guard.'' Similar outcomes were observed in the other samples.
This experiment shows that, while the other models generated frequently occurring words or tokens such as '' end ,'' the comma, '','' ''a,'' ''in,'' and ''the,'' SEQ2SEQ++ avoided them to generate the correct words/tokens according to the respective questions.

V. CONCLUSION AND FUTURE WORK
Question-answering chatbots that provide concise answers to specific user questions and queries are rapidly gaining popularity in many domains, such as customer support and education. Neural network-based chatbot models equipped with domain knowledge can scale much faster than humans and can be utilized around the clock. It can be continuously trained with additional new data to be updated with the latest knowledge to be served to users.
The Seq2Seq natural answer generation method is one of the most popular methods for implementing questionanswering chatbots. In this method, a question or answer is treated as a sequence of words or tokens. During training, the model learns to generate a sequence of words as answers given the question, which is also a sequence of words. Although the Seq2Seq based chatbots can provide answers to most questions, they tend to generate frequently occurring words in the answer; hence, the generated answer may not be relevant to the question. Some generated answers also ended abruptly, which means that it is not a complete answer. Consequently, the generated answers may be meaningless or unsatisfactory for the user. This Seq2Seq method's weakness can be attributed to three key issues: language model influence, answer generation overfit, and question encoding overfit.
The existing methods exhibit some gaps. First, existing attention methods only use the final hidden state of the decoder for decoding, and this may not be adequate to address the language model issue. Second, existing MTL models rely on fixed task loss weights for auxiliary tasks, which are not sufficient to reduce the answer generation overfitting. Third, the binary classification utilized by MTL-BC, which classifies an answer as positive or negative, is not natural. A more natural classification of the answer given a question is correct, partially correct, or incorrect. The MTL-LTS, which is the sequential MTL model, could not fully harness the power of the MTL as compared to the parallel MTL approach to reduce the question encoding overfit issue.
We propose four new methods to fill the gaps of Seq2Seq learning issues. CAM is a new attention mechanism that is utilized for decoding. The CAM considers all previous hidden states of the decoder during decoding. CAM can balance the gaps between language model influence and question influence. By creating this balance, the occurrence of high-frequency words can be reduced, and thus the model can generate a correct and meaningful answer. MFE performs first-and last-word prediction tasks in parallel with the answer-generation task. MFE, which is based on a parallel learning approach, has shown significant improvement over the sequential approach. Similarly, by utilizing TC, which performs ternary classification of the answer given a question, the question encoder overfit can also be reduced. Models that utilize MFE and TC can improve the answer generation quality by reducing the generation of high-frequency words incorrectly. CAM, MFE, and TC are integrated into a new MTL model called SEQ2SEQ++, which utilizes the DL weight mechanism, which calculates the task loss weights for each of the tasks in the MTL framework during the epoch and automatically uses it for the next epoch. This ensures that each task contributes accordingly to the overall model learning and ensures that model learning does not end prematurely.
We developed eight models and trained and tested each of them on two published research datasets to gauge the effectiveness of our proposed methods (CAM, DL, MFE, TC) and also our final model SEQ2SEQ++, which combines all our proposed methods to generate meaningful and relevant answers.
The results showed that our proposed methods (CAM, DL, MFE, TC) achieved better results than the benchmark models. The final experiment result showed that SEQ2SEQ++ achieved the best performance compared to the benchmark models in natural answer generation. SEQ2SEQ++ can produce answers with higher correctness (highest BLEU scores), lower errors (lowest WER scores), and higher diversity (highest Distinct-2 scores) on both datasets.
The significance of this research is three-fold:i) First, a comprehensive attention mechanism is proposed. CAM is generic and does not require additional input, so it can be used by researchers in other Seq2Seq-based tasks such as caption generation or question generation. ii) Second, a DL weights scheme and a new DL-based MTL training algorithm. In an MTL model, determining the weight for each task's weight loss is not easy.
Multiple arbitrary values must be assigned and tested before the final weight for a task can be identified. This trial-and-error approach is time-consuming and inefficient. It becomes even worse or nearly impossible if there are more than two tasks. Utilizing a dynamic task loss weight scheme is not only efficient but also highly effective in producing a better learning approach, as proved in this study. Moreover, this finding can encourage other researchers to further improve existing MTL models with more than one (1) auxiliary task. The DL-based MTL training algorithm can be readily adapted to any other parallel MTL framework. iii) Third, this study confirms how additional tasks such as answer classification, first-word and last-word prediction tasks can be combined in a parallel MTL setting to improve the performance for the Seq2Seq learning-based answer generation. MFE shows how the question encoding overfit issue can be directly addressed by performing further tasks on the juston-the-question encoding alone without any need for additional data. TC shows how the answer can be classified in a more natural manner, which is effective in reducing question encoding overfit and is also valuable for natural language generation tasks. SEQ2SEQ++ is both a model and framework. As a model, researchers can utilize SEQ2SEQ++ to train their question-answer system in another domain or dataset. They can also perform their Seq2Seq-based NLP-based research in other areas such as question generation and translation. SEQ2SEQ++ is also a flexible framework. The existing auxiliary tasks (question-answer classification, first-word prediction, and last word prediction) can be replaced with other tasks, if needed. CAM can also be replaced with another attention mechanism that may be newly developed. This work also provided all the algorithms and formulas utilized for all the models implemented in this study. Researchers can replicate and implement them for benchmarking and further investigations. In the future, we may devote our efforts to investigating SEQ2SEQ++ for multiturn conversations and other natural language tasks, such as question generation and summarization. We may even investigate how to further improve our model to show a much more significant improvement in diversity compared to other models. We are also interested in exploring pretrained language models such as the generative pretrained language models and embeddings such as BERT [23], GPT-3 [10], Word2Vec [57], and Glove [58] to be integrated into SEQ2SEQ++.
KULOTHUNKAN PALASUNDRAM graduated from Universiti Kebangsaan Malaysia. He received the B.S. degree in computer science (Hons.) and the master's degree in IT, in 1995 and 1998, respectively. He is currently pursuing the Ph.D. degree in intelligent computing with Universiti Putra Malaysia. His research interests include artificial intelligence, deep learning, big data, natural language processing, and dialog generation.
NURFADHLINA MOHD SHAREF is currently an Associate Professor at the Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Malaysia. Her research interests include text mining, recommendation systems, and data science. Besides chatbot, her current projects are multi-objective deep reinforcement learning, multi-task deep learning for multi-class tweets classification, and deep-tensor factorization model for recommendation systems.
KHAIRUL AZHAR KASMIRAN received the Ph.D. degree from The University of Sydney, Australia, in 2012. He is currently a Senior Lecturer at the Department of Computer Science, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Malaysia. His interests include deep learning, reinforcement learning, performance engineering, formal verification, and software development.
AZREEN AZMAN (Member, IEEE) received the Diploma degree in software engineering from the Institute of Telecommunication and Information Technology, in 1997, the Bachelor of Information Technology degree in information systems engineering from Multimedia University, Malaysia, in 1999, and the Ph.D. degree in computing science specializing in information retrieval from the University of Glasgow, Scotland, in September 2007. Before joining his Ph.D. degree, he served in the industry for a few years. He is currently an Associate Professor at Universiti Putra Malaysia. His current research interests include information retrieval, text mining, natural language processing, and intelligent systems. He serves as a Committee Member for the Malaysian Society of Information Retrieval and Knowledge Management (PECAMP) and the Malaysian Information Technology Society (MITS). VOLUME 9, 2021