Answer Category-Aware Answer Selection for Question Answering

As a key problem in artificial intelligence, question answering (QA) has always been a topic of intensive research. Most existing methods cast question answering as an answer selection task. The size of the candidate answer pool is usually very large, so it is difficult to accurately select the correct answer. One of the solutions is to narrow the range of candidate answer pool based on the category labels of the answers. However, QA tasks in reality usually only provide the category label of the question but not the category label of the answer. Based on this observation, we propose an Answer Category-Aware Answer Selection system (ACAAS), which jointly leverage unlabelled answer data and labelled question category data to generate answer category pseudo-labels in a joint embedding space. Experimental results on two public QA datasets demonstrate the effectiveness of the proposed method.


I. INTRODUCTION
Question answering refers to making the machine understand natural language questions and deliver answers automatically [1]. As one of the core tasks of artificial intelligence, question answering technology has received more and more attention due to its wide application in natural language processing and information retrieval. There are many existing machine learning-based question answering methods. Most question answering systems convert question answering tasks into answer selection tasks, such as [2], [3]. That is, given a question and a set of candidate answers, the system automatically selects the correct answer from the pool of candidate answers [4]. In the answer selection-based question answering systems, a popular method is to encode the question and each candidate answer as a continuous vector, then calculate the matching score of the two, and finally select the answer with the highest matching score as the answer to the question [5]- [7]. One of the main challenges in the answer selection task is that the size of the candidate answer pool is usually very large, so it is difficult to accurately select the correct answer from it. Intuitively, if we can know the category of the answer, then in general, we can quickly locate the target category, thereby reducing the size of the answer pool. However, in most cases, the answer category The associate editor coordinating the review of this manuscript and approving it for publication was Guangcun Shan . information is not available, and there is no existing method for answer categorization. On the other hand, we note that the category information of the problem is usually available [8]. Moreover, even if the category information of the question is not available, there are existing methods that can accurately categorize questions [9].
Based on the above observations, we propose the Answer Category-aware Answer Selection System (ACAAS). The system improves the accuracy of the question answering system through the fusion of question category information and answer data. Specifically, the system first uses a two-way attention mechanism to encode the target question and answer so that the information of the question and answer can interact with each other in the representation computation. Then, we designed the Attention-Based Shared Label Embedding Network (AB-SLEN), which generates pseudo-labels in the latent space based on question categories and unlabeled answers. The pseudo label of the answer can help the system locate the correct answer more accurately. Finally, ACAAS selects answers based on the encoded questions, answers, and pseudo-labels. We evaluated the performance of ACAAS on three public QA datasets, and the experimental results demonstrate the effectiveness of our model.

A. ANSWER SELECTION
After being proposed by [1], answer selection became popular as a task. An important component of this task is to encode VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ question texts and answer texts as vector representations and calculate the matching scores based on the representations. Thus, Convolutional neural network (CNN) [10]- [13] and recurrent models like the long short-term memory (LSTM) [11], [12] can be used to encode the input sentences. There are many different neural network to model question and answer sentence pairs e.g., the attentive pooling with BiLSTM or CNN network [12]; dual Bi-LSTM models with holographic composition approach [14]; a model using hyperbolic representations instead of Euclidean space [15]; a LSTM-based model using temporal gates for question and answer pairs to jointly influence the learned representations in a pairwise manner [16]; a knowledge-based model using a knowledge module to learn a more general representation [17]. In addition, [13] employed a bigram CNN model with average pooling. [10] developed the LSTM model with noise-contrastive estimation approach.
In NLP tasks, many recent works leverage external information to enrich the sentence representation learning of deep learning models [8], [18]. Reference [6] used a positional attention-based RNN model to incorporates the positional context of questions into the answers' representations. Reference [19] used external knowledge bases (KBs) to learn continuous representations, which can enhance the learning of RNN model for machine reading. Reference [20] used a novel attention mechanism from external information sources to enrich the representations learned by transformer networks in conversational search. Similar to these researches, our work aims to jointly leverage the unlabelled candidate answer and the question category label information for answer selection task.

B. ATTENTION-BASED MODELS
Attention-based models have shown great performance on a range of tasks, such as computer vision and speech recognition [21], [22]. For computer vision tasks, [23] applied attention in RNN models to integrate information over time from an image or video by sequential decision process. [24] combined a novel spatial attention mechanism with a sequential variational auto-encoding framework for image generation. For speech recognition tasks, [25] used an attention mechanism in bi-directional RNN to do the alignment between the input and output sequences. [26] extended the attention mechanism with features needed for speech recognition.
Attention-based models have been applied to NLP tasks after their success in computer vision and speech recognition.
To introduce extra source of information for NLP tasks, most prior works used attention mechanism on top of the CNN or RNN model to guide the extraction of sentence embedding. Reference [27] firstly introduced attention mechanism into question answering in RNN models. Reference [28] took the lead in exploring attention mechanisms in CNNs. Reference [29] performed soft-attention alignment by first measuring the matching scores between each word in question and answer.
Most of the aforementioned methods create an attention vector for each of its hidden states during the recurrent iteration, which lead to the models focused on lexical correlations between a certain word and its previous words. Different from them, we propose an attention mechanism, which is only performed once and performed on top of an LSTM layer in our model. It is more focused on the semantics of the whole sentence that each word contributes to, less on relations between words.

C. QUESTION CLASSIFICATION
Question classification is the task that analyzes a question and labels the question based on its expected category information of answer [30]. In answer selection task, we expect it not only provides a semantic constraint on the answer, which allow further processing to precisely locate and verify the answer, but also provides some information for downstream processes in determining answer selection strategies, which may be answer category specific, rather than uniform [31]. Recent works on question classification task focus mainly on machine learning approaches. Reference [32] presented a feature-based machine learning model for question classification. Reference [9] employed a deep learning model and use question categories information to enhance the datasets with highlighting entities in it by a question classification API. 1 To date, the studies on QC are mainly based on the text classification techniques. Many efforts have been made to approach text classification using statistical models based on bag-of-words (BOW) features, such as [33], [34]. But because BOW features lack of semantic meaning representation, neural networks started to be applied to learn word-level representations, such as word2vec. There are also related works that use character-level features for language processing. [35] proposed an alternative low-level representation based on character n-grams. [36], [37] incorporated character-level features to CNNs. In our article, we explore using both word-level and character-level representation and some other kinds of representation of text to introduce more helpful information.

III. PROBLEM STATEMENT
Given a question q, the category of the question c and a set of candidate answers A = {a 1 , a 2 , . . . , a n }, the goal of our system is to select the correct answer of q from A. Concretely, for each question-answer pair {q, a i }, the system takes q, c, and a i as the input, and outputs a scalar that indicating the matching score of the question-answer pair. We learn a function f (q, a i , c) to compute the matching scores. The scores are then used to rank a list of possible candidates. The higher the matching score is, the corresponding answer is more likely to be the correct answer of the question.

IV. METHODOLOGY
In the proposed method, we compute a matching score against each question answer pair for answer selection. The question category information is used to improve the scores of positive question-answer pairs and reduce the scores of negative question-answer pairs, individually. Considering answers are unlabelled in the answer selection datasets, we leverage the predictions of question classification module to estimate a category label for the answer. Concretely, we first use an embedding layer to convert questions and answers into vector representations (Section IV-A). Then we adopt bidirectional long short-term memory (BiLSTM) and a two-way attention mechanism to learn a context-based sentence representation of questions and answers in QA encoding layer (Section IV-B). Afterwards, an Attention-Based Shared Label Embedding Network (AB-SLEN) is designed to learn the category-based attentive sentence representation by embedding labels in latent space and providing pseudo-labels for unlabeled answer sentences (Section IV-C). Finally, we compute the bilinear matching score between the category-based attentive question and answer vectors and pass into a fully connected hidden layer before a 2-class softmax layer in output layer (Section IV-D). The overview of our model is shown in Fig. 1. In this section, we describe our answer category-aware answer selection system (ACAAS) layer-bylayer. Note that, we use the subscripts q, a to indicate whether a parameter belongs to question or answer respectively.

A. EMBEDDING LAYER
Given a question q and a set of candidate answers A = {a 1 , a 2 , . . . , a n }, we first use an embedding layer to convert them into vector representations, then input these embedding vectors into the our answer selection system. For a given sentence, it is represented by the embeddings of the words it consists of. The character feature and position feature are also transformed into real-valued representation. The question representation is given by: where ⊕ indicates the concatenation operation. E ∈ R m×d , m denotes the length of padded sequence and d is the dimension of the embedding.

B. QA ENCODING LAYER
In QA encoding layer, we adopt bidirectional long short-term memory (BiLSTM) as a shared encoder to encode questions and answers. BiLSTM not only has access to future contexts but also captures information from past contexts. The input of BiLSTM layer is a sequence of word embeddings E. The output of the i-th word is represented by are the outputs of the forward network and the backward network, respectively. Accordingly, the initial contextual sentence representation H ∈ R 2u×m for both question and answer will be: Based on these initial contextual sentence representations, we employ a two-way attention mechanism to interactively learn the attentive representation of question and answer. With question sentence representation H q and answer sentence representation H a , we compute the correlation matrix F ∈ R m×m by introducing an attentive matrix U ∈ R 2u×2u as follows: Afterwards, we conduct max pooling operation along rows and columns of F to generate context-based attention vectors for question f q and answer f a separately, where f q , f a ∈ R m . The co-attentive context-based representation of questions and answers can be computed as: Attention Module uses attention mechanisms to enhance the presentation of questions and answers. Specifically, we use an self-attention mechanism to avoid computing an annotation vector over step each time, which more focus on the semantics of the whole sentence. The attention mechanism takes the initial contextual sentence representation H a , H q as input, and outputs a matrix of weights: where W q , W a ∈ R d a ×2u is a weight matrix, u is the hidden units size. And W c2 ∈ R r×d a is a weight matrix of parameters, where d a and r are hyperparameter we can set arbitrarily.
In the conventional attention mechanism, W c2 is usually a vector belonging to R d a . Therefore, the final representation tends to focus on specific components of the sentence, such as a special set of related words or phrases. Considering that a sentence may contain multiple semantic components that make up the entire sentence, we introduce W c 2 by extending r. Accordingly, the attentive question representation Q and answer representation A instead of H as input of similarity calculation object in (14): A att = W a H a .
Then we use a fully connected layer to obtain the attentive representation h q ∈ R d l and h a ∈ R d l : where flat indicates the flatten operation. The Shared Label Embedding Network uses the question category label as the training signal to find a joint space of category label and answer, and then projects the representation of the unlabelled answer to the joint space to generate the answer category pseudo label. Concretely, we first use a trainable matrix L for the representation of the question category labels, where L ∈ R t×d l , t is the number of labels, d l is the dimension of the label embedding. In order to let the question representation h q and answer representation h a interact with L, we introduce a similarity calculation object to measure the similarity between the embedded label and the input representation, which is similar to the Universal Schema Latent Feature Model introduced by [38]: where (·) is the dot product.
Then we use a softmax function to obtain probability distribution: The pseudo-labelled embedding of question and answer are computed by: The loss function of AB-SLEN is the squared error between pseudo-labelled embedding o q and the embedding of the question category l c in the label embedding matrix L: By replacing the question text with the answer text as input, we can also obtain the pseudo-labelled embedding of answer as (18). To take full use of the label information, we consider these label embeddings as the input of matching score computation in 22 and concatenate them to the existing vectors as detailed in (20), (21): The category label information of question and answer is included in the representation, which helps to reduce the search space of potential answers, making the discovery of answers more effective and accurate.

D. OUTPUT LAYER AND OPTIMIZATION
Finally, we compute the bilinear matching score between the category-based attentive question and answer vectors: where W s ∈ R 2u×2u is a matching matrix to be learned. Note that, ACAAS without AB-SLEN take q out and a out instead of q o and a o as the input of matching score computation. Given the matching score Score(q o ; a o ) and the category-based representation q o , a o , we concatenate them together X = [q o : Score(q o ; a o ) : a o ] as the input of the hidden layer. Then the output of the hidden layer goes through a softmax layer for binary classification: where W and b are the parameters to be learned. The model is trained to minimize the overall loss function: As mentioned above,the first optimization objective L pseudo denotes the squared error in (19). The second optimization objective L a denotes the cross-entropy loss function: where p a i is the output of the softmax layer. θ contains all the parameters of the network and λ θ 2 2 is the L 2 regularization. The training loss function measures how predictive the model is on the training data and the regularization term penalizes the complexity of the model, which helps to avoid overfitting. The parameters of the model can be updated by the Adam optimizer.

A. DATASETS AND METRICS
We evaluate the performance of ACAAS on three popular public datasets: TREC QA [39], 2 Yahoo QA [14] and Wiki QA [13]. The statistics of the AS datasets used are described in Table 1. Note that our proposed method needs to use the information of question categories, but only TREC QA has the question category labels in [9]. For the rest of the datasets, we use the same question classification method as [9] to categorize the questions. The questions are classified into ABBR, DESC, ENTY, HUM, LOC, NUM. We add an ''UNK'' label to signify a category which is not in the above category list.
• TREC QA [39], is a widely-adopted QA ranking benchmark obtained from the TREC QA Tracks 8-13. TREC QA is a factoid question answering dataset. Following previous works [16], [17], we experiment on the raw TREC QA dataset. We adopt the official evaluation metrics of the mean average precision (MAP) and mean reciprocal rank (MRR) as our evaluation metrics.
Since the evaluation metrics are commonplace in ranking tasks, we omit any further details for the sake of brevity.
• Yahoo QA [14], is an open-domain community-based dataset collected from CQA platform, which can be used to learn and validate answer selection tasks. Yahoo QA is a moderately large dataset containing 253K QA pairs. For this dataset, we use the same evaluation metrics as that in [14], [16], including Precision@1 and Mean Reciprocal Rank (MRR).
• Wiki QA [13], is an open-domain factoid answer selection benchmark. Each question was sampled from real queries of Bing without editorial revision. The candidate sentences were chosen from the summary paragraph of relevant Wikipedia pages directly. We adopt the official evaluation metrics of MAP and MRR as our evaluation metrics.

B. EXPERIMENTAL SETTINGS
For question and answer sentences, we use GloVE embeddings 3 of 300 dimensions to initialize the word embedding, which is concatenated with character embedding and position embedding to generate the input of our model. We use position embedding generated by sinusoids of varying frequency as [40] and character embedding are randomly initialized with size 30. The dimension of the label embedding d l is set to be 100. The maximum length of a sentence is set to be 30.
For the baseline models we compared with, we followed exactly the same parameter settings as those in their original papers. As for our model, the hidden layer size of BiLSTM and the final hidden layer size are both set to be 200. The hyperparameter d a is set to 400 and r is set to a wide range of values between 10 to 30. The dropout rate is set to be 0.5 and L2 regularization is set to 0.0001. Our model is trained in batches with a size of 64 and a learning rate of 0.005. All other parameters are randomly initialized from [-0.1, 0.1].

C. EXPERIMENTAL RESULTS
For TREC QA and Yahoo QA, we compare ACAAS against 5 baseline methods: (1) the attentive pooling with BiLSTM or CNN network [12]; (2) dual Bi-LSTM models with holographic composition approach [14]; (3) a model using hyperbolic representations instead of Euclidean space [15]; (4) a LSTM-based model using temporal gates for question and answer pairs to jointly influence the learned representations in a pairwise manner [16]; (5) a knowledge-based model using a knowledge module to learn a more general representation [17]. The experimental results are shown in Table 2.
For Wiki QA dataset, in addition to [12], we compare our method with two more strong baselines: (1) the average pooling with a bigram CNN model [13]; (2) the attentive pooling with BiLSTM or CNN network [12]; (3) a LSTM-based model with noise-contrastive estimation approach [10]. The experimental results are shown in Table 3. Table 2 and Table 3 report the experimental results of different models on TREC QA, Yahoo QA and Wiki QA. The performances of our baseline models achieve the state of the art. For all reported results, the best result is in boldface and the second best is underlined. There are multiple interesting observations from Table 2 as follows: (1) We observe that our answer category-aware method ACAAS outperforms a myriad of complex neural architectures. Notably, our answer category-aware method ACAAS outperforms the state-of-the-art results on TREC QA and  Yahoo QA datasets by at least about 2% and 11% respectively. We obtain a clear performance gain of 2% -3% in terms of MAP/MRR on TREC QA and 7% -11% in terms of P@1/MRR on Yahoo QA against the second-best result of KAN (AP-BiLSTM).
(2) The result of ACAAS significantly outperforms other baselines not leveraging external information. Among these results, AP-BiLSTM performs much worse than KAN (AP-BiLSTM) and ACAAS, which demonstrates the effectiveness of incorporating external information into our model. In general, our model has shown promise by incorporating attentive and category information.
(3) We observe that ACAAS outperforms KAN (AP-BiLSTM), demonstrating that ACAAS will introduce more accurate external information (category information). This is within our expectation since ACAAS introduces category information to precisely improve the computation of the matching score. While KAN (AP-BiLSTM) learn representations by using the knowledge base to enable knowledge transfer from the source domain to the target domains. It will interfere the training of the knowledge module and bring a negative effect on the performance if the target dataset is completely different from the source dataset in domain.
(4) These results show that ACAAS is more robust to larger input texts than others. The relative performance of ACAAS is significantly better on large datasets, e.g., Yahoo QA (253K training pairs) as opposed to smaller ones TREC QA (53K training pairs). We believe that this is due to the fact that our representation in Eq.1 introduces more information with the increasing input texts to enrich overall sentence representations.
From Table 3, we observe a similar result as above. The proposed method shows comparative result on Wiki QA. For these baselines, it makes a significant performance boost to incorporate external information into the overall architecture. Form Table 4, we observe more detailed experimental results of incorporating AB-SLEN into the overall architecture. The number in the parenthesis indicates the accuracy decrease over ACAAS without AB-SLEN model on different category data. A/Q indicates the average number of answers to a question. The performance increases at least about 2% on each category data. We observe that ACAAS performs well on data that belong to LOC and DESC since the proportion of correct answers is higher, and the average number of answers to a question is smaller than others. Besides, it is obvious that the accuracy increase over ACAAS without AB-SLEN model on data that belong to LOC and DESC is higher than others, which demonstrates that the smaller searching space of potential answers, the more effective incorporating external information.

Ablation Study
For a thorough comparison, we report the ablation test on TREC QA, Yahoo QA and Wiki QA to analyze the improvements contributed by each part of our model. The results are shown in Table 5, where (i) ACAAS without AB-SLEN stand for the answer selection model without AB-SLEN. (ii) ACAAS with SLEN and ACAAS with AB-SLEN are the SLEN-based model of answer selection task. (iii) ACAAS with AB-SLEN is a model conducted with AB-SLEN. Generally, both factors contribute, and it makes larger performance boosting to integrate category information. Even ACAAS with SLEN achieves competitive results with these strong baselines in 2, which demonstrates the effectiveness of leveraging external category information for answer selection task. This is within our expectation since the answer category-aware representation helps reduce the search space of potential answers, making the discovery of answers more effective and accurate.

E. CASE STUDY OF OUR ATTENTION
The category label information used in ACAAS can reweight the attention weight from Equation 20 and Equation 21 in answer selection task. Due to the limited space, we randomly choose two question-answer pairs and visualize their attention weights. The color depth indicates the importance degree of the words, the darker the more important.
The question in Fig. 2 ''how do i start using asp.net?'' is categorised under ''DESC'' (DESCRIPTION). From Fig. 2, we can observe that AP-BiLSTM pays much attention to those words that are similar to the question, such as ''asp.net'', while due to our attention mechanism, ACAAS can be used to alleviate the limitation of that with more context and semantic information.
The question in Fig. 3 ''how much wood would a woodchuck chuck, if a woodchuck chuck wood?'' is categorized under ''NUM''. From Fig. 3, we can observe that AP-BiLSTM pays much attention to those words that are not related to numbers, such as ''nothing is free''. This limitation can be alleviated decently by category-aware attention mechanism, since there is a strong correlation between ''NUM'' and ''none'' in question classification.
The results indicate that leveraging question category label information can aid in attending more valuable information in the answer selection task.

VI. CONCLUSION
In this paper, we propose an answer category-aware answer selection method for question answering. Specifically, we jointly leverage question category labels and unlabeled answers by using a embedding space between the disparate label spaces and learning transfer functions between the label embeddings. Experimental results on three public datasets show the effectiveness of our proposed method. In the future, we will explore the hidden relations beyond the context by leveraging external knowledge from a text corpus to enrich the representational learning of questions and answers.