HSM-QA: Question Answering System Based on Hierarchical Semantic Matching

In recent years, Question Answering (QA) systems have gained popularity as a means of acquiring knowledge. However, the prevalent approach of matching question-answer pairs still suffers from low precision and efficiency due to the inherent ambiguity of natural language descriptions. To address these issues, we propose a novel QA approach based on hierarchical semantic matching, termed HSM-QA. Specifically, HSM-QA is decomposed into two main steps, i.e., query-question and query-answer matchings, respectively. For query-question matching, a Siamese network is applied to calculate the similarity between query-question pairs, which recalls the most similar questions and their corresponding answers as candidates. In terms of query-answer matching, we adopt the idea of the pairwise algorithm and propose a single-stream structure to calculate the relevance between query and answer, based on which the best-matching candidates are ranked and returned. After training, these two steps are combined as an efficient QA scheme for different languages, e.g., English and Chinese. Furthermore, to address the lack of Chinese QA datasets, we collect a massive amount of text data from Chinese social media and generate a new dataset via a pre-trained language model. Extensive experiments are conducted on six QA datasets to validate our HSM-QA. The experimental results demonstrate the superior performance and efficiency of our method than a set of compared methods.


I. INTRODUCTION
The emerging technologies of social media are changing the way that information passes all around the world [1], [2], [3]. People can exchange ideas and messages on internet rapidly, regardless of distance or time. This makes it easier to collect massive data but challenging to navigate and extract meaningful information from this collection. Consequently, people demand higher efficiency and accuracy in acquiring The associate editor coordinating the review of this manuscript and approving it for publication was Agostino Forestiero .
information, leading to the popularity of Question Answering (QA) systems in the field of information retrieval [4], [5], [6].
Question Answering (QA) system is an advanced form of information retrieval [7], [8] and has been studied for decades [4], [5], [7], [9], [10], [11], [12], [13]. In the early years, question answering was regarded as a database query task [14], [15] and researchers performed QA retrieval by pattern matching. Later, QA system focuses more on semantic and syntax analysis for user's input [16], [17], most of which usually translate a natural language description into a logical expression before converting it into a SQL query or calculate the statistic and semantic similarity between query and question. Despite good performance, these methods still have drawbacks when facing open data source. The reason is that they have to structure the collected data in a fixed form, which greatly limits their real-world applications.
To conquer this problem, there exists two types of QA systems in recent advances, namely Community Question Answering (CQA for short) [18], [19], [20] and Frequently Asked Questions (FAQ for short) [17], [21], [22], [23]. CQA is a web-based application where users can post or answer questions according to their own experience. When a new query is submitted, the searching engine returns relevant question-answer pages to user. FAQ provides consultation services for users by organizing common questions and their corresponding answers in advance and listing them in websites, articles, online forums and so on. This kind of QA system is often designed for companies, from which customers can quickly search the FAQ website and obtain the target QA pair by finding the most similar question to solve the problem. However, a major challenge lies in this kind of QA system, i.e. CQA and FAQ, is that Question-Answer pairs in the database may be organized in a certain text format, while users always form their queries with different linguistic patterns. Furthermore, people may use colloquialisms and slangs when asking questions, so the queries input to the system are also diverse. How to quickly and accurately match the user's query with the question-answer pairs has become an essential step in building advanced QA systems.
Recent advances often resort to semantic matching or class classification for effective question answering. These methods can be divided into two main categories. The first one only uses query-Question(q-Q) or query-Answer(q-A) relations. For instance, works in [24], [25], [26], [27], [28], [29], [30], [31] and [32] perform QA pair retrieval by searching the most similar questions according to the new query. We refer to such methods as q-Q matching-based model. Other researchers [33], [34], [35], [36] retrieve and rank the QA pairs by calculating the matching degree between the new entered query and existing answers, which is often termed q-A matching based model. The structures of these two models are shown as Figure 1(a) and Figure 1 Query-Question matching based model is developed on top of the assumption that if a new query is similar to some posted questions, then the answers to the posted questions can be regarded as the target results for the new entered query.
The key to this methodology is to improve the text matching or classification performance of the QA system. As for the query-Answer matching based model, the most important component is to construct an effective semantic space that can bridge the lexical gap between new queries and existing answers.
Another category takes both q-Q relations and q-A relations into account. The structure is shown in Figure 1(c). For instance, the methods proposed in [37], [38], and [39] extract features of both posted questions and their corresponding answers simultaneously. Then for each QA pair, a score that indicates the matching degree of both query-Question and query-Answer will be calculated by a learned model and used as the ranking indicator. We name it as query-Question-Answer Matching based model. Overall, the query-Question matching based model only considers the question feature while ignoring the relation between the query and answer, which is prone to incorrect answering when applying these methods to CQA, as people may not answer the posted question well. The query-Answer matching based method can directly measure the matching degree of new query and existing answers. However, it is more difficult to determine the semantic relevance, as q-A data is less homogeneous. Models should be carefully designed, and more complex data pre-processing and feature engineering should be performed before training the final models. As for query-Question-Answer matching based model, it takes both the question and answer into consideration but needs to perform recall and rank all examples at once, which comes at the expense of efficiency. The pros and cons of these methods are listed in Table 1 for a clear comparison.
To tackle these problem, we propose a novel QA system based on hierarchical semantic matching, dubbed HSM-QA, whose pipeline is illustrated as Figure 1(d). The main objective of the proposed method encompasses two main aspects: • To further enhance the accuracy of the retrieval results, HSM-QA consider modeling both query-Question and query-Anaswer relations.
• To achieve high inference efficiency, HSM-QA employs a hierarchical approach to achieve semantic matching and filters out the most irrelevant data in the initial step. Specifically, HSM-QA decomposes the QA retrieval process into two main steps. Firstly, pre-trained language models integrated in a Siamese structure are used to calculate the similarity between query-question pairs, and recall the most similar questions and their corresponding answers as candidates. Afterwards, a network trained by pairwise loss will be used to calculate the relevance between each query-answer pair and rank the candidates. We name these two steps as q-Q matching and q-A matching, respectively. After training, the final overall Question Answering System is constructed for knowledge retrieval. In practice, different pretrained models are applied as sentence encoders to extract sentence feature and we adapt this system to both Chinese and English question answering.
Furthermore, to address the lack of Chinese QA datasets, we collect Chinese text data from Internet and a pre-trained language model is used to generate question-answer pairs automatically and then the data is labeled manually. The experimental results demonstrate the distinct performance gains of our HSM-QA over the compared methods in both English and Chinese benchmarks.
To sum up, the contributions of this paper are three-fold: • We propose a novel QA system, termed HSM-QA, which can hierarchically measure the relation between user's query and QA pairs. We believe that HSM-QA also presents a better trade-off between performance and efficiency than existing QA methods.
• To address the lack of Chinese QA benchmarks, we propose a new QA dataset in this paper. Its data is collected from Internet and automatically transformed into QA pairs via a sequential language model.
• The proposed HSM-QA achieves outstanding performance on multiple QA benchmarks and is also superior in model efficiency. The remainder of this paper is organized as follows: Section II introduces some related methods and algorithms in the field of QA systems. Section III presents the proposed HSM-QA approach in detail. Section IV illustrates the experimental results and analyze the effectiveness of our proposed method. Finally, Section V draws an conclusion and highlighting potential avenues for future study.

II. RELATED WORK
This section presents an overview of relevant and recent works in QA systems, including existing QA systems, language model pre-training and pairwise ranking algorithm.

A. QUESTION ANSWERING SYSTEM
Question Answering (QA) system refers to returning the correct answer in response to the input query. Research efforts have been conducted for decades to improve the ranking performance for QA system. Methods proposed in [25] and [28] present a learning to rank model based on SVMRank [44] 77828 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
to rank question paraphrases, where the paraphrases are collected from query logs. Surdeanu et al. [34] formulate the ranking score function combined with different text features for answer modeling while Zhou et al. [33] propose to learn semantic representations for q-A pairs, which are then incorporated into a learning to rank framework.
As the task of ranking QA pairs can be transformed to text matching or text classification, methods to improve the matching precision have been widely studied [26], [27], [29], [32], [45], [46], [47]. For example, both studies of [26] and [27] train a skip-gram model to obtain continuous word embeddings for q-Q pairs and the similarity of processed embeddings can be regarded as the matching degree. Dutta et al. [29] propose a model called TI-S2S to extract keywords for each question, based on which the similarity scores will be calculate to rank the candidate question. Li et al. [32] adopt a two-step method combined with a TF-IDF search technique at first stage and a deep neural network at second step to fully explore the multi-grained query-question matching. To better guide the search for best answers, metadata information like category is always combined with the text features and a classification model is trained and used to figure out the semantic categorization of a new query [24], [48].
Both the learning-to-rank and text-matching models mentioned above only model the relationship of q-Q pairs or q-A pairs. They do not utilize the data characteristics of both questions and answers in the database simultaneously. Some methods [37], [38], [39] apply unsupervised matching models for query-question matching and additionally adopt deep neural network to calculate the matching degree between query and answers, from which a weighted score will be obtained as a ranking indicator. However, this one-step retrieval scheme can result in long inference times and consume large amounts of time and storage resources.
The structure of our proposed method is similar to [32], but we further discover and explore the feature between user's query and answer. Specifically, a pre-trained language model are adopted under a Siamese structure [49], [50] at the first step to roughly recall a set of candidates. With the language encoder in this Siamese network, all the representations of questions in QA data set can be acquired and stored in database. Then for each input query, we just need to use the model to get the corresponding representations and calculate the similarity with the questions from database, which greatly reduces the time for inference. At the second step, we apply another deep network to output a value to indicate the matching degree of user's input and answers. The two-step retrieval process enables the proposed HSM-QA access to more real-world applications and enhances the performance in both efficiency and accuracy.

B. LANGUAGE MODEL PRE-TRAINING
Recently, large-scale pre-training has become a major trend in natural language processing [51], [52], [53], [54]. Language model pre-training aims to learn general text representation from large-scale unlabeled text corpora. BERT [54] adopts masked language modeling and next sentence prediction for pre-training and proposes a bi-directional language model, which advances the state of the art for many tasks in NLP [55], [56], [57]. From then on, more variants of BERT have been proposed to further promote the performance [58], [59], [60] or reduce the scale of models [61], [62], [63], [64]. In addition, some models add more different pre-training objectives in respond to different language tasks [65]. Due to the complexity of the Chinese language, ERNIE [66] modifies the masking strategy and employs phrase-level and entity-level masks to implicitly learn both syntactic and semantic information. ERNIE-gram [67] applies n-gram model to ERNIE so as to learn the language representations with multi-granularity. These models greatly prompt the Chinese text analysis. In this paper, we will use different pre-trained models for different matching networks and languages.

C. PAIRWISE RANKING ALGORITHM
The ranking algorithms can be categorized into three kinds [68]: pointwise, pairwise and listwise. Considering that the pointwise method tends to ignore the relation within candidate documents and listwise method has stricter demand of the labeled data, we prefer using pairwise method to sort the candidate answers. The inputs of a typical pairwise method is the feature of paired documents. Most of the pairwise methods train a binary classifier to mimic the process of document ranking. A pair of documents, denoted as doc + and doc − , are fed into a binary classifier, where doc + is relevant to the input query and doc − is the irrelevant one. The final goal of this binary classifier is to output a binary value to indicate the partial order of these two documents. In addition, some works [69], [70], [71], [72] also resort decision tree to obtain the partial order relation of user,s query and two related documents. However, these methods require carefully design of handcrafted features, which will reduce their training efficiency.
In this paper, we resort pre-training model to extract the feature automatically and further measure the relevance between query and answer by applying pairwise loss. During inference, the candidate answers will be ranked and returned according to the order of the score value.

III. METHODOLOGY
This section outlines the proposed HSM-QA approach. First, we provide an overview of this two-step method, followed by a detailed description of each matching model, including the architecture design and training objectives.

A. OVERVIEW
The detailed structure of the proposed HSM-QA is illustrated in Figure 2. Given a user query q, there are N candidate questions and each of them has corresponding M answers.   Note that M = 1 in FAQ system, and otherwise M > 1. The retrieval task is to find the most appropriate QA pair according to the user query q. As shown in Figure 2, HSM-QA takes into account the characteristics of the data and measures the matching degree based on two aspects: query-question similarity and query-answer relevance.
Specifically, we decouple the QA process into two steps: i.e., query-question (q-Q) and query-answer (q-A) matchings. For q-Q matching, we first apply a Siamese network to calculate the semantic similarity between query-question pairs and obtain the k 1 candidate questions {Q i } k 1 i=1 by ranking the similarity scores. The candidate answers {A i * } k 1 i=1 can also be obtained according to the corresponding candidate questions.
To perform q-A matching, we deploy a single-stream structure to calculate the relevance between the user query and candidate answers. The more fine-grained similarity scores will direct us to rerank the results in q-Q matching and the best-matching k 2 QA pairs will be returned to the users.

B. MEASUREMENT OF Q-Q SIMILARITY MATCHING
The underlying idea behind building the q-Q matching model is that the similar questions have similar answers, which has also been reflected in [25] and [28]. However, instead of training a learning to rank model, we construct a high-quality semantic space between queries and questions using a Siamese network, as shown in Figure 3. The output similarity indicates the degree of semantic matching. Following the outstanding work of Sentence-BERT [50], we also apply a Siamese structure to calculate the similarity of queryquestion pairs. There are two reasons for using Siamese network. Firstly, two sub-networks in the Siamese network have the same structure and share weights. We only need to learn the parameters of one network, which greatly reduces the training consumption of the model. Secondly, the two input sentences do not require any interaction before the final similarity matching. In practice, we encode and store the questions in advance. During inference, we reuse the stored sentence embeddings of questions and only need to obtain the query representation online, which greatly reduces the forward consumption and memory occupation.
To further improve the model performance, we apply contrastive loss as Equation (1) to pull the similar sentences closer and push the dissimilar ones.
To alleviate the anisotropy of the embeddings produced by Tranformer architecture [53], we follow [73], [74] to whiten the sentence embeddings. The process of whitening the sentence embedding s i can be fomulated as Equation (6).
where µ and are the mean and variance matrix, s i is the whitened sentences embedding. As mentioned in BERT-whitening [74], a large number of samples are needed to obtain the more accurate µ and W . Therefore, we only apply whitening to large-scale datasets in the following experiments and we describe more details about whitening operation in Section IV-C.

C. MEASUREMENT OF Q-A RELEVANCE MATCHING
As shown in Figure 4, we explore a pairwise method for q-A matching. Given a query and its corresponding matching answers, we concatenate them to construct the positive sample. Similarly, we concatenate the same query with other irrelevant answers to obtain the negative ones. The q-A matching network is responsible for extracting the feature of each sample and rank the positive sample ahead of the negative one. The training objective is formulated as Equation (7).
where score + and score − are the corresponding output of positive and negative samples. The goal of this loss function . The Architecture of q-A matching. When training, positive and negative sample will be concatenated with query as input and widen the output. When inference, the output of q-A matching network can be regarded as the relevance score.
is to widen the gap between positive and negative samples. If the gap is larger than γ , the loss is set to zero, otherwise the optimization will be performed. The inference process is shown in Figure 4(b). We can directly input the concatenation of query and each candidate answer to q-A matching network during inference. And the output is regarded as the relevance score between query and answer.

IV. EXPERIMENT
This section introduces the experimental setup, including the evaluation datasets and metrics. We then provide more details about model implementation and report the results of each experiment. We also analyze and investigate the factors that impact the performance of the QA system.

A. DATASET
To validate the proposed HSM-QA, we conduct extensive experiments on several popular datasets, namely Quora Duplicate Question, 1 QNLI, 2 CQA dataset for semeval 2017 task3, 3 LCQMC, 4 Nlpcc 2016 DBQA dataset 5 and Chinese QA pair dataset, and we compared it with a set of the state of the art methods [40], [41], [43], [75]. Quora Duplicate Question [76] is a dataset with 40,428 potential question duplicate pairs while each pair has a binary value that indicates whether the two questions are true duplicates. This dataset is used for training and evaluating the performance of English q-Q matching network.
QNLI [55] is a dataset for the natural language understanding tasks, which contains 104,734 samples for training and 5,463 samples for validation. This dataset is used for training and evaluating the performance of English q-A matching network.
CQA dataset for semeval 2017 task3 [20] contains all queries and QA-pairs posted on Qatarliving, a web forum. This dataset is proposed for Semeval 2017 Task3 competition, which has three sub-tasks: • Ranking the candidate question sets based on their similarities given a query q and a list of ten questions from the forum (Q 1 , . . . , Q 10 ); • Ranking the candidate answers when given a question Q and a list of ten answers (A 1 , . . . , A 10 ) in its question thread; • Given a query q and the ten questions (Q 1 , . . . , Q 10 ), where each Q i associates ten potential relevant answers (A 1 i , . . . , A 10 i ) in its question thread, the model should return the top 10 QA pairs according to their relevance scores. To fully evaluate the ranking results, relevance scores of all answers are futher used for calculating standard classification measures. We use this dataset to finetune the English HSM-QA model and evaluate its performance under these three sub-tasks.
LCQMC [77] is a colloquial Chinese similar question matching dataset, whose data comes from multiple fields of Baidu QA, a popular web forum in China. It contains 238,766 training samples, 8,802 validation samples and 12,500 test samples. This dataset is used for training and evaluating the performance of Chinese q-Q matching network.
Nlpcc 2016 DBQA dataset [78] has the similar data format as QNLI, it contains 172,680 samples for training, 40,996 samples for evaluation and 81,537 samples for testing. Each sample in this dataset has one question, one potential answer and one binary label to indicate whether the text could answer the question. This dataset is used for training and evaluating the performance of Chinese q-A matching network.
Chinese QA pair dataset is collected by ourselves to address the lack of Chinese QA dataset. We resort the text generation model UniLM [79] and seq2seq principle [80] to automatically generate the question-answer pairs. Furthermore, we expand the dataset by using RoFormer-sim [81] to generate multiple sentences with the same semantics as we obtained sentences before. The proposed dataset has 300 anchor samples as user queries. Each of the query has 10 questions with potential same semantics performing as the candidate questions and each question has 10 potential relevant answers. Referring to semeval 2017 CQA, we organize the data format and also borrow the three sub-tasks for evaluation after finetuning.

B. METRICS
There are several metrics we used for different tasks to measure the performance of the proposed HSM-QA. The q-Q matching network is used for recalling potential duplicate questions so we use recall as the main indicator. The Auc is demanded for evaluate the performance of ranking the selected candidate answers in q-A matching. And under the settings in Semeval 2017 CQA Task3, we also use mean Average Precision(mAP for short) calculated over the top-10 results to measure the retrieval performance. But still, classification measurements like precision and recall are also reported in the results.

C. EXPERIMENTAL SETTINGS
To better perform our HSM-QA system on real application, we use different pre-trained models as language encoders to balance inference speed and model performance. For the English version of HSM-QA, we choose distilBERT and RoBERTa as encoders, while for the Chinese version, we use ERNIE and ERNIE-gram. But in fact, any sentence encoder can be used as a sentence encoder in our two matching network. After training q-Q and q-A matching networks, we integrate them into a unified model and perform QA pairs retrieval completely. The AdamW optimizer is used and the batch size is set to 32. All models are trained on one GTX 1080Ti GPU. The hyper parameters of the models are listed in Table 2. We only perform the whitening operation when training the q-Q matching network on Quora Duplicate Question and LCQMC, as their training samples are abundant enough to calculate a precise µ and W following Equation (2) -Equation (6).

D. QUANTITATIVE EXPERIMENT
We first evaluate the performance of each matching network, followed by experimenting the entire HSM-QA to perform QA pair retrieval. The experimental results demonstrate the excellent matching performance of our proposed method. We also provide an in-depth analysis of the results to further highlight the strengths of HSM-QA.

1) EXPERIMENTAL RESULTS OF Q-Q MATCHING
The experiment results of English query-Question matching is shown in Table 3. We apply the whitening operation under  the instruction of [73] and [74] to refine the sentence vector. And we can observe that the performance on each metric is slightly improved. As for the metric of recall, the value can reach nearly 83% on valid set while 81.77% on test set. The performance on Chinese query-Question matching is also impressive, though it declines slightly when applying whitening operation.
BERT-whitening addresses the issue of anisotropy in BERT embeddings [82] by utilizing the traditional whitening method, resulting in improved performance on English datasets. However, as shown in [83] and Table 3 of our paper, this approach does not perform well on Chinese datasets. We think that this post-processing method for sentence embedding is not entirely suitable for Chinese. Unlike English, Chinese is a complex language, and the meaning of a sentence can be challenging to comprehend without its context. The whitening operation, in particular, may remove crucial contextual information contained within a Chinese sentence, resulting in performance degradation in our experiments. To enhance the sentence embeddings of Chinese, it may be more effective to learn embeddings at the paragraph level or refine the embedding based on its context, instead of relying solely on post-processing embeddings for individual sentences.

2) EXPERIMENTAL RESULTS OF Q-A MATCHING
The final results of the English q-A matching network are shown in Table 4. The recall score, which is 97.41%, is higher than the accuracy rate value. The results indicate that most positive samples can be recalled in q-A matching, but negative samples tend to be recognized as positive ones. We further measure the Auc score of the models and observe that the score is also competitive, i.e. 95.60%. The results demonstrate that our model always places the positive sample ahead of the negative sample, which greatly benefits the ranking process.
As for Chinese, it can be concluded that q-A matching network has high accuracy in the validation and test sets of nlpcc 2016 DBQA, both achieve over 97%. The value of recall can also be around 90%, though the accuracy rate is low. Similarly, the final Auc of the model on this data set can exceed 97%, indicating that q-A matching is capable for ranking samples and the relevant score of positive samples are always higher than those of negative samples.

3) EXPERIMENTAL RESULTS OF HSM-QA FOR COMPLETE RETRIEVAL
After finetuning the two matching networks on their corresponding dataset, we combine these two networks to perform the two-step QA retrieval task. And the results for two languages are shown in Table 5 and Table 6.
For English version, we validate our HSM-QA on semeval 2017 task3 CQA dataset. To perform subtask1, which aims to retrieve the similar questions when given a user's query, the Siamese network for q-Q matching is used to calculate the semantic similarities with DistilBERT as the sentence encoder. And the ten candidate questions in each query-Question thread will be ranked according to the output value. It can be confirmed that our Siamese-DistilBERT outperforms all the models proposed in the competition in mAP and recall, especially the value of mAP improved by about 30 percentage points. The models proposed in 2017 mainly use the tradition methods to extract text features and focus only on analyzing the keywords of queries and questions. They measure the similarity by keywords between two sentences, which is difficult to realise semantic understanding [40], [43]. For a fair comparison, we also test Ranknet [85] in this subtask. Ranknet is a pairwise method that scores the candidate documents using a neural network, VOLUME 11, 2023 77833 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  but the same pre-trained language encoder is used to get the sentence embeddings as input features. From Table 5, we notice that model pre-trained on large-scale corpus can generate high-quality text representation embeddings, as both Ranknet and Sentence-DistilBERT outperform other moethods in this similarity ranking task. Additionally, building a high-quality semantic space may be more effective in retrieving similar sentences than learning to rank models. The core idea of subtask1 is to figure out the relatedness of q-Q pairs. Our Sentence-DistilBERT achieves this target via a contrastive loss to generate similar features for similar sentences. Although other methods also construct feature engineering and extract multi-level features, their ultimate goal is to learn a ranking order rather than identifying the similairity between sentences. Therefore, Sentence-DistilBERT achieves significant gains in this q-Q matching task.
For subtask2, compared with the models that participated in the competition in 2017, our pairwise matching network, which uses RoBERTa as the encoder, could not achieve best performance. There are two possible reasons: (1) Pairwise method can only focus on the partial order between positive and negative samples during training. However, in the Semeval 2017 task3 CQA dataset task, the number of candidate answers are larger, so finer-grained annotation information may be lost during training. (2) Models proposed in this competition always perform more detailed data analysis on questions and answers, searching for the potential useful information in the question domain and entity word information between query-answer and construct complex feature engineering [41], [42]. Yet we directly input the entire sentence to the network without any pre-processing operation on the data, and we also don't pay much attention to mining relationships between features, which will also cause the original data to have a certain amount of noise, thereby reducing the performance of the model.
In subtask3, we retrieve the top ten answers in the following steps: First, we select three most similar questions to the query by Sentence-DistilBERT; then we obtain 10 answers to each of these 3 questions as the candidate answers of the query, and input them to pairwise-RoBERTa to further calculate the relevant score between the query and the candidate answer; finally we return the top 10 answers by the relevant order. Following the construction of [20], we calcualte mAP over the returned ten ranked answers and other evaluation metrics are calculated over the full list of results, which the relatedness socres of answers and queries are computed by the trained pairwsie-RoBERTa. From Table 5, we can observe that our proposed HSM-QA achieves nearly 20% gains in mAP and 18% gains in F1 score.
We further analyze the reasons that boost our HSM-QA's performance in subtask3. We count the number of different answers and discover the prior of the dataset itself, which is shown in Figure 5. The number of relevant answers is much larger under a similar question thread. Therefore, it is crucial to detect the relatedness of questions, as more relevant answers can be found under a similar question. Bunji [75] only takes the query and answer features into account and  ignores the relations between each query and its candidate questions. Both IIT-UHH [41] and KeLP [42] train a classifier, and the feature vectors input contain similarity metrics computed between the original query and the entire thread of questions, concatenating each question with its answers. Their ranking score can be regarded as a combination of q-Q matching score and q-A matching score. However their performance on query-Question matching is not as good as ours, which affects their performance in answer ranking and classification. To validate the effectiveness of the two-step retrieval process, we also combine Ranknet with pairwise-RoBERTa to perform subtask3 in the same way of HSM-QA. We find that although this combination outperforms other previous methods, it still falls short of our HSM-QA, as the q-Q matching step in our method is better than RankNet. Note that HSM-QA and Ranknet+pairwise-RoBERTa have the same q-A matching network, they achieve the same classification performance. In summary, we attribute the impressive performance of HSM-QA to its two-step structure and the outstanding semantic textual similarity matching ability.
We further validate our Chinese version by finetuning the model on the Chinese QA pair dataset. We also test Ranknet and Ranknet+pairwise-ERNIE-gram as a baseline and the results are provided in Table 6. With the score of mAP and recall increasing during the training process, our model's ability to recall semantically similar questions gradually improves in subtask1, as well as Ranknet. Additionally, the performance of our Sentence-ERNIE is slightly superior to that of Ranknet. In subtask2, the effect of our model is also obvious, especially in recall and F1 score. We evaluate HSM-QA and the baseline on subtask3 as the same process of English model. After fine-tuning, the performance improves on both validation dataset and test dataset. The final mAP score of HSM-QA is 81.79%, which is also higher than the baseline. The improvement in this model's performance demonstrates its feasibility for Chinese question answering and retrieval task experiments.

E. ABLATION EXPERIMENT
We also conduct a series of ablations to explore the proposed HSM-QA in terms of the structure and the training strategy. We put forward subtask4 to evaluate the QA pair retrieval performance. We define this new subtask as obtaining three candidate QA pairs given a user's query. Here we declare that only when the question and answer are similar and relevant to the query, the returned QA pair can be considered as the matching one.

1) EFFECTS OF THE HIERARCHICAL STRUCTURE
By adding or removing each step, we try to find out how important the role of each step plays in the entire model. The experimental results are shown in Table 7. We can observe that the precision scores of both the English and Chinese versions are improved when applying hierarchical semantic matching. It is also interesting to note that the two matching networks have the same importance in the English version, but q-A has a greater impact than q-Q in the Chinese version. VOLUME 11, 2023 77835 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

2) EFFECTS OF THE TRAINING STRATEGY
In previous sections, the weights of each matching model are initialized by the pre-trained models and finetuned on the corresponding query-Question matching or query-Answer matching dataset before finetuned on the QA pair dataset. In order to analyze the impact of fine-tuning, we further conduct extensive ablations. The experimental results on the English dataset are shown in Table 8. We notice that the metrics are significantly improved in the English version, which facilitates the adaptation of the model to the question-answer matching task. In terms of Chinese version, the finetuning of q-Q matching also improves the performance of the model. However, the reason why fine-tuning the q-A matching degrades the performance is that the gaps in language styles between the datasets are large. LCQMC is a colloquial question sets which the language style may be similar to our constructed data, while data in nlpcc 2016 DBQA are all intercepted from long texts and combined into questionand-answer pairs through word segmentation and sentence segmentation. There are many meaningless sentences that it may introduce lots of noise, resulting in a worse retrieval performance.
To validate the generalization and robustness of the proposed HSM-QA, we perform subtask4 when HSM-QA is not fine-tuned on the target QA pairs dataset. We observe that fine-tuning the matching network on other common datasets can significantly improve the performance of our model. As can be seen from Table 8, the performance of the model fine-tuned on other datasets is generally better than that of the unfine-tuned model even they don't fine-tuned on target dataset. Therefore, in real-world applications, if the amount of data in the specific domain of the model is missing, finetuning two matching networks on other data that is easy to collect can still perform well and have excellent property of application.

V. CONCLUSION
Modern QA systems still face limitations in terms of precision and efficiency due to the inherent ambiguity present in natural language descriptions. In this paper, we present a question answering system model based on hierarchy semantic matching to achieve question answer pair based retrieval. Our proposed model adopts Siamese network and pairwise method to split the retrieval of question-answer pairs into two steps. First, question-answers pairs are recalled based on query-question similarity and then ranked according to query-answer correlation. In query-Question similarity recall, we use the Siamese network to improve accuracy and training efficiency. In query-Answer relevance ranking, we use pairwise to train the matching network with simple loss function. To evaluate our proposed model, we conduct experiments on multiple datasets to verify the performance of the hierarchy structure. The results demonstrate that our model achieves the best performance in multiple indicators, and reflects practicability.
In our upcoming study, we aim to enhance the generalization of our HSM-QA system to more open-world scenarios by exploring the incorporation of external knowledge sources, like knowledge graph. This will enable our system to tackle more complex and diverse question-answering tasks and ultimately improve its overall performance and usefulness.
JINLU ZHANG received the bachelor's degree in science and technology of artificial intelligence from Xiamen University, China, in 2022, where she is currently pursuing the master's degree with the Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China. Her research interests include multimodal machine learning and image generation.