Adaptable Closed-Domain Question Answering Using Contextualized CNN-Attention Models and Question Expansion

In closed-domain Question Answering (QA), the goal is to retrieve answers to questions within a speciﬁc domain. The main challenge of closed-domain QA is to develop a model that only requires small datasets for training since large-scale corpora may not be available. One approach is a ﬂexible QA model that can adapt to different closed domains and train on their corpora. In this paper, we present a novel versatile reading comprehension style approach for closed-domain QA (called CA-AcdQA). The approach is based on pre-trained contextualized language models, Convolutional Neural Network (CNN), and a self-attention mechanism. The model captures the relevance between the question and context sentences at different levels of granularity by exploring the dependencies between the features extracted by the CNN. Moreover, we include candidate answer identiﬁcation and question expansion techniques for context reduction and rewriting ambiguous questions. The model can be tuned to different domains with a small training dataset for sentence-level QA. The approach is tested on four publicly-available closed-domain QA datasets: Tesla (person), California (region), EU-law (system), and COVID-QA (biomedical) against nine other QA approaches. Results show that the ALBERT model variant outperforms all approaches on all datasets with a signiﬁcant increase in Exact Match and F1 score. Furthermore, for the Covid-19 QA in which the text is complicated and specialized, the model is improved considerably with additional biomedical training resources (an F1 increase of 15.9 over the next highest baseline).


I. INTRODUCTION
In automated Question Answering (QA), the goal is to retrieve answer(s) to a particular question expressed as a natural language text [1]. In closed-domain QA, the focus is on a particular domain of interest where the goal is to retrieve answers to questions within that domain. Machine reading comprehension (MRC) is the core task for textual QA, which aims to infer the answer for a question given the related context [2]. The answers could be sentences or paragraphs, or even n-grams. In practice, sentences are a good size to present a user with a detailed answer. For instance, given the The associate editor coordinating the review of this manuscript and approving it for publication was Weipeng Jing . question ''Why is the Pfizer vaccine better than Sinovac?'', one would expect the answer in one or two sentences rather than a single phrase. The task is more challenging compared to others in information retrieval (IR) [3], where the goal is to retrieve a ranked list of relevant documents.
The focus of this paper is closed-domain sentence-level QA [4]- [7]. This is an important and challenging field to study because many problems could be addressed by building domain-specific QA systems. For example, technology companies building systems for their call agents to answer user queries would benefit from a system that uses their internal call records so their call agents could efficiently get an answer to the questions of their clients. In a further example, students studying a particular subject would benefit from closed-domain QA systems to help them answer questions surrounding their syllabus, rather than using a general open-domain QA system that might retrieve irrelevant answers due to the diversity of topics covered.
Developing methods to improve closed-domain QA is a crucial problem to address so that we can build systems that answer domain-specific user questions effectively. It is challenging because there are a variety of domains, each with its vocabulary, language syntax, and semantics. Ideally the same computational model would be applied in different domains with minimal human supervision to avoid needing tailor-made models for every domain, which would be time-consuming and expensive. In automated QA, there has been significant progress, concentrated largely on the open-domain QA systems. There are systems in both open and closed-domain QA that have used popular pre-trained neural contextual language encoders such Bidirectional Encoder Representations from Transformers (BERT) [8] and other variants [9], [10]. The language models have achieved near-human, or even better performance, on popular open-domain QA tasks such as SQuAD [11]. Despite this progress in open-domain QA, existing models for closed-domain QA [4]- [7], [12] are comparatively less effective and open-domain QA models do not perform as expected for domain-specific questions. Our goal in this paper is to develop a closed-domain QA system that can be easily adapted to different domains with only a small training data set.
We propose an adaptable closed-domain MRC-style QA system based on a Convolutional Neural Network (CNN) and self-attention mechanism, with several characteristics that tackle contextual understanding for closed-domain QA. To enable the model to focus on question-relevant sentences, we apply an unsupervised filtering technique to remove those sentences which do not contain an answer for the question. The model also attempts to rewrite some question (determined by a tuned parameter) to make them less ambiguous. Closed-domain QA does not typically have large-scale datasets that could help develop a statistical model and, as a result, many strong open-domain QA models will struggle in closed domains. Applying statistical learning models on small datasets also introduces the problem of reliable generalization thus, we divide the fine-tuning process into two steps: 1) transfer to the task; and 2) adapt to the target domain. The two-step fine-tuning process addresses the data scarcity problem for closed-domain QA. The first fine-tuning step only needs to be done once, but the second step is required each time we adapt the model to a new domain.
To the best of our knowledge, there is no existing flexible system that can effectively adapt to different closed QA domains. Our contributions are as follows: 1) A novel hierarchical CNN attention network for reading comprehension style QA, which aims to answer questions at sentence-level for a given context in a specific domain . The CNN-attention model extracts  local and mutual interactions among different words   and phrase-level correlations to comprehend context  sentences and questions. A candidate answer identifier  module and a question expansion module for selecting question-relevant sentences and question  Lin et al. [13] developed a distantly supervised open-domain QA model that utilises an information retrieval-based paragraph selector to filter out noisy paragraphs and a paragraph reader to extract the correct answer using a multi-layer long short-term memory network. Yang et al. [14] demonstrated an end-to-end question answering system that integrates a BERT-based reader with the open-source Anserini information retrieval (IR) toolkit to identify answers from a large corpus of Wikipedia articles in an end-to-end fashion. Karpukhin et al. [15] focused on establishing the optimal training procedure utilising a sparse set of question and passage pairs. They designed retrieval solely through dense representations, with embeddings learnt from a modest number of questions and passages using a simple dual-encoder system. Seo et al. [16] introduced Dense-Sparse Phrase Index (DENSPI), an indexable query-agnostic phrase representation model for real-time open-domain QA on SQuAD. In their model, phrase representation combines dense and sparse vectors based on BERT and term-frequency-based encoding, respectively. Qu et al. [17] proposed an open-retrieval conversational QA (ORConvQA) containing a retriever, reranker, and a reader that are all based on fine-tuned BERT and ALBERT based encoders and decoders. They evaluated their model on the OR-QuAC dataset they created for conversational QA.

B. CLOSED-DOMAIN QA
Lende and Raghuwanshi [6] proposed a system for closed-domain QA for user queries related to education. An index term dictionary was created for the keywords extracted from a corpus created for the education domain.
To obtain the relevant answer, they apply Part-of-speech (POS) tagging to all the filtered documents to find the suitable answer, which contains the same sense as the query. Sarkar et al. [7] developed a knowledge-based QA system, which only understands predefined insurance-related queries.
In the first step, the Apache OpenNLP tool is used to detect the query's subject-to-predicate triplets, and then relevant VOLUME 10, 2022 content was retrieved and ranked using matching criteria (query sentence similarity, sentence length, relative word importance, etc.). Badugu and Manivannan [5] created a closed-domain question answering framework for ''Hyderabad Tourism'' based on rule-based classification and similarity measures. The corpus is preprocessed, divided into sentences, and then grouped into various inquiry types such as What, Where, Who, and When. Sentence retrieval is conducted, based on the question category, and their vectors are generated based on the term frequency and the inverse document frequency of the term. The Jaccard similarity score determines the final answer for each question. A BERTbased clinical question answering system was proposed by Rawat et al. [4], using fine-tuned BERT on medical corpora. Entity-level clinical concepts were integrated into the BERT architecture using the Enhanced Language Representation with Informative Entities (ERNIE) framework. ERNIE extracts contextualized token embeddings using BERT and generates entity embeddings using a multi-head attention model. Godavarthi and Sowjanya [18] built a closed-domain QA system that answers queries from the COVID-19 open research data set . They fine-tuned a BERT model for self-supervised learning of language representations (ALBERT) [19] for retrieving all COVID relevant information to the query. Cai et al. [20] proposed an integrated framework for answering Chinese questions in restricted domains by modeling the question pair, comparing the input question to the existing question, and then identifying the answer output.

C. READING COMPREHENSION APPROACHES
Reading approaches can be classified into two categories based on whether the retrieved documents are processed independently or jointly for answer extraction. This subsection summarizes recent reading comprehension approaches (readers) in different QA models. With the use of BERT Reader, Dense Passage Retrieval [15] estimates the likelihood that a passage contains the answer and the probability that the token is the beginning and end of an answer span. It then selects the most probable answer based on what it calculates. Readers are often developed as graph-based systems to extract answer spans from passages [21], [22]. For example, in Graph Reader [22], the graph is used as input, and Graph Convolution Networks [23] are primarily used to learn the passage representation before pulling the answers from the most probable span. In DrQA [24], various features, such as POS, named entities (NE), and term frequencies (TF), are extracted from the context. The multilayer Bi-LSTM then predicts the span of the answer based upon the inputs, the question, and the paragraphs. As part of this process, argmax is applied across all answer spans to get a final average of answer scores across paragraphs using an un-normalized exponential function. BERTserini [14] provides a reader that works on BERT by removing the softmax layer, which allows for comparison and aggregation across different paragraphs. A Shared Normalization mechanism modifies the objective function and normalizes the start and end scores across all paragraphs to achieve consistent performance gains [13]. This mechanism eliminates the problem of unnormalized scores (e.g., exponential scores or logit scores) for all answer spans.

D. QUERY EXPANSION
This subsection explains recent question expansion (reformulation) approaches proposed for QA systems. According to GOLDEN Retriever [25], the query reformulation task can be recast as an MRC task because they both take a question and some context documents as inputs and aim to generate natural language strings as outputs. The query expansion module in GAR [26] is built using a pre-trained Seq2Seq model BART [27] to take the original query as input and generate new queries. The model is trained with various generation targets: the answer, the sentence containing the answer, and the passage title. Some other works generate dense representations to be used for searching in a latent space. For example, Multi-step Reasoner [28] employed a Gated Recurrent Unit (GRU) [29], taking token-level hidden representations from MRC and the question as input to generate a new query vector. The new query vector is then trained using Reinforcement Learning (RL) by comparing the extracted answer to the ground-truth. Xiong et al. [30] uses a pre-trained masked language model (such as RoBERTa [31]) as its encoder, which concatenates all previous passages and the question representation to encode a dense query.

III. ADAPTABLE CLOSED-DOMAIN QA MODEL
In this section, we describe our novel reading comprehension model for sentence-level closed-domain QA that can be tuned with a small training dataset for various domains. We have introduced a candidate answer identifier (CAI) module based on syntactic and linguistic rules to reduce the context to the sentences that could contain the answer for the given question (candidate answer sentences). We designed a neural network based on CNN and self-attention mechanism to analyze and score the candidate answer sentences. The novelty lies in obtaining different levels of contextual understanding of context sentences and the question by extracting important semantic features and their correlations. The CNN-attention layer assigns a relevance score for each candidate answer sentence selected by the CAI. Also, we have introduced a question expansion module (QE) for rewriting ambiguous questions shown in Fig. 3, which in spirit, is close to the query expansion technique in Information Retrieval. The key advantage of this module is that it rewrites the question and produces synonym versions to help the system select the answer sentence with more confidence. The overall framework of our model is shown in Fig. 1.

A. CANDIDATE ANSWERS IDENTIFIER (CAI)
Unlike previous approaches, we filter out irrelevant content from context P to help improve our results. We analyze the linguistic features for each sentence to determine its capability for answering different question categories (When, Where, Who, What, Why, How). We developed a strategy to classify the context sentences into question categories to facilitate the answer selection. To this end, we use a popular tool called Giveme5W1H [32], an open-source system that uses syntactic rules to automatically extract the relevant phrases from English news articles for answering the 5W1H questions. The advantage of this tool is that it can be customized towards one's needs. Since our main goal is identifying candidate answers for each question category, we have customized different components in Giveme5W1H functions. We have designed new methods and rules to improve and adapt the Giveme5W1H for candidate answer selection since the Giveme5W1H does not cover all the syntactic rules for ''Why'', ''What'', and ''Where''. Also, we have used a new parser function for finding all types of date-time named entities (NEs) for ''When''. Additional methods are added to Giveme5W1H to support all types of ''How'' questions such as ''How many'', ''How much'' and ''How''. We perform six independent identification functions to retrieve the candidate answers for the six (5W1H) categories. The candidate answer identifier module uses the Giveme5W1H preprocessing steps, gets the context as input, and splits it into sentences to process them separately. After checking all the rules and methods for each sentence, it will be added to correspondent categories, VOLUME 10, 2022 and in the end, a list of candidate answers for each question category will be prepared. More details on different rules and how they have been incorporated into our candidate answer identifier module are mentioned below and depicted in Fig. 2.
• When: For detecting all types of temporal NEs including all formats of DateTime, duration, etc we have added the dateparser 1 python package, as well as using SUTime [33] which is used in Giveme5W1H.
• Who: In Giveme5W1H, the sentences that have the subject are considered as ''who'' candidate answers. The first noun phrase (NP) that is a direct child to the sentence in the parse tree and has a verb phrase VP) as its next right sibling is the sentence subject. We also considered the sentences containing Person or Organization NEs as ''who'' candidate answers which are missed in the original Giveme5w1H.
• What: In Giveme5W1H, the ''who'' candidates that a VP is the next right sibling in their parse tree are considered as ''what'' candidates. We have extended this function because the original function does not reliably work on many ''what'' candidate answers. We added an extra function to select sentences as the candidate answers for ''what''; however, the order of tags is not important. The pattern that we look for is (Noun + Verb * Preposition * Adjective * ).
The candidate answer sentences list for questions that do not belong to the 5W1H categories contains all the context sentences.

B. CNN-ATTENTION BASED ANSWER SELECTOR
Given the question q, represented as a sentence, there are K possible candidate answers CA 1 , CA 2 , . . . , CA k which are present in the accompanying context P associated with the q. Question q with m tokens (q = q 1 , q 2 , . . . , q m ) and candidate answer sentence CA i with n tokens (CA i = c 1 , c 2 , . . . , c n ) are combined together into a single sequence, separated by a special token [SEP] as the input of the CNN attention layer. The output of BERT is taken only for the first token [CLS], which is used as the aggregate representation of the sequence. We derive the semantic representation of q and CA i using a pre-trained contextual language model such as BERT or ALBERT for the embedding layer. The advantage is that we derive high-quality representations, which cannot be obtained using methods such as static word embeddings [34], [35]. Our goal is to obtain a reliable or most plausible answer CA j to the question q in P. BERT uses a multi-layer bidirectional transformer [36] network to encode contextualized language representations. Similar to BERT, the ALBERT model introduces two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. To calculate the scores for candidate answer sentences, we fine-tune the BERT and ALBERT pre-trained model with untrained layers of CNN, pooling, and attention. The CNN and self-attention mechanism focus the model on the most important features and their correlations when constructing the question and sentence representation.

1) CONVOLUTIONAL NEURAL NETWORK
The CNN extracts salient n-gram features from the input sentence to create an informative latent semantic representation of the sentence for downstream tasks [37]. For each sentence, let e i ∈ R d represent the word embedding for the i th word in the sentence, where d is the dimension of the word embedding, and the given sentence has n words. Convolution is then performed on this input embedding layer. It produces a new feature by applying a filter K ∈ R hd of size h on a window of h words.
For example, a feature c i is generated using the window of words e i:i+h−1 by (1).
Here, f is a non-linear activation function, for example, the hyperbolic tangent, and b ∈ R is the bias term. The filter (also called kernel) K is applied to all possible windows (slide over the entire sentence embedding matrix) using the same weights to create the feature map. We divide the sentence of length n into {e 1:h , e 2:h+1 , . . . , e i:i+h−1 , . . . , e n−h+1:n } and perform the filter on each component. The feature map obtained by filter is shown in (2).
A convolution layer is usually followed by a pooling strategy on each filter to provide a fixed-length output and reduce the output's dimension while retaining the most salient features. In this paper, the maximum pooling method on each feature map is applied, which gives us low dimensions dominant features, as shown in (3).
Theĉ is obtained by one convolution filter along with maximum pooling layer, and a feature sequence obtained with t convolution filters is shown in (4).
In this stage, important n-gram features of the candidate answer sentence and question are extracted by CNN, and the generated feature vectors should be concatenated to form the new global feature vector matrix Y as the input to the selfattention layer.

2) SELF-ATTENTION LAYER
The self-attention mechanism primarily focuses on the internal dependence of input [38]. In our model, the self-attention layer calculates the semantic association between the extracted features from the question and candidate answer sentence to determine the candidate answer's relevance score. In each self-attention mechanism, there is a query matrix (Q), a key matrix (K ) and a value matrix (V ). The output of the CNN layer, matrix Y , is the the initial value of query matrix (Q), key matrix (K ) and value matrix (V ), as shown in (5).
Scaled Dot-product Attention (SDA) is the main concept of the self-attention mechanism. It first computes the similarity by solving the dot product of Q and K , then divides by √ d k (d k is the dimension of matrix K) to avoid the dot product result from being too large. The result is then normalized using the Softmax function before being multiplied by the matrix V to obtain the expression of attention. SDA operation is depicted in (6).
We perform average pooling on the output matrix of the self-attention layer to obtain the feature vector f for integrated CA i and q. We input f through the fully connected layer to the final softmax layer. In the answer selection task, there are two classifier labels (similar = 1, dissimilar = 0). We modified the final layer to get the predicted Score(CA i ) for the similar label, as shown in (7) and (8).
where w c is the weight matrix, b c is the bias and C is label. We rank all the candidate answer sentences based on the obtained scores, and the candidate answer sentence with the highest score is selected as the answer sentence for the question q. We prevent having more than one sentence with label 1 with this method. VOLUME 10, 2022 C. QUESTION EXPANSION Some questions are more ambiguous or convey less domain-related information than others [39]. Inspired by research in Information Retrieval, where query terms are expanded with relevant keywords from the vocabulary, we developed a strategy to use more appropriate terms if the question does not convey much information to our model. We introduce a parameter θ where 0 < θ < 1, which is automatically tuned from the data and helps us to assess whether question expansion is needed. If the selected answer sentence score is less than θ, the question expansion module generates question synonyms until a candidate answer achieves a score greater than θ. We have designed a lightweight hybrid question expansion based on contextualized embedding and lexical resources (WordNet) that replaces some question keywords with domain-related synonyms. We extract the question keywords by POS tagging the question and removing the symbols, stopwords, and NEs to keep the words most important to the question. After selecting the keywords, expansion terms are extracted from WordNet considering the keyword's role in the question (for example, if the keyword is an adjective, adjective synonyms are selected accordingly). Thereafter ranking and filtering functions are applied to choose the most appropriate expansion terms for each keyword. The expansion terms that do not exist in the domain vocabulary are eliminated from the list, and the remaining ones are considered for calculating their relevance to the question. We further train the pre-trained BERT model using the domain-specific corpus to generate domain-specific embedding vectors for expansion terms and questions. We rank the expansion terms regarding their relatedness to the whole question, and those more semantically related to the question are retained. Therefore the semantic similarity between question and expansion terms embedding vectors is calculated.
After finalizing the expansion list, each expansion term is transformed to the appropriate form to get the same POS tag as the keyword (for example, if the keyword is plural Noun(NNS), its expansion term should be the same). Then, each keyword is replaced with one of the expansion terms to form a synonym version of the question that conveys the same context. The generated synonym versions have the same structure as the original question since only some keywords are replaced with their synonyms. As a result, there is no need to do grammar checking for the generated versions.
For example, ''What are main steps for mitigating the COVID -19 transmission during transport of suspected and confirmed patients?'' is a question from the COVID-QA dataset that needs expansion because its answer sentence score is less than θ. The first step is keyword selection, The second step of the filtering is to measure the expansion terms' semantic relevance to the question. We filtered the terms with lower relevance (lower semantic similarity) to the question for keywords with more than one synonym. The average of the semantic relevance to the question is calculated for all the keyword synonyms and those obtaining the semantic relevance more than the average value (α) will remain for the keyword. After this step, the final list of expansion terms with higher semantic relevance to the question remains {main: major, primary -transmission: infection -transport: transfer -confirmed: corroborated, affirmed }. We automatically transform the synonyms to their appropriate form to get the same POS tag as the keyword, ''confirmed'' has the ''VBN: present participle'' POS tag so its synonyms are converted to present participle form.
The synonym versions for the question are generated by replacing the keywords with their synonyms. One of the expanded versions of our example question is ''What are major steps for mitigating the COVID -19 infection during transfer of suspected and corroborated patients?''.
Replacing the keywords (one adjective, two nouns, and one verb) with domain-relevant and question-related synonyms generates other versions of the question with the same meaning. We analyze the candidate answers for the generated synonym version of the question to find the answer sentence more accurately. The final selected answer sentence is ''HCWs who handle the transport of COVID-19 patients must consider the following principles: firstly, early recognition of the deteriorating patient; secondly, HCW safety; thirdly, bystander safety; fourthly, contingency plans for medical emergencies during transport; fifthly, post-transport decontamination.'' with the score 0.72. If the scores for the selected answers by the synonymized versions are lower than θ, the answer sentence with the highest score (among all the selected answers) will be chosen. The question expansion module described in Algorithm 1 takes a question as input and generates the synonym version of the question in four steps: 1) keyword detection; 2) expansion terms (synonyms) extraction; 3) filtering inappropriate synonyms; and 4) preparing expansion terms and replacing keywords with their corresponded synonyms to generate various synonyms of the question.

IV. EXPERIMENTS A. EXPERIMENTAL DATASET
We have used four closed-domain datasets to verify the performance of the proposed model. Three datasets were derived from popular SQuAD collection [11] due to the limited number of closed-domain QA datasets that are publicly available. The datasets are from three domains with different concepts and different sizes: Tesla (person); California (region); and European-Union-law (system) referred to as EU-law in our results. COVID-QA [40], a SQuAD style Question Answering dataset, was added as the fourth closed-domain dataset for our experiments. The datasets consist of Context-Answer-Question triples. The Tesla, California, EU-law, and COVID-QA consist of 565, 746, 315, and 2019 questions, respectively, along with annotated answers and context (see Table 1). We used Stanford CoreNLP [41] for sentence splitting, tokenization, full parsing, POS-tagging, preprocessing, and preparing the context in the CAI module for candidate answers selection. We used two-step training for the contextualized CNN-attention answer selector model: 1) transfer to the task; and 2) adaptation to the target domain. Performing single fine-tuning requires a large dataset for the target domain, which is impractical due to the difficulty and cost of collecting training data specific to that domain. Thus, the first step transfers the model to the target task, and the second step can adapt the model to the target domain with a small training dataset. We have utilized the Natural Questions (NQ) dataset [42] consisting of 300,000 naturally occurring questions, along with human-annotated answers from Wikipedia pages, to be used for the first step of training. This dataset provides a whole Wikipedia page for each question which is significantly longer compared to MRC datasets (e.g., SQuAD). Following Liu et al. [43], we generate multiple document spans by splitting the Wikipedia page using a sliding window with the size and stride 512 and 192 tokens respectively to generate the negative (i.e., no answer) and positive VOLUME 10, 2022 (i.e., has answers) spans. Then, we only preserve the positive spans (the span containing the annotated short answer) as the context, and the negative ones were discarded. For both the first and second steps of training, the question sentence pairs were generated by CAI. After generating the candidate answer sentences for question categories with CAI, the candidate answer sentence and question pairs were generated for training the CNN attention-based answer selector. The candidate answer sentence which contains the annotated answer gets the label 1, and other candidate sentences get label 0. The first fine-tuning step is done only once, and the second step is performed each time we adapt the model to a new domain. We used the pre-trained BERT base and ALBERT base model for token embeddings, consisting of 12 Transformer blocks with 12 self-attention heads and the hidden size of 768. There is no analytical formula to calculate an appropriate value of the hyperparameters to obtain the optimal model parameter. Therefore, we used tools to automatically tune the model hyperparameters. We performed hyperparameter optimization using Ray Tune Python library 3 with Hyperopt algorithm [44]. Filter size, number of filters, learning rate, batch size, and theta (QE threshold) hyperparameters were optimized for each domain shown in Table 2. The search spaces are (0,1), {2, 3, 4, 5}, {10, 20, 30}, {1e-7, 2e-7, 1e-8, 2e-8, 5e-8}, {4, 8, 16} for θ, filter size, number of filters, learning rate, and batch size respectively. The optimal combination of hyperparameters values that maximize the model performance is discovered by the Hyperopt algorithm for each time tuning the model for a new domain. The Hyperopt algorithm utilizes a form of Bayesian optimization and requires the search space, the loss function, the optimization algorithm, and a database for recording hyperparameter tuning history (score, configuration). We set the maximum sequence length for BERT and ALBERT to 128 tokens. We utilized the Adam optimization algorithm [45] for the parameter update. We used the cross entropy loss function to calculate the loss. The optimal values for filter size, number of filters, learning rate, and batch size for the first step of training are calculated as follows: {2, 3, 4}, 100, 2e-5, 64. We applied early stopping on the development set for both training stages on the loss value. We set the max number of epochs to 9 and 3 for transfer and adapt steps, respectively. For the QE module, we used domain-specific corpora (concatenation of 3 https://docs.ray.io/en/latest/tune/index.html contexts for one domain) for tuning the pre-trained BERT for generating domain-specific embeddings. We automatically prepared the domain-specific corpus for ''masked Language Model'' and ''next sentence prediction'' to generate the data for pre-training on each domain. We utilized the NodeBox English library, which has been succeeded by the Pattern Python library, 4 for analyzing the keyword's role and expansion term transformation.

C. EVALUATION METRICS
We adopt two metrics including Exact Match (EM) and F1 scores to evaluate our model. The EM score determines the percentage of predictions that perfectly match the ground truth answer, and the F1 score demonstrates the average overlap between the prediction and the ground truth answer.

D. COMPARATIVE METHODS
To demonstrate the effectiveness of our proposed model, we compare against several other comparative approaches: • KPOS-QA [6] is a closed-domain QA system (their dataset is not publicly available). We have simulated their approach for sentence-level QA regarding the details provided in their paper (ranking and selecting the answer based on extracted keywords and POS tags for query and context).
• R-TFIDF [5] is another closed-domain QA system (their dataset is also not publicly available). We simulated their approach for sentence-level QA regarding the details provided in their paper (a rule-based sentence classification and measuring cosine similarity on TF-IDF vectors for question and sentences).
• AttReader [46] presented BiLSTM networks based on an attention mechanism and the GLoVe language model for reading comprehension in QA.
• QANET [47], is an MRC model for open-domain QA based on convolutions, global self-attention, and the GLoVe language model.
• cdQA is an end-to-end closed domain QA system built on top of the pre-trained BERT. 5 • Retro-reader [48] is an ''open-domain'' MRC model and ranks 5th in the SQuAD2.0 leaderboard. 6 An approach with two reading modules (sketchy reading module and intensive reading module) is proposed to find answer span and detect unanswerable questions. In the intensive reading module, two question-aware matching mechanisms based on the transformer and multi-head attention are introduced for predicting the answer.
• EtoE-Covid-QA [50] fine-tuned RoBERTa-large on SQuAD2.0, NQ, and proposed both language modeling on the CORD-19 collection and example generation model for the MRC training for Covid-19 QA.

V. RESULTS AND DISCUSSION
We present the results obtained by our model and others on the development set in Table 3. We present results for two variants of our model (CA-AcdQA): 1) pre-trained BERT; and 2) pre-trained ALBERT. For AttReader, QANET, cdQA, Retro-Albert, we used their public code to apply their model to the datasets and generate results for closed-domain sentence-level QA. We used the same pre-trained language models as reported in their respective papers and fine-tuned the models with two stages of training as mentioned in IV-B. ZCovid-QA, EtoE-Covid-QA, and OCovid-QA baselines are only designed for Covid-19 QA and the results for these models are reported in their respective papers. We categorized the comparative models into two groups: 1) based on conventional language models (KPOS-QA, R-TFIDF, AttReader, QANET); and 2) contextualized language models (cdQA, RetroReader, ZCovid-QA, EtoE-Covid-QA, OCovid-QA). We observe from our results that KPOS-QA, which is based on context and question keyword extraction using POS tags, achieves the worst results. Hence, we learn that there is a strong need for high-quality vectors representing context and question. In R-TDIDF, we notice an improvement of 3%-11% of the F1 score, obtained by applying traditional TF-IDF vectors and sentence classification. The QANET and AttReader outperformed the KPOS and R-TFIDF, whereas the pre-trained GLoVE language model encodes context and question. The QANET outperformed AttReader because it's not relying on the recurrent structure, unlike the AttReader, which is based on BiLSTM. cdQA outperformed QANET and AttReader and improved the EM due to its reader architecture based on BERT. RetroReader outperformed the other baseline methods for all datasets since it employed a pre-trained transformer-based language model and attention mechanism for reading comprehension. Evaluating RetroReader for closed-domain reading comprehension shows its performance has degraded slightly (its performance in open-domain QA is 91.3 for F1 score and 88.8 for EM).
Our model outperforms all baseline models for all datasets because we explore the association between the extracted features from the question and candidate answer sentences by applying CNN and self-attention on the joint representation of question and candidate answer sentences. Also, the CAI and QE module's effect on selecting appropriate sentences from context and rewriting the vague questions should not be disregarded. The performance of all baseline methods is worse on the COVID-QA dataset since Biomedical QA (BQA) is more challenging than other domains, and more reasoning is needed for the question and biomedical text compared to other domains. Another challenge is clinical term ambiguity due to the variation of clinical terminology and the frequent use of abbreviations and esoteric medical terminology. BQA evaluation is also challenging because most evaluation metrics do not consider the rich biomedical synonym relationships. Since biomedicine is a highly specialized domain, understanding complex biomedical knowledge is required, and using contextualized language models pre-trained on open-domain corpora is inefficient. We evaluated our approach utilizing pre-trained BERT on the biomedical domain as shown in Table 3

VI. ABLATION STUDY
We investigate the effect of the question expansion component and CNN Attention layer individually to understand the overall role they play in our model. In Table 4, we present our results without the CNN attention module in our model. Finetuning the QA pipeline without the CNN-Attention layer with pre-trained BERT and ALBERT reduced the performance significantly. Utilizing the CNN-Attention layer captures the semantic connections between the sentence and question features which boosts the model performance 7%-11% for EM and and F1 score. We depict the quantitative results in Table 5,  where the model's performance with and without the question expansion (QE) component is shown. Additionally, we examined the BART and T5 pre-trained question paraphrasers for rewriting the vague questions. T5 caused a slight performance degradation compared to the ''CA-AcdQA(ALBERT) without QE'' since it is not tuned with any domain-specific training data for question paraphrasing. BART paraphraser outperformed T5, although it couldn't elevate the model performance significantly. T5 and BART need fine-tuning on a domain-specific paraphrase dataset for a better performance which is not feasible for every domain. We can conclude that rewriting the question without considering the domain terminology misleads the model by generating domain irrelevant questions. Therefore, for generating in-domain question paraphrases (synonymized questions) without the need for training data for every domain, the proposed QE module operated well, and it improved the EM and F1 score by 1%-2% for all domains. We conducted additional experiments to study the role played by each question type, i.e., '' What'', ''Where'', ''When'', ''Why'', ''Who'', ''How''. Besides that, our goal is also to portray that the new customizations that we have made to Giveme5W1H are useful to our framework. This will allow us to get an overall understanding of the role that each question type plays in our study. We calculated the F1 score for each question type individually for analyzing our model performance on different question categories (see Table 6). We have also reported the number of instances in each question category on four datasets in Table 1. One observation is that the performance is not impacted by the number of instances in the category. We believe this is because the proposed framework does not heavily rely on statistical information, which makes it reliable even under low-resource situations. The number of questions that do not belong to the 5W1H categories is 15, 76, 59, 200, for Tesla, California, EU-law, and COVID-QA datasets. Furthermore, the candidate answer identifier helps automatically select the appropriate sentences in each question category based on the linguistic rules in III-A. An advantage that our model gets by the CAI component is reducing the number of candidate answers, which significantly impacts the model's effectiveness for long contexts by excluding the question-irrelevant sentences. Fig. 4 displays the average TABLE 6. Performance comparison (F1) of our models for each question category with the proposed candidate answer identifier (customized Giveme5W1H) and Giveme5W1H candidate answer identifier.
F1 score across all datasets for the model with the original Giveme5W1H and the proposed CAI on each question category. We improved the model performance for each question category by adding linguistic rules and functions to Giveme5W1H, and the ''What'', ''Why'', and ''When'' categories improved the most.

VII. CONCLUSION
Our proposed closed-domain QA model improves upon stateof-the-art models across different closed-domain datasets: Tesla (person); California (region); EU-law (system); and COVID-QA (biomedical) dataset. We presented a novel approach by exploiting CNN and the self-attention mechanism to solve the generalization problem by training on small datasets. Our model calculates the semantic association between the extracted local features from context sentences and the question by employing CNN and the self-attention mechanism. Furthermore, components such as the candidate answers identifier and question expansion assist the model by limiting the choice to relevant sentences for each question category and removing ambiguity in questions by replacing some keywords. Experimental results illustrate that our proposed model outperforms different models on different domains without any knowledge base. In the future, our goal is to extend this model to be unified with a retriever to further improve our model for QA.