cpgQA: A Benchmark Dataset for Machine Reading Comprehension Tasks on Clinical Practice Guidelines and a Case Study Using Transfer Learning

Biomedical machine reading comprehension (bio-MRC), a crucial task in natural language processing, is a vital application of a computer-assisted clinical decision support system. It can help clinicians extract critical information effortlessly for clinical decision-making by comprehending and answering questions from biomedical text data. While recent advances in bio-MRC consider text data from resources such as clinical notes and scholarly articles, the clinical practice guidelines (CPGs) are still unexplored in this regard. CPGs are a pivotal component of clinical decision-making at the point of care as they provide recommendations for patient care based on the most up-to-date information available. Although CPGs are inherently terse compared to a multitude of articles, often, clinicians find them lengthy and complicated to use. In this paper, we define a new problem domain – bio-MRC on CPGs – where the ultimate goal is to assist clinicians in efficiently interpreting the clinical practice guidelines using MRC systems. To that end, we develop a manually annotated and subject-matter expert-validated benchmark dataset for the bio-MRC task on CPGs – cpgQA. This dataset aims to evaluate intelligent systems performing MRC tasks on CPGs. Hence, we employ the state-of-the-art MRC models to present a case study illustrating an extensive evaluation of the proposed dataset. We address the problem of lack of training data in this newly defined domain by applying transfer learning. The results show that while the current state-of-the-art models perform well with 78% exact match scores on the dataset, there is still room for improvement, warranting further research on this problem domain. We release the dataset at https://github.com/mmahbub/cpgQA.


I. INTRODUCTION
Computer-assisted clinical decision support system or CDSS aims at assisting healthcare professionals in making valuable The associate editor coordinating the review of this manuscript and approving it for publication was Agostino Forestiero . patient-specific clinical decisions [1]. Some applications of CDSS are predicting mortality, monitoring drug abuse, delivering knowledge, etc., [1], [2]. Among them, delivering knowledge by answering user-defined questions from complex biomedical narratives is one of the crucial applications of CDSS that incorporates natural language processing VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ (NLP) [1]. For better patient care, healthcare professions require clinicians to be acquainted with up-to-date information regarding diagnosis, treatment, prognosis, recommendations, treatment risks, and benefits under the paradigm of Evidence-based Medicine (EBM) [3]. Due to time limitations, it is difficult for clinicians to search available resources for information that are relevant, high-quality, and reliable. Biomedical machine reading comprehension (bio-MRC), an important task in biomedical natural language processing (bio-NLP), aims at efficiently tackling this task of retrieving information from complex biomedical text documents and excerpts. The alternatives -information retrieval (IR) systems or search engines -have several disadvantages while retrieving information. When queried, IR systems provide a list of documents to be perused by the user which is timeconsuming [4], whereas the ranked search results provided by the search engines (e.g., Google), while faster than the IR systems, do not satisfy the ideals of the evidence-based medicine (EBM) [5]. On the contrary, bio-MRC uses intelligent systems to comprehend complex biomedical documents and provide exact answers to the queries made by the user in seconds instead of returning lists of documents and thus is more time-efficient and intuitive. Depending on the types of users, types of contents, and motivations for queries, the applications of MRC in the biomedical domain can be divided into several sub-domains -MRC with (i) scientific literature where the goal is to learn cutting-edge scientific advances and get professionallevel answers, (ii) clinical notes where the goal is to take patient-specific clinical decisions and get professional-level answers, (iii) consumer health queries raised on search engines by the mass public where the goal is to seek advice or knowledge about people's own health conditions, and (iv) medical licensing examination questions where the goal is to test biomedical knowledge of medical professionals [6].
In this work, we introduce another sub-domain of bio-MRC by defining a new problem -MRC task on Clinical Practice Guidelines (CPGs). CPGs are recommendations made by systematically reviewing the most recent available research evidence. These recommendations include patient-specific care based on the best available research evidence, value judgments on risks and benefits, alternative care options, patient management, and practice experienceaiming to assist clinicians in delivering the best practice and care for their patients [7]. Despite the conciseness in comparison to reviewing multiple resources, clinicians often find the CPGs as lengthy, complex, and time-consuming to use [8]. The goal of the bio-MRC task on CPGs is to save clinicians' valuable time and efforts, by assisting them in comprehending the complex narratives in the CPGs while providing targeted selected information to support their clinical practice.
In this context, utilizing a benchmark dataset is a critical factor for evaluating the abilities of intelligent systems in reading and comprehending the narratives in CPGs and then answering queries from them. To the best of our knowledge, there is no existing benchmark MRC dataset that focuses on or includes CPGs. In this paper, we present a manually built, Subject-Matter Expert (SME)-validated benchmark MRC dataset -cpgQA, with 1097 samples, using a clinical practice guideline. cpgQA can be viewed as the pioneer dataset for bio-MRC on CPGs that can extend existing MRC models to enable efficient and accurate interpretation of the clinical practice guidelines. In this work, we also present an extensive case study using the state-of-the-art (SOTA) technique used for the low-resource bio-MRC tasktransfer learning with transformer-based pre-trained language models (PLMs).
Transformer-based PLMs are SOTA language models that have achieved human-level performance on MRC tasks for domains such as Wikipedia, web search results, etc., [9], [10], [11], [12]. Nonetheless, the performance of these PLMs is highly dependent on large-scale high-quality labeled training datasets [10]. In real-world applications, MRC tasks on new problem domains such as CPGs suffer from lack of high-quality and/or large-scale labeled training datasets, which is a bottleneck for the high performance of these PLMs. Moreover, acquiring such datasets in the biomedical field requires subject-matter expertise, causing the process to be expensive and time-consuming. In scenarios where large-scale datasets are unavailable for training, transfer learning -a technique that helps transfer knowledge from a high-resource domain to a low-resource one -comes into play.
In this paper, we further experimentally demonstrate the necessity for a benchmark MRC dataset on CPGs. We also perform a thorough error analysis that depicts the strengths and weaknesses of the SOTA approach, in light of the MRC task on CPGs. Last but not least, we explain the limitations of this dataset and delineate the scope of improvements and future research directions.
The primary contributions of this paper can be outlined as follows: (i) We introduce a new and important problem domain in biomedical MRC -Clinical Practice Guidelines. (ii) We present a benchmark MRC dataset for this new problem domain -cpgQA, which we annotated manually and validated using assistance from subject-matter experts. (iii) We demonstrate the applicability of transfer learning with transformer-based PLMs in this new-problem domain. (iv) Through comprehensive analyses, we demonstrate the capabilities and limitations of the SOTA approach and identify the scope of further improvements, in light of the cpgQA dataset.

II. BACKGROUND AND RELATED WORK
Our work focuses on clinical practice guidelines as a new problem domain in biomedical MRC, and as such, is in the convergence of two research areas -(i) biomedical MRC datasets and (ii) biomedical MRC modeling with the help of transfer learning in the absence of sufficient training data. In this section, we provide a brief description of the relevant background and literature in these areas.

A. BIOMEDICAL MRC DATASETS
In this work, we focus on the MRC tasks and datasets where the questions are in natural form, i.e., interrogative sentences, and the answers are span-based, i.e., text spans extracted verbatim from the contexts. Over the recent years, researchers have made significant progress in the field of machine reading comprehension in NLP, following the release of the first large-scale MRC dataset, SQuAD, in 2016 [10]. The contexts in the SQuAD dataset consist of passages from Wikipedia articles and question-answer pairs which were manually generated by crowd-workers [10]. Following SQuAD, researchers have developed several large-scale MRC datasets on domains such as news articles (NewsQA [13]), web search log (MS MARCO [14]), etc.
The biomedical domain, on the other hand, suffers from the scarcity of high-quality large-scale datasets because it requires domain expertise to generate QA pairs from the biomedical narratives [6] and automation can often hurt the quality of the dataset [15]. Biomedical MRC datasets can be categorized into four sub-domains: scientific biomedical literature, clinical notes, consumer health, and medical examination [6]. Among these sub-domains, literature and clinical notes currently have MRC datasets that consist of natural-form questions and span-based answers. Past work [16] has presented the BioASQ dataset which is the outcome of yearly BioASQ challenges. 1 BioASQ addresses the problem of effortless knowledge extraction from biomedical literature. It is the largest domain-expert-annotated MRC dataset on biomedical literature [6] with 4,234 questionanswer pairs (according to the latest release) on various PubMed abstracts [16]. Another work [17] has generated COVID-QA, a published dataset built on scientific articles related to COVID-19, with 2,019 question-answer pairs annotated by volunteer SMEs.
To address the scarcity of MRC datasets in the sub-domain of clinical notes, authors in [18] have presented an MRC dataset on unstructured electronic medical records (EMRs), emrQA. This dataset consists of templates for patient-specific questions that could be asked by healthcare providers. The question-answer pairs in emrQA have been automatically generated instead of being annotated by experts -leading to incompleteness in the answers, unanswerable questions, and lack of diversity [15].

B. BIOMEDICAL MRC MODELING USING TRANSFER LEARNING
The availability of computing resources has popularized neural network-based models in MRC tasks. In recent years, transformer-based deep learning models have become the most popular choice among researchers for biomedical MRC tasks because of their unbeatable performance [19], [20], [21], [22], [23], [24]. These models are usually pre-trained on large-scale corpora -general-purpose or domain-specificfor a pre-training task and used as trainable encoding modules for downstream tasks such as machine reading comprehension [19], [20], [21], [22]. During pre-training, the model parameters are initialized either randomly from scratch [22] or from the parameters of another pre-trained model [25]. Provided question-context pairs, these encoding modules transform discrete texts into continuous high-dimensional vector representations. Then, to perform the MRC task, an MRC module is added following the encoding module. The MRC module usually consists of a few task-specific layers and is trained along with the encoding module on an MRC dataset [19], [23], [26]. These layers are commonly fully-connected feed-forward neural network (FFNN) layers.
To learn better representations of data instances and perform well in a task, deep learning models such as transformer-based models require large enough training data [27], [28], and training and testing data from the same underlying distribution [28], [29]. In real-world scenarios, new applications of deep learning models, such as ours, suffer from limited or lack of training data. Transfer learning, a learning paradigm, can address this issue by transferring knowledge acquired from a widely-explored domain (namely, source domain with large-scale labeled training data) to a less-explored domain (namely, target domain with limited or non-existent labeled training data) [30].
BioASQ dataset has popularized transfer learning in the biomedical MRC tasks [6]. Authors in [23], [24], [31], [32], and [33] have used sequential learning -a common choice of transfer learning among researchers. In this setting, the same model is sequentially trained on single or multiple large-scale source-domain and single small-scale target-domain datasets [23], [24]. Authors in [23] have used sequential transfer learning to transfer knowledge from the MRC task on the general-purpose SQuAD dataset as well as the Natural Language Inference (NLI) task on the MNLI 2 dataset to the biomedical MRC task.
In a real-world scenario, oftentimes the primary obstacle in applying the MRC models to a new problem domain is the absolute absence of labeled training data in that particular domain [28]. Authors in [34], [35], [36], [37], [38], and [39] have addressed this challenge by utilizing unlabeled data from the target domain and labeled data from the source domain. Authors in [35] and [34] have used synthetic QA pairs in the target domains. However, while generating synthetic QA pairs can improve the MRC performance in domains such as news articles, web search logs, or Wikipedia [34], it hurts the performance in the biomedical domain [36]. Additionally, authors in [34] have used adversarial learning to reduce domain shift between non-biomedical source and target domains -Wikipedia, news articles, and web search logs. Applying a trained model (on a source domain) directly to the target domain, in the absence of training data in the target domain, often hurts the performance of the model [40]. It occurs because the differences in the topic distributions between the source and target domains lead to discrepancies in the feature representations and as such, the learning paradigm of MRC models fails to satisfy one of the two assumptions of machine learning [28] -training and testing data need to have the same underlying distribution. In the adversarial learning approach for MRC, two adversaries -an MRC model and a discriminator -are usually trained jointly against one another to motivate the encoding module in the MRC model to reduce domain shift between the target and source domains [36], [34]. In this way, an MRC model that already performs well in the source domain can also achieve good performance in the target domain and thus achieve generalizability over multiple domains. Authors in [36] have used adversarial learning with a domain similarity discriminator to bring the source-domain and the target-domain (biomedical) instance representations adjacent to each other in the embedding space. They have also used an auxiliary task layer in the MRC framework to stabilize the adversarial learning process. Furthermore, authors in [38] and [39] have proposed a multi-task learning approach that simultaneously performs two tasks: language modeling and MRC.

III. MATERIAL AND METHODS
The primary goal of this research is three-fold: (i) introduce a new problem domain -Clinical Practice Guidelines (CPGs) -for biomedical-MRC, (ii) publish a reliable benchmark dataset -cpgQA -to validate the capability of an MRC system for comprehending the CPGs to answer questions from them, (iii) present a thorough case study using transfer learning with state-of-the-art machine reading comprehension approaches and cpgQA. In this section, we describe the aforementioned parts of this study in detail.

A. cpgQA DATASET 1) CLINICAL PRACTICE GUIDELINES (CPGs)
Are ''systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances'' [41]. The foundation of the CPGs is a systematic review of the available research evidence that is targeted to answer specific clinical questions on a certain condition with an emphasis on the strength of the evidence used for certain clinical decision-making [7]. CPG developers consider a variety of questions when they review multiple sources to compile a guideline from them -''the identification of risk factors for conditions; diagnostic criteria for conditions; prognostic factors with and without treatment; the benefits and harms of different treatment options; the resources associated with different diagnostic or treatment options; and patients' experiences of healthcare interventions'' [42]. These characteristics make CPGs different from other unstructured biomedical text data such as scientific literature or clinical notes.

2) DATASET CONSTRUCTION
The proposed cpgQA dataset consists of 1097 questions with answers and contexts taken verbatim from the VA/DoD CPG on Opioid Therapy for Chronic Pain [43] (available at https://www.healthquality.va.gov/ guidelines/Pain/cot/VADoDOTCPGPocketCard 022817.pdf). Similar to other CPGs, this guideline contains the most prevailing information, collated from multiple relevant resources into one concise document, and follows the same structure [43].
cpgQA is manually annotated and released in collaboration with subject-matter experts (SMEs) as the first standard benchmark MRC dataset in the biomedical sub-domain, CPGs. Each context in the dataset is a paragraph from the CPG. To ensure high quality, the questions and answers are created from the contexts with the help of SMEs by manually reading through the document paragraph-by-paragraph, focusing on five primary components of the CPG as follows. The numbers in the parentheses indicate the count of data instances in cpgQA for that specific component: • Introductory information on the guideline (44 instances) • Background information on the subject-matter (233 instances) • Features and overview of the guideline (280 instances) • Algorithm that accommodates the ''understanding of the clinical pathway and decision making process'' (129 instances) • Recommendations under the consideration of ''confidence in the quality of the evidence, balance of desirable and undesirable outcomes (i.e., benefits and harms), patient or provider values and preferences, and other implications, as appropriate (e.g., resource use, equity, acceptability)'' (411 instances) Figure 1 shows an example sample from the cpgQA dataset.

3) DATASET STATISTICS
The cpgQA dataset has 190 unique contexts and 1,097 question-answer pairs. We further analyze the characteristics of the cpgQA dataset based on two linguistic aspects: (i) distribution of types of questions (based on interrogative words/phrases) and (ii) distribution of the number of words in the questions, answers, and contexts. Figure 2 shows that there are eight types of questions in the dataset, among which approximately 61% are ''What'' type questions, dominating the dataset, while one-third of the dataset is equally shared by the ''When'', ''Which'', ''How'', ''Who'' type questions.
The dominance of the ''What'' question is also shown in Figure 2b, which details the distribution of the question types for each of the five components in the CPG: introductory information, background information, features and overview, algorithm and recommendations. 75% of the questions and answers in the cpgQA dataset consists of less than 18 words and 17 words, respectively (Figure 2b). On the other hand, 75% of the contexts consists of 247 words or less. In Table 1, we also show the comparison between cpgQA and two other bio-MRC datasets -BioASQ [44] and emrQA [18] test sets, in terms of context, question, and answer lengths.

4) COMPARISON WITH OTHER BIOMEDICAL SUB-DOMAINS
To explain why we need a dataset for CPGs, we compare the CPG sub-domain, with two other biomedical sub-domains that have MRC datasets -scholarly articles and clinical notes/EMRs. For scholarly articles and EMRs, we choose the question-context pairs from the test sets of BioASQ-9b (Factoid) [44], and emrQA (Relation subset) [18], respectively. BioASQ-9b (Factoid) is an SME-annotated MRC dataset published in the BioASQ challenge 2021, where the contexts are snippets extracted from PubMed/MedLine articles. 3 emrQA is an automatically-annotated MRC dataset where the contexts are from ''longitudinal EMRs'' of patients [18]. Among four subsets of emrQA, we choose the relation subset, based on the experiments performed by [15].
We demonstrate the domain difference by plotting the vector representations of the question-context pairs from these domains in 2D (Figure 3), adapting the approach described in [45]. In this approach, given a single questioncontext pair, we use the last hidden state of the BERT (Bidirectional Encoder Representations from Transformers) encoder [9] to create 768-dimensional vector representations for each of the tokens in that pair. BERT is one of the stateof-the-art bidirectional multi-layer transformer network for modeling language representations, which is pre-trained on Wikipedia articles and BookCorpus [9]. Then, we calculate the average over these token representations, which results in a 768-dimensional vector representation for the questioncontext pair. For visualization, we then perform dimensionality reduction with 2-component PCA over the vector representations of all question-context pairs. For a fair comparison, we use the original BERT model [9] and do not fine-tune on any of the biomedical sub-domains. Figure 3 clearly shows that CPGs are linguistically very different from scholarly articles and clinical notes, which necessitates the introduction of CPGs as a different sub-domain in biomedical MRC.

B. MACHINE READING COMPREHENSION MODELS
As explained in section II, MRC models aim to understand biomedical narratives to correctly answer questions posed by users (such as clinicians, patients, or general people, depending on the application) from these narratives. An input to the MRC model is a question-context pair, and the output is the positions of the start and end tokens in the answer span taken from the context.
More formally, Given a question Q with n q tokens (words) and a context C with n c tokens, an MRC model respectively predicts the start token position a s and end token position a e of the answer span A a e a s ∈ C such that there exists a unique answer span composed of n a (with n a ≤ n c ) consecutive tokens in the context.
We provide the specifications of these models in Table 2. All nine encoder models have 12 layers, 12 attention heads per layer, 768 hidden nodes, and 3,072 FFNN inner hidden nodes. For these models, the embedding dimension of each token is 768. A maximum of 512 tokens can be provided as one input sequence in these models, except RoBERTa, which can accept 514 tokens per input.
As the task-specific layer, we use a single-layer FFNN with 768 hidden nodes. The layer calculates the probability distributions for start and end token positions by using the VOLUME 11, 2023 Here, n l is the sequence length of the input, h i ∈ R H is the hidden representation vector of the i th token, W s , W e ∈ R H and ∈ R H are two trainable weight matrices, and p s i and p e i refer the probabilities of the i th token being predicted as start and end respectively,  To optimize the MRC model, we employ the cross-entropy (CE) loss L on the predicted answer positions. For each sample, we average the total cross-entropy loss of two predicted outputs for the start and end positions, following Equation 2.
Here y s and y e respectively denote the ground truth answer's start and end token positions. The answer in the test phase is predicted by selecting the sequence of tokens in the interval defined by the two highest probabilities from the distributions p s k∈ [1,n l [49], [50].
(ii) Simultaneous learning: In this setup, the MRC model is simultaneously trained on the source and target domains. Similar to sequential learning, simultaneous learning also ensures that the model is exposed to a large-scale dataset during training. To evaluate the model on the cpgQA dataset in the supervised setting, we perform a 5-fold crossvalidation as follows: We divide the cpgQA dataset into five disjoint subsets (folds) based on 190 unique contexts to avoid data leakage in the test sets. For sequential learning, we subsequently train the MRC model on the source-domain dataset and all but one of the folds of the target-domain dataset. At the end of the training, the remaining target-domain subset is then used as the test set. We repeat this five times, each time with a different subset excluded from training and reserved for testing. Table 3 shows the number of target-domain training and testing samples (i.e., question-answer pairs) in each fold. For simultaneous learning, we follow the same process, except for the fact that we train the MRC model simultaneously on the source and target domain samples.
As the baseline and to show the effect of transfer learning, we also perform a 5-fold cross-validation using only the target-domain dataset. Additionally, for both of these settings, we also experiment with the SOTA approach for biomedical-MRC using transfer learning -BioADAPT-MRC [36]. As explained in Section II, using a trained (on the source domain) model for inference in the target domain in the aforementioned unsupervised setting often hurts the performance of the model. We demonstrate this phenomenon in Section IV-E1.
BioADAPT-MRC addresses this issue and improves performance by using a deep learning framework with adversarial learning that employs both the source domain with a large-scale labeled dataset and the target domain with an unlabeled or limited labeled dataset for learning. The three main components of this framework are an encoding module, an MRC module, and a domain similarity discriminator. Even in the absence of labeled target-domain training data, BioADAPT-MRC has been able to outperform the approaches that used labeled target-domain training data and achieved SOTA performance in the MRC task on biomedical scholarly articles. Hence, in this work, we experimentally explore the potential of BioADAPT-MRC in performing the MRC task on CPGs.

IV. RESULTS AND DISCUSSION
In this section, we describe the specifications of the source-domain datasets used in the experiments, the metrics used to measure the performance of the MRC models, and the experimental setup. We further report the experimental results of the case study on MRC modeling of cpgQA using transfer learning. We then report a thorough error analysis to demonstrate the strengths and weaknesses of the current SOTA approach in performing the MRC task on cpgQA.

A. SOURCE-DOMAIN DATASETS
As mentioned in Section III-C, we consider three source domains for this study: (i) biomedical scholarly articles, (ii) clinical notes, and (iii) Wikipedia. As datasets from these domains, we use BioASQ-9B (Factoid) [44], emrQA (Relation subset) [18], and SQuAD-1.1 [10], respectively. While BioASQ and emrQA are biomedical domain-specific datasets, SQuAD is a general-purpose dataset where the VOLUME 11, 2023  contexts are from Wikipedia articles and the question-answer pairs were developed by crowd-workers [10]. Table 4 provides the number of contexts and question-answer pairs in the training and test sets of these datasets.
The SQuAD-1.1 dataset can be found in the wolfram data repository. 4 For BioASQ-9b, we use the training and test sets from the BioASQ challenge website. 5 BioASQ dataset includes of four types of questions -yes/no, factoid, list, and summary. Among them, the factoid question-answering task closely relates to extractive MRC. Hence, we pre-process the training and test sets to keep only the factoid questions. In place of text passages, the contexts in the BioASQ training set consist of PMIDs. Therefore, we further pre-process the BioASQ training data by retrieving full abstracts from PubMed utilizing the PMIDs. Then, we use these abstracts as the contexts in the training set. We also remove the entries (from the training and test sets) that do not have an answer in the context. For emrQA (relation subset), 6 we pre-process the training and development sets following [15]. For both SQuAD and emrQA, we use the development datasets as the test sets.

B. TARGET-DOMAIN TRAINING DATASET FOR BioADAPT-MRC
As explained in Section III-C, BioADAPT-MRC can utilize unlabeled training datasets from the target domain. An unlabeled MRC dataset implies that the dataset does not contain the labels, i.e., question-answer pairs, but the contexts. Hence, as the unlabeled target-domain training data, we use 10,987 paragraphs, automatically extracted using Regular Expressions, from 21 other VA/DoD CPGs. 7 Along with the benchmark dataset cpgQA, we will also release this unlabeled training set in our GitHub repository 8 for reproducibility.

C. EXPERIMENTAL SETUP AND TRAINING CONFIGURATIONS
We implement the MRC models using PyTorch [51] using the huggingface API [52].
According to the hyperparameter choices for the MRC tasks provided in [9], [19], [20], [21], [22], [25], [46], [47], and [48], we select the following ones for all the experiments except the ones with BioADAPT-MRC: for tokenization -384 as the maximum sequence length, 64 as the maximum query length, 128 as the document stride; for training -3e −5 as the learning rate, 24 as the batch size, and 3 as the number of training epochs. For the maximum answer length, we choose 200, given the distribution of answer length in cpgQA (Figure 2c).
To implement the BioADAPT-MRC model, we follow the default implementation process and hyperparameter setting provided in [36]. We performed all experiments on a Linux virtual machine with a single Tesla V100-SXM2-16GB GPU and Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz.

D. EVALUATION METRICS
For measuring the performance of the MRC models on cpgQA, we use two metrics widely used in the MRC tasks: Exact Match (EM) and F1-Score (F1).

1) EXACT MATCH
Each question in the cpgQA dataset corresponds to exactly one correct answer that can be a word, a phrase, or single/multiple sentences. For each question-answer pair, if the predicted answer matches strictly with the ground truth answer to the character, then for that data instance, EM is calculated as 1, otherwise 0. Thus, being off by a single character in the prediction results in an EM score of zero. In this way, for the whole dataset, the EM score is calculated following Equation 3.
Here, N is the length of the dataset, i.e., the total number of QA pairs in the dataset, and em i is the exact match score for the i th pair.

2) F1-SCORE
F1-score is a well-known classification metric used in cases where precision and recall should be provided equal importance. The basis of the F1-score in MRC is the number of shared words between the ground truth and the predicted answers. For each question-answer pair, the F1-score can be calculated using Equations 4 and 5.
Here, TP is the number of tokens shared between the predicted answer and ground truth, FP is the number of tokens in the predicted answer but not in the ground truth, and FN denotes the number of tokens in the ground truth answer but not in the predicted one. F1-score for the whole dataset is calculated following Equation 6.
Here, f1 i is the F1-score calculated using Equation 4 for the i th QA pair.

3) CROSS-VALIDATION SCORE
We perform 5-fold cross-validation (CV) for experiments in the supervised setting. Since the folds are not evenly distributed (Table 3), we calculate the weighted CV mean (for both EM and F1 calculated by Equations 3 and 6) across all five folds following Equation 7. We also calculate the error rates of EM and F1 scores using weighted standard deviation following Equation 8.
Here, M is the number of folds. For 5-fold CV, M = 5. s i is either EM or F1 score calculated over the samples in fold i and f i is the number of samples in fold i.

E. EXPERIMENTAL RESULTS
In this section, we discuss the experimental results of the MRC models on cpgQA. Table 5 and Figure 4 show the experimental results of MRC models on cpgQA in the unsupervised setting. As depicted by the low EM and F1 scores on cpgQA in Table 5 and the drop in performance from source to target domains in Figure 4, the models perform poorly when we transfer knowledge directly from the biomedical scholarly articles (BioASQ) and clinical notes (emrQA). It indicates that the CPGs are linguistically different from articles or clinical notes. These results reaffirm the need for the new biomedical sub-domain -CPGs. We also notice that when we use a more general-purpose domain such as Wikipedia (SQuAD) as the source domain, the performance of the MRC models on cpgQA is higher. It may happen because compared to BioASQ and emrQA, SQuAD consists of more diverse documents and does not focus on a narrow domain. Moreover, as shown in Table 4, while emrQA is the largest of three datasets with more than 600k samples in the training set, the diversity in the contexts is approximately 64 times less than that of the contexts in the SQuAD dataset. Consequently, MRC models that are trained on SQuAD may generate comparatively better generalizable feature representations. Table 6 show the results from the BioADAPT-MRC model and the best-performing model in Table 5 -BioLinkBERT trained on SQuAD. In the rest of the paper, we denote the model name BioLinkBERT by its original name hyphenated with the name of the training data to avoid redundancy. As shown, BioADAPT-MRC can achieve a 62% EM score with 95% confidence interval (CI): 59%-65% and 82% F1-score with 95% CI: 81%-84% in the target domain while TABLE 5. Test scores on cpgQA in the unsupervised setting with nine pre-trained language models and three source domains. The highest and the second highest scores are highlighted in bold and italic, respectively.

1) UNSUPERVISED SETTING
retaining high-performance scores in the source domain. This is because the unlabeled target-domain dataset and adversarial learning enable BioADAPT-MRC to generate features that reduce domain shift and thus reduce the gap between the performance of MRC models in the source and target domains. Table 7 shows the cross-validation scores with 95% CI of MRC models on cpgQA in the supervised setting. We also show the trends of EM and F1 scores across five folds of cpgQA used in the experiments ( Figure 5). As shown, when trained with only the target domain dataset, the BioLinkBERT-cpgQA model achieves the worst performance. It may happen because the training dataset is too small to use the full potential of the model. Nonetheless, when we train the MRC Model on SQuAD and cpgQA (BioLinkBERT-SQuAD-cpgQA), it achieves higher EM and F1 scores. Comparing the scores reported in Table 7, we can say that the addition of the source domain dataset, SQuAD, in the training process helps improve the learning of the MRC model. We also notice that sequential training is better than simultaneous training -indicating a learning pattern in the transformer-based MRC model. We also show the experimental results using BioADAPT-MRC (Table 7). Even in the default setting without any hyperparameter optimization, BioADAPT-MRC outperforms BioLinkBERT-cpgQA and BioLinkBERT-SQuAD-cpgQA with higher CV mean and lower CV standard deviation, achieving higher stability in performance across five folds ( Figure 5). It shows the potential of this SOTA approach in performing MRC tasks on CPGs.

F. ERROR ANALYSIS
In this section, we present a comprehensive three-fold error analysis of the best-performing MRC model from our experiments by considering the following aspects of the cpgQA dataset: (i) components of the clinical practice guideline, (ii) types of questions, (iii) length of answers.
According to our experimental results presented in Section IV-E, we choose the BioADAPT-MRC model (in the supervised setting) for performance analysis.

1) COMPONENTS
As mentioned in Section III-A, there are five components in the guideline which was used to build the dataset:    (1) Introductory information, (2) Background information, (3) Features and overview, (4) Algorithm, and (5) Recommendations.
To analyze which parts of the guideline are correctly comprehended and interpreted by the model, we divide the dataset into five subsets according to these components and then calculate the EM and F1 scores for each of these subsets. An ideal MRC model that can understand CPGs should be able to answer questions correctly from each of these parts of the CPG and achieve an exact match score of 1 for each of these parts. Figure 6 shows that the model has been able to answer approximately 80% of the questions from each subset 3700 VOLUME 11, 2023 with an exact match to the ground truth answers. Nonetheless, to build the MRC system more reliable for future deployment in the healthcare setting, further performance improvement is required for each of the components of the guideline.

2) QUESTION TYPES
The cpgQA dataset consists of eight types of questions based on eight different interrogative words/phrases: ''What, When, Which, How, Who, Why, Where, Is there''. Figure 6b shows the capability of the MRC model in answering different types of questions present in the dataset. According to the EM scores, while the model can correctly answer most of the questions with the interrogative word ''Who'', it struggles the most with questions with the interrogative word ''Where''. There is also plenty of room for improvement for other question types such as ''Is there'', ''Why'', ''Which'', etc.

3) ANSWER LENGTH
cpgQA dataset consists of answer spans that can range from 1 word to 194 consecutive words. An ideal MRC model should be able to capture all ranges of answer spans. To show the influence of answer length on the performance of the model, in this experiment, we divide the dataset based on ''binned'' answer length. As the dividing criteria we use 15 percentiles for granularity and we group the dataset into 7 disjoint bins as follows:  Figure 6c shows that the model does remarkably well with smaller answer spans, i.e. short answers. As the answer length increases, the model struggles to predict the start and end token positions (of the answer span) that exactly match the ground truth.

4) MODEL'S CAPABILITY TO IDENTIFY GROUND TRUTH ANSWER LOCATIONS
Last but not least, we demonstrate whether the model can identify the location of the ground truth answer in cases where it is unable to find the exact match. We do this by calculating the percentage of test samples which do not have an exact match but an overlap between words in the predicted and ground truth answers. Figure 7 shows that approximately 6.7% mispredicted answer spans have 100% overlap with the ground truth whereas only 1.3% mispredicted samples have no overlap. It indicates that while the model is struggling to find the exact answer spans for all the questions in the test set, it is able to identify their location most of the time.
The overall error analysis indicates that while, in this study, the best-performing model performs well in various scenarios, there is still a lot of room for potential improvement.

V. LIMITATIONS OF cpgQA
While the cpgQA dataset contains most parts of the guideline, it omits the tables embedded in the appendix of the guideline. These tables contain additional information on diagnosis, treatment, recommendations, etc. Thus, a future research direction that can stem from this work is incorporating the tabular data into the text-based cpgQA dataset to generate a multi-modal dataset, similar to the Finance dataset presented in [53]. Furthermore, we used only one guideline to build the benchmark cpgQA dataset due to resource constraints, which resulted in a smaller dataset. Additionally, the dataset, in its current state, does not contain cases where no answer can be found in the provided context. While cpgQA provides a well-informed baseline for the MRC task on CPGs, including more guidelines and sample cases with no answers will enlarge and diversify the dataset and help us increase the reliability of the MRC models.
Disclaimer: The sole purpose of cpgQA is to evaluate the state-of-the-art machine reading comprehension models and pioneer the research in the sub-domainclinical practice guidelines -in biomedical machine reading comprehension. The dataset is not intended as a resource for patient care and should not be used as such.

VI. CONCLUSION
Biomedical machine reading comprehension is a task in bio-NLP and one of the applications of CDSS that helps efficiently extract information from intricate biomedical narratives. Clinical practice guidelines or CPGs are such narratives that are crucial resources at the point of care, as they provide the most up-to-date and authoritative recommendations necessitated by a consistent and well-defined clinical decision-making process. While several research works over the past few years have focused on the bio-MRC task on resources such as scholarly articles and clinical notes, clinical practice guidelines have yet remained unexplored. In this work, we explore the CPGs for the bio-MRC task and identify it as a new problem domain for this task. We present a benchmark dataset -cpgQA -manually annotated using a guideline with assistance from subject-matter experts. We then evaluate the dataset by presenting a thorough case study on transfer learning with state-of-the-art transformer-based language models. We then investigate the shortcomings of the state-ofthe-art approach in performing the MRC task on cpgQA and identify possible future research directions by performing a three-fold error analysis.
Future research directions that can originate from this work are as follows: (i) Incorporating tabular data with the text in the CPG to extend the cpgQA from a text-only to a multimodal dataset. (ii) Developing an MRC system that can handle the multi-modal cpgQA dataset. (iii) Extending the MRC models to address the weaknesses unveiled in this study. (iv) Expanding the cpgQA dataset by including more guidelines and data samples with no answers to diversify the dataset and consequently increase the reliability of the MRC models for CPGs.
We hope that the proposed dataset will foster research in machine reading comprehension systems for intelligent and efficient interpretation of the clinical practice guidelines used in healthcare by clinicians. He specializes in the research, design, and development of resilient, secure, and scalable analytic architectures. Additionally, he has been leading national scale programs focused on the predictive modeling problems related to national defense, fraud prevention, veteran suicide prevention (collaboration with PERC/REACHVET), precision, and personalized medicine (MVP CHAM-PION). During his tenure at ORNL, he led the original Knowledge Discovery Infrastructure (KDI) and the Citadel efforts, the first DOE platforms for computing on protected data, including leadership-class systems. He is a Senior Member of the ACM and a Committer with Apache Software Foundation (ASF).

SUSANA MARTINS is currently a Senior Data
Architect with the Department of Veterans Affairs, Office of Mental Health and Suicide Prevention. She works with interdisciplinary teams of experts to develop and implement predictive models for suicide prevention and national reports to support clinical decision making in the VA. Integral to her current work is creating and optimizing the architecture required to integrate large and complex datasets derived from distinct electronic health records such as Oracle-Cerner and VistA as well as extraction of relevant clinical concepts from structured and unstructured data in the medical record for use in predictive modeling, analytics, and clinical decision support. She has a 20 year research career in clinical informatics and has published extensively on health informatics topics across a range of diseases and conditions. Specifically, her focus was on knowledge modeling and creating evidence-based patient specific clinical recommendations delivered at point of care for clinical decision making.

SUZANNE TAMANG is currently an Assistant
Professor with the Stanford University School of Medicine and also a Computer Scientist with the Department of Veterans Affairs, Office of Mental Health and Suicide Prevention. She works with interdisciplinary teams of experts on population health problems of public interest with a focus on chronic disease, disability, and mental health. Integral to her work is the analysis of large and complex population-based datasets, using techniques from natural language processing, machine learning, and deep learning. She brings extensive experience with U.S. and Danish population-based registries, Electronic Medical Records from various vendors, administrative claims, and other types of observational health and demographic data sources in the U.S. and internationally; also, constructing, populating, and applying knowledge-bases for automated reasoning. She has developed open-source tools for the extraction of health information from unstructured free-text patient notes and licensed machine learning prediction models to Silicon Valley health analytics startups. In addition to her more traditional research activities, she also functions as a Faculty Mentor for the Stanford Community Working Group Stats for Good.