Generating Scientific Question Answering Corpora from Q&A forums

Question Answering (QA) is a natural language processing task that aims at retrieving relevant answers to user questions. While much progress has been made in this area, biomedical questions are still a challenge to most QA approaches, due to the complexity of the domain and limited availability of training sets. We present a method to automatically extract question-article pairs from Q&A web forums, which can be used for document retrieval and QA tasks. The proposed framework extracts questions from selected forums as well as answers that contain citations that can be mapped to a unique entry of a digital library. This way, QA systems based on document retrieval can be developed and evaluated using the question-article pairs annotated by users of these forums. We generated the SciQA corpus by applying our framework to three forums, obtaining 5,432 questions and 10,208 question-article pairs. We evaluated how the number of articles associated with each question and the number of votes on each answer affects the performance of baseline document retrieval approaches. Also, we trained a state-of-the-art deep learning model that obtained higher scores in most test batches than a model trained only on a dataset manually annotated by experts. The framework described in this paper can be used to update the SciQA corpus from the same forums as new posts are made, and from other forums that support their answers with documents.


INTRODUCTION
estion Answering (QA) consists of the automatic retrieval of information that can directly answer user questions.
is task is relevant for the biomedical domain, due to the large quantity and variety of information that is necessary to integrate to fully understand biomedical problems. QA systems can assist the study of biological systems by automatically retrieving information relevant to the queries made by a researcher.
is is possible because of all the biomedical information that is available in text format on publicly available digital libraries, in the form of scienti c literature. However, these systems require advanced techniques when dealing with large document repositories.
In recent years, many developments have been made in QA systems using deep learning techniques [10,16]. QA systems are evaluated using gold standards, where automatic answers are compared to the manually annotated answers by domain experts. One of the most relevant gold standards for biomedical QA is BioASQ [20], which is also a community challenge where every year, a new dataset is released to train and evaluate QA systems. One limitation of this type of evaluation is that the questions are not obtained from real users.
e BioASQ datasets are developed by experts, who create the questions and the answers. erefore it lacks the variability that is inherent to real-world scenarios.
estion-and-Answers (Q&A) forums, such as Stack Exchange and ora, are websites where users can post questions and answer them. In recent years, search engines have improved their ability to answer natural language queries, instead of relying just on keywords as input. It is more intuitive for users to retrieve information using natural language than through a selection of keywords. QA systems could, therefore, bene t from Q&A forums where users post questions to which other users provide and vote on the best answers.
Although most of these forums have no selection process to restrict the participants, the users can vote on each answer, and, in some cases, the user who made the question can give their approval to one of the answers. One signi cant example of Q&A forums is Stack Exchange, which is a network of websites where users can ask questions about speci c subjects, such as computer science, mathematics, art, and various languages. Reddit is another forum where users can post questions and get answers from communities focused on speci c subjects, such as nutrition. Our work focused on Stack Exchange and Reddit since both these forums contain relevant questions to the scienti c domain, and provide an API to retrieve their contents.
For scienti c questions, it is essential to provide proper references that corroborate the given answers. Otherwise, the correctness of the answer may be questioned. Usually, Q&A forums that are related to scienti c subjects, including the ones used in this work, encourage users to support their posts with scienti c articles. A signi cant advantage of open-access digital libraries such as PubMed is that anyone can analyze its documents to be er understand scienti c problems. For this reason, improving scienti c document retrieval is of interest to the general community.
With this article, we present a framework to generate a gold standard for document retrieval QA, with questions and relevant articles retrieved automatically from Q&A forums. We demonstrated our approach on three forums that are relevant to biomedical sciences: Biology from Stack Exchange, Medical Sciences also from Stack Exchange, and Nutrition from Reddit.
We also present the SciQA corpus of question-article pairs, which can be used to develop and evaluate document retrieval and QA systems, as we demonstrate in this article. Unlike existing corpora, our corpus is based on real user questions and can evolve over time.
e proposed framework could also be applied to other relevant forums, in any language, and update the dataset with new questions.

RELATED WORK 2.1 User-based QA
Recently, there has been an increased focus on QA using real user questions. For example, the Natural estions dataset released by Google uses real queries made by users of the Google search engine [13]. While the questions of other QA datasets are formulated by annotators whose function is to read a paragraph and write relevant questions [17], this dataset contains real queries from users seeking information. e authors then annotated each question with answers taken from the English Wikipedia. e objective of this dataset is to develop systems that can extract the correct answer to each question from a speci c Wikipedia page.
Another example is the TREC LiveQA track [2], which promotes the development of new approaches to answer questions retrieved from various sources, such as Yahoo Answers and the U.S. National Library of Medicine. e results of the teams that have participated on this track highlight the low scores obtained when comparing with human performance, particularly for consumer health questions [3].

Datasets based on web forums
Other authors have also explored Q&A forums to generate datasets. Cong et al. [5] presented an algorithm to extract Q&A pairs from forum threads, where it may not be clear which posts are questions and which are answers. Shah and Pomerantz [19] generated a dataset based on Yahoo Answers and used crowdsourcing to rank the answers given to the same question. However, since Stack Exchange and Reddit users can vote on answers for any question (not just the ones they made), our approach does not require that extra annotation step. Dalip et al. [7] explored Stack Over ow to develop a learning-to-rank method to evaluate the quality of the answers, based on the scores given by the users. e authors extracted several textual and user-based features to classify each answer, including the number of links to external sources, which they found to be a good predictor of the quality of the answer. Nevertheless, to the best of our knowledge, we were the rst to extract references to articles from the answers to automatically generate a QA dataset.

Biomedical QA
e development of systems for biomedical QA requires datasets to both train and evaluate those systems. e biomedical domain poses a challenge to this type of system since the questions may be more challenging to answer, and the datasets available are more limited. BioASQ is one of the most signi cant community e orts into advancing biomedical QA systems. It is a series of evaluations where the participants compete on various QA tasks, using a gold standard provided by the organization. Each edition of BioASQ has had at least these two subtasks: one which consists of indexing PubMed abstracts with MeSH terms, while the second consists of answering biomedical questions. Each year the organizers provide new annotated questions that are curated by biomedical experts. In 2019 (BioASQ 7), task B phase A consisted of retrieving relevant documents, concepts, and snippets from designated article repositories to answer biomedical questions. e best Mean Average Precision (MAP) scores ranged between 0.1218 and 0.2898 on ve test batches.
is dataset is similar to ours because PubMed abstracts are used to answer natural language questions. However, while the BioASQ questions are created and answered by experts, our dataset is made of questions and answers created by users of online forums. e consequence of this di erence is that our questions are taken directly from real users, and the answers are given by non-experts.
ere have been other e orts in developing datasets for biomedical QA tasks. MediQA [4] is another relevant challenge, which focuses on clinical data. is dataset consists of question and candidate answers, which the participants had to classify and rank according to the relevance to the question. While this dataset was developed manually to include both correct and incorrect answers, the organizers also provided other datasets derived from online resources which contain only the correct answers [1]. e dataset released for this task contained 383 questions, while the answers were retrieved using a QA system and ranked by experts. emrQA is another dataset that also focused on the clinical domain [15]. It consists of 6,686 questions and explores the i2b2 dataset to generate QA pairs. Recently, Jin et al. [12] released the PubMedQA dataset, where they derive questions based on article titles and answer them with its respective abstracts. Figure 1 summarizes our framework to generate the SciQA corpus from Q&A forums. For each post of a forum, we retrieved its title, body text, and the number of votes. If the forum is not speci cally for Q&A, such as Reddit, we selected only posts with questions marks in the title or body. For each post, we retrieved all of its rst-level replies, i.e., we ignored comments made on the question and answers because these usually are not direct answers to the question. We also did not take into account user data, although other authors have pointed out that that information could be relevant to rank answers [7]. However, we saved the original IDs of every post and reply so that additional information could be retrieved later.

CORPUS GENERATION
A erward, we parsed each answer to obtain hyperlinks mentioned throughout the answer text. Whenever possible, we linked various types of URLs to PubMed: from PMC, doi.org, ScienceDirect, and ResearchGate. More types of URLs could be handled in the future using more complex mapping rules, or using machine learning. Other users provided the full forma ed citation without a hyperlink, which would require more complex methods to link to PubMed automatically. Example 3.1 shows a question from our corpus, and a document used to answer it. e answer has some text wri en by the user as well as the passage from the article that contained the answer, which we omi ed. is is a simple example where the user admits that they are not experts and simply used academic search engines. Indeed, if we search the question title text on PubMed, the article linked can be found in the results list. Example 3.2 provides another example where domain knowledge is necessary to retrieve the articles to answer the question. In this case, the question text should be processed to remove stop words, and it is necessary to have some understanding of the relationship between cholesterol levels and consumption of eggs. We did not lter the answers by their score; however in our analysis, we tested various thresholds for the minimum number of votes and its e ect on document retrieval engines (Section 5.2). When an answer has more votes, it means that more users agreed that that answer is correct or at least relevant. While we could have applied stricter criteria for answer selection, we observed some answers supported by citations but without additional votes (every answer starts with one vote).
is is possibly due to the limited size of the forums' user base. However, there were cases of negative scores, which can indicate that the answer is incorrect or non-related to the question.

CORPUS VALIDATION
We validated the SciQA corpus using two strategies. First, we compared the results obtained with three search methods on the questions of the dataset. is way, we could compare the references obtained through user answers with the ones obtained directly using search engines. Secondly, we validated the dataset by applying it to a biomedical question answering task. e objective was to determine if with our dataset, we can obtain similar performance to a dataset developed by domain experts.

Search engine validation
In order the understand the potential of our dataset for document retrieval, we evaluated three di erent baseline methods. We considered the articles obtained from the forum answers as the relevant documents list of each question. e three methods were: i) the NCBI Eutils API 1 which provides access to the PubMed search engine, ii) query likelihood scorer with Dirichlet term smoothing calculated by Galago [6] on our local version of PubMed, and iii) BM25 scorer [18], also calculated by Galago. e PubMed search engine also uses BM25 at a rst stage, with additional improvements [8].
We tested querying each method with the title and body text of the questions, as well as with the answer text since the articles were usually more related to the answer text than to the question text ( Figure 2). Although in a QA task the answer text would not be available, we still wanted to study how much it could help the search engines retrieving the right articles.
Our framework rst tokenized the text using Spacy 2 and then removed HTML tags, stopwords, punctuation, and spaces. We used the "OR" operator with the PubMed search since otherwise, the articles would have to contain the same words as the query, and we would get very few results. We retrieved a maximum of 100 articles per question, ordered by the relevance score provided by the PubMed API, and the Dirichlet term smoothing score and BM25 score calculated by Galago. We retrieved 100 articles because this was the same number of articles retrieved by Nentidis et al. [14] (see Section 4.2). en we compared with the articles given by the users that answered each question, which we considered as correct if they had more than a given number of votes. If the correct articles appear at the top of the search results, it means that article could have been obtained through a search query. However, if the correct articles appear at the bo om or do not appear at all, it means that the document retrieval process should be improved.

BioASQ6b validation
Document retrieval is a signi cant step of a QA pipeline since the answers can be derived from the retrieved documents. For this reason, we used our dataset with a biomedical QA system that performed document retrieval tasks to evaluate the e ect on its performance. We chose the AUEB system developed for BioASQ6 [14] since it obtained the highest MAP scores on that competition for the document retrieval tasks. e BioASQ6 dataset, where the AUEB system was evaluated, consists of a training set of 2,151 questions, a development set of 100 questions, and ve testing batches of 100 questions each. On this dataset, the results of each testing batch are reported separately. e AUEB team explored various deep learning models for BioASQ6b. We selected their extension of the Position-Aware Convolutional Recurrent Relevance (PACRR) model [11], dubbed TERM-PACRR, as it obtained be er results in most test batches for document retrieval. is model computes a query-document similarity matrix and uses a convolution neural network and multi-layer perceptron to compute the relevance score of each query term.
We added the SciQA corpus to the BioASQ corpus to determine if the additional questions would bring improvements to the performance of the AUEB system. To accomplish this, we rst executed the AUEB system with the original BioASQ6 data to reproduce the results. en we reran it using the full SciQA dataset, in addition to the BioASQ data. e AUEB system rst retrieves 100 documents for each question using Galago and uses the BM25 score to then rerank these documents. erefore we did the same for the question of the SciQA corpus, using our local version of PubMed indexed by Galago. At each epoch, the model was evaluated on the developement set, and, at the end of each training run, the model that obtained best performance on the development set was selected. We followed the procedure described by the original AUEB submission, which consisted in training ten versions of each model using di erent random seeds and combining the results of the ten models on each test batch. e purpose of this procedure is to reduce the variability of the results of di erent training runs.
Finally, we also experimented only using the SciQA corpus to train a model with the AUEB system. We followed the same procedure to evaluate this version of the model.

Evaluation measures
We calculated the Mean Average Precision (MAP) on each subset of the SciQA corpus and each validation method, which is also the metric used on BioASQ Task B Phase A. is way, we can indirectly compare the scores obtained on our dataset with those obtained on the BioASQ datasets. To calculate the MAP score, rst, we calculate the average precision obtained with each question, and then the macro-average over all questions. e average precision is given by: where |L| is the number of retrieved articles and |L R | is the number of relevant articles obtained from the answers, P(r ) is the precision obtained with the rst r retrieved articles, and rel(r ) is 1 if the r th retrieved article is relevant, and 0 otherwise. Considering Example 3.1, since it only has one relevant article, its AP would be equal to 1 n where n is the position of the relevant article in the retrieved list. For the BioASQ6b validation, we used a variation of this measure where L R is always equal to 10, which is how it is implemented on the evaluation tool. en we calculate MAP as: where q i ∈ Q and Q is the list of all questions.

SCIQA CORPUS ANALYSIS 5.1 Data sources
We focused on three online forums to extract QA data: Biology and Medical Sciences from Stack Exchange (we refer to them as Biology.SE and MS.SE) and Nutrition from Reddit (r/nutrition). We assumed that the users of these communities would give more value to answers supported by scienti c articles, instead of accepting answers that could be based on personal experiences, for example. On Reddit, only an account is necessary to create and answer posts, as well as to vote on posts and replies. However, r/nutrition does not allow posts from new accounts or accounts with negative karma (balance of positive and negative votes). On Stack Exchange, it is possible to answer as a guest without creating an account, but it has a reputation system so that only established users can vote on other answers. Both forums have volunteer moderators who de ne community guidelines and can delete posts. Table 1 provides some statistics on these forums, obtained at the time of writing.

Corpus statistics
e SciQA corpus consists of a collection of three datasets, obtained from each of the forums previously described. We show the number of questions and question-article pairs of each dataset in Table 2. We could only use questions with answers that contained citations that we could automatically extract and map to PMIDs, so we had to discard a large number of questions. As we did not normalize the number of votes, we can see a relation between the average number of votes and size of the community; r/nutrition has a larger community than Biology.SE and MS.SE; hence the average number of votes is higher.
Each answer has a number of votes associated with it, as well as a list of PMIDs, retrieved at the timestamp shown in Table 2. Figure  3 shows the distribution of each dataset in terms of the number of votes on the answers, while Figure 4 shows the distribution of the number of PMIDs per answer. We can see a similar distribution of votes as the one reported by [7]. e MS.SE subset is the smallest of the three forums; hence it also does not have answers with as many votes as the other two. However, it has a considerable number of answers with many PMIDs, as can be seen in Figure 4. e r/nutrition subset has a lower number of PMIDs per answer. However, many answers had links to other sources of information, such as blog posts and Wikipedia pages. We also observed that these communities did not always value answers with a large number of citations because more succinct answers are easier to understand.

SEARCH ENGINE EVALUATION
As a baseline evaluation of the dataset, we a empted to retrieve the same PubMed documents that were referenced by the answers, using various search engines. is way, we can observe if a search engine could obtain the same type of answers than those provided by the users.
Since we do not have a score for each PMID but only for the answer as a whole (which may contain several PMIDs), we cannot correctly rank lists of PMIDs from the same answer. Nevertheless, the number of votes can provide a measure of the relevance of all PMIDs. On the other hand, it is also di cult to evaluate questions for which we only have a small number of PMIDs, regardless of the number of votes. For this reason, we experimented limiting both the minimum number of votes (Table 3) and the number of PMIDs associated with the answers (Table 4) while performing a baseline evaluation.
As explained in Section 4.1, we retrieved the top 100 articles given by the query likelihood with Dirichlet term smoothing, which is the default scorer of Galago. Overall, we can see that increasing the minimum number of PMIDs can lead to higher MAP scores, although this results in a smaller corpus. e maximum MAP score is achieved when selecting only questions with at least 10 PMIDs for Biology.SE, 5 PMIDs for MS.SE, and 10 PMIDs for r/nutrition. However, on the r/nutrition dataset, this highest MAP score corresponds to a corpus of just 4 questions, and for larger corpus sizes the MAP scores are much lower. We were also able to obtain higher MAP scores by ltering by the number of votes, on the MS.SE and r/nutrition datasets. Ideally, a balance should be found between the minimum number of votes and PMIDs and the size of the corpus.
Finally, we studied two other variables on the baseline evaluation: the document retrieval method and the text used as a search query (title or body or answer). Table 5 shows the results of this study, using the full SciQA corpus. PubMed search obtained lower MAP scores, even though their method is an improvement over the BM25 scoring method. eir search engine is tuned for user queries, so we assume that questions easily answered by a PubMed search are less likely to be posted on Q&A forums. Furthermore, in every case, using the body text results in lower scores, and the best scores are achieved using the answer text.

EVALUATION ON THE BIOASQ6B DATASET
We applied the full SciQA corpus to a deep learning QA system to study its impact on this type of system, merging the questions of the three subsets into a single corpus, without any restrictions of number of votes or PMIDs. We explored two scenarios: using SciQA in conjunction with the BioASQ6b training set and using just the SciQA corpus to train the model. e results of both experiments, as well as the results we obtained using only the BioASQ6b training data, are shown in Table 6, for each test batch. We obtained higher MAP scores by adding the SciQA corpus to the BioASQ6b training set on 4 out of 5 batches Although these were small improvements, we can see that even training only with the SciQA corpus, the MAP score obtained is not much lower than the BioASQ only score.

DISCUSSION
Comparing Tables 3 and 4, we can see that, for the Biology.SE dataset, the number of PMIDs per answer is a more e cient way to obtain higher MAP scores, since for similar corpus sizes, the MAP score is higher using a PMID threshold. Even for higher threshold values, which results in small corpus sizes, the Biology.SE dataset obtains higher MAP scores. We did not show results for higher values given the small number of documents that remained. We also did not include answers with less than one vote since there were only 394 answers in this situtation and did not change the results of this analysis. We expected MS.SE and r/nutrition to have the same behavior as Biology.SE on Table 4, in terms of how the MAP score increases with the minimum number of PMIDs. Since this was not the case, those two datasets proved to be more challenging and may require be er query processing and natural language understanding in order to retrieve the correct documents. We compare the MAP scores obtained on the SciQA corpus with the ones obtained on the latest BioASQ challenge because it has a subtask that consists of document retrieval for QA (Task B Phase A). BioASQ is organized yearly, and every year there are new batches of questions. e participating teams are evaluated on each batch separately, obtaining di erent scores on each one. While the best score achieved on the 2019 edition was 0.2898 on test batch 3, on test batch 5, the best score was 0.1218. We also obtained a range of MAP scores on our corpus, in most cases lower than the BioASQ test batches. Furthermore, our corpus is closer to a real-world scenario since we are using user-submi ed questions, which requires natural language understanding to correctly answer them. One possibility is that users had access to a search engine before posting a question and did not nd an answer, therefore increasing the complexity of the questions. Novel systems tuned for this type of data can obtain be er scores, especially considering that more data is made available each day as users post questions and answers on those forums.
We were able to train a deep learning model with our corpus that obtained MAP scores similar to a model trained on a corpus annotated by experts. One possible explanation for these results is that our corpus has more questions, and these questions were retrieved from various sources. Furthermore, adding the questions from our corpus to the model trained on the BioASQ6b training data led to slightly be er MAP scores on four test batches. ese results highlight the importance of having more datasets for biomedical QA in order to improve the existing systems.
Our approach has some limitations when compared to expert annotations. Since Q&A forums are anonymous, there is no accountability of the answers, and personal biases may be more apparent than within a group of expert annotators.
is approach is also dependent on how engaged the users are within the community: for example, r/nutrition has more users and a higher maximum number of votes than the other two, but it has fewer answers that we could link to PubMed articles, resulting in a lower average number of PMIDs. Another limitation is that there is no context for each citation used to answer. In some cases, an article is used as an example and not necessarily to answer the question. Our framework treats every article associated with an answer the same way, so it cannot distinguish which ones are more relevant.
We select the biomedical domain given its complexity and because it is not usually the focus of QA systems. Although some biomedical QA systems have been proposed [9], these are limited in comparison to general ones. Biomedical-speci c approaches are essential due to the relevance of this topic to the society. On the web, there are frequent health-related questions that are not easily answered but could be clari ed through an e cient retrieval of the relevant literature.
estions from the SciQA corpus that were answered with several citations were related to vaccines, genetically modi ed food products, and meat consumption, for example, topics for which much information can be found in the scienti c literature.
Our framework could potentially be applied to other communities where biomedical questions are made, for example, ora, or social networks such as Twi er and Facebook. We selected PubMed as the reference library since it is widely used by the biomedical community and can be easily accessed.

CONCLUSION
is manuscript presents an e ective framework to generate corpora for QA document retrieval using Q&A forums, demonstrated in the form of the SciQA corpus. ese forums are a useful resource for QA systems since the questions are real-world examples of user questions, and the answers are crowdsourced from multiple users. To the best of our knowledge, this is the rst a empt at generating this type of dataset for QA. e task of document retrieval is a signi cant step of a QA system, so these systems must optimize this step according to the user needs. We demonstrated the feasibility and performance of the framework and the SciQA corpus, namely their relevance for training and evaluating QA systems, particularly in the biomedical domain, for which resources are more scarce. e SciQA corpus is composed of 5,433 questions and 10,204 question-article pairs obtained from three di erent Q&A forums. e results obtained with our baseline methods indicate that higher MAP scores can be obtained on questions with a higher number of PMIDs, and more advanced methods are necessary to retrieve articles for all questions correctly.