Recent Trends in Deep Learning Based Open-Domain Textual Question Answering Systems

Open-domain textual question answering (QA), which aims to answer questions from large data sources like Wikipedia or the web, has gained wide attention in recent years. Recent advancements in open-domain textual QA are mainly due to the signiﬁcant developments of deep learning techniques, especially machine reading comprehension and neural-network-based information retrieval, which allows the models to continuously refresh state-of-the-art performances. However, a comprehensive review of existing approaches and recent trends is lacked in this ﬁeld. To address this issue, we present a thorough survey to explicitly give the task scope of open-domain textual QA, overview recent key advancements on deep learning based open-domain textual QA, illustrate the models and acceleration methods in detail, and introduce open-domain textual QA datasets and evaluation metrics. Finally, we summary the models, discuss the limitations of existing works and potential future research directions.


I. INTRODUCTION
A. BACKGROUND Question answering (QA) systems have long been concerned by both academia and industry [1]- [3], where the concept of QA system can be traced back to the emergence of artificial intelligence, namely the famous Turing test [4]. Technologies with respect to QA have been constantly evolving over almost the last 60 years in the field of Natural Language Processing (NLP) [5]. Early works on QA mainly relied on manually-designed syntactic rules to answer simple answers due to constrained computing resources [6], such as Baseball in 1961, Lunar in 1977, Janus in 1989 and so on [5]. Around 2000, several conferences such as TREC QA [1] and QA@CLEF [7], have greatly promoted the development of QA. A large number of systems that utilize information retrieval (IR) techniques were proposed at that time. Then around 2007, with the development of knowledge The associate editor coordinating the review of this manuscript and approving it for publication was Jan Chorowski . bases (KBs), such as Freebase [8] and DBpedia [9], especially with the emergence of open-domain datasets on WebQuestions [10] and SimpleQuestions [11], KBQA technologies evolved quickly. In 2011, IBM Watson [12] won the Jeopardy! game show, which received a great deal of attention. Recently, due to the release of several large-scale benchmark datasets [13]- [15] and the fast development in deep learning techniques, large advancements have been made in the QA field. Especially, recent years have witnessed a research renaissance on deep learning based open-domain textual QA, an important QA branch that focuses on answering questions from large knowledge sources like Wikipedia and the web.

B. MOTIVATION
Despite the flourishing research of open-domain textual QA, there remains a lack of comprehensive survey that summarizes existing approaches&datasets as well as systemically analysis of the trends behind these successes. Although several surveys [16]- [19] were proposed to discuss the broad VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ picture of QA, none of them have focused on the specific deep learning based open-domain textual QA branch. Moreover, there are several surveys [20]- [23] that illustrate recent advancements in machine reading comprehension (MRC) by introducing several classic neural MRC models. However, they only reported the approaches in close-domain single-paragraph settings, and failed to present the latest achievements in open-domain scenarios. So we write this paper to summarize recent literature of deep learning based open-domain textual QA for the researchers, practitioners, and educators who are interested in this area.

C. TASK SCOPE
In this paper, we conduct a thorough literature review on recent progress in open-domain textual QA. To achieve this goal, we first category previous works based on five characteristics described as below, then give an exact definition of open-domain textual QA that explicitly constrains its scope. 1) Source: Towards different data sources, QA systems can be classified into structured, semi-structured and unstructured categories. One the one hand, structured data are mainly organized in the form of knowledge graph (KG) [9], [24], [25], while semi-structured data are usually viewed as lists or tables [26]- [28]. On the other hand, unstructured data are typically plain text composed of natural language. 2) Question: The question type is defined as a certain semantic category characterized by some common properties. The major types include factoid, list, definition, hypothetical, causal, relationship, procedural, and confirmation questions [17]. Typically, factoid question is the question that starts with a Wh-interrogated word (What, When, Where, etc.) and requires an answer as fact expressed in the text [17]. The form of question can be full question [14], key word/ phrase [15] or (item, property, answer) triple [29]. 3) Answer: Based on how the answer is produced, QA systems can be roughly classified into extractivebased QA and generative-based QA. Extractive-based QA selects a span of text [13], [15], [30], a word [31], [32] or an entity [10], [11] as the answer. Generative-based QA may rewrite the answer if it does not (i) include proper grammar to make it a full sentence, (ii) make sense without the context of either the query or the passage, (iii) have a high overlap with exact portions in context [33], [34]. 4) Domain: Closed-domain QA system deals with questions under a specific field [35], [36] (e.g., law, education, and medicine), and can exploit domainspecific knowledge frequently formalized in ontologies. Besides, closed-domain QA usually refers to a situation where only a limited type of question is asked, and a small amount of context is provided.
Open-domain QA system, on the other hand, deals with questions from a broad range of domains, and only ranking module, answer extraction, and answer selection, summarize recent trends on acceleration techniques as well as public datasets and metrics (Section III). Last, we conclude the work with discussions on the limitations of existing works and some future research directions (Section IV).

II. OVERVIEW OF OPEN-DOMAIN TEXTUAL QA SYSTEMS
Before we dive into the details of this survey, we start with an introduction to the history, the reason why deep learning based method emerges and architecture regarding to open-domain textual QA systems based on deep learning.

A. HISTORY OF OPEN-DOMAIN TEXTUAL QA
In 1993, START became the first knowledge-based questionanswering system on the Web [43], since then answered millions of questions from Web users all over the world.
In 1999, the 8th TREC competitions [44] began to run the QA track. In the following year, at the 38th ACL conference, a special discussion topic ''Open-domain Question Answering'' was opened up. Since then, open-domain QA system has become a hot topic in the research community. With the development of structured KBs like Freebase [8], many works have proposed to construct QA systems with KBs, such as WebQuestions [10] and SimpleQuestions [11]. These approaches usually achieve high precision and nearly solve the task on simple questions [45], but their scope is limited to the ontology of the KBs. There are also some pipelined QA approaches that use a large number of data resources, including unstructured text collections and structured KBs. The landmark approaches are ASKMSR [3], DEEPQA [12], and YODAQA [2]. A landmark event in this filed is the success of IBM Watson [12], who won the Jeopardy! game show in 2011. This complicated system adopted a hybrid scheme including technologies brought from IR, NLP, and KB. In recent years, With the development of deep learning, NLP based QA systems emerge, which can directly carry out end-to-end processing of unstructured text sequences at the semantic level through neural network model [46]. Specifically, DrQA [37] was the first neural-network-based model for the task of open-domain textual QA. Based on this framework, some end-to-end textual QA models have been proposed, such as R 3 [47], DS-QA [48], DocumentQA [49], and RE 3 QA [38].

B. WHY DEEP LEARNING FOR OPEN-DOMAIN TEXTUAL QA
It is beneficial to understand the motivation behind these approaches for open-domain textual QA. Specifically, why do we need to use deep learning techniques to build open-domain textual QA systems? What are the advantages of neuralnetwork-based architectures? In this section, we would like to answer the above questions to show the strengths of deep learning-based QA models, which are listed as below: 1) Automatically learn complex representation: Using neural networks to learn representations has two advantages: (1) it reduces the efforts in hand-craft feature designs. Feature engineering is a labor-intensive work, deep learning enables automatically feature learning from raw data in unsupervised or supervised ways [50].
(2) contrary to linear models, neural networks are capable of modeling the non-linearity in data with activation functions such as Relu, Sigmoid, Tanh, etc. This property makes it possible to capture complex and intricate user item interaction patterns [50]. 2) End-to-end processing: Many early years' QA systems heavily relied on the question and answer templates, which were mostly manually constructed and time-consuming. Later most of the QA research adopted a pipeline of conventional linguistically-based NLP techniques, such as semantic parsing, part-ofspeech tagging, and coreference resolution. This could cause the error propagation during the entire progress.
On the other hand, neural networks have the advantage that multiple building blocks can be composed into a single (gigantic) differentiable function and trained end-to-end. Besides, models of different stages can share learned representations and benefit from multi-task learning [51]. 3) Data-driven paradigm: Deep learning is essentially a science based on statistics, one intrinsic property of deep learning is that it follows a data-driven paradigm. That is, neural networks can learn statistical distributions of features from massive data, and the performance of the model could be constantly improved as more data are used [52].  [37], which splits the task into two subtasks: paragraph retrieval and answer extraction. The paragraph retrieval module selects and ranks the candidate paragraphs according to the relevance between paragraph and question, while the answer extraction module predicts the start and end positions of candidate answers in the context. Later, Clark and Gardner [49] proposed a shared-normalization mechanism to deal with the distant-supervision problem in open-domain textual QA. Wang et al. [47] adopted reinforcement learning to joint train the ranker and the answer-extraction reader. Based on this work, Wang et al. [53] further proposed evidence aggregation for answer re-ranking. Recently, Hu et al. [38] presented an end-to-end open-domain textual QA architecture to jointly perform context retrieval, reading comprehension, and answer re-ranking.
To summarize these works, we propose a general technical architecture of open-domain textual QA system in Fig. 1. The architecture mainly consists of three modules including paragraph index&ranking, candidate answer extraction, and final answer selection. Specifically, the paragraph index&ranking module first retrieves top-k paragraphs related to questions. Then these paragraphs are sent into the answer extraction module to locates multiple candidate answers. Finally, the answer selection module predicts the final answer. Moreover, in order to improve the efficiency of QA systems, some acceleration techniques, such as jump reading [54] and skim reading [55], can be applied in the system.

III. MODELS AND HOT TOPICS
In this section, we illustrate the individual component of the generalized open-domain textual QA system described in Fig. 1. Specifically, we introduce: (i) the paragraph index&ranking module in subsection III-A, (ii) the candidate answer extraction module in subsection III-B, (iii) the final answer selection module in subsection III-C, and (iv) the acceleration techniques in subsection III-D. Finally, we give a brief introduction of recent open-domain textual QA datasets in subsection III-E, as well as experimental evaluation and model performance in subsection III-F.

A. PARAGRAPH INDEX AND RANKING
The first step of open-domain textual QA is to retrieve several top-ranked paragraphs that are relevant to the question. There are two sub-stages here: retrieving documents through indexing, and ranking the context fragments (paragraphs) in these documents. The paragraph-index module builds the light-weight index for the original documents. During processing, the index dictionary is loaded into memory, while the original documents are stored in file-systems. This method can effectively reduce memory overhead, as well as accelerates the retrieval process. The paragraph-ranking module analyzes the relevance between query and paragraphs and selects top-ranked paragraphs to feed into the reading comprehension module. In recent years, along with the development of information retrieval and NLP, a large number of new technologies regarding to index and ranking have been proposed. Here we mainly focus on the deep learning-based approaches.

1) PARAGRAPH INDEX
Paragraph index can be classified into query-dependent index and query-independent index. The query-dependent index mainly includes dependence model and pseudo relevance feedback(PRF) [56], [57], which considers approximation between query and document terms. However, due to the index dependence on queries, the corresponding ranking models are difficult to scale and generalize. The query-independent index mainly includes TF-IDF, BM25, and language modeling [56], [57], which contains a relatively simple index feature and with low computational complexity on matching. IBM Watson adopted a search method to combine the query-dependent similarity score with the query-independent score to determine the overall search score for each passage [58]. Although those index features are relatively efficient and scalable on processing, they are mainly based on the terms without the contextual semantic information.
Recently, several deep learning-based methods have been proposed. These approaches usually embed the terms or phrases into dense vectors and use them as indices. Kato et al. [59] constructed a demo to compare the efficiency and effectiveness of LSTM and BM25. Seo et al.
proposed Phrase-indexed Question Answering (PIQA) [60], which employed bi-directional LSTMs and self-attention mechanism to obtain the representation vectors for both query and paragraph. Lee et al. leveraged BERT encoder [61] to pre-train the retrieval module [62], unlike previous works that retrieve candidate paragraphs, the evidence passage retrieved from Wikipedia was seen as a latent variable.

2) PARAGRAPH RANKING
The traditional ranking technologies are based on manuallydesigned feature [63], but in recent years, learning to rank (L2R) approaches have become a hot-spot. L2R refers to ranking methods based on supervised learning, it can be classified into Pointwise, Pairwise, and Listwise [64]. Pointwise (e.g., McRank [65], Prank [66]) converts the document into feature vectors, then gives out the relevance scores according to the classification or regression function learned from the training data, from which to indicate the ranking results. Pointwise focuses on the relevance between the query and documents, ignoring the information interaction inside the documents. Hence Pairwise (e.g., RankNet [67], FRank [68]) estimates whether the order of document pairs is reasonable. However, the number of relevant documents varies greatly from different queries. Thus the generalization ability of Pairwise is difficult to estimate. Unlike the above two methods, Listwise (e.g., LambdaRank [69], SoftRank [70]) trains the optimization scoring function with a list of all search results for each query as a training sample. Since the aim of paragraph ranking is to filter out irrelevant paragraphs, Pointwise seems to be adequate in most cases. However, the scores between queries and paragraphs also can be helpful for predictions on the final answer, as we discuss in subsection III-C. Consequently, Listwise ranking methods are also important to the open-domain textual QA task.
Moreover, the paragraph ranking model trained with deep neural networks mainly includes four categories [56]: (i) learning the ranking model through manual features, and only using the neural network to match the query and document; (ii) estimating relevance based on the query-document exactly matching pattern; (iii) learning the embedded representations of queries and documents, and evaluating them by a simple function, such as cosine similarity or dot-product; (iv) conducting query expansion with neural network embeddings, and calculating the query expectation.
Similar to (ii), Wang et al. [47] proposed Reinforced Ranker-Reader (R 3 ) model, which is also a kind of Pointwise method. It consisted of: (1) a Ranker to select a paragraph most relevant to the query, and (2) a Reader to extract the answer from the paragraph selected by Ranker. The deep learning-based Ranker model was trained using reinforcement learning, where the accuracy of the answer extracted by Reader determined the reward. Both the Ranker and Reader leveraged Match-LSTM [71] model to match the query and passages. Similar to (iii), Tan et al. [72] studied several representation learning models and found that attentive LSTM can be very effective on the Pairwise mode training. And PIQA [60] employed similarity clustering to retrieve the nearest indexed phrase vector to the query vector by asymmetric locality-sensitive hashing (aLSH) [73] or Fassi [74].
There are also combinations of the above categories, Htut et al. [75] combined (i) and (iii), which took the embedded representations to train the ranking model, and proposed two kinds of ranking models: InferSent ranker and Relation-Networks ranker. The rankers leveraged the Listwise ranking method, which were trained by minimizing the margin ranking loss, so as to obtain the optimal score.
Here f is the scoring function, p pos is a paragraph that contains the ground-truth answer, p neg is a negative paragraph that does not contain the ground-truth answer, and k is the number of negative samples. InferSent ranker leveraged sentence-embedded representations [76] and evaluated the semantic similarity in ranking for QA, which employed a feed-forward neural network as the scoring function: (1) x classifier + b (1) (3) score = W (2) ReLU (z) + b (2) Relation-Networks ranker focused on measuring the relevance between words in the question and words in VOLUME 8, 2020 the paragraph, where the word pairs were the inputs of Relation-Networks which is formulated as follows.
Here E(·) is a 300 dimensional GloVe embedding [77], f φ and g θ are 3 layer feed-forward neural networks with ReLU activation function. The experimental results showed that the performance of QA part [75] even exceed reinforcing feedback ranking model [47].

B. CANDIDATE ANSWER EXTRACTION
With the candidate paragraphs filtered from the index& ranking module, QA systems can locate candidate answers (the start and end positions of answer spans in the document or paragraph) through the reading comprehension model. With the releasing of datasets and test standards [13]- [15], [30], many works have been proposed in the past three years, attracting great attention from the academia and industrial.
In this subsection, we illustrate the reading extraction model from three hierarchies: (i) word embeddings and pre-training models for feature encoding in subsection III-B1, (ii) interaction of questions and paragraphs using attention mechanism in subsection III-B2, and (iii) feature aggregation for predicting the candidate answers in subsection III-B3.

1) FEATURE ENCODING LAYER
In this layer, the original text tokens are transformed into vectors that can be computed by the deep neural networks through word embeddings or manual features. Word embeddings can be obtained through dictionary or fine-tuning on pre-trained language models, while manual textual features are usually implemented by part-of-speech tagging (POS) and named entity recognition (NER). Manual features can be constructed by tools such as CoreNLP [78], AllenNLP [79], and NLTK [80]. Generally, the features mentioned above will be fused with embedding vectors. Embedding vectors can be constructed by pre-trained language models. Glove [77] transferred word-level information to word vectors through the co-occurrence matrix, but cannot distinguish the polysemous words. ELMo [81] leveraged a deep bi-directional language model to yield word embeddings that can vary from different context sentences, which was concatenated by two unidirectional language models. OpenAI GPT [82] used the left-to-right transformer decoder [83], whereas BERT [61] used the bi-directional transformer encoder [83] to pre-train, then both of them adjust the downstream tasks through fine-tuning methods. Fig. 2 shows the difference between ELMo, GPT, and BERT. Specifically, The pre-trained BERT model has been proven as a powerful context-dependent representation and made significant improvements on the open domain textual QA tasks, some works based on BERT, such as RE 3 QA [38], ORQA [62], and DFGN [84], have achieved state-of-the-art results.

2) INTERACTIVE ATTENTION LAYER
The interactive attention layer constructs representations on the original features of question or paragraph by using attention mechanisms. It can be mainly divided into two types: (i) Interactive alignment between the question and paragraph, namely co-attention, which allows the model to focus on the most relevant question features with respect to paragraph words, and breaks through the limited coding extraction ability of a single model. Wang and Jiang [71] leveraged a textual entailment model Match-LSTM [85] to construct the attention processing. Xiong et al. [86] used a co-attention encoder to co-dependent representations of the question and the document, and a dynamic pointer decoder to predict the answer span. Seo et al. proposed a six layers model BiDAF [87] along with a memory-less attention mechanism to yield the representations of the context paragraph at character-level, word-level and contextual-level. Gong and Bowman [88] added a multi-hop attention mechanism to BiDAF to solve the problem that the single-pass model cannot reflect on.
(ii) Self alignment inside the paragraph to generate self-aware features, namely self-attention, which allows non-adjacent words in the same paragraph to attend to each other, thus alleviating the long-term dependency problem.
For example, Wang et al. [89] proposed a self-attention mechanism to refine the question-aware passage representation by matching the passage against itself.
We can find two trends in recent works: (1) the combination of co-attention and self-attention. e.g., DCN+ [90] improved DCN by extending the deep residual co-attention encoder with self-attention. Yu et al. leveraged the combination of convolutions and self-attention in the embedding and modeling encoders, and a context-query attention layer after the embedding encoder layer [91]. (2) fusion features at different levels, e.g., Huang et al. adopted a three-layers fully-aware-attention mechanism to further enhance the feature representation ability of the models [92]. Wang et al. combined co-attention and self-attention mechanism, as well as applied a fusion function to incorporated different levels of features [93]. Hu et al. proposed a re-attending mechanism inside a multi-layer attention architecture, where prior co-attention and self-attention were both considered to fine-tune current attention [94].

3) AGGREGATION PREDICTION LAYER
In this layer, aggregation vectors are generated to predict candidate answers, we mainly focus on the following parts.
• Aggregation strategies. Aggregation strategies vary from different network frameworks. BiDAF [87] and Multi-Perspective Matching [95] leveraged Bi-LSTM for semantic information aggregation. FastQAExt [96] adopted two feed-forward neural networks to generate the probability distribution of start and end position of the answers, then used beam-search to determine the range of the answers.
• Iteration prediction strategies. DCN [86] consisted of a co-attentive encoder and a dynamic pointing decoder, which adopted a multi-round iteration mechanism. In each round of iteration, the decoder estimated the start and end of the answer span. Based on the prediction of the previous iteration, LSTM and Highway Maxout Network are used to update the prediction of the answer span in the next iteration. ReasoNet [97] and Mnemonic Reader [94] used the memory network framework to do iterative prediction. DCN+ [90] and Reinforced Mnemonic Reader [94] iteratively predicted the start and end position by reinforcement learning.
• Interference discarding strategies. Discarding interference items dynamically during the prediction process can improve the accuracy performance and generalization of models, such as DSDR [98] and SAN [99].
• Loss Function. Based on the extracted answer span, the loss function is generally defined as the sum of the probability distributions of the start and end positions of gold answers [49], which can be formulated as follows.
Here s j and g j are the scores for the start and end bounds produced by the model for token j, a and b are the start and end tokens. In the multi-paragraph reading comprehension tasks, reading comprehension model is employed on both negative paragraphs and positive paragraphs, thus need to add the no-answer prediction term in the loss function as [49], [100]: n j=1 e s i g j (7) Here δ is 1 if an answer exists and 0 otherwise, and z presents the weight given to a ''no-answer'' possibility.

C. FINAL ANSWER SELECTION
Final answer selection mainly selects the final answer from multiple candidate answers using feature aggregation, aggregation methods can be divided into the following types.

D. ACCELERATION METHODS
Despite that current open-domain textual QA systems have achieved significant advancements, these models become slow and cumbersome [110] with multi-layers [111], multi-stages [53], [102] architectures along with various features [81], [87], [137]. Moreover, ensemble models are employed to further improve performance, which requires a large number of computation resources. Open-domain textual QA systems, however, are required to be fast in paragraph index&ranking as well as accurate in answer extraction. Therefore, we would like to discuss some hot topics regarding acceleration methods in this section.

1) MODEL ACCELERATION
Due to the complex and computationally expensive deep learning models, automated machine learning (AutoML) technologies have aroused widespread concern on hyperparameter optimization and neural architecture search methods [112]- [114]. However, there is little research about AutoML acceleration for the open-domain textual QA system. In order to reduce complexity under the guarantee of quality, there are many models proposed to accelerate reading processing, namely model acceleration. Hu et al. [115] proposed a knowledge distillation method, which transferred knowledge from an ensemble model to a single model with little loss in performance. In addition, it is known that LSTMs, which are widely used in the open domain textual QA systems [110], are difficult to parallelize and scale due to their sequential nature. Consequently some researchers replace the recurrent structures [110] or attention layer [96] with more efficient works, such as Transformer [83] and SRU [111], and limit the range of co-attention [116].

2) ACTION ACCELERATIONS
There are some works boosting the sequence reading speed while maintaining the performance, namely action acceleration. These approaches can dynamically employ some actions to speed up during reading, such as jumping, skipping, skimming, and early-stopping. We illustrate the details from the following perspectives.
• Jump reading determines from the current word how many words should be skipped before next reading. For example, Yu and Liu [54] proposed LSTM-Jump, which was build upon the basics of LSTM network and reinforcement learning, to determine the number of tokens or sentences to jump. As shown in Fig. 3, the softmax gave out a distribution over the jumping steps between 1 and the max jump size. This method can greatly improve reading efficiency, but the decision action can only jump forward, which may be ineffective in complex reasoning tasks. Therefore Yu et al. [117] proposed an approach to decide whether to skip tokens, re-read the current sentence or stop reading the feedback answer, and LSTM-shuttle [118] proposed a method to either read forward or read back to increase accuracy during speed reading.
• Skim reading determines whether to skim one token before reading the sentence according to the current word or not. Unlike previous methods using reinforcement learning to make action decisions, skip-rnn [119] adjusted the RNN module to determine whether each step input was skipped or directly copied the state of the previous hidden layer. However, previous methods are mainly for sequence reading and classification tasks, and the experiments are mainly for the cloze-style QA task [31]. Then Skim-rnn [55] conducted comparative experiments on the reading comprehension tasks. Specifically, skim-RNN was responsible for updating the first few dimensions of the hidden state through the small RNN, and weighted between the computation amount and the discard rate. Moreover, Hansen et al. [120] proposed the first speed reading model including both jump and skip actions.
• Other speed reading applications: JUMPER [36] provided fast reading feedback for legal texts, Johansen and Socher [121] focused on sentiment classification tasks.
Choi et al. [122] tackle long document-oriented QA tasks for sentence selection and reading based on CNN.
Hu et al. [38] proposed an early-stopping mechanism to efficiently terminate the encoding process of unrelated paragraphs.

E. DATASETS
In this subsection, we introduce several datasets relative to open-domain textual QA. Owing to the release of these datasets, the development of open-domain textual  QA systems has made great progress in recent years. Table. 2 shows some statistics of the following datasets.
In SQuAD-Open, only question-answer pairs are given, while the evidence documents come from the whole Wikipedia articles.
• SearchQA [15] contains 140k question-answer pairs crawled from J! Archive. It uses Google search engine to collect the top 50 web page snippets as context fragments for each question.
• TriviaQA [14] consists of 650K context-query-answer triples, which contains three settings: web domain, Wikipedia domain, and unfiltered domain. The questions come from 14 trivia and quiz-league websites and needs cross-sentence reasoning to obtain the ground-truth answer. The evidence documents of TriviaQA-web and TriviaQA-Wikipedia are retrospectively crawled from Wikipedia or Web search. TriviaQA-unfiltered is the open domain setting of TriviaQA, which includes 110,495 QA pairs and 740K documents.
• Quasar-T [30] mainly consists of 43k open-domain trivia questions and their answers obtained from various Internet sources. For each question-answer pair, 100 paragraphs have been collected to process. There are two sub-sets according to the length of candidate paragraphs, where the short sub-set makes up of paragraphs with less than 10 sentences, and the long one makes up of paragraphs with an average of 20 sentences. This dataset is constructed by two processes: retrieving top-100 documents and adding top-N unique documents to the context document.

F. EVALUATION
For extractive textual QA tasks, in order to evaluate the predicted answer, we usually adopt two evaluation metrics [13], which measure exact match and partially overlapped scores respectively.
• Exact Match. EM measures whether the predicted answer exactly matches the ground-truth answers. If the exact matching occurs, then assigns 1.0, otherwise assigns 0.0.
• F1 Score. F1 score computes the average word overlap between predicted and ground-truth answers, which can ensure both of precision and recall rate are optimized at the same time, F1 score is calculated as: We summarize the performance of current state-of-theart models on different open-domain textual QA datasets, as shown in Table. 3. As we can see, MemoReader [124] has achieved promising results on the TriviaQA-web dataset, while DynSAN [125] is the top-tier model for SearchQA. On the other hand, RE 3 QA [38] has achieved SOTA results on the remaining three datasets, likely due to the use of pre-trained language models such as BERT [61].

IV. DISCUSSION
In this paper, we introduce some recent approaches in open-domain textual QA. Although these works have established a solid foundation for deep-learning based open-domain textual QA research, there remains ample room for further improvement. In this section, we first summarize the structure of some typical models, then present the challenges and limitations of recent approaches, finally outline VOLUME 8, 2020 several promising prospective research directions, which we believe are critical to the present state of the field.

A. SUMMARY OF MODELS
We summarize current hot topics in Figure. 4 and categorize structure of some models in  [61] to select paragraphs. In the extractive reading stage, most works utilize Glove embeddings [77], while recent models tend to use pre-trained language models such as ELMo [81] or BERT [61] for text feature encoding. As for the attention mechanism, most works adopt either co-attention or self-attention, or combine both of them to better exchange information between questions and documents. For the aggregation prediction, most works adopt RNN-based approaches (LSTM or GRU), while some recent works leverage BERT [61]. In the final answer selection, the multi-stage aggregation is the main solution while few works adopt the evidence aggregation strategy.

B. CHALLENGES AND LIMITATIONS
We first present the challenges and limitations of open-domain textual QA systems due to the use of deep learning techniques. There are several common limitations of deep learning techniques [126], which also affect deep learning based open-domain textual QA systems.
• Interpretability. It is well-known that the process of deep learning likes a black-box. Due to the activation function and backward derivation, it is hard to model the neural network function, which makes the final the answer unpredictable in theoretical.
• Data Hungry. As mentioned in subsection II-B, deep learning is data-driven, which also bring some challenges [126]. We can also find the fact in subsection III-E, where the total samples of each dataset are larger than 10k.  [61] or other self-attention pre-training model [82] to extract text features can improve the scalability, running these models over hundreds of paragraphs is computationally costly since these models usually have large size and consist of numerous layers. Moreover, Using indexable query-agnostic phrase representation can reduce the computationally cost while ensuring accuracy in reading comprehension, whereas the accuracy is still low in open-domain textual QA [128].
• Machine Reading Comprehension. Existing extractive reading technology has made great progress. Several reading comprehension models even surpass human performance. However, these MRC models are complex and lack of interpretation, which makes it difficult to evaluate the performance and analyze the generalization ability of each neuron module. With the improvement of performance along with the increase of model size, it is also a problem that running these models consumes a lot of energy [129]. Moreover, existing models are vulnerable to adversarial attacks [130], making it difficult to deploy them in real-world QA applications.
• Aggregation Prediction. Existing predictive reasoning usually supposes that the answer span only appears in the single paragraph, or the answer text is short [37]. However, in the real world, the answer span usually appears in several paragraphs or even requires multi-hop inference. How to aggregate evidence across multiple mentioned text snippets to find the answer remains to be a great challenge.

C. RECENT TRENDS
We summarize several recent trends regarding to opendomain textual QA, which are listed as follows. 1) Complex Reasoning. As the datasets get larger, reasoning becomes more complex, open-domain textual VOLUME 8, 2020 QA task come up with a great deal of challenging subtasks. For example, multi-hop QA tasks, which include multiple evidence inference across documents [104], [107], [131], symbolic reasoning like numeric calculation, count and sort [132], and extraction-based generation [33], [34]. Combining complex reasoning modules such as graph-based reasoning [84], [133], [134], numerical reasoning [135] and logical reasoning [84], [131] with existing paragraph ranking and extractive reading comprehension models is a new trend in open-domain textual QA. 2) Complexity Improvement. Making accurate QA requires a deep understanding of documents and queries. As a result, most of recently proposed models become extremely complex and large [124], [136], resulting in low efficiency. It is nontrivial to speed up the whole computation, especially for the RNN-based models [83], [111]. Since AutoML [112], [113] technologies can automatically search optimal parameters or network structures, applying them in open-domain textual QA may be a good approach to find a light-weighted network structure for improving the efficiency. 3) Technology Integration. Technology integration refers to the combination of multiple technologies from different fields, which is a typical trend in the recent deep learning works. For example, the semantic paragraph ranking approaches [60], [75] may use the technologies from the fields of information retrieval and natural language processing. As for the answer selection module, knowledge base QA and natural language processing technologies are combined to improve the overall QA performance [106], [109]. Moreover, we can find that many machine learning technologies, such as transfer learning [61], reinforcement learning [47], and meta-learning [51]  CHANGJIAN WANG received the B.S., M.S., and Ph.D. degrees in computer science and technology from the National University of Defense Technology (NUDT), Changsha, China. He is currently an Associate Professor with NUDT. His research interests include database, distributed computing, cloud computing, big data, and machine learning.