Enhancing Lexical-Based Approach With External Knowledge for Vietnamese Multiple-Choice Machine Reading Comprehension

Although Vietnamese is the $17^{th}$ most popular native-speaker language in the world, there are not many research studies on Vietnamese machine reading comprehension (MRC), the task of understanding a text and answering questions about it. One of the reasons is because of the lack of high-quality benchmark datasets for this task. In this work, we construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts which are commonly used for teaching reading comprehension for elementary school pupils. In addition, we propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text. We compare the performance of the proposed model with several baseline lexical-based and neural network-based models. Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model. We also measure human performance on our dataset and find that there is a big gap between machine-model and human performances. This indicates that significant progress can be made on this task. The dataset is freely available on our website for research purposes.


I. INTRODUCTION
A primary goal of computational linguistics or natural language processing is to make computers able to understand natural language texts, as well as human beings, do. One of the standard tests of natural language understanding ability requires computers to read documents and answer any questions related to their contents, resulting in different research problem settings of machine reading comprehension [1]- [5]. MRC can also be the extended task of question answering (QA). There are many studies on QA [6]- [9], which are also the foundation for the development of MRC. Findings of this research field are implemented into various artificial intelligence applications such as next-generation search engines, AI agents, chatbots, and robots.
One common method for evaluating someone's understanding of texts is by giving them a multiple-choice The associate editor coordinating the review of this manuscript and approving it for publication was Shiping Wen . reading comprehension test. This type of test can measure abilities such as causal or counterfactual reasoning, inference among relations, or basic understanding of the world in a set of reading texts. In the past ten years, there have been many study works [10]- [16] in this field. In addition to researching MRC in each language, one of the current trends in MRC is cross-lingual studies such as [17], [18]. Hence, the first important thing is the contribution of MRC datasets in each language. Besides, there have been research results in lexical-based approaches [10], [12] and machinelearning-based approaches [4], [12], [19]- [22]. Depending on the characteristics and size of datasets, we propose the appropriate methods to achieve better performances.
English and Chinese are regarded as resource-rich languages when it comes to the accessibility of the tools needed to carry out communication. Still, many other languages are deemed resource-poor, and Vietnamese is one of them. Machine reading comprehension for the Vietnamese language is vital as for other languages because it is useful for non-Vietnamese speaking people to understand the question of others and answers extracted from a document or text. Vietnamese is the national language of Vietnam and is widely used by over 97 million people. 1 Therefore, machine reading comprehension has become significant even for the Vietnamese language so that people can understand the questions and documents of people expressed in this language. The challenge of machine reading comprehension for Vietnamese has not yet been explored fully even after its extensive use; therefore, in this article, the primary focus is Vietnamese.
The integration of external sources has proven effective on a range of previous study works [23], [24] and recently, success on leveraging external knowledge to generate answers in the neural QA model [23]. WordNet and word embeddings are two useful external sources for a range of natural language applications. Multiple deep learning-based approaches [12], [19]- [21], [25], [26] have worked well when using word embeddings in multiple-choice machine reading comprehension. Because our dataset is limited in the number of questions, we aim to find solutions based on the lexical-based method when leveraging external sources in multiple-choice machine reading comprehension. Thus, our proposed method is shown in Section IV with our experiments and result analysis in Section V and Section VI.
In this article, we have three main contributions described as follows.
• We propose a benchmark dataset for evaluating Vietnamese multiple-choice reading comprehension task. Our dataset is the first dataset for Vietnamese multi-choice machine reading comprehension. The number of questions in our dataset is larger than that of MCTest [10], which is the English first dataset published to motivate many MRC studies. The dataset is available freely for the research community and is expected to contribute to the research development of Vietnamese machine reading comprehension. We also provide this dataset for the cross-lingual research with other similar datasets such as MCTest [10], RACE [12], and C 3 [27].
• We propose the lexical-based method utilizing semantic similarity and external knowledge sources for multiple-choice reading comprehension. As a result, this model achieves better accuracy than baseline models. Also, we compare this model with different baseline lexical-based and neural network-based models.
• To gain an in-depth understanding of our proposed model, we analyze and compare its and other models' performances with different linguistic properties by quantitative analysis and visualizing their effects. Through empirical observations, researchers are given more insights and better understandings of the aspects of our proposed method on our dataset. The rest of this paper is structured as follows. Section II reviews related datasets and methods. Section III introduces the creation process and analysis of the ViMMRC dataset. 1 https://www.worldometers.info/world-population/vietnam-population/ Section IV presents our proposed method for Vietnamese multiple-choice machine reading comprehension. Section V shows experiments and results on the dataset. Section VI describes the result analysis for these experimental results. Finally, Section VII concludes the paper and discusses future work.

II. LITERATURE REVIEW
In this section, we aim to review recent datasets and techniques in machine reading comprehension. In particular, the typical MRC datasets and methods are described as follows.

A. MRC DATASETS
In the last decade, we have witnessed a fast growth of research interest in machine reading comprehension (MRC) and an explosion of datasets for MRC studies for popular languages like English [1], [2], [28]- [35] and Chinese [36]- [38].
In terms of types of answers, MRC datasets are divided into three categories, including extractive, abstractive, and multiple-choice.
• Extractive MRC requires computers to locate the correct segment in a provided reading text that answers a specific question related to that text. Recently, there has been a significant increase in the construction of extractive MRC datasets with formal written texts such as SQuAD [2], CNN/Daily Mail [1], CBT [28], NewsQA [29], TriviaQA [31], WIKI-HOP [32], DRCD [37], and CMRC2018 [38]. There are also datasets of which reading texts are spoken language, such as ODSQA [33] and Spoken SQuAD [34] and conversation-based datasets [30], [35].
• In contrast to extractive MRC, abstractive MRC requires computers to generate answers or synthetic summaries because answers to such questions in abstractive MRC are usually not spans in the reading text. Datasets for abstractive MRC include MS MARCO [39], SearchQA [40], NarrativeQA [41], and DuReader [36].
• Multiple-choice MRC includes both extractive and abstractive MRCs; however, the correct answer options are primarily abstractive. Most of the multiple-choice MRC datasets are created using crowdsourcing methods in major steps of dataset construction including generating questions, correct answer options and distractors. MCTest [10], ROCStories [11], MultiRC [13], MCScript [14], and COSMOS QA [42] are typical datasets of this type. The crowd workers also assign to each question the reasoning mechanism that is needed to figure out the answer. Apart from the basic reasoning mechanism -the matching type, a dramatic number of questions require complex reasoning mechanisms which are based on multiple sentences and require external knowledge. Other datasets are collected from examinations designed by educational experts QALD [43], NTCIR-11 QA-Lab [44], dataset from TOEFL exams [45], dataset from NY Regents 4th Grade VOLUME 8, 2020 Science exams [46], and RACE [12], which aim to evaluate learners.
Until now, there is not yet any dataset available for Vietnamese machine reading comprehension, which is one of the primary reasons that we would like to collect and build a dataset for the Vietnamese language processing community.

B. MRC METHODS
In this paper, we focus on two main types of MRC method, lexical-based approaches and neural network-based approaches. Therefore, we review the previous study works in these methodologies as follows.

1) LEXICAL-BASED APPROACHES
The first method implemented into multiple-choice reading comprehension is the Sliding Window algorithm, a lexical-based approach developed by [10], as our first baseline model. This method was also used as a baseline in other studies [2], [12], [14]. Sliding Window finds an answer based on simple lexical information. Motivated by TF-IDF, this algorithm uses inverse word count as a weight of each lexical unit, and maximizes the bag-of-word similarity between the answer option and lexical units in the given reading text in a window size.

2) NEURAL NETWORK-BASED APPROACHES
With the popularity of the neural network approach, endto-end models such as Stanford AR [19], GA Reader [20], HAF [21], and Co-Match [47] have produced promising results on multiple-choice MRC. Recently, pre-trained language models have also been added [48], [49]. These models do not rely on complex manually-devised features as in traditional machine learning approaches, but are able to outperform them. In this paper, we employ an end-to-end model called Co-match [47] with different pre-trained word embeddings as another baseline model.
Regarding to the Vietnamese language processing, there are quite a number of research works on other tasks such as parsing [50]- [52], part-of-speech [53], [54], named entity recognition [55]- [57], sentiment analysis [58]- [60], and question answering [61]- [63]. However, to the extent of our knowledge, there are no research publications on multiple-choice machine reading comprehension. Therefore, we decide to build a new dataset of Vietnamese multiple-choice reading comprehension for the research community and evaluate MRC state-of-the-art models on our dataset.

C. SEMANTIC SIMILARITY MEASUREMENT AND WORD EMBEDDINGS
Recently, the semantic similarity measures between texts have been studied in many natural language processing applications. A range of researchers have used these measures to improve their study works [64]- [66]. These methods proposed for estimating the similarity between two documents include three different types, i.e., lexical matching, linguistic analysis, and semantic features. Lexical matching is not sufficiently strong and linguistic analysis also have limitations. In semantic feature approaches, a word is represented by a vector as semantic meaning before estimating similarity. These study works [67], [68] utilized external knowledge sources to estimate the similarity of two texts. These approaches are only effective when external knowledge sources such as WordNet, word embeddings or other datasets are available for the tested domain or applications.
Word embeddings also play a significant role in machine reading comprehension. Rumelhart et al. (1986) [69] proposed word embedding, a technique that maps each word to a vector space and can accurately capture a large proportion of syntactic and semantic relationships in text. Using pre-trained word embedding [70], [71], there are two most common methods to represent words in machine reading comprehension models: word-level embedding and character-level embedding. However, these methods seem to be insufficient because it simply concatenates word-level and character-level embeddings; generated vectors stay the same in different contexts. To tackle these problems, Peters et al. (2018) [72] proposed deep contextualized word representations called ELMo which is pre-trained by language model first and fine-tuned according to the learning task. Devlin et al. (2018) [49] introduced BERT, which utilizes bidirectional transformer to encode both left and right contexts to the representations. In this article, we take advantage of semantic similarity and word embeddings for enhancing external knowledge to improve the performance of multiple-choice reading comprehension in Vietnamese.

A. DATASET CREATION
The process of constructing the ViMMRC dataset includes three different phases: reading-text collection, multiplechoice question creation, and dataset validation. These phases are described in detail as follows.

1) READING-TEXT COLLECTION
We decide to focus on the reading comprehension levels at primary schools because they only require general knowledge, not too specific knowledge. We collect the Vietnamese reading texts suitable for the 1 st to 5 th graders from the subject named Vietnamese. In addition, we collect reading comprehension tests from two reliable websites where all reading comprehension tests from 1 st to 5 th grades are made public for free of charge. As a result, 417 reading texts are gathered.

2) MULTIPLE-CHOICE QUESTION COLLECTION
Questions, answer options, and correct answers are created by primary-school teachers. These questions are intended to test the reading comprehension ability of elementary learners. The teachers are asked to create at least five questions per text. Each question is accompanied by four answer options, of which only one is correct. For those texts with fewer numbers of questions or answer options, it is necessary to create more to meet the above conditions. Spelling errors are corrected. At the end of this phase, we achieve the ViMMRC dataset.

3) VALIDATION
During this phase, primary-school teachers review the multiple-choice questions, their answer options, and their correct options again to ensure there are no mistakes. Finally, we obtain a highly-qualified dataset for research purposes for the computer multiple-choice reading comprehension mechanism. Table 1 demonstrates some of the examples of Vietnamese multiple-choice MRC questions. In the following section, we analyze the characteristics of the dataset.

B. DATASET ANALYSIS
We randomly divide our dataset into train, development, and test sets of 292 (70%), 42 (10%), and 83 (20%) texts, respectively. The statistics of the training, development and test sets are summarized in Table 2. In the table, the number of questions, the average words of reading texts, questions, answer options, correct answers, and vocabulary sizes are also listed.
In this section, we present analysis of our dataset from different aspects. Table 3 shows statistics of our dataset with  different grades. Vocabulary size, text length, question length, answer option length, and correct answer length are calculated in words. We used the word segmentation pyvi. 2 We found that the number of reading texts for the 1 st grade is small, which is obvious because the 1 st grade focuses on developing basic language skills rather than reading comprehension skill. We can observe that the vocabulary size increases as the grade increases. It can be inferred that the vocabulary sizes are correlated with the difficulty level of the reading comprehension task.
The types of reasoning required to solve the multiple-choice machine reading comprehension (MMRC) task directly influence the performance of MMRC models. In this paper, we classify the questions in our dataset following the same reasoning types as used in the analysis of the well-known dataset RACE [12]. These types are shown as follows, in ascending order of the difficulty level: • Word matching (WM): Important tokens in the question exactly match tokens in the reading text. Thus, it is easy to use a keyword search algorithm for finding the correct answer of this question based on the reading text.
• Paraphrasing (PP): The question is paraphrased from a single sentence in the reading text. In particular, we may use synonymy and world knowledge to create the question.
• Single-sentence reasoning (SSR): The answer is inferred from a single sentence in the reading text. Such answers could be created by extracting incomplete information or conceptual overlap.
• Multi-sentence reasoning (MSR): The answer is inferred from multiple sentences in the reading text by information synthesis techniques.
• Ambiguous or insufficient (AoI): The question has many answers or answers are not found in the reading text. We manually annotate all questions in our dataset according to these types. Examples and percentages of these type are listed in Table 8. It can be seen from the table that single-sentence reasoning and ambiguousor-insufficient make up the lowest proportions in our dataset (7.35% for single-sentence reasoning and 6.12% for ambiguous-or-insufficient). Meanwhile, word matching and multiple-sentence reasoning types account for the largest percentage, at 25.85% and 36.73% respectively. This demonstrates that ViMMRC is a challenging dataset for evaluating reading comprehension models for the Vietnamese language.

C. COMPARISON WITH THE MCTest DATASET
In this section, we compare our dataset with the MCTest dataset. The size of the MCTest dataset is approximately the same as our dataset. Table 4 shows differences between our dataset and the MCTest dataset. As can be seen from the table, although the number of reading texts in our dataset is less than that of the MCTest dataset, the number of questions of our dataset is greater. Besides, the average numbers of words per reading text, per question and per answer in our dataset are also higher than those of the MCTest dataset.

IV. METHODOLOGY
In this section, we introduce our proposed approach for the Vietnamese multiple-choice machine reading comprehension corpus. Because deep learning methods require a large dataset, so we only focus on the development of lexical-based methods on our dataset. Fig. 1 presents our proposed model by integrating semantic similarity and external knowledge sources into the lexical-based approach. This method is briefly described as follows. First of all, we pre-process the texts. Next, we calculate the sliding windows scores, the distance scores, and the external knowledge scores, respectively. Last, we combine those three scores for calculating the final score. The final score is used for predicting the correct answer in our approach. We implement this system through the algorithms described in detail as follows.

A. PRE-PROCESSING TECHNIQUES
Pre-processing techniques play an important role in many applications of NLP. These techniques help to get rid of meaningless and confusing words, so we clean this data by following the steps shown in Algorithm 1 and Algorithm 2. There are many techniques in natural language processing which are implemented in the pre-processing phase. In particular, Algorithm 1 pre-processes for a sentence, applied to sentence processing in the reading text, questions and answer options.
Algorithm 1 Pre-Processing a Raw Vietnamese Sentence S Input: A raw Vietnamese sentence S. Output: A list of Vietnamese words after pre-processing L.
procedure Pre-processing a Vietnamese sentence X = tokenizing S into a list of tokens.
Removing punctuations in X . Removing Vietnamese stop words in X . S = converting X into a lower-case sentence. L = segmenting S into a list of words by the Vietnamese word segmentation.
return L.
Algorithm 2 Pre-Processing a Vietnamese Reading Text T Input: A Vietnamese reading text T . Output: A pre-processed reading text T .
procedure Pre-processing a Vietnamese reading text L = splitting T into a list of single sentences. for i = 1 to len(L) do L i = Pre-processing for a raw Vietnamese sentence(L i ).
T = a pre-processed reading text converted from the list L.
return T .
In Algorithm 1, firstly we use the tokenizer to break a sentence into a list of Vietnamese tokens X . In our work, this step performs in three steps, removing punctuation marks, stop words and noise words (short vowels) in the list X . After that, we convert the list X into a lower-case sentence S . Lastly, we use the Vietnamese word segmentation tool to parse the sentence S into a list of Vietnamese words L which is the output of this algorithm. We also apply Algorithm 1 to both questions and answer options. We use the tool pyvi 3 for word segmentation in this algorithm.
In Algorithm 2, first of all, we split an input reading text into a list of sentences L. Then, we run the Pre-processing function (see Algorithm 1) for each sentence on all items of the list L. The output of this algorithm is a pre-processed reading text T converted from the list L. Algorithm 1 and Algorithm 2 are implemented in reading texts and multiple-choice questions on MMRC models.

B. SLIDING WINDOW AND DISTANCE SCORES
We present how to calculate sliding window scores (see Algorithm 3) and distance scores (see Algorithm 4) in the original sliding window algorithm (SW), a lexical-based approach developed by [10]. This approach matches a bag of words, constructed from a question Q and an answer option O i , with a given reading text, and calculates a TF-IDF style matching score for each answer option. The two algorithms are important components in our proposed model. To under- for , Otherwise return sw stand this method, we start with formal definitions of Vietnamese multiple-choice reading comprehension task. Let T denote the reading text, Q denote the question text, O 1..4 denote the texts of four answer options. The aim of the task is to predict the correct one among four answer options O 1..4 with regard to the question Q and the given reading text T . We also attempt to adapt Vietnamese textual structures into the sliding window algorithm (SW) as first baseline models on our proposed dataset. The results of these models are presented in Section V.

C. EXTERNAL KNOWLEDGE INTEGRATION
In addition to the lexical-based approach, we attach one more element to enrich world knowledge using semantic similarity and external knowledge sources like word embeddings.

Algorithm 4 Calculating the Distance Scores
Input: Text T, set of reading-text words TW, set of words in question Q, and set of words in answer options O 1..4 . Output: Returning the distance score for answer options of the question.
procedure Calculating distance scores Initialize a list d of distance score for answer options. q, a) is the minimum number of words an occurrence of q and an occurrence of a in T , increase 1 return d In particular, we add a boosted score (denoted by web i ) to the final score of each answer option. Algorithm 5 presents how to calculate the boosted score. To understand Algorithm 5, we introduce two notations V T and V O i to denote the ordered sets of words in the reading text T and in the answer option O i , respectively. We calculate web i , the maximum cosine similarity between V O i and span words X of the same length in V T . v is the average of the word embeddings of the lexical units in v. Fig. 2 shows semantic similarity architecture estimating the boosted core between an answer option and a span in T. The semantic similarity of the two vectors V O i and X is formulated as follows.
In this model, we use external knowledge sources as word embeddings. To explore the effectiveness of word embeddings, we evaluate the performance of our proposed model on with several word embeddings including Word2vec [73], Word2vec and Character2vec [74], fastText [75], ELMo [72], BERT [49] and MULTI [76]. In particular, we use pre-trained embeddings on Vietnamese Wikipedia proposed by [76] for all experiments of our proposed method.
Based on the above algorithms, we can regard sw i as the sliding window score (see Algorithm 3) and d i as the distance score (see Algorithm 4) defined in the original sliding window approach. In addition, the final distance-based sliding window score of O i [10] can be formulated as follows.
Because a large proportion of questions cannot be solved by lexical-based approaches, we also try to incorporate external sources as general world knowledge into our lexical-based method. We calculate the boosted score for answer options of the question web i presented by Algorithm 5. To make the final answer option prediction, our lexical-based method combines the sliding widow score sw i , the distance score d i , and the boosted score web i (see Algorithm 5) can be formulated as follows.

V. EMPIRICAL EVALUATION
In this section, we compare the performance results of our proposed model with baseline models, and humans on our dataset. procedure Calculating the boosted scores Initialize a list web of boosted score for answer options.
We evaluate random, lexical-based approaches (Sliding Window and Distance-based Sliding Window [10]) and a neural network based model (Co-match [47]) as baseline models. Sliding Window and Distance-based Sliding Window were used in the baseline methods of various datasets such as MCTest [10], SQuAD [2] and RACE [12]. Co-match is a strong neural network based model in multiple choice machine reading comprehension. Co-match achieve the positive results on RACE [12] and also was chosen to be the first baseline models on COSMOS QA [42] and C3 [27]. Despite the limited data size of our dataset, we verify to evaluate how well the Co-match method do and then analyze the need for increasing training data for neural network based models presented in Sub-section VI-E.

B. EVALUATION METRIC AND EXPERIMENTAL SETTINGS
We use accuracy as the primary evaluation metric which is computed as follows: Accuracy = Number of questions correctly answered Total number of questions (4) In all experiments, we use the word segmentation tool pyvi 4 and six different pre-trained word embeddings proposed by [76]. The training, development and test sets are divided as shown in Table 2. Besides, we implement three methods such as Random, Sliding Window and Distance-based Sliding Window as baseline models on our dataset.
For the model Co-match, we fine-tune the model parameters suitable for the Vietnamese multiple-choice MRC. In particular, we use a mini-batch size of 32, and the hidden memory size of 10. The number of epochs is set to a number of 30. Adamax optimizer is used for optimization with a starting learning rate of 0.002. In this model, we also in turn test the same word embeddings used on our proposed model.

C. HUMAN PERFORMANCE
We randomly take 100 questions from the test set and 100 questions from the development set. We conduct the tests on ten students. As a result, human performance reaches 91.20% in accuracy on the development set and 91.10% on the test set. These results are much higher than our best model. To overcome human performance is challenging to explore a new machine reading comprehension model suitable for this dataset in the future.

D. MODEL PERFORMANCE
We report the performances of the baseline models and our proposed model in Table 5. Sliding Window and Distance-based Sliding Window achieve different performances, 58.50% and 60.55%, on the development set but they have the same accuracy of 56.30% on the test set. Our proposed method achieves the accuracies over 60% on the test set and over 61% on the development set. Specifically, this method with the ELMo word embedding achieves the highest results on both of the test and development sets, 65.99% and 61.81%, respectively. This proves that our proposed method is more effective than the baseline methods for the Vietnamese MMRC task at present with improvements of 5.45% and 5.51% on the development and test sets, respectively. However, these results are much lower than the human performance of 29.29% on the test set. This is a great challenge in the study of Vietnamese multiple-choice machine reading comprehension.
Comparing the experimental results of the Co-match model with different word embeddings, we can see that ELMO only achieves the best accuracy of 45.58% and 44.94% on development and test sets. However, ELMO is still the best word embedding on both lexical-based and neural-based approaches. In addition, the best performance of the Co-match model on the test set is 16.87% lower than that of our proposed model. It is also much lower than the human performance of 46.16%. Because the data size is not large enough, we evaluate this model on different sizes of training  data in Sub-section VI-E, which helps us to make a decision whether to continue increasing the data size in future work.

VI. RESULT ANALYSIS
To gain insights into the best model (our proposed method with the ELMo embedding), we analyze the experimental results in terms of different aspects such as question length, reading-text level, reasoning type, and word embedding. Besides, we aim to evaluate how the size of our training set has an impact on the neural network-based method.

A. EFFECTS OF THE QUESTION LENGTH
To verify whether the length of question is a reason for the poor performance of our best model, we measure the performances of the best model according to the question length. In particular, we divide the development set into five groups corresponding to the following question lengths:   questions, our method predicts less effective. This may be because short questions contain less information beneficial to search for the correct answer. In particular, the performances on shorter questions (64.15% for the ≤ 10-word questions and 65.18% for 10 − 15-word questions) are lower than the performances on longer questions which are over 66% in accuracy. Fig. 4 shows performance comparison between the best baseline and our proposed method with different groups of the question length. The accuracy of our proposed model has an improvement on all question lengths (except the question lengths over 25 words), of which the three groups (11 − 15, 16 − 20, 21 − 25) have significant increases. Fig. 5 shows the accuracies of the best model according to different levels of reading text -the first to fifth grades. We can observe that the difficulty of the reading comprehension task increases together with the level of reading text. The system could answer questions of the 2 nd grade well, over 78% in accuracy. It is more challenging to predict correct answers for questions of the 3 rd to 5 th grades (less than 68%). The  performance on 1 st grade questions is not as high as that on the 2 nd grade questions because the amount of questions of the 1 st grade is much fewer than those of other grades. Fig. 6 shows performance comparison between the best baseline and our proposed method with different reading text levels. The accuracy of our proposed model has an improvement on all reading text levels (except the first grade level), of which the three types (2-grade, 4-grade, and 5-grade) have significant increases.

C. EFFECTS OF THE REASONING TYPE
We also perform analysis to see how the reasoning types influence the best MMRC model. Fig. 7 shows the analysis results. We found that the system determines answers more efficiently for the of the word matching and the paraphrasing reasoning types (WM and PP), 92.11% and 82.93% in accuracy, respectively. In contrast, complex forms of reasoning result in lower performances. They include single-sentence reasoning, multi-sentence reasoning, and ambiguous-or-insufficient. Fig. 8 shows performance comparison between the best baseline and our proposed method  with different reasoning types. The accuracy of our proposed model has an improvement on all types of reasoning, of which the three types have significant increases: word matching (WM), paraphrasing (PP), and ambiguous or insufficient (AoI), while complex reasoning types (SSR and MSR) have slight improvements. Table 5 shows the experimental results with external knowledge sources as pre-trained word embeddings. It can be seen that the results are influenced by the methods when combined with these word embeddings. In particular, our lexical-based approach achieves better results when using word embeddings, approximately 5% higher. The experimental results show that ELMo is the best among the other word embeddings.

D. EFFECTS OF THE WORD EMBEDDINGS
In addition, we conduct the detailed analysis of the effect of external knowledge integrated to our proposed model compared with the best baseline model according to different aspects such as the question length and reasoning type.
In particular, Table 6 shows statistics of the performance and improvement of our proposed model according to different types of length. Our model improves the results of short questions (≤ 10 words) with an increasing accuracy of 1.98% and average-length questions with an improvement of 5.78% for 11 − 15-word questions and the one of 13.04% for 16−20-word questions. For longer questions, this model does not improve its performance. However, this number is not significant because the number of long questions accounts for low percentage. Table 7 shows statistics of the performance and improvement of our proposed model according to different types of reasoning. We found that our proposed model is a right solution for three types of reasoning, word matching, paraphrasing and ambiguous or insufficient, increasing 7.90%, 12.20% and 11.11% of the total number of solved questions, respectively. However, the number of questions of word matching and paraphrasing improved significantly because they account for a high proportion in the dataset.

E. EFFECTS OF THE TRAINING DATA SIZE
To verify whether the size of training data is a reason for the poor accuracy of the machine model, we evaluate Co-match [47] as a neural network-based model on different sizes of training data including 508, 1010 and 1975 human-created questions. In this experiment, we implement Co-match with different pre-trained word embeddings [76].
Experimental results (in accuracy) on the test set are presented in Fig. 9. The figure shows that the model performance is improved when we increase the amount of the training data. These observations suggest that increasing training data size would improve the accuracy. This is also a future direction for addressing this problem.

VII. CONCLUSION AND FUTURE WORK
In this paper, we propose the lexical-based approach utilizing semantic similarity and external knowledge sources and perform experiments to compare the performances between this method and baseline lexical-based and neural network based methods. The experimental results show that our proposed method is effective for Vietnamese multiple-choice reading comprehension. The best performance reaches 61.81% in accuracy on our dataset. However, there is still a large gap between the human performance and the best model (a significant difference of 29.29%). We also analyze the best models in different linguistic aspects to gain in-depth insights into the dataset. These analyses results illustrate that our corpus is a challenging dataset and need further studies. We also contribute this dataset for studies of the multiple-choice machine reading comprehension task for the Vietnamese language. This dataset includes 2,783 multiple-choice questions based on a set of 417 Vietnamese reading texts. This dataset encourages further advances in machine reading comprehension and guides the development of artificial intelligence for the Vietnamese language.
In future, we plan to increase the quantity and quality of the dataset in terms of the number of reading texts. The analysis results also suggest that we should focus on methods to improve the performance on long questions and difficult reasoning types. When the dataset is large enough, we will further research on state-of-the-art methodologies such as deep neural networks and transfer learning to explore suitable models for Vietnamese multiple-choice MRC. In addition, we can use classification of the level of difficulty on multiple-questions to conduct experiments with curriculum learning [77]. Table 8 shows the ratio of each reasoning type in the development set. Those types of reasoning have been described in Section III. Besides, we have an example for each reasoning type and these Vietnamese examples are translated into English.

STATISTICS OF DIFFERENT REASONING TYPES
KIET VAN NGUYEN received the B.S. and M.S. degrees from the University of Information Technology, Ho Chi Minh City, Vietnam, in 2012 and 2017, respectively. He is currently a Lecturer with the Faculty of Information Science and Engineering, University of Information Technology, Vietnam National University, Ho Chi Minh City. His research interests include natural language processing, machine reading comprehension, and deep learning.
KHIEM VINH TRAN is currently a junior student with the University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam. His research interests include text processing, machine reading comprehension, and sentiment analysis. He just took part in the WNUT-2020 Task 2 and ranked the third place in this competition. SON T. LUU received the B.S. degree from the University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam, in 2019, where he is currently pursuing the master's degree. He is currently a Research Assistant with the University of Information Technology. His research interests include machine reading comprehension, toxic comment detection, sentiment analysis, and knowledge representation. NGAN LUU-THUY NGUYEN received the Ph.D. degree in information science and technology from the University of Tokyo, Japan. She was a Postdoctoral Researcher with the National Institute of Informatics, Japan, from 2012 to 2013. She is currently a Scientist with the University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam. Her research interests include natural language processing and data analysis. VOLUME 8, 2020