PersianQuAD: The Native Question Answering Dataset for the Persian Language

Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: (1) Wikipedia article selection, (2) question-answer collection, (3) three-candidates test set preparation, and (4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties.We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD is freely available and can be downloaded from here. All the QA systems implemented in this paper are also available here.


I. INTRODUCTION
Developing open-domain Question Answering systems (QA) is one of the main goals in Artificial Intelligence. QA systems receive the users' questions in natural language and respond to them with precise answers. They deploy natural language understanding and information retrieval to understand the users' questions and find the appropriate answers. Classic search engines receive the users' queries and return a list of relevant web pages. The user should read the returned web pages to obtain their required information. QA systems receive the users' questions and find the final answer to that questions. Nowadays, modern search engines such as Google, Yahoo, and Bing, deploy QA techniques to provide precise responses to some types of questions. For example, if someone searches for the question "what is the normal body temperature?" in Google, it responds with the precise QA systems mainly focus on factoid questions; questions that can be answered with facts expressed in a few words [1]. Table 1 shows three examples of factoid questions, along with their answers and answers' types.
With the advent of Deep Learning (DL) techniques, NLP tasks such as machine translation, sentiment analysis, and QA have witnessed a significant advance. Although DL performs very well on NLP tasks, they require a considerable amount of annotated data for training. In the QA task, the data are semantically annotated and in the form "question, paragraph, answer", where the question, paragraph, and answer, show a question expressed in natural language, a text that contains the answer, and a span of the paragraph which has the answer, respectively. Given the question and the paragraph, the QA system should select the paragraph's span, which is the answer.
Many annotated datasets have been built for the QA task; most of them are exclusively in English. The most famous QA dataset in English is SQuAD [2]. It contains 100,000+ questions created by crowdworkers on a set of Wikipedia articles. The answer to each question is specified in the article text. In order to conduct QA systems for other languages, some works have automatically translated the English QA datasets to the target language using machine translation tools. Even though translating English QA datasets is a fast and straightforward solution to prepare labeled data for lowresourced languages, automatic translation is not perfect and can not produce high-quality annotated data. In addition to this, the process of translation -even human translation-may produce faulty outputs in the target language [3]. The result is that the quality of the QA systems trained on the purely native datasets is substantially better than the quality of the QA systems trained on the translation datasets [3]- [6].
In order to address the need for a high-quality QA dataset for Persian language, we propose a model for creating dataset for deep-learning-based QA systems. We use the proposed model to create PersianQuAD: a Question Answering Dataset for the Persian language. PersianQuAD contains 20,000 "question, paragraph, answer" triplets and is the first large-scale native QA dataset for the Persian language, to the best of our knowledge. PersianQuAD is freely available for public use at here. To evaluate the quality of the QA dataset created through the proposed model, we implemented a set of state-of-the-art deep-learning-based QA systems and used the created QA dataset for training these systems. Our best model achieves an F 1 score of 82.97% and an Exact Match of 78.7%, which are comparable with that of English QA systems trained on the English SQuAD, made by the Stanford University.
The remainder of this paper is organized as follows. Section II reviews the related work and puts our work in its proper context. Sections III, IV, V and V-D present in detail the proposed model for preparing QA datasets, including the problem definition, the annotation tool used, dataset collection process and data quality monitoring. This is followed by more in-depth analyses of the resulted dataset in Section VI. Section VII contains the experiments carried out to evaluate the quality of the resulted dataset by using it for training three deep-learning-based QA systems. Finally, we outline conclusions in Section VIII.

II. RELATED WORK
Many QA datasets have been produced in the past decade, most of them exclusively in English. In recent years, several QA datasets have been built for other languages, such as Arabic, France, etc. In this section, we present a brief review of the researches on creating QA datasets.

A. ENGLISH
The Stanford Question Answering Dataset (SQuAD) [2] can be considered as the most famous QA dataset in English. It consists of 100,000+ questions posed by crowdworkers on Wikipedia articles. A set of Wikipedia articles were presented to the annotators, and they were asked to pose some questions on the paragraph and specify the corresponding answer. The second version of SQuAD [7] contains over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
The WikiQA [8] and MS Marco [9] datasets are built by sampling questions searched by the users in the Bing search engine. For each question, the top-ten documents returned by the Bing engine are presented to the annotators and ask them to find the answer to the question in the documents or say that the documents do not contain the answer. Wik-iQA contains about 3,000 questions and their corresponding answer sentence on the Wikipedia page. MsMarco contains 100,000 questions with freeform answers. Natural Questions (NQ) dataset [10] is created by sampling questions issued to the Google search engine and consists of over 300,000 examples. To create NQ dataset, a set of questions searched in Google along with the top-five results returned by Google are presented to the annotators. Then the annotators are asked to specify the answer on the pages or mark null if the answer is not inside the returned pages.
The QuAC [11] and CoQA [12] are conversational QA datasets that contain dialogues between a questioner and an answerer. CoQA [12] contains over 127,000 question-answer pairs. The questions are collected by asking two crowdworkers to chat about a passage. During the conversation, one of the crowdworkers poses some questions about the passage, and the other tries to answer them. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. NewsQA [13] is a QA dataset based on the CNN news articles, with over 100,000 question-answer pairs. NewsQA is created in three stages, and the crowdworkers are divided into three groups: (1) questioners, (2) answerers, and (3) validators. In the first stage, the questioners see only the article highlights and headlines and pose some questions. In the second stage, the answerers see the crowdsourced questions and the full article and determine the answer in the article. In the third stage, the validators see the article, the question, and a set of unique answers to that question selected by the answerers. The validators then choose the best answer from the candidate set or reject all of them. Some QA systems translate the users' question from natural language into a query formulated through a specific data query language that is compliant with the underlying knowledge base. MQALD [14] provides a dataset for evaluating the performance of the QA systems in translating the questions from natural language into a specific data query language.
Some questions can not be answered by reading a single paragraph or document, and several pieces of information in different documents should be considered to find the answer. These questions need multi-step(or multi-hop) reasoning. Several QA datasets have been developed to address multistep QA. The main challenge in multi-step QA datasets is to answer the questions by reasoning over different documents. QAngaroo [15], HotpotQA [16], ComplexWebQuestions [17], and R4C [18] are examples of multi-hop datasets. ComplexWebQuestions and QAngaroo are built by incorporating a knowledge base with the Web and Wikipedia website documents. HotpotQA and R4C are created by crowdsourcing. HotpotQA consists of 113,000 questions on Wikipedia articles. Answering each question in the dataset needs finding multiple documents and reasoning over them. In multi-hop datasets, in addition to answering questions, QA systems typically should also identify paragraphs that have been used to drive the answer.

B. OTHER LANGUAGES
There are two main approaches for building QA datasets in languages other than English: (1) translating English QA datasets into the target language, typically using machine translation, and (2) building the dataset from scratch (native QA dataset). In the first approach, an English QA dataset's training set is first translated into the target language. Then the QA system for the target language is trained on the translated dataset. Carrino et al. [19] proposed a new method to automatically translate SQuAD to Spanish and used the translated dataset to fine-tune a Spanish QA model. Mozannar et al. [20] translated 48,000+ SQuAD instances into Arabic using machine translation and built an Arabic QA system. Lee et al. [21] translated SQuAD into Korean and built k-QuAD using machine translation. Croce et al. [22] proposed a semi-automatic translation method and translated SQuAD into Italian. While the translation approach is a fast and relatively easy way to building QA datasets for low-resourced languages, Clark et al. [3] argue against this approach. They discuss that the translation process, even human translation, tends to produce problematic artifacts in the output language, such as preserving the word order of the source language when translating to a target language with free-word order or using more formal in the target language. As a consequence, the text that is obtained from humanor automatic-translation may be significantly different from original native text [6], [23]. In addition to this, speakers in various languages and of different nationalities may have questions about different topics [3] and in different ways. For example, a Persian speaker may ask about the recipe of "Ghormeh sabzi", a traditional Iran food. These types of questions may never appear in the translated datasets. These issues encourage following the second approach, i.e., build-ing native QA datasets from scratch by native annotators. In this work, we follow the second approach for creating the PersianQuAD dataset.
Native QA datasets are mostly constructed in a similar way to SQuAD. The SberQuAD [24] is a Russian native QA dataset and contains above 50,000 samples. The DRCD [25] is a native Chinese QA dataset, consists of 30,000+ questions posed by the annotators on 10,014 paragraphs extracted from 2,108 Wikipedia articles. KorQuAD [26] is a Korean QA dataset and PIAF [27] is a French QA dataset, consisting of 70,000+ and 3835 question-answer pairs, respectively. Since large-scale QA datasets in languages other than English rarely exist and building native QA datasets is time-and cost-consuming, developing QA systems for these languages is challenging. Cross-lingual QA datasets have been developed to address this challenge. These datasets are typically deployed in training the QA model on one language and transfer the model to another language. It has been shown that the resulted models perform well in the zero-shot setting [28]. MLQA [29] is a cross-lingual QA dataset developed for seven languages: English, Arabic, German, Spanish, Hindi, Vietnamese and Chinese. It consists of over 12,000 samples in English and 5,000 samples in other languages. MMQA [30] is a parallel QA dataset in Hindi and English, containing 5,000+ parallel instances. BiPar [31] is another parallel QA dataset in English and Chinese. XQA [32] consists of a training set in English and the development and test sets in eight languages: English, French, German, Portuguese, Polish, Chinese, Russian, Ukrainian, and Tamil. XQuAD contains 1190 instances from SQuAD, along with their translations in 10 languages.
There are only a few works on building open-domain QA datasets for the Persian language. Abadani et al. [33] automatically translated SQuAD into the Persian language and built a translated Persian QA dataset, called ParSQuAD. They created two versions of ParSQuAD: ParSQuAD(manual) and ParSQuAD(automatic), with 25000 and 70000 instances, respectively. For creating ParSQuAD(manual), after automatic translation of SQuAD, some of the translation errors have been corrected manually. ParSQuAD(automatic) is the result of automatic translation of a part of SQuAD, without any manual correction on the translation. As we discussed earlier, the translation approach for creating a QA dataset has a number of limitations. Hence, in this paper, we create a native QA dataset for the Persian language. Khashabi et al. [34] created a Persian QA dataset containing 1300 instances and trained a QA system using this dataset. To the best of our knowledge, currently, there is no native larg-scale QA dataset for answering the Persian questions, neither as a monolingual nor as a cross-lingual dataset. In this paper, we present the first large-scale and native dataset for the Persian language, called PersianQuAD.

III. PROBLEM DEFINITION
In the QA systems, given a question and a paragraph containing the answer, the model needs to find the answer to that question in the paragraph text. The answer is determined by an "start" and an "end" token, which show the start and end tokens of the answer in the paragraph text, respectively. To define the QA problem formally, assume that the question Q = {q 1 , q 2 , ..., q n } and the paragraph P = {p 1 , p 2 , ..., p m } are given, where q i and p j are i'th and j'th words of the question and paragraph text, respectively. The QA system needs to find the answer A = {a 1 , ..., a k } of the question Q in paragraph P . The paragraph P contains the answer, hence, where start and end are the start and end tokens of the answer in the paragraph text. The QA system finds the answer of the question, by predicting the start and end tokens on condition that start < end.
To be more precise, the QA systems specify the answer by estimating P r(start, end|P, Q), where P r shows the probability . Modern QA systems employ DL techniques to solve the stated problem, i.e., to predict the start and end tokens of the answer in the corresponding paragraph text. DL algorithms require a large number of annotated data. As mentioned earlier, for the QA task, the annotated data are in the form (Q, P, A) where Q, P , and A are the question, the paragraph, and the answer, respectively. Usually, the QA datasets are stored in JSON format. Figure 1 shows an instance of the SQuAD dataset in JSON format.
In order to build high-quality QA systems by using DL techniques, a large labeled dataset is required. While these datasets are available for a limited number of languages, there is no large-scale and native QA dataset for most of the non-English languages, such as the Persian language. In this paper we build the first native Persian question answering dataset. In order to create the dataset, we first implement SAJAD, an effective annotation tool for gathering QA datasets (Section IV). Then we collect the data through a participatory approach on Wikipedia articles (Section V). We build three QA systems and employ PersianQuAD for training of these systems in Section VII. In the rest of the paper, we explain the mentioned steps in detail.

IV. SAJAD ANNOTATION TOOL
In Figure 2 the Document Title shows the title of the Wikipedia page that the paragraph is extracted from. The Paragraph shows the paragraph on which the annotator should pose a question. The annotator pose a question on the paragraph and type it in the Question field. The annotator then specifies the answer to the question by highlighting the answer within the paragraph text. By highlighting the answer within the text, it will be automatically placed in the Answer field. The annotator adds the question to the JSON file of the PersianQuAD by the Add button. In the case the annotator context: "Oxygen is a chemical element with symbol O and atomic number 8. It is a member of the chalcogen group on the periodic table and is a highly reactive nonmetal and oxidizing agent that readily forms compounds (notably oxides) with most elements. By mass, oxygen is the third-most abundant element in the universe, after hydrogen and helium. At standard temperature and pressure, two atoms of the element bind to form dioxygen, a colorless and odorless diatomic gas with the formula O\n2. can not pose any question on a paragraph, s/he can go to the next paragraph by clicking the Skip button.
The following features characterize SAJAD: • Web-based and Mobile-friendly: SAJAD is a web-based application, and hence it does not need to be installed on the participants' systems. It also can be accessed easily from any device and browser. • Multi Account: SAJAD is a multi-user tool. It enables the system administrator to create multiple accounts. Each participant is provided with an account, and the participants' IDs will be stored in the questions they pose. This enables the system administrators to actively evaluate the quality of questions that the annotators posed. • SQuAD Format: To enable the community to use the dataset easily and quickly, the SAJAD input and output are the same as SQuAD. Each paragraph can have many questions, and each question can have multiple answers.

V. DATASET COLLECTION
In line with recent works on QA tasks, we follow the same format as in SQuAD and other QA datasets for our dataset. The data in QA datasets is in the form (Q, P, A), where Q is the question, P is the paragraph that contains the answer, and A is the answer to the question. As described in Section IV, we develop SAJAD and use it for gathering the QA data. Our proposed model for collecting QA datasets consists of four steps: (1) Wikipedia article selection, (2) question-answer collection, (3) three-candidates test set preparation, and (4) Data Quality Monitoring .

A. WIKIPEDIA ARTICLE SELECTION
In this step, we selected a set of high-quality Persian Wikipedia articles to use in the dataset collection process. We used Project Nayuki's Wikipedia's internal PageRanks to retrieve the top 10,000 articles of Persian Wikipedia with the highest page rank. We extracted the individual paragraphs from the selected article and kept only the paragraphs with more than 500 characters. Ultimately, we got 26,417 paragraphs with different topics and converted them to the JSON input format to be used as the SAJAD input. Figure 3 shows the article selection process.

B. QUESTION-ANSWER COLLECTION
We employed several participants to make questions. Most of them are graduated, or fourth-year students in linguistics and computer science, and all of them are native Persian speakers. We first provided the annotators with a set of written guidelines and oral explanations about the annotation process, the level of complexity that the questions should have, the answers span, and the shortest span. The annotators were also shown a set of example paragraphs and some good and bad questions and answers on each paragraph. Then the annotators were asked to make 50 questions, and if they pose at least 45 questions correctly, they could participate in the annotation process.
We used the SAJAD tool for annotation, as explained in Section IV. In the annotation process, the annotators are shown a random paragraph from the selected articles. Then they were asked to pose some questions on the paragraph and highlight the answer within the paragraph text. The participants were asked to spend one minute making each question and posing at least three questions in each paragraph. They were also asked to make questions without using   the paragraph's text and express it in their own words. In the case that the participants can not make any question on a specific paragraph, they can skip it. Figure 4 shows the process of question-answer collection. In this way, we collected about 20,000 question-answer and stored them in the JSON format.

C. THREE-CANDIDATES TEST SET PREPARATION SET
As explained earlier, the annotators were asked to specify the shortest span as the answer. However, in some cases, there is a disagreement between the annotators about the shortest span. For an example, consider the paragraph and the corresponding question in Table 3. For the question in Table 3, "Mohammad-Reza Shajarian" and "Shajarian" are both correct answers. In the case that we just specify "Mohammad-Reza Shajarian" as the correct answer, a QA system that selects "Shajarian" as the answer will be wrongly punished. Hence, to make the evaluation more accurate and reliable, we followed the SQuAD protocol and specified two extra answers for the questions in the test set.
In order to prepare three-candidates test instances, we show each annotator the questions on the test set along with the corresponding paragraphs. Then ask the annotators to specify the shortest answer to the question in the paragraph's text. In this way, we have three answer candidates for each question in the test set. Figure 5 shows the process of preparing a three-candidates test set.

D. DATA QUALITY MONITORING
In SQuAD's data collection protocol, the annotators were required to have a high HIT acceptance rate to participate in the process. However, no constant supervision is made on the questions that the annotators make. Since SQuAD is a huge dataset with 100,000 question-answer pairs, some wrong or inappropriate instances made by the annotators might be tolerable. PersianQuAD contains 20,000 instances,  The test file containing 1000 questions with three-candidate answer instances Add The test file containing 1000 questions with one-candidate answer instances Candidate answer Candidate answer FIGURE 5. Three-candidates test set preparation and we compensate for the lack of quantity by improving our dataset's quality. To this end, we actively check the quality of the instances.
As mentioned earlier, the annotators were asked to perform an annotation task, and if they pose at least 45 questions correctly, they could participate in the data collecting process. In addition to this, to ensure the quality of the dataset, all of the questions were constantly checked by three supervisors 1 . The entire data collection process took approximately two months to complete. At the end of each day, the supervisors sampled a number of questions from each annotator and verified the quality of the questions based on the following criteria: (1) Whether the question is fluent in Persian?, (2) 1 All of the supervisors are native Persian speakers, One of them is a linguist and the others are experts in NLP Whether the answer exists in the paragraph text?, (3) Whether the specified answer is correct? And (4) Whether the specified answer is the shortest one? The questions that failed to satisfy these criteria were removed from the dataset. The supervisors also checked the type and variety of the questions that each annotator made and prevented the dataset from being biased toward specific types of questions.

A. DATASET STATISTICS
We split PersianQuAD into train and test sets containing 18,567 and 1000 instances, respectively. The test set was selected randomly from the dataset, with the remaining part of the dataset used for training. Table 4 gives the statistics of training and test sets of PersianQuAD. As mentioned in VOLUME 4, 2016   Section V-C, we specified three candidate answers for the questions in the test set.

B. QUESTION TYPE ANALYSIS
In order to evaluate the proposed model for creating QA datasets and to ensure the representativeness of the resulted data, we analyze the PersianQuAD by extracting the type and number of its questions. To compare the question type distribution of our dataset with that of other datasets, such as SQuAD and TyDi QA, we use English question types as the basis of our analysis. To this end, we mapped each of the question words in English to their corresponding question words in Persian. Table 5 shows the question type distribution over PersianQuAD, SQuAD, and TyDi QA datasets. As Table 5 shows, PersianQuAD contains a more balanced distribution of the question words in comparison with SQuAD and even TyDiQA. In all datasets, "What" and "Why" questions have the highest and the lowest frequency, respectively. Table 6 shows the statistics of the question types over the training and test sets of the PersianQuAD. As this table shows, the frequency of different question types over the test set is similar to that of the training set, confirms that the test set is a good representer of the whole dataset.

C. QUESTION DIFFICULTY
We take the lexical similarity between the question and the corresponding answer sentence as an indicator of the question's difficulty. The less similarity between a question and its answer sentence, the more difficult it is for a QA system to answer that question. We measure the lexical similarity between the question and the answer sentence by using the Jaccard Coefficient [35]. Assume that the question and the answer sentences are shown as Q = {q 1 , q 2 , ..., q n } and A = {a 1 , a 2 , ..., a m }, respectively. Jaccard Coefficient measures the similarity between the question Q and answer sentence A as shown in Equation 1.

Jaccard(P, A)
As Equation 1 shows, Jaccard Coefficient measures the similarity as the ratio of the number of common words between the question and answer sentence to the total number of words in question and answer. Jaccard Coefficient varies between 0 and 1, where 0 shows there is no lexical similarity between the question and the sentence, while 1 shows the sentence contains all the question words. Figure 6 shows the lexical similarity between the questions and corresponding answer sentences over the PersianQuAD test set. As Figure 6 shows, for almost 90% of the questions, the similarity between the question and its answer sentence is less than 0.3% in terms of the Jaccard Coefficient. This demonstrates that the lexical overlap between the questions and the answer sentences is very low, and hence, the PersianQuAD questions present an adequate level of complexity.

A. METHODS
We design and implement three versions of a deep-learning based QA system and deploy PersianQuAD as the training set of the QA systems. In line with the state-of-the-art research on QA tasks [36], we used three pre-trained language models in our QA systems: MBERT [37], ParsBERT [38] and ALBERT-FA [39]. MBERT (Multilingual Bidirectional Encoder Representations from Transformers) is a deep bidirectional language model developed by Google. It has been pre-trained on Wikipedia articles of 104 languages, including Persian. MBERT has shown great performance on a wide range of NLP tasks such as named entity recognition, question answering, part of speech tagging, etc. ALBERT is a lite version of MBERT with fewer parameters and hence, faster training speed. It also models the inter-sentence coherence by using a supervised loss. ALBERT shows better performance than MBERT on some NLP tasks with multi-sentence inputs such as English QA [39]. ALBERT-FA is the version of ALBERT trained on the Persian (Farsi) texts. ParsBERT is a monolingual BERT trained on the Persian language. Figure 7 shows the general architecture of the proposed QA systems . As Figure 7 shows, the QA system first tokenizes the paragraph text and the question sentence using the BERT tokenizer. Then it passes the generated tokens to the BERT language model. Finally, the BERT language model predicts the answer's start and end token within the paragraph text, and the answer generation component generates the final answer to the question. All the implemented QA systems are also available at here.

B. ALGORITHM
Algorithm 1 indicates the algorithm of the QA system in Figure 7. In this algorithm, Initilize function first inserts a special token [CLS] at the beginning of the question sentence. Likewise, it adds a special token [SEP] between the question and paragraph and a token [SEP] at the end of the paragraph. Equation 2 shows this function.

[CLS]question[SEP ]paragraph[SEP ]
= Initialize(question, paragraph) The packed sequence is then tokenized using the BERT tokenizer. As shown in Equation 3, q i shows the i th token of the question sentence and p j shows the j th token of the paragraph. [ To implement BERT language model, transformer encoders [40] are employed. In transformer encoders, selfattention layers are implemented rather than recurrent neural networks to present each token. In each transformer encoder. The inputs are passed to some self-attention layers. In i th self-attention layer, three vectors Query (Q i ), Key (K i ), and Value (V i ) are generated for each embedding vector emd j . To generate these vectors in i th self-attention layer, emd j is multiplied to W Qi ∈ R |emdj |×|Qi| , W Ki ∈ R |emdj |×|Ki| , and W Vi ∈ R |emdj |×|Vi| . These matrices are learned during training model. Finally, the vector Z i is generated as the output of i th self-attention layer. Equations (5) to (8) shows these operations. In Equation 8, σ demonstrates the softmax function.
By concatenating Z i vectors for all self-attention layers, vector Z 1..|Self −Attentions| is generated. This vector is multiplied by the matrix W O ∈ R Z 1..|Self −Attention| ×768 and vector Z is produced. W O is a learnable matrix. Equation 9 shows this multiplication.
Z vector is then passed to a fully connected network and a new vector emb new j is generated. In this network, W F ∈ R |768|×|embj | and b F ∈ R |embj | are learnable parameters. This fully connected network is shown in Equation 10.
All of emd new 1..|inputs| vectors are passed to a new encoder again. This operation is repeated to the number of encoders.
As shown in Equation 11, the output vectors of the last encoder are passed to a fully connected network and start logits (s ∈ R |inputs| ) and end logits (e ∈ R |inputs| ) are produced. In this equation, W qa ∈ R |emdj |×2 and b qa ∈ R |inputs|×2 . Start logit and end logit indicate the start score and end score of the answer span, respectively.   Afterwards, based on Equation 12, the system finds i th and j th tokens of the paragraph so that sum of their logits is maximum. This shows the best span of the paragraph which represents the exact answer.
Finally, all tokens p a1 , p a2 , ..., p a f are detokenized and the final answer is generated and returned to the user. Equation 13 shows this process.

C. EVALUATION METRICS
Two evaluation metrics are commonly used for evaluating QA systems: Exact Match and F1 [2]. We use the same metrics in this research.
• Exact Match: This metric measures the percentage of the predicted answers that exactly match any of the gold candidate answers. • (Macro-averaged) F1: This metric measures the average overlap between the predicted and the gold candidate answers. To compute the overlap, both the predicted and the candidate answers are represented as bags of words, and hence, the order of words is ignored. The F1 of each question is considered as the maximum F1 over all of its candidate answers. The (Macro-averaged) F1 is the average of the F1 scores over all of the questions.

D. HUMAN PERFORMANCE
As explained in Section V-C we prepared a three-candidates test set for PersianQuAD. In order to measure human performance on the PersianQuAD test set, as in SQuAD, we take the third answer to each question as to the human answer and keep the two other answers as gold ones. The human performance on the test set of PersianQuAD is 95.0% and 96.49% in terms of Exact Match and F1 metrics, respectively.

E. SETUP
In order to implement QA systems with MBERT, Pars-BERT, and ALBERT-FA, we used Python and PyTorch as our programming language and our deep learning framework, respectively. The models were fine-tuned and tested on Google Colab, with NVIDIA Tesla p100 GPU and 12G RAM. We used the built-in tokenizer in the models (MBERT, ParsBERT, and ALBERT-FA) to tokenize the paragraph and answer ⇐ Detokenize(p a1 , p a2 , ..., p a f ) return answer answer text 2 . All of the models were fine-tuned with a batch size of 12, the learning rate of 3e −5 , gradient accumulation steps of 1, and weight decay of 0. We fine-tuned each model for two epochs and used the AdamW optimizer [41]. All of the models were tested with a batch size of 8.

F. RESULTS AND ANALYSIS
We build three QA systems according to the pre-trained language models examined (MBERT, ALBERT-FA, Pars-BERT). We trained each of the QA systems using the training part of PersianQuAD and evaluate them using the test part. We evaluate each of the QA systems according to two widely used automatic evaluation metrics Exact Match and F1 described in Section VII-C. To have a better understanding of the obtained performance, we compare it with the performance of QA systems trained for other languages. Table 7 shows the performance of the QA systems and human performance on PersianQuAD, along with the performance of the QA systems on the datasets in other languages. We derive the following observations from the results: • For PersianQuAD, the best performance in terms of both evaluation metrics is obtained using MBERT. ALBERT-FA, and ParsBERT are in the next ranks, respectively. • ParsBERT is pre-trained on more massive amounts of Persian data than MBERT. However, this does not have 2 We use Wordpiece tokenizer for MBERT and ParsBERT, and Sentence-Piece tokenizer for ALBERT-FA. a positive effect on the performance of the Persian QA system and MBERT performs better than ParsBERT. • While ALBERT performs better than BERT on English SQuAD, this is not the case for the persian Persian-QuAD, and MBERT show higher p erformance than ALBERT, in terms of both evaluation metrics. • The performance gap between humans and models on PersianQuAD (17% in Exact Match and 14% in F1) shows that there is still plenty of room for improving the QA models on the PersianQuAD. • Our best model achieves an F 1 score of 82.97% and an Exact Match of 78.7%, which are comparable with that of English QA systems trained on the SQuAD. It shows that the PersianQuAD performs well in training the deep-learning-based QA systems. • The performance of the QA systems trained on Per-sianQuAD is better than that of QA systems trained on ParSQuAD and ParsiNLU, indicating that Persian-QuAD works well on training QA systems for the Persian language. • The performance of the QA systems trained on Per-sianQuAD is better than that of QA systems trained on SQuAD-es (Spanish), SberQuAD (Russian), KorQuAD (Korean) and PIAF(French), (Except for the SberQuAD and KorQuAD, with a slight decreas in terms of Exact Match). Table 8 shows the F1 score of the models on answering different types of questions in PersianQuAD.  Here we observe: • All models have their best performance on Why, How and When question types. We hypothesize that this can be attributed to the fact that the answer of "Why" questions in PersianQuAD usually appears after a sentence starting with "because", and hence, finding the answer is straightforward. How question types include How much and How many. The answer to How much, How many and When questions are usually quantities and finding quantifiers as answers is relatively straightforward for the models. • All models show their worse performance on What question types. This is because Whatquestions in English are mapped into several types of questions in Persian, and finding the answers to them is relatively complicated. • The ranking of question types, based on the model performance, for MBERT and ParsBERT is the same (1-What, 2-Where, 3-Which, 4-Who, 5-How, 6-When, 7-Why) , while for ALBERT-FA the ranking is slightly different (1-What, 2-Who, 3-Where, 4-Which, 5-When, 6-How, 7-Why). We hypothesize that this is because MBERT and ParsBERT use the same architecture, while ALBERT-FA uses a different one.

VIII. CONCLUSION
This paper present a model for developing Persian datasets for deep-learning-based QA systems. The proposed model consists of four steps: (1) Wikipedia article selection, (2) question-answer collection, (3) three-candidates test set preparation, and (4) Data Quality Monitoring. We deploy the proposed model to create PersianQuAD, the first native question answering dataset for the Persian language. PersianQuAD consists of approximately 20,000 (question, paragraph, answer) triplets on Persian Wikipedia articles and is created by native annotators. We analysed PersianQuAD and showed that it contains questions of varying types and difficulties and hence, it is a good representer of real-world questions in the Persian language. We built three QA systems using MBERT, ALBERT-FA and ParsBERT. The best system uses MBERT and achieves a F1 score of 82.97% and an Exact Match of 78.7%. The results show that the resulted dataset performs well for training deep-learning-based QA systems. We have made our dataset and QA models freely available and hope that it encourages the development of new QA datasets and systems for different languages, and leads to further advances in machine comprehension.