Toward Zero-Shot and Zero-Resource Multilingual Question Answering

In recent years, multilingual question answering has been an emergent research topic and has attracted much attention. Although systems for English and other rich-resource languages that rely on various advanced deep learning-based techniques have been highly developed, most of them in low-resource languages are impractical due to data insufficiency. Accordingly, many studies have attempted to improve the performance of low-resource languages in a zero-shot or few-shot manner based on multilingual bidirectional encoder representations from transformers (mBERT) by transferring knowledge learned from rich-resource languages to low-resource languages. Most methods require either a large amount of unlabeled data or a small set of labeled data for low-resource languages. In Wikipedia, 169 languages have less than 10,000 articles, and 48 languages have less than 1,000 articles. This reason motivates us to conduct a zero-shot multilingual question answering task under a zero-resource scenario. Thus, this study proposes a framework to fine-tune the original mBERT using data from rich-resource languages, and the resulting model can be used for low-resource languages in a zero-shot and zero-resource manner. Compared to several baseline systems, which require millions of unlabeled data for low-resource languages, the performance of our proposed framework is not only highly comparative but is also better for languages used in training.

downstream tasks due to its ease of use and state-of-the-art 23 performances on various multilingual tasks. Although pre-24 training on large-scale multilingual corpora allows mBERT to 25 be directly applied to the text of more than hundreds of differ-26 ent languages, inevitably, its performance in a low-resource 27 language is generally worse than that in other rich-resource 28 languages [7]. The scarcity of data makes it impractical to 29 break the bottleneck merely by collecting more data. A simple 30 but common solution is to utilize knowledge learned from 31 The associate editor coordinating the review of this manuscript and approving it for publication was Farhana Jabeen. a rich-resource language, such as English, to improve the 32 performance of low-resource languages. Several studies have 33 shown the potential of this approach through experiments in 34 few-shot or zero-shot settings [8], [9], [10]. 35 Because assistant applications are frequently installed in 36 a variety of mobile phones and home devices around the 37 world, multilingual question answering has been an emergent 38 challenge in recent years. Furthermore, the global popular- 39 ization of multimedia technology, video/audio sharing web- 40 sites, and social networks has led to a significant growth in 41 multilingual content nowadays. This has also increased the 42 demand for machine reading comprehension of multilingual 43 content, a typical case of multilingual question answering. 44 In an example of machine reading comprehension, given a 45 question and text passage, the machine is asked to predict a 46 pair of start and end indices to determine a short text span 47 from the text passage as the answer to the question. Before the multilingual variant of BERT was released, most 109 studies focused on building multilingual word embeddings 110 via unsupervised methods. Multilingual Unsupervised and 111 Supervised Embeddings (MUSE) [31] is one of the most 112 representative and powerful methods. It learns bilingual word 113 mapping with relative criteria through multiple languages so 114 that multiple sets of monolingual word embeddings can be 115 smoothly merged into a set of multilingual word embeddings. 116 However, even a robust method like MUSE still does not 117 take any benefits from a large-scale multilingual corpus. As a 118 result, the development of multilingual models is slowed 119 down, causing a bottleneck in the performance.

120
Soon after the release of BERT went viral in the natu-121 ral language processing (NLP) community, the multilingual 122 variant of BERT was released in the next year. The multi-123 lingual BERT (mBERT) model easily outperforms MUSE 124 on several multilingual datasets owing to its pretraining on 125 a large-scale multilingual corpus crawled from Wikipedia, 126 including articles of more than 100 languages. Afterward, 127 a method named Language-Agnostic SEntence Represen-128 tations (LASER) [32] was proposed, which forced word 129 embeddings to be language-agnostic via training on machine 130 translation tasks. Although the architecture of LASER is 131 based on bidirectional long short-term memory, which is 132 usually considered a weaker model compared to transformer-133 based architecture, LASER proved the importance of lan-134 guage agnosticism with its success.

135
The cross-lingual language model (XLM) [33] and its 136 variant XLM-RoBERTa (XLM-R) [34] were thus proposed 137 to combine the advantages of mBERT and LASER. Inspired 138 by the masked language modeling task, a translation language 139 modeling task was proposed for XLM-R. This task asks the 140 model to perform classic masked language modeling (MLM) 141 tasks with its input being the concatenation of a pair of 142 parallel sentences of two languages. Owing to the knowledge 143 learned in machine translation tasks and a pretraining corpus 144 even larger than the one of mBERT, XLM-R easily outper-145 forms mBERT on most multilingual datasets. However, while 146 XLM-R adopts a SentencePiece [36] tokenizer, mBERT 147 adopts a WordPiece tokenizer. Some researchers might prefer 148 mBERT over XLM-R when developing a baseline system for 149 word-level NLP tasks due to the extra technical difficulties 150 of fixing word misalignment caused by the SentencePiece 151 algorithm.

152
To further enhance the sentence-level language agnosti-153 cism for XLM-R, the information-theoretic framework for 154 cross-lingual language model [37] proposed a cross-lingual 155 contrastive learning method to maximize the mutual informa-156 tion between the representations of parallel sentences. More-157 over, inspired by the adversarial approach of ELECTRA [38], 158 XLM-E [39] manipulates the translation language modeling 159 task of XLM-R into a task pretraining to ELECTRA style, 160 VOLUME 10, 2022 the translation replaced token detection (TRTD). In the TRTD task, the model first performs a classic translation language 162 modeling prediction to recover the masked tokens in the par-  Xia et al. [15] proposed MetaXL, a framework that uti-220 lizes meta-learning to train a representation transformation 221 network layer to transform representations of a rich-resource 222 language onto the distribution of a low-resource language. 223 Then, the RTN layer transforms training examples of the rich-224 resource language before fine-tuning. The application of multilingual BERT has been widely 229 studied due to its state-of-the-art performance in sev-230 eral multilingual NLP-related tasks. The attributes of the 231 transformer-based architecture make it a powerful encoder 232 for learning knowledge from multiple languages all at 233 once. When mBERT is applied in the multilingual ques-234 tion answering task, a naïve approach is to employ it to 235 encode a concatenation token (i.e., WordPiece) sequence 236 of a passage and a question. Then, classification objec-237 tives are introduced to indicate a pair of start and end 238 indices so that a text span can be extracted from the con-239 catenation token sequence as the answer to the question. 240 Formally, for a passage p = w token. Next, the pretrained mBERT model extracts a set of 246 hidden vectors for each token in the concatenation token 247 sequence. Two fully connected feed-forward neural networks 248 are adopted to individually translate the collection of hidden 249 vectors to two sets of scores, corresponding to the start and 250 end indices. As usual, a softmax function is then used to 251 translate the scores to probability distributions. The training 252 objective is computed and optimized to minimize the negative 253 log-likelihood of the proper start and end indices for the 254 concatenation token sequences for each training example: where P start gold and P end gold are the predicted probabilities of the 257 ground-truth starting and ending positions, respectively. This study aims to develop a zero-shot and zero-resource 260 multilingual question answering framework that requires no 261 labeled/unlabeled data for target low-resource languages. 262 In other words, the deduced multilingual system is only con-263 structed by rich-resource languages, and the resulting model 264 can be directly deployed on low-resource languages with a 265   , and a novel data augmentation method 305 is introduced. In particular, the mean vector is not needed 306 for inference, so the resulting model can be used for any 307 language in a zero-resource manner. Figure 1(b) provides a 308 simple example of the workflow. We propose an auxiliary training objective to further 312 encourage mBERT for giving outputs of language-agnostic 313 representations. Again, a concatenation token sequence of 314 a training example was first encoded and augmented so 315 that a pair of hidden outputs H k , H k after the k th trans-316 former layer is obtained. Then, H k , H k were passed back 317 to mBERT at its (k + 1) th transformer layer, leading to 318 two pairs of probability distributions pair P start , P end and 319 P start , P end .P start and P end were derived from H and 320 denote the probability distributions of each token position 321 being chosen as the start and end indices of the answer span, 322 respectively. Similarly, P start and P end were obtained from H . 323 An auxiliary Kullback-Leibler divergence objective L KL was 324 further introduced to guide the fine-tuning of mBERT: where P start and P end are used as references in the computa-327 tion of Kullback-Leibler divergences. The auxiliary objective 328 was designed to reduce the intrinsic distribution differences 329 between different representations (i.e., H and H ).
where y i is 1 if h i is derived from a zero-meaned hidden 375 state, and log P 1 | h i represents the probability computed 376 by the discriminator that h i is derived from a zero-meaned 377 hidden state. If the ZMTD is added to the model training, 378 we randomly select 10% of input tokens to subtract the mean 379 vector from their hidden states. Figure 2 To create a zero-shot and zero-resource multilingual question 384 answering model for low-resource languages, we introduce 385 a data augmentation method, an auxiliary Kullback-Leibler 386 divergence objective, an information compensation train-387 ing strategy, and an auxiliary ZMTD objective. Empirically, 388 a simple regularization is usually helpful in preventing over-389 fitting, making the training processing stable, and achiev-390 ing good performance for test data. In our implementation, 391 we apply the L2 regularization on token embeddings at the 392 beginning of the mBERT. Ultimately, an enhanced mBERT 393 model, which combines all the methods by summing up their 394 corresponding training objectives, is formulated: By leveraging all the proposed methods, we expect to obtain 397 an enhanced multilingual BERT-based question answering 398 (emBERTqa) model that can be used to answer questions in 399 multiple languages. In particular, a major contribution of this 400 work is that the proposed emBERTqa model is trained using 401 rich-resource language data, and the resulting model can be 402 used for low-resource languages without their unlabeled or 403 labeled data. Consequently, to the best of our knowledge, 404 on the MLQA dataset. XLM-R and mBERT are two clas-443 sic multilingual pretrained language models, so we directly 444 employed them for the question answering task without 445 any fine-tuning. On the contrary, the adversarial learning 446 method, MDS, and zero-mean method all require a large 447 amount of unlabeled data for each low-resource language. 448 Thus, XLM-R and mBERT perform multilingual question 449 answering for low-resource languages in a zero-shot (i.e., 450 without using labeled data) and zero-resource (i.e., without 451 using unlabeled data) manner, while the other baseline sys-452 tems belong to a zero-shot way. Although XLM-R generally 453 performs better than mBERT in most sentence-level NLP 454 tasks, such as text classification tasks, mBERT could perform 455 better than XLM-R in word-level tasks, such as question 456 answering, due to the difference between their tokenizers. 457 The WordPiece tokenizer used by mBERT is based on a word-458 level tokenization algorithm, whereas the SentencePiece tok-459 enizer used by XLM-R is based on a sentence-level algorithm, 460 which might cause token misalignment when locating the 461 start and end indices as the training labels of the question 462 answering task (cf. Section II-A). Therefore, as shown in 463 Table 2, XLM-R performs worse than mBERT in the MLQA 464 dataset.

465
C. PROPOSED FRAMEWORK 466 We compared the proposed emBERTqa with the baseline 467 models in the second set of experiments. All the results 468 are listed in Table 2. The proposed emBERTqa generally 469 outperformed the vanilla mBERT method and other baseline 470 systems. Again, the adversarial learning method, MDS, and 471 zero-mean method all require a large amount of unlabeled 472 data for target languages in the MLQA dataset, whereas the 473 proposed emBERTqa does not. Consequently, the major con-474 tribution of the proposed framework is that it requires neither 475 labeled nor unlabeled data for low-resource languages. More-476 over, the emBERTqa can even perform better than baseline 477 models, which need unlabeled data. These results conclude 478 that the proposed framework makes a step forward to create a 479 set of language-agnostic representations and shared semantic 480 space benefits from training data in various languages.    training process, and make the model robust. The information compensation training strategy is inspired by the translation 498 language modeling task. Accordingly, we leveraged the idea 499 to make the model learn the relationship between the original 500 statistics of a language and its zero-meaned statistics. We also 501 propose the ZMTD task to employ a discriminator to detect 502 which representations of tokens are zero-meaned so that the 503 model is forced to generate more language-agnostic represen-504 tations. As shown in Table 2, the performance gains reveal 505 that both methods provide positive benefits as expected.

507
In addition to the zero-shot and zero-resource scenario for 508 low-resource languages, we also studied the potential impact  Table 3. Comparing the results in Tables 2 and 3 in a zero-shot and zero-resource scenario. When emBERTqa 529 was evaluated on the MLQA and TyDiQA-GoldP datasets, 530 it outperformed several baselines requiring a large amount 531 of unlabeled data. A series of analyses demonstrated that 532 better language-agnostic representations can be retrieved by 533 emBERTqa to improve cross-lingual generalization capabil-534 ity. In summary, the proposed emBERTqa creates a potential 535 way for low-resource languages on the multilingual question 536 answering task. Hence, the framework can be generalized to 537 other languages that exclude the original mBERT, and we 538 leave its extension for future investigations. In the future, 539 we also plan to leverage the framework for other NLP-related 540 tasks, such as multilingual document retrieval and summa-541 rization.