DbAPE: Denoising-based APE System for Improving English-Myanmar NMT

Automatic post-editing (APE) research aims to investigate methods for correcting systematic errors in machine translation (MT) results. Recent work has shown successful practices of APE for improving MT output quality; however, their effectiveness strongly relies on the availability of large-scale human-created APE triplets. The high production cost of human post-edited data has led to the absence of APE triplets for most language pairs, including English-Myanmar, which has become a limiting factor for the applicability of the APE task. This work investigates how to conduct the APE task on the English-Myanmar MT where human-edited APE triplets are unavailable. We build an APE system using only the monolingual and parallel MT corpora. The system takes the source sentence (src) and the MT output (mt) as inputs and produces the post-edited mt as output by operating the three processes together, including word alignment extraction, enriching mt using the extracted word alignment information, and denoising the enriched-version of mt. We conduct extensive experiments by applying our APE system as a post-processor to the raw output of the existing English-Myanmar MT systems and show that it significantly improves the quality of baseline translation results in terms of TER and BLEU scores. In addition, we perform word alignment experiments with four types of alignment methods and demonstrate that the proposed multilingual word aligner can achieve robust performance over previous state-of-the-art models.


I. INTRODUCTION
Output of machine translation (MT) are plausibly called "pre-translation" as they are not always perfectly correct and might need revisions by human experts for correcting the systematic errors. The goal of APE system is to automatically fix these errors in a machine-translated text by learning from human post-edited samples. Earlier APE researchers adopted the phrase-based statistical machine translation (PBSMT) models to train the APE system as a monolingual re-writing task without considering the source sentence [1], [2]. However, PBSMT-based APE models are only applicable to fix the errors in the output of rule-based MT systems. There are no or only modest improvements while using PBSMT both for first-stage MT and the second stage APE without additional source context modelling and thresholding [3]. The majority of recent APE approaches adopt a dual-source (or multi-source) sequence-to-sequence structure that extends the Transformer [4] in a supervised learning setting [5], [6].
Generally, building an APE system requires a training set comprising the triplets (source-text, MT-output, human postedit), denoted as ⟨src, mt, pe⟩, respectively. The source sentence (src) and its corresponding MT output (mt) are simultaneously taken as inputs to the APE models and the associated human post-edited sentence (pe) is used as the target. As the high production cost of the target data (pe), the quantity of available APE triplets is still insufficient to train the deep and complex APE models. Currently, strong APE models have failed to show any notable improvement in the refinement of neural machine translation (NMT) output when training on similar-sized human post-edited data [5]- [7].
Open APE triplets are available only for very few language pairs such as English-German and English-Chinese 1 . Most of the language pairs including English-Myanmar are absent of APE triplets and thus hinder the applicability of the APE task. To make APE more widely applicable for the most language pairs where APE triplets are unavailable, this work investigates an alternative solution to conduct the APE task without having access to the humanedited APE triplets.
We introduce an easy and effective APE system that uses only monolingual and parallel MT corpus without using any human-edited APE triplets. Our APE system takes the MT output (mt) and its original source sentence (src) as the inputs, and output the high-quality target sentence (postedited mt) by performing a series of the following three steps: 1. Extracting word alignment information between mt and src using our proposed word aligner, 2. Enriching mt by removing unaligned target words and adding missing source-side information into the target words based on alignment information and bilingual dictionaries for maximizing the semantic similarity between mt and its source sentence, and 3. Denoising the enriched mt (from Step 2) to generate a high-quality target sentence with our proposed denoiser. Our word aligner is exploited LaBSE [8] which uses crosslingual word embeddings on a given sentence pair. Regarding the bilingual dictionaries used in the sentence enrichment step, we create two types of bilingual dictionaries from (1) source and target monolingual corpus and (2) parallel MT corpus. For denoisers, we use Transformer [4] models and train them on target monolingual data. The main contributions of this paper are: • We develop a new word aligner for English-Myanmar using the pre-trained language-agnostic sentence embedding model called LaBSE [8] that leverages effectively to extract the alignment from cross-lingual word embeddings. Our word aligner achieves state-of-the-art performance even in the absence of explicit training on parallel corpus. • We introduce a simple yet effective method to enrich raw translated text using the bilingual dictionaries extracted from existing monolingual and parallel corpus. Our method effectively considers missing source-side information and context in lexical choices. • We propose a postprocessor for APE systems that can generate qualified output in the target language using the denoising autoencoder, handling multialigned words, and local reordering. • We verify that cross-lingual embedding on subword units performs poorly in word alignment task. • We empirically show that an APE system built from combining the above three modules is effective and leverages well the existing monolingual corpora, parallel corpus, and pre-trained model; it can be the best learning approach in a low-resource setting where APE triplets are unavailable. Our proposed APE system can be effectively use as a postprocessor to the raw output of the existing NMT system for most language pairs, without using any human-edited APE triplets. The analyses provided in this work show better understanding of learning the pre-trained model and its usage in the APE task to generate contextualized word embeddings for extracting word alignment information and enriching translated sentences. Moreover, this work demonstrates that the denoising autoencoder, usually used as a language model in various downstream Natural Language Processing (NLP) tasks, can also be applied as a monolingual sentence rewriter in an APE system. Altogether, we show that in a lowresource setting that has only available monolingual and limited parallel data, not only the proposed multilingual word aligner outperforms the existing state-of-the-art models on the word alignment extraction task, but also our denoisingbased APE system can help to revise the raw translated texts of existing English-Myanmar MT systems to meet the agreed level quality. As a result of our experiments, this work suggests the optimal research direction in APE for most of the language pairs where human-edited APE triplets are unavailable.

II. MODEL ARCHITECTURE
Our denoising-based APE system (DbAPE) is proposed as a pipeline consisting of three main modules. Fig 1 (a) depicts the first module that performs word alignment information retrieval from an input sentence pair of a source sentence (src) and a machine-translated target sentence (mt), utilizing cross-lingual word embeddings. Fig 1 (b) depicts the second module which removes the typical errors in mt and minimizes the semantic gap between the mt and src by enriching with the missing source-side information. We call this operation target sentence enrichment that enriches mt to be a better version by correcting the errors and adding missing information. Fig 1 (c) depicts the final denoising module where we take the enriched-version sentence (enriched-mt) as input and clean it by removing all possible noises and ordering the words and phrase to be in an acceptable target style.

A. WORD ALIGNMENT INFORMATION RETRIEVAL
Given a pair of source sentence = ( 1 , 2 , … , ) of length and its corresponding parallel target sentence = ( 1 , 2 , … , ) of length , the task of word aligner A is to find a set of pairs of source and target words which are semantically similar to each other within the context of the sentence.

1) EXTRACTING ALIGNMENTS FROM EMBEDDINGS
The pre-trained word embedding models such as BERT [9] and RoBERTa [10] represent words using continuous vectors calculated in context and have achieved impressive performance in a variety of NLP tasks. Multilingually trained sentence embedding models such as language-agnostic BERT, called LaBSE [8], have adapt multilingual BERT (mBERT) [9] to produce language-agnostic cross-lingual This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  sentence embeddings for 109 languages, giving the state-ofthe-art on the parallel text (bi-text) retrieval task. LaBSE is originally proposed for bi-text process to find the translation pairs in multiple languages. However, this work uses LaBSE for the word alignment extraction task that finds and extracts semantically similar source-target word pairs in a given parallel sentence pair. While prior works have relied on parallel training data to obtain the word alignments, here we propose a more effective and simpler approach which is particularly suitable for low-resource languages that are lack of the parallel data to train the word aligner. We propose an unsupervised word alignment model that aligns words from the LaBSE based cross-lingual word embeddings. We consider this alignment extraction process as a semantic search task.
In the reminder of the paper, we denote the list of word alignment pairs by − and the lists of aligned source and target words by and , respectively. Finally, we denote the list of unaligned source words by and the list of unaligned target words by . The detail of our word alignment procedure is described in Algorithm 1, where is a user-defined word pair similarity threshold. As cosine similarity score between two word vectors falls in the range of 0 to 1, we set the threshold t to 0.5, at the halfway mark.
As shown in the algorithm, the word alignment information retrieval task proceeds as follows. Given a pair of source sentence (src) and its corresponding MT output (mt), we extract the most similar src word for each mt word base on the similarity score computed by the cosine similarity function on their LaBSE based cross-lingual word embeddings. Among the extracted highest similar pairs, the pairs with the similarity score higher than the threshold are considered as the final word-aligned pairs − . Meanwhile, we record the unaligned source words and unaligned target words , in src and mt, respectively.

B. TARGET SENTENCE ENRICHMENT
This section is to enrich MT output by removing errors and adding missing information based on word alignment information and bilingual dictionaries. Given the monolingual and parallel corpus, we first build two bilingual dictionaries (cf. Figure 1): a monolingual corpus-based dictionary (MD) and a parallel corpus-based dictionary (PD), by extracting potential source-target translation word pairs with similar vectors.
Using the source and target monolingual texts, we build a bilingual MD as follows: • We create the source and target vocab files which contain the list of source and target words, • We feed these two files as input into our word aligner (Algorithm 1), and This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3185415

Word alignment information
• We store all extracted source-target word alignment pairs in the bilingual MD. Using the parallel corpus, we build a bilingual PD. From each pair of source sentence and target sentence in parallel corpus, the potential translated word pairs are extracted as follows: • We create forward-aligned forward and backwardaligned backward word pairs by running the proposed word alignment information retrieval module (Algorithm 1) in source-to-target and targetto-source directions, respectively, as follows: ∈ } • We find the common of all aligned word-pairs < , > from both lists forward and backward and store them into the bilingual PD, i.e., PD = forward ∩ Since bilingual PD is built in the supervised setting with the guidance of the parallel aligned sentence pair, it should be more accurate than MD built in the unsupervised setting and we confirm this hypothesis from our experiments. Having the raw translated sentence (mt), unaligned source words ( ), unaligned target words ( ) and the bilingual dictionaries (PD and MD), we first delete the unaligned target words in mt according to . Then, we extract the most similar target words of from the bilingual dictionaries and append them to mt to get enriched-version of mt (enriched-mt). For each unaligned source word in , the process of extracting its most similar target word is as follow: • If is in the source-side words of bilingual PD, we extract its aligned target-side word from PD.
• Else if is not in the bilingual PD but it is in MD, we extract its aligned target word from MD.
• Else, we find the most similar source word of in PD first and extract its aligned target word from PD. If the source word is aligned to more than one target words in the bilingual dictionary, we extract only the target word that has the highest similarity. Figure 2 illustrates our approach.

C. TARGET SENTENCE DENOISING
Although the target sentence enrichment module has removed mistranslated or extra words and added missing source-side information, the enriched-version of MT output (enriched-mt) is still far from being an acceptable translation. It still needs to improve the word order and perform grammar correction. Moreover, in the appended part of enriched-mt, unaligned source word to similar target word substitution always outputs a target word for every position. There are a plenty of cases that some of the substituted (appended) words should be remove/denoise to make a fluent output. Moreover, in some cases, we need to add extra common words, e.g. prepositions or articles, to be in the correct sentence structure. For example, a sequence of Myanmar source words "နှ စ် ယ ောက် စလ ို ုံး သူ တ ို ို့ က ို " would be substituted by word-to-word with the sequence of similar target words "both them to"; however, it must be "both of them" in English. In this case, we consider the substituted word "to" as an insertion noise that needs to remove from the sentence and the extra word "of" as a deletion noise that must be added to the sentence.
To remove the potential noises in enriched-mt, we design a sequence-to-sequence Transformer [4] model that takes a noisy (unstructured) sentence as input and generates a clean (denoised) sentence as output; both of which are of the same (target) language. As shown in Fig 1 (c), we feed the noisy input, enriched-mt, into a designed denoising model so that it transforms the input into a clean target sentence post-edited mt. To inspect the effectiveness of denoising mechanism on improving quality of the final output, we conduct experiments on the denoising task with the following two different models.

1) DENOISING AUTOENCODER (DA)
For training the denoising autoencoder, training label sequences would be the target monolingual sentences. Given a clean target sentence, the noisy input should be ideally the unstructured version of the corresponding source sentence. To create the noisy versions, we inject artificial noise into a clean sentence to simulate the noise of our enriched-version sentence.
Firstly, for each sentence in a given monolingual corpus, we remove 20 to 30 percent of out-of-vocabulary (OOV) words and append the deleted words to the end of the sentence. Then, we insert the following artificial noises into the source side: a) Insertion of random frequent tokens where the model learns to remove extra/redundant words: 1. For each position , a probability ~ Uniform (0, 1) is first sampled, 2. Let be a probability threshold of the insertion. If < , we sample a word from the most frequent target words and then insert it before the position . This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. The inserted words are limited by because target insertion occurs mostly with common words, e.g. prepositions or articles. We threshold the value with to decide for inserting the words. b) Deletion of tokens helps the model learn to predict and add potential words for fluency: 1. For each position , a probability ~ Uniform (0, 1) is first sampled, 2. Let be a probability threshold of the deletion. If < , we drop the word at the position . We threshold the value with to decide for deleting the words. c) Permutation of tokens with a limited distance is applied to stimulate the learned model to modify the word order in a correct target structure: 1. Let be a degree of the permutation. For each position , an integer ∈ [0, ] is sampled, 2. We add to index and sort the incremented indices + in an increasing order, 3. The words are rearranged in the new positions, to which their original indices have moved by Step 2. For insertion, deletion, and reordering noises, we adopt the designs and settings of the previous work in [11]. In our target sentence denoising module, we consider a vocabulary size of 32,000 words, and words out of this vocabulary are called OOV words.

2) DENOISING REWRITER (DW)
We design a Transformer-based target-to-target rewriting model and train it to generate the clean target sentence from its noisy-version. For training the rewriting model, we build noisy training data from the target monolingual corpus . Firstly, we delete 20 to 30 percent of OOV words from a given sentence ∈ , and then append the deleted words to the end of . Next, for creating the insertion and deletion noise types, we randomly drop/add some words (up to three words for the sentences with the sentence-length greater than ten). Finally, we swap contiguous words randomly with a probability to introduce some noises to get noisy version ′. Note that = 0.2 is set. We treat ′ as the input and as the output to train the model. For model inference, we feed the enriched-version of MT output (enriched-mt) into the trained model and generate the clean target sentence, post-edited mt.

A. EVALUATION METRIC
For evaluating the performance of our APE system, we use two standard evaluation metrics: BLEU 2 which measures the degree of n-gram match between the model hypotheses and its target; TER 3 which measures the number of edits required to change a system output into one of the references. We evaluate the performance of the alignment models using Alignment Error Rate (AER) [12].

B. DATASETS
As monolingual data for training our denoisers and creating the bilingual dictionary, we used eight million Myanmar sentences gathered from various sources, including textbooks, Myanmar local news, Myanmar Wikipedia, ALT train data [13], and CC100-Burmese dataset [14]. For the English monolingual corpus, we used ten million sentences which combined ALT train data and randomly extracted sentences from WMT monolingual News Crawl datasets 4 . We used Moses tokenizer to tokenize English sentences. For Myanmar, we used the UCSYNLP segmenter 5 .
Parallel data is used to build the baseline NMT systems and the bilingual dictionary, and to train/fine-tune the word aligners. We collected around 224 thousand parallel sentences. Data statistics are shown in Table 1.

C. MODEL CONFIGURATION
The next subsections provide details about the architecture and training procedure of baseline systems and our models.

1) BASELINE MT SYSTEMS
The performance evaluation of our proposed APE system was conducted based on three different test sets, which were generated by a simple Transformer-based NMT, fine-tuned mT5 and Google Translate translation systems. For training the NMT, we used PyTorch version of the OpenNMT project, an open-source (MIT) neural machine translation framework [15]. The Transformer experiments were run on NVIDIA Tesla P100 GPU with the following settings listed in Table 2. For the mT5 system, we were constrained by computational resources to mt5-base, which has 580M parameters. We initialized the pre-trained mT5-base model using Hugging Face's AutoModelForSeq2SeqLM 6 . We used the AdamW optimizer [30] with a learning rate of 5e−4 and transformer's get_linear_schedule_with_warmup 7 scheduler, and fine-tuned the model on 8 epochs with batch size of 16 and 1000 training iterations between checkpoints. Parallel datasets shown in Table 1 are tokenized into sub-word units by using SentencePiece 8 and used to train/fine-tune and validate the baseline NMT and mT5 systems.

2) DENOISING MODELS
For denoisers, we used 6-layer Transformer encoder/decoder [4]. Denoising autoencoder 9 is trained using Sockeye [11], [21]. For training the denoising rewriter, we use the same tool and settings as used in the baseline NMT. We used the target-side monolingual data to train the denoising models and treat ALT dev set as the validation data.

3) WORD ALIGNMENT MODELS
We apply the pre-trained LaBSE [8] model to get the crosslingual word embeddings for word alignment information extraction task. We compared our word alignment model with the following baselines: 1) fast_align [16] is a simple, fast, unsupervised word aligner with reparameterization of IBM Model 2. 2) GIZA++ [17], [18] is an implementation of IBM models. We used five iterations each for Model 1, the HMM model, Model 3, and Model 4 to train GIZA++ by following the previous work of [19]. 3) AWE-SoME [20] is a neural word aligner based on multilingual BERT that can extract word alignments from contextualized word embeddings with and without fine-tuning on parallel data.

IV. EXPERIMENTAL RESULTS
In this section, we first describe the main results of our APE model based on our two different denoising strategies: DA and DW on the output of the three baseline MT systems: NMT, mT5 and Google Translate. Then, we evaluate our alignment model and compare its performance with state-ofthe-art works. Additionally, we conduct a series of qualitative analysis and ablation studies on the baseline NMT output to further validate the reliability of our proposed models and to better understand the importance of data preprocessing in the word alignment extraction task. 8 https://github.com/google/sentencepiece 9 https://github.com/yunsukim86/sockeye-noise

A. MAIN RESULTS
The overall results of our APE model are reported in Table 3.
There are two methods of training the proposed APE system as described in Subsection 2.C: DA and DW. The performance of the models is evaluated with BLEU and TER metrics. Our experiments demonstrate that both versions of APE models improve the quality of the texts generated by the baseline NMT. Our APE model trained with DW showed to give at least +4% BLEU and -16% TER, respectively. When we trained the APE system with DA instead of DW, we could have additional gain around +1% BLEU and -2% TER.
To further validate the effectiveness of our APE systems on the state-of-the-art MT systems, we also conduct APE tasks on the output generated by mT5 and Google Translate. In these cases, our APE system trained with DA can still improve their output quality in both directions. However, APE system with DW fails to improve the quality of mT5 and Google Translate texts in the English-to-Myanmar direction.
Both denoisers are built using the same Transformer architecture but are trained on different noisy datasets. Overall, the insertion/deletion/reordering noise types demonstrate a promising performance while mitigating these noises by using the denoising autoencoder, DA.

B. WORD ALIGNMENT RESULTS
Multilingual sentence embedding model is a powerful tool that encodes text from different languages into a shared embedding space, enabling it to be applied for a range of downstream NLP tasks, like clustering, text classification, and others, while also leveraging semantic information for language understanding. The existing approaches for generating such embeddings, like MUSE 10 or LASER 11 , require parallel data to train for mapping a sentence from one language directly into another language to encourage consistency between the sentence embeddings.
The pre-trained LaBSE model that leverages recent advances on language model pre-training, using both masked language modeling (MLM) and translation language modeling (TLM) objectives, on a BERT-like architecture and fine-tuned on a translation ranking task, results into a state-of-the-art model that encodes text from different languages into a shared embedding space. In this work, we apply pre-trained LaBSE model to encode both source and target words which have similar meaning, into a shared embedding space. Given a sentence pair, firstly, we encoded all words in each sentence using LaBSE word embeddings. Then, we extracted all possible parallel source-target word pairs from their embeddings by our designed word aligner. We set the threshold value for word similarity to 0.5, as described in Subsection 2.A. The extracted pairs which had the similarity scores higher than the threshold value were considered as the word-aligned pairs. We evaluated our model by using the AER metric. Table 4 shows the alignment error rates (AERs) of our models and popular word aligners on ALT test data of the English-Myanmar language pair. The results shows that our LaBSE-based word aligner achieves consistent improvements over the state-of-the-art baseline models, demonstrating the effectiveness of our proposed method. The best score is in bold. Surprisingly, the alignments which are directly extracted from LaBSE (i.e., w/o fine-tuning setting) already achieve better performance than the popular statistical word aligner fast_align and GIZA++ without finetuning on parallel data. To further investigate the performance in the bilingual setting, we trained/fine-tuned the model using the parallel data shown in Table 1. In bilingual setting, our word aligner achieves the best performance than other models.

C. QUALITATIVE ANALYSIS
Our main results reveal that automatic post editing using the denoising autoencoder (DA) is better than the target-to-target rewriting based denoising model (DW). In this section, we additionally conduct a qualitative analysis to perform a more reliable verification of our proposed framework. We analyzed the actual post editing results of two APE models: DA and DW, which were trained through our created noisy datasets. We present some examples from DA-based and DW-based APE models tested on the output of NMT system in Table 5 and Table 6, respectively. From the tables, TER scores in the mt rows are calculated regarding tgt; boldface words in mt indicate words that need to be corrected to match the human-translated reference sentence, tgt of target-side. We found that the output of the baseline English-Myanmar NMT, mt, undergoes an excessive number of corrections, whereas mt post-edited by our APE models requires fewer corrections. Among these two models, post-editing with DA requires the fewest corrections and can make mt to a more accurate and fluent sentence, which is in turn similar to that of the reference sentence.

D. ABLATION STUDY: DENOISING AUTOENCODER
We tuned each parameter of the noise and combined them incrementally to investigate the effect of each noise type in the denoising autoencoder on the baseline NMT output as shown in Table 7. Firstly, we applied the reordering noise with different values of . A significant improvement was achieved from = 5 since a local reordering usually involved a sequence of 5 to 6 words. We also tried to train with > 5 and found that it shuffles too many consecutive words together and thus cannot handle long-range reordering, yielding no further improvement.
Secondly, for the deletion noise, = 0.1 gave +1.16% BLEU, but it immediately degraded with a larger value; it was hard to observe one-to-many in the similar target word substitution more than once in each sentence pair. Finally, for the insertion noise, we observed the best performance (+1.92% BLEU) with = 10. Generally, increasing was not helpful since it provided too many variations in the inserted word; it might not be related to its neighboring words.

E. ABLATION STUDY: WORD ALIGNMENTS
In this part, we examined the performance of two different types of pre-trained embedding models, namely mBERT and LaBSE, on the supervised word alignment extraction task with our designed word aligner. mBERT is a transformers model pre-trained on a large multilingual Wikipedia corpus using a masked language modeling (MLM) objective. We used the word embeddings of the 8layer of mBERT following [20] and 12-layer of LaBSE, respectively. We also examined how the word alignment performance varies with different levels of cross-lingual word embeddings.
As shown in Table 8, we can see that LaBSE can significantly outperforms mBERT by a large margin on English-Myanmar language pairs. Both mBERT and LaBSE can support both English and Myanmar languages in a single model but the embedding vectors spaces of mBERT between languages are not aligned, i.e., the text with the same content in different languages would be mapped to different locations in the vector space. This work shows that LaBSE trained on both monolingual sentences and bilingual sentence pairs using MLM and translation language modeling (TLM) with the primary purpose of parallel sentence retrieval can result in a model that is effective on word alignment extraction even on low-resource languages for which there is no data available during training.   We further investigated the performance in the sub-word level. For that, we tokenized source and target sentences into sub-word units using SentencePiece. While examining the performance on sub-word level embeddings, our experiment shows that sub-word level embeddings performed worse than word level embeddings in both mBERT and LaBSE model. For short sub-word tokens, the context they potentially met during the embedding training was much more various than a complete word, and a direct translation of such token to a sub-word token of another language would be very ambiguous. This means that word-to-word similarity calculation with cross-lingual embedding depends highly on the frequent word mappings and learning the mapping between rare words does not have a positive effect. Based on this result, we decide to adopt LaBSE-based word embeddings without considering sub-word level in our APE system.

V. RELATED WORK
Most recent APE studies primarily focus on the techniques to alleviate the data sparsity problems in APE. While recent advances have reported that automatic generation of synthetic APE triplets ⟨src, mt, pe⟩ from parallel corpora based on various noising schemes [22] and addition of synthetic data to genuine data to expand the APE training data [23]- [25] can mitigate the data scarcity, other studies have highlighted several open challenges [26]. A major challenge is the quality of the generated synthetic data. These recent synthetic data generation works neglect to comply with minimum-editing criterion, where pe should be created by minimally editing mt yet maintaining the meaning of src. Therefore, the correction patterns detected in this synthetic data may differ from those occurring in the genuine APE data, and possibly limit the APE performance. Moreover, in the case of generating the APE triplets using the existing parallel data, training baseline MT and APE models on the same data size will not be effective [5]- [7]. There is also an issue that pe should not be a reference translation (target text translated by human) in the APE task, since this would defeat the purpose of learning editing patterns for the MT output [27]. In this work, considering the limitations in APE triplet generation and avoiding the absence of APE triplets that hinder the applicability of the APE task on English-Myanmar NMT, we pursue an alternative solution to design an APE model using only available monolingual and parallel data but without using any human-edited APE triplets. Primarily, APE systems are employed for improving MT output by exploiting information unavailable to the decoder and coping with systematic errors including adequacy and fluency errors of an MT system whose decoding process is not accessible. For this purpose, the previous work on English-French APE [3] tried to maintain the connection between MT output and the source sentence using word alignment information in order to improve the adequacy. They created a new intermediate sentence by concatenating each word in MT output with "#" and aligned source word. Then, their APE model is trained to rewrite the intermediate sentence to reference target sentence. However, their APE pipelines failed to improve on the MT baseline for the English-to-French direction and achieved only a small increase in BLEU of 0.65 absolute over its baseline for French-to-English direction. Parton et al. [28] also tackled specific linguistic adequacy errors. They tried to correct the errors by either replacing or inserting words into the hypothesis. This system only fixed certain word-choice errors (e.g. numbers, names and named entities) using the three resources such as the phrase table, dictionaries and background MT corpus. In our work, we focus on the same purpose but consider an alternative approach to be useful even in the low-resource setting where APE triplets are unavailable. We design a simple and effective word aligner for extracting word alignment information between the MT output and its original source sentence. Using word alignment information, we can perform deeper text analysis. All possible systematic errors such as extra information (unaligned target words) and missing source information (unaligned source words) in the MT output can be examined from the word alignment information. Based on this analysis, we design a sentence enrichment module that enables to enrich the MT output by removing the errors and adding missing information. Moreover, not only for solving the noises but also for transforming enriched MT output into a more accurate and fluent style, we also propose the denoisers to clean the errors and reordering noises. We show that APE systems can adapt the output of a general-purpose MT system to the lexicon/style requested in a specific application domain.
A large body of literature has studied using pre-trained contextualized word embeddings derived from multilingually trained language models (LM) for extracting word alignment information. In the field of neural word alignment, Sabet et al. [29] proposed methods to align words using multilingual contextualized embeddings and achieved competitive results even in the absence of explicit training on parallel data. Recently, Dou et al. [20] proposed a neural word aligner that leveraged pre-trained mBERT and fine-tuned embeddings on the parallel corpus for better alignment results. Although mBERT has shown a reasonable capability for the zero-shot cross-lingual transfer when fine-tuned on the downstream NLP tasks, it is not pre-trained with explicit cross-lingual supervision, and thus transfered performance can further be improved by aligning mBERT with cross-lingual signal. Instead of mBERT, we use LaBSE embeddings in our word alignment extraction task. LaBSE is a powerful model that encodes text from different languages into a shared embedding space and it is a new state of the art on the multiple parallel sentence pair retrieval task. It is effective even on the low-resource languages for which there is no data available during training. The experimental results show that LaBSE word embeddings is superior to mBERT in our proposed word alignment extraction task.
The word-by-word translation output of an unsupervised MT system trained only on monolingual corpora can be improved with the denoising autoencoder (Kim et al., 2019). Denoising autoencoder is a sequence-to-sequence neural network model that takes a noisy sentence as input and produces a clean sentence as output, both of which are of the same language. In our APE system, the target sentence enrichment module enriches the raw MT output by deleting unaligned target words and appending the most similar target word for each unaligned source word. This sentence enrichment task of unaligned source words to the closet target words substitution can be considered as the part of word-by-word translation. Following the same idea, we use the denoising autoencoder in our APE system to transform the enriched-version of MT output into a clean and fluent version. As an alternative to the denoising autoencoder, we further design a denoising rewriter, a target-to-target rewriting model and train with different settings of noises. As a result of our experiments, our proposed APE system with the denoising autoencoder (DA) can improve the quality of the texts generated by the stateof-the-art MT systems.

VI. CONCLUSION
In this paper, we propose a simple yet effective APE pipeline that can correct the errors in the translation results of current English-Myanmar NMT systems and greatly improve the quality of their translated texts in both directions. We identify three principles (namely, word alignment, sentence enrichment, and sentence denoising) underlying recent successes in the absence of APE triplets and show how to apply them to build an APE system without having the triplets. In essence, we firstly introduce a neural word aligner that extracts alignments' information from LaBSE-based contextualized cross-lingual word embeddings. Using the extracted word alignments' information, we analyze the gap between MT output and its corresponding source sentence. From the analysis, we thereby design the target sentence enrichment module that improves the raw MT text by further removing extra information and inserting missing information. Finally, the enriched-version of MT text is denoised by our proposed denoisers. The final output of our denoiser is a clean and fluent target sentence. Ablation studies show that our APE model integrated with the three principles gives a promising performance even in the absence of APE triplets.
To the best of our knowledge, this is the first attempt of adapting contextualized cross-lingual word embeddings and denoising mechanisms for the APE task on low-resource language pairs like English-Myanmar. We believe that our findings can encourage further research along this direction. The proposed word aligner and denoisers can be effectively applied not only in the APE task but also in other MTrelated works. These models can easily be trained in both low and rich-resource settings.