An Automatic Post Editing With Efficient and Simple Data Generation Method

Automatic post-editing (APE) research considers methods for correcting translation results inferred by machine translation systems. The training of APE models, generally require triplets including a source sentence (<inline-formula> <tex-math notation="LaTeX">$src$ </tex-math></inline-formula>), machine translation sentence (<inline-formula> <tex-math notation="LaTeX">$mt$ </tex-math></inline-formula>), and post-edited sentence (<inline-formula> <tex-math notation="LaTeX">$pe$ </tex-math></inline-formula>). As considerable expert-level human labor is required in creating <inline-formula> <tex-math notation="LaTeX">$pe$ </tex-math></inline-formula>, APE researches have encountered difficulty in constructing suitable dataset for most of language pairs. This has led to the absence of APE data for most of language pairs, such as Korean-English, and imposed limitation to the sustainable researches of APE. Motivated by this problem, we propose a method that can generate APE triplets using only a parallel corpus without human labor. Our proposal comprises three noise generation techniques, including random, part of speech tagging (POS) based, and semantic level noises, and the effectiveness of these methods are verified by the results of quantitative and qualitative experiments on Korean-English APE tasks. As a result of our experiments, we find that POS based noise encourages the best APE performance. The proposed method is influential in that it can obviate expert human labor which was generally required in APE data construction, and enable the sustainable APE researches for the most language pairs where human-edited APE triplets are unavailable.


I. INTRODUCTION
Automatic post-editing (APE) is a sub-field of machine translation focusing on the automated correction of errors produced by machine translation systems. APE has attracted considerable attention in that it alleviates the need for human efforts to correct machine-generated translations to human levels [1] and can contribute to domain specialized translations [2], [3]. APE is currently being actively studied as a shared task in the Conference on Machine Translation (WMT) [4].
However, a chronic problem remains in the APE researches with respect to the data generation. APE models require triplet data including a source sentence (src), a corresponding machine translation sentence (mt), and an associated The associate editor coordinating the review of this manuscript and approving it for publication was Zijian Zhang . post-edited sentence (pe), which is directly post-processed by human experts. As substantial human revisions are essential in correcting errors in mt, considerable expert-level human labor is required in APE data generation.
The obligation to associating expert human labor in data generation yields significant difficulties for most of the language pairs. Currently, open APE triplets have been provided only for very few language pairs, such as English-German [4], [5], while open data suitable for the implementation of APE data has not been released in most language pairs, such as Korean-English. Accordingly, it can be observed that APE research may become more concentrated toward some specific language pairs where the appropriate data has already been released.
In this study, we relieve the high dependency of APE research on the human-generated data, and propose a method to conduct APE studies on language pairs without human-edited APE triplets. In particular, we introduce several methods for automatically generating APE triplets from parallel corpora without human labor, and evaluate their performance by training APE models leveraging each approach. We propose three different APE triplet generation methods which are based on various noising schemes; Random noise, POS based noise, and semantic level noised. The effectiveness of our proposed method is validated by applying it to Korean-English pairs.
These methods commonly regard source and target sentences of parallel corpora as src and pe of APE triplets, respectively. The proposed noising schemes serve to generate a before-editing sentence, which is regarded as mt for APE triplets. The methods proposed in the present work were inspired by [6], which suggested the application of noising schemes to parallel corpora to generate APE triplets in English-German (En-De) language pairs. We define Random noise based APE triplet generation as a method to generate noise based on the retrieval of a training corpus. APE triplet generation utilizing POS based noise and semantic level noise indicate the methods to create mt by imposing noise by referring its corresponding POS tagging and semantic information retrieved by WordNet [7], respectively.
Through experiments, we quantitatively and qualitatively verified the effectiveness of these methodologies and evaluated their performance on APE tasks. Furthermore, we additionally leveraged translation system based APE triplet and confirmed that we can achieve additional improvement via utilizing translation system based APE triplet. Overall, the main contributions of this paper are as follows: • We propose a method to generate APE models with only parallel corpora, and substantially alleviated the needs for the expert human labor in APE data generation.
• Through our proposal, we enable the sustainable APE researches for most of language pairs where APE data has not been released.
• Through our comparative analyses between several noising schemes and training strategies, we have derived the optimal strategy that trains APE model only with parallel corpus.

II. RELATED WORK AND BACKGROUND
Recent studies on APE primarily focus on techniques to alleviate data sparsity problems in APE. Representatively, data augmentation method utilizing parallel corpora have been widely adopted. In these approach, source and target sentences in parallel corpora are generally regarded as src and pe for APE triplet, respectively, and mt is generated by utilizing parallel corpus. In these work, mt indicates before-editing sentence which should be revised through APE models. One major approach in generating mt through parallel corpus is to leveraging machine translation system. Reference [8] proposed a method to generate mt by translating src through a translation system. Recent studies have demonstrated significant improvements in APE models by utilizing this method [9], [10]. However, the corresponding method involves a translation system for the generation of mt. Therefore, we observe that the implementation of such methods may not suitable for low resource languages (LRLs) for which it is difficult to construct high-performance translation systems owing to insufficient parallel data [11]. That is, when relatively few parallel sentences are available, it is difficult to assure that an equivalent performance improvement can be achieved by the corresponding method. Moreover, because the mt generated through the translation system was created independently of pe, it is hard to say that corresponding mt contains information on errors that need to be corrected through humans [12]. This may mislead the APE model to the different objectives from the original purpose of APE: generating pe through correcting errors reside in mt.
Considering the limitations of such translation system based APE triplet generation, a noising scheme based APE triplet generation method was proposed [6]. In corresponding methods, mt is generated by intentionally imposing defect to pe, which was originally target sentence in parallel corpus. Four noising schemes are proposed, including adding new tokens (insertion), deleting tokens (deletion), replacing tokens with other token (substitution), and reordering tokens (shifting). An advantage of such methods is that they can also be applied in LRLs because they generate mt by adding noise to target sentences without the necessity of constructing a translation system.

A. AUTOMATIC NOISE GENERATION FOR LRL APE TRIPLET
We propose a method to generate pseudo-triplets T = for APE by applying a noising scheme from parallel corpus P = {(X (i) , Y (i) )} d i=1 . In this notation, X (i) ,Ŷ (i) , Y (i) indicate src, mt and pe, respectively. Similar to [6], the present work utilizes noising schemes in generating pseudo-triplets of APE. T and P share the same X (i) and Y (i) , and eachŶ (i) in T is generated by imposing noise to Y (i) , As part of this study, we conducted a comparative analysis on three noising schemes, including random noise, POS based noise, and semantic level noise.

1) RANDOM NOISE
Random noise refers to a noising scheme that generatesŶ (i) from Y (i) by replacing some words in the Y (i) with random words. Prior to the noising process, we construct a word list L by referring Y (i) in P for the latter use. L is defined by equation (1).
In this equation, where n i is the token length of Y (i) which is segmented by NLTK tokenizer [13]. L combines all the segmented tokens in every Y (i) in P, without overlapping.
In this noising process,Ŷ (i) is generated by noising some words in a Y (i) with random words selected from L, without VOLUME 10, 2022 any consideration of contextual information. In addition to replacing words in a Y (i) with random words, we utilize insertion, deletion, replacing and shifting noise, which denote adding new words, deleting original words, changing original word into different one, and changing the positions of words in the Y (i) , respectively. These noising schemes were proposed by [6], and we generateŶ (i) by combining these noising schemes together.
The ratio of noise to be imposed to Y (i) is determined by the probability. In noising process, each word in Y (i) is judged to be noised or not, based on the probability p, selected from the uniform distribution [0, 1). The noising probability applied to each Y (i) varies throughout the whole training process. This can enable the model to obtain the robust errorrevising capacity. Detailed procedure of generatingŶ (i) can be formularized as Equation (2) y Each noising scheme in Equation (2) refers to the insertion, deletion, replacing, shifting noise, and skipping noise, respectively in order from the top. With total probability p, each noise scheme is selected to be equally-distributed.
In imposing noise schemes to y is generated by segmenting each Y (i) with NLTK tokenizer. ThenŶ (i) can be obtained by imposing noise schemes with probability p by the random variable r selected from the uniform distribution [0, 1). For the insertion and replacing noise, random token t is extracted from the word list L, and for the shifting noise,ŷ

2) POS BASED NOISE
In applying POS based noise, a similar noising process with the random noise is implemented, but different from random noise, part of speech (POS) tagging-based word list L pos is used to determine tokens for the replacement. For the construction of L pos , all the y (i) j which POS tag is pos are accumulated as shown in equation (3).
In this equation, pos refers to the POS tag of y (i) j , which is a replacing token in Y i . During the noising process, each word used to replace another is selected randomly from L pos . Specifically, the similar noising process with equation (2) is proceeded, but t is selected from L pos , not from L. Through this process, a verb in the Y (i) , such as ''help'' is replaced with another verb such as ''find'', not with a word in another POS tag such as noun or adjective.
These noising processes follow the traditional NLP pipeline [14] in imposing noise to each sentence. Thus, we can expect higher performance compared with the application of random noise, which does not have any standard. We utilized the NLTK toolkit [13] to perform POS tagging.

3) SEMANTIC LEVEL NOISE
In semantic level noise, wordnet [7] is utilized in imposing noise. Similar with previous noising schemes,Ŷ (i) is generated from Y (i) by replacing some words in the Y (i) with others. Once a word y (i) j is selected to be noised, its corresponding wordnet information is retrieved, especially a list of its synonyms. These synonyms are regarded as candidates for the replacements, and during the noising process, a corresponding synonym is randomly selected and replaced. Specifically, in applying equation (2), newly replaced token t is selected from the synonym list of y (i) j , not L. This can be viewed as a noising scheme used to train a type of human-level editing. Representative errors in the translation results, which should be edited by human experts, include the miss-choice of synonyms, which have the same meaning but play a different role in a given domain or context. For instance, ''help'' and ''assist'' have the same meaning as ''give help or assistance, or be of service.'' However, considering formality and context, these two phrases should be considered differently. By imposing this type of error in the noising process, a model could be trained to mimic humanlevel editing.
A brief structure for the application of each noising scheme is shown in Figure 1. Representatively, replacing noise is depicted in this figure. During the generation ofŶ (i) in each noising scheme, some words in the Y (i) were selected randomly and replaced with other words according to the corresponding noising scheme. In random noise, words are replaced randomly from the word list constructed by combining words of Y (i) in the whole corpus P, without any consideration of contextual or semantic information. In POS based noise, replacing words are selected from the word list, according to its corresponding POS tag, and in semantic level noise, synonym lists of replacing words in Y (i) are retrieved, and then random synonyms are selected from their corresponding synonym lists.

B. EFFICIENT TRAINING OF APE MODEL
In this study, we construct an APE model by fine tuning APE tasks to a pretrained vanilla transformer [15] based NMT model. Transfer learning strategy [16] that fine tuning APE task to the pretrained NMT model is shown to be effective in improving APE performance [17], [18]. As a pretrained language model which has a large amount of parameters and is trained with a large amount of training data, such as XLM [19], is not required, this approach is also effective in aspect of training efficiency.
In the APE fine tuning process, we utilized a bottleneck adapter layer (BAL) [20] structure for more efficient training [17], [21]. The BAL comprises two dense layers with one activation function. We adopted ReLU as the activation function. The forward processing through a single BAL structure can be described as equation (4).
In equation (4), x is an input embedding vector that is x ∈ R d model ×len , where len indicates max token length.
By setting the output dimension of W 2 to be equal to d model , the final dimensionality of BAL(x) is made to be the same as the dimension of x. The BAL structure is applied to the pretrained transformer based NMT model. For each transformer layer, two BAL structure is added to the posterior position of self-attention structure, and the feed forward network structure. During the fine tuning process of the BAL-added transformer model, we froze the parameters of the pre-existing transformer model, and train only the BAL structure. Adopting BAL structure in fine tuning process noticeably improves training efficiency. In particular, the amount of training parameters involved in a fine tuning process can be significantly reduced, compared with naive fine tuning strategy [20]. In this work, the amount of trainable parameters in APE fine tuning is 1.6M, which is about 1.6% of the total amount of parameters in the whole model structure. This enable the model to have considerable performance only by training relatively small amount of parameters, and accordingly can lead to the reduction of the computing power and training time required in model training.
APE fine tuning process by utilizing our proposed noising scheme can be described as follows. First, we generate pseudo-triplet APE data , by following our proposal. Then by utilizing T , we train the APE model θ with the training objective to maximize a sequence to sequence based probability as equation (5).
In this equation, z (i) k refers to the k th token in Y (i) , which is segmented by our sentencepiece tokenizer. We can denote it as where m i is the max token length of Y (i) . Equation (5) indicates that in fine tuning process, a concatenated sentence of X (i) andŶ (i) are utilized as an input structure. Then, by a sequence-to-sequence [22] based training process, a model is trained to generate Y (i) . These fine tuning processes are suggested by [17], and we modified the corresponding processes appropriately in our experiments.

A. DATASET DETAILS
To verify the effectiveness of our proposed approach, we adopted a Korean-English parallel corpus, released by VOLUME 10, 2022 AIhub 3 [23]. This released data, comprising 1.6 million sentence pairs, was generated by the NMT model and then inspected by human experts. This corpus is being adopted in many Korean-English translation studies [24], [25]. As a precise human inspection was engaged in data generation, we can ensure the data quality to be sufficiently high. We filtered out several sentences with less than two words or more than 200 words for consistent training [24], [26]. We extracted 120,000 sentence pairs from these data and utilized them to generate APE triplets, while others were utilized to train the NMT models. From the 120,000 extracted sentence pairs, we arbitrarily selected 12,000 sentence pairs each as validation and test datasets.
We used translation edit rate (TER) [27] and BLEU [28] score as our evaluation metrics. To measure each metric, we employed TER measurement software 4 known as tercom for TER score, and mteval13.pl 5 presented by Moses for BLEU score. BLEU is the most representative metrics that measures the performance of machine translation system based on the n-gram similarity with length penalty [29], and is being adopted for the auxiliary evaluation metric of APE research [4]. TER is now being adopted for the major evaluation metric for the APE system that measures the minimum number of editing required in revising translated sentence into reference one [30]. The performance evaluation of our proposed model was conducted based on three different test sets, which were generated by Google, Amazon, and Microsoft translation systems. The statistics for training, validation, and test datasets used in this paper are as shown in Table 1.
As shown in Table 1, Google Translate had the best MT quality, while Microsoft and Amazon showed relatively lower MT quality. By conducting evaluations on these test sets that show different qualities, we verified the effectiveness of our proposal.

B. MODEL DETAILS
In our experiment, we constructed an NMT model which has an vanilla transformer model structure, and then fine-tuned the APE task to that model. The adopted transformer-based model consisted of six encoder-decoder layers with a hidden size of 512. The vocab size of the model was set to 50,000, and we utilized SentencePiece [31] uni-gram model as a tokenizer. Translation training was conducted using fairseq [32]. Specifically, 200K training steps of 16,384 max tokens was proceeded with early stopping based on the validation BLEU score. We followed training instruction given by fairseq 6 where learning rate is empirically selected to 5e-4 with inverse square root learning rate scheduler.
Prior to the fine tuning of the APE task, we added the BAL structure to the pretrained NMT model structure. We set BAL size d bal to be 64, which is 1/8 of the pretrained model's hidden size d model , to prevent overfitting and improve efficient learning. Model parameter settings are based on [21], where d bal is set to be smaller than d model . Total amount of parameters in our NMT model is 96.4M and the amount of parameters added by utilizing BAL structure is 1.6M. This indicates that during APE fine tuning process, only 1.6M parameters are trained among 98.0M parameters of the whole model. We utilized Huggingface [33] for training the BAL-added model structure. For the training process, we adopted Adam optimizer [34] and cosine annealing scheduler [35] with learning rate 3e-5, selected empirically. One RTX A6000 was used for the training, and every training process takes within a day. Specifically, early stopping was applied based on the validation BLEU score.

1) MAIN RESULTS
In this section, we present the experimental results of verifying the effectiveness of our proposed method. To perform this verification, we generated APE training data by applying the proposed noising schemes, including random, POS based, semantic level noise, and trained three different APE models by utilizing each data. The effectiveness of each noising scheme is measured by the performance of its corresponding model. Our results are shown in Table 2.
As shown in Table 2, by utilizing the APE triplets generated by the three proposed noising schemes, it was possible to create an APE model that effectively correct errors included in the mt. Among our three noising schemes, APE triplets utilizing POS based noise demonstrated the best performance. This shows that POS information is influential if utilized in imposing noise, and elaborate consideration in noising schemes may further improve the APE performance. By comparing such results with random noise, we can also find that maintaining structural consistency in injecting noise may lead to better performance, as in generating mt, tokens in pe is substituted with the tokens with the same POS.
Through these experiments, we can observe that APE triplets generated by semantic level noise achieved relatively poor performance. We can infer that as the entire surrounding context is not fully considered in the noising process, the proper synonym may not have been selected as its original meaning in the pe, and thereby adequate training has not been performed.
Additionally, the APE model trained by the APE triplets generated by semantic level noise demonstrate even worse performance than the model utilizing random noise. This can be interpreted that the word list used in the noising process, which was extracted from the synset of WordNet, may have led to a slight degradation in APE performance as the replaced words were derived independently from the pe corpus. This can be interpreted that the model is more likely sensitive to the domain specific vocabulary. Note that in semantic level noise, replacing words are selected from the vocabulary  extracted from the synonym list in wordnet, which is retrieved independently of the domain of training data.
Performance difference between the POS based noise utilizing model, and the semantic level noise utilizing model also shows the importance of domain-consistency in APE data generation. POS based noise may encourage the optimal performance by considering contextual information as well as reflecting the domain specificity of training data, in generating APE data.
Though our three methods yielded slightly different results, the overall performances of the APE models utilizing the corresponding APE triplets are quite prominent. For instance, our APE model trained with POS based noise improves the quality of translation results obtained from Amazon translation by 14.620 TER and 20.10 BLEU score. As our proposal is a data generation method that automatically generates APE triplets from a parallel corpus without the need for human experts to perform editing, it is expected to be of benefit especially in application to LRL, for which human-edited APE triplets have not been released.
Furthermore, as it exclude the needs for the expert level human labor in APE data generation, which has been the major obstacles in vigorous APE studies throughout the universal language pairs, this work contribute to the sustainable APE research.

2) QUALITATIVE ANALYSIS
To perform a more reliable verification of our proposal, we additionally conducted a qualitative analysis. We analyzed the actual editing results of the APE models, which were trained through APE triplets generated by our proposed three noising schemes. The results are shown in Table 3.
Based on the results, we can observe that there exist considerable differences between the edited results for each model, especially with respect to the order of words within each sentence, or in the overall meaning of each sentence. We can observe that the best qualitative results are obtained through the model that leveraged the POS based noise. The model, which is trained by semantic noise based APE triplets, VOLUME 10, 2022 ignored the information on ''the light and dark effect,'' and lost is tense such as ''has been'', and in leveraging random noise, a word from the source sentence, '' '' that should be translated to ''Western'' is omitted in the edited sentence.
However, for the APE model that leverages POS based noise, these errors were properly corrected, and the edited sentences expressed the overall meaning of the post-edit sentences properly. These results indicate that POS based noise is of particular benefit in APE training.

3) MUTUAL SUPPLEMENTATION EFFECT WITH TRANSLATION SYSTEM
Referring to recent studies on APE, we can identify two different methods of generating pseudo-APE triplets from parallel corpora. The first method is generating mt by translating src with machine translation systems, similar to the eSCAPE [8], and the second method is to utilize noising schemes as used in the present work. For the simplicity, we denote APE data which is augmented by machine translation system as d MT , and APE data generated by applying noising scheme as d noise .
In previous study, [6] showed that utilizing both d MT and d noise can substantially improve the APE performance. Inspired by this, we investigate the optimal approach to use both APE triplet generation methods for improving APE performance. If the performance of the APE model can be improved by being trained with both triplets, compared with the model trained by the respective APE triplet, we denote it as a mutual supplementation effect.
Through this experiments, we investigate how this mutual supplementation effect can be obtained, and inspect the optimal strategy of utilizing both d MT and d noise in the training process, with respect to the batch configuration. Specifically, we extend the previous study [6] where the occupying ratio of d noise in the whole batch configuration is fixed, and experimented for the various batch configurations.
We denote the occupying ratio of d noise in a batch configuration of the training process where d MT and d noise are utilized together, as the mixing ratio m. We experimented with various mixing ratios and figure out the optimal m for achieving mutual supplementation effect. Our results are shown in Table 4.
The results demonstrate that the proposed model trained with d MT and d noise together showed improved performance compared with the model trained by d MT or d noise alone, for all of our three noising schemes. We also found that generally low m lead to higher performance; if we set the mixing ratio to be lower than 0.25, the models were largely able to obtain higher performance than the model trained by d MT alone. These results indicate that the mutual supplementation effect can be achieved by strategically combining d MT and d noise together in batch configuration. That is, by adopting d MT along with our proposed method, we can obtain the further improvement of APE model performance. As d MT can also be generated without expert human level labor, this can substantially assist our proposed noising schemes, without deteriorating the original intention of our proposal: sustainability of APE data generation and APE research.

V. CONCLUSION AND FUTURE WORK
In this paper, we proposed a method to automatically generate APE triplets from parallel corpora without human labor. Three noising schemes were proposed, including random noise, POS based noise, and semantic level noise, and the performance of each method was evaluated using an APE model trained with APE triplets generated by the corresponding noising schemes. We confirmed that decent APE model can be constructed by utilizing our proposal, without leveraging the APE data which should be generated by expert level human labor. We figured out that by additionally utilizing translation system, even higher APE performance can be obtained. Our proposal enable the sustainable APE researches even for the language pairs where appropriate APE data has not been released. In the future, we plan to study APE considering industrial services based on a data-centric methodology. SUGYEONG EO received the B.S. degree in linguistics and cognitive science from the Hankuk University of Foreign Studies, Yongin-si, South Korea, in 2020. She is currently pursuing the Ph.D. degree with the Natural Language Processing and Artificial Intelligence Laboratory, Korea University, Seoul, South Korea, under an integrated master's and Ph.D. course. Her research interests include neural machine translation and quality estimation, where she tries to predict machine translation quality that minimizes human labor.
HEUISEOK LIM received the B.S., M.S., and Ph.D. degrees in computer science and engineering from Korea University, Seoul, South Korea, in 1992, 1994, and 1997, respectively. He is currently a Professor with the Department of Computer Science and Engineering, Korea University. His research interests include natural language processing, machine learning, and artificial intelligence.