RESHAPE: Reverse-Edited Synthetic Hypotheses for Automatic Post-Editing

Synthetic training data has been extensively used to train Automatic Post-Editing (APE) models in many recent studies because the quantity of human-created data has been considered insufficient. However, the most widely used synthetic APE dataset, eSCAPE, overlooks respecting the minimal editing property of genuine data, and this defect may have been a limiting factor for the performance of APE models. This article suggests adapting back-translation to APE to constrain edit distance, while using stochastic sampling in decoding to maintain diversity of outputs, to create a new synthetic APE dataset, RESHAPE. Our experiments show that (1) RESHAPE contains more samples resembling genuine APE data than eSCAPE does, and (2) using RESHAPE as new training data improves APE models’ performance substantially over using eSCAPE.


I. INTRODUCTION
M ACHINE Translation (MT) has been developed to produce high-quality translations and is now being used in various areas. Nevertheless, MT is often inferior to a human translation, i.e., MT outputs may contain translation errors such as errors in lexical choice and word order, and these errors require post-editing to improve the original MT outputs. In this regard, Automatic Post-Editing (APE) has been proposed to improve the quality of given MT outputs by correcting errors or by tailoring the output style to a specific domain [1], [2]. Besides diminishing humans' post-editing efforts, APE is specifically useful when the MT system is given as a black-box because APE can revise the MT output on the fly without accessing the MT system's internal structure.
Many recent APE studies adopt neural multi-source sequence-to-sequence model architectures with supervised learning [2]- [4]. These models typically take a source text (src) and its MT output (mt) simultaneously as their inputs and take the post-edited text (pe) as their target. Thus, training those models requires triplet data, also called an APE triplet, that has the form of ⟨src, mt, pe⟩ (Fig. 1). Furthermore, APE data should satisfy the minimum-editing criterion, where pe should be created by minimally editing src Manipulates the shape of an item . mt Bearbeitet die Form eines Elements an . pe Verändert die Form eines Elements . FIGURE 1. An example of APE triplets from the English-to-German WMT APE dataset [2]. Boldface words are either incorrect words in mt or post-edited words in pe. mt yet maintaining the meaning of src. However, the quantity of currently-available APE data is insufficient to train deep and complex APE models due to the high production cost of human post-edited data; this scarcity has become a limiting factor for the performance of APE models.
To mitigate the data scarcity, addition of synthetic data to genuine data to expand the training data [6]- [8] has emerged as a possible solution. Especially, eSCAPE [7], a synthetic APE dataset made of parallel corpora, has been used extensively in many studies [2]- [4], [9], [10]. eSCAPE uses parallel corpora composed of bitexts -pairs of a source (src) and a reference (ref), to make a set of synthetic APE triplets: ⟨src, mt, ref ⟩, in which mt is the MT output of src, and ref serves as pe.
However, synthetic data usually lack certain qualities that genuine data retain. In the same vein, although eSCAPE can be an effective way to obtain large amounts of training data, it [0,5) [5, 10) [10,15) [ 15,20) [ 20,25)  neglects to comply with the minimum-editing criterion that genuine APE data should follow. We note that ref is created independently of mt and is therefore not guaranteed to have been minimally edited from mt. Consequently, the correction patterns observed in this synthetic data may differ from those occurring in genuine data, and this violation results in a significant discrepancy in the distribution of edit distance between eSCAPE and genuine data (Fig. 2), possibly limiting the APE performance.
To solve this problem, we propose a new synthetic APE data-generation scheme that uses parallel corpora. We introduce back-APE 1 (src, pe) → mt which can be seen as an adaptation of back-translation [12] for APE. Back-APE learns to predict mt that exhibits the most likely error patterns for given src and pe. We expect that back-APE will produce erroneous hypotheses mt from parallel corpora based on the learnt error patterns. Eventually, we use mt and bitexts to compile a set of new synthetic triplets ⟨src, mt, ref ⟩ named "Reverse-Edited Synthetic Hypotheses for Automatic Post-Editing" (RESHAPE).
We further examine several decoding strategies for back-APE to identify which strategy yields synthetic data that lead to the biggest enhancement of the APE performance. Basically, decoding methods that maximize the model's output probabilities, such as beam search and greedy search, are the primary methods for sequence generation; although accurate, their outputs tend to be rather short and/or conservative [13]- [16]. We speculate that those decoding methods can confine back-APE output mt to certain error patterns, and thereby the generated training data can impede the model's learning. Thus, we suggest lending randomness with regard to the model probability to the decoding process of back-APE by adopting stochastic sampling methods to encourage the resulting synthetic samples to be diverse.
In our experiments, we use the same parallel corpora used to construct eSCAPE when constructing RESHAPE to make a fair comparison of our method with eSCAPE. For evaluation, we use the English-to-German WMT APE data [2], the de-facto standard benchmark. Experimental results demonstrate that, compared to eSCAPE, not only does our method improve the APE performance, but it also produces more samples with similar characteristics to genuine data.

II. PRELIMINARY: AUTOMATIC POST-EDITING
Fundamentally, APE has been recognized as a sequenceto-sequence learning problem and implemented by the sequence-to-sequence structure in which the encoder produces latent representations of a given source sequence and the decoder autoregressively produces the target sequence by taking the encoded representations. Due to the nature of APE that produces pe by revising mt while maintaining the meaning of src, the above-mentioned structure has been extended to dual-source (or multi-source) sequence-tosequence structure (src, mt) → pe to accommodate two input sequences, in which src is treated as an auxiliary input providing contextual information, and mt serves as the primary input to be corrected.
The majority of recently proposed APE models adopt this dual-source structure that extends Transformer [17], and several variants of this structure have been proposed [18], [19]. Thanks to the article [19], which compared those variants to each other to identify the optimal architecture, we choose one dual-source architecture 2 (Fig. 3)   performance with fewer model parameters than others, as the underlying architecture shared by APE and back-APE in our experiments. Schematically 3 , this APE model performs the following operations. Above all, we notate the set of natural numbers from 1 to n as [n], i.e. x [t] = {x 1 , x 2 , ..., x t } for notational convenience. First, let D = {⟨x src , x mt , y⟩} n denote a set of training data, which is a collection of n APE triplets. For any training data ⟨x src , x mt , y⟩ ∈ D, where Tmt] , and y = y [Tpe] , in which T corresponds to the sequence length, src encoder first reads x src to produce a sequence of representations where h src = h src [T src] . Then, mt encoder takes x mt together with h src to produce a sequence of contextualized representations where h mt = h mt [T mt] . For every pe y i ∈ y, the decoder autoregressively produces its conditional probability as where W pe ∈ R d×|V | , in which d and |V | denotes the sizes of the hidden dimension and the vocabulary, respectively.  Finally, the model is trained to minimize the negative loglikelihood of the output probabilities with the following objective function

III. ADAPTATION OF BACK-TRANSLATION TO APE
Back-translation [12] is a method widely used in MT to obtain additional parallel resources by leveraging monolingual texts. Back-translation first trains an MT system in the reverse direction, from the target to the source, and then uses the trained MT system to generate synthetic source-side texts from the target-side monolingual texts. Motivated by that method, we introduce back-APE; we apply the backtranslation method to an APE model and use it to create new synthetic APE triplets by leveraging parallel corpora. The back-APE method reverses the original APE process by swapping the positions of mt and pe, thus aiming to learn (src, pe) → mt 4 . Consequently, back-APE trains to mini- 4 Note that such modeling makes a difference from our previous work in which back-APE was constructed in the form of (src, ref) → mt. Assuming that back-APE (src, ref) → mt is trained until convergence, the results are expected to be very similar to the original mt and thus may not be meaningful (since the results may not reduce the edit distance from ref drastically). Thus, our previous study relied on empirical observations to determine its training stop point, leading to inefficient learning. VOLUME 4, 2016 mize the following objective function with a given dataset D = {⟨x src , y, x mt ⟩} n (cf. Equation (4)): Thus, whereas APE outputs a minimally 'corrected' text (pe) from mt while considering the meaning of src, back-APE outputs a minimally 'corrupted' text (mt) from pe conditioned on src (Fig. 4), i.e., back-APE can be interpreted as learning to produce mt that is likely to correspond with the given src and pe. Accordingly, when supplying a bitext (src, ref) to back-APE at inference time, we expect back-APE to produce mt whose error patterns (distribution) are influenced by genuine APE data and also expect that mt and ref follow the minimum-correction principle. Subsequently, by using back-APE's output ( mt) and its input bitext, we construct a new set of synthetic APE triplets ⟨src, mt, ref ⟩ RESHAPE (Fig. 5). Note that letting src and ref serve as src and pe respectively can be reasonable because each side of parallel corpora is supposed to be error-free and semantically equivalent to each other; likewise, src and pe of a genuine APE triplet have the same relation.

IV. DECODING STRATEGIES FOR BACK-APE A. MAXIMUM-A-POSTERIORI DECODING
Besides the back-APE training, we consider several decoding strategies to discover which produces synthetic samples that provide the most significant learning effect. First, we can typically consider maximum-a-posteriori (MAP) decoding strategies, which find the most likely sequences by maximizing the model probabilities, i.e., MAP yields However, the complete MAP decoding is almost intractable due to its inevitably vast search space O(|V | m ), where V is the vocabulary set, and m is the maximum sequence length. Therefore, MAP is usually approximated by beam search or greedy search. Beam search is a bestfirst search algorithm that traces only the B most likely prefixes (where B and t denote the beam size and the decoding time step, respectively). In detail, at every time step t, every possible token x mt t ∈ V is added to each prefix in X mt t−1 , advancing toX mt x mt t ∈ V , and then, top-B prefixes for the next time step are selected as reducing the search space to O(|V |B m ). Meanwhile, greedy search simply picks the t-th token with the highest model probability: at every time step t. Under the assumption that the probability assigned by the model increases as the quality of the output increases, they are commonly used for many sequencegeneration problems; thus, we can consider adopting them for back-APE as well. Nevertheless, inadequacies of the MAP decoding have also been reported [13]- [15]. Because MAP determines its output by taking just the modes in the model distribution, MAP may not reproduce various other statistics of the training data, resulting in conservative, generic, and/or relatively uninformative outputs. In addition, favoring high sentence scores, MAP tends to underestimate the sentence length. Although such phenomena are known to be particularly problematic for tasks that aim to generate human-like texts, such as dialogue generation [13] and story generation [14]- [16], we speculate that such downsides could also be problematic for back-APE because favoring only the top-scoring texts may result in highly homogeneous outputs and thus restrict the diverse error patterns (statistics) found in genuine APE data. Consequently, the lack of diversity in synthetic data can lead to biased learning effects.

B. SAMPLING METHOD
Sampling methods, in which each output token is determining stochastically according to the model distribution, are known to better represent various statistics of the training data into the output; sampling methods preclude the model from taking only the modes of its distribution and estimate the targeted distribution more completely than MAP methods. When sampling is adopted in back-APE in the decoding stage, the model selects the t-th token randomly at every decoding time step with weight given by the model distribution, i.e., For back-APE, it appears promising to use sampling methods to yield synthetic training data that reflects various statistics of genuine APE data. Sampling methods are expected to generate mt by considering all possible error patterns that are likely to appear in the given input bitext based on the learnt statistics, so the generated samples will provide diverse learning patterns to APE models. In some cases, however, sampling methods may produce arbitrarily poor outputs that are inconsistent and/or context-independent if they are frequently drawn from the tails of the model distribution where tokens are assigned relatively low, but non-zero, probabilities. Therefore, certain measures that help bound the error of model outputs are required.

C. RESTRICTED SAMPLING METHOD
Restricted sampling methods [14], [15] have been proposed to rectify the unboundedness problem that arise from sampling methods by excluding the unreliable tails of the model distribution. Especially, top-k sampling [14], a straightforward but powerful scheme, has recently become a popular method to obtain high-quality texts in many human-like text generation tasks [14]- [16]. Top-k sampling samples within the k most likely tokens at each decoding step, presuming that the tokens not ranked as the top-k are unreliable; top-k sampling can be regarded as a compromise between MAP and pure sampling. Thus, at every time step, the model distribution is trimmed to contain only the probabilities of the top-k tokens.
Provided that we apply top-k sampling to back-APE, the original model distribution will be re-scaled toP (·) at every decoding step t referring to the top-k vocabulary V k t ⊂ V . Specifically, where The t-th token will be subsequently sampled from the following re-normalized distribution We expect that applying top-k sampling to back-APE will reduce the noise from pure sampling yet still allow the outputs to reflect the properties of the error distribution of the training data.

A. SETTINGS a: Evaluation Metric
As in the WMT APE shared task [2], [3], we used the defacto standard evaluation metrics: TER 5 [5], a primary metric that measures the edit distance from the model hypotheses to its target; BLEU 6 [20], a secondary metric that measures the degree of n-gram match between the model hypotheses and its target. All of our evaluations are case-sensitive.

b: Datasets
We started with the WMT English-German (EN-DE) APE dataset, which is the de-facto standard APE benchmark, coming from the IT domain [2], [3]. The WMT data were 5 https://github.com/jhclark/tercom 6 https://github.com/moses-smt/mosesdecoder used to train both back-APE and APE, and to evaluate the APE performance. We also used EN-DE eSCAPE (1) to train back-APE due to the small size of the WMT dataset; (2) to construct RESHAPE by feeding its (src, ref ) into the trained back-APE model for a fair comparison; and (3) to train one of the baseline APE models. All of our APE data are also categorized by their subtasks: whether mt has been produced by either a phrase-based statistical MT (PBSMT) system or a neural MT (NMT) system; detailed data statistics are presented in Table 1. We tokenized words into subword units by using SentencePiece 7 . We used OpenNMT-py 8 to implement the aforementioned dual-source architecture ( §II). We adopted the "base Transformer settings" [17] for all the models: specifically, 512 for all the hidden dimensions including the embedding dimensions, 2048 for the feed-forward layers, 6 layers, 8 attention heads, a dropout probability of 10%, a label smoothing value of 0.1, the Adam [21] optimizer with β = (0.9, 0.998), and 6,000 warm-up steps followed by an inverse square root decay of learning rate. For the back-APE and APE training, we pre-trained the models with a batch size of 48K tokens on the combination of the synthetic data (either eSCAPE or RESHAPE) and the WMT data and subsequently finetuned them with a batch size of 1024 tokens by using only the WMT data. To evaluate the APE models, we used beam search with a beam size of 6 to obtain their hypotheses.

B. BACK-APE TRAINING AND RESHAPE CONSTRUCTION
We used eSCAPE both to pre-train back-APE and to construct RESHAPE, so we applied the 'n-fold cross-generation technique' (adapting n-fold cross-validation) to back-APE to avoid obtaining biased outputs for which the inputs have already been used in the training. Specifically

A. EXAMINATION OF BACK-APE DECODING SCHEMES
We prepared four RESHAPE datasets, each by using one of the back-APE decoding schemes (beam search, greedy search, sampling, and top-k sampling). To compare their learning effects to each other, we have trained APE models on each RESHAPE and then evaluated their performance ( Table 2); in the case of top-k sampling, among various k values (Fig. 6), we only report the best one in Table 2.
Firstly, we observed that making use of sampling methods (either pure sampling or top-k sampling) results in consistent improvements in the APE performance over those of MAP decoding methods. Among the sampling methods, top-k sampling led to better APE performance than pure sampling in most cases. However, we found that pure sampling is competitive with top-k sampling in the NMT task, although top-k sampling consistently yields better TER than pure sampling. Our speculation on this phenomenon is that the edit-distance distributions of the two tasks have different skewnesses. In contrast with the PBSMT dataset, the edit-distance distribution of the NMT dataset is drastically skewed (Fig. 2): most of the samples have only a small number of errors, and therefore the output probabilities of back-APE are likely to be concentrated on just a few candidates, possibly making not much different distribution between pure sampling and top-k sampling.

B. QUANTITATIVE APE EVALUATION
According to the previous subsection, we adopted RESHAPE created by applying top-k sampling to the back-APE as our final result for new synthetic training APE data. To evaluate the APE performance, we considered three baselines: • NO EDIT: a standard baseline formed by the evaluation of the raw mt, which has not yet been post-edited, in the test datasets, indicating the initial margin for improvement to be achieved by APE. • WMT only: an APE model trained solely on the WMT data. • eSCAPE: an APE model trained on the WMT data augmented with eSCAPE, which can be regarded as our main baseline. Top-k TER PBSMT task NMT task FIGURE 6. The effect of k for top-k sampling. y-axis represents the evaluation result of the APE model that is trained on RESHAPE using the top-k sampling decoding method where k = x. The evaluation was conducted on the WMT testsets (three TER results were averaged for the PBSMT task). The k that records the best performance is marked with a color (k = 40 for the PBSMT subtask and k = 30 for the NMT subtask).
The evaluation results (Table 3) demonstrate that our approach surpasses the first two baselines (i.e. NO EDIT and WMT only) by a substantial margin, indicating that augmenting gold APE data (the WMT data) with RESHAPE is remarkably beneficial. More importantly, we observed that our approach outperformed the eSCAPE baseline in all the test cases, even achieving a statistically significant gain in most cases. This result suggests that RESHAPE is more advantageous than eSCAPE to APE learning. Our approach also outperformed the state-of-the-art APE models: CopyNet-APE [9] and BERT-APE [10]. Both are Transformer-based APE models and were trained on eSCAPE; CopyNet-APE adds CopyNet [22] layers, and BERT-APE adopts BERT [23] weights for the APE architecture. It is particularly noticeable that our APE model trained on RESHAPE even outperformed BERT-APE, which contains a larger number of model parameters and was pretrained on tens of millions of monolingual data.

VII. DISCUSSION
Our quantitative results reveal that RESHAPE is better than eSCAPE. Beyond that, we further discuss and analyze our results from the various perspectives on whether RESHAPE is more similar to human-made APE data than eSCAPE is.
We have pointed out at the beginning ( §I) that eSCAPE and the WMT data differ greatly in terms of the edit-distance distribution (Fig. 2), so we first examined whether RESHAPE helps reduce such discrepancy. For this purpose, we compiled the TER statistics for the WMT data 9 , eSCAPE, and RESHAPE in the same manner as in Fig 2, then measured the KL-divergence of the TER distributions of eSCAPE and RESHAPE against the WMT data. As a result, RESHAPE showed a smaller divergence than eSCAPE (Fig. 7); this re- sult indicates that the edit-distance distribution of RESHAPE is more similar to that of the WMT data, compared to the eSCAPE's distribution. Apart from the edit distance, we verified how similar RESHAPE is to the WMT data in various aspects. We empirically designed 14 features to characterize an APE triplet as follows: • A TER score and the rates of four editing operations (insertion, deletion, substitution, and shift) in the TER calculation. • Lengths of each triplet element.: T src , T mt , T pe . • Length ratios between two elements which have an 'input-output' relation: T mt /T src , T pe /T src , T pe /T mt . • Language-model scores of each triplet element. Each element was scored by its corresponding language model (LM src , LM mt , LM pe ), each of which is a 5-gram Kneser-Ney language model [24] trained on the corresponding elements in the WMT data. We then represented all triplets included in RESHAPE, eS-CAPE, and the WMT data as 14-dimensional vectors with these features, excluding triplets that are represented as the same feature vector for both RESHAPE and eSCAPE as no comparison is possible. Finally, for each WMT triplet, we used the k-nearest neighbor algorithm (calculated by Euclidean distance, after normalization of each feature) to search for the k closest synthetic triplets, either in RESHAPE or eSCAPE; we then counted how many come from each. We found that the k-nearest neighborhoods that belong to RESHAPE outnumbers those of eSCAPE (Table 4); this result supports our speculation that RESHAPE is capable to capture various characteristics of genuine APE data better than eSCAPE.
Finally, we present some examples from RESHAPE and eSCAPE (Table 5). We found that mt in eSCAPE tend to undergo an excessive number of corrections, much more

Data types KL Divergence [D(P Q)]
Q =eSCAPE Q =RESHAPE FIGURE 7. D(P ∥ Q) (with natural logarithms) where P and Q are edit-distance distributions; P is for the WMT data, and Q is for either RESHAPE or eSCAPE.

VIII. CONCLUSION
In this article, we introduce a new synthetic APE dataset, RESHAPE, derived from parallel corpora. We summarize our research findings as follows: 1) We propose the back-APE method, which adapts the back-translation technique for APE, to produce synthetic data that captures inherent characteristics of genuine APE data, and this method is a refined version from our previous study [11]. 2) We investigate several decoding methods for back-APE and conclude that sampling methods, top-k sampling especially, are superior to MAP decoding methods. 3) Through our quantitative and qualitative evaluation, we observed that compared to the existing method, eSCAPE, our method produces more synthetic samples containing similar characteristics to genuine APE data, including the edit-distance properties, which in turn contribute to improving the APE performance. 4) Finally, we believe that our findings present the importance of reflecting the characteristics of genuine data in synthetic data.
Considering that this is the first attempt at adapting backtranslation for APE, we believe that our findings suggest further research directions. First, because this work is simply employing one APE architecture without any modification for back-APE, several extensions such as modifying the training objective function or the architecture in itself will be meaningful future studies. Second, we found that although back-APE helps produce triplets similar to gold APE data, the edit distances of RESHAPE and the WMT data (Fig. 7) still differ. Therefore, a useful study would be to explore a way to control the quantity of the errors to be injected, by following the error distribution of genuine data.