Langevin Cooling for Unsupervised Domain Translation

Domain translation is the task of finding correspondence between two domains. Several deep neural network (DNN) models, e.g., CycleGAN and cross-lingual language models, have shown remarkable successes on this task under the unsupervised setting—the mappings between the domains are learned from two independent sets of training data in both domains (without paired samples). However, those methods typically do not perform well on a significant proportion of test samples. In this article, we hypothesize that many of such unsuccessful samples lie at the fringe—relatively low-density areas—of data distribution, where the DNN was not trained very well, and propose to perform the Langevin dynamics to bring such fringe samples toward high-density areas. We demonstrate qualitatively and quantitatively that our strategy, called Langevin cooling (L-Cool), enhances state-of-the-art methods in image translation and language translation tasks.

In some DT applications, labeled samples, i.e., paired samples in the two domains, can be collected cheaply.For example, in the super-resolution, a paired low-resolution image can be created by artificially blurring and downsampling a high-resolution image.However, in many other applications including image translation and language translation, collecting paired samples requires significant human effort, and thus, only a limited amount of paired data are available.
Unsupervised DT methods eliminate the necessity of paired data for supervision and only require independent sets of training samples in both domains.In computer vision, CycleGAN, an extension of generative adversarial networks (GANs) [31], showed its capability of unsupervised DT with impressive results in image translation tasks [32]- [34].It learns the mappings between the two domains by matching the source training distribution transferred to the target domain and the target training distribution under the cycle-consistency constraint.Similar ideas were applied to natural language processing (NLP): dual learning [18], [35] and cross-lingual language models (XLMs) [19], which are trained on unpaired monolingual data and achieved high performance in language translation.
Despite their remarkable successes, existing unsupervised DT methods are known to fail on a significant proportion of test samples [32], [36]- [38].In this article, we hypothesize that some of the unsuccessful samples are at the fringe of the data distribution, i.e., they lie slightly off the data manifold, and therefore, the DNN was not trained very well for translating those samples.This hypothesis leads to our proposal to bring fringe samples toward the high-density data manifold, where the DNN is well-trained, by cooling down the test distribution.Specifically, our proposed method, called L-Cool, performs the Metropolis-adjusted Langevin algorithm (MALA) to lower the temperature of test samples before applying the base DT method.The gradient of the log-probability, which MALA requires, is estimated by the denoising autoencoder (DAE) [39].
L-Cool is generic and can be used for enhancing any DT method.We demonstrate its effectiveness in image translation  [19] AND L-COOL.L-COOL MAKES THE TRANSLATION CLOSER TO THE GROUND TRUTH and language translation tasks, where L-Cool exhibits consistent performance gain.Fig. 1 and Table I show a few intuitive exemplar results.The main contributions of this article include the following. 1 1) Proposal of a novel Langevin cooling (L-Cool) method that enhances DT performance by cooling down test samples toward the high-density data manifold.2) Qualitative evaluation of L-Cool in comparison with state-of-the-art methods as well as image processing techniques (as baseline projection methods) on image 1 This article is an extended version of our preliminary conference publication [40] with additional contributions and extended analyses.The conference publication [40] contains the first three contributions listed in the following, and the last four contributions have been newly added in this journal version.However, the first three contributions have also been refined with additional baselines and analyses.Specifically, all experimental results, which were obtained with L-Cool-Cycle in the conference version, have been replaced with the results obtained with L-Cool (with DAE), since drawbacks of L-Cool-cycle have been found.translation tasks, including horse2zebra, zebra2horse, apple2orange, and orange2apple which visualizes the effectiveness of L-Cool.

TABLE I EXAMPLES OF FRENCH-TO-ENGLISH TRANSLATION BY XLM
3) Quantitative evaluation on image translation tasks (horse2zebra and sat2map) based on classification accuracy by pretrained classifiers as well as paired data.
Experiments with fringe detection support our hypotheses and show significant gains by L-Cool when applied to fringe samples.4) Comparison between the gradient estimator by DAE and that by the cycle structure of CycleGAN (L-Cool-Cycle) on a synthetic toy dataset.Our investigation reveals drawbacks of L-Cool-Cycle.5) Evaluation in language translation (English ↔ French and English ↔ German) on the NewsCrawl dataset, 2which revealed quantitative performance gain by L-Cool in terms of the BLEU score [41].6) Identification of the feature space (L-Cool-Feature) as a more reliable place for applying Langevin dynamics than the input space (L-Cool-Input) for language translation models.7) Analysis of hyperparameter dependence on image as well as language translation tasks.
A. Related Work 1) Unsupervised Image Translation: CycleGAN [32] and its concurrent works [33], [34] have eliminated the necessity of supervision for image translation [23], [42] by using the loss inspired by GAN [31] along with the cycle-consistency loss.The consistency requirement forces translation to retain the contents of source images so that they can be translated back.Liu et al. [43] proposed a variant that shares the latent space between the two domains, which works as additional regularization for alleviating the highly ill-posed nature of unsupervised DT.
Huang et al. [44] and Lee et al. [45] tackled the general issue of unimodality in sample generation by splitting the latent space into two-a content space and a style space.The content space is shared between the two domains, but the style space is unique to each domain.The style space is modeled with a Gaussian prior, which helps in generating diverse images at test time.Mejjati et al. [37] and Kim et al. [46] showed that attention maps can boost the performance by making the model focus on relevant regions in the image.Alternatives to cycle consistency include geometry-consistent GANs (GcGAN) [47] and contrastive unpaired translation (CUT) [48].GcGAN tries to maintain the distance between the inputs in the output space, while CUT employs patch-based contrastive learning for improving the DT performance.Despite a lot of new ideas proposed for improving the image translation performance, CycleGAN [32] is still considered to be the state of the art in many transformation tasks.
2) Unsupervised Language Translation: Language translation has been tackled with DNNs with encoder-decoder architectures, where text in the source language is fed to the encoder and the decoder generates its translation in the target language [49].Unsupervised language translation methods have enabled learning from a large pool of monolingual data [18], [50], which can be cheaply collected through the Internet without any human labeling effort.
Transformers [35] with attention mechanisms have shown their excellent performance in unsupervised language translation, as well as many other NLP tasks, including language modeling, understanding, and sentence classification.It was shown that generative pretraining strategies such as masked language modeling (MLM) (which masks a portion of the words in the input sentence and forces the model to predict the masked words) is effective in making transformers better at language understanding [51]- [54].Back translation has also enhanced performance by being a source of data augmentation while maintaining the cycle-consistency constraint [20], [55], [56].XLM [19] have shown state-of-the-art results in unsupervised language translation, outperforming generative pretrained transformer (GPT) [51], bidirectional encoder representations from transformers (BERT) [53], and other previous methods [55], [57].
3) Temperature Control: Changing distributions by controlling the temperature has been used in the Bayesian learning and sample generation.Heek and Kalchbrenner [58] and Wenzel et al. [59] reported that sampling weights from its cooled posterior distribution improve the predictive performance in the Bayesian learning.Higher quality images were generated from a reduced-temperature model in [60]- [62].Dahl et al. [61] used a tempered softmax for super-resolution.In contrast to previous works that cool down estimated distributions (Bayes posterior or predictive distributions), our approach cools down the input test distribution to make fringe samples more typical for unsupervised DT.

II. THEORETICAL BACKGROUND
Here, we introduce two basic tools, on which our proposed method relies.

A. Metropolis-Adjusted Langevin Algorithm
The MALA is an efficient Markov chain Monte Carlo (MCMC) sampling method that uses the gradient of the energy (negative log probability E(x) = − log p(x)).Sampling is performed sequentially by where α is the step size and ν is a random perturbation subject to N L (0, δ 2 I L ).Here, N L (μ, ) denotes the L-dimensional Gaussian distribution with mean μ and covariance , and I L denotes the L × L identity matrix.By appropriately controlling the step size α and the noise variance δ 2 , the sequence is known to converge to the distribution p(x). 3 Nguyen et al. [63] successfully generated high-resolution, realistic, and diverse artificial images by MALA. 3 For convergence, a rejection step after applying (1) is required.However, it was observed that a variant, called MALA-approx [63], without the rejection step gives reasonable sequence for moderate step sizes.We use MALA-approx in our proposed method.

B. Denoising Autoencoders
A DAE [64], [65] is trained so that data samples contaminated with artificial noise are cleaned.Specifically, an estimator for the following reconstruction error is minimized: where E p [•] denotes the expectation over the distribution p, R L x ∼ p(x) is a data sample, and ∼ p( ) = N L (0, σ 2 I) is an artificial Gaussian noise.Alain and Bengio [39] discussed the relation between DAEs and contractive autoencoders (CAEs) and proved the following useful property of DAEs.
Proposition 1 [39]: Under the assumption that r(x) = x + o(1), 4 the minimizer of the DAE objective (2) satisfies as σ 2 → 0. Proposition 1 states that a DAE trained with a small σ 2 can be used to estimate the gradient of the log probability, i.e.,

A. Langevin Dynamics With Lower Temperature
As discussed in Section I, we hypothesize that DT methods can work poorly on test samples lying at the fringe of the data distribution.We therefore propose to drive such fringe samples toward the high-density area, where the DNN is better trained.Specifically, we apply MALA (1) to each test sample with the step size α and the variance δ 2 of the random perturbation satisfying the following inequality: 2α > δ 2 . ( If 2α = δ 2 , MALA can be seen as a discrete approximation to the (continuous) Langevin dynamics where w is the standard Brownian motion. 5The dynamics ( 6) is known to converge to p(x) as the equilibrium distribution [66], [67].By setting the step size and the perturbation variance so that inequality (5) holds, we can approximately draw samples from the distribution with lower temperature, as shown in the following.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.By seeing the negative log probability as the energy E(x) = − log p(x), we can see p(x) as the Boltzmann distribution with the inverse temperature equal to β = 1 where The following theorem holds.Theorem 1: In the limit where α, δ 2 → 0 with their ratio α/δ 2 kept constant, the sequence of MALA (1) converges to p β (x) for Proof: As α and δ 2 go to 0, MALA (1) converges to the following dynamics: which is equivalent to Equation ( 9) can be rewritten with the Boltzmann distribution (7) with the inverse temperature specified by ( 8) Comparing this equation with (6), we find that this dynamics converges to the equilibrium distribution p β (x).Theorem 1 states that the ratio between α and δ 2 effectively controls the temperature.Specifically, we can see MALA (1) as a discrete approximation to the Langevin dynamics converging to the distribution given by of which the probability mass is more concentrated than p(x) if inequality (5) holds.
Our proposed L-Cool strategy uses DAE for estimating the gradient and applies MALA for β > 1 to cool down test samples before DT is performed.As shown in Fig. 2, this yields a small move of the test sample toward high-density areas in the source domain.Since the DNN for DT is expected to be well trained on the high-density areas, such a small move can result in a significant improvement of the translated image in the target domain and thus enhances the DT performance.We show the qualitative and quantitative performance gain by L-Cool in Sections IV-VI.

B. Extensions
We can choose two options for L-Cool, depending on the application and computational resources.
1) Fringe Detection: We can apply fringe detection, in the same way as adversary detection [68].Namely, assuming that the gradient of log p(x) is large at the fringe of the data distribution, we identify samples as fringe if for a threshold ξ > 0 and apply MALA only to those samples.This prevents nonfringe samples already lying in high-density areas from being perturbed by Langevin dynamics.

2) Gradient Estimation by Cycle:
Another option is to omit to train DAE and estimate the gradient by a cycle structure that the DNN for DT already possesses.This idea follows the argument in [63], where MALA is successfully used to generate high-resolution, realistic, and diverse artificial images.The authors argued that DAE for estimating the gradient can be replaced with any cycle (autoencoding) structure in their application.In our image translation experiment, we use CycleGAN as the base method, and therefore, we can estimate the gradient by (11) for some γ > 0, where G corresponds to the mapping of the CycleGAN from the source domain to the target domain and F to its inversion.We call this option L-Cool-Cycle, which eliminates the necessity of training DAE.However, one should use this option with care; we found that L-Cool-Cycle tends to exacerbate artifacts created by CycleGAN, which will be discussed in detail in Section V-E.

IV. DEMONSTRATION WITH TOY DATA
We first show the basic behavior of L-Cool on toy data.We generated 1000 training samples each in the source and the target domains by where t, t ∼ Uniform(0, 1), ∼ Uniform(0, 0.2), and ∼ Uniform(0, 0.1).Then, a CycleGAN [32] with two-layer feedforward networks, G(x) → x and F(x ) → x, were trained to learn the forward and the inverse mappings between the two domains.A DAE having the same architecture as G with two-layer feedforward network was also trained on the samples in the source domain.
Blue dots in Fig. 3 show training samples, from which we can see the high-density areas both in the source (right) and the target (left) domains.Now, we feed three off-manifold test samples x 1  test , x 2 test , and x 3 test , shown as red, yellow, and magenta squares in the left graph, to the forward (source to target) translator G.As expected, the translated samples G(x 1  test ), G(x 2 test ), and G(x 3 test ), shown as red, yellow, and magenta squares in the right graph, are not in the high-density area (not typical target samples), because G was not trained for those off-manifold samples.As shown as trails of circles, L-Cool drives the off-manifold samples into the data manifold in the source domain, which also drives the translated samples into the data manifold in the target domain.In this way, L-Cool helps CycleGAN generate typical samples in the target domain by making source samples more typical.

V. IMAGE TRANSLATION EXPERIMENTS
Next, we demonstrate the performance of L-Cool in several image translation tasks.We use CycleGAN as the base translation method, and L-Cool is performed in the source image space before translation (Fig. 4).

A. Translation Tasks and Model Architectures
We used pretrained CycleGAN models, along with the training and the test datasets, publicly available in the official github repository 6 of CycleGAN [32].Experiments were conducted on the following tasks.For the first two tasks, we also conducted experiments on the inverse tasks, i.e., zebra2horse and orange2apple.The validation images were used for hyperparameter tuning for L-Cool (see Section V-D).
The CycleGAN model consists of a forward mapping G and a reverse mapping F. Both G and F have the same architecture, including two downsampling layers followed by nine resnet generator blocks and two upsampling layers.Each resnet generator block consists of convolution, instance normalization [69], and relu layers with residual connections added between every block.While training with a batch size of 1 using instance normalization is equivalent to using batch normalization [70], instance normalization has been shown to help improve the results for image stylization [69] as well as for image translation [32], [47], [48].For reproducing the results of CycleGAN, we utilize the pretrained models provided by the authors on their official github repository.The network architecture and training strategies of CycleGAN are also shared by CUT and GcGAN.For CUT and GcGAN, we use the code provided in their respective official github repositories 7,8 for training the models.
For DAE, we adapted a tiramisu model [71] consisting of 67 layers in total.The PyTorch [72] code for tiramisu was obtained from a publicly available github repository. 9The tiramisu consists of five downsampling layers followed by a bottleneck layer and five upsampling layers.Each downsampling as well as upsampling layer consists of dense blocks with a growth rate of 16.Each dense block consists of batch normalization [70], relu, and convolution layers with dense connections [73].We trained the DAE on the training images in the source domain for 200 epochs by the Adam optimizer with the learning rate set to 0.0002.

B. Qualitative Evaluation
Fig. 5 shows some example results of horse2zebra, zebra2horse, apple2orange, and orange2apple tasks.For each example, we compare L-Cool with (the plain) CycleGAN [32], CUT [48], and GcGAN [47], which are state of the art in these tasks.As other baselines, we also evaluated two edge-preserving smoothing techniques, median filter [74] and total variation denoising [75], which are applied before translation by CycleGAN.These baselines are supposed to move test samples toward the high-density areas to some extent.Below each input image (in the first column), we report as a fringeness measure the percentile ρ of the norm of the score function [see (10)] among the whole test set.Higher values of ρ imply more fringeness.
We see in Fig. 5 that, while, in some cases, median filter and total variation denoising show improvement in target domain attributes (e.g., increased zebra stripes for the task of horse2zebra), in many other cases, it gives no improvement (e.g., for the task of apple2orange) or worsens the output image quality (e.g., for the task of zebra2horse).Although smoothing methods can be broadly considered as a projection toward the high-density region, the destination can still be outside the training data manifold.L-Cool on the other hand is multistep projection toward the data distribution, whose dynamics can be controlled by tuning the hyperparameters-the number of steps N, the step size α, and the temperature β −1 .Thus, in our experiments, we find that the projection by L-Cool results in better translation performance than the edge-preserving smoothing methods.
We also find in Fig. 5 that for the task of horse2zebra, L-Cool provides for an increase in the zebra stripes while reducing the artifacts, while CycleGAN, CUT, and GcGAN produce fewer or no zebra stripes or suffer from severe artifacts.This pattern is also observed for the task of zebra2horse where the results of L-Cool show increased brown color of the horse compared to all other methods.L-Cool significantly increases the orange color of the output image for the task of apple2orange.Similarly, the apple images in the results for the task of orange2apple show reduced artifacts and significantly better target domain attributes by L-Cool than any other method.
In general, our qualitative experiment shows that L-Cool performs better than or comparable to the baseline methods.However, we also observe its side effect-the translated image can be oversaturated in some cases.This is because of imperfect training of the sore function estimator-Langevin dynamics with normally trained DAE does not converge to the training distribution when the number of steps is large [76].
Recent study has overcome this issue by convergent learning applied to energy-based models (EBMs), with which Langevin dynamics converge to the target distribution and generated samples are not oversaturated [77].Unfortunately, the current techniques for convergent learning of EBM are applicable only to small networks.We expect that scaling convergent learning will remedy this side effect of L-Cool in the near future.

C. Quantitative Evaluation
In order to confirm that L-Cool generally improves the image translation performance, we conducted two experiments that quantitatively evaluate the performance.
1) Likeness Evaluation by Pretrained Classifiers: Focusing on horse2zebra, we evaluated the likeness of the translated images to zebra images by using state-of-the-art classifiers, including VGG16 [78], InceptionV3 [79], Resnet50 [80], Resnet101 [81], and Densenet169 [73] pretrained on the Ima-geNet dataset [82].Specifically, we evaluated and compared the probability outputs (i.e., after softmax) of the classifiers for the translated images by plain CycleGAN and those by L-Cool.We applied fringe detection [see (10)] with the threshold ξ adjusted so that specified proportions (20%, 40%, 60%, 80%, and 100%) of the test samples are identified as fringe.Note that 100% fringe samples correspond to the whole test samples, while 20% fringe samples correspond to the 20% samples that are farthest from the data manifold identified by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the fringe detector (10).Our strategy of L-Cool with the fringe detector is to apply L-Cool only to the fringe samples.
Let us first check whether one of our hypothesesunsuccessful translation tends to happen for fringe samplesholds.Fig. 6 shows the relation between the fringeness and the unsuccessfulness of image translation in the horse2zebra task.Here, the unsuccessfulness is measured by the proportion of the samples for which the probability output p( y = zebra|x) of the classifier for the translated image is smaller than 0.1.We consistently observe clear correlation between the fringeness and the unsuccessfulness of image translation by the plain CycleGAN, which supports our first hypothesis.
Next, we show that the translation performance can be improved by applying L-Cool to fringe samples-our second hypothesis.Fig. 7 shows the scatter plots of likeness to zebra images, i.e., the probability p( y = zebra|x) evaluated by  pretrained classifiers.The five panels plot the 20%, 40%, 60%, 80%, and 100% fringe samples.In each plot, the horizontal axis corresponds to the output probabilities of the transferred images by CycleGAN, while the vertical axis corresponds to the output probabilities of the transferred images by L-Cool.The dashed line indicates the equal probability, i.e., the points above the dashed line imply the improvement by L-Cool.
We observe that all classifiers tend to give a higher probability to the images translated after L-Cool is applied.We emphasize that L-Cool uses no information on the target domain-DAE is trained purely on the samples in the source domain, and MALA drives samples toward high-density areas in the source domain, independently of the translation task.The hyperparameters for the Langevin dynamics were set to α = 0.005, β −1 = 0.001, and N = 40, which were found optimal on the validation set (see Section V-D).
Table II shows the average of the output probabilities over the fringe samples and the five classifiers for plain CycleGAN (second column) and L-Cool (third column).Here, again, we use the fringe detector (10) to identify a proportion, indicated by %fringes, of the test samples as fringe.We observe in Table II that, for smaller proportions of fringe samples (i.e., most outlying samples), the performance of the plain CycleGAN is worse, and the performance gain, i.e., the differences between L-Cool and CycleGAN, is larger.These observations empirically support our hypothesis that CycleGAN does not perform well on fringe samples, and cooling down those samples can improve the translation performance.2) Evaluation on Paired Data: As mentioned in Section VI-A, the sat2map dataset consists of pairs of satellite images and the corresponding map images and therefore allows us to directly evaluate the image translation performance.We applied the pretrained CycleGAN to the test satellite images with and without L-Cool and compared the transferred map images with the corresponding ground-truth map images.Following the evaluation procedure in [43], we counted pixels as correct if the color mismatch (i.e., the Euclidean distance between the transferred map and the ground-truth map in the RGB color space) is below 16.
Table III shows the average pixelwise accuracy, where we observed a similar tendency to the likeness evaluation in Section V-C1; for smaller proportions of fringe samples, the translation performance of the plain CycleGAN is worse, and the performance gain by L-Cool is larger.This again supports our hypothesis that driving the fringe samples toward the data manifold is beneficial for improving the performance of the base DT method.Fig. 8 shows an exemplar case where L-Cool improves the translation performance.

D. Hyperparameter Setting
L-Cool has several hyperparameters.For DAE training, we set the training noise to σ = 0.3 for all tasks, which approximately follows the recommendation (10% of the mean pixel values) in [63].We visually inspected the performance dependence on the remaining hyperparameters, i.e., temperature β −1 , step size α, and the number of steps N. Roughly speaking, the product of α and N determines how far the resulting image can reach from the original point, and similar results are obtained if α • N has similar values, as long as the step size α is sufficiently small.Fig. 9 shows the exemplarily translated images in the orange2apple task, where the dependence on the temperature β −1 and the step size α is shown for the number of steps fixed to N = 100.We observed that, as the step size α increases, the translated image gets more attributes-increased red color on the apple-of the target domain, and artifacts are reduced.However, if α is too large, the image gets blurred.We also observed that too high temperature β −1 gives a noisy result.The visually best result was obtained when β −1 = 0.001, α = 0.005, and N = 100 (marked with a green box and plotted on the right most in Fig. 9).Similar tendency was observed in other test samples and other tasks.

E. Investigation on the L-Cool-Cycle
L-Cool requires a trained DAE for gradient estimation.However, a variant, introduced in Section III-B2 as an option called L-Cool-Cycle, eliminates the necessity of DAE training and estimates the gradient by using the autoencoding structure of CycleGAN.This option empirically showed good performance in image generation [63] as well as in our preliminary experiments in image translation [40].
Suboptimality of L-Cool-Cycle can already be seen in the toy data experiment.Fig. 10 shows the same demonstration as in Fig. 3 and compares trails by L-Cool and L-Cool-Cycle.We see that L-Cool (red) drives the off-manifold samples directly toward the closest points in the data manifold that are expected to be semantically similar to the farther points.On the other hand, L-Cool-Cycle (green) does not always do so.This implies that the cycle estimator (11) is not a very Fig. 9. Translated images by L-Cool with different hyperparameter settings.We found that the setting β −1 = 0.001, α = 0.005, and N = 100 (marked with a green bounding box) best removes artifacts and increases the target domain attributes.Fig. 10.Similar toy data demonstration as in Fig. 3, comparing L-Cool (red) and L-Cool-Cycle (green).L-Cool moves samples toward the closest points that are expected to be semantically similar than the farther points.In contrast, L-Cool-Cycle does not move samples directly toward the high-density region in the source domain, implying that the cycle gradient estimator is not a very good substitution for DAE gradient estimator.good gradient estimator.Although L-Cool-Cycle is an option when training DAE is hard or time-consuming, it should be used in care-resulting samples should be checked by human.

VI. LANGUAGE TRANSLATION EXPERIMENTS
In this section, we demonstrate the performance of our proposed L-Cool in language translation tasks with XLM [19], [55]-a state-of-the-art method for unsupervised language translation-as the base method.

A. Translation Tasks and Model Architectures
We conducted experiments on four language translation tasks, EN-FR, FR-EN, EN-DE, and DE-EN, based on NewsCrawl dataset 10 under the default setting defined in the github repository page 11 : for each pair of languages, we used the first 5M sentences for training, 3000 sentences for validation, and 3000 sentences for test.
The main idea of XLM is to share subword vocabulary between the source and the target languages created through the byte pair encoding (BPE).MLM is performed as pretraining, similar to BERT [53]; 15% of the BPE from the text stream is masked 80% of the time, by a random token 10% of the time, and they are kept unchanged 10% of the time.The encoder is pretrained with the MLM objective, whose weights are then used as initialization for both the encoder and the decoder.This pretraining strategy was shown to give the best results [19].
The transformer consists of six encoders and six decoders.The architectures of encoders and decoders are similar, and each consists of a multihead attention layer followed by layer normalization [83], two fully connected layers with gelu activations [84], and another layer normalization.While the first fully connected layer projects the input with a dimensionality of 1024 to a latent dimension of 4096, the second fully connected layer projects it back to 1024.Each encoder and decoder layer also consists of a residual connection.For XLM implementation, we use the code publicly available at the github page.We train the model by using the ADAM optimizer along with linear warm-up and linear learning rates.We warm start with the model weights obtained after the MLM stage and further train the weights on the training sentences.

B. L-Cool Variants
We tested two variants of L-Cool (see Fig. 11). 1) L-Cool-Input: MALA is performed in the input word embedding space (the position embeddings are unaffected), similar to the image translation experiments in Sections V-B and V-C. 2) L-Cool-Feature: MALA is performed in the intermediate feature (code) space.In both the variants, DAE with the same architecture as the encoder of the transformer was trained in the corresponding space on the training sentences of NewsCrawl.Hyperparameters were tuned on the validation sentences (see Section VI-D).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.L-Cool-Feature was motivated by our preliminary observation that L-Cool-Input rarely improves the language-translation performance, as will be shown in the subsequent sections.We hypothesized that this is because of the discrete nature of the input space-the input is the word embedding that depends only on discrete occurrences of words, and therefore, a single step of MALA to any direction can bring the sample to a point where the base transformer is less trained than the original point.This issue might be remedied by applying the Langevin dynamics in the feature space where the mapped distribution is already smoothed.Note that L-Cool-Feature would be not suitable when paired data are not available for hyperparameter tuning (such as in our image translation experiments except Section V-C2).This is because driving samples in the feature space can drastically change the corresponding input, and thus, the translated result can become unrelated to the original input, unless the hyperparameters are tuned with paired data.

C. Quantitative Evaluation
Table IV shows the BLEU scores [41] by plain XLM, L-Cool-Input, and L-Cool-Feature.We find that L-Cool-Feature shows consistent performance gain and outperforms XLM on all the four language translation tasks.On the other hand, L-Cool-Input does not improve the performance over XLM, except for the task of FR-EN.
Focusing on L-Cool-Feature in the EN-FR task, we evaluated the translation performance with the fringe detector.Similar to the image translation experiments in Section V-C, we applied the fringe detector (10) with the threshold ξ controlling the proportion of the fringe samples (%fringes) in the test set.Table V shows the BLEU scores in the EN-FR translation task by XLM and by L-Cool-Feature on the fringe samples with different %fringes.We observe a similar tendency to the image translation experiments; for smaller %fringes (hence for sample sets with higher fringeness), the translation performance by the plain XLM is worse, and the performance gain, i.e., the difference between L-Cool-Feature and XLM, is larger, which empirically supports our hypothesis also in this language application.
The destination by L-Cool must be close enough to the original point, in order not to change the semantics of the original sentence.This is achieved by tuning the hyperparameters effectively on the validation set.In addition, a user also has the possibility to stop the sequence generation (e.g., a rejection step) if the L p distance between the input and a sample from the step of MALA is beyond a predetermined threshold.VIII.CONCLUSION Developing unsupervised, as well as self-supervised, learning methods, is one of the recent hot topics in the machine learning community for computer vision [85]- [89] and NLP [52], [53], [90]- [92].It is challenging but highly attractive since eliminating the necessity of labeled data may enable us to keep improving learning machines from data stream automatically without any human intervention.The success of deep learning in the unsupervised DT was a milestone in this exciting research area.
Our work contributes to this area with a simple idea.Namely, L-Cool performs MALA to test samples in the source domain and drives them toward high-density manifold, where the base DNN is well-trained.Our qualitative and quantitative evaluations showed improvements by L-Cool in image and language translation tasks, and the evaluations of L-Cool with fringe detection, i.e., applying L-Cool only to the detected fringe samples, supported our hypothesis that a proportion of test samples are failed to be translated because they lie at the fringe of data distribution and therefore can be improved by L-Cool.
L-Cool is generic and can be used to improve any DT method.Future work is therefore to apply L-Cool to other base DT methods and other DT tasks.We will also try to improve the gradient estimator for L-Cool by using other types of generative models such as normalizing flows [93].Explanation methods, such as layerwise relevance propagation (e.g., [94]- [96]), might help identify the reasons for successes and failures [97] of DT, suggesting possible ways to improve the performance.

Fig. 1 .
Fig. 1.Example of orange2apple task.The baseline CycleGAN transfers an orange image to an apple image (left column).Our proposed L-Cool makes a slight change in the original orange image, which significantly improves the quality of the transferred apple image (right column): the green artifacts surrounding the apple were removed almost completely and the texture and the color on the apple were improved, although slight blurry along the edges of the apple was introduced.

Fig. 2 .
Fig.2.L-Cool drives the test sample in the source (horse) domain slightly toward the center of data manifold, which gives a significant impact on the translated sample in the target (zebra) domain.

Fig. 4 .
Fig. 4. Schematics of (the plain) CycleGAN (top) and L-Cool (bottom).In CycleGAN, an encoder, y = G(x), translates a source sample into a target sample, while a decoder, x = F( y), translates the target sample back to the source sample.In L-Cool, a source sample is cooled down by MALA, before being translated by CycleGAN.

Fig. 5 .
Fig. 5. Example results of image translation tasks.For each example (row) the leftmost figure shows the input test image with its fringeness ρ-the percentile of the norm of the score function as shown in (10) (higher ρ indicates that the image is further from the manifold).The six right columns show translated images in the target domain.Median filter, TVD, and L-Cool are applied to the input image before translation by CycleGAN.In each task and each image, we find that the translations provided by L-Cool better represent the target domain attributes.

Fig. 6 .
Fig. 6.Fringeness versus unsuccessfulness of image translation by plain CycleGAN in the horse2zebra task.Here, %fringe indicates the proportion of the samples identified as fringe by the detector (10) (and therefore lower %fringe indicates higher fringeness of the evaluated set of samples).The unsuccessfulness is measured by the proportion of the samples for which the probability output p( y = zebra|x) of the classifier for the translated image is smaller than 0.1.With all five different pretrained classifiers, we consistently observe clear correlation between the fringeness and the unsuccessfulness of image translation by CycleGAN.

Fig. 7 .
Fig. 7. Likeness to zebra images evaluated by the probability output p( y = zebra|x) of pretrained classifiers for the translated images by CycleGAN (horizontal axis) and by L-Cool (vertical axis).Each panel plots the fringe samples detected by (10) with the threshold ξ controlling the proportion of the test samples identified as fringe.Points above the equal-likeness dashed line implies improvement by L-Cool compared to CycleGAN.We can see that, consistently for all classifiers (shown in different colors), points tend to be above the equal-likeness dashed line implying improvement by L-Cool.(a) %fringes: 20.(b) %fringes: 40.(c) %fringes: 60.(d) %fringes: 80. (e) %fringes: 100.

Fig. 8 .
Fig. 8. Example of sat2map image translation result.The green regions are increased in the output of L-Cool (bottom right) compared to that of CycleGAN (bottom left).As a result, the output of L-Cool is closer to the ground-truth map (top right).

Fig. 11 .
Fig. 11.Schematics of XLM (left), L-Cool-Input (middle), and L-Cool-Feature (right).L-Cool-Input performs MALA in the input space, whereas L-Cool-Feature performs MALA in the feature (code) space between the encoder and the decoder.

Fig. 12 shows
the performance dependence on the hyperparameters for L-Cool-Input (left) and L-Cool-Feature (right) in the EN-FR translation task, where the best performance was obtained when β −1 = 10 −4 , α = 10 −5 , and N = 25 for L-Cool-Input and when β −1 = 10 −3 , α = 10 −2 , and N = 25 for L-Cool-Feature.VII.COMPUTATION TIME L-Cool requires additional computation cost both in training and test.Training the DAE can typically be done much faster

Fig. 12 .
Fig. 12. Language translation performance (BLEU score) dependence on hyperparameters in the EN-FR task with L-Cool-Input (left) and L-Cool-Feature (right) on the validation set.The dashed line in each graph indicates the baseline performance by plain XLM.(a) L-Cool-Input.(b) L-Cool-Feature.

1 )
horse2zebra: Translation from horse images to zebra images and vice versa.The training set consists of 1067 horse images and 1334 zebra images, subsampled from ImageNet.Dividing the test set, we prepared 60 and 70 validation images and 60 and 70 test images for horse and zebra, respectively.2) apple2orange: Translation from apple images to orange images and vice versa.The training set consists of 995 apple images and 1019 orange images, subsampled from ImageNet.Dividing the test set, we prepared 133 and 133 validation images and 133 and 133 test images for apple and orange, respectively.3) sat2map: Translation from satellite images to map images.The training set consists of 1096 satellite images and 1096 map images, subsampled from Google Maps.The 1098 and 1098 images each are provided for test.Dividing the test set, we prepared 250 validation images and 848 test images.Although CycleGAN was pretrained in the unsupervised setting, the dataset is actually paired, i.e., the ground-truth map image for each satellite image is available, which allows quantitative evaluation.

TABLE II AVERAGE
LIKENESS TO ZEBRA IMAGES OVER THE FRINGE SAMPLES AND THE CLASSIFIERS (SHOWN IN THE LEGEND IN FIG. 7).THE FRINGE SAMPLES ARE DETECTED BY (10) WITH THE THRESHOLD ξ CONTROLLING %FRINGES.FOR EACH ROW, WE MARK IN BOLD THE BEST METHOD AND THE METHODS THAT ARE NOT SIGNIFICANTLY OUTPERFORMED BY THE BEST, ACCORDING TO THE WILCOXON SIGNED RANK TEST FOR p = 0.05 TABLE III AVERAGE PIXELWISE ACCURACY IN THE SAT2MAP TASK.THE FRINGE SAMPLES ARE DETECTED BY (10) WITH THE THRESHOLD ξ CONTROLLING %FRINGES FOR EACH ROW, WE MARK IN BOLD THE BEST METHOD AND THE METHODS THAT ARE NOT SIGNIFICANTLY OUTPERFORMED BY THE BEST, ACCORDING TO THE WILCOXON SIGNED RANK TEST FOR p = 0.05

TABLE IV BLEU
SCORES IN LANGUAGE TRANSLATION TASKS ON THE TEST SET.FOR EACH TASK (COLUMN), WE MARK IN BOLD THE BEST METHOD AND THE METHODS THAT ARE NOT SIGNIFICANTLY OUTPERFORMED BY THE BEST, ACCORDING TO THE WILCOXON SIGNED RANK TEST FOR p = 0.05

TABLE V BLEU
SCORES IN THE EN-FR TRANSLATION TASK WITH FRINGE DETECTION.DIFFERENT PROPORTIONS OF FRINGE SAMPLES ARE IDENTIFIED BY THE FRINGE DETECTOR (10) WITH ADJUSTED THRESHOLD ξ .IN EACH ROW, WE MARK IN BOLD THE BEST METHOD AND THE METHODS THAT ARE NOT SIGNIFICANTLY OUTPERFORMED BY THE BEST, ACCORDING TO THE WILCOXON SIGNED RANK TEST FOR p = 0.05