Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K

Multimodal machine translation (MMT) is an attractive application of neural machine translation (NMT) that is commonly incorporated with image information. However, the MMT models proposed thus far have only comparable or slightly better performance than their text-only counterparts. One potential cause of this infeasibility is a lack of large-scale data. Most previous studies mitigate this limitation by employing large-scale textual parallel corpora, which are more accessible than multimodal parallel corpora, in various ways. However, these corpora are still available on only a limited scale in low-resource language pairs or domains. In this study, we leveraged monolingual (or multimodal monolingual) corpora, which are available at scale in most languages and domains, to improve MMT models. Our approach follows that of previous unimodal works that use monolingual corpora to train the word embedding or language model and incorporate them into NMT systems. While these methods demonstrated the advantage of using pre-trained representations, there is still room for MMT models to improve. To this end, our system employs debiasing procedures for the word embedding and multimodal extension of the language model (visual-language model, VLM) to make better use of the pre-trained knowledge in the MMT task. The results of evaluations conducted on the de facto MMT dataset for the English–German translation indicate that the improvement obtained using well-tailored word embedding and VLM is approximately +1.84 BLEU and +1.63 BLEU, respectively. The evaluation on multiple language pairs reveals their adoptability across the languages. Beyond the success of our system, we also conducted an extensive analysis on VLM manipulation and showed promising areas for developing better MMT models by exploiting VLM; some benefits brought by either modality are missing, and MMT with VLM generates less fluent translations. Our code is available at https://github.com/toshohirasawa/mmt-with-monolingual-data.


I. INTRODUCTION
In multimodal machine translation (MMT), a target sentence is translated from a source sentence together with related nonlinguistic information, such as images. Since the development of a multimodal parallel corpus, namely, Multi30K [1], most research in this area has focused on incorporating static images into encoder-decoder neural machine translation (NMT) systems. In an image-guided machine translation task, the multimodal NMT models are expected to disambiguate lexical ambiguity in the source and target languages The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . or correct inaccurate expressions in the source sentence [2]. These models should also resolve language phenomena that exist only in the target languages. The success of MMT has encouraged its application in translating subtitles of movies, utterances in conferences, and descriptions of paintings.
Following the proposal of the encoder-decoder NMT model [3], large-scale parallel corpora have been built to train better NMT systems. For news translation, an English-Czech NMT system acquired 170k data and achieved quality on par with that of human translators [4]. More recently, [5] employed over 200M data to train an English-German and English-Japanese NMT model that won first place in an international competition [6]. However, the Multi30K [1], a well-established corpus for image-guided machine translation tasks, is annotated based on an image-captioning dataset (Flickr30K [7]), and comprises 30k tuples of image, English sentence, and German sentence. Compared to the datasets used for news translation or other text-only translation tasks, the Multi30K dataset is considered a low-resource dataset. Owing to a lack of available large-scale datasets, training high-quality MMT models is challenging, and has resulted in only ''modest'' improvement being achieved from using image information [8]. To address this problem, previous studies employed large-scale textual parallel corpora. [8] trained a text-only NMT model on a textual parallel corpus (OpenSubtitles [9]) and translated a multimodal monolingual corpus (MS-COCO [10]) to augment the data for training MMT models. More recently, [11] also employed the same strategy to augment the training data. They trained a visual translation language model (VTLM)-an extension of the translation language model (TLM) [12]-and transferred the weight of the VTLM to initialize MMT models. Despite the success of these approaches, the limitations on the use of large-scale ''parallel'' corpora circumscribe their availability in low-resource languages.
In this study, we focus on manipulating knowledge obtained from either textual or multimodal monolingual corpora, which is more feasible for the low-resource domain. Specifically, we prove the usefulness of the pre-trained word embedding and language model for MMT models. Although these approaches were developed for text-only MT, their application toward MMT has room for improvement and is worthy of consideration.
The main contributions of this study are as follows: 1) We propose two approaches to incorporate a monolingual (or multimodal monolingual) corpus into MMT models; one uses a pre-trained word embedding using a debiasing procedure and monolingual-corpus-based subword tokenization; the other uses VLM. 2) We demonstrate that both approaches achieve substantial improvement over their baselines; in particular, the MMT model fused with VLM consistently outperforms its text-only counterparts across a range of target languages.

A. WORD EMBEDDING
Pre-trained word embedding is considered an important part of neural network models in many natural language processing (NLP) tasks. In the context of NMT, pre-trained word embedding has proved useful in low-resource domains [13], in which FastText [14] embedding is used to initialize the encoder and decoder of the NMT model. They also indicate that the improvement achieved using the pretrained word embedding decreases as the training data increases. Reference [15] introduced an MMT model with embedding prediction that provided substantial performance improvement. However, many studies have proven that the vectors in a pre-trained word embedding distribute unevenly in a narrow conical subspace rather than evenly in the entire space. This highly localized geometry of pre-trained word embeddings harms the isotropy of word embedding and consequently their advantage in downstream tasks. In these embedding spaces, some words appear frequently in the nearest neighbors of other words [16], [17]. This is called the hubness problem in the general machine learning domain [18], and it impairs the utility of pre-trained word embedding. To address this problem, several post-process debiasing methods have been proposed with respect to different bias scopes, such as local bias [19] and global bias [20]. Recently, Kaneko and Bollegala [21] proposed a debiasing method that uses an autoencoder. Extending [22], which was proposed to debias pre-trained word embeddings prior to integrating them into MMT models for English-German and English-French translations, we examined the latest debiasing techniques in more language pairs. Moreover, despite the application of subword-level tokenization over NLP tasks, its impact on pre-trained word embeddings and downstream tasks has been less quantified. Specifically, reusing subwords learned during pre-training in downstream tasks is a well-established strategy in recent emergent language models [23]. Thus, we hypothesize that reusing subwords also benefits tasks using pre-trained word embeddings. To this end, we conducted experiments in which the training data for pre-training word embeddings were tokenized at either the word or subword level.
Our main findings on the use of pre-trained word embeddings are as follows: 1) Integrating pre-trained word embeddings into MMT models improves the translation performance, and applying a debiasing method further boosts the gains among a wide range of language pairs and MMT models. 2) Models that reuse the subwords of pre-trained word embeddings consistently outperform their counterpart, showing the best translation performance when debiasing is applied.

B. LANGUAGE MODEL
The pre-trained language model (LM) improves the performance of a target model for different NLP tasks, such as text summarization [24], grammatical error correction [25], and machine translation [26]. The bidirectional encoder representations from Transformers (BERT) [23] comprises the encoder of the transformer [27] architecture, and it learns the general LM through pre-training on large-scale corpora. This pre-trained BERT is then fine-tuned on downstream tasks [23] or serves as a feature extractor [24]- [26]. Inspired by pre-trained LMs, many pre-trained LMs of vision-language modalities have been proposed. For example, the LXMERT [28] model incorporates object-level visual features into a BERT-like architecture, and it achieves stateof-the-art results in visual question-answering tasks. Several researchers have incorporated VLM into MMT models. Reference [11] proposed to use a pre-trained MT model to translate the captions of images and train a VLM on the pseudo multimodal parallel data, followed by fine-tuning on an MMT dataset. [29] pre-trained an MMT-specific visual feature extractor and incorporated the extracted visual feature into MMT models along with the BERT feature. Those approaches achieved substantial improvement, however, they required at least one expensive parallel corpus or high computational resource, which is unaffordable for training a model in a low-resource language or domain.
To address this problem, we propose a multimodal transformer model fused with a general-purpose pre-trained LXMERT system that utilizes the multimodal feature beyond its visual and textual features. The proposed model utilizes both textual and visual knowledge from the multimodal feature to improve the translation performance, which is unattainable solely with textual modality. The results obtained indicate the following: 1) The LXMERT-fused model improves the translation quality, especially under a limited language context. 2) An extensive analysis of our model indicates that a model fused with both LXMERT and BERT features further benefits the translation quality and still has room to fully exploit LXMERT and BERT features. The remainder of this paper is organized as follows. Section II briefly discusses related work. Sections III and IV describe conventional MMT models with a pre-trained word embedding and the proposed MMT model fused with a visual-language LM, respectively. Sections V and VI respectively describe the relevant experiments conducted and analyze the results obtained in detail. Finally, concluding remarks are presented and future work is outlined.

II. RELATED WORK A. MMT WITH DATA AUGMENTATION
Data augmentation is widely studied in MMT owing to the lack of available large-scale corpora. Most previous studies adopted textual parallel corpora as external data for Multi30K [1], a well-established multimodal parallel corpus. Reference [8] proposed to pre-train an NMT model on a large-scale textual parallel corpus (OpenSubtitles corpus [9]) and translated a multimodal monolingual corpus (MS-COCO [10]) to augment the data for training MMT models. Although their system trained using the augmented training data achieved state-of-the-art results in both English-French and English-German translation, the improvement made by images is moderate or even negative. Reference [30] combined four different pseudo-parallel and parallel corpora as additional data to train MMT models: back-translation of Multi30K Task 2 data and in-domain monolingual corpora of target language, in-domain data extracted from general domain parallel corpora, and general domain parallel data. Recently, [11] proposed a VTLM to aggregate the recent development of TLM and VLM. As training VTLM requires multimodal parallel data, they employed a publicly available NMT model trained on multiple large-scale parallel corpora. Reference [29] acquired approximately six million sets of image-caption data that they used to pre-train an object recognition model, followed by fine-tuning on Multi30K data concurrently with training an MMT model.
In most studies on data augmentation in the MMT domain, either large-scale out-of-domain textual parallel corpora [8], [30] or pre-trained NMT models [11] of good quality for back-translation [31] is mandatory. Some studies require a large computational resource [29], which is considered overabundance to train an MMT model. With this constraint, the approaches cannot be adopted in low-resource domains. Thus, we explore the use of either textual or multimodal monolingual data, which tend to be available.

B. MONOLINGUAL CORPORA AUGMENTED NMT
Knowledge learned from monolingual corpora has been widely exploited in NMT. Using word embedding or a language model is a common method to incorporate monolingual corpora in NMT systems.
Reference [13] made the first attempt to utilize pre-trained word embedding in low-resource NMT. They revealed that using a pre-trained word embedding to initialize embedding layers in an MT system improves the translation quality. Further, they stated that their approach is more effective for more similar language pairs. Recently, [32] proposed a search-based NMT model that predicts the embedding of the output word, rather than the distribution over the vocabulary. Their approach achieves not only faster training convergence without decreasing translation performance, but also a more accurate translation of rare words. This technique has been extended to an MMT model by [15], with resulting improvement in translation quality.
BERT [23] and its derivations [33], [34] employ a Transformer [27] architecture in a self-supervised learning manner and have achieved new state-of-the-art results on several natural language processing tasks. Moreover, several studies that leverage BERT for NMT have been published. Previous studies have revealed that simply initializing a transformer encoder using pre-trained BERT parameters does not improve translation quality [26], [35]. To address this problem, [35] proposed two-stage curriculum learning, in which model parameters initialized using pre-trained BERT are frozen until convergence in the first stage; all model parameters are then fine-tuned in the second stage. Meanwhile, [26] proposed a BERT-fused model that incorporates the representations from pre-trained BERT into a transformer model by feeding it into all the encoder and decoder layers. In both studies, models with BERT features outperformed naive transformer models to a substantial degree. Table 1 shows the dataset used to train each model. Recently, many studies on self-supervised learning for vision-language tasks have been conducted. Similar to BERT, the models in these studies are first pre-trained on a large-scale text-image dataset, and then they are finetuned on downstream tasks. Table 2 shows the performance for the downstream tasks of each model that has publicly ''Images'' and ''Sentences'' denote the size of the data for pre-training LMs. ''VQA'' shows the overall accuracy on the test-dev split in VQA v2 [36].
available pre-trained models. Inspired by the progress in vision-language LM, we explore a Transformer-based MMT model incorporating vision-language LMs.

III. MULTIMODAL MACHINE TRANSLATION WITH PRE-TRAINED WORD EMBEDDING
In this section, we present our proposal for exploiting pretrained word embedding for conventional MMT models. Although pre-trained word embedding has been widely used in NMT tasks after the emergence of [13], there is still room to improve their effectiveness by alleviating the following problems: (a) Learning word embedding of good quality is challenging for rare words. (b) Some words frequently appear in the nearest neighbors of other words irrespective of their similarity. To this end, our proposed approach comprises five steps: 1) Tokenizing monolingual data at either the word or subword level 2) Pre-training word embedding using a model 3) Applying the debiasing process to remove hubness from pre-trained word embedding 4) Initializing the embedding of MMT encoder and decoder to the pre-trained word embedding 5) Training the MMT model on the Multi30K dataset To show that our approach is invariant with the architecture of MMT models, we employed four different MMT models: decoder initialization [42] (''DECINIT''), IMAGI-NATION [43] (''IMAG+''), hierarchical-attention NMT [44] (''HA-NMT''), and visual attention grounding NMT [45] (''VAG-NMT'').

A. CONVENTIONAL MMT MODELS
Conventional MMT models are based on the attentive recurrent neural network. All of these models handle machine translation as a sequence-to-sequence learning problem in which a neural model is trained to translate a source sentence of N -tokens x = {x 1 , x 2 , · · · , x N } into a target sentence of M -tokens y = {y 1 , y 2 , · · · , y M } along with the global visual feature v g and/or the local visual feature v l .
The underlying text-only NMT model of all MMT models comprises a bidirectional gated recurrent unit (GRU) [46] encoder and a unidirectional GRU decoder. The encoder first maps the source sentence x into the encoder hidden state h = h 0 , · · · , h N , which is a concatenation of outputs from the forward GRU encoder and the backward GRU encoder.
Thereafter, the decoder computes a hidden state proposal s j for each time step j ∈ [1, M ]: whereŝ j−1 is the previous hidden state and e dec (ŷ j−1 ) is the embedding for the previous output wordŷ j−1 . The initial statê s 0 is set to a zero vector. The textual context vector is computed using an attention mechanism, given s j as the query and h i as the key and value. Technically, in each time step j while decoding, a feedforward layer is used to calculate a normalized soft alignment α j,i with each source hidden state h i , and the textual context vector c t j is computed as the weighted sum of the source hidden states: where v t , U t α , and W t α are model parameters. The decoder employs another GRU unit to compute the final hidden stateŝ j from the hidden state proposal s j , textual context c t j , and visual context c v j : The system output at time-step j is obtained using the current hidden state, previous word embedding, textual context, and visual context: where L s , L w , and L t are model parameters.

1) DECODER INITIALIZATION
This MMT model uses a projected global visual feature as the initial decoder stateŝ 0 , rather than a zero vector: where W 0 and b 0 are model parameters.

2) HIERARCHICAL-ATTENTION NMT
Hierarchical-attention NMT [44] incorporates the local visual feature in an attentive manner. The model first computes the textual context vector c t j and the visual context vector c v j using two individual attention units, as described by (2)-(4). The concatenation of two context vectors {c t j , c v j } is then fed into another attention as the key and value, where the hidden state proposal s j is used as the query. The second GRU in (5) takes the obtained multimodal context vector, rather than the textual context vector c t j .

3) IMAGINATION
IMAGINATION [43] is a multitask model that jointly learns machine translation and visual latent space models. The MT model is the vanilla NMT model that does not involve any visual features; this model does not use images during inferencing. The latent space model is a feed-forward model that calculates the average vector over the hidden states h i in the encoder and maps it to the final vectorv in the latent space:v where W v is a model parameter.
We employ the max-margin loss to train the latent space model to ensure that the model maps the encoder hidden states closer to the global visual feature: where d is a cosine similarity function and α is the margin. 1 The final loss is the sum of the losses for MT and latent space learning. 1 We use α = 0.1 in our experiment.

4) VISUAL ATTENTION GROUNDING NMT
Visual attention grounding NMT (VAG-NMT) [45] is another multitask model comprising an MMT model and a latent space model. The model first computes the sentence representation t from the global visual feature v g and encoder hidden states h: Thereafter, VAG-NMT utilizes the sentence representation to initialize the decoder hidden state: where W init is a model parameter and ρ is a hyperparameter for determining the text representation ratio in the decoder initial state. 2 Further, the model learns to map the sentence representation t and the global visual feature v g closer in a latent space: The loss for latent space learning is the max-margin loss with negative sampling: where d is the cosine similarity function; k and p are the indexes for sentences and images, respectively; t k =p are the negative samples for which all examples in the same batch with the target example are selected; and γ is the margin that adjusts the sparseness of each item in the latent space. 3 The final loss is the weighted sum of losses for MT and latent space learning.

B. TOKENIZATION
The distribution of word occurrence follows Zipf's law and is long-tailed, where a few common tokens dominate and most tokens appear several times. Consequently, the obtained word embedding for tokens lying on the long tail may have poor quality. Improving the word embedding for rare words intuitively requires relaxing the slope of the distribution. To this end, we propose learning the word embedding from sentences that are tokenized at the subword level. Technically, we employ Byte Pair Encoding [47] (BPE) to tokenize words into subwords.
For training the MMT models, we adopt the same subword vocabulary used in pre-training word embedding to tokenize Multi30K sentences.

1) LOCALIZED CENTERING
Localized centering shifts each word based on its local bias. The local centroid for each word x is computed and subtracted from the original word x to obtain the new embeddingx: where k is a hyperparameter called the local segment size 4 and kNN(x) returns the k-nearest neighbors of the word x.

2) ALL-BUT-THE-TOP
All-but-the-Top uses the global bias of the entire vocabulary to shift the embedding of each word. The All-but-the-Top algorithm comprises three steps: subtract the centroid of all words from each word x, compute the PCA components for the centered space, and subtract the top n PCA components from each centered word to obtain the final wordx: where D is a hyperparameter used to determine how many principal components of the pre-trained word embedding are ignored. 5

3) AUTOENCODER
Reference [21] showed that applying centering and PCA is the same as applying an autoencoder. Following their approach, we first train an autoencoder upon a pre-trained word embedding using the L 2 reconstruction loss. Subsequently, we extract the hidden state of each word embedding in the autoencoder as the revised word embedding. The centering and PCA effects of the autoencoder result in the obtained word embeddings being both debiased and isotropic.

IV. MULTIMODAL MACHINE TRANSLATION WITH LM
In this study, we use a variant of the BERT-fused model [26], namely, the LXMERT-fused model, in which the LXMERT [28] system is used as the feature extractor instead of the BERT system. As the performance of the LXMERT system is the best among available pre-trained systems (Table 2), we employ LXMERT as our feature extractor. 6 This model tackles MMT as a sequence-to-sequence learning problem, in which a neural network model is trained to translate a source sentence of N -tokens x = {x 1 , · · · , x N } and the corresponding image into the target sentence of M -tokens y = {y 1 , · · · , y M }.

A. LXMERT
The LXMERT model is a pre-trained VLM that represents cross-modal connections, as well as each modality. It takes both a sentence x and an image as its inputs and generates two different features: language output H L and vision output H R .
More specifically, LXMERT first encodes each modality using individual single-modality encoders, and then it encodes cross-modality connections using another individual encoder with a cross-modality attention mechanism. The LXMERT model is pre-trained with four tasks: masked cross-modality LM, masked object prediction (feature vector regression and detected-label classification), cross-modality matching, and image question answering.

1) SINGLE-MODALITY LANGUAGE ENCODER
The single-modality language encoder is a BERT-like encoder. It is composed of a stack of nine layers, with each layer containing a self-attention sub-layer and a feedforward sub-layer. Two special tokens, CLS and SEP, are appended before and after the input sentence words, respectively. The resulting input sequence {CLS, x 1 , · · · , x N , SEP} is fed into the encoder to generate the language-modality representation h 0 .

2) SINGLE-MODALITY VISION ENCODER
The single-modality vision encoder is also a BERT-like encoder. However, it is composed of a stack of five layers and takes visual features as inputs. Rather than using a raw image, LXMERT uses a bag-of-objects representation of K objects {(p 1 , f 1 ), · · · , (p K , f K )} that a Faster R-CNN [48] system detects from the image. Here, p j is the position feature, and f j is the region-of-interest (RoI) feature for j ∈ [1, K ]. The position-aware embedding o i for each object is derived from the position and RoI features: where LayerNorm is the layer-normalization function and W F , b F , W P , and b P are model parameters.
The position-aware embeddings o = {o 1 , · · · , o K } are fed into the single-modality encoder to deliver vision-modality representation v 0 .

3) CROSS-MODALITY ENCODER
The cross-modality encoder is composed of a stack of five identical layers. Each cross-modality layer consists of two unidirectional cross-attention, two self-attention, and two feed-forward sub-layers. In the k-th layer, two unidirectional cross-attention sub-layers are first applied-one from language to vision and the other from vision to language. The query and context vectors are the outputs of the (k − 1)-th layer:ĥ where Head k h→v and Head k v→h are two different multiheaded attention modules. Subsequently, we apply the self-attention sub-layers to the output of the cross-attention sub-layers, which is followed by the feed-forward sub-layers to obtain the k-th layer outputs h k and v k .

4) OUTPUT REPRESENTATIONS
The corresponding outputs of the last cross-modality encoder are denoted H L for language and H R for objects. Technically, we use h 5 as H L and v 5 as H R .

B. LXMERT-FUSED MODEL
The LXMERT-fused model takes the LXMERT representations as the embedding of the images. In addition, we use a concatenation of H L and H R as the LXMERT representations H LR = [H L andH R ] to ensure that the model can exploit both language-to-vision and vision-to-language crossmodality information.

1) ENCODER
The encoder is composed of a stack comprising one embedding layer and six encoder layers. Each encoder layer contains one fusion sub-layer and one feed-forward sub-layer. A residual connection is applied around each of the two sub-layers, and then layer normalization proceeds.
The encoder first projects tokens in a sentence x = {x 1 , · · · , x N } to vectors via the embedding layer, followed by a tanh activation. It then injects positional encoding into the input embedding and applies layer normalization to obtain the position-aware word embeddings H 0 E = {H 0 E,1 , · · · , H 0 E,N }: where i ∈ [1, N ] denotes each position in a source sentence, e enc (x i ) is the embedding representation for a word x i , and PE(i) is the positional embedding for a position i.
In the l-th encoder layer, the fusion sub-layer is first applied, where two context vectors are computed using two different attention modules: self-attention and encoder-tolxmert attention. The fusion sub-layer then interpolates two context vectors and obtains the final context vectorH l E : where λ E and λ LR are interpolation coefficients 7 and λ E + λ LR = 1 and Head l E→E and Head l E→LR are multi-head attention modules with different parameters. Dropnet [26] is applied for all fusion sub-layers.
The context vectors are then processed by the position-wise feed-forward sub-layers, and the output of the l-th layer H l E is derived: where W l 1 , W l 2 , b l 1 , and b l 2 are the model parameters, and ReLU is a ReLU activation.

2) DECODER
The decoder is also composed of a stack comprising one embedding layer and six decoder layers. Each decoder layer contains one self-attention sub-layer, one fusion sub-layer, and one feed-forward sub-layer. The residual connection and layer normalization are applied between sub-layers.
In each position t, while decoding, the decoder first computes the position-aware word embeddings H 0 D,<t = {H 0 D,1 , · · · , H 0 D,t−1 } from the predicted tokensŷ <t = {ŷ 1 , · · · ,ŷ t−1 }: where j ∈ [1, t − 1] denotes each position in the predicted tokens, and e dec (ŷ j ) is the embedding representation for a wordŷ j . In the l-th decoder layer, the output of the l − 1-th layer is fed to a self-attention module Head D→D to generate the intermediate representationĤ l D,j : The fusion layer in the decoder works similar to that in the encoder; however, it uses decoder-to-encoder attention (rather that self-attention) to generate the final representa- where ρ E and ρ LR are the interpolation coefficients 8 and ρ E + ρ LR = 1. Further, Head l D→E and Head l D→LR are multihead attention modules with different parameters. Dropnet is also applied for all fusion sub-layers in the decoder.
The context vectors are then processed by the position-wise feed-forward sub-layers, and the output of the l-th layer H l E is derived: where W l 3 , W l 4 , b l 3 , and b l 4 are the model parameters. The output of the last decoder layer H 6 D is fed into the projection layer to generate the output distribution p(ŷ t |ŷ <t ): where W 5 and b 5 are the model parameters. In particular, W 5 is a projection matrix that maps the decoder state into vocabulary space.

C. TRAINING
A preliminary study, reported by [26], has indicated that training LXMERT-fused models from scratch does not lead to a good model performance, which is also while using the BERT feature.
To address this problem, we employ a two-step procedure to train LXMERT-fused models. We first train an LXMERTfused model with ρ E = 0, where only the language part of the training data is included. After the model has converged, we then set ρ E for a specific value to fine-tune the model on both language and vision data. During fine-tuning, the learning rate and batch size are set to smaller values than those in the first step.

A. WORD EMBEDDING
In our experiments, we used three different well-established word embedding models: word2vec [49], GloVe [50], and FastText [14]. The publicly available pre-trained word embeddings use different training corpora; however, we trained the word embeddings of different models using an identical monolingual corpus for fair comparison.

1) TRAINING CORPUS
We downloaded Wikidump 9 for English, German, French, and Czech and extracted the article pages. All the extracted sentences were preprocessed by lower-casing, tokenizing, and normalizing the punctuation using a Moses script. 10 For the subword-level experiments, we used Byte Pair Encoding to split words into subwords. We used subwordnmt 11 to process the sentences. The number of merge operations was 30,000, and the vocabulary threshold was set to zero. Table 3 shows the statistics of the preprocessed Wikipedia corpus for each language.
We applied each debiasing method to the obtained word embeddings with the same options as in its original paper.

2) TRAINING SETTING
All word embeddings were trained on a dimension of 300. The specific options for training were as follows (default values were used for other options).
We trained the word2vec model 12 using the CBOW algorithm (with window size of 10, negative sampling of 10, and minimum count of 10), the GloVe model 13 (with window size of 10 and minimum count of 10), and the FastText model 14 using the CBOW algorithm (with maximum character n-gram of 5, window size of 5, and negative sampling of 10).

3) UNKNOWN WORDS
Unknown words are of two types: words that are a part of a pre-trained word embedding but are not included in a vocabulary (Out-Of-Vocabulary (OOV) words) and words that are a part of a vocabulary but are not included in pre-trained word embedding (OOV words for embedding). OOV words for embedding only exist when using word-level embedding (word2vec and glove); the embedding of such words in FastText are calculated as the mean embedding of character n-grams consisting of the word.
The embeddings for both types of OOV words were calculated as the average embedding over the words that were a part of the pre-trained word embedding, but were not included in the vocabularies, and they were updated individually during training. Tables 4 and 5 show the hyperparameters of the conventional and LXMERT-fused MMT models in our experiments, respectively. Note that each conventional MMT model has an encoder size of 320; therefore, the size of bidirectional GRU is 640. All models were implemented using the nmtpytorch toolkit v4.0.0 [51].

1) GLOBAL AND LOCAL VISUAL FEATURE
We encoded each image using pre-trained ResNet-50 [52] and selected the hidden state in the res4f layer of 1024D as its global visual feature, and that in the pool5 layer of 2048D as its local visual feature.

2) LXMERT FEATURE
We used a publicly available LXMERT model 15 in our experiment. We first employed the alternative pre-trained Faster R-CNN model 16 to encode all images in the Multi30K dataset and selected 36 RoI features of the 2048-dimension and 36 positional features of the four-dimension for each image in the same manner as [28]. Finally, the pre-trained LXMERT model processed the selected visual features and the corresponding source sentence to obtain LXMERT features of the 768-dimension.

C. MULTI30K DATASET
We used the Multi30K [1] dataset for all translation directions and the 2017 test set [53] for English-German and English-French translations. The training, validation, 2016 test, and 2017 test sets have 29,000, 1,014, 1,000, and 1,000 instances, respectively. We selected English as the source language and German/French/Czech as the target languages. All sentences in English/German/French/Czech were preprocessed by lower-casing, tokenizing, and normalizing the punctuation using the same scripts described in V-A. For the subword-level experiments, we applied BPE using the subwords obtained from Wikipedia; consequently, no OOV tokens appeared in the training and other sets. Table 7 shows the results of the preliminary experiments. Considering the results, we decided to perform our experiments only on the BiGRU-based MMT models and omit BPE-to-word translation.
We also evaluated the LXMERT-fused models on a degraded version of Multi30K (2016 N ), where the first noun of each noun phrase is masked. Note that the models for the 2016 N test set were trained on the degraded version of the training set.

D. EVALUATION
We used BLEU [54] and METEOR [55] as our evaluation metrics. BLEU evaluates the hard matches on unigrams, bigrams, trigrams, and 4-grams between the system output and the reference. METEOR is a BLEU-like metric that employs WordNet [56] to relax the hard alignment between the prediction and reference, which allows the metric to take more account of semantics. Note that METEOR is only available for German, French, and Czech; we did not evaluate Japanese translation using METEOR. We trained each model three times using different seeds and averaged the scores. VOLUME 10, 2022 TABLE 8. Corpus-level BLEU / METEOR on the 2016 test set for English-German translation using different tokenization strategies, pre-trained word embeddings, and MMT models. ''+ debias'' shows the best score of the three models using different debiasing methods. The underlined scores are higher than the score of randomly initialized models. The bold score is the best score of each model. '' † '' indicates the statistical significance of the improvement over randomly initialized models. Table 8 shows the BLEU and METEOR scores across the ''text-only'' model and MMT models for English-German translation. First, we observe that applying the Wikipediabased BPE to both English sentences and German translation results in substantial improvement (+1.01 BLEU on average) for all models. Note that applying BPE to English sentences also boosts the model performance, which is contrary to the report by [57] that Multi30K-based BPE to source sentences is not beneficial. Second, debiasing pre-trained word embedding further improves the model performance. Given the use of BPE on both sides, models using debiased word embedding have a higher BLEU score than their counterparts that use vanilla word embedding.

1) MMT WITH PRE-TRAINED WORD EMBEDDING
We observed a slightly different trend for other translation pairs. Table 9 shows the BLEU scores of three models for English-French, English-Czech, and English-Japanese translation. Using debiased word embedding still results in improvements over randomly initialized models. However, Wikipedia-based BPE no longer benefits model performance (−0.37, +0.01, and −0.52 BLEU for English-French, English-Czech, and English-Japanese on average, respectively).

2) LXMERT-FUSED MMT
We trained all the models three times with different seeds and averaged the scores. Table 10 shows the BLEU and METEOR scores across three test sets in Multi30K. Adding to the text- only   TABLE 9. Corpus-level BLEU on the 2016 test set for English-French, English-Czech, and English-Japanese translation. ''+ debias'' shows the best score of the three models using different debiasing methods. The bold score is the best score of each model. '' † '' indicates the statistical significance of the improvement over randomly initialized models.
Transformer and LXMERT-fused Transformer, we also conducted experiments on models fused with the BERT feature (''BERT''), ResNet-50 local visual feature (''ResNet-50''), and RoI feature (''Faster R-CNN''). We observed that the MMT models incorporating BERT (or all-inclusive feature)  outperformed other models on the 2016 and 2017 test sets. This suggests that, when the input sentence is complete, the textual modality is more important than the visual modality.
However, in the 2016 N test set, the benefit of using the BERT feature is less than those in the 2016 and 2017 test sets. This suggests that, while the textual context is limited, visual features profit more than the textual feature. We can further observe the significant improvement resulting from using most visual feature; the LXMERT feature profits more than most of the other visual features (ResNet-50 and Faster R-CNN). Moreover, the model achieves the best score along with the all-inclusive feature in many translation directions, which is a concatenation of BERT and LXMERT features. We may conclude that the LXMERT-fused MMT model is not only capable of utilizing visual features but is also good at working with both strong textual LM and visual-language LM.
Furthermore, these properties are consistent among translation directions, which is different from what we observed when using pre-trained word embedding. In all translation directions, the models fused with BERT, LXMERT, or the all-inclusive feature perform the best. We provide a detailed model comparison for English-German translation in Section VI-C.

VI. DISCUSSION
In this section, we first examine the effectiveness of each debiasing method. Subsequently, we conduct an extensive quantitative analysis of the LXMERT-fused model. Table 11 reports the average BLEU and METEOR scores of each model using different word embeddings and debiasing methods over different tokenization strategies. We can observe that All-but-the-Top (''AbtT'') achieves the best scores for nine out of 15 combinations of MMT models and word embedding. This is followed by localized centering (''LC''), which achieves the best scores for two. Conversely, autoencoder (''AE'') seems less capable with pre-trained embedding in the MMT scenario.

A. DEBIASING METHOD
More interestingly, whereas the debiasing procedures only improve the benefit on six out of 10 benchmarks for word2vec, GloVe and FastText benefit on all benchmarks. This difference may be caused by how each embedding learns the global property of the training corpus. In contrast to word2vec, which learns to predict local context words from each word, GloVe learns based on the global co-occurrence matrix of the training data. FastText comprises each word embedding from its subword embeddings, which results in the generalized embedding rather than the VOLUME 10, 2022 TABLE 11. Detailed comparison of pre-trained word embeddings and debiasing methods across conventional MMT models for English-German translation. The score in boldface is the best score among the vanilla and debiased embeddings for each embedding. localized embedding. Consequently, GloVe and FastText learn the global property of the training corpus more than word2vec does, which makes GloVe and FastText more capable with the debiasing method based on the global bias of the entire vocabulary. This result is consistent with the report by [21], which stated that word2vec with the autoencoderbased debiasing procedure is less capable than GloVe and FastText on word disambiguation tasks.

B. FEATURE ABLATION
Selecting the appropriate feature is essential for leveraging the visual information for NMT. To reveal which part of the LXMERT feature contributes the most, we conducted experiments with various features extracted from LXMERT: (1) Object-level visual features as defined in (26) (Objects); (2) features before single-modality encoders (Embedding); (3) output of single-modality encoders (Single-M); and (4) output of cross-modality encoders (Cross-M). Table 12 reports the results of ablation experiments conducted on the 2016, 2017, and 2016 N test sets. Although the model exploiting the cross-modal feature (''Cross-M'') is not the best model w.r.t. most of the test sets, it achieves almost the best performance. Interestingly, the best feature for any test sets is either a cross-modality feature or a concatenated single-modality feature. This suggests that the multimodal feature is more feasible for the model than single-modality features. Moreover, we need to select features from different layers to make the model best fit with different test sets. This would be caused by the pre-training tasks of LXMERT that are not optimized for NMT models and would present deceptive information in the LXMERT representations. The observation also suggests that selecting appropriate pre-training tasks will further boost translation quality.

C. ALL-INCLUSIVE FEATURE
A key finding of this study is that the fuse-based model can utilize both LXMERT and BERT features in degraded scenarios. Table 13 shows the statistics of sentence sets that benefit from either LXMERT, BERT, or all-inclusive features.
Evidently, the largest contribution is made by 214 sentences (2016 test set) and 166 sentences (2016 N test set) that are improved by all features. In the 2016 N test set, the difference in the improvement made by the LXMERT feature and the all-inclusive feature is substantial (+0.83 BLEU). However, almost no additional improvement (+0.05 BLEU) is made by the all-inclusive features for Multi30K. Based on these results, we can conclude that the fuse-based model can utilize both LXMERT and BERT features when the input sentences are incomplete.
Moreover, by using the all-inclusive features, our model improved 47 samples (2016 test set) and 46 samples (2016 N test set) that are not improved by using either the LXMERT or BERT features. These samples validate the assertion that the fused-based model with the LXMERT feature is capable of not only selectively using the better features from the BERT features or LXMERT features but also extracting novel information that is imperceptible in the underlying features. Statistics of the sentence subsets in the 2016 test set for English-German translation that benefit from the features with up arrow (green) and do not benefit from the features with down arrow (red). ''L,'' ''B,'' and ''A'' denote the models with the LXMERT feature, BERT feature, and all-inclusive feature, respectively. ''Samples'' shows the number of samples in each set. ''Avg. BLEU'' shows the gain (or loss) of each feature from the text-only baseline. However, our model failed to improve 33 samples (2016 test set) and 34 samples (2016 N test set) that were improved by using a single feature, but not by the all-inclusive feature. This demonstrates that the model failed to utilize two promising features for some samples. Further, the pairwise comparison (Table 14) of the models supports this idea, where the LXMERT feature contributes more than BERT and all-inclusive features in sentences that need images for translation. 18 Therefore, the model still has room for further improvements, especially in exploiting multiple features simultaneously without losing the individual benefits from composing features. We will explore this issue in future work.

D. VISUAL AWARENESS
To determine whether the LXMERT-fused model is aware of visual context, we performed adversarial evaluation [58] on the 2016 and 2016 N test sets. In adversarial evaluation, we measure how a system performs when it is presented with the correct text data and either the correct image data (congruent) or incorrect image data (incongruent). To this end, we reversed the order of 1,000 images in each test set to obtain incongruent text-image data pairs. As we assumed that the input sentences are congruent, the incongruent LXMERT features were extracted from congruent sentences, giving incongruent images. 18 Some translations in the 2016 test set were modified during the post-edit process with the presence of images, indicating that images are mandatory for these samples. We determined post-edited sentences by extracting sentences in WMT17 that differ from those in WMT16, obtaining 150 samples.  Table 15 shows the corpus-level BLEU scores for each model in the adversarial evaluation. A large difference is observed between the congruent and incongruent settings in the 2016 N test set, but almost no difference in the original Multi30K. This observation is consistent with the assertion made in [59], claiming that the source text in Multi30K is sufficient to perform the translation and prevents the visual features from affecting the model.

E. HUMAN EVALUATION
To investigate the characteristics of our models for human users, we asked human judges to rank the systems from best to worse for each source text. The 2016 test set, which consists of 1,000 input sentences, serves as evaluation data. For each input sentence, we sampled an output for each model from three translations generated by three trained systems. Ties were allowed, as multiple systems may generate the same translation for an input sentence. Finally, we turned absolute ranks into pairwise comparison of the two selected systems.   Furthermore, we observed a small yet remarkable gap between LXMERT-bused and BERT-fused models, which contradicts the fact that the LXMERT-fused model has a higher BLEU score than the BERT-fused model. We consider the perplexity of each model's account for this gap; the perplexity 19 of BERT-fused model (8.59) is slightly lower than that of LXMERT-fused model (8.96). As BERT is pre-trained on larger data than those used for LXMERT, the BERTfused model might generate more fluent translations than the LXMERT-fused model. Table 17 shows the English-German translation generated with different features. In this sample, the word ''pyramid'' is not translated by the text-only model and models with either BERT or RoI features. However, the LXMERT feature successfully guides the model to generate the German translation word ''pyramide.'' This sample demonstrates a good interaction between language and vision modalities. Specifically, the LXMERT feature guides the model to construct the sentence structure by leveraging the language modality, which is also observed when using the BERT feature, and then completes uncertain words by leveraging the vision modality.

VII. CONCLUSION
In this paper, we introduced two approaches to incorporate a monolingual corpus to improve MMT models. We showed that pre-trained word embeddings improve the translation performance along with the debiasing procedure and/or monolingual-corpus-based subword tokenization. Pre-trained VLMs are also proven to boost the translation quality. The results on multiple language pairs support the usefulness of monolingual data. Compared to the approaches based on parallel corpus, our proposed approach requires less-expensive annotations and is, therefore, more applicable for low-resource languages. Although we conducted experiments on various target languages to show the applicability across languages, the utility may deteriorate if our approaches are applied to a language with a culture that is distant from that of LXMERT. In future work, we would like to inspect the impact of this cultural gap for cultural-distance language pairs (e.g., English-Arabic). 19 We employed bert-base-multilingual-uncased to compute the perplexity.
After manipulating knowledge obtained from monolingual corpora, conventional MMT models still outperformed Transformer-based MMT models in some language pairs. However, through extensive analysis, we found the focus areas to develop better MMT models fused with pre-trained VLMs. In future work, we will examine training tasks for pre-trained VLMs that are more appropriate for multimodal NMT. Further, we will investigate models fused with multiple features that preserve every benefit made by their underlying features.
TOSHO HIRASAWA received the B.S. degree from Kyoto University, in 2009, and the master's degree in information science from Tokyo Metropolitan University, in 2021, where he is currently pursuing the Ph.D. degree with the Graduate School of System Design. His research interests include machine translation and multimodal natural language processing. She is currently a Data Scientist and an Engineer with CogSmart. She is also a Visiting Researcher in Tokyo Metropolitan University.

MASAHIRO KANEKO
MAMORU KOMACHI received the M.Eng. and Ph.D. degrees from the Nara Institute of Science and Technology (NAIST), in 2007 and 2010, respectively. He was an Assistant Professor at NAIST. He is currently a Professor at Tokyo Metropolitan University. His research interests include semantics, information extraction, and educational applications of natural language processing.