Multimodal Machine Translation

In recent years, neural network machine translation, especially in the field of multimodality, has developed rapidly. It has been widely used in natural languages processing tasks such as event detection and sentiment classification. The existing multimodal neural network machine translation is mostly based on the autoencoder framework of the attention mechanism, which further integrates spatial-visual features. However, due to the ubiquitous lack of corpus and the semantic interaction between multimodalities, the quality of machine translation is difficult to guarantee. Therefore, this paper proposes a multi-modal machine translation model that integrates external linguistic knowledge. Specifically, on the encoder side, we adopt the pre-trained Bert model to be used as an additional encoder to integrate with the original text encoder and picture encoder. Under the cooperation of the three encoders, a better text representation and picture representation at the source end is generated. Besides, the decoder decodes and generates a translation based on the image and text representation of the source. To sum up, this paper studies the visual-text semantic interaction on the encoder side and the visual-text semantic interaction on the decoder side, and further improves the quality of translation by introducing external linguistic knowledge. We compared the performance of the multimodal neural network machine translation model with pre-trained Bert and other baseline models in English German translation tasks on the multi30k data sets. The results show that the model can significantly improve the quality of multimodal neural network machine translation, which also verifies the importance of integrating external knowledge and visual text semantic interaction.


I. INTRODUCTION
T HE real world that human beings live in is a space where text, sound, image, and video coexist.For thousands of years, humans have exchanged information with each other in a variety of ways, such as language, text, and images, and using multiple modalities at the same time can clearly convey information more fully and accurately.Multiinformation fusion is an important research trend.However, most of the existing machine translation models only use text data for translation.How to integrate text, image, and video information to improve the quality of translation is a topic worthy of study.As Kalchbrenner et al. proposed the concept of neural network machine translation in 2013, it soon achieved results comparable to, or even better than, traditional statistical machine translation, and it has gradually become a research hotspot.
At present, the mainstream models can be divided into three categories: the first type of model only uses the text attention mechanism, and the image is only used as auxiliary information to improve the generation of text representation.2020) jointly trained source-to-target and target-to-source translation models, and encouraged these models to share visual information when generating semantically equivalent visual words.However, these models only use visual information to optimize the semantic representation of text, ignoring the strong semantic association between text and image.
In this paper, according to the characteristics of the multimodal neural network machine translation model, the BERT model is introduced as an additional encoder to encode the input sentence, and then it is used for decoding with the original visual encoder and text encoder of the multi-modal neural network model., So that the visual context vector and text context vector can learn more external linguistic knowledge to improve their representation ability.In summary, this article has the following innovations: i) Aiming at the existing problem of lack of semantic interaction in multimodal neural network machine translation, the visual-text semantic interaction on the encoder side and the visual-text semantic interaction on the decoder side are studied separately.
ii) In response to the lack of multimodal machine translation corpus, the introduction of external linguistic knowledge further improves the quality of translation.
ii) Experiments were performed on multiple language pairs on the Multi30k data set, and the results all show that the model in this paper can significantly improve the quality of multimodal neural network machine translation.

II. RELATED WORK A. BERT
Pre-training technology has a long history in the field of machine learning and natural language processing, and its related applications can be traced back to (Erhan et al.Practice has shown that in natural language understanding tasks, effective use of this type of pre-trained model can effectively improve the performance of the model.BERT and its variant models have been widely used in tasks such as natural language understanding tasks and text classification, and have greatly promoted the development of corresponding fields.Multimodal neural network machine translation aims to simultaneously use source language sentences and corresponding visual information to obtain high-quality target language translations.In this process, the input sentence and image need to be encoded first, and then decoded to generate the target language sentence.Today, BERT has been proven to improve the model performance of many natural language processing tasks, so it is of great significance to study the application of BERT in the direction of multimodal machine translation.Limited by equipment and computing power, the cost of retraining BERT is unbearable.Therefore, this article mainly focuses on: by introducing the pre-trained BERT model, the multi-modal machine translation model can learn more external linguistic knowledge to improve the translation quality of the model.

III. METHOLOGY
In this section, we design a multi-modal neural network machine translation model incorporating pre-trained BERT, as shown in Figure .1,using BERT to encode the input text sequence.Due to the vocabulary size and sub-word unit division of the BERT model, there may be differences from the existing multi-modal model.Solve the problem of embedding dimension and sentence length alignment by introducing two mapping matrices.Then, the hidden layer on the decoder side is used to pay attention to the hidden layer sequence encoded by BERT to obtain an additional context vector; the context vector and the original text context vector are merged through a gate to generate an integrated linguistic knowledge Text context vector.The model designed in this paper includes a visual encoder, an RNN text encoder, an additional pre-trained BERT text encoder, and a decoder.We will introduce these modules in detail.

A. VISUAL ENCODER
Given the picture I and the source language description sentence X = (x 1 , x 2 , • • • , x N ) of the picture, where n is the length of the source language sentence.And the corresponding target language translation Y = (y 1 , y 2 , • • • , y M ), where m is the length of the target language sentence.The goal of multimodal neural network machine translation is to construct an end-to-end neural network model to model In practice, researchers often use gated recurrent neural network (Gru) (CHO et al., 2014) as the implementation of recurrent neural network: specifically, the network uses forward encoder Φ enc and reverse encoder ← − Φ enc to encode the input sentences from two directions to generate forward hidden layer sequence h 1 , h 2 , • • • , h N and reverse hidden layer sequence The specific generation process is shown in formulas 2 and 3: Where Φ enc and ← − Φ enc are GRU activation functions in two directions respectively, and E x [x i ] represents the word vector corresponding to the source word x i .The final hidden layer vector in a given time step is composed of forward and reverse hidden layer vectors h i = h i ; ← − h i .Based on this, we can use the hidden layer vector sequence As shown in Figure 2, the visual encoder adopts the pre trained convolutional neural network, and the parameters of the encoder do not participate in the update during training.Specifically, the encoder is a 50 layer residual network (resnet-50) (he et al., 2016) to encode the visual semantic information into a matrix A = (a 1 , a 2 , • • • , a 196 ) , a i ∈ R 1024 and each line is composed of a 1024 dimensional feature vector encoding a specific image region.Since the purpose of acquiring visual representation is to initialize the hidden layer state (vector with dimension of 256) of decoder, a twolayer full connected layer is used to transform the dimension of visual representation.In addition, the forgetting layer is added to the network to improve the robustness of the model and make it have stronger generalization ability.
After generating text hidden layer sequence and visual representation by using bidirectional recurrent neural network text coder and visual coder, fine-grained semantic interaction between text and vision is realized under the action of bidirectional attention mechanism, and the improved text hidden layer sequence is represented as C and Ā

IV. PRE-TRAINING BERT
As shown in Figure 3, the Bert model is mainly composed of bidirectional transformers.In the Bert model, the context information on the left and the context information on the right are considered in the process of generating the representation of each layer.Given the input source language sentence sequence , where represents the i-th subword unit in the input sentence.The pre-training Bert model encodes the input sequence into hidden layer sequence Q according to formula 4

1) Decoder
The decoder is a conditional threshold control unit (cGRU) with four independent attention mechanisms, three of which are used to process text information and the other is used to process visual information.Specifically, cGRU consists of two stacked GRU activation units REC 1 and REC 2 .At time t, REC 1 employ the hidden layer vector s t−1 of the previous time and the target word y t−1 to generate the target word y t by using the formulas 5 where s t−1 represents the hidden layer state of GRU unit REC 1 at the previous time, S t means the new memory of GRU unit REC 1 .Z t is the update gate of REC 2 which determines the fusion of the newly generated memory and the hidden state of REC 1 mode.r t is the REC 2 reset gate which determines the importance of the hidden state of REC 2 to the generation of new memory.W r , U r , W Z , U z are parameter sets which are used to generating reset gate r t and updating gate z t of REC 2 .
During the above process, based on the temporary hidden layer vector s t and the improved text hidden layer state sequence C, text attention mechanism uses the following formula to generate a time independent temporary text context vector C t−temp .
Meanwhile, based on the temporary hidden layer vector s t and the hidden layer state sequence Q, the text attention mechanism adopts the formula 7 generate a time independent temporary text context vector C t−BERT : Then, a threshold unit g( * ) is generated by using the temporary hidden layer vector under the action of the forward neural network.The threshold unit is used to fuse the temporary context vector generated by the bidirectional cyclic neural network encoder t−temp .And the temporary context vector generated based on the pre trained Bert encode c t−BERT to get the text context vector C t .The calculation process is shown in formula 8 At the same time, the visual attention mechanism uses the temporary hidden layer vector s t and the visual feature matrix A to adopt the formula 9 generate time independent visual context vector i t : Then, under the use of collaborative attention mechanism, the temporary hidden layer vector s t , the text context vector C t , and the text context vector i t .And the visual up and down vectors realize the high-level semantic interaction between text and vision, and generate the hidden layer of the current moment vector S t .Finally, based on the hidden layer state y t−1 , the calculation process of context vector C t and visual context vector i t is get by p  L o , L s , L w , L cs , L ci are hyper parameters corresponding to the model.

V. EXPERIMENTS A. EXPERIMENTAL SETTING
The experiment in this paper also uses the M30k data set, and the model parameter settings are the same as the multimodal neural network machine translation based on deep semantic interaction described above.Each instance in M30KC consists of one image, five English descriptions and five German descriptions in a triad, where the English and German descriptions are independent of each other.For the experiments, the data set was divided as follows: training set of 29,000 triples, validation set of 1014 triples and test set of 1,000 triples.
For visual information, this paper uses a pre-trained 50layer residual neural network to extract the local features of the image.As shown in Figure 4, this paper performs a series of preprocessing operations on the text data before the model It is worth noting that this paper mainly studies the English German translation based on the m30k public data set.Because the model proposed in this paper is based on the encoder to represent the context information of sentences, and the low data resource language output integrating external semantic information is realized through the decoder, so as to establish the end-to-end sequence mapping from the source language to the target language.Therefore, the model can also be applied to other language translations, such as Slavic ones, as long as there is a corresponding corpus for training.

B. BASELINE
Next, in order to verify the effectiveness of the introduction of external linguistic knowledge for multi-modal neural network machine translation, this chapter designs multiple sets of comparative experiments.The following describes the models involved in the experiments Baseline Model: In this paper, to verify the advantages of the deep semantic interaction-based MNMT model designed in this paper, we compare it with the following mainstream models: Parallel RCNN [37]: this model uses an encoder that contains multiple encoding threads with long-and short-term memory units in each encoding thread [38] share parameters.MNMT [39]: this model introduces two separate attentional mechanisms that utilize image features and text sequences to decode and generate translations.IMG [40]:This model uses image features as additional input to initialise the implicit units of the decoder.Soft-Attention [41]: this model uses an encoder-decoder framework that not only considers the sequence of text representations on the source side when generating the context vector on the decoder side, but also introduces an additional local attention mechanism to extract image features to The model not only takes into account the sequence of text representations at the source side when generating the context vector at the decoder side, but also introduces an additional local attention mechanism to extract image features to assist in generating better context vectors.Hard-Attention [41] uses two separate attention mechanisms for generating image and text context vectors, one of which weights all text representations and the other considers only one image feature at each moment.

C. VISUAL ANALYSIS
Since the translation quality of neural network machine translation models is closely related to sentence length, the sentence length of the dataset was visually analysed to guide the experimental setup (e.g., length penalty terms etc.), as is shown in Fig. 5 and 6.The best scores are achieved through multiple rounds of parameter optimization of models, and the length of original word sentences will also affect the result of translation in the way that too long sentences may lead to the weakening or even loss of relevant information between words with large spacing, while too short sentences may not be able to learn effective sentence representation and become phrase translation.
Figure .7 show the translation results generated by each model for a sample English->German multimodal translation.It is worth noting that the German word "klatscht" in blue means "claps" in the source language.In order to explore how the multimodal neural network machine translation model based on deep semantic interaction designed

VI. CONCLUSION AND FUTURE
This paper designs a multi-modal neural network machine translation model that incorporates pre-trained BERT.By introducing external linguistic knowledge, the neural network machine translation model can generate better translations.It also introduces the four main components in the design model of this article: visual encoder, text encoder, BERT pre-training model, and decoder.Next, this article briefly introduces the multi-modal experimental data set and experimental settings.Since the sentence length is closely related to the model translation quality, this article visually analyzes the VOLUME 4, 2016 Huang et al.(2016) by integrating the semantic representation of the image as an additional input into the encoder.Calixto et al(2017b) further studied how to use the semantic representation of the image to initialize the hidden state of the decoder.In the framework of multi-task learning, Elliott and Kadar (2017) decompose multi-modal translation into learning translation models and visual representations.In this way, multi-modal models can be trained on parallel text or external data sets describing images, making it possible to use existing resources.Qian et al. (2018) proposed a new algorithm based on the advanced actor-critical algorithm (Bahdanau et al., 2017) to study the effectiveness of reinforcement learning in multi-modal NMT.The second type of model believes that both text and image information are crucial in multimodal neural network translation.Therefore, two attention mechanisms are simultaneously used to capture text and image contexts for translation.In this regard, Caglayan et al. (2016a, b) first proposed an end-to-end attention multi-modal NMT model, which effectively integrates text and image information into the existing VOLUME 4, 2016 machine translation framework by sharing parameters.In addition, Calixto et al. (2017a) will introduce two independent attention mechanisms for text and image information.Delbrouck et al. (2017) empirically studied the effectiveness of enhanced visual and textual representation to improve the quality of multimodal neural machine translation.The third type of model uses semantic interaction to refine the learned image semantics.Delbrouck and Dupont (2017b) apply a multi-modal compressed bilinear pooling operation to remove the noise information represented by the image based on the text representation.Recently, Yin et al. (2020) proposed a graph-based multi-modal fusion encoder, which is based on a unified graph representing various semantic relations between multi-modal semantic units.Lin et al. (2020) introduced a capsule network to better dynamically extract translation image features.Yang et al. ( , 2010).Since then, Mikolov et al. (2013) and Pennington et al. (2014) pioneered Xindi proposed a word embedding representation, and this pre-training technique was widely used at that time.Dai & Le (2015) trained an autoencoder using unlabeled data and then used the model for downstream tasks.As the scale of data is getting larger and larger and deep neural network models are widely used, pre-training technology has been widely used and has achieved remarkable results, but it has also received more and more attention.Peters et al. (2018) designed ELMo based on the two-way cyclic long-and short-term memory unit, and input the pre-trained ELMo as global information into downstream tasks.In 2018, Radford designed the language model GPT based on Transformer, which uses unlabeled data for pre-training and fine-tuned through specific downstream tasks.Drawing lessons from the design ideas of Transformer model encoder, Devlin et al. designed the BERT model in 2019, which is widely used for the initialization of downstream task models.On the basis of BERT, many variant models have been derived, such as the multilingual pre-training model XLM (Lample & Conneau, 2019), which introduces more unlabeled data and removes the "NSP (predict next sentence)" module of RoBERTa (Liu et al., 2019), and XLNet based on permutation modeling method (Yang et al., 2019b).In recent years, with a large number of pre-training techniques/models, such as: ELMo (Peters et al., 2018), GPT/GPT-2 (Radford et al., 2018) , BERT (Devlinet al., 2019) and cross-language language XLM (Lample & Conneau, 2019), XLNet (yang et al., 2019b) ,RoBERTa (Liu et al., 2019) and other models have refreshed the performance records in the corresponding field time and time again, and the pre-training technology has attracted widespread attention from the machine learning and natural language processing communities.These models are pre-trained on a large amount of unlabeled data to better learn the representation of the model input.These models are then used to provide context-aware word embedding representations of the input sequence for downstream tasks (Peters et al., 2018) or to initialize model parameters for downstream tasks.

FIGURE 1 .
FIGURE 1.The overall framework, which consists of a image feature extraction module, an attention based LSTM module and a joint leading re-position relation network for figure question answering.

P
= (Y | X, I).In this model, a pre-trained Bert model is added to the multimodal translation model.The source language sentences are encoded by the original coder in MNMT model and the pre trained Bert model respectively to obtain the hidden layer sequence C and the hidden layer sequence Q.In addition, the visual encoder encodes the picture to represent A. Then, the decoder decodes the encoded text sequence representation C and Q and the encoded visual sequence a according to the conditional probability formula 5: log p(Y | X, I) = M i=1 log (y <t , C, A, Q)

FIGURE 2 .
FIGURE 2. The overall framework, which consists of a image feature extraction module, an attention based LSTM module and a joint leading re-position relation network for figure question answering.

FIGURE 3 .
FIGURE 3. The overall framework, which consists of a image feature extraction module, an attention based LSTM module and a joint leading re-position relation network for figure question answering.

FIGURE 4 .
FIGURE 4. The overall framework, which consists of a image feature extraction module, an attention based LSTM module and a joint leading re-position relation network for figure question answering.

FIGURE 5 .
FIGURE 5. Distribution of sentence lengths in the training set(English).