Neural Machine Translation Transfer Model Based on Mutual Domain Guidance

The neural machine translation (NMT) model is a data hungry and domain-sensitive model but it is almost impossible to obtain a large number of labeled data for training it. This requires the use of domain transfer strategy. In order to solve the problem of domain data mismatch, this paper proposes a neural machine translation transfer model based on domain mutual guidance and establishes the continuous impact through the framework of mutual guidance. At the same time, self-ensemble and self-knowledge-distillation are used in these independent domains so that the model will not deviate from the domain too much. Furthermore, the model can better train the models from the batching way of domain data. It mainly uses the pretraining model out of domain, distillation of existing models in domain and data selection in the training process to guide the in-domain model. These are unified in the training framework, so that model training can be continuously and effectively guided in and out of domain. In this study, three typical experiment scenarios were comprehensive tested and our model was compared with many conventional classic methods. The experiment results showed that our proposed “inter-domain transfer training” and “curriculum scheduling agent” was effective and robust. The most important results and findings are that this comprehensive guided training framework (intra-domain and inter-domain) is suitable for the domain transfer in different scenarios, and this framework doesn’t increase the decoding cost.

represent the source domain/task, whereas P t (x)/P t (y|x) can 30 be used to represent the target domain/task. The transfer 31 of learned information between domains is called domain 32 adaptation. It can be applied to training and/or testing corpora 33 from different domains. The transfer of learned information 34 between different tasks is called multitask learning or system 35 combination. 36 Neural machine translation (NMT) models have developed 37 rapidly. Many translation frameworks have been reported 38 in the relevant literature, including models using recurrent 39 neural networks (RNN) [5], convolutional neural networks 40 (CNN) [6], and transformer [7] models, as well as hybrid 41 [8], [9] and simple frameworks for small devices (including 42 neural architecture search [10] and knowledge distillation 43 (KD) [11]). From the perspective of translation style, existing 44 VOLUME 10,2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ methods can be divided into multilingual translation systems 45 (including multiway translation such as one-to-many [12], 46 many-to-one [13], and many-to-many [14], and multisource Similarly, model enhancement methods may be largely 124 divided into those performing training-(changing training 125 objectives and processes), structure-, and decoding-based 126 enhancement. Regarding training objectives, the basic con-127 cept is to weight training data. Shafiq et al. [ Figure 1). The input sequence is token embed-208 ding and position embedding. The encoder is composed of N 209 mixed add layers, multihead attention layers and feedforward 210 layers, while the decoder is composed of N mixed add layers, 211 multihead attention layers, feedforward layers and masked 212 multihead attention layers. Unlike the multihead attention 213 layer, the text sequence cannot be output at one time in 214 masked multihead attention layers, and the ungenerated text 215 is masked. 216 Specifically, a source input H 0 with position embedding is 217 first transformed into a query matrix Q 0 , key matrix K 0 , and 218 value matrix V 0 . Multihead attention is then applied to Q 0 , 219 K 0 , and V 0 as follows: where Q 1 is a source representation with global feature infor-233 mation and Q 0 is added to implement residual connections to 234 overcome gradient vanishing.

235
Similarly, this processing sequence is formally expressed 236 as a function f SAN , which is used to learn the source repre-237 sentation Q 1 .
The self-attention network uses a set of different layers to 240 learn the source representation.
Here, [· · · ] N (n ∈ {1, 2, . . . , N }) indicates that N identical 243 layers of the encoder are stacked together. The output Q N of 244 the N-th attention network layer is the final source represen-245 tation that is sent to the decoder to learn a translation context 246 vector used to predict the target word. The only difference 247 between the decoder and encoder is the masked attention 248 layer because the target output is generated dynamically.

249
The decoded output calculates the probability 250 p(y j |y <j , x; ) of each word using the softmax function, 251 where denotes the parameter set of the encoder and 252 decoder.
Here, q(y j |y <j , x;¯ t ) and¯ t are the distribution and param-272 eter set of the teacher model, and p(y j |y <j , x; t ) and t 273 are the distribution and parameter set of the student model, 274 respectively.

275
Inspired by Zeng et al. [53], a maximum strategy¯ t = 276 * t (where * t is the model parameter set yielding the best 277 performance in previous rounds), average strategy¯ t = 278 t ] can be adopted for a self-ensemble 283 model. In the average and weighted average strategies, the 284 student model is typically more robust because it gathers 285 information from the previous iterations of the teacher model. 286

287
Inspired by the concept of coteaching [55], the proposed 288 approach models, i.e., the out-of-domain and in-domain mod-289 els, perform transfer learning through pretraining. 290 Figure 2 illustrates the alternative training processes for 291 in-domain and out-of-domain data. Each round of out-of-292 domain parameters is used to initialize the following round 293 of in-domain parameters and vice versa. These processes 294 are repeated to complete the mutual transmission of infor-295 mation. Through this alternative iterative training processes, 296 each domain can absorb knowledge beneficial to this domain. 297 Therefore, in-domain and out-of-domain features can be 298 transferred to each other in a model-level transfer that can 299 better retain shared knowledge between models and provides 300 better transfer performance. Here, we need to evaluate the 301 quality of the model. The source and target domain data 302 C s , C t are divided into training sets C tr s , C tr t and development 303 sets C val s , C val t that are used to train and evaluate the model, 304 respectively.

305
Algorithm 1 is the domain transfer algorithm for the pro-306 posed NMT model, mainly divided into two stages.

307
(1) In the initialization phase, the main task is to complete 308 initialization of the in-domain and out-of-domain model 309 parameters.

310
• The TrainModel(·) function is used to train the model. 311 The nondistillation objective function is used on the train-312 ing set C tr t and the model parameter set The same process is performed for the 314 source domain.

315
(2) In the iteration phase, the main task is to complete 316 information transfer between the in-domain and out-of-317 domain models.

318
• The TransModel(·) function is used for model transfer. 319 The objective functions with self-knowledge-distillation 320 functions L NLL ( (k) t 13: //Out-of-domain model transfer training and evaluation 14: • Reward R. Given a state and action, the scheduling 366 agent will provide an immediate reward (r(s, a)) accord-367 ing to the current training scenario.

368
• Discount Rate γ . γ ∈ [0, 1] is a discount factor that 369 measures the current value of long-term rewards.

370
The entire reinforcement process can be described as fol-371 lows. In each time step, the scheduling agent presents a ∈ A 372 according to the current data and model state s ∈ S, and 373 obtains the corresponding reward r(s, a). The original state 374 updates to the new state s according to the state transition 375 probability p(s |s, a). The goal of the scheduling agent is to 376 identify the optimal strategy (µ φ : S × A → [0, 1]) to 377 maximize the expected cumulative reward. In the proposed 378 approach, we adopt the classic deep deterministic policy 379 gradient (DDPG) algorithm. The DDPG algorithm utilizes 380 an actor-critic framework that can model continuous behav-381 ior. Compared with a model based solely on actors such as 382 the REINFORCE algorithm [56], the existence of the critic 383 reduces the update variance and accelerates the convergence 384 of the model.  (a t = µ φ (s t ) and Q ω (s t , a t )) and target actor-critic networks 393 (µ φ (s t ) and Q ω (s t , a t )), and experience replay memory. The     • In the second stage, some samples are removed from the 470 experience replay memory to update the actor and critic 471 networks for the two objective functions. According to 472 the time difference learning method [58], the objective 473 function and update formula for the critic network are 474 expressed as follows: According to the deterministic policy gradient theorem 480 [59], the objective function and update formula for the 481 actor network are expressed as follows: Algorithm 2 Training Algorithm for the Curriculum Scheduling Agent 1 Initialize actor µ φ and critic Q ω with parameters φ and ω 2 Initialize target network µ φ and Q ω with weights φ ← φ and ω ← ω 3 Initialize experience replay memory M and soft update parameter τ 4 for k = 1 . . . K do 5 for number of RL training iterations do 6 Observe state s t 7 Obtain action a t according to policy µ φ with εgreedy exploration 8 Update p(y|x; (k) ) with selected sample to obtain p(y|x; (k ) ) 9 Calculate the perplexity difference on the validation set C val between p(y|x; (k ) ) and p(y|x; (k) ) as r t z 10 Observe new state s t+1 and store transition (s t , a t , r t , s t+1 ) in M 11 Sample mini-batch transitions N * (s i , a i , r i , s i+1 ) in M using the prioritized experience replay sampling technique 12 Update the critic network through Equation (11)  13 Update the actor network through Equation (13)  14 Update the parameters in the SRM with backpropagation from the update signal of the actor 15 Update the target networks: Select data C tr from C tr using µ φ 17 Update p(y|x; (k) ) with C tr to obtain p(y|x; (k+1) ) and testing set (2k in total), denoted by the tag ''GEN.''

504
The in-domain data were the same as those in the specific 505 domain setting.   All parameters were initialized using a uniform distribution 524 in [−0.1, 0.1]. Our model used the Adam algorithm as an 525 optimizer and the initial learning rate was set to 0.0005. When 526 the performance of the development set did not exceed that 527 of the previous eight rounds of checkpoints, the learning rate 528 was set to 0.8 times the original value. When the performance 529 of the development set did not improve in 20 rounds of check-530 points, the training process was terminated (one checkpoint 531 was equivalent to 1000 updates).

533
For the curriculum scheduling agent, the actor-critic frame-534 work was adopted and the system was constructed based on 535 reference [64]. The experience replay memory size was set to 536 2500 and a warm-up phase of 500 steps was performed. The 537 mixing factor of the target and online networks was τ = 0.1. 538 The target network was updated at 100 steps. The discount 539 factor was γ = 0.99.   of the specific domain is better than that of the general 599 domain, which can be attributed to the fact that the general 600 domain contains information from multiple domains, causing 601 the data to be relatively noisy. From the perspective of the 602 experimental scenarios in and out of the domain, the transfer 603 effect of the news corpus and general domain is better. One 604 possible reason for this is that the news corpus has stronger 605 generalization ability in the domain. From the perspective of 606 the specific-domain experimental scenarios, we can make the 607 same observation.
608 Table 2 presents the performance of the low-resource 609 experimental scenario. NEWS-50 refers to the use of 50k 610 in-domain samples and the other results are similarly named. 611 As a result of the mismatching between the amount of out-612 of-domain and low-resource-domain data, the fine-tuning 613 model, KD method, and our method only adopt a single 614 approach, namely, using data from out of the domain for 615 initialization, and they do not iteratively improve each other. 616 Additionally, the in-domain data are oversampled in the dis-617 criminant and domain label models. As a result of this in-618 domain-oriented training method, this section only discusses 619 the in-domain performance.

620
For the overall and classification baseline models (various 621 systems are compared in Table 2), the conclusions are similar 622 VOLUME 10, 2022           From the commercial point of view, this framework is only 698 on the training stage and has no impact on the decoding stage. 699 It is suitable for offline training and deployment to online 700 system. In the future, we hope to develop a transfer training 701 framework suitable for more domains and further reduce the 702 training cost. In addition to cross-domain, we hope that this 703 method can also be applied to other similar tasks, such as 704 cross-lingually etc.