Low-Resource Neural Machine Translation: A Systematic Literature Review

In this study, a systematic literature review was conducted to examine the significant works in the literature on low-resource neural machine translation. Within the scope of the study, three research questions were identified to examine the low-resource neural machine translation literature. According to the inclusion and exclusion criteria, 45 studies were selected for review. After the relevant studies were identified, three research questions were aimed to be answered. The first research question is to identify the study directions and language pairs used in low-resource neural machine translation. The second research question aims to identify which deep learning methods are used in low-resource neural machine translation and which metrics are used to evaluate these methods. The third research question is to determine the bilingual and monolingual corpora used in the studies and the preferred development environments. In addition, the studies with the most commonly used language pairs were analyzed, and directions for future studies were made.


I. INTRODUCTION
Machine translation (MT) is a concept proposed in 1949 by Warren Weaver, who thought that computers could be used to automatically translate one language into other languages [1].MT is a field of study that has received great attention in recent years, as it has similar goals with natural language processing (NLP) and machine learning (ML) concepts.Apart from its scientific importance, MT also has great potential in the field of communication [2].Before deep learning approaches were applied to the field of MT, generally rule-based and statistical machine translation methods were used.
Rule-based machine translation was the first MT concept based on the assumption that there are words in all languages that has the same meaning, and was a popular method before the 2000s [3].In this method, translation can be considered as placing the words in the source sentence in the appropriate place in the target language.Since the meaning of a sentence may be represented by different word orders in different languages, such a word substitution method must comply with the syntax rules of the languages to be translated.In such The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato .methods, certain rules must be designed for source language analysis, translation from source language to target language and target sentence generation.However, since there are so many syntactic rules in a language, editing grammar rules in this way is a very difficult process and requires a lot of effort.Although rule-based methods look good in theory, they lag far behind in terms of performance in practice because the defined rules do not include invisible rules in the language.The main disadvantage of rule-based methods is that they ignore the need for contextual information in the translation process, which makes machine translation unreliable.
Statistical machine translation methods, proposed in the 1990s, are systems that can learn translation rules between words or phrases using probabilistic models [4].It is a method that has achieved success in the sector, especially in large companies such as Google and Microsoft.Unlike the rulebased approach, SMT models consider the translation process from a statistical perspective.SMT models find words or phrases with the same meaning through bilingual parallel corpora.The most widely used form of SMT is phrase-based SMT [5], which roughly includes preprocessing, word alignment, sentence alignment, and language model (LM) training.The basis of this model is the use of a vocabulary that matches phrases between the source and the target language.In this method, unlike rule-based methods, the translation model can use contextual information in the sentence.Although SMT gives better results than rule-based methods, the systems that need to be designed manually, such as the language model and reordering model, cause SMT to not take full advantage of the parallel corpora, and the translation quality is far from desired performance [6].
Traditional machine learning techniques rely primarily on human-generated features derived from linguistic intuition, which is a trial-and-error process and frequently far less accurate at capturing the core of the original data.SMT techniques have done pretty well in the MT area in recent years; however, certain fundamental shortcomings still need to be resolved.The first one is that since SMT methods create the translation by splitting the source sentence into several phrases and changing the phrases, they ignore the long-term dependencies in long sentences, therefore it causes inconsistencies in the translation results.Second, existing systems often have many complex sub-components, such as language model, reordering model, etc.It gets increasingly challenging to adjust and combine these sub-components to produce a more stable output as their number rises.These circumstances have caused an obstacle in the advancement of SMT architectures.This problem is mainly due to the LM component.LM is able to provide important information, such as the probability of a specific word (or phrase) occurring based on prior words.Therefore, creating a effective LM greatly affects translation performance.
While the research of LM components through statistical methods has become almost static, neural language models (NLM) using a neural network to model text data directly have emerged.Due to the distributed representation of the words, NLM reduces sample sparsity in comparison to classic LM, enabling them to share statistical weights rather than being independent variables.However, LMs created using feed-forward networks have some problems due to neural networks.The most important one is the long-term dependency problem in the sentences.Language models using recurrent neural networks (RNN) structures have been put forth as a solution to this issue [7].This method processes each word in a one time step, and the whole sentence is modeled.Thus, real conditional probability can be modeled without the limit of the content window [8].With language models built using RNN, any size input can be processed, and information from previous steps can be used, but the computations are slow as a result of the numerous parameters.
The use of neural networks for MT (Neural Machine Translation: NMT) operations has required many years due to the low performance of models and hardware limitations for the calculations.First studies were done to build NLMs for the target language [9] and to apply statistical models [10].These ideas have been taken further, including systems that score sentence pairs with a forward network [11] and work that adds a source content window to neural language models [12], [13].The use of deep learning approaches for MT has started with studies in the last 10 -12 years.With the spread of deep learning in 2010, the field of NLP has shown great progress.However, the use of deep neural networks for MT has also become widespread.Deep learning-based approaches, which are a completely new approach to MT, were first introduced in 2013 [14], [15].Compared to other models, NMT models require less grammar and produce at least as good results as other methods [16].Numerous studies have shown that NMT outperforms traditional SMT models and is industrially applicable to a greater extent [17].
With the increasing success of deep learning in the field of NLP, nowadays NMT models are designed as end-toend learning.That is, a sequence of words in the source language is directly mapped to a sequence of words in the target language.The purpose of the learning process is to obtain the target sentence by viewing the two sentences as a high-dimensional classification problem in a semantic space.Encoding and decoding are the two components that make up this process in contemporary NMT models.An example visualization of the basic encoder-decoder structure is given in Figure 1.The encoder -decoder models generate the target T = (t 1 , t 2 , • • • , t m ) sentence using the maximum valued conditional probabilities in the source sentence S = (s 1 , s 2 , • • • , s n ).In doing so, it uses both predicted words and information from the source sentence.So this is an recurrent neural language model (RNLM) creation process.The encoder network sequentially processes a source sentence word by word upon receiving it, compressing the variable length sequence into a fixed length vector.The target sentence is subsequently generated by the decoder using the encoder's final hidden state.It is referred to as end-to-end translation because the encoder-decoder structure conducts translation directly from the source data to the target result, i.e. there is no obvious outcome in the intermediate step.The idea behind the encoder-decoder structure is to map the source sentence to the target sentence using a semantic space intermediate vector.The semantic meaning of both languages can be represented by this intermediate vector.RNN-based NMT models differ from one other in three key ways: (a) the way the sentence is given to the model; (b) the type of neural network used (SimpleRNN, LSTM, GRU); and (c) the depth of the RNN layer [18], [19].Some models use CNN structure instead of RNN units in the encoder-decoder framework [20], [21].There are various benefits to using convolution in NMT models rather than recurrence.Their hierarchical structure connects far-off words in the sentence more quickly than sequential structures, requires fewer sequential calculations, and is easier to parallelize [22].Because of these advantages, CNNbased models can facilitate the learning process.However, the models become deeper and more challenging to train when numerous convolution layers are stacked for translating long sentences [22].
The biggest problem in encoder-decoder structures is the process of compressing all the information in the source sentence into a fixed-size vector.This situation causes the performance of the models to decrease as the length of the sentence to be translated increases.In order to solve this problem, models with attention mechanisms that perform alignment and translation at the same time have been proposed.The first example of the attention mechanism was proposed in 2014 and has come to the fore as a very important development in the field of NLP [23].While using an attention mechanism to generate each word in a translation model, it looks for a few places in the original sentence where the key information is concentrated.After that, using the content vectors connected to these source sentence positions and the words predicted in earlier time steps, the model predicts a new word.The most important feature of this structure; the source sentence does not need to be encoded into a fixed-size vector.Instead, the input is encoded as a sequence of vectors and a sub-set of these vectors is used in the decoding step.In this way, translation performance increases in long sentences.This method is used by the decoder to determine which elements of the source sentence should be given significance.
The attention mechanism has undergone many changes since the day it was first proposed and has been used in different ways.The attention mechanism in NMT is most frequently used as an interface between the encoder and the decoder, though.A significant refinement of the attention mechanism is the self-attention mechanism proposed in 2017 [24].In the proposed structure named Transformer, RNN units have been removed and a structure that uses full attention has been created.Self-attention calculates the word dependency within the sentence sequence and thus obtains a stronger attention-based sequence representation.In the computational steps, self-attention first takes three vectors based on the original embedding for different purposes.The Query (Q), Key (K), and Value (V) vectors are the three vectors in question.Self-attention, which can be thought of as a mapping between Q, K, and V to output, is the Transformer's central element.Scaled-dot product attention and multi-head attention, two crucial attention mechanisms, are used to achieve this in the original Transformer study.These two key components of the Transformer model are depicted in Figure 2.
The dot-product of Q-queries and K-keys (size d k ) is calculated in the scaled-dot product attention process, and the outcome is scaled by divided √ d k .The results of the preceding phase are then put through the softmax function to produce the weights that will be multiplied by V.The attention output is calculated by multiplying these weights Scaled dot-product attention, B: Multi-head attention [24].by V.In practice, the attention computation is carried out concurrently over a series of queries in a Q matrix [24].The matrices K and V are utilized to use keys and values.The formulation of this method is as follows [24]: The main idea in the Transformer structure is to perform as many operations as the number of attention heads (H, H = 8 in the original Transformer structure) instead of performing a single operation on the sentence.For attention heads, the query, key, and value vectors are linear transformations of Q, K, and V.The attention output is produced on each head using scaled dot-product attention.The combination of the outputs from each self-attention head is the result of multihead attention.This is formulated as follows [24]: Here, all of the W matrices are parameters that can be learned.In this way, a much stronger representation is created and operations can be performed in parallel.The dimensions of the attention heads are usually divided by H to avoid increasing the number of parameters.Multiple sub-nets with diverse views of the key-value set running in parallel as multi-head attention sub-nets that process the output representation into various sub-spaces.
The Transformer model is displayed in Figure 3. Like earlier NMT models that have been successful in the literature, the Transformer model is built on the encoderdecoder structure.One of the difficulties encountered in selfattention-based models is that attention itself does not have a concept of order [22].Key-value pairs are accessed only based on the correspondence between the key and the query, not based on the location of the key in memory.Since queries, keys, and values in recurrent NMT are obtained from RNN states and the RNN structure provides a strong sequential signal, this does not present a significant challenge [22], [24].Transformer model does not use recurrence; hence, handling the order of the words in the input sequences requires knowledge of the relative or absolute position of the tokens in the sequence.To overcome this, a method called positional embedding (PE) using sine and cosine functions is applied after the input and output embedding layers.By including them in the input and output word embeddings, these are become position-aware.This process is carried out as follows [24]: After PE, The resulting output is then sent to the encoder.The Transformer encoder is a stack of N = 6 identical layers.Two sublayers make up each layer.A multi-head selfattention layer is the first sub-layer, while a fully connected feed-forward layer makes up the second.Each of these layers has a residual connection surrounding it, which is followed by a layer normalization operation.The Transformer decoder is a stack of N = 6 identical layers.The decoder features a third sub-layer that performs multi-head attention on the output of the encoder stack in addition to the two sub-layers in each encoder layer.The outputs are produced using residual connections and layer normalization, just as the encoder.In addition, the multi-head attention sub-layer is used in this part as masked multi-head attention.To stop the model from focusing on later tokens, subsequent embeddings are masked in this section.This ensures that at location t it can only use information from outputs generated from locations before t.Once the output is obtained from the decoder layer, it moves to the inference stage, where a softmax layer is used to generate the target sentence.
Since 2013, neural networks using the encoder -decoder system have become mainstream for MT studies.Today, it stands out as the technique used in Transformer architecture and the most used technique for NMT studies.Unlike other NLP methods, MT includes two languages.Therefore, the success of the model created in MT on a language pair is highly dependent on the number of parallel sentences available between the two languages.In order for NMT systems to achieve smooth results, large amounts of parallel data are needed in the created systems.High-resource language pairs (English, German, French, etc.) have no problem with parallel data.However, this is not the case for low-resource languages, and this is considered a major challenge for the NMT field.As a result, NMT research on low-resource languages has significantly increased in recent years.In NLP, the problem of low-resource is mainly due to low-resource of the considered languages or low-resource of the studied areas [25], [26].Whether a language is low-resource or high-resource can be determined based on the size of data available and the NLP tools that can be used [25], [26], [27].Additionally, a language is regarded as low-resource for NMT even if it involves a large number of monolingual corpora and a little parallel corpus with another language.
The main purpose of this study is to perform a systematic literature review (SLR) on NLP and deep learning methods used in low-resource NMT.Although there are many research studies examining on these topics on low-resource NMT, there are very few systematic reviews on this subject as far as it is known.Research articles for our study were carefully selected to examine the deep learning techniques used for low-resource NMT and the NLP methods used.
The remainder of this study consists of five parts: In Part II, a literature review is given.The methodology for how the studies reviewed in this study were obtained are described in Chapter III.Findings and evaluations are shared in Chapter IV.Chapter V includes discussion and conclusion.

II. LITERATURE REVIEW
When the studies in the field of low-resource NMT are analyzed, it is seen that the methods used utilize monolingual and auxiliary language data in addition to the limited corpus available.This section will examine the most widely used methods in low-resource NMT in general terms.

A. USE OF MONOLINGUAL DATA
Low-resource language pairs often perform poorly on the MT task due to the lack of parallel bilingual data.To address this issue, the use of monolingual data is recognized as an effective strategy to improve translation quality in lowresource scenarios.Monolingual data is especially helpful for enhancing translation accuracy in low-resource scenarios since it is more abundant and simpler to get than bilingual parallel data and provides a wealth of linguistic and contextual 131778 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
information.Many studies have made extensive use of monolingual data in NMT systems, which are categorized in several aspects.
One of the most used method of monolingual data is backtranslation (BT).Back-translation is the reverse translation of monolingual sentences from the target side into the source language using a translation system to create pseudo-parallel sentence pairs [28], [29].When using this method, it has been shown that translating target sentences into source sentences usually yields better results [26].An essential limitation of the BT method is the assumption that there is an NMT system in the BT direction, and the success of the NMT system used affects the model to be created.In addition, the synthetic data generated using BT contains more noise than the original data.Following this method, an iterative BT method has been proposed in the literature, which is based on BT and improves the success of NMT [30].In the iterative method, the source and target data are translated using NMT models in opposite directions.This translation process is continued until there is no improvement on either side.There are many different back translation methods in the literature, and studies have shown that this method provides performance gains in NMT systems [29], [31].
Utilizing monolingual data and pre-trained models is helpful for a variety of language generation and understanding tasks [32], [33].Since NMT requires both language understanding (encoder) and generation (decoder) capability, pretraining can be extremely beneficial, especially low-resource scenarios [34].Depending on the encoder and decoder in the NMT, studies on language model pre-training can be categorized as separate or joint pre-training.Some studies use separate pre-training of the encoder and decoder.In [35], they experimented with initializing the encoder and decoder with different models, including BERT, GPT-2, RoBERTa, and random initialization.An LM can be added into the target side of the NMT model to increase the output text's fluency.This process is known as LM fusion, and classified into shallow fusion and deep fusion [32], [36].In shallow fusion, LM is used to score words produced by the NMT system's decoder at inference time or during training [36].The NMT design is changed in deep fusion, which improves performance, to integrate the LM and the decoder [36].One drawback of the encoder and decoder used with separate pre-training is that they do not train the NMT well, which is crucial for linking source and target representations in the NMT model.To improve translation accuracy, some research suggest pre-training the encoder, decoder, and attention jointly [37], [38].
Recently, models utilizing adversarial training frameworks of unsupervised Generative Adversarial Network (GAN) structures with monolingual corpora and cross-lingual embeddings have become popular.For this structure, usually in the adversarial framework, initial translation models are created for both forward and backward directions, and then iterative BT is performed to improve translation performances jointly [26].The neural network can learn a reliable map of the translation due to the adversarial training.The translation task is therefore framed by a generator and a discriminator using in a GAN architecture.Reconstruction loss is caused by reconstruction of forward and backward noisy translations [26].The discrimination loss is a result of a binary classifier that distinguishes between the translated and original target texts in order to distinguish between the source language and the target language [26].An adversarial loss function exchanges between the reconstruction loss of the back translation and the discrimination loss of the classifier.This process produces a superior translation that is more smooth for LRLs.Existing approaches in the literature on unsupervised NMT change the adversarial framework by incorporating extra adversarial phases or extra loss functions during the optimization step [39].

B. USING DATA IN THE AUXILIARY LANGUAGE
Human languages share similarities in several ways: languages in the same/similar language family or of a similar type can share similar writing style, vocabulary, and grammar; languages can affect one another, and a word from one language may be adopted as is in another [34].In addition, translating between a low-resource language pair can be aided by a corpus of related languages [47].The methods of utilizing data from different languages in low-resource NMT can be categorized as multi-lingual translation, transfer learning, and pivot translation.
The significant advantage of multi-lingual training is that multiple language pairs can be trained in a single model through parameter sharing.Compared to training multiple separate models, the cost of maintaining and model training can be significantly decreased, and information can be learned collectively from multiple languages to help LRLs [34], [58].Low-resource language pairs can benefit from high-resource language pairs through joint training.When the languages in the models are linked, and the number of languages is relatively small, better results can be obtained than with bilingual models [26], [59].Multi-lingual methods are more practical than building bilingual models because they include many languages.A review of the literature shows that multi-lingual methods can be modeled as one-tomany from one source language to many target languages, many-to-one from many source languages to one target language, or many-to-many from many source languages to many target languages [26], [43].These methods are built by applying a single encoder-decoder, multiple encoderssingle decoder, single encoder-multiple decoders, or multiple encoder-decoder models.Finally, multi-lingual NMT allows translation over language pairs not seen during training, so-called zero-shot translation (ZST) [60].
Transfer learning (TL) can be defined as the application of knowledge gained from solving one problem in machine learning to a different problem related to that problem.One of the most popular methods for low-resource NMT is TL.Initially, a NMT model is trained as a ''parent'' on language pairs that are resource-rich, typically with ample training data.Subsequently, this parent model is fine-tuned on a lowresource language pair, referred to as the ''child'' model, where training data is limited [61].Fine-tuning is required to transfer information from the parent model to the child model.There are different approaches to fine-tuning, although it is unclear which method is better.These methods are; i) completely transferring the parent to the child, ii) fine-tuning the entire child model, iii) fine-tuning specific layers on the encoder-decoder models.The simplest way to fine-tune is to establish the model with a high-resource language pair, and then adjust the parameters using the low-resource language pair [62].During fine-tuning, certain parameters can be fixed.This choice is purely a matter of model design.Furthermore, besides the bilingual parent model, using a multilingual parent model is another option [43].Due to the restricted model capacity of a multilingual model, fine-tuning can drive the model to focus on the desired low-resource languages, boosting performance.As a result, a low-resource language pair can benefit from several auxiliary languages.
Typically, a high-resource language is selected as a bridge in pivot-based techniques.The source-target translation can then be constructed using the source-pivot, pivot-target corpus, and model.Often, source-pivot, pivot-target models are trained and then combined into a source-pivot-target model.[63].Training the source-target model using pseudoparallel data generated with the pivot language is another frequently used technique.Also, utilizing the parameters of the source-pivot and pivot-target models is one way of using the pivot language [64].In pivot translation, the pivot language selection has a substantial impact on the translation's quality.A pivot language is typically selected based on prior information.
Apart from the techniques employed in the literature, large language models (LLMs) have gained popularity lately.Although mixed-language training data is used to train many LLMs, English remains the preferred language [65].Multilingual data is used to enable LLMs to process inputs and generate responses in multiple languages.LLMs are capable of doing effectively in translation even when they are not specifically trained for such tasks.There are studies in the literature where LLMs with known success such as ChatGPT, GPT-4, etc. are used for MT tasks [66], [67].Some studies in the literature have found that when LLMs are used for the translation of low-resource languages, they underperform the models with the best results so far [66].In addition, studies show that LLMs achieve impressive results when translating in the XX-English direction, but relatively poor results in the English-XX direction [65], [66].Even while LLMs work effectively on a variety of translation tasks, low-resource languages and the English-XX translation direction still need work.

C. OTHER SURVEY STUDIES
This section is a review of other survey/review studies in the literature.Some of the studies that have been carried out to date and information about the characteristics of these studies are given in Table 1.When studies are examined, it is seen that the reviews are generally studies in the field of NMT regardless of the scenario (low-high), and especially in recent years, the number of survey/review studies on low-resource scenarios has started to increase.To the best of our knowledge, there is no systematic literature review in low-resource NMT.Unlike most survey/review studies, our study covers only the low-resource NMT area.The differences between our review from other studies are as follows: • As the study is a systematic review, how the reviewed studies were obtained is shared.
• The reviewed studies were categorized in terms of the areas they focused on.
• The most preferred methods in the studies were identified.
• The language pairs most frequently studied in the low-resource NMT literature were examined.
• Bilingual and monolingual corpora used in the studies were examined.
• The metrics used to measure the success of the studies were analyzed.
• The development tools used in low-resource NMT studies were examined.

III. METHODOLOGY
In this study, a systematic literature review was conducted for the field of low-resource NMT.While conducting the study, the process was divided into several stages.The stages of the study are given in Figure 4.In the rest of this section, these steps are explained in detail.• With the emergence of the research questions, the searches were focused on the keywords ''neural machine translation'' and ''low resource''.
• The keywords ''low resource'' are used together with the keyword ''neural machine translation'' to focus on low-resource scenarios.Keywords such as ''transfer learning'', ''pivot translation'', ''pre-training'' and ''multilingual translation'' were created in order to reach the studies with different methods used in the field of low-resource NMT, and these words were used additionally.
• Logical operators were used to search databases.''OR'' operators were used for synonym keywords, and ''AND'' operators were used to combine keywords.
Table 2 shows the queries used to search databases.The query for the ScienceDirect database is shorter than the others because there is a limit of eight logical operators for the queries to be used.On other databases such as Google Scholar, Springer, etc., queries, as written in Table 2, could not be written.Even if they were written, meaningful results could not be obtained (too many irrelevant results, too many studies to be analyzed).For this reason, only the seven databases from which results could be obtained were used for the study.As a result of these queries, between 2018 and 2023, 94 studies were found in IEEE Xplore, 298 in ScienceDirect, 821 in Scopus, 47 in Taylor Francis, 409 in Web Of Science, 47 in Wiley Online Library, and 542 in ACM Digital Library (as of the search date).Table 3 gives numerical information about these databases.

C. STUDY SELECTION
Although 2258 studies were found as a result of the searches, most of these studies were out of scope.In some of the databases, many studies are unrelated to the subject because the search was done for all studies.In addition, some studies may appear in more than one database in the search results.Before starting the study selection process, such duplicate studies were organized to be taken from a single database.Subsequently, some inclusion and exclusion criteria were determined in order to include studies that are appropriate for the purpose of this study.Table 4 shows these inclusion and exclusion criteria.
In the preliminary examination phase, summaries and general outlines of the studies were mainly analyzed.In this section, firstly, it was examined whether the study was in the field of NMT and on low-resource languages.Studies that did not include a low-resource setting were excluded from the review.Subsequently, 45 studies were selected to be examined according to the criteria determined from the remaining studies.Information about this selection process is given in Fig 5 .Table 5 shows the studies selected for review due to the above steps.As can be seen in Table 5, all of the selected studies were published between 2018-2023(July).From this point of view, the review we have conducted is up-to-date.

IV. OBTAINED FINDINGS AND RESULTS
In this part of the study, the studies selected for review are briefly mentioned.Subsequently, the answers to the research questions will be shared.

A. STUDY SUMMARIES
In [68], using phrase-based methods, namely phrase-based statistical MT (PBSMT) and NMT, for English to Mizo translation is investigated.The proposed model is a three-stage process of obtaining translation predictions, data preprocessing, system training and testing.The NMT system consists 131782 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. of a one-way LSTM encoder -decoder that uses the attention mechanism for translation.The findings of the study for the NMT model can be summarized as follows; (1) the NMT system attaches importance to the accuracy of the syntactic structure of the predicted translation, as it aims to produce fluent translations; (2) the NMT system pays little attention to the precision of named entities, which usually results in partly sufficient translations; (3) translations predicted by the NMT system are shorter; (4) translations predicted by NMT are of lower quality.
In [69], inspired by humans' ability to learn languages, a new hierarchical TL architecture is proposed to take full advantage of auxiliary languages by adding a middleware for low-resource languages that only have a single parallel corpus.During the training process, the three-layer architecture transfers parameters layer by layer, and fine-tuning is done at each layer.The study was carried out between the Uyghur-Chinese languages, and the Turkish language was used as an intermediate language.The study combines the advantages of high-resource language data size, syntactic information, and linguistic similarity of the intermediate language.In terms of training time and efficiency, the model is trained several steps on a high-resource language pair (English-Chinese), and the parameters are transferred to the intermediate model in the first layer.In the second layer, the model is trained using a language pair (Turkish -English), which contains an intermediate language similar to Uyghur in terms of syntax, and the parameters are fine-tuned until they converge.Finally, to start the low-resource model, the parameters of the model trained with the intermediate language pair are transferred to the sub-model, and the model is trained on the low-resource language pair (Uyghur-Chinese) until it converges.The framework of the NMT model is not changed, but instead of randomly initializing the next model, parameters are transferred from the parent model.Transformer model was used in the study.In addition, experiments were conducted on the generalization of hierarchical TL architecture to Turkish-English.The results confirmed that the proposed method performs faster convergence and can initialize parameters for the low-resource language pair more successfully than random initialization.
In [70], the problems encountered in building a highquality Korean-Vietnamese NMT system are identified, and solutions are proposed to address these problems.In order to create NMT system, a parallel Korean-Vietnamese corpus containing 454,751 sentence pairs was created.NMT systems are built based on attention-based seq-toseq architecture.The experimental findings demonstrate numerous advantages over current MT systems that employ statistical and neural techniques.In addition, the Korean word sense disambiguation (WSD) method was proposed based on UWordMap, a manually constructed lexical semantic network (LSN) for particular features of Korean.WSD and morphological analysis are applied to Korean texts in the corpus.Morphological analysis segments each Korean word into morphemes, and their original form is recovered.Morpheme segmentation increases token size, and recovery of original forms reduces word size.WSD operation increased the vocabulary by labeling different meaning codes in the same word form.The Vietnamese texts in the corpus were used for word segmentation with the RDRsegmenter.RDRsegmenter reduces token size and expands vocabulary by combining tokens into a single word.In the study, the encoder-decoder structure of NMT systems is constructed utilizing deep multi-layer LSTM networks.The extracted linguistic features are used individually and in combination with the models.The best results were obtained when UTagger and RDRsegmenter were used together.
In [71], a system combination model is proposed based on the idea that an increase in the accuracy of translation systems can be achieved by combining the outputs of SMT and NMT systems.To improve accuracy, several machine translation outputs are combined with the system combination model (SCM).Additionally, it's possible that some translated output components produced by one system will be better to their corresponding components produced by another system.A SCM can be used to obtain the benefits of both systems.SCM can be classified as either statisticalbased (SBSC) or neural-network based (NBSC).However, both methods have their own advantages and disadvantages.In this study, in order to combine the outputs of different SCMs, a coupling-based hybrid architecture consisting of both statistical and neural network-based coupling techniques is proposed.The aim of the study is to gain the advantages of various MT systems and system combination models by using the translations created without knowing their detailed architectures.The proposed architecture works as three layers.First, n candidate translations are generated from N systems whose internal structure is unknown.The statistical and the neural network-based approach are then used to merge the results from the first layer in the second layer.Finally, the suggested hybrid model for the system combination chooses the best sentences produced by the different SCMs.In the study, the BiLSTM-attention model was used in the neural network-based approach.The outputs of four systems (phrase-based, Hiero, NMT, Google) from existing studies in the literature were combined.These models are trained with different corpora within themselves.The sum of the separate elements complexity makes up the complexity of the hybrid SCM.It has been observed that the method used affects the overall working speed but shows significant improvements in translation success.Phrase-based, Hiero, and Google were used in combination to get the best outcomes.
In [72], a new Teacher-Free Knowledge Distillation framework is proposed.Transferring knowledge from one neural network (teacher) to a different one (student) is the goal of knowledge distillation (KD).A target distribution that looks like a virtual teacher model is manually created in the study.The target distribution depends on how many terms in the target vocabulary are similar to each other.The loss function in the MT model training is increased, and the diversity in the vocabulary is modeled more accurately.To enhance model training, a further Kullback-Leibler divergence loss is applied based on the maximum likelihood estimation.Two probability distributions are compared to determine the additional loss.The model's training prediction provides the first probability distribution, while the distribution obtained through word similarity provides the second distribution.The vector representation of tokens (words/subwords) in the vocabulary was obtained from large monolingual data.Cosine similarity is used on pre-trained embeddings to rank the token order.FastText and CCMT2020 corpus are used for the pre-trained embeddings.The proposed method is compared with sequence-level knowledge distillation and Transformer model and achieves better results.In addition, the proposed system is tested together with the backtranslation method.Although back-translation has additional training cost, it is found that it can further improve the effectiveness of the NMT model.
In [73], a NMT model between Sanskrit-Hindi languages is proposed by combining RNNs with a rule-based linguistic approach.In order to train and test the proposed NMT system, several models, activation functions, training data, and lengths of sentences were used.The suggested technique uses a pipeline design that accepts input from its earlier stages, performs calculations, and forwards the result to the following action.The rule-based pipeline design for the NMT system has 10 modules.To more effectively train the system, each module provides a distinct output as linguistic attributes to the encoder-decoder system.The encoder-decoder with attention is implemented as a stack of Bi-GRU layers.The proposed framework integrates features from the rule-based pipeline architecture to train the RNN.Each feature has a separate word embedding.To combine all these word embeddings, it creates a feature embedding matrix as a sum of all features embedding sizes.As the lengths match, these embeddings are subsequently added to the overall embedding size.These retrieved linguistic features are multiplied by the input vectors.Only this update to the encoder is made; all other functions and parameters remain unchanged.Initially, a small parallel corpus was used to train the NMT system.In this way, the system achieved low accuracy, and the output was not intelligible.Therefore, data augmentation techniques are included in the system to overcome this problem.
In [74], problems of inadequate translation were addressed by imposing sentence alignment constraints on NMT.The alignment score between the source and target sentences is predicted using a discriminator (D) based on sentence alignment.A gated self-attention based encoder is used in D to capture evidence of semantic alignment of input data.In order to avoid over-penalizing for translations that are correct but not human-generated, the N-pair loss is defined in the training process of D. Then, an adversarial training and alignment-based decoding strategy was applied to integrate the sentence alignment constraint into the NMT.A basic NMT model is trained using adversarial training to create accurate translations that outperform those produced by the generator (G) and discriminator (D).D guides the NMT model for alignment-sensitive decoding by integrating the alignment score and decoding probabilities when generating a translation.The proposed encoder is a structure that learns to focus on lexical information important for sentence alignment and to improve the contribution of keywords.This semantic and lexical information is transferred to the NMT with the suggested training and decoding processes.The alignment-sensitive decoding structure allows the decoder to consider adequacy and fluency of translations.These features incentivize the NMT model to generate translations that match the semantic information a discriminator learns for sentence alignment.In the study, the LSTM-attention and Transformer models are implemented.Uyghur-Chinese were the low-resource language pair employed in the study.
In [75], it is shown that substantial performance improvements can be achieved by pre-training an auto-regressive model with a target that extracts and reconstructs noise from texts in several languages.In this study, a multi-lingual seq-to-seq mBART model is presented, which de-noises the autoencoder.BART is used to train mBART on sizable monolingual corpora across many languages.Noise is created in the texts by masking the entered texts and replacing the words.A single Transformer model is trained to recover these texts.Unlike other NMT pre-training methods, mBART pre-trains a full auto-regressive seq-to-seq model.Without any task-or language-specific modifications or initialization procedures, mBART is trained once across all languages, producing a set of parameters that may be fine-tuned for each of the language pairs, in both supervised and unsupervised circumstances.Although BART only received pre-training for English, pre-training influence on several language pairs have been systematically studied.To more accurately assess the effects of various levels of multi-lingualism throughout pre-training, models using all languages and pre-training with fewer languages were created.The training data was divided into high, medium, and low-resource scenarios, and various experiments were conducted.mBART pre-training has been shown to provide constant improvements in performance at low/medium-resource settings and outperform other existing pre-training schemes against bilingual models and BT.It has been found that mBART can boost performance even for languages that are not included in the pre-training corpora.
In [76], a method is proposed to enhance the NMT system in languages or domains with limited resources.In the study, phrases taken from an SMT system are used as training data for NMT.The basic idea of this method is to supply more details about the compatibility between source and target expressions.A sentence pair does not contain any information regarding the mapping between the source and target expressions when it passes through an encoderdecoder.The model learns translation maps implicitly by predicting and correcting the error over a across a vast parallel data.However, the model cannot comprehend the relationship between expressions when the amount of data is small.Therefore, in addition to feeding sentence pairs to the network, sentence pairs were also fed as training examples.To implement this feedback mechanism, sentence pairs were extracted from the original training data using the Moses SMT system, and they were added to the original training data as parallel sentence pairs.Experiments were conducted with two methods; attention-based GRU and Transformer models.The results of the proposed method were better to those of basic models, and Transformer model performing the best.In addition, the proposed approach was compared with some techniques discovered in NMT, such as sub-word level NMT and back-translation, and achieved better results.
In [77], the effects of BT on NMT were investigated using language pairs that not only utilize distinct writing systems but also belong to different language families, leading to more challenges for MT with limited resources.With models trained on extremely low-resource corpora, SMT and NMT experiments were carried out with character and word based settings, offering comparisons for Chinese-Vietnamese and Vietnamese-Chinese directions.Additional analyses, including N-gram F1 score, error rate, and linguistic analysis, were also performed to obtain new results.The study also examined impact of synthetic data size on model performance.Although different results were obtained, NMT models generally achieved better results when a large amount of synthetic data was used.When word-based SMT and word-based NMT outputs were examined, it was discovered that NMT outputs are better in two ways; a) the number of untranslated Vietnamese words is much less than SMT, including named entities; b) in NMT outputs, the word order and general syntactic structures are more precise and comprehensible.The study concludes that in the two translation directions of Chinese-Vietnamese, the addition of artificial data positively affects the performance of character and word based models.For bidirectional Chinese-Vietnamese translation, the performance of SMT outperforms NMT in most cases.
In [78], firstly, a parallel corpus called UPC, consisting of two large parallel corpora, was created to train Korean-English and Korean-Vietnamese MT models.Data was gathered on subjects that were pertinent to everyday life, such as economy, education, religion, etc., for a variety of audiences.Word ambiguities (or homographs) that have the same spelling but different meanings harms both SMT and NMT performance.This model forces NMT systems to choose from several candidate translations representing different meanings of a word.To solve this problem, a hybrid approach is proposed combining knowledge-based methods with a sub-word conditional probability to determine the suitable purposes of homographs and explain the codes corresponding to these homographs.Using this approach, a fast and accurate WSD system called UTagger has been developed.WSD was then applied to the original Korean sentences in the UPC.The SMT and NMT systems were both trained using this corpora.Experiments were carried out with the normal version of the corpus and using UTagger.Better results were obtained when using UTagger.In addition, rare words produce a large number of out-of-vocabulary (OOV) words, which is a problem for MT.TClear word boundaries were formed and OOV words were decreased by the morphological analysis used in UTagger.As a result, WSD usage has enhanced MT system performance.Additionally, the NMT model achieved better results than the SMT model for these corpora.
In [79], by repeatedly using the Transformer model for BT, it is proposed to add a pseudo-parallel corpus to the training data.A successful round-trip approach is analyzed with sentence alignment metrics for pre-and post-translation filtering.If the target sentence and the round-trip translation are parallel, the synthetic source sentence is considered a possible match to the monolingual target sentence.Therefore, this sentence can be added to the synthetic parallel sentence corpus to increase the training efficiency.The proposed framework is composed of two modules.The first module, back-translation, consists of 4 steps.First, an iterative Transformer is trained on the source-target pseudo-parallel corpus with different parameter settings to acquire the synthetic data.Then, at each epoch during training, the translation is analyzed using the source synthetic data, and the model with the best BLEU score is selected.The Transformer model is tuned with different iteration patterns and layer sizes to reduce training variance.Then, sentence parallelism between monolingual target and source synthetic sentences is estimated.Finally, Cohen's Kappa measures the agreement between the mono-lingual target data and the synthetic source data to avoid duplicate sentences in the corpus.Sentences with low Cohen's Kappa scores are removed from the corpus as they are considered detrimental to model performance.The second module in the proposed framework is the round-trip approach used to acquire the target data.First, this module uses a round-trip translation of synthetic source sentences in the source-target direction to obtain a synthetic target sentence.Then, the similarity between monolingual and synthetic target sentences is computed, and low-scoring sentences are filtered out.Finally, the filtered synthetic and monolingual target sentences are combined to extend the training data.The proposed method is compared with some works in the literature and achieves better results in both highand low-resource scenarios in different language pairs.In [80], a NMT model called BERT-JAM, which stands for BERT-fused Joint-Attention, was proposed.Three ways have been tried to maximize the use of BERT for NMT.First, a fusion module is included in each encoder/decoder layer to be able to use the representations of BERT in a combined representation.Weights are shared between different samples to combine multi-layer representations.Second, it is proposed to integrate the fused BERT representation with the encoder and decoder layers by combining the self-attention and the cross-attention modules using one joint attention module.The joint attention module carries out multiple attention module tasks at once, dynamically allocating attention between BERT representation and encoder -decoder representation.A joint attention module termed BERT-encoder joint attention is employed at each encoder layer to take part in both BERT and encoder representation at the same time.Each decoder layer contains two joint attention modules.In the first, called BERT-decoder joint attention, deals with combining the BERT representation with the decoder representation.Secondly, called encoder-decoder joint attention, focuses on both encoder and decoder representation.Third, to address the issue of catastrophic forgetting, the proposed model is trained using a three-step optimization technique that gradually solves various model components.By employing this method, the model is able to benefit from the improved performance that BERT fine-tuning provides.
In [81], an easy-to-use but powerful constrained sampling technique is proposed for data augmentation (DA) in NMT.Constrained sampling method that makes use of edit distance calculation are considered to be more effective than other methods that choose words at random from the original text.The proposed way is basically similar to GAN networks and can be expressed in 3 steps.First, both positive and negative samples were used to train the discriminator sub-model.Therefore, some negative samples were created from original data using the negative sampling technique.Second, the evaluation sub-model is trained using original and generated negative data.The evaluation sub-model is designed to select high quality data after generation and is intended to ignore, to some extent, sequences containing semantic or syntactic errors.Third, some samples are augmented using the edit distance sampling technique on the original data distribution, and low-quality augmented datas are ignored by the discriminant sub-model.The proposed method stands out because it is language-independent.Such a sampling method can be incorporated into NMT systems in different languages.The proposed strategy performs noticeably better than the approaches in the literature, according to experimental findings.
In [82], investigates which knowledge a model gains from pre-training and which information from the pre-trained model enables a highly accurate unsupervised NMT.For this reason, which layers of the unsupervised NMT system store what kind of information and whether features such as word order of cross-attentions differ in languages have been analyzed.The cross-attentions of an encoder-decoder architecture are being analyzed using a novel technique that takes into account the different features of the source and target sequences.A language generation model is pre-trained using the Masked Sequence-to-Sequence (MASS) method with two monolingual corpora.The pre-trained model is then fine-tuned for the same corpus and unsupervised NMT task with a back-translation loss.The Transformer method is used for the architecture consisting of an encoder -decoder.An input sentence including a random masked fragment is provided by the encoder, and the decoder attempts to predict this random masked fragment.In this work, a BT approach is adopted to build an unsupervised NMT system because the BT approach can be easily implemented by using a typical encoder-decoder for both languages with the MASS method.Both strategies rely on the creation of cross-lingual word embeddings across the two languages before an unsupervised NMT system is trained.The results show that pre-trained models are helpful in improving the performance of a unsupervised NMT system.
In [83], the benefits of adding linguistic annotation to sentences used as input for MT are investigated.A model for Korean-Vietnamese NMT is proposed that combines a Transformer model with a pre-trained Viet-BERT model.The Korean-Vietnamese bilingual corpus undergoes a number of pre-processing processes before being incorporated into NMT systems in order to enhance the standard of NMT.In [84], a study on adversarial learning is presented.To get correct translations in complex systems, an enhanced feature extraction method is examined in small-sized training of sentence pairs.The suggested model additionally makes use of TL to further improve NMT performance.The whole GAN system consists of a discriminator D and a generator G.The parameters of G and D are optimized using two adversarial losses.G uses fake examples to perplex D, the discriminator.Conversely, discriminator D seeks to identify the fake examples produced by G and adjusts its parameters as necessary for this.The adversarial losses of a GAN model are included in the NMT as they help the LRL translation.In the proposed model, RNNSearch is designed in the generator part, and a residual connected convolutional neural network (CNN) is designed in the discriminator part to classify the input pairs according to their hierarchical features.Mixture, Res, and Feature are the three basic components that make up the discriminator.The two different embeddings in the input pair are independently sent through an exclusive convolutional layer and merged in a mixture block.This block contains an ordered convolutional layer to thoroughly fuse dense exponential linear units (ELU) and their embeddings.Res block combines the same number of layers under its predecessor faster.By contrasting the suggested model with other models, its efficacy has been confirmed.These are models such as RNNSearch, BERT, and ALBERT.Next, the pre-transfer trained generator, discriminator, generatordiscriminator, and non-transfer training models were compared.The analysis of the proposed model in terms of TL is tested with a separate generator, a discriminator, and both a generator and a discriminator.The best results are obtained when only the discriminator is transferred.
In [85], it is recommended to use pre-trained BERT model on both the encoder and decoder sides to improve the use of information obtained through pre-training and to support performance.The study used lightweight neural network components called adapters to incorporate the BERT model into the seq-to-seq framework.Two pre-trained BERT models are added on the source and target side, and these are considered encoder/decoder.The advantage of adapter modules in the model is that they are parameter efficient and robust.The suggested system also doubles the decoding speed through parallel decoding.Each element in the proposed structure can be thought of as a plug-in unit, which makes the model very flexible and task-independent.Since pre-trained BERT models are deep models, this study investigated whether it is necessary to add adapters to each BERT layer.A probabilistic learning process is utilized to determine whether to use an adaptor in each layer using hidden variables.Variational inference optimizes the latent variables, and an additional loss function regulates the number of adapter layers.In this way, the parameter scale of the adapters is automatically pruned, and the adapter layers are directly fine-tuned to a pretrained model, which significantly lowers the model decoding delay during inference.Based on the fact that some layers in the Transformer model can be pruned without seriously harming the model effectiveness, it is also assumed that some adapter layers can be pruned as well, as not all adapters play essential roles when fine-tuning.In order to enable the model to choose and employ adapter layers automatically, a probabilistic approach is applied.The proposed model performed better when compared to some research in the literature.
In [86], a syntax-graph guided self-attention (SGSA) tecchnique is proposed, a model that combines the syntax of source sentences with stacked multi-head self-attention layers, aiming to improve Transformer by using syntax more instantly.The syntax is converted to a graph in order to create an effective combination with the NMT model.The syntax-sensitive approach is a structure suitable for sub-word units, and it resolves the issues caused by extensive vocabularies and sparse words.The source-side syntactic dependency is used as a guide, and a syntactically directed self-attention mechanism is used without additional parameters.To perform this process, which the authors call dynamic multiple syntax-aware self-attention representations (DMSR), the syntactic graph is adaptively tuned, and the effect of different fusion methods on the performance of the model is investigated.To solve the absence of syntactic information and maintain the parallel computational ability of self-attention networks, the syntactic relationships of each source token are represented as vectors and applied to the self-attention components Query-Q and Key-K.Different strategies have been attempted to integrate DMSRs and attain the ultimate representation.These strategies encompass methods such as average-pooling, highway and linear networks, all used as fusion methods.The analysis showed that adding syntax information in the first three layers of the Transformer decoder yielded better results, while adding it in the deeper layers did not result in significant improvement.
When syntactic information is integrated into a single layer, performance often degrades with increasing layer depth.In addition, the proposed model was compared with other models in terms of inference time and parameter count.The proposed method has fewer parameters in all translation directions.
In [87], an approach is proposed that strengthens the relationship between languages to improve translation quality and addresses domain adoption issues through rewardbased learning.In [88], a novel technique for producing synthetic data in low-resource scenarios is proposed, and compared to BT in experiments.The suggested approach is quick, reliable, and does not need any additional outside resources, such as dictionaries, pre-trained models, rules, or language models.No additional resources are used for data augmentation except for a small amount of available bilingual corpora.It was suggested that artificial translation units (ATUs) be used for data augmentation while maintaining the sentence word order and context.ATUs refer to tags produced by standard translation modules using solely monolingual vocabulary.Based on ATUs, equivalent artificial sentences from the target are created for an original sentence pair.The original source sentence is then compared to these pseudo-target sentences to produce artificial data.The artificial corpus is mixed with the original corpus to train the NMT systems.To enhance the effectiveness of the MT, the method can be utilized in conjunction with the BT.In addition to data augmentation, the effectiveness of low-resource translation is investigated in terms of combining Chinese and Japanese texts on the source side (combined training) in a translation task to Vietnamese.The translation tasks capitalize on the utilization of shared translation units between the two languages.Furthermore, the BERT model is incorporated into NMT systems.The BERT model is trained using a mixture of Chinese and Japanese texts (with grammatically inverted patterns), in contrast to previous studies.The goal is to look at how effective combined training systems are in high-resource scenarios.
In [89], highlights the low-resource MT, specifically for Manipuri as well as other Indian languages.This is achieved through a multi-lingual technique and involves direct translation scenarios between Indian languages and Manipuri using zero-shot manner.It has been mentioned that there is a capacity bottleneck issue in a single shared MNMT model.In order to address this, a comprehensive analysis is performed on the multi-lingual cross-lingual word embedding (MCLWE), which precedes the MNMT model.It has been shown that this increases the generalizability of the model.In addition, the effects of using such an embedding on zero-shot translation are also examined.In the proposed method, firstly, embedding training is performed for all languages separately.The process of multi-lingual alignment is then carried out by mapping each language embedding to the shared language embedding area.In the study, a single common MNMT model is used in a many-to-many manner with shared encoders and decoders.The encoder-decoder are initialized, and the multi-lingual model trains together over N language pairs.The proposed method is compared with bilingual, multi-lingual, and pivot translation models.In the test processes, it was observed that more improvement was achieved in the English-XX direction than in the XX-English direction, with the inclusion of MCLWE in the system for all language pairs.Finally, the model was compared with the pivot-based methods using zero-shot translation, and competitive results were obtained, which were not better than the pivot-based cases.Overall, the proposed method can handle repeating words better than the bilingual and multi-lingual base models and enhances the quality of NMT for the low-resource Manipuri language.
In [90], a study was conducted to reduce corpus requirements and improve context learning in extremely LRLs.A new method is proposed that jointly embeds textual and phonetic information of languages into GAN-NMT by leveraging an optimized attention network based on deep RL.In the proposed architecture, a pre-trained NMT model is utilized as the generator, creating translations from the source sentences, while a different network is employed as the discriminator, determining if the translations are genuine or not.A new GAN model consisting of Deep-RL-Guided-Attention as the generator and a Convolution Neural Network (CNN) as the discriminator is used to obtain better attention weights.The Transformer model has been modified to create the generator model.The discriminator is a classification model developed using CNN to distinguish between real sentences and sentences generated by the generator.Instead of using the conventional word embedding, the GAN model is trained using a new joint embedding.By substituting deep RL-guided attention for the initially used attention in the generator, the suggested design enhances the GAN model and raises the probability of deceiving the discriminator model.The enhanced GAN model can learn additional phonetic context that is missing from other approaches by combining information from textual and phonological representations.The proposed method is compared with different approaches in the literature and achieves better results.In addition, the proposed framework has been tested on high-resource German-English translation and outperforms some of the compared models.
In [91], a method called Group-Transformer (GTRANS) is proposed, which strengthens the model and separates multiple layers into different groups to take full advantage of the low-level and high-level attributes in both the decoder and the encoder.Only the last latent states from each encoder group, which is composed of a certain number of contiguous layers, are included in the combined representation.Similar to this, each decoder layer is broken up into distinct decoder groups before being integrated as a whole.The target words are predicted based on the word probabilities generated by combining all of the representations of each decoder group, allowing low-level information to also directly influence the predictions.Experiments were done with 5 different corpora in the study, but the corpus that can be called low-resource is only the IWSLT-14 English-German.The proposed approach was contrasted with several approaches suggested in the literature, and the findings were presented.The technique provides +0.78 and +1.73 BLEU score enhancements in De-En and En-De directions, accordingly, indicating that the proposed method can take advantage of multi-layered features to improve translation quality significantly.In addition, the IWSLT-17 corpora was used for multi-lingual experiments in the study.The authors said that the different layer representations provided by the proposed method are suitable for the multi-lingual translation task and provide consistent improvements in all translation directions of the model.In comparison with the Transformer model, the proposed method has no additional parameters and has a close inference speed.
In [92], a NMT model trained on a sizable corpus including every Arabic dialects was created.The goal is for this NMT model to be able to translate a particular dialect using a low size corpus.A transductive TL strategy is proposed to address the issue of data scarcity in Arabic dialect translation.The transductive TL strategy was used with two NMT models: LSTM seq-to-seq and attention seq-to-seq (Luong attention).The corpora used in the proposed framework are MADAR (25 Arabic dialects-Modern Standard Arabic (MSA)) and the target PADIC (Algerian dialect-MSA).The suggested strategy consists of two key steps: 1. learning step: the LSTM seq-to-seq and attention seq-to-seq models are trained using the 25 Arabic dialects of the MADAR corpora; TL step: using TL, the LSTM seq-to-seq and attention seq-to-seq models are trained again using the Algerian Arabic dialects in the PADIC corpora.The parent model is built using the massively parallel corpus MADAR.The parent model is then retrained using the PADIC corpus, which results in the creation of a child model.The information from the parent model is passed down to the child model through the reuse of its parameters.The model performances started dropping for sentences with more than 20 words for the LSTM model and more than 25 words for the attention model.In addition, the study compared the results with studies using different TL methods, but this comparison is improper since the corpora in the studies are different.
In [93], proposed model overcomes the drawback that TL does not take into account the vocabulary properties shared by the parent and child models when fine-tuning the submodel.Based on this situation, this study proposes a method to use vocabulary embedding and vocabulary information in the child model.The integrated corpus used with the shared vocabulary while training the parent model in the proposed structure shows a much better translation performance by using the parent and child vocabulary.To make the model strong, the data used in the child model has been added to the parent model.In this way, the parent model has prior knowledge of the child model.Therefore, It is thought that a more robust TL approach will emerge.The basic idea is to share vocabulary properties between the parent and child models before fine-tuning.In the suggested structure, the parent model is first trained using a language pair with a lot of resources.Afterward, preparation is made for the hybrid model in order to create an integrated corpus.The oversampling method is used to create the integrated corpus.A larger mixed corpus is produced by bringing the size of the child corpus to parity with that of the parent corpus.A joint vocabulary is created over this corpus, and the hybrid model is trained.The child model is fine-tuned over this hybrid model using a low-resource language pair.Languages used for parent models are Arabic, Persian, and Turkish.The suggested model is built upon the Transformer method.The proposed strategy performed better when compared to several TL techniques in the literature.Additionally, an experiment was conducted on the model using a low-resource (Uyghur) language in the parent model, and it was seen that the results improved.
In [94], a study was conducted on how to handle with out-of-vocabulary (OOV) words and multi-word expressions (MWE) in a NMT system.NMT systems use the softmax function in the output layers.Softmax function has a high computational complexity, and therefore NMT systems are used with limited vocabulary sizes.This situation triggers the OOV problem.MWEs are constructions that contain multiple words but behave as a single word.NMT systems may fail to learn, remember and reproduce MWEs as they represent the entire sentence in a high-dimensional vector.The Punjabi-English language pair is analyzed in this study, and existing systems in the literature are studied.In addition, a corpus of MWEs and named entities were created.In the study, the encoder-attention-decoder structure was used with LSTM, and a total of 4 different models were analyzed within this structure using word embedding and different corpora.Word vectors obtained from FastText were used in the study.In addition, a pre-processing module was created for sentences.It has been observed that the models are generally more successful in short sentences, and there is a decrease in success after 15 words.
In [95], Transformer, multi-source Transformer and shared multi-source Transformer models with additional grammatical features are used for low-resource NMT.The major goal of this study is to enhance the translation efficiency of low-resource languages by including extra linguistic variables into NMT models.In the experiments, POS taggers were employed to assign the accurate POS tag to every word within the corpus.A POS labeling format, Word|POS, was used in the experiments.To implement translation models, on the source side, POS-tags are first used.In order to initiate translation models, POS-tags are then added to the target side.Then, for each translation model, POS-tags are included in every word on both the source and target side.Multisource Transformer and shared multi-source Transformer models use two inputs (i.e., sequence data and POS-tagged sequence data), and the models outputs are either sequence data or POS-tagged sequence data.The basic model is the Transformer model, which uses only word vectors.The multi-source Transformer model is an enhanced version of the Transformer.This enhancement involves incorporating an extra encoder and adding an additional target-source multihead attention component on top of the existing one.This modification allows for the utilization of double inputs same time.The architecture includes two encoders, one for the words and the other for the linguistic features.Despite the fact that two independent encoders use the same parameters, their outputs differ, and they merge in different spots throughout the decoder.Shared multi-source Transformer and multi-source Transformer models have many similarities.In contrast, when training, the parameters of the multi-source Transformer model are shared.In addition, the authors proposed a POS Tagging method and included it in the experiments.With this method, competitive results were obtained with fewer labels.Generally, the best results were obtained with the shared multi-source Transformer method.
In [96], the objective is to enhance the translation success by combining two Transformer-based structures on the Turkish-English language pair with shallow fusion method.Firstly, a Turkish-English corpora was created for study.Transformer and SciBERT models are used together in the proposed structure.To maximize the effectiveness of these two architectures, the shallow fusion technique is used.The outputs from the decoder and LM are combined in another neural network (NN).The additional NN structure used in the study is the fully attentional network.On the fusion process, the combined output goes through token, positional, and segment embedding.After these processes, the output is given as input to the Transformer encoder.Meanwhile, the weights of the LM are frozen.The proposed model has been compared with Google Translate, LSTM, and Convolutional Based Transformer and received better results.Additionally, the proposed model is tested on the WMT'17 and WMT'18 corpora in a zero-shot manner.20.12 and 20.56 BLEU scores were achieved in the Turkish-English direction, respectively, and although the results are not very good, they are considered to be competitive.
In [97], a fully synchronized inference technique is proposed for multi-lingual NMT, which can simultaneously and interactively produce multiple target sentences in several languages.In the inference phase, the model uses predicted words in other languages additionally the source and the prior predicted words while predicting the next word.A module called cross-lingual attention has been developed that can dynamically choose the most pertinent part from the target sentences of more than one language in order to utilize the supplementary information of different target languages during generation and to direct the generation of the language of interest.This allows the approach to generate translations in multiple languages at the same time, and this allows for mutual enhancement between target languages.The study is built upon the Transformer model.In the proposed model, the encoder component is identical to the original Transformer model.In the decoder part, the recommended cross-lingual attention module is used.This method generates a simultaneous representation for each language.In this way, the attention calculation for a language pair establishes a relationship not only within itself but also with other languages.For this, the attention between the two languages is calculated first.Afterward, these binary attentions are merged using a fusion function to form the ultimate representation.Three different fusion methods; linear, non-linear, and attention-based, were used in the study.While linear and non-linear methods behave equally for target languages, the attention-based method is designed as a structure that allows dynamically selecting relevant information from all languages.The beam-search algorithm has been modified to make inferences in more than one language in order to be suitable for synchronous inference.This enables interaction between different languages throughout the decoding process.Model training was conducted in the form of multi-task learning to take advantage of existing large bilingual parallel data.The corpora considered as low-resource in the study is IWSLT'14.In total, 5 different situations were tested.These; standard bilingual Transformer, fully parameter sharing multi-lingual Transformer and fusion methods are the case.In the study, better results were obtained by adding a small amount of parameter compared to the multi-lingual shared model.Looking at the results, it was seen that Chinese is the language that contributes the most to translation among languages in the cross-lingual attention.In addition, the proposed method has been tried by using different languages as a structure, and better results have been obtained compared to the multi-lingual Transformer model.
In [98], the authors mention two problems with the Transformer model.The first one is that when positional encoding is done on the corpus, the location information is lost, and the model entirely disregards the sequence order.Second, there is often an over/under-translation problem, and the model does not capture the correlation between words well.To overcome these problems, the Transformer fast gradient with relative positional embedding (TF-RPE) method is proposed together with the adversarial training method.The proposed method can get local and global interdependencies among texts by replacing absolute positional encoding with relative one.To improve the training of word vectors in the multi-head attention part of the Transformer, the fast gradient method (FGM) adversarial training algorithm is added.In the proposed structure, words are first converted into word vectors in the embedding layer.Then, to obtain the desired positional embedding information, the location information formula is used to add the location code to the word vectors at each position using RPE.Obtained results are transmitted to the encoder and decoder sections of the Transformer model for training, and the FGM adjust the training data for the encoder layer.The FGM adversarial training algorithm is added to the attention module to enable the model to identify more adversarial examples and reduce over/under-translation problems.By adjusting parameters, adversarial training generates noise to enhance robustness and generalization.For a better adversarial sample, FGM typically utilizes a perturbation value that scales with the gradient.The loss in the multi-headed attention of the encoder layer, along with the gradient value, the embedding layer gradient, and the norm value, are employed to obtain a new loss and its gradient.The parameters are updated for improved model convergence using a combination of the initial and adversarial gradients.Using relative position embedding and adversarial training ensures the correct positioning of words during translation and using semantic information by the Transformer.Chinese-English are employed as a low-resource language pair in this study.The proposed model has been compared with CNN-based Transformer, BERT-Fused and other approaches in the literature and has achieved better results.
In [99], concentrated on examining the efficacy of unsupervised and semi-supervised methods for English-Manipuri MT using the monolingual data that was available.This study uses self-training (ST) and BT to increase little parallel data with monolingual data on the source and target sides in order to overcome the low-resource problem with a semi-supervised system.In order to increase the amount of original parallel data, ST uses a source-target MT model to translate monolingual data on the source language.Akin to this, BT creates synthetic data from target monolingual data using a trained target-source MT model.From three supervised candidate structures (SMT, LSTM, and Transformer), a thorough analysis was done to choose the basic architecture for the suggested semi-supervised MT model.The trained models were examined, and since the Transformer gave the best results, the Transformer structure was used in the proposed model.To deal with lack of the parallel data, a semi-supervised MT system that inckudes ST, forward-translation (FT), and BT is proposed.The artificial data produced during BT and FT is noisy and there are distributions of this noise.Therefore, to randomize this built-in noise, some perturbation in the manner of word shuffling, word dropout, and word spacing has been added to induce some degree of randomness in order to alter the initial noisy distribution of artificial data.In addition, cases where only BT, only ST and both were used were compared.Using the two methods together gave much better results.Increasing artificial data is only advantageous to some extent, and performance suffers when more synthetic data is added because of more noise.To examine the impact of the BLEU score for all models with varied sentence length, test sentences were grouped according to the length of the reference sentences.However, it was observed that the success decreased as the reference sentence length increased.The performance of the suggested semi-supervised approach was assessed over alternative supervised, unsupervised, and pre-trained mBART techniques, and better results were obtained with the proposed method.
In [100], it was shown that NMT systems are able to benefit from additional morphological information for translating English-Slovene.To provide a more thorough understanding of the practiablity of morpho syntactic description (MSD) tags and to integrate MSD tags, experiments were performed utilizing various training corpus sizes and methodologies.The concept behind the proposed technique is to emphasize on preparing data rather than the design of the NMT system.Labeled and lemmatized corpora were used to create five different formats from each corpus using different methods.The best results in the English -Slovene direction were obtained when words and MSD tags were used as distinct tokens in languages.Best results in the Slovene -English direction were obtained when lemmas and MSD tags used as independent tokens on the source side, and only superfical words were used on the target side.The NMT models used in the study are Transformer and LSTM.However, it is not clearly stated in the study which results were obtained with which model.
In [101], the authors proposed a new method by extending their previous work called ''regressing word embeddings (ReWe)''.During training, ReWE is incorporated as a module into the decoder of the seq-to-seq model.As a result, the model is trained to predict the following word in the translation as well as the pre-trained word embeddings.This approach has proven that pre-trained word embeddings can take advantage of embedded contextual information, especially with low/medium size corpus.The idea previously used in this study is extended to sentence embedding regression (ReSE).ReSE employs a self-attention method for every input sentence in order to understand a single, fixed-size vector at the output.Throughout the training phase, the model is trained to regress this vector towards the pre-trained word embedding of the reference sentence.Specifically, it has been proposed to jointly regress word 131792 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and sentence embeddings as a unified training modifier, and the suggested method is named ReWE+ReSE.In order to promote model regularization, the proposed ReWe model combines information from the word vector into the loss function.A ReWE block has been added to the NMT model to produce continuous vector representations as output.The ReWE block receives the hidden vector from the decoder at each decoding step and outputs a second vector of the same size with pre-trained word embeddings.In order to achieve accurate word embedding, the model is trained to regress the predicted vector.This is accomplished by employing a loss function that computes the distance between two vectors.The authors said they used cosine similarity for this loss in the previous study.ReSE and ReWE differ primarily in that ReSE predicts one regressed vector per sentence as opposed to an one regressed vector per word.The proposed approach makes use of the LSTM and Transformer models.Additionally, pre-trained FastText embeddings are used to initialize word embeddings in both models.Pretrained USE and SBERT sentence embeddings were used in corpora where English is the target language, as they can be used as monolingual encoders.Among the models used, the LSTM model achieved better results than the Transformer model.
In [102], a data augmentation method using a BT technique for NMT, and a neural network-based data evaluator called EvalNet, is proposed.EvalNet is determine weights for training data and augmented data.In a gradient descent step, the loss values can be modified using these weights.EvalNet is trained to assign greater weights to actual training data as opposed to artifical data, and higher weights to artificial data instead of noisy data.As a result, EvalNet secures data augmentation while maintaining NMT performance.EvalNet utilizes three characteristics to assess the quality of parallel data.The loss value is the first of them.The second is the similarity in meaning between the source sentence and the target sentence.As for the third is the cross-attention map that exists between the encoder-decoder components of the Transformer model.The cross-attention map in MT obliquely denotes the relationship between a source and a target sentence.So, noisy parallel sentences might not have the same cross-attention as normal ones.Fully connected layers transform semantic similarity and loss value into feature vectors.LSTM layers transform cross-attention mappings into feature vectors.The semantic similarity, crossattention map, and loss value are the three inputs used by EvalNet, while the output is an estimate of the data weights.EvalNet provides the evaluation weights that aid in efficiently and effectively training NMT systems from noisy and normal data.Several trials have proven that EvalNet outperforms previous work as a data evaluator.For usage in training NMT systems, artifiacial parallel sentences should have the same meaning regardless of how they were gathered or created.As a result, one of the main characteristics of EvalNet is the semantic similarity of sentence pairings.An embedding vector must first be used to represent a sentence before it can be used to measure the semantic similarity of two sentences.In this work, language-independent BERT sentence embedding is used for embedding a sentence.
In [103], a Transfer Learning Based Semi-Supervised Pseudo Corpus Generation (TLSPG) method is proposed for the translation of zero-resource languages using semi-supervised learning to address zero-shot translation issues and take advantage of similarities among low and zero-resource language pairs.The suggested TLSPG method is based on a hybrid architecture that combines SMT and NMT models.The relationship between language pairings with low and no resources is used by TLSPG to create a pseudo corpus, and TL is used to learn the context of sentences in a semi-supervised manner.As opposed to the multi-lingual ZST scenario where both HRLs and LRLs are considered, the approach here focuses on utilizing a single LRL parallel corpus to develop a MT system for languages with zero available resources.The proposed method consists of three components: Transformerbased semi-supervised learning (TSL), Moses-based semisupervised learning (MSL), and TL-based creation of pseudo-corpora.The model for zero-resource translation was pre-trained using semi-supervised learning using the TSL and MSL components.The TL-based pseudo-corpus generation component creates a parallel aligned corpus for zero-resource language pairs via pre-trained TSL and MSL modules.Then, after training the MT model using the Transformer or Moses systems, a synthetic parallel corpus is formed by mixing the parallel corpus of the pertinent languages with the pseudo-corpus.TLSPG initially employs the pre-trained TSL or MSL model on monolingual data from the target side of zero-source language pairs.Afterwards, generates monolingual sentences on the source side.The generated source-side monolingual sentences and the target-side monolingual zero-resource language sentences are then parallel aligned in the source-target direction.To produce a synthetic source-target parallel corpus for language pairs with zero resources, TLSPG integrates generated aligned parallel data with a parallel corpus of relevant language pairs.Two models were created specifically for NMT.These are data generated from TSL and data generated from MSL and Transformer models.The mBART model was used to compare the proposed model.In general, the SMT approach gives better results than NMT.
In [104], addresses the problems of domain mismatch in low-resource translation and the lack of low-resource corpora.The Transformer is used as the primary model.Subsequently, the lexical constrained mechanism is applied to the Transformer encoder.In addition, a TL approach is used to overcome corpus limitations.In the pre-processing stage, the authors used an approach called dynamic dictionary.The primary contributions of this research include: a) examining the best data processing strategy to use to enhance neural network performance; b) gathering 60,000 pairs of sentences in English and Vietnamese from the fields of politics, business, the arts, and sports in order to create a parallel corpus using BT; 3) proposing a new method for low-resource MT through TL based on a lexically constrained model.Token, positional, and segment embedding layers are added to the Transformer model to constrain specific words from references.To investigate the performance of the proposed approach, firstly, the model is trained only on the English-Vietnamese corpus.Second, TL technique are applied through the model to utilize the high-resource language pair.Comparisons with models in different translation directions are made to analyze the decoding speed of the proposed model.It was found that the proposed approach works slightly slower.In addition, the behavior of the model with different beam sizes has been studied, and it has been observed that the results do not improve after the beam size exceeds 30.
In [105], the Dual-lEvel-bAck-tRanslation (DEAR) scheme was proposed.As an extension of NMT, multi-modal NMT uses images or videos as auxiliary information.As a model of NMT, back-translation improves the reducibility of languages.The proposed method is generally a dual-level back translation method using multi-pattern joint learning.It is designed to do back-translation at sentence and concept level.In sentence-level back-translation, the target sentence is accepted as the input of the model to construct the source sentence through a translation model.The model used in the study is the Transformer.Concept level BT is presented in the video under the unique character dynamic visual concept.When a video is given, the first k keyframes are obtained.They are then re-encoded as a new action segment, with the following 32 frames for a keyframe.Thus, action detection and concept labels are obtained.Then, the sentence-based concept attribute is created to synchronize coordination between the sentence and the action.Sentences and action concepts are combined using the joint attention method.For this, a technique called multi-pattern joint learning has been introduced.This method relies on two corpus that share of the Transformer parameters during translation from source-target and target-source.This makes it easier to restrict the input language by combining parameters.Thus, translation at the sentence and concept level is naturally learned jointly.For action capture, the pre-trained Image-Net model was used with fine-tuning on the Kinetics400 dataset.The suggested approach performed better when compared to several techniques in the literature.
In [106], a more advanced embedding method is proposed that allows sharing of the updated results of word embeddings during the optimization of neural networks.The main idea is that the original embedding matrix is replaced by the inner product of two matrices, R and S.Matrix R is the prior information of the relationships between words, which can be acquired through pre-training or self-iterative training.The matrix S, which maintaines the adaptiveness of conventional embedding, is acquired through iterative training within the limitations of the translation.The relation embedding and translation system are initialized and updated at the same time as part of self-iterative training.Pre-training, on the other hand, involves training the relation embedding first, using other systems like LM.By iteratively updating each embedding during the training phase, the new word embedding matrix, on the one hand, contains the relationship between words and is fully mirrored in the entire embedding matrix.The original embedding, which takes up %50 of the size of the Transformer model, is replaced by the relation embedding.The authors state that the proposed method does not lead to any performance loss in most cases and that %85 of the relation embedding elements equal to 0 can be safely removed, thus reducing the model parameters by at least %40.The method proposed in the study was tested on the Transformer model in some scenarios.A Transformer-XL based language model and a BERT model are used for traditional word embedding pre-training and relation embedding training.Traditional embedding, relation embedding, and shared embeddings were pre-trained in Malagasy, Czech, Spanish, Russian, Lithuanian, and English for low-resource translation tasks.The LM was used to pre-train the first four languages independently, then the LM and BERT model was used for the final two.The data usage is greatly improved in the study, and even though the training data is minimal, the method is able to capture the key features of the language better and thus achieve higher performance improvements.The proposed model outperformed the Transformer model in all translation tasks with fewer parameters.
In [107], an effective method for improving NMT performance in low-resource languages and utilizing monolingual data is proposed using the Wolaytta-English pair.Two primary objectives of the study: a) training a model on the existing Wolaytta-English parallel corpora (base model) and self-learning; b) training the base model on a combination of the original and synthetic corpus using a fine-tuning approach.TThe following questions are addressed in this study: Does the performance of NMT for a low-resource language pair enhance by using only single-language data on the source side?Does the performance of NMT when using English as the source language enhance when using monolingual data from a low-resource language?Three main experiments were carried out in the study.LSTM encoderdecoder, bi-LSTM encoder-decoder, and Transformer were used for basic experiments.The Transformer model, which performed the best among the three main models, was chosen to build a artificial English corpus using the Wolaytta monolingual data.A self-learning technique was applied to Transformer model by merging the pseudo-parallel training data with the original parallel data.To create the final NMT model, TL was used to fine-tuned the self-trained NMT model using the Wolaytta-English data.Self-training NMT model with both in-domain and mixed validation sets were used during the fine-tuning.When combining artificial and original parallel data for training and using original corpora for validation, and testing, using only source-side monolingual data was found to improve the 131794 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.success of NMT in both translation directions for Wolaytta as a low-resource language.Using the original parallel corpora to fine-tune NMT models that were trained on both artificial and original data has demonstrated enhanced NMT performance in both translation directions for the pair of Wolaytta-English.
In [108], an electrical engineering corpus was used for model training, and the issues with the MT model losing the core information as well as varying emphasis on multi-layer information were looked into.Various methods have been used to fused the output vectors of every layer in the encoder.A vector fusion-based multi-attention mechanism translation model is developed on the basis of this, and the decoder component is improved.Thus, the enhanced model gains a more thorough domain knowledge of the source language at the encoder side and enables this knowledge to be better exploited in the decoding phase to enhance the translation success of the model.This study uses Transformer as the basic model.In the multi-layered structure of the encoder units, each layer has different contributions regards to syntax and lexical information in the output vector.When layers are repeatedly stacked on top of one another, the output vector closest to the upper layers concentrates more on grammar, and the output vector of the unit closest to the lower layer concentrates more on the lexical meaning of the source language.As a result, various vector fusion techniques are employed in this work to combine the output vectors from various encoder units before passing them to the decoder.This leads to an enhanced source language representation, consequently improving the translation efficiency of the model.Four techniques-average, additive, weight, and gated fusion-were used in the system fusion experiment and the encoder internal vector fusion experiment, both of which used Transformer as the basic model.Among the fusion methods, the weight-fusion method achieved the best results, and subsequently, a vector fusionbased NMT model with different attention mechanisms is suggested.
In [109], explores the creation of superior NMT models for the resource-poor Kazakh language.First, existing methods for expanding data sizes for low-resource languages like forward translation, BT, and TL are explored.The most common seq-to-seq NMT designs, RNN, Bi-RNN, and Transformer, are described in detail, along with their features, characteristics, and schematics.Then, the ways of creating a Kazakh-English parallel corpus and the training methods of NMT models are explained systematically.For this, a large corpus of 308.000Kazakh-English sentences was created by combining 205.000 monolingual Kazakh sentences from scientific papers translated with the Promt MT system with 175.000 parallel sentences collected from official government online sources.LSTM, Bi-LSTM, and Transformer architectures of the OpenNMT framework are used to train advanced NMT models, and the results are shared.The best results were obtained with the Bi-LSTM encoder-decoder architecture.
In [110], attempted to overcome the issue that current data augmentation techniques cannot be used in both highand low-resource settings at the same time.The features and constraints of existing data augmentation methods are analyzed, and a data augmentation method for NMT in a scenario-independent approach is proposed.To further improve the training corpus, the approach combines BT with low-frequency word substitution.Substituting uncommon words for more frequent ones increases the variety of the training data, allows the model to learn to translate a wider variety of words and phrases, and enhance translation accuracy.This prevents the model from overfitting the limited training data.The proposed framework uses additional language model, word frequency modification, and syntax error correction modules.The existing limited-size bilingual and a sizeable target monolingual corpus are first used to build a BT model from the target-source direction.In order to make the final generated corpus parallel to the original one, this paper employs word substitution and uses the grammar error correction module to remove grammatical errors.Subsequently, the generated corpus is combined with a bilingual corpus.The WMT2015 English-German corpora is used in the study, and the high-and low-resource settings are compared in two aspects.An experiment that compares several networks is first carried out.Secondly, a comparative experiment is carried out to relevant data augmentation studies.The proposed method is compared with the RNNSearch, ConvS2S, and Transformer models and the way they are used with different data augmentation methods existing in the literature.The Transformer model outperforms the RNNSearch and ConvS2S models regarding overall translation performance, regardless of whether it is a high or low-resource scenario.Regarding overall translation performance, regardless of whether it is a high-or low-resource scenario, the Transformer model surpasses the RNNSearch and ConvS2S models.
In [111], linguistic attributes are used to create a bidirectional NMT system between the Sanskrit-Malayalam languages.To enhance translation efficiency, the text-based MT system uses linguistic features such as morphological features, POS-tags, and word sense disambiguation (WSD).The Transformer based Sanskrit-Malayalam translation model comprises six distinct modules, incorporating linguistic features.In this study, in addition to text data, manually created audio data is also used.The corpus to be used for the   produced by the encoder network.The decoder employs this altered context vector to generate the correct translation.Methods called max/min/average fusion are used to conduct fusion.The WT, SMRT&GMRT and average fusion approach produced the greatest outcomes in both translation directions.
In [112], to enhance the English-Assamese NMT system, the potential benefit of pre-alignment and pre-trained LMs is studied.In this work, guided alignment is used together with the Transformer model, and parallel corpora of EnAsCorp1.0and Samanantar are utilized to improve translation success in both directions.The FastAlign tool and the idea of guided alignment in Transformer based NMT were utilized to obtain token alignment knowledge from source-target parallel sentences.In addition, the alignment information obtained by FastAlign and SimAlign is implemented in Transformer-based NMT.It was found that the SimAlign technique and the Transformer-based NMT provided better translation in both directions than the FastAlign technique.It has been observed that there is an enhancement in both translation directions with the pre-trained language model.In addition, English-Spanish and English-Bengali language pairs were used for comparative analysis using pre-alignment.When the findings were examined, higher translation accuracy was obtained with the SimAlign technique based on pre-trained multi-lingual contextual embeddings, with or without previous alignment information based on FastAlign.In addition, when the pre-trained LM is used, translation accuracy is even higher for longer sentences.

B. RQ1: WHAT IS FOCUS OF IN LOW-RESOURCE NMT AND ON WHICH LANGUAGE PAIRS ARE STUDIES CONDUCTED?
With this research question, it is aimed to examine which applications/directions are emphasized in the field of low-resource NMT and which language pairs are used and how often.
Investigating the 45 studies selected for review, it is seen that different applications are made to increase the success of translation in the field of low-resource NMT.In order to classify the studies examined in certain aspects, the focus areas for each study were determined to be a maximum of 4 keywords.Although some studies have achieved successful results for the language pair studied, unlike other studies, they do not focus on any point.In other words, these studies only apply NMT systems that already exist in the literature on one or more language pairs.Due to this situation, such studies do not have a specific focus on any application.In Table 6, the direction of the studies and the low-resource pairs used in the studies or other language pairs used in the low-resource setting are shared.
In Table 7, information about the number of studies in the focused areas of the studies examined is shared.When Table 7 is examined, it is seen that a total of 20 different aspects are focused on for low-resource NMT.The most commonly used aspects are data augmentation with 13 studies, the use of pre-trained models with 10 studies, and the use of transfer learning with 9 studies.
When looking at the most commonly used methods, it is seen that they generally try to cope with the lack of corpora in the low-resource NMT field.Data augmentation includes methods used to increase corpus size such as back-translation, forward-translation, etc.When looking at pre-trained models, using the BERT model comes to the fore.It is seen from the studies examined that the use of models such as BERT with a language-specific pre-training provides higher translation performance.The transfer learning method, on the other hand, is a method that is frequently used in the field of machine learning in general and is frequently used here to cope with the lack of corpus.It is often preferred to train a model with a high-resource language pair and then apply this model to a low-resource language pair.In the field of low-resource NMT, it is seen in some studies that it is aimed to increase the success of translation by including various grammatical features and morphological information in the models.However, the use of these features varies according to the extent to which the languages to be studied have the tools to provide these features.In addition, the inclusion of these features in the models may cause additional costs on the models if there are no ready tools for the language pairs studied.Looking at some of the reviewed studies, it is seen that changes have been made in the embedding and attention layers of NMT models.Various techniques are employed to fuse output vectors from specific units within the encoder or decoder.This is done to tackle the issues of information loss in the translation model and varying levels of emphasis on multi-layer information.Models built using these methods can improve translation by combining information from different layers.Since embeddings contain direct representations of words or sentences, changes made in this section can achieve successful results if they provide better representations.It is seen that the changes made in the attention mechanishm are generally made in order to include the additional features used in the models or to provide a stronger representation.When we look at the reviewed studies, it is seen that there are five studies using unsupervised approaches.GAN architecture and the adversarial training approach used in these structures have recently been used in the field of NMT, and it is seen that there are four studies using this method among the studies examined.In unsupervised architectures, back and forward translation models are usually created first and then these models are used jointly.While the GAN architecture is the basis for NMT studies, the model is based on a generator and discriminator.Reconstruction loss occurs when noisy translations are reconstructed in forward and backward directions, and discrimination loss occurs when the translated text is attempted to be separated from the original text.Using this method can provide a higher-quality translation for LRLs that is more fluid.These methods are used by modifying the adversarial framework by including additional adversarial steps or extra loss functions in the optimization step.Among the reviewed studies, the one that has an unsupervised architecture and does not use a GAN structure is a study on the use of the pre-trained MASS model in the NMT model.
A single model for translation across several languages using accessible linguistic resources in numerous languages in multi-lingual NMT approaches, which deal with translation between multiple language pairs [26].With this approach, knowledge from multiple languages can be learned collectively and applied to help low-resource languages.Multi-lingual NMT techniques use data from various languages to build models.Information from high-resource languages can be utilized to improve the success on low-resource languages because such systems attempt to represent many languages in the same vector space [43].When the reviewed studies are examined, it is seen that multi-lingual translation is used in only three studies.The methods used in these studies are multi-lingual pre-training, multi-lingual embedding, and multi-lingual implementation of the attention mechanism.Since input from many languages may be employed at once, multi-lingual approaches are generally significantly more effective than training models on language pairs separately.In addition, one of the most essential features of multi-lingual models is that the model can produce a translation for a language not included in the training data, enabling zero-shot translation.Finally, in multimodal structures, it focuses on how to utilize non-text data to improve translation quality.
Information about the language pairs used in the studies examined and how many studies they were used in are given in Table 8.A total of 72 different language pairs were used in these studies.When the language pairs used in the studies are examined, it is seen that English language is used to a great extent together with a low-resource language and a total of 49 different uses are made in this way.This is more than half the number of language pairs used, and it results that studies generally focus on the English language in In addition, it was also observed that most of the translation processes used a high-resource language on one side or the other.The number of cases where languages known to have high resources, such as English, Chinese, French, Italian, Spanish, German, and Arabic, are used in any translation direction is 53 out of 72.After English, the most used language in any direction is Hindi, with 11 different uses, and Chinese, with 6 different uses.WThe studies show that while low-resource languages are used in any direction, high-resource languages are often used in low-resource settings.It seems that the use of low-resource languages in a translation process with each other is less than in other situations, and in this way, there are 15 different translation directions.The most common language pairs used in the studies were English -Turkish with six studies, English -German, English -Hindi, and English -Chinese with five studies each.Turkish and Hindi are known to be low-resource languages, but German and Chinese are high-resource languages and were used in low-resource settings in these studies (<1M data).In addition, Table 9 provides information on the language families of the languages used in the studies analyzed.Based on this information, it can be said that in the field of low-resource NMT, studies are generally conducted on languages in the Indo-Endonesian language family.Information about which language families the languages belong to is taken from the Glottolog1 website.

C. RQ2: WHICH DEEP LEARNING METHODS ARE PREFERRED IN LOW-RESOURCE NMT AND WHICH EVALUATION CRITERIA ARE USED?
This research question investigated which methods are most preferred for model building in the field of low-resource NMT and which evaluation criteria are used to assess the results of the models built with these methods.
Table 10 provides information about the deep learning methods used in the rewieved studies and the studies in which they were used.It is seen that all of the methods used to create NMT models in the reviewed studies were created within the encoder-decoder framework.The most commonly used method among the models is the Transformer model, which has been used in 35 studies.The attention mechanism, known to increase translation success for pre-Transformer encoder-decoder frameworks, was used in 17 studies.In these studies, Luong style attention method was used in 10 studies, and Bahdanau style attention method was used in 7 studies.Apart from these, one of the methods used in the models created is the Transformer-based BERT model, which was used in 5 different studies.In addition, it is seen that the CNN structure is not used much in the field of NMT and has been used in a total of 4 studies.The use of CNN structure is generally preferred when GAN-like structures are applied in the unsupervised NMT domain.The number of studies that do not use any attention mechanism is 1.It is seen that LSTM and GRU structures are generally preferred in encoder-decoder models other than the Transformer model.The LSTM method stands out as the most widely used among these methods.In other words, if the Transformer structure is not used in an NMT model, LSTM is generally preferred.As can be seen in Table 10, since it was proposed in 2017, the Transformer model has been getting the best results in NMT studies and has been used as the mainstream NMT method.In addition, some studies use more than one model and provide comparisons between them.
Machine translation studies proposed in the literature are evaluated with some metrics to assess translation quality and make comparisons.In the reviewed studies, 13 different metrics, namely BLEU, METEOR, TER, WER, RIBES, ChrF, F-measure, ROGUE, Perplexity, Adequacy, Fluency, Overall and BLEURT, were used to evaluate NMT systems.Information about these metrics and their usage is given in Figure 6.Due to the diversity of languages in the world and the variability of languages, there may be more than one translation of a sentence, and which one is correct may vary regionally.There is not yet a standardized approach to evaluate the success of NMT systems [113], [114].When the reviewed studies are examined, it is seen that 2 types of evaluation criteria are used to evaluate the translation success of the models.These are automatic evaluation and human evaluation metrics.Although the costs of automatic evaluation are lower than human evaluation, the quality of human evaluation is much better.In addition, from the point of view of the studies reviewed, since the proposed models are usually compared with different models, automatic evaluation stands out in terms of speed and provides convenience to researchers.Human evaluation is more costly than automated evaluations in terms of time and human effort.Unlike automatic evaluation, the possibility of inconsistent results should not be ignored since there is a human factor.For these reasons, automatic evaluation criteria are generally preferred in studies.Automatic evaluation is performed by comparing the translations produced by the models with the reference translation, i.e., the correct translation [114].However, automatic evaluation only captures lexical similarities, and no content or grammar checks are performed.Therefore, sentence structure cannot be properly checked in this way.Table 11 provides information about the evaluation metrics used by the reviewed studies.When the reviewed studies are examined, it is seen that a total of 13 different criteria were used, 10 as automatic evaluation criteria and 3 as human evaluation criteria.Within the scope of this study, the most commonly used automatic evaluation methods BLEU, TER, ChrF, METEOR, and human evaluation criteria, are detailed.In addition to the metrics in Table 11, the BLEURT metric is not included in this table since it is only used in the [104] study.
Since automatic evaluation criteria only consider some aspects of translation quality, the results may be inaccurate.These methods usually use positional information of words to evaluate machine translation.On the other hand, human evaluation assesses machine translation based on adequacy, fluency, and overall rating [68], [113].Adequacy is the measurement of the amount of meaning of the reference sentence in a machine translation.The fluency metric measures how well the machine translation is generated in the target language without considering the relevance of the machine translation to the reference sentence.The overall rating is the average of the adequacy and fluency values of the machine translation.A translation with high adequacy and fluency values is considered high quality and achieves a high score.All three metrics mentioned above are scored between 1 and 5, with higher values indicating better results.
When the reviewed studies were examined, it was seen that the most commonly used metric was the Bilingual Evaluation Understudy (BLEU) metric, which was found in 45 of the 45 studies.The BLEU score is the most commonly used method for evaluating NMT models.BLEU score, which is an automatic evaluation metric, is utilized to assess how well the translation generated by the MT model resembles the reference translation [115].Similar to human evaluation, the BLEU metric measures the translation's adequacy and fluency.BLEU score is computed through the consideration of three key components: firstly, the precision of n-gram alignment between the machine translation and the reference translation; secondly, the application of a brevity penalty (BP) to counteract potential sentence length bias; and thirdly, the utilization of clipping to appropriately adjust the appearance of continuous words.By dividing the total number of n-grams by the number of matched n-grams, precision is computed.In order to determine the BLEU score, the highest frequency of n-gram matches is counted.The number of n-gram matches is reduced by the maximum number measured in any reference sentence to prevent counting the same n-gram more than once.Short sentences are punished harshly by BLEU since it does not use recall.When the length of the reference sentence is less than the generated sentence, BP is employed to lessen the effect of sentence length on the BLEU score.Since there is almost no human involvement in the evaluation, it is a simple and useful method to assess the quality of the generated translation.However, the BLEU score only uses the n-grams in the sentence, and the results may vary depending on the number of reference sentences.This method only considers word matches, making it difficult to evaluate translations for morphologically rich languages.
The second most commonly used evaluation metric in the reviewed studies is Translation Edit Rate (TER).TER is an automatic evaluation metric utilized to assess the precision of a machine translation [116].This assessment is conducted by contrasting the machine translation with a reference sentence.It is obtained by calculating the minimum number of edits required to match the generated translation with the reference sentence.Editing operations include replacing, deleting, adding, and shifting.TER is calculated by dividing the total number of edits by the average word count of the reference sentence.It is a frequently used metric but has some shortcomings.TER focuses only on word-level matches and does not utilize the semantic similarity between the machine translation and the reference sentence.This means that grammatically incorrect translations can score high.Even if a translation is semantically correct, a low score may appear when the words in the sentences do not match exactly.Since the TER only looks at word-level matches, it does not measure the fluency of the machine translation.
Another most commonly used metric is the Character ngram F-score (ChrF) metric.Unlike the BLEU score, ChrF is calculated by measuring character n-gram overlap instead of word n-grams [117].This method uses the F-score, which combines character-based n-gram precision and n-gram recall values.N-gram precision represents the percentage of matching n-grams between the machine translation and the reference sentence, while n-gram recall represents the percentage of matching n-grams per character between the machine translation and the reference sentence.Using these two values, the F-score is calculated, and the overlap between the machine translation and the reference sentence is calculated per-character basis.Therefore, it gives better results for character-based languages.
Another metric for measuring translation quality is the Metric for Evaluation of Translation with Explicit ORdering (METEOR).This metric is designed to overcome the limitations of the BLEU score.Unlike BLEU, a weighted F-score is calculated using precision and recall values.The method first aligns the machine translation and the reference sentence to find the longest matching set of words.Words that have identical meaning are considered as the same word during this alignment.Precision and recall are calculated based on the quantity of words that match individually.A penalty is then calculated to reduce the impact of short matches and make longer ones more effective.Adjacent matching phrases are constrained to penalize shorter matches and incentivize longer matches.Its consideration of word stems and words with the same meaning gives it an edge over the BLEU score.This allows the METEOR metric to capture semantic similarity better.
Due to their ease of use and speed, word-based assessment criteria have become popular in recent years for evaluating the quality of NMT systems.However, because these techniques are unable to accurately assess the overall meaning and fluency of machine translation, they are unable to evaluate translation quality effectively [113].The studies that are under consideration demonstrate that the BLEU measure is the most often applied technique to assess the quality of MT systems.The BLEU metric has limitations, though.The total number of reference sentences may affect the translation outcome because this method only uses n-gram precision [115].Additionally, there are drawbacks to evaluating MT system translation performance just on the basis of precision [118].The BLEU metric does not take word stems or synonyms into account; only word matches are taken into account.Additionally, it does not accurately reflect the meaning and sentence structure of the translatation result.Despite the fact that the BLEU score is frequently employed, these restrictions have always necessitated the development of other metrics.One of the advantages of METEOR, which is one of the most commonly used metrics in the analyzed studies, considers stems, synonyms, and word inflections gives it an edge over the BLEU score.As a result, the METEOR metric allows for a considerably better capture of the semantic similarity between the machine translation and the reference translation.Since METEOR employs F-score and penalty functions that take recall and precision values as inputs, it also addresses the issue that punishing short translations of the BLEU metric.According to several research in the literature, the METEOR metric produces findings that are significantly closer to those of human translation than the BLEU score [113], [119].Despite being frequently used, the BLEU metric has trouble capturing the similarity in semantic content between texts.When the reviewed studies are looked at, it becomes clear that methods like TER, WER, ChrF, and METEOR are utilized in place of this method.In [113], tests were conducted to determine the relationship between some automatic evaluation criteria in the literature and human translation using sentences that are semantically equivalent but have different structures and words.In this study, some metrics used to evaluate MT results were examined by performing a correlation analysis.As a result of this analysis, it is reported that the BLEU score has the lowest correlation score.Contrarily, it was claimed that among the word-based metrics, the METEOR metric had the highest correlation score.This is due to the fact that the METEOR metric uses precision and recall values at the same time and takes into account stems of words and synonyms [113], [120].Additionally, it was noted that among the word-based metrics, the ChrF measure in the analysis had the highest correlation value.This is assumed to be because the ChrF metric places more emphasis on characters than words [113].According to this data, metrics that address translation quality from many angles, such as METEOR, TER ChrF, human review, etc., should be evaluated alongside the BLEU score if an NMT system is to be evaluated using automatic evaluation criteria.This is because the BLEU score does not capture all aspects of translation quality and therefore the use of additional metrics is important to better understand the quality of the proposed NMT system.

D. RQ3: WHAT ARE THE CORPORA AND DEVELOPMENT TOOLS USED IN THE STUDIES?
This research question aims to provide information about the corpora used in low-resource NMT and the tools used to build deep learning models in this field.Since many language pairs are used in the reviewed studies, it is seen that 131806 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.many different corpora are used.Information about which corpus was used for which language pair in which study and, if used, information about monolingual corpora is given in Table 6.When the studies are examined, it is seen that some studies did not provide complete information about the corpus they used.Therefore, the corpora for which information was available are shared in this table.When the corpora used in the studies were examined, it was seen that 47 different corpora were used in total.The most commonly used corpora are IWSLT with 11 corpora from various years, WMT corpora with nine corpora from various years, and TDIL corpus with six times.In addition to these, in 8 studies, the authors created their corpora and did not use any other corpus.Figure 7 shows information about the corpora used and their usage numbers.
In addition to the bilingual corpora used in the studies, some studies use monolingual data due to their methods.Information about the languages used in this way and the monolingual corpora used for these languages is shared in Table 12.
While creating NMT models in the literature, libraries that provide tools to simplify the creation of models are generally used.In the second part of this research question, it was examined which development environments were preferred for the creation of the models in the studies examined.In line with the information shared in the studies and the accessible information is given in Figure 8.When the studies are analyzed, it is seen that the most used libraries are OpenNMT(py-tf) [121] with 12 uses, Fairseq [122] with 10 uses and Tensorflow2 with 6 uses.In general, PyTorchbased libraries are preferred for NMT models.

E. EVALUATION
This chapter aims to provide information about the analysis and future studies in the field of low-resource NMT after the research questions have been answered.In addition, all the results obtained by the analyzed studies are given in Table 13.The reviewed studies cannot be directly compared as they usually have different corpora, many different languages, and focus on different aspects.Accordingly, taking the RQ1 results and Table 6 into account, an analysis of the studies with the most commonly used language pairs is presented first.

1) ENGLISH-TURKISH (EN-TR)
In the [75] study, the English-Turkish language pair was used in both directions.17.8 BLEU score was obtained in the En-Tr direction and 22.5 in the Tr-En direction.The results obtained in this study seem relatively low, but the study is multi-lingual.It can be seen as a successful model regarding the method used.In [81], 26.66 BLEU score was obtained in the Tr-En direction.This study is based on the GAN structure, and a data augmentation method is applied.In another study [84], 37.9 BLEU score was obtained in the same translation direction and using the GAN structure.In [86],16.98BLEU score was obtained in the En-Tr direction.Although this result seems relatively low compared to other studies, the study focused on using the models more efficiently by using fewer parameters.In [96], 45.10 BLEU score was obtained in the Tr-En direction, showing the effect of the pre-trained BERT model on this language pair.The score is high, but this study used a corpus of academic articles.The results on different domains need to be analyzed.Finally, in the [105] study, 44.39 BLEU score was obtained in the Tr-En direction and 36.87 in the En-Tr direction.Unlike other studies, this study is multi-modal in design.

2) ENGLISH-GERMAN (EN-DE)
Reference [80], 31.20 BLEU score in the En-De direction and 38.66 in the De-En direction were obtained.The focus areas of the study are pre-training, transfer learning, and attention mechanism.In [81], 35.14 BLEU score in the De-En direction was obtained using the GAN structure.In [86], 26.12 BLEU score was obtained in the En-De direction with the NC11 corpus.In the De-En direction, 28.46 and 35.79 BLEU scores were obtained using NC11 and IWSLT'14 corpora, respectively.With the knowledge that the test sets used in the studies were not analyzed, it can be said that the IWSLT corpus is more effective in model training.In [91], using the vector fusion method, 35.68 and 35.32 BLEU scores were obtained in the En-De and De-En directions, respectively.In [100], 18.91 BLEU score were obtained in the En-De direction by focusing on data augmentation.

3) ENGLISH-HINDI (EN-HI)
Reference [71] stands out as the only study that uses ensemble learning architecture among the reviewed studies.19.97 and 21.81 BLEU scores were obtained in En-Hi and Hi-En directions, respectively.In the [76] study, a hybrid structure was used to analyze the models using corpora from different TABLE 13. Results obtained in the reviewed studies.X-Y: the source language is X, the target language is Y.Only BLEU scores are shared according to RQ2 results.131808 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
domains separately.The best result in the En-Hi direction is 25.99 BLEU score in the judicial domain, while the best result in the Hi-En direction is 25.70 in the health domain.In [81], a BLEU score of 34.68 in the Hi-En direction was obtained using the GAN structure.In [86], a BLEU score of 19.3 in the Hi-En direction was obtained using the GAN structure.When looking at the results between these two studies, a significant difference is observed.However, the size of the corpus used in [86] is extremely small.In [89], on the other hand, a multilingual study is performed, and the BLEU score is 31.9 and 22.1 in the Hi-En and En-Hi directions, respectively.

4) ENGLISH-CHINESE (EN-ZH)
In the [80], 27.9 BLEU score was obtained in the En-Zh direction.In [97], 12.7 BLEU score were obtained in the En-Zh direction in a multi-lingual structure.It should be noted that in this study, different languages are used in the model.In [98], the adversarial training method is applied by focusing on the embedding layer.22.38 in the En-Zh direction and 19.59 BLEU scores in the Zh-En direction were obtained.In [105], a multi-modal structure is used.35.70 in the En-Zh direction and 29.13 BLEU scores in the Zh-En direction are obtained.In [108], the authors trained a model on their corpus and obtained 37.60 BLEU score in the Zh-En direction.
Considering the results obtained by the [75], [105] studies, it is seen that in the English-Turkish language pair, relatively better results are obtained when the source language is Turkish.Similarly, higher BLEU scores are obtained in English-German and English-Hindi language pairs when English is the target language.Unlike these language pairs, the English-Chinese pair shows lower results when English is on the target side.In addition to the above analysis, studies with Korean-Vietnamese and Hindi-Nepali language pairs, which are used more than other language pairs but do not have a high-resource language in any direction, were also examined.

5) KOREAN-VIETNAMESE (KO-VI)
Reference [70] examined the use of additional grammatical features and morphological information in these two language pairs.27.79 BLEU score was obtained for Ko-Vi and 25.44 for Vi-Ko.The same method was followed in [78], and 27.81 in the Ko-Vi direction and 25.62 BLEU scores in the Vi-Ko direction were obtained.In [83], the same corpora was used as in [78], and 28.22 BLEU score was obtained in the Vi-Ko direction.In this study, a pre-trained language model was also used in addition to using additional grammar features.When the two studies are compared, the positive effect of using a pre-trained model on the performance of LRLs can be seen.In addition, it is seen that all studies using this language pair use additional grammar features and morphological information.These features are POS-Tags, WSD, morphological analysis, and word segmentation.

6) HINDI-NEPALI (HI-NE)
In [87], a study was conducted on domain adoption in agriculture and entertainment domains using reinforcement learning, a rarely used method in NMT.The best result in the Hi-Ne direction is 52.50 BLEU score in agriculture, and the best result in the Ne-Hi direction is 36.91BLEU score in entertainment.In [90], the reinforcement learning approach was applied by combining it with the GAN structure.34.2 in the Hi-Ne direction and 32.1 BLEU score in the Ne-Hi direction were obtained.
Finally, future work analysis of the reviewed studies is examined in this section.In this direction, it is aimed to determine the future directions that can be worked on in the field of low-resource NMT.The future work analysis of the reviewed studies is given in Table 14.In this context and line with the reviews, future studies are summarized as follows: • It is seen in the studies that the use of additional grammar features in NMT models has a beneficial effect on the success of the models.The most commonly used of these grammar features are POS tags, WSD, and morphological However, features are used in morphologically languages.The usage areas of these features can be in future studies.However, it is essential avoid additional costs on the models when using these features in terms of model complexity.
• Using pre-trained models increases the success of NMT models regardless of the data size.Pre-trained models such as BERT, BART, and MASS or language models to be trained on monolingual corpora for low-resource settings will increase success in low-resource scenarios.
• GAN-like models and the adversarial training approach used in these structures have recently succeeded in lowresource NMT.Using these structures in combination with data augmentation and reinforcement learning methods is worth investigating.
• Transfer learning approaches are often preferred for situations where there is insufficient knowledge of machine learning.Since the lack of corpora is the biggest problem in low-resource NMT, transfer learning approaches similar to hierarchical and pivot-based methods can overcome this shortcoming.
• Better utilization of multi-lingual models in lowresource NMT is a good research topic.Multi-lingual models can be more efficient than bilingual ones as they provide information in multiple languages.In addition, the possibility of zero-shot translation that arises with multi-lingual models is worth investigating, especially for extreme LRLs.
• LLMs work successfully on many NLP tasks and have recently gained a lot of popularity.The effective incorporation of LLMs in low-resource NMT systems, their effectiveness and English-XX translation direction needs to be investigated in more depth.
• In multi-modal NMT, when and how to use different models remains an open problem.A good research topic is finding a suitable method where data such as images, video, or audio are indispensable during translation.
• Domain adaptation has always been a remarkable research topic and has attracted the attention of many researchers.Domain adaptation in NMT is frequently closely tied to parameter fine-tuning, unlike the techniques used in SMT.It is still difficult to solve the issue of unidentified test and out-of-domain translations.
• Especially the models created using the Transformer structure have too many parameters.Working on models that can get competitive results with fewer parameters is a good research topic.
• Automatic evaluation metrics are generally preferred when evaluating studies.However, due to their structure, these methods cannot address all aspects of translation quality.Therefore, human evaluation metrics can be used as a supplement to these metrics.In addition, different metrics that can address all aspects of translation quality can be worked on.

V. CONCLUSION
In this study, an SLR study was carried out to examine the methods used in the field of low-resource NMT.According to the inclusion and exclusion criteria determined in the early stages of the study, 45 studies were selected for review.It was aimed to answer three research questions determined after the relevant studies were determined.The first research question to identify the areas of focus and language pairs used in the field of low-resource NMT.In the studies, it was seen that studies were carried out in a total of 20 different focus.The most focused study aspects are seen as data augmentation, the use of pre-trained models, and transfer learning.The most studied language pairs in the studies are English-Turkish, English-German, English-Chinese, and English-Hindi.The second research question; it is intended to determine which deep learning methods are used in the low-resource NMT field and which metrics are used to evaluate these methods.It was observed that the Transformer method was mostly used in the models created.Except for the Transformer, Luong attention is mostly used in the LSTM seq-to-seq architecture.It was seen that 13 different metrics were used in total for the evaluation of the studies.The most used metric stands out as the BLEU score.The last research question is; it is about identifying bilingual and monolingual corpora used in studies and preferred development environments.The most used corpora are IWSLT and WMT corpora for various years and TDIL corpus.Finally, when looking at the tools used for model creation, it is seen that the most commonly used tools are OpenNMT and Fairseq.In addition to these, studies were made specifically for the studies in which the most used language pairs were found, and suggestions were made for future studies.
BİLGE KAĞAN YAZAR received the bachelor's degree in computer engineering from Ankara University, Ankara, in 2017, and the master's degree in computer engineering from Ondokuz Mayıs University, Samsun, in 2020, where he is currently pursuing the Ph.D. degree in computational sciences.His research interests include machine learning, deep learning, natural language processing, and machine translation.
DURMUŞ ÖZKAN ŞAHİN received the bachelor's degree in computer engineering from Süleyman Demirel University, Isparta, in 2013, and the master's degree in computer engineering and the Ph.D. degree in computational sciences from Ondokuz Mayıs University, Samsun, in 2016 and 2022, respectively.His research interests include machine learning, data mining, text mining, information retrieval, and Android malware analysis.
ERDAL KILIÇ received the bachelor's degree in electrical electronic engineering from Karadeniz Technical University, Trabzon, in 1991, the master's degree in electrical electronic engineering from Karadeniz Technical University, in 1996, and the Ph.D. degree in electrical and electronic engineering from Middle East Technical University, Ankara, in 2005.He is currently a Full Professor with the Department of Computer Engineering, Ondokuz Mayıs University.His research interests include neural networks, machine learning, and data mining.

FIGURE 1 .
FIGURE 1. Example of basic encoder-decoder structure.The vector shown in red represents the encoding of the source sentence into a fixed-size vector.

FIGURE 2 .
FIGURE 2. Attention mechanisms used in the Transformer model.A: Scaled dot-product attention, B: Multi-head attention [24].

FIGURE 4 .
FIGURE 4. Steps followed for SLR study.

FIGURE 5 .
FIGURE 5. Number of studies to be reviewed for SLR study.
Especially, POS-tags have been added to Vietnamese sentences, and morphological analysis (MA) and WSD have been applied to Korean sentences.A BERT-based model for Vietnamese sentences was applied to the encoder layer to create an embedding of each token of the given input.To compare the effectiveness of proposed NMT technique, different MT systems for Vietnamese to Korean have been created with varying formats of input.The most important is the BERT fused Transformer model, in which the BERT-based VietBERT model is used.BERT can identify what a word means based on context and produce relevant embeddings for different contexts, in contrast to context-free methods like word2vec.Additionally, the outputs of the BERT model are maximally utilized.This improved representation is then connected to each layer of the NMT model via the attention mechanishm.Consequently, due to this input, the decoder of the NMT model produces more proper target sentences.Using BERT also improved POS tagging results, an annotation of Vietnamese data.As a result, combining the VietBERT and NMT increases the success of Vietnamese-Korean MT.Other models created are Bi-RNN encoderdecoder-based NMT models.
To address the domain adoption problem, a Reinforce-based Sentence Selection and Weighting (RSSW) method is proposed that chooses data based on the rewards received.After training the NMT model on out-ofdomain data using RSSW, the NMT model is fine-tuned on the in-domain corpus using maximum likelihood estimation and minimum risk training.The three modules that make up RSSW are translation model training, policy network, and language model.The first sentence weight for each training sentence is determined using the LM.Then, the sentence weight modification module adjusts the sentence weights in accordance with the action values, meanwhile the reinforcement learning (RL) agent assists in generating change actions according to the environment states through the policy network.In the final, NMT training is carried out using weighted out-of-domain training sentence pairs and fine-tuned on the original in-domain data.Both the LM and NMT model are trained using a Transformer architecture.The proposed model has been compared with studies in the literature by only in-domain, in-and out-domain together, in-domain training -out-domain fine-tuning, and got better outcomes.In addition to these, LSTM architecture was also tried instead of Transformer architecture.RSSW showed ∼1 BLEU score improvement in the LSTM model.In addition to the original target languages, German, English, and French were used in the experiments in low-resource scenarios, and the model performance improved by ∼1 BLEU point.
NMT system was recorded by vocalization and used in a multimodal structure.The Transformer model comprises a feature extraction block and a multi-mode feature-level fusion module.This model uses a Sanskrit-Malayalam corpus and speech data.Text in Sanskrit and Malayalam is provided as input to the Transformer model.Speech signals are sent to the Wavelet Transform (WT), Sequential Mapped True Transformation (SMRT), and GCB-based True Transformation (GMRT) modules for obtaining features.Various feature-level multimodal fusion techniques are used to merge the features from the WT/SMRT/GMRT module with the content vector

FIGURE 6 .
FIGURE 6. Metrics used for the evaluation of models in the studies and the number of uses.

FIGURE 8 .
FIGURE 8. Development environments/libraries used in the studies.

TABLE 1 .
Comparison of other survey/review studies (Y: Yes, N: No, P: Partially).Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 2 .
Queries used in databases for SLR work.

TABLE 3 .
Number of studies obtained as a result of searches in databases.

TABLE 5 .
Studies selected for review after applying the inclusion and exclusion criteria.

TABLE 6 .
Details of low-resource NMT studies.This table contains general information about the studies.It was used for RQ1 and RQ3 answers.The language pairs in the table do not refer to translation directions.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 6 .
(Continued.)Details of low-resource NMT studies.This table contains general information about the studies.It was used for RQ1 and RQ3 answers.The language pairs in the table do not refer to translation directions.

TABLE 7 .
Focused directions in the field of low-resoruce NMT and the number of studies using these directions.

TABLE 8 .
Language pairs used in the studies and their usage numbers (The language pairs in the table do not refer to translation direction).

TABLE 9 .
Language families to which the languages used in the reviewed studies belong.

TABLE 10 .
The methods used in the reviewed studies and the studies in which these methods were used.

TABLE 11 .
Evaluation metrics used in the studies.

TABLE 12 .
Monolingual corpora and languages used in the studies.

TABLE 14 .
Future directions in the reviewed studies.