Iterative Translation-Based Data Augmentation Method for Text Classification Tasks

A well-known limitation of existing rule-based text augmentation is that it cannot be applied to other languages because it depends on grammatical and structural characteristics. Moreover, most text Generative Adversarial Networks (GAN) are unstable in training due to inefficient generator optimization and rely on maximum likelihood pre-training. This paper addresses the above problems by proposing a novel augmentation method with a Sentence Generator (SG) and Sentence Discriminator (SD) for Iterative Translation-based Data Augmentation (ITDA). This paper makes three original contributions. First, the ITDA SG is designed to provide universal multiple-language support by generating comprehensive augmented sentences through serial and parallel iterations of an existing translator, such as Google Translate. Second, given that the quality of the generated sentences varies depending on the translation combination or the type of sentence, the ITDA addresses this issue using a discriminator to achieve sentence augmentation, which can select high-quality augmented data using a text classifier. Third, the ITDA can perform sentence augmentation for 109 different languages using discriminators based on text classifiers trained for a specific language or type of data set. Extensive experiments are conducted to evaluate the efficacy of the ITDA using a Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), CNN-BiLSTM, and self-attention. The results demonstrate that when the ITDA is applied to 480 sentence classification tasks, the average accuracy increases by 4.24%.


I. INTRODUCTION
Many organizations collect increasing amounts of data to build complex data analytics, machine learning, and Artificial Intelligence models [1]. Machine learning models trained on smaller data sets often do not perform well enough. Data augmentation methods can help train more robust models, even when training over smaller data sets [2] and are commonly used in computer vision [3] and speech [4]. The image data augmentation method has been used in many studies because it is universal and fast. In contrast, the text data augmentation methods exhibit strong language-dependent characteristics because text data are highly sensitive to added random noise or minor changes. Some previous studies address this limitation in text augmentation by focusing on rule-based methods that is highly language dependent [5,6,7].
Consequently, such data augmentation methods have limited utility for other languages. For example, these rule-based methods are intensively studied with English and lack generalpurpose utility. The data augmentation method is more efficient for small data sets. Moreover, it is in high demand for other languages to use data augmentation to enhance language model training efficiency with small data sets.
However, universal data augmentation methods in Natural Language Processing (NLP) have not been fully explored because it is challenging to establish generalized rules for other languages [5]. Moreover, most text Generative Adversarial Networks (GAN) [8] are unstable in training (due to inefficient optimization of the generator) and rely on maximum likelihood pre-training [9]. This paper proposes a novel Iterative Translation-based Data Augmentation (ITDA) method that can be applied to multiple languages. Like GAN, the ITDA consists of a Sentence Generator (SG) and Sentence Discriminator (SD), but they are independently trained. The ITDA SG is designed to provide universal multiple-language support by generating comprehensive augmented sentences through serial and parallel iterations of an existing translator such as Google Translate.
We address the problem that the SG may not always produce high-quality transformed sentences by optimizing the SG with an SD based on a text classifier, which learns to filter out low-quality sentences. For our first prototype, we develop a deep learning model using a Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) as the ITDA SD. The SD trains the source sentences with labels and predicts the label of the transformed sentences generated by the SG. If the predicted transformed sentence label differs from the source sentence label used to generate the transformed sentences, it is removed.
This method is effective in discriminating low-quality transformed sentences. The main advantage of this independent discriminator is that it can use any classifier model depending on the characteristics of the corresponding language. By intelligently combining an SG and SD, the ITDA can augment sentences in 109 languages supported by Google Translate using a text classifier best suited to the specific language and data set.
We verify the efficacy of the ITDA by conducting extensive sets of experiments using CNN, BiLSTM, CNN-BiLSTM, and self-attention on five real-world data sets in English and Korean. The results demonstrate that when the ITDA is applied to the 480 sentence classification tasks, the average accuracy increases by 4.24%. Furthermore, by changing the SG parameters (i, n), the ITDA improves performance in most cases and can be applied to languages with different grammar rules.

II. RELATED WORK
Easy Data Augmentation (EDA) [5] proposed four methods to augment sentence data using the grammatical features of English: Synonym Replacement (SR), Random Insertion (RI), Random Swap (RS), and Random Deletion (RD). The authors conducted three experiments. The first experiment augments 500, 2,000, and 5,000 original data sets with four EDA methods and compares the accuracy with Recurrent Neural Network (RNN) [10] and CNN [11] models. The second experiment uses five data sets and observes the performance improvement of each model using 1,5,10,20,30,40,50,60,70,80,90, and 100 (%) of the data. In the third experiment, the performance is compared using only one of each of the four methods. In all experiments, the smaller the amount of data, the higher the performance.
EDA has been used in many studies where data are scarce due to its fast processing time and simple implementation. However, some of the four methods may not function well when applied to languages with different grammar rules. Other studies have used data noising as smoothing [12] and predictive language models for SR [13]. Hierarchical Data Augmentation (HDA) augments data by extracting the most relevant parts from sentences and connecting them in order using a hierarchical attention network model [14]. Most sentence data augmentation research focuses on one language. The proposed ITDA is a data augmentation method based on iterative translation for supporting multiple languages.
Translation-based text augmentation methods use predominantly Neural Machine Translation (NMT) [15]. One study explores strategies for learning monolingual data without changing the encoder-decoder NMT architecture by pairing monolingual learning data with automatic back-translation [16]. Back-translation has been proven effective by extensive experiments conducted in [17]. QANET improved the performance of question-and-answer tasks by generating new data through reverse translation technology that translates English to French and back to English [18]. Unsupervised Data Augmentation (UDA) is a semi-supervised method of learning through data augmentation (including back-translation) that achieves state-of-the-art results on a wide variety of language and vision tasks [19].
We propose an SG using iterative translations based on serial and parallel back-translation. Then, the SD in the proposed ITDA selects high-quality transformed sentences generated from the SG.

III. BACKGROUND
Google's Neural Machine Translation (GNMT) [20] is a deep learning model used by Google Translate that currently supports translations for 109 languages. GNMT follows the general sequence-to-sequence learning framework [21] with attention [22]. Figure 1 illustrates GNMT'S structure. GNMT has three components: an encoder network, a decoder network, and an attention network. The encoder transforms a source sentence into a list of vectors, one vector per input symbol. Given this list of vectors, the decoder produces one symbol at a time until the special end-of-sentence symbol is produced. The encoder and decoder are connected through an attention module, enabling the decoder to focus on different regions of the source sentence during decoding.
Let X = x1, x2, …, xM be the sequence of M symbols in the source sentence and Y = y1, y2, ..., yN be the sequence of N symbols in the target sentence. The encoder is simply a function of the following form: In this equation, , , … , is a list of fixed-size vectors. The number of members in the list is the same as the number of symbols in the source sentence. With the chain rule, the conditional probability of the sequence P(Y | X) can be decomposed as: The decoder is implemented as a combination of an RNN network and a softmax layer. The decoder RNN network produces a hidden state yi for the next symbol to be predicted, which is then processed through the softmax layer to generate a probability distribution over candidate output symbols. Like [23], GNMT uses a deep-stacked Long Short-Term Memory (LSTM) [24] network for both the encoder and decoder RNNs.
The attention module is similar to that in [22]. Let yi−1 be the decoder-RNN output from the past decoding time step. Attention context ai for the current time step is computed as follows: Where the attention function in this implementation is a feed-forward network with one hidden layer.

A. SENTENCE GENERATOR
The SG generates a sentence by transforming the source sentence using Google Translate [20], which supports 109 languages. As depicted in Figure 2, the source sentence is translated into other languages i times serially using Google Translate. The meaning of the other language is a language randomly selected from the other 108 languages supported by Google Translate. After completing the i-time serial translation, it is finally translated into the language of the source sentence. By iterating the above operation n times in parallel, n transformed sentences can be generated. This serial and parallel sentence generation method has the advantage that it can be easily applied to the 109 languages supported by Google Translate regardless of language characteristics. SG cannot always generate high-quality transformed sentences. We first remove the transformed sentence if it is the same as the source sentence. However, the proportion of the duplicate sentences may be high. This problem can be solved by adjusting i. If we increase i, the serial SG is less likely to generate the duplicate sentences because the strength of the transformation increases. However, the high i value leads to trade-offs in generating sentences that differ significantly from the source sentences. If these over-translated sentences are used as train data, the performance of the model will decrease. We address this problem by using an SD to select and remove unnecessary transformed sentences for training.

B. SENTENCE DISCRIMINATOR
The SD is based on a classifier suitable for a data set. The SD only learns source sentences. After learning the source sentences, the SD predicts the label of the transformed sentence generated by the SG. If the predicted label of the transformed sentence differs from the label of the source sentence, it is discriminated as an incorrectly transformed sentence. VOLUME XX, 2017 The structure of the SD consists of a parallel structure merging CNN and BiLSTM [25], as depicted in Figure 3. We select BiLSTM (used primarily for NLP) and a CNN because they exhibit high performance on various classification problems. The embeddings used for the input of the BiLSTM layer of the SD are Glove [26] in English and FastText [27] in Korean.
The input of the convolution layer is used by adding one dimension to the input of the BiLSTM layer. The input of the CNN-BiLSTM model consists of one or two sentences depending on the task. When using one sentence as input, it has the structure depicted in Figure 3. In contrast, when using two sentences as input, it adds the same layer before the concatenate layer in Figure 3 to concatenate the features from the two sentences. The output is the label classification of the input sentence. The SD effectively predicts the label of the transformed sentence generated by the SG and finds and removes the transformed sentence whose predicted label differs from the label of the source sentence.

C. STRUCTURE OF ITERATIVE TRANSLATION-BASED DATA AUGMENTATION METHOD
The ITDA is based on the SG and SD, as depicted in Figure 4. As described in Section IV-A, The SG translates one source sentence serially i times and iterates it in parallel n times to generate n transformed sentences. If the transformed sentence is the same as the source sentence or another transformed sentence, it is removed. Then, as described in Section IV-B, the SD discriminates the incorrectly transformed sentences by predicting the labels of the transformed sentences with a classifier trained only on the source sentences. Finally, the discriminated sentences are removed, and the remaining transformed sentences are used as the augmented sentences.

V. EXPERIMENTS
In this section, we evaluate the effectiveness of the ITDA by conducting experiments using five data sets: four English data sets and one Korean data set. We evaluate the proposed ITDA on three tasks: (1) a single-sentence task using two English data sets, (2) an inference task using two English data sets, and (3) a Korean single-sentence task using one Korean data set. The single-sentence and inference tasks fix the parameters of SG (i = 3, n = 5) and test the results according to the change in the number of train data. In the Korean single-sentence task, we test how the number of train data and the variation in the parameters (i, n) of SG affect the results. In all cases, the test data are permanently fixed for each data set. Table 1 shows the size of train and test data sets and the types of tasks. We represent the results of all experiments in terms of the average accuracy of 10 trials. The performance of data augmentation is determined by the difference in the accuracy between with and without data augmentation.

A. SINGLE-SENTENCE TASK
Corpus of Linguistic Acceptability (CoLA). The CoLA [28] consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is grammatically correct. We use this as a task to classify whether it is a valid English sentence. Table 2 illustrates the accuracy according to the number of train data when data augmentation is and is not applied. Each parameter of EDA is set to 0.25, and five augmented sentences are created. In Table 2, ITDA-NSD denotes the ITDA model without SD. This ITDA-NSD model conducts data augmentation using only SGs with a parameter set of i=3 and n=5. As shown in Table 2, EDA increases the mean accuracy by 0.54%, while ITDA increases it by 1.68%. In all cases, ITDA outperforms ITDA-NSD, which shows the benefits of SD. Specifically, the average accuracy of ITDA is 3.74% higher than ITDA-NSD. Stanford Sentiment Treebank v2 (SST-2). The SST-2 [29] consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the positive/negative class split and use only sentence-level labels. Table 3 presents the accuracy according to the number of train data when data augmentation is applied and when it is not. EDA, ITDA-NSD, and ITDA decrease the average accuracy of the four base deep learning models by 0.53%, 2.25%, and 0.95%, respectively. In the case of the SST-2 dataset, data augmentation is not effective because of the large amount of data (SST-2 number of train data: 67k). Since SST-2 consists of sentences about movie reviews, it contains many proper nouns. These proper nouns negatively effect on the performance of ITDA because they cause over-translation. For this reason, ITDA shows worse performance than EDA in the SST-2 experiment. We observe that the more SG generates incorrectly transformed sentences, the more SD tends to be unable to completely discriminate transformed sentences.

B. INFERENCE TASK
Recognizing Textual Entailment (RTE). The RTE [30] data sets are obtained from a series of annual textual entailment challenges. RTE is a combination of data from RTE1 [31], RTE2 [32], RTE3 [33], and RTE5 [34], which are constructed based on news and Wikipedia text. Therefore, many sentences in RTE include proper nouns. In addition, the RTE data set has overfitting problem because the size of the train data is smaller than that of the test data. RTE consists of pairs of sentence-1 and sentence-2. Thus, data augmentation can be applied in three ways: Augmented sentence-1 (ITDA-1), Augmented sentence-2 (ITDA-2), and Matched Aug-1 and Aug-2 (ITDA-M). As shown in Table 4, the performances of CNN and BiLSTM models in the data rate ranging from 40% to 60% decrease when ITDA is applied due to over-translation by proper nouns. However, the overall improvement of the average accuracy is 0.24%. The main reason for this performance improvement is that the positive effect of ITDA's overfitting prevention for the RTE data set is more dominant than the neg-ative effect caused by over-translation. Winograd Natural Language Inference (WNLI). WNLI [30] is created based on the Winograd Schema Challenge [35] data set. It predicts whether the sentence-2 with the pronoun substituted is entailed by the original sentence (the sentence-1). As shown in Table 5, the accuracy decreases as the data rate increase, due to a serious overfitting. The main reason for this overfitting problem of the WNLI data set is that the size of WNLI is relatively small (634 sentence pairs), imbalanced in two classes (65% not entailment).
Since WNLI consists of sentence pairs like RTE, three augmentation methods (ITDA-1, ITDA-2 and ITDA-M) can also be applied. In the sentence pairs, the sentence-1 is the original sentence, and the sentence-2 is the corresponding sentence-1 substituted with a possible referent of the ambiguous pronoun. The average accuracy of ITDA-2, which applies the ITDA to only the sentence-2, increases by 14.9%, which is the average performance improvement in these experiments. This is because over-translation for the sentence-2 occurs relatively less.

C. KOREAN SINGLE-SENTENCE TASK
Intonation-aided intention identification for Korean (3i4K). 3i4K [36] is used to classify Korean sentences into one of seven categories: fragment, statement, question, command, rhetorical question, rhetorical command, or intonation-dependent utterance. Two sets of experiments are conducted: (1) changing the number of translation iterations (i) and (2) changing the number of transformed sentences (n).

1) EXPERIMENTS CHANGING NUMBER OF TRANSLATION ITERATIONS (i)
As described in Section IV-A, the SG uses Google Translate to translate the source sentence into another language i times serially to generate one transformed sentence. If this operation is repeated n times in parallel, n transformed sentences are generated. This experiment measures the amount of augmented data generated when varying the parameter i. The parameter n is constant at 15, and 7,000 3i4K data (1,000 for each label) are used. Duplicate sentences among the generated transformed sentences are removed as described in Section IV-A. Figure 5 illustrates how many transformed sentences, discriminated sentences, and augmented sentences are generated by varying i. If there are no duplicate sentences, 105,000 transformed sentences are generated. Because the number of duplicate sentences decreases as i increases, the number of transformed sentences increases. However, the growth declines and is likely to converge at approximately 90,000. Figure 5 also illustrates the number of discriminated sentences when the SD is applied to the transformed sentences. As i increases, the number of transformed sentences that are discriminated to interfere with learning also increases because the number of transformed sentences that damage the source sentences increases as i increases. Augmented sentences are data obtained by subtracting discriminated sentences from transformed sentences. If i exceeds 3, more than half of the transformed sentences are discriminated and removed, with less than half remaining as the final augmented sentences.  Figure 6 illustrates the results of training the model with the final augmented data after removing the discriminated data and measuring the accuracy. As shown in Figure 6, ITDA shows the best performance when i = 3. When the value of i is smaller than 3, it shows poor performance because the transformed sentences from SG are very similar to the source sentences. On the other hand, when the value of i is larger than 3, the performance of ITDA decreases due to over-translation.  Figure  7 illustrates the results of experiments conducted by changing n to 5, 10, 15, and 20 after setting i to a constant value of 3. The number of data is increased by 100 for each of the seven classes. Original in Figure 7 denotes a case where the ITDA is not applied. When the ITDA is applied, the average accuracy increases by 1.63%. When the number of data is small, the case where n is large (n = 20) shows higher accuracy than the case where n is small (n = 5) as shown in Figure 7 (a-d). This means that a large n is necessary to a small data set in order to prevent an overfitting problem due to the small size of the data set. In contrast, a small n is effective for a large data set because the quality of the augmented sentences becomes important in this case.

VI. Conclusion
We propose a novel text data augmentation method using ITDA, with three original contributions. First, ITDA combines SG and SD to provide general-purpose text data augmentation. Second, the ITDA SG generates transformed sentences through serial and parallel iterations of the existing translator. Third, the ITDA SD uses a classifier trained with the source sentences to discriminate and remove incorrectly transformed sentences. ITDA is designed for comprehensive data augmentation for multiple languages. With Google Translate for the sentence generator's translator, we can generate transformed sentences for 109 languages. We can also use any classifier available for that language as the SD. VOLUME XX, 2017 We conduct extensive experiments using four base deep learning models on five data sets. We observe through experiments that proper nouns in sentences negatively affect the SG because they cause over-translation. In contrast, when our ITDA is applied to sentences in which parts of the source sentence are replaced with pronouns, the average accuracy increases by 14.9%. Our experiments on 480 sentence classification tasks with the ITDA demonstrate that accuracy improves on average by 4.24%.
Sangwon Lee received a B.S. degree in Information and Communication Engineering and an M.S. degree in Electrical and Computer Engineering from Inha University, Incheon, Republic of Korea, in 2019 and 2021. He is currently pursuing a Ph.D. in Electrical and Computer Engineering at Inha University under the guidance of Professor Choi. His research interests are deep learning, natural language processing, time-series data processing, and data intelligence.
Ling Liu is a Professor in the School of Computer Science at Georgia Institute of Technology. She directs the research programs in the Distributed Data Intensive Systems Lab (DiSL). Prof. Liu is an IEEE Fellow and a recipient of IEEE Computer Society Technical Achievement Award in 2012. She has published over 300 international journal and conference articles and is a recipient of the best paper awards from numerous top venues. Prof. Liu served as the EIC of IEEE Transactions on Service Computing (2013-2016), and serves on the editorial board of half a dozen international journals.
Wonik Choi is a Professor in the School of Information and Communication Engineering at Inha University, where he runs the Data Intelligence Lab. He received his PhD in Computer Engineering from Seoul National University, Korea. His research interests include spatio-temporal databases, sensor network topology, telematics, and GIS/LBS. He was a visiting scholar in the School of Computer Science at Georgia Institute of Technology in 2012.