LiDA: Language-Independent Data Augmentation for Text Classification

Developing a high-performance text classification model in a low-resource language is challenging due to the lack of labeled data. Meanwhile, collecting large amounts of labeled data is cost-inefficient. One approach to increase the amount of labeled data is to create synthetic data using data augmentation techniques. However, most of the available data augmentation techniques work on English data and are highly language-dependent as they perform at the word and sentence level, such as replacing some words or paraphrasing a sentence. We present Language-independent Data Augmentation (LiDA), a technique that utilizes a multilingual language model to create synthetic data from the available training dataset. Unlike other methods, our approach worked on the sentence embedding level independent of any particular language. We evaluated LiDA in three languages on various fractions of the dataset, and the result showed improved performance in both the LSTM and BERT models. Furthermore, we conducted an ablation study to determine the impact of the components in our method on overall performance. The source code of LiDA is available at https://github.com/yest/LiDA.


I. INTRODUCTION
Text classification is the most widely known task in Natural Language Processing (NLP) due to its various applications across many domains. Spam detection, sentiment analysis, emotion detection, topic detection are a few examples of text classification applications. Nowadays, the performance of text classification applications is tremendous because of the deep learning algorithm. However, to achieve such high performance, a deep learning algorithm requires enormous amounts of labeled data.
In a low-resource language such as Indonesian, creating a high-performance text classification model is challenging due to the insufficient labeled data. Moreover, collecting enormous labeled data is difficult and costly. One approach to overcome this problem is to create synthetic data by using data augmentation techniques. Data augmentation is a method to create synthetic data from original data. On textual data, this technique aims to transform original sentences into The associate editor coordinating the review of this manuscript and approving it for publication was Xianzhi Wang . synthetic sentences, and is generally done at the word or sentence level.
For the text classification task, several data augmentation techniques can be used to create synthetic data. At the word level, the simplest strategy is through random word replacements. This approach replaces random words in the sentence with synonyms [1], [2], [3], or the closest words in the word embedding space [4], or a predicted word from the language model [5], [6]. Another technique is changing the sentence structure by deleting some words, inserting a word in a random place, or by swapping some words within the sentence [3]. At the sentence level, the most popular technique is back-translation. This technique produces a sentence that has the same meaning as the original sentence using machine translation models [7], [8], [9], [10]. Another technique at the sentence level is the generative method. This technique creates synthetic sentences using a text generation model where the model generates tokens sequentially based on the probability of word occurrences formulated from previous word sequences [11], [12], [13], [14].
Unfortunately, all of the above-mentioned techniques are language-dependent, meaning they can only be used for FIGURE 1. Difference between the previous methods (a) and our method (b). The previous methods create synthetic data by transforms the original sentence and then encode them into sentence embeddings. Our method encodes the sentence into sentence embedding first and then transforms it to creates synthetic sentence embeddings.
certain languages, such as English. For example, randomly replacing some words with synonyms is highly dependent on the presence of WordNet in the same language. Meanwhile, replacing some words with the closest word in the word embedding space requires a language-specific pre-trained word embedding model. Similarly, the language model prediction and generative method require a specific pre-trained language model.
In this paper, we introduce LiDA: A Languageindependent Data Augmentation technique for text classification. Our approach works at the sentence embedding level, unlike previous methods that perform at the word and sentence level. Our approach was inspired by data augmentation techniques in computer vision where synthetic images would be created by transforming the original image vector with some functions such as flipping, shifting, rotating, zooming, etc. Similarly, our approach transformed sentence embedding with some functions to create a new synthetic sentence embedding. Figure 1 shows the difference between previous methods and our method. Furthermore, to prove that our approach is language-independent, we evaluated our technique with English, Chinese and Indonesian datasets and with various fractions of the datasets to simulate the low-resource language scenario. The result shows that LiDA increased the model's performance in both the LSTM and BERT models.
Our contributions are as follows: (1) a data augmentation technique for text classification that performs at the sentence embedding level that is independent of language, (2) our technique does not require any language features, (3) our technique performs well on a small and large training dataset, therefore making it suitable for low resource languages.

II. RELATED WORK
A. WORD LEVEL DATA AUGMENTATION Data augmentation at the word level is the most straightforward technique. This technique replaces, deletes, inserts, or swaps one or more words in the text randomly to create new synthetic data. Some studies such as [1], [2], and [3] replace words with their synonyms obtained from WordNet. Another study used pre-trained word embeddings such as Word2Vec and Glove for word replacement [4]. The words are replaced with the closest words in the word embedding space. In other words, they are replaced with semantically similar words. BERT masked language model, which is trained to predict masked words in a sentence, is also used for word replacement [5], [6]. The words to be changed are replaced with MASK tokens, and then the language model will replace the MASK tokens with new words based on the context. Furthermore, in their approach, Wei and Zou [3] also delete, insert, and swap words in the text to create a synthetic text.
Although these techniques are easy to use, their application is limited to the same language as the training data, such as English. For example, word replacement techniques highly depend on WordNet in English. At the same time, WordNet for languages other than English is not as good as English WordNet or is not even available in some languages [15]. Similarly, the technique of deleting, inserting, and exchanging words in sentences cannot be directly applied to languages that do not use spaces as word separators, such as Chinese and Japanese. Furthermore, the word embedding technique highly depends on pre-trained word embedding available in the same language.

B. SENTENCE LEVEL DATA AUGMENTATION
Unlike at the word level, data augmentation at the sentence level changes complete phrases to create synthetic data, making implementation more difficult than at the word level. The use of machine translation models is a commonly used technique [7], [8], [9], [10]. Text data is translated from the source language to another language and then translated back to the source language, commonly referred to as a back-translation technique. With this approach, the synthetic text data will still have the same meaning even though the sentences are different. The most advanced technique is to use a generative model [11], [12], [13], [14]. The model will create a new data word by word based on the probability of the previous set of words. This technique allows the synthetic data formed to have a sentence structure that is entirely different from the original data and will further enrich the training data.
These strategies, however, have some drawbacks. Backtranslation, for example, necessitates a good translation model so that the generated synthetic data has the same meaning as the original data. Meanwhile, the generative models require a big language model and are typically trained in a single language. As a result, these two strategies are challenging to employ in languages with limited resources. VOLUME 11, 2023

III. PROPOSED METHOD
LiDA relies on the Sentence-BERT (SBERT) multilingual language model [16], that is optimized for semantic textual similarity. SBERT is a multilingual pre-trained language model modified from the BERT network to create semantically meaningful sentence embeddings. With SBERT, two similar semantic sentences will have similar sentence embeddings with a high cosine similarity score, even in different languages. Similar sentence embeddings mean that the embeddings are close together in vector space.
For each sentence in the training set, we first encoded the sentence into sentence embedding by using the SBERT multilingual model. Next, we transformed the sentence embedding by using three functions: (1) linear transformation, (2) autoencoder model, and (3) denoising autoencoder model. This process would create three synthetic sentence embeddings. Then, we concatenated the three synthetic sentence embeddings with the original sentence embedding as the output from LiDA. Finally, we used the output from LiDA as input for the classifier model.

A. MOTIVATION
Before going into further detail about LiDA, first, we will first describe the motivation behind our approach. We chose SBERT for constructing sentence embedding because it outperformed other multilingual models in creating similar sentence embedding of semantically similar sentences, especially in different languages. Table 1 shows the difference in cosine similarity scores between the SBERT and other multilingual models. We can see that the sentence embeddings produced by SBERT have a high similarity score in similar sentences, both in the same or different languages. Additionally, SBERT produces different sentence embeddings from two sentences with different meanings, as we can see in the last two rows. In contrast, the mBERT [17] and XLM-RoBERTa [18] models cannot produce sentence embedding based on the similarity of a sentence and cause the similarity scores to be inconsistent. In order to emphasize our argument, we also compared the sentence embedding from each model on the clustering task. Table 2 shows that sentence embeddings from SBERT are perfectly clustered, while the sentence embeddings from mBERT and XLM-RoBERTa tend to cluster randomly.
Based on these findings, the SBERT model can generate similar sentence embeddings from sentences with similar meanings even if they are from different languages. In other words, two sentence embeddings with high cosine similarity mean that the embeddings come from two sentences with the same meaning and vice versa. Therefore, we can create synthetic sentence embeddings by slightly transforming the embedding of the original sentence, as long as the degree of similarity between the initial sentence embedding and the new sentence embedding remains high. We propose three approaches to transform the original sentence embedding into new synthetic embeddings: (1) linear transformation, (2) autoencoder model, and (3) denoising autoencoder model.

B. LiDA
Given a sentence S from the training set, first, we tokenized the sentence using SentencePiece tokenizer. Next, we encoded the tokens into a 768-dimensional sentence embedding V using the multilingual SBERT model. Then, the sentence embedding was passed to the linear transformation, pre-trained autoencoder model, and pre-trained denoising autoencoder model to create new synthetic sentence embeddings Vs. Finally, we concatenated the new synthetic sentence embeddings with the original sentence embedding to construct a new dataset for the classifier model ( Figure 2).

1) LINEAR TRANSFORMATION
The purpose of the linear transformation is to create a synthetic embedding that is similar to the original sentence  embedding. To achieve this task, we shifted the sentence embedding slightly by adding small numbers.
Given sentence embedding V , the linear transformation transforms the sentence embedding V into W .
where T is a linear transformation, V is the original sentence embedding, r is a small random number, and W is the synthetic sentence embedding.

2) AUTOENCODER
The autoencoder model learns to generate slightly similar synthetic embedding from similar sentences; thus, the classifier model can learn from various sentence embeddings and prevent overfitting during training. We used the Quora paraphrase dataset 1 The Quora dataset consists of over 400,000 potential question duplicate pairs, where the label indicates whether the question pair is duplicate or not. In other words, the duplicate pairs mean the questions are semantically similar. Table 3 shows samples of the dataset. For the autoencoder training purpose, we only used the duplicate pairs with a total of 134,336 rows of training set and 14,927 rows of validation set. First, both sentences were tokenized using SentencePiece tokenizer. Next, both tokens were encoded into sentence embeddings V and W using SBERT. Then, we trained the autoencoder model to map from input embedding V (sentence 1) to output embedding W (sentence 2). Finally, we used the pre-trained autoencoder to create new synthetic data from the training dataset ( Figure 3). 1 https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

3) DENOISING AUTOENCODER
The objective of the denoising autoencoder model is to learn to generate synthetic embedding from a noisy sentence embedding. We used the same Quora paraphrase dataset to train the denoising autoencoder model, but we only used  question 1 of the dataset. First, we tokenized the sentence using SentencePiece tokenizer. Next, we encoded the tokens into sentence embedding V using SBERT. We then added Gaussian noise to the sentence embedding and formed a noisy embedding VN . Then, we trained the autoencoder model with VN as the input and V as the output. Finally, we used the pre-trained autoencoder to create new synthetic data from the training dataset (Figure 4).
The following are the models used for the experiment: • LSTM [22], this model consists of two layers of LSTM where the input size is 768, the output size is 2 for the  English and Chinese dataset and 3 for the Indonesian dataset, the hidden layer size is 500, the dropout rate is 0.5, the learning rate is 1e-5, the batch size is 32, the training epochs is 100, the optimizer is Adam and crossentropy as the loss function.
• SBERT [17], this model uses the same architecture as the XLM-RoBERTa model [18] which is pre-trained on 50+ languages. The input size of this model is 768, the output size is 2 for the English and Chinese dataset and 3 for the Indonesian dataset, the maximum length is 128, the dropout rate is 0.5, the learning rate is 1e-6, the batch size is 8, the training epochs is 10, the optimizer is Adam and cross-entropy as the loss function. For the performance metric, we used the Matthews Correlation Coefficient (MCC). We conducted experiments with the LSTM and BERT models with and without LiDA on various dataset sizes. In figure 5, it can be seen that LiDA has improved the model perfor-   Tables 5 and 6 show the highest and lowest performance improvements as well as the averages of the LSTM and BERT models. This performance improvement is due to the significant increase in the amount of training data and the quality of the synthetic data produced by LiDA.
The experimental results also showed that LiDA works well in all languages without requiring specific language features or specific language adjustments. This is because LiDA creates synthetic data at the embedding level as sentence representations in a vector form. When a sentence is encoded into an embedding, various transformation functions can be performed on the embedding without knowing the source language or the language features of the embedding. These results prove that LiDA is not language-dependent. In addition, the performance improvement that happened in small fractions of the datasets proves that LiDA is suitable for text classification in low-resource languages.

2) COMPARISON
LiDA was designed to be language-independent, whereas most previous data augmentation techniques were languagedependent on commonly used languages such as English. Back-translation is one of the techniques that can be directly used for multiple languages. For this reason, we compared LiDA with the back-translation method.
We used the MarianMT model [23] for translation and the LSTM model for text classification. For the English dataset, we translated the data to French and translated it back to English. As for Indonesian and Chinese, we translated the data into English before translating them back into their original languages. Table 7 shows that LiDA performs better than the back-translation technique for all of the languages tested. In addition, LiDA was also faster and computationally cheaper than back-translation. For comparison, it took 30 seconds to translate and back translate 100 data, while LiDA only took 5 seconds to encode the data and create synthetic data.

3) LEARNING RATE EFFECT
Learning rate is an essential hyperparameter that must be adjusted when training a neural network. To train a new model with synthetic data from LiDA, we would need to adapt the learning rate to obtain maximum performance. We experimented with the English model on the different learning rates VOLUME 11, 2023  to determine the importance of adjusting the learning rate when using synthetic data from LiDA. As shown in Figure 6, a slight adjustment of the learning rate made the model work better than the actual learning rate. As a recommendation, it is better to change the learning rate to be slightly higher or lower than the actual learning rate.

4) LIMITATIONS
Although LiDA is language-independent and can be used for multiple languages, the languages supported by LiDA are still limited. The supported languages follow the pretrained multilingual SBERT model, which currently supports about 50 languages. To add support for a new language, SBERT could be trained using new language data through the knowledge distillation technique [24].

E. ABLATION STUDY
LiDA consists of three components that can generate synthetic data. To explore the performance of each element, we conducted an ablation study and compared the results with LiDA. We run the experiment with each component for all the data fractions and average the results. From figure 7, we can see that each component improves the model's performance, but by combining these three components, LiDA worked better than each component working alone. Based on these results, The hypothesis that we can make synthetic sentence embedding by slightly changing the original sentence embedding is proven because each of the three components increases the model's performance.

F. CONCLUSION
We introduced LiDA, a data augmentation technique for text classification. LiDA is not dependent on a particular language and does not require the features of a language, therefore it is suitable for low-resource languages. The experimental results showed that LiDA could improve the performance of the multilingual text classification model without the need for language adjustments. We hope that LiDA can promote the development of universal data augmentation techniques.