I. Introduction
With the rapid growth of textual data produced as a result of the Web and its interactions, sentiment analysis plays an important role in the effective application of AI models. Sentiments and emotions bring a degree of subjectivity, which is essential in human-to-human interactions. Therefore, sentiment analysis is a field of identifying and understanding these subjectivities and nuances, and is crucial for human-to-machine interactions. Sentiment-analysis applications range from commercial and academic tools to large and small companies, and have great potential as subcomponents for other technologies [1]. These techniques enable the automation of the analysis of a large amount of data [2] and the extraction of knowledge and insights from raw unstructured data [3], [4]. Although most of the research relies on deep learning methods [5], recent work has achieved advances in combining the bottom-up approach of learning language features from deep learning with a top-down approach of modeling commonsense knowledge [6]. However, sentiment-analysis models require a vast amount of training data to effectively learn these patterns. Low-quality datasets are often found when developing this type of system, with issues including data scarcity and the lack of labeled samples, which may degrade the performance of these models in real-world scenarios [7]. The scarcity of linguistic and textual resources has been a recurring issue in many NLP tasks [8]. Furthermore, the lack of data could compromise the sample quality and affect data distribution. As a result, the imbalanced data violates the assumption of a relative equilibrium distribution for most learning algorithms, which can significantly decrease the classification performance. Real-world datasets often suffer from data scarcity issues, which may lead to an overfitting scenario. When classification models are trained with few samples, they tend to memorize features from the training set instead of learning the underlying feature distribution, resulting in an inadequate generalization capacity [9]. In addition to data augmentation, different approaches have been proposed to handle data scarcity and imbalances in real-world scenarios. Neural network regularization [10], dropout regularization [11], batch normalization [12], and transfer learning [13], [14] are among the most widely adopted techniques, especially for deep learning methods. One-shot and zero-shot learning is a more recent paradigm for building models with minimal data that can deliver promising results [15]. Text data augmentation methods have been proposed to mitigate the data scarcity issue by performing class-preserving manipulation on the original data source [16]. These methods are common strategies to avoid overfitting the training data, mainly on small datasets and situations where labeled examples are expensive. However, unlike in the case of simple image transformations, such as rotation and translation, in these methods, preserving the original label after text perturbations may be more complex. Thus, different methods have been proposed in recent years to address this problem. Ranging from simple text transformations such as dictionary-based synonym replacement to more complex methods involving large language models and transfer learning, each technique has its own advantages and disadvantages. For example, more straightforward methods may fall short in more linguistically diverse scenarios such as social network media. By contrast, more complex methods may include significant overhead in the pipeline, which increases the training time. Thus, experiments evaluating the methods in diverse scenarios are required to recognize the benefits and drawbacks. To understand the effects of text data augmentation and how the classification method can handle data quality drawbacks, we systematically studied how sentiment analysis with different algorithms is affected by data augmentation methods. We performed our experiments using easy data augmentation (EDA), back-translation (BT), pretrained data augmentor (PREDATOR), and BART augmentation methods with recent classification algorithms. We augmented seven original datasets for sentiment-analysis tasks by using highly accurate classification methods: Long short-term memory (LSTM), convolutional neural network (CNN), bidirectional encoder representations of transformers (BERT), support vector machine (SVM), gated recurrent units (GRUs), random forests (RF), and enhanced language representation with informative entities (ERNIE). Three scenarios representing imbalanced datasets, small datasets, and different sample availabilities support our discussion. We discovered that, while these augmentation methods often contribute to a better performance than the original datasets, they can respond similarly to the original dataset according to the classification method in particular scenarios. We introduced a taxonomy for text data augmentation considering the most recent methods under both embedding and sentence categories. The main contributions of this article include the following.
A taxonomy devoted to text data augmentation, incorporating the last methods and their categories
Investigation of augmented methods under different scenarios (imbalanced data, small datasets, and different availability scales)
Examination of the advantages and disadvantages of modern classification methods and their relationship with augmented datasets