Loading [MathJax]/extensions/MathMenu.js
Toward Text Data Augmentation for Sentiment Analysis | IEEE Journals & Magazine | IEEE Xplore

Toward Text Data Augmentation for Sentiment Analysis


Impact Statement:Data augmentation methods have substantially improved the data-driven predictive models. However, we considered text data augmentation methods that have been explored nai...Show More

Abstract:

A significant part of natural language processing (NLP) techniques for sentiment analysis is based on supervised methods, which are affected by the quality of data. There...Show More
Impact Statement:
Data augmentation methods have substantially improved the data-driven predictive models. However, we considered text data augmentation methods that have been explored naively, particularly for sentiment-analysis problems. As a result, this field lacks discussion, analysis, and understanding of the entire phenomenon related to the augmented samples and their impact on the current classification methods. Here, we propose a new organization of categories and methods to shed light on this topic. Furthermore, we present advantages, drawbacks, and particularities when augmenting sentiment-analysis datasets by combining the most prominent augmented methods with several classification methods.

Abstract:

A significant part of natural language processing (NLP) techniques for sentiment analysis is based on supervised methods, which are affected by the quality of data. Therefore, sentiment analysis needs to be prepared for data quality issues, such as imbalance and lack of labeled data. Data augmentation methods, widely adopted in image classification tasks, include data-space solutions to tackle the problem of limited data and enhance the size and quality of training datasets to provide better models. In this work, we study the advantages and drawbacks of text augmentation methods such as easy data augmentation, back-translation, BART, and pretrained data augmentor) with recent classification algorithms (long short-term memory, convolutional neural network, bidirectional encoder representations of transformers, support vector machine, gated recurrent units, random forests, and enhanced language representation with informative entities, that have attracted sentiment-analysis researchers a...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 3, Issue: 5, October 2022)
Page(s): 657 - 668
Date of Publication: 21 September 2021
Electronic ISSN: 2691-4581

Funding Agency:


I. Introduction

With the rapid growth of textual data produced as a result of the Web and its interactions, sentiment analysis plays an important role in the effective application of AI models. Sentiments and emotions bring a degree of subjectivity, which is essential in human-to-human interactions. Therefore, sentiment analysis is a field of identifying and understanding these subjectivities and nuances, and is crucial for human-to-machine interactions. Sentiment-analysis applications range from commercial and academic tools to large and small companies, and have great potential as subcomponents for other technologies [1]. These techniques enable the automation of the analysis of a large amount of data [2] and the extraction of knowledge and insights from raw unstructured data [3], [4]. Although most of the research relies on deep learning methods [5], recent work has achieved advances in combining the bottom-up approach of learning language features from deep learning with a top-down approach of modeling commonsense knowledge [6]. However, sentiment-analysis models require a vast amount of training data to effectively learn these patterns. Low-quality datasets are often found when developing this type of system, with issues including data scarcity and the lack of labeled samples, which may degrade the performance of these models in real-world scenarios [7]. The scarcity of linguistic and textual resources has been a recurring issue in many NLP tasks [8]. Furthermore, the lack of data could compromise the sample quality and affect data distribution. As a result, the imbalanced data violates the assumption of a relative equilibrium distribution for most learning algorithms, which can significantly decrease the classification performance. Real-world datasets often suffer from data scarcity issues, which may lead to an overfitting scenario. When classification models are trained with few samples, they tend to memorize features from the training set instead of learning the underlying feature distribution, resulting in an inadequate generalization capacity [9]. In addition to data augmentation, different approaches have been proposed to handle data scarcity and imbalances in real-world scenarios. Neural network regularization [10], dropout regularization [11], batch normalization [12], and transfer learning [13], [14] are among the most widely adopted techniques, especially for deep learning methods. One-shot and zero-shot learning is a more recent paradigm for building models with minimal data that can deliver promising results [15]. Text data augmentation methods have been proposed to mitigate the data scarcity issue by performing class-preserving manipulation on the original data source [16]. These methods are common strategies to avoid overfitting the training data, mainly on small datasets and situations where labeled examples are expensive. However, unlike in the case of simple image transformations, such as rotation and translation, in these methods, preserving the original label after text perturbations may be more complex. Thus, different methods have been proposed in recent years to address this problem. Ranging from simple text transformations such as dictionary-based synonym replacement to more complex methods involving large language models and transfer learning, each technique has its own advantages and disadvantages. For example, more straightforward methods may fall short in more linguistically diverse scenarios such as social network media. By contrast, more complex methods may include significant overhead in the pipeline, which increases the training time. Thus, experiments evaluating the methods in diverse scenarios are required to recognize the benefits and drawbacks. To understand the effects of text data augmentation and how the classification method can handle data quality drawbacks, we systematically studied how sentiment analysis with different algorithms is affected by data augmentation methods. We performed our experiments using easy data augmentation (EDA), back-translation (BT), pretrained data augmentor (PREDATOR), and BART augmentation methods with recent classification algorithms. We augmented seven original datasets for sentiment-analysis tasks by using highly accurate classification methods: Long short-term memory (LSTM), convolutional neural network (CNN), bidirectional encoder representations of transformers (BERT), support vector machine (SVM), gated recurrent units (GRUs), random forests (RF), and enhanced language representation with informative entities (ERNIE). Three scenarios representing imbalanced datasets, small datasets, and different sample availabilities support our discussion. We discovered that, while these augmentation methods often contribute to a better performance than the original datasets, they can respond similarly to the original dataset according to the classification method in particular scenarios. We introduced a taxonomy for text data augmentation considering the most recent methods under both embedding and sentence categories. The main contributions of this article include the following.

A taxonomy devoted to text data augmentation, incorporating the last methods and their categories

Investigation of augmented methods under different scenarios (imbalanced data, small datasets, and different availability scales)

Examination of the advantages and disadvantages of modern classification methods and their relationship with augmented datasets

Contact IEEE to Subscribe

References

References is not available for this document.