Introduction
Although deep learning has achieved remarkable results in the field of natural language processing (NLP), including sentiment analysis, relation extraction, and machine translation [1]–[3], a few recent studies pointed out that adding small modifications to text inputs can fool deep classifiers to incorrect classification [4], [5]. Similar phenomenon exist in image classification where adding tiny and often imperceptible perturbations on images could fool deep classifiers. It naturally raises concerns about the robustness of deep learning systems considering that deep learning has become core components of many security-sensitive applications, like text-based spam detection.
Formally, for a given classifier \begin{equation*}
\text{x}^{\prime}=\text{x}+\triangle \text{x}, \Vert\triangle \text{x}\Vert_{p} < \epsilon, \text{x}^{\prime}\in \mathbb{X}\\
F(\text{x})\neq F(\text{x}^{\prime})\ \text{or}\ F(\text{x}^{\prime})=t\qquad
\tag{1}
\end{equation*}
Here we denote a machine learning classifier as F: X → Y, where
The choice of
Formally for \begin{equation*}
\Vert\triangle \text{x}\Vert_{p}=\sqrt[p]{\sum_{i=1}^{p}\vert x_{i}^{\prime}-x_{i}\vert^{p}}
\tag{2}
\end{equation*}
The
In addition to targeted/untargeted and
Recent studies have focused on image classification and typically created imperceptible modifications to pixel values through an optimization procedure [4]–[7]. Szegedy et al. [4] first observed that DNN models are vulnerable to adversarial perturbation (by limiting the modification using
An example of wordbug generated adversarial sequence. Part (1) shows an original text sample and part (2) shows an adversarial sequence generated from the original sample in part (1). From part (1) to part (2), only a few characters are modified; however this fools the deep classifier to a wrong classification.
Images can be naturally represented as points in a continuous
Text tokens are categorical features. Imperceptible perturbations using
-norms makes sense on continuous pixel values, but not on letters since they are discrete.L_{p} Each text sample includes a linearly-ordered sequence of words, and the length of sequences varies.
Due to above reasons, the original definition of adversarial modifications:
A few recent studies [11], [12] defined adversarial perturbations on RNN-based text classifiers. [11] first chose the word at a random position in a text input, then used a projected Fast Gradient Sign Method to perturb the word's embedding vector. The perturbed vector is projected to the nearest word vector in the word embedding space, resulting in an adversarial sequence (adversarial examples in the text case). This procedure may, however, replace words in an input sequence with totally irrelevant words since there is no hard guarantee that words close in the embedding space are semantically similar. [12] used the “saliency map” of input words and complicated linguistic strategies to generate adversarial sequences that are semantically meaningful to a human. However, this strategy is difficult to perform automatically.
We instead design scoring functions to adversarial sequences by making small edit operations to a text sequence such that a human would consider it similar to the original sequence. I.e., the small changes should produce adversarial words which are imperceptibly different from the original words. We do this by first targeting the important tokens in the sequence and then executing a modification on those tokens (defined in Section II) that can effectively force a deep classifier to make a wrong decision. An example of the adversarial sequence we define is shown in Figure 1.
The original text input is correctly classified as positive sentiment by a deep RNN model. However, by changing only a few characters, the generated adversarial sequence can mislead the deep classifier to a wrong classification (negative sentiment in this case).
Contributions
This paper presents an effective algorithm, DeepWordBug (or WordBug in short), that can generate adversarial sequences for natural language inputs to evade deep-learning classifiers. Our novel algorithm has the following properties:
Black-box: Previous methods require knowledge of the model structure and parameters of the word embedding layer, while our method can work in a black-box setting.
Effective: Using several novel scoring functions, with two real-world text classification tasks our WordBug can fool two different deep RNN models more successfully than the state-of-the-art baseline.
Simple: WordBug uses simple character-level transformations to generate adversarial sequences, in contrast to previous works that use projected gradient or multiple linguistic-driven steps.
Small perturbations to human observers: WordBug can generate adversarial sequences that look quite similar to seed sequences.
Deepwordbug
For the rest of the paper, we denote samples in the form of pair
A. Recurrent Neural Networks
Recurrent neural networks (RNN) [13] are a group of neural networks that include a recurrent structure to capture the sequential dependency among items of a sequence.
RNNs have been widely used and have been proven to be effective on various NLP tasks including sentiment analysis [14], parsing [15] and translation [16]. Due to their recursive nature, RNNs can model inputs of variable length and can capture the complete set of dependencies among all items being modeled. such as all spatial positions in a text sample.
To handle the “vanishing gradient” issue of training basic RNNs, Hochreiter et al. [17] proposed an RNN variant called the Long Short-term Memory (LSTM) network that achieves better performance comparing to vanilla RNNs on tasks with long-term dependencies.
B. Word Based Modification for Adversarial Sequences
In typical adversarial generation scenarios, gradients are used to guide the change from an original sample to an adversarial sample. However, in the black-box setting, calculating gradients is not available since the model parameters are not observable.
Therefore we need to change the words of an input directly without the guidance of gradients. Consider the vast search space of possible changes (among all words and all possible character changes), we propose to first determine the important words to change, and then modify them slightly by controlling the edit distance to the original sample. More specifically, we need a scoring function to evaluate which words are important and should be changed to create an adversarial sample and a method that can be used to change those words with a control of the edit distance.
To find critical words for the model's prediction in a black-box setting, we introduce a temporal score (TS) and a temporal tail score (TTS). These two scoring functions are used to determine the importance of any word to the final prediction.
We assume the perturbation happens directly on the input words (i.e., not on embedding, or at the “semantic” level).
We assume the perturbation approximately minimizes the edit distance to the seed sample. We find an efficient strategy to change a word slightly and is sufficient for creating adversarial text sequences.
In summary, the process of generating word-based adversarial samples on NLP data in the black-box setting is a 2-step approach: (1) use a scoring function to determine the importance of every word to the classification result, and rank the words based on their scores, and (2) use a transformation algorithm to change the selected words.
C. Step 1:Token Scoring Function and Ranking
First, we construct scoring functions to determine which words are important for the final prediction. The proposed scoring functions have the following properties:
1. Our scoring functions are able to correctly reflect the importance of words for the prediction.
2. Our scoring functions calculate word scores without the know ledge of the parameters and structure of the classification model.
3. Our scoring functions are efficient to calculate. In the following, we explain three scoring functions we propose: temporal score, temporal tail score, and a combination of the two.
1): Temporal Score (TS)
Suppose the input sequence
In the continuous case (e.g., image), suppose a small perturbation changes \begin{equation*}
\triangle_{i}F(\text{x})=(x_{i}^{\prime}-x_{i})\nabla_{x_{i}}F(\text{x})
\end{equation*}
However, in a black-box setting,
Therefore, we directly measure
2): Temporal Tail Score (TTS)
The problem with the temporal score is that it scores a word based on its preceding words. However, words following a word are often important for the purpose of classification. Therefore we define the Temporal Tail Score as the complement of the temporal score. It compares the difference between two trailing parts of a sentence, the one containing a certain word versus the one that does not. The difference reflects whether the word influences the final prediction when coupled with words after itself. The Temporal Tail Score (TTS) of word
3): Combined Score
Since the temporal score and temporal tail scores model the importance of a word from two opposition directions of a text sequence, we can combine the two. We calculate the combined scoring function as: Combined Score ==
Once we calculate the importance score of each word in an input, we select the top
D. Step 1:Token Transformer
Previous approaches (summarized in Table V) change words following the gradient direction (gradient of the target adversarial class w.r.t the word), or following some perturbation guided by the gradient. However, in our case there is no gradient direction available. Therefore, we propose an efficient method to modify a word, and we do this by deliberately creating misspelled words.
The key observation is that words are symbolic and learning-based classification programs handle NLP words through a dictionary to represent a finite set of possible words. The size of the typical NLP dictionary is much smaller than the possible combinations of characters at a similar length (e.g., about 26nfor the English case).
This means if we deliberately create misspelled words on important words, we can easily convert those important words to “unknown” (i.e., words not in the dictionary). The unknown words are mapped to the “unknown” embedding vector in deep-learning modeling. Our results (Section III) strongly indicate that this simple strategy can effectively force RNN models to make a wrong classification.
To create such a misspelling, many strategies can be used.
However, we prefer small changes to the original word as we want the generated adversarial sequences and its seed input appear (visually or morphological) similar to human observers. Therefore, we prefer methods with a small edit distance and use the Levenshtein distance [18], which is a metric measuring the similarity between sequences. We propose four similar methods: (1) substitute a letter in the word with a random letter, (2) delete a random letter from the word, (3) insert a random letter in the word, and (4) swap two adjacent letters in the word. The edit distance for the substitution, deletion and insertion operations is 1 and 2 for the swap operation.
These methods do not guarantee the original word is changed to a misspelled word. It is possible for a word to “collide” with another word after the transformation. However, the probability of collision is very small as there are
Experiments on Effectiveness of Adversarial Sequences
We evaluate the effectiveness of our algorithm by conducting experiments on different RNN models across two real-world NLP datasets. In particular, we want to answer the following research questions: (1). Does the accuracy of deep learning models decrease when feeding the adversarial samples? (2). Does the adversarial samples generated by our method transfers between models?
A. Experimental Setup
Datasets
In our experiments, we use the Large Movie
Review Dataset (IMDB Dataset) [19] and the Enron Spam Dataset [20].
The IMDB Movie Review Dataset contains 50000 highly polarized movie reviews, 25000 for training and 25000 for testing. We train an RNN model to classify the movie reviews into 2 classes: positive and negative.
The Enron Spam Dataset is a subset of the original Enron Email Dataset. The goal is to train a spam filter that can determine whether a certain message is spam or not. We use a subset containing 3,672 ham (i.e. not spam) emails, and 1,500 spam emails.
Details of the datasets are listed in Table II.
Target deep models
To show that our method is effective, we performed our experiments on both uni-and bidirectional LSTMs.
The first model contains a random embedding layer, a uni-directional LSTM with 100 hidden nodes and a fully connected layer for the classification. Without adversarial examples, this model achieves 84% accuracy on the IMDB Dataset and 99% accuracy on the Enron Spam Dataset. The second model is the same as the first, except with a bi-directional LSTM (also with 100 hidden nodes) instead of uni-directional. Without adversarial examples, it achieves 86% accuracy on the IMDB Dataset and 98% accuracy on the Enron Spam Dataset.
Baselines
We implemented the following attacking algorithms to generate adversarial samples:
Projected FGSM:
attack from [11]. In our implementation, we use the Fast Gradient Sign Method code from Cleverhans [21], a library developed by the original authors. As we discussed, this method is not black-box.L_{\infty} Random + DeepWordBug Transformer: This technique randomly selects words to change and use our transformer to change the words.
Our method
We use our socring functions to better mutate words. In our implementations, we use different score functions: replace-1 score, temporal score, temporal tail score and the combined score. After that, we use our tranformer to change the words.
Platform
We train the target deep-learning models and implement attacking methods using Keras with Tensorflow as back-end. We use Nvidia GTX Titan cards.
Performance
Performance of the attacking methods is measured by the accuracy of the deep-learning models on the generated adversarial sequences. The lower the accuracy the more effective the attacking method is. Essentialy it indicates the adversarial samples can successfully fool the deep-learning classifier model. The number of words that is allowed for modification is a hyperparameter.
B. Experimental Results on Classification
We analyze the effectiveness of the attacks on two deep models (uni-and bi-directional LSTMs). The results of model accuracy are summarized in Table III. Detailed experimental results at different numbers of allowed word modifications are presented in Figure 3. The results of unidirectional LSTM are in Figure 3 (a)(b), and the results of bi-directional LSTM are in Figure 3 (c)(d).
From Figure 3, we first see that the model has a significantly lower accuracy when classifying the adversarial samples generated by our method on both datasets when compared to the accuracy results from the original test samples. On the IMDB Dataset, changing 20 words per review using WordBug-Combined reduced the model accuracy from 86% to around 41%. As the movie reviews have an average length of 215 words, we consider the 20-word modification as effective. On the Enron Spam Dataset, changing 20 words following WordBug-Combined reduced the model accuracy from 99% to around 44%. For the bidirectional model, changing 20 words on every sequence reduce model accuracy from 86% to around 26% on the IMDB Dataset and from 99% to around 40% on the Enron Spam Dataset. We can see that randomly choosing words to change (i.e., Random in Table III) has little influence on the final result.
Surprisingly our method achieves better results when compared with the projected FGSM which is a white-box attack. The improvement is most likely because the selection of words is more important than how to change the words. Since the projected FGSM selects words randomly, it does not achieve as sound performance as ours.
It is also interesting to compare different score functions that we proposed. On both the IMDB and Enron datasets, the combined score performs notably better than the temporal score and the tail temporal score. It utilizes more information compared to other score functions. The Replace-1 score does not perform well in these datasets, presumably because it does not consider the temporal relationship among words.
C. Transferability of the Adversarial Sequences
Next, we evaluate the transferability of adversarial sequences generated from our methods. Previous studies have found that transferability is an important property of adversarial image samples: adversarial images generated for a certain DNN model can successfully fool another DNN model for the same task. i.e., transferred to another model.
We use the combined score and the substitution transformer to generate adversarial samples. The number of words we change is 20. The results in Table IV are acquired by feeding adversarial sequences generated by one RNN model to another RNN model on the same task.
Experiment results. The X axis represents the number of modified words, and the Y axis corresponds to the test accuracy on adversarial samples generated using the respective attacking methods. (a) Uni-directional LSTM on the IMDB dataset (b) unidirectional LSTM on the enron spam dataset (c) bi-directional LSTM on IMDB dataset (d) bidirectional LSTM on the enron spam dataset
From the table, we see that most adversarial samples can successfully transfer to other models, even to those models with different word embeddings. This experiment demonstrates that our method can successfully find those words that are important for classification and the transformation is effective across multiple models.
Connecting to Previous Studies
Compared to studies of adversarial examples on images, little attention has been paid on generating adversarial sequences on text. We compare the most relevant two and ours in Table V. (1) Papernot et.al., applied gradient-based adversarial modifications directly to NLP inputs targeting RNN-based classifiers in [11]. The resulting samples are called “adversarial sequence,” and we also adopt the name in this paper. The study proposed a white-box adversarial attack called projected Fast Gradient Sign Method and applied it repetitively to modify an input text until the generated sequence is misclassified. It first randomly picks a word, and then uses the gradient to generate a perturbation on the corresponding word vector. Then it maps the perturbed word vector into the nearest word based on Euclidean distance in the word embedding space. If the sequence is not yet misclassified, the algorithm will then randomly pick another position in the input. (2) Recently, [12] used the embedding gradient to determine important words. The technique then uses heuristic driven rules together with hand-crafted synonyms and typos. Differently, from ours, this study is a white-box attack because it accesses the gradient of the model. (3) Another paper [22] measures the importance of each word to a specific class using the word frequency from that class's training data. Then the study uses heuristic driven techniques to generate adversarial samples by adding, modifying or removing important words. Differently, this method needs to access a large set of labeled samples.
In summary, previous approaches do not apply to black-box settings. Besides previous approaches mostly used heuristic-driven and complicated modifications. We summarize the differences between our method and the previous studies on generating adversarial text samples in Table V. Our method is black-box while previous approaches all used the stronger white-box assumption. Our method uses the edit distance at the sequence input space to search for the adversarial perturbations. Also, our modification algorithm is simpler compared to previous approaches.
Conclusion
In this paper we introduce a vulnerability with deep learning models for text classification. We present a novel framework, DeepWordBug to generate adversarial text sequences that can mislead deep learning models by exploiting this vulnerability. Our method has the following advantages:
Black-box: DeepWordBug generates adversarial samples in a black-box manner.
Performance: While minimizing edit distance (approximately minimized), DeepWordBug achieves better performance comparing to baseline methods on two NLP datasets across multiple deep learning architectures.
Our experimental results indicate that DeepWordBug results in about 70% decrease from the original classification accuracy for two state-of-the-art word-level LSTM models across two different datasets. We also demonstrate that the adversarial samples generated on one model can be successfully transferred to other models, reducing the target model accuracy from around 90% to 30-60%.