Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training

Although attention mechanisms have been applied to a variety of deep learning models and have been shown to improve the prediction performance, it has been reported to be vulnerable to perturbations to the mechanism. To overcome the vulnerability to perturbations in the mechanism, we are inspired by adversarial training (AT), which is a powerful regularization technique for enhancing the robustness of the models. In this paper, we propose a general training technique for natural language processing tasks, including AT for attention (Attention AT) and more interpretable AT for attention (Attention iAT). The proposed techniques improved the prediction performance and the model interpretability by exploiting the mechanisms with AT. In particular, Attention iAT boosts those advantages by introducing adversarial perturbation, which enhances the difference in the attention of the sentences. Evaluation experiments with ten open datasets revealed that AT for attention mechanisms, especially Attention iAT, demonstrated (1) the best performance in nine out of ten tasks and (2) more interpretable attention (i.e., the resulting attention correlated more strongly with gradient-based word importance) for all tasks. Additionally, the proposed techniques are (3) much less dependent on perturbation size in AT. Our code is available at https://github.com/shunk031/attention-meets-perturbation


I. INTRODUCTION
A TTENTION mechanisms [1] are widely applied in natural language processing (NLP) field through deep neural networks (DNNs). As the effectiveness of attention mechanisms became apparent in various tasks [2]- [7], they were applied not only to recurrent neural networks (RNNs) but also to convolutional neural networks (CNNs). Moreover, Transformers [8] which make proactive use of attention mechanisms have also achieved excellent results. However, it has been pointed out that DNN models tend to be locally unstable, and even tiny perturbations to the original inputs [9] or attention mechanisms can mislead the models [10]. Specifically, Jain and Wallace [10] used a practical bi-directional RNN (BiRNN) model to investigate the effect of attention mechanisms and reported that learned attention weights based on the model are vulnerable to perturbations. 1 The Transformer [8] and its follow-up models [12], [13] have self-attention mechanisms that estimate the relationship of each word in the sentence. These models take advantage of the effect of the mechanisms and have shown promising performances. Thus, there is no doubt that the effect of the mechanisms is extremely large. However, they are not easy to train, as they require huge amounts of GPU memory to maintain the weights of the model. Recently, there have been proposals to reduce memory consumption [14], and we acknowledge the advantages of the models. On the other hand, the application of attention mechanisms to DNN models, such as RNN and CNN models, which have been widely used and do not require relatively high training requirements, has not been sufficiently studied.
In this paper, we focus on improving the robustness of commonly used BiRNN models (as described detail in Section III) to perturbations in the attention mechanisms. Furthermore, we demonstrate that the result of overcoming the vulnerability of the attention mechanisms is an improvement VOLUME X, 2021. This is a pre-print for private use only.  [17]. The proposed adversarial training for attention mechanisms helps the model learn cleaner attention.
in the prediction performance and model interpretability.
To tackle the models' vulnerability to perturbation, Goodfellow et al. [15] proposed adversarial training (AT) that increases robustness by adding adversarial perturbations to the input and the training technique forcing the model to address its difficulties. Previous studies [15], [16] in the image recognition field have theoretically explained the regularization effect of AT and shown that it improves the robustness of the model for unseen images.
AT is also widely used in the NLP field as a powerful regularization technique [18]- [21]. In pioneering work, Miyato et al. [18] proposed a simple yet effective technique to improving the text classification performance by applying AT to a word embedding space. Later, interpretable AT (iAT) was proposed to increase the interpretability of the model by restricting the direction of the perturbations to existing words in the word embedding space [19]. The attention weight of each word is considered an indicator of the importance of each word [22], and thus, in terms of interpretability, we assume that the weight is considered a higher-order feature than the word embedding. Therefore, AT for attention mechanisms that adds an adversarial perturbation to deceive the attention mechanisms is expected to be more effective than AT for word embedding.
From motivations above, we propose a new general training technique for attention mechanisms based on AT, called adversarial training for attention (Attention AT) and more interpretable adversarial training for attention (Attention iAT). The proposed techniques are the first attempt to employ AT for attention mechanisms. The proposed Attention AT/iAT is expected to improve the robustness and the interpretability of the model by appropriately overcoming the adversarial perturbations to attention mechanisms [23]- [25]. Because our proposed AT techniques for attention mechanisms is model-independent and a general technique, it can be applied to various DNN models (e.g., RNN and CNN) with attention mechanisms. Our technique can also be applied to any similarity functions for attention mechanisms, e.g, additive function [1] and scaled dot-product function [8], which is famous for calculating the similarity in attention mechanisms.
To demonstrate the effects of these techniques, we evaluated them compared to several other state-of-the-art ATbased techniques [18], [19] with ten common datasets for different NLP tasks. These datasets included binary classification (BC), question answering (QA), and natural language inference (NLI). We also evaluated how the attention weights obtained through the proposed AT technique agreed with the word importance calculated by the gradients [26]. Evaluating the proposed techniques, we obtained the following findings concerning AT for attention mechanisms in NLP: • AT for attention mechanisms improves the prediction performance of various NLP tasks. • AT for attention mechanisms helps the model learn cleaner attention (as shown in Figure 1) and demonstrates a stronger correlation with the word importance calculated from the model gradients. • The proposed training techniques are much less independent concerning perturbation size in AT. Especially, our Attention iAT demonstrated the best performance in nine out of ten tasks and more interpretable attention, i.e., resulting attention weight correlated more strongly with the gradient-based word importance [26]. The implementation required to reproduce these techniques and the evaluation experiments are available on GitHub. 2

II. RELATED WORK A. ATTENTION MECHANISMS
Attention mechanisms were introduced by Bahdanau et al. [1] for the task of machine translation. Today, these mechanisms contribute to improving the prediction performance of various tasks in the NLP field, such as sentence-level classification [2], sentiment analysis [3], question answering [4], and natural language inference [5]. There are a wide variety of attention mechanisms; for instance, additive [1] and scaled dot-product [8] functions are used as similarity functions.
Attention weights are often claimed to offer insights into the inner workings of DNNs [22]. However, Jain and Wallace [10] reported that learned attention weights are often uncorrelated with the word importance calculated through the gradient-based method [26], and perturbations interfere with interpretation. In this paper, we demonstrate that AT for attention mechanisms can mitigate these issues.

B. ADVERSARIAL TRAINING
AT [9], [15], [27] is a powerful regularization technique that has been primarily explored in the field of image recognition to improve the robustness of models for input perturbations. In the NLP field, AT has been applied to various tasks by extending the concept of adversarial perturbations, e.g., text classification [18], [19], part-of-speech tagging [20], and machine reading comprehension [21], [28]. As mentioned earlier, these techniques apply AT for word embedding. Other AT-based techniques for the NLP tasks include those } from two input sequences, X P and X Q , respectively. These inputs are encoded into hidden states through a bi-directional encoder (Enc). In conventional models, the worst-case perturbation r is added to the word embeddings. In our adversarial training for attention mechanisms, we compute and add the r to attention a to improve the prediction performance and the interpretability of the model. related to parameter updating [29] and generative adversarial network (GAN)-based retrieval-enhancement method [30]. Our proposal is an adversarial training technique for attention mechanisms and is different from these methods.
Miyato et al. [18] proposed Word AT, a technique that applied AT to the word embeddings. The adversarial perturbations are generated according to the back-propagation gradients. These perturbations are expected to regularize the model. Since then, Sato et al. [19] proposed Word iAT, and it has been known to achieve almost the same performance as Word AT that does not expect interpretability [19]. The Word iAT technique aims to increase the model's interpretability by determining the perturbation's direction so that it is closer to other word embeddings in the vocabulary. Both reports demonstrated improved task performance via AT. However, the specific effect of AT on attention mechanisms has yet to be investigated. In this paper, we aim to address this issue by providing analyses of the effects of AT for attention mechanisms using various NLP tasks.
AT is considered to be related to other regularization techniques (e.g., dropout [31], batch normalization [32]). Specifically, dropout can be considered a kind of noise addition. Word dropout [33] and character dropout [34], known as wildcard training, are variants for NLP tasks. These techniques can be considered random noise for the target task. In contrast, AT has been demonstrated to be effective because it creates particularly vulnerable perturbations that the model is trained to overcome [15].
It has been reported that DNN models that introduce adversarial training to overcome adversarial perturbations capture human-like features [23]- [25]. These features help to make the prediction of DNN models easier to interpret for humans. In this paper, we demonstrate that the proposed AT to attention mechanisms provides cleaner attention that is more easily interpreted by humans.

III. COMMON MODEL ARCHITECTURE
Our goal is to improve the performance of NLP models (i.e., predictability and interpretability) by aiming at the robustness of the attention mechanisms. To demonstrate the effectiveness of exploiting AT to attention for words, we adopted the BiRNN-based model used by Jain and Wallace [10] as our common model architecture and set their performance as our performance baseline, as described in Section I. This is because they performed extensive experiments across a variety of public NLP tasks to investigate the effect of attention mechanisms, and their model has demonstrated desirable prediction performance. However, the attention mechanism in the model has been reported to be vulnerable to perturbations. Based on the model of Jain and Wallace [10], we investigated three common NLP tasks, BC, QA, and NLI. Because Jain and Wallace [10] considered the same tasks. A BC task is a single sequence task that takes one input text, while QA and NLI tasks are pair sequence tasks that take two input sequences. Then, we defined two base models, a single sequence model and pair sequence model, for those tasks, as shown in Figure 2.

A. MODEL WITH ATTENTION MECHANISMS FOR SINGLE SEQUENCE TASK
For a single sequence task, such as the BC task, the input is a word sequence of one-hot encoding X S = VOLUME X, 2021. This is a pre-print for private use only.
where T S and |V | are the number of words in the sentence and vocabulary size. We introduce the following short notation for the sequence Let w t ∈ R d be a ddimensional word embedding that corresponds to x t . We represent each word with the word embeddings to obtain (w t ) T S t=1 ∈ R T S ×d . Next, we use the BiRNN encoder Enc to obtain m-dimensional hidden states h t : where h 0 is the initial hidden state and is regarded as a zero vector. Next, we use the additive formulation of attention mechanisms proposed by Bahdanau et al. [1] to compute the attention score for the t-th wordã t , defined as: where W ∈ R d ×m and b, c ∈ R d are the parameters of the model. Then, from the attention scoresã = (ã t ) T S t=1 , the attention weights a = (a t ) T S t=1 for all words are computed as a = softmax(ã). ( The weighted instance representation h a is calculated as Finally, h a is fed to a dense layer Dec, and the output activation function is then used to obtain the predictions: where σ is a sigmoid function, and |y| is the label set size.

B. MODEL WITH ATTENTION MECHANISMS FOR PAIR SEQUENCE TASK
For a pair sequence task, such as the QA and NLI tasks, the input is T P and T Q are the number of words in each sentence. X P and X Q represent the paragraph and question in the QA and the hypothesis and premise in the NLI. We used two separate BiRNN encoders to obtain the hidden states h are the initial hidden states and are regarded as zero vectors. Next, we computed the attention weightã t of each word of X P as: where W 1 ∈ R d×m and W 2 ∈ R d ×m denote the projection matrices, and b, c ∈ R d are the parameter vectors. Similar to Eq. 3, the attention weight a t can be calculated fromã t . The presentation is obtained from a sum of words in X P .
is fed to a Dec, and then a softmax function is used as σ to obtain the prediction (in the same manner as in Eq. 5).

C. TRAINING MODEL WITH ATTENTION MECHANISMS
Let Xã be an input sequence with attention scoreã, wherẽ a is a concatenated attention score for all t. We model the conditional probability of the class y as p(y|Xã; θ), where θ represents all model parameters. For training the model, we minimize the following negative log likelihood as a loss function with respect to the model parameters:

IV. ADVERSARIAL TRAINING FOR ATTENTION MECHANISMS
The main contribution of this paper is to explore the idea of employing AT for attention mechanisms. In this paper, we propose a new training technique for attention mechanisms based on AT, called Attention AT and Attention iAT. The proposed techniques aim to achieve better regularization effects and to provide better interpretation of attention in the sentence. These techniques are the first application of AT to the attention in each word, which is expected to be more interpretable, with reference to AT for word embeddings [18] and a technique more focused on interpretability [19]. In this paper, we generate adversarial perturbations based on the model described in Section III.

A. ATTENTION AT: ADVERSARIAL TRAINING FOR ATTENTION
We describe the proposed Attention AT, which features adversarial perturbations in the attention mechanisms rather than in the word embeddings [18], [19]. The adversarial perturbation on the mechanisms is defined as the worst-case perturbation on attention mechanisms of a small bounded norm that maximizes loss function L of the current model: where Xã +r is the input sequence with attention scoreã, its perturbation r, y is the target output, andθ represents the current model parameters. We apply the fast gradient method [11], [18], i.e., first-order approximation to obtain an approximate worst-case perturbation of norm , through a single gradient computation as follows: is a hyper-parameter to be determined using the validation dataset. We find this r AT against the current model parameterized byθ at each training step and construct an adversarial perturbation for attention scoreã:

B. ATTENTION IAT: INTERPRETABLE ADVERSARIAL TRAINING FOR ATTENTION
We describe the proposed Attention iAT for further boosting the prediction performance and the interpretability of NLP tasks. Rather than utilizing AT to attention mechanisms (as described in Section IV-A), Attention iAT effectively exploits differences in the attention to each word in a sentence for the training. As a result, this technique provides cleaner attention in the sentence and improves the interpretability of the attention. These effects contribute to improving the performance of various NLP tasks.
In terms of formulation, the proposed Attention iAT is analogous to interpretable AT for word embeddings (Word iAT) [19], which increases the interpretability of AT for word embeddings in formulas. However, the implications and effects for training the model are very different; in the proposed Attention iAT, the attention difference enhancement, described later, enhances the difference in attention for each word. The difference and its effect will be explained later in this section and discussed in Section VII-C.
Supposeã t denotes the attention score corresponding to the t-th position in the sentence. We define the difference vector d t as the difference between the attention to the t-th wordã t in a sentence and the attention to any k-th wordã k : T = T S in single sequence task, and T = T P in a pair sequence task. By normalizing the norm of the vector, we define a normalized difference vector of the attention for the t-th word:d The number of dimensions in d k is the number of the vocabulary (fixed length) for Word iAT, while the dimension of words in a sentence (variable length) for Attention iAT. The dimensionality of d t in Attention iAT is much smaller compared to Word iAT. 3 We define perturbation r(α t ) for attention to the t-th word with trainable parameters α t = (α t,k ) T t=1 ∈ R T and the normalized difference vector of the attentiond t as follows: By combining α t for all t, we can calculate perturbation r(α) for the sentence: Then, similar to Xã +r in Eq. 10, we introduce Xã +r(α) and seek the worst-case weights of the difference vectors that maximize the loss functions as follows: r iAT = argmax α:||α||≤ L(Xã +r(α) , y;θ).
Contrary to Attention iAT, in Word iAT, the difference d t,k in Eq. 13 is defined as the distance between the t-th word in a sentence and the k-th word in the vocabulary in the word embedding space. Based on the distance, Word iAT determines the direction of perturbation for the t-th word as a linear sum of the word directional vectors in the vocabulary. In contrast, Attention iAT does not compute the distance to word embeddings in the vocabulary. Instead, this technique computes the difference in attention to other words in the sentence and determines the direction of the perturbation. The adversarial perturbation of Attention iAT, defined in this way, works to increase the difference in attention to each word. We call this process in Attention iAT as attention difference enhancement. Owing to the process, Attention iAT improves the interpretability of attention and contributes to the performance of the model's prediction. The detail discussions are shared in Section VII-C.

C. TRAINING A MODEL WITH ADVERSARIAL TRAINING
At each training step, we generate adversarial perturbation in the current model. To this end, we define the loss function for adversarial training as follows: where λ is the coefficient that controls the balance between two loss functions. Note that Xã ADV can be Xã adv for Attention AT or Xã iadv for Attention iAT.

V. EXPERIMENTS
In this section, we describe the evaluation tasks and datasets, the details of the models, and the evaluation criteria.

A. TASKS AND DATASETS
We evaluated the proposed techniques using the open benchmark tasks (i.e., four BCs, four QAs, and two NLIs) used in Jain and Wallace [10]. In our experiment, we added MultiNLI [41] as an additional NLI task for more detailed analysis (see the details in Appendix A). Table 1 presents the statistics for all datasets. We split the dataset into a training set, a validation set, and a test set. 4 We performed preprocessing, including tokenization with spaCy 5 , mapping out vocabulary words to a special <unk> token, and mapping all words with numeric characters to qqq in the same manner as Jain and Wallace [10]. We used the three most well-known NLP tasks for evaluation: binary classification (BC), question answering (QA), and natural language inference (NLI). We split the dataset into training, validation, and test sets. We performed preprocessing as shown at https://github.com/successar/AttentionExplanation in the same manner as Jain and Wallace [10]. See the details in Appendix A. Jain and Wallace [10] split the dataset into only a training set and a test set, so we did not get the same result.

B. MODEL SETTINGS
We compared the two proposed training techniques to four conventional training techniques. They were implemented using the same model architecture as described in Section III. Following Jain and Wallace [10], we used bi-directional long short-term memory (LSTM) [42] as the BiRNN-based encoder, including Enc, Enc P , and Enc Q . A total of six training techniques were evaluated in the experiments: We implemented the training techniques above using the AllenNLP library with Interpret [43], [44]. Through the experiments, we set the hyper-parameter λ = 1 related to AT or iAT in Eq. 20. To ensure a fair comparison of the training techniques, we followed the configurations (e.g., initialization of word embedding, hidden size of the encoder, optimizer settings) used in the literature [10] (see the details in Appendix B). Note that while Jain and Wallace [10] used a test set to adjust the model's hyper-parameters, we used a validation set. In adversarial training, the Allentune library [45] was used to adjust hyper-parameter , and we report the test scores for the model with the highest validation score.

C. EVALUATION CRITERIA
First, we compared the prediction performance of each model for each task. As an evaluation metric of the prediction performance, we used the F1 score 6 , accuracy, and the micro-F1 score for the BC, QA, and NLI, respectively, as in [10]. 6 The F1 score is a metric that harmonizes precision and recall.. Therefore, this score takes both false positives and false negatives into account.
Next, we compared how the attention weights obtained through the proposed AT-based technique agreed with the importance of words calculated by the gradients [26]. To evaluate the agreement, we compared the Pearson's correlations between the attention weights and the word importance of the gradient-based method. In [10], the Kendall tau, which represents rank correlation, was used to evaluate the relationship between attention and the word importance obtained by the gradients. Recently, however, it has been pointed out that rank correlations often misrepresent the relationship between the two due to the noise in the order of the low rankings [46]; we concurred with this, so we used Pearson's correlations.
Finally, we compared the effects of perturbation size of AT on the validation performance of the BC, QA, and NLI tasks with a fixed λ = 1. We randomly choose the value of in the 0-30 range and ran the training 100 times. The configurations in [18], [19] were = 5 for Word AT and = 15 for Word iAT.

VI. RESULTS
In this section, we share the results of the experiments. Table 2 presents the prediction performance and the Pearson's correlations between the attention weight for the words and word importance calculated from the model gradient. The most significant results are shown in bold.

A. COMPARISON OF PREDICTION PERFORMANCE
In terms of prediction performance, the model that applied the proposed Attention AT/iAT demonstrated a clear advantage over the model without AT (as shown in Vanilla [10]) as well as other AT-based techniques (Word AT [18] and Word iAT [19]). The proposed technique achieved the best results in almost all benchmarks. For 20News and AGNews in the BC and bAbI task 1 in QA, the conventional techniques, including the Vanilla model, were sufficiently accurate (the score was higher than 95%), so the performance improvement of the proposed techniques to the tasks was limited to some extent. Meanwhile, Attention AT/iAT contributed to solid performance improvements in other complicated tasks. Comparison of prediction performance and the Pearson's correlation coefficients (Corr.) between the attention weight and the word importance of the gradient-based method. We used the same test metrics as in [10]: binary classification (BC), question answering (QA), and natural language inference (NLI). As an evaluation metrics of the prediction performance, we used the F1 score (F1), accuracy (Acc.), and the micro-F1 (Micro-F1) score for BC, QA, and NLI, respectively.

B. COMPARISON OF CORRELATION BETWEEN ATTENTION WEIGHTS AND GRADIENTS ON WORD IMPORTANCE
In terms of model interpretability, the attention to the words obtained with the Attention AT/iAT techniques notably correlated with the importance of the word as determined by the gradients. Attention iAT demonstrated the highest correlation among the techniques in all benchmarks. Figure 3 visualizes the attention weight for each word and gradientbased word importance in the SST test dataset. Attention AT yielded clearer attention compared to the Vanilla model or Attention iAT. Specifically, Attention AT tended to strongly focus attention on a few words. Regarding the correlation of word importance based on attention weights and gradientbased word importance, Attention iAT demonstrated higher similarities than the other models. Figure 4 shows the effect of the perturbation size on the validation performance of SST (BC), CNN news (QA), and Multi NLI (NLI) with a fixed λ = 1. We observed that the performances of the conventional Word AT/iAT techniques deteriorated according to the increase in the perturbation size; meanwhile, our Attention AT/iAT techniques maintained almost the same prediction performance. We observed similar trends in other datasets as described in Section V-A.

A. COMPARISON OF ADVERSARIAL TRAINING FOR ATTENTION MECHANISMS AND WORD EMBEDDING
Attention AT/iAT is based on our hypothesis that attention is more important in finding significant words in document processing than the word embeddings themselves. Therefore, VOLUME X, 2021. This is a pre-print for private use only. we sought to achieve prediction performance and model interpretability by introducing AT to the attention mechanisms. We confirmed that the application of AT to the attention mechanisms (Attention AT/iAT) was more effective than word embedding (Word AT/iAT) and supports the correctness of our hypothesis, as shown in Table 2. In particular, the Attention iAT technique was not only more accurate in its model than the Word AT/iAT techniques but also demonstrated a higher correlation with the importance of the words predicted based on the gradient.
As shown in Figure 3, Attention AT tended to display more attention to the sentence than the Vanilla model. The results showed that training with adversarial perturbations to the attention mechanism allowed for cleaner attention without changing word meanings or grammatical functions. Furthermore, we confirmed that the proposed Attention AT/iAT techniques were more robust regarding the variation of perturbation size than conventional Word AT/iAT, as shown in Figure 4. Although it is difficult to directly compare perturbations to attention and word embedding because of the difference in the range of the perturbation size to the part, the model that added perturbations to attention behaved robustly even when the perturbations were relatively large.

B. COMPARISON OF RANDOM PERTURBATIONS AND ADVERSARIAL PERTURBATIONS
Attention RP demonstrated better prediction performance than Word AT/iAT. The results revealed that augmentation 5 10 15 20 25 Perturbation size 50  for the attention mechanism is very effective, even with simple random noise. In contrast, the correlations between the attention weight for the word and the gradient-based word importance were significantly reduced, as shown in Table 2.
We consider that Attention RP is successful in learning robust discriminative boundaries through random perturbation and improving to the desired classification performance. However, as the gradient is smoothed out by the perturbation around the (supervised) data points, the correlation with the word importance by the gradient is considered to be degraded. In other words, Attention RP can achieve a certain level of classification performance, but it does not lead to which words are useful from their gradients.

C. COMPARISON OF ATTENTION AT, ATTENTION IAT, AND WORD IAT
In the experiments, Attention iAT showed a better performance compared Attention AT in the prediction performance and the correlation with the gradient-based word importance. Attention iAT exploits the difference in the attention weight of each word in a sentence to determine adversarial perturbations. Because the norm of the difference in attention weight (as shown in Eq. 14) is normalized to one, adversarial perturbations in attention mechanisms will make these differences clear, especially in the case of sentences with a small difference in the attention to each word. That is, even in situations where there is little difference in attention between each element of d t = (d t,1 , d t,2 , · · · , d t,T ), the difference is amplified by Eq. 14. Therefore, even for the same perturbation size ||α|| < , more effective perturbations r(α) weighted byd t were successfully obtained for each word. However, in the case of sentences where there was originally a difference in the clear attention to each word, the regularization of d t in Eq. 14 had practically no effect because it did not change their ratio nearly as much. Thus, we posit that the Attention iAT technique enhances the effectiveness of AT applied to attention mechanisms by generating effective perturbations for each word. The Attention iAT technique was inspired by Word iAT. Word iAT generates perturbations in the direction that maxi-mizes the loss function while restricting the direction of the perturbation to become a linear combination of the direction of word embedding in the vocabulary. Word iAT indirectly improves the interpretability of the model by indicating which words in the vocabulary to which the perturbation is similar. However, we are confident that Attention iAT is a direct improvement in the interpretability of the model, because it can show more clearly which words to pay attention. This is owing to the attention difference enhancement process described in Eq. 14. Thus, the proposed techniques are highly effective in that they lead to a more substantive improvement in interpretability.

D. LIMITATIONS
Our proposal is a general-purpose robust training technique for DNN models, which are commonly used for NLP tasks. Therefore, we have chosen here an RNN with an attention mechanism that has been put to practical use [10]. For this reason, models such as BERT [47] that deal with selfattention were outside the scope of this study, and will be the subject of future work. We also did not deal with tasks (such as machine translation) that were not used in the literature [10] as baselines. Additionally, for the same reason in as [48], we did not consider the variants of attention mechanisms, such as bi-attentive architecture [5], multiheaded architecture [8], because they could have different interpretability properties.
As an extension of AT, virtual adversarial training (VAT), which is a semi-supervised training technique, was proposed in [18], [49]. Based on VAT, the proposed technique can be expected to improve accuracy by using unlabeled datasets.

VIII. CONCLUSION
We proposed robust and interpretable attention training techniques that exploit AT. In the experiments with various NLP tasks, we confirmed that AT for attention mechanisms achieves better performance than techniques using AT for word embedding in terms of the prediction performance and the interpretability of the model. Specifically, the Attention iAT technique introduced adversarial perturbations that em-VOLUME X, 2021. This is a pre-print for private use only.
phasized differences in the importance of words in a sentence and combined high accuracy with interpretable attention, which was more strongly correlated with the gradient-based method of word importance. The proposed technique could be applied to various models and NLP tasks. This paper provides strong support and motivation for utilizing AT with attention mechanisms in NLP tasks.
In the experiment, we demonstrated the effectiveness of the proposed techniques for RNN models that are reported to be vulnerable to attention mechanisms, but we will confirm the effectiveness of the proposed technique for large language models with attention mechanisms such as Transformer [8] or BERT [47] in the future. Because the proposed techniques are model-independent and general techniques for attention mechanisms, we can expect they will improve predictability and the interpretability for language models. .

A. BINARY CLASSIFICATION
The following datasets were used for evaluation. The Stanford Sentiment Treebank (SST) [17] 7 was used to ascertain positive or negative sentiment from a sentence. IMDB Large Movie Reviews (IMDB) [35] 8,9 was used to identify positive or negative sentiment from movie reviews. 20 Newsgroups (20News) [36] 10 were used to ascertain the topic of news articles as either baseball (set as a negative label) or hockey (set as a positive label). The AG News (AGNews) [37] 11 was used to identify the topic of news articles as either world (set as a negative label) or business (set as a positive label).

B. QUESTION ANSWERING
The following datasets were used for evaluation. The CNN news article corpus (CNN news) [38] 12 was used to identify answer entities from a paragraph. The bAbI dataset (bAbI) [39] 13 contains 20 different question-answer tasks, and we considered three tasks: (task 1) basic factoid question answered with a single supporting fact, (task 2) factoid question answered with two supporting facts, and (task 3) factoid question answered with three supporting facts. The model was trained for each task.

APPENDIX B IMPLEMENTATION DETAIL
For all datasets, we either used pretrained GloVe [50] or fastText [51] word embedding with 300 dimensions except the bAbI dataset. For the bAbI dataset, we trained 50 dimensional word embeddings from scratch during training. We used a one-layer LSTM as the encoder with a hidden size of 64 for the bAbI dataset and 256 for the other datasets. All models were regularized using L 2 regularization (10 −5 ) applied to all parameters. We trained the model using the maximum likelihood loss utilizing the Adam [52] optimizer with a learning rate of 0.001.