WordRevert: Adversarial Examples Defence Method for Chinese Text Classification

Adversarial examples can evade the detection of text classification models based on Deep Neural Networks(DNNs), thus posing a potential security threat to the system. To address this problem, we propose an adversarial example defense method for Chinese text classification called WordRevert. The method first obtains the "positive text" containing the adversarial words by filtering the clauses that do not contribute to the current classification label. Then the detection network is combined with the position importance calculation function to achieve the detection of the adversarial words. Finally, the adversarial words are restored to the original words by calculating the candidate score and the detection score. The experiments show that the current popular Chinese text adversarial attack algorithms can be effectively defended by this method, and achieve a significant increase in the accuracy of the adversarial examples with a small reduction in the classification accuracy of clean samples while achieving better precision, recall, and F1 value of adversarial word detection and restoration.


I. INTRODUCTION
As Deep Neural Networks (DNNs) have achieved great success in solving many major problems in the fields of computer vision, natural language processing and speech recognition, systems based on DNNs have been widely used in production and life. However, recent studies have shown that DNNs are vulnerable to attacks from adversarial examples [1]. The research reveals the vulnerability of DNNs, which indicates that systems based on DNNs may have potential security threats. Attackers can disguise adversarial examples as normal examples through adversarial examples generation methods to evade the detection of classifiers, which seriously affects the security of sentiment analysis and spam classification systems.
The adversarial example is first discovered in image classification tasks. Szegedy et al. [1]find that adding disturbances that are imperceptible to humans in the original input images can make deep learning models go wrong with a high degree of confidence. According to the knowledge of the attacker, adversarial attacks can be divided into white-box attacks and black-box attacks. White-box attacks mean that the attacker has information such as the gradient, structure, and parameters of the target model [2] and black-box attacks mean that the attacker can query the output of the target model or has no information at all [3]- [6]. Generally speaking, blackbox attacks are more in line with real-life scenarios. In the text field, due to the discrete attributes of the text and the perceptible attributes of disturbance, the generation of text adversarial examples is more difficult. With the deepening of research, some methods for generating text adversarial examples have also been proposed in recent years. What does not match it is the relative lack of defense methods. There may be two reasons for the lack of defense methods: one is that there is no unified theoretical model to describe the generation process of adversarial examples. In the face of unclear or completely unknown adversarial examples generation methods, the defense appears more passive. The other is multiple modification methods based on the original example can cause errors in the model. Therefore, it is difficult to find adaptive methods for different adversarial attacks. At present, the research on adversarial examples generation and defense methods are mainly aimed at English, and the differences between languages make it impossible to transfer the originally few defense methods to Chinese text. Therefore, it is imperative to research adversarial examples defense for Chinese text classification.
In this paper, we propose a method called WordRevert to defend against adversarial examples in Chinese text classification tasks. The filtering operation on the original input is first implemented to obtain clauses that may contain adversarial examples. Then we use a bidirectional LSTM(BiLSTM) to implement a sequence labeling model as a detection network and combine the location importance algorithm to successfully locate the adversarial words, and finally use the scoring mechanism of adversarial words and candidate words to accurately restore adversarial words to original words. This method effectively defends against the current popular Chinese text adversarial attack algorithms, while ensuring the accuracy and recall of the restored text, and achieves a good defense effect. The main contributions of this paper are as follows:

1) We proposed an adversarial examples defence method
for Chinese text classification, WordRevert, which can successfully defend against the existing adversarial attack methods for Chinese text classification. The detection and restoration of adversarial examples can be realized without changing the DNNs classification model. 2) We introduce a method for locating adversarial words based on sentence segmentation filtering, which can filter clauses that do not contain adversarial words. At the same time, the sequence labeling model implemented by BiLSTM is combined with the target classification model, and the position importance calculation function and the scoring mechanism of candidate words are designed to locate and restore the adversarial words to the original words more accurately.

3) Experiments are implemented on three datasets. The
WordRevert algorithm is used to defend against three adversarial attack methods on two text classification models. The results show that the classification accuracy can be effectively improved to more than 86% on average, which is close to the accuracy of the original classification. Experiments also demonstrate that the impact of the WordRevert algorithm on clean samples is negligible, and the method has an excellent performance in precision, recall, and F1 score of adversarial word detection and restoration. Finally, ablation experiments are designed to verify the effectiveness of the adversarial word detection method.
The rest of this article is organized as follows. A detailed review of the relevant research work on textual adversarial attacks and defenses in Section II. The relevant definitions and defense models are formalized in Section III, and the WordRevert algorithm is introduced. In Section IV, experiments are conducted on three real datasets. Finally, the conclusion of the paper is given in Section V.

II. RELATED WORK
In recent years, adversarial examples have become a hot issue in the field of artificial intelligence and security, and related research has also increased dramatically. In response to this security problem, Fast Gradient Sign Method(FGSM) [7], Deepfool [8], C&W [9], Project Gradient Descent (PGD) [10] and many other adversarial examples generation methods have been proposed one after another. However, these methods are mostly aimed at images. Due to the discrete properties of text and metrics different from images, these methods cannot be directly applied to text [11].
In the text field, Papernot et al. [12]first migrated FGSM to text, combined with a specific dictionary, and successfully generated adversarial examples. Subsequently, a series of studies on text adversarial examples have been proposed [13]- [15]. Gao et al. [16] studied the adversarial examples generation method in the black-box scene and proposed the DeepWordBug algorithm. The algorithm uses the output results of the model to find keywords in the original text through the word importance calculation function, and then uses insertion, deletion, replacement, and exchange of characters to generate adversarial examples. Ren et al. [17] proposed a greedy algorithm for word-level attacks, which first determines the order of keywords that need to be replaced by Probability Weighted Word Saliency (PWWS), and then uses WordNet [18] , which is a large lexical database for the English language, to find synonyms to generate adversarial examples. Similar to Ren et al., Zang et al. [19] proposed a word-level attack algorithm based on sememes, which can generate more diverse adversarial examples by finding words with the same sememes corresponding to each word in the original example by HowNet [20], and then using a particle swarm algorithm to optimize the combination of candidate words in discrete space.
The above adversarial example generation methods are designed based on English text. Some character-level modification strategies are not applicable to Chinese, and simple substitution using synonyms is not effective in Chinese text adversarial example generation. Wang et al. [21] first proposed WordHanding, a Chinese adversarial examples generation method. They used pre-trained substitution models and word importance calculation functions to determine the keywords to be replaced and used homophones to replace them, successfully reducing the classification accuracy of LSTM and CNN models to about 60%. However, only phonetic features of Chinese characters are used for keyword replacement in this method, other features are not fully utilized, and the modification strategy is relatively single. Based on the above research work, Tong et al. [22] proposed the CWordAttacker algorithm, which locates important keywords and phrases through a targeted word deletion scoring mechanism and uses attack strategies such as traditional character replacement, pinyin rewriting, and word order disturbance to attack Chinese text at the word-level. Similarly, Cheng et al. [23] added strategies such as word splitting and proposed the WordChange algorithm to realize the generation of Chinese text adversarial examples. The experiment proved that it can also cause interference to the actual deployed sentiment analysis system.
Unlike images, the research on text adversarial examples mainly focuses on the generation method, and the defense method is rarely involved. At this stage, the defense methods in the image field mainly include detection, model enhancement, and defense distillation. The defensive distillation method has proven to be ineffective in the text domain, and model enhancement is mainly achieved through adversarial training but it is not always effective [24]. The adversarial example is a kind of data with a specific purpose. It is a more intuitive idea to prevent the adversarial examples from attacking the model by detecting the input. The detection of adversarial examples has achieved some good research results in the image field [25], [26]. In the text, using certain methods to generate adversarial examples will produce spelling errors. This feature makes the detection method effective in the text.
In the detection method, Gao et al. [16] used the Python autocorrect 0.3.0 package to detect the input. Li et al. [13] use context-aware spell checking services to accomplish similar tasks. Experimental results show that this method of detecting adversarial examples by checking spelling errors is effective for character-level modifications, and partly effective for word-level attacks due to different modification strategies. However, the spell check method is not suitable for adversarial examples based on other languages including Chinese [21]. At present, the main work of adversarial examples in Chinese texts is focused on the generation of adversarial examples, and no in-depth research has been conducted on defense methods.
It can be seen from the above that the work related to text adversarial attacks has achieved good results, but there are still many problems to be solved in the field of text defense, especially the defense against Chinese texts. Therefore, based on previous research results, we learn from the current main methods of text adversarial examples defense, integrate text error checking into the research of Chinese text adversarial examples defense, and propose an adversarial example defense method for Chinese text classification and contribute to the field of adversarial machine learning.

III. WORDREVERT
We focus on the research of adversarial examples defense methods for Chinese text classification. In the section, we first give the relevant definition of the problem, including the adversarial attack of text classification and the adversarial defense of text classification. Then we give a general description of the WordRevert algorithm. Finally, the WordRevert algorithm is introduced in detail. It consists of three parts: filtering operation, adversarial word detection, and adversarial word restoration.

A. PROBLEM DEFINITION
Given a data set X = {x 1 , x 2 , . . . , x n } with N texts and a set of corresponding N labels Y = {y 1 , y 2 , . . . , y n }. A pretrained natural language classification model F , which needs to learn the mapping f : X → Y from the input text x ∈ X to the label y ∈ Y. Finally, it can classify the original input text x to the true label y true as much as possible, as shown in Equation (1): Under normal circumstances, an adversarial example x * is generated by adding a small disturbance r to x. The adversarial example will cause model F to give a wrong label, as shown in Equation (2): At the same time, the disturbance is required to be imperceptible to the human eye ,which means it will not cause significant changes in semantics, ensuring that humans can still understand the meaning of the original text. So the adversarial example can be defined as shown in Equation (3) [17]: In Equation (3), ∥r∥ p defined in Equation (4) uses p-norm to represent the constraint on perturbation r, and L 0 , L 2 and L ∞ are commonly used.
In Equation (4), the original input text is expressed as x = w 1 w 2 . . . w i . . . w n , where w i ∈ D is a word and D is a dictionary of words. In order to meet the above constraints, adversarial words w * i are usually obtained from third-party dictionaries and lexical databases. Such as WordNet [18], HowNet [20], etc. can be used in English texts to get adversarial words. In Chinese texts, homonyms [21], traditional Chinese [22], and other dictionaries are usually used to obtain adversarial words. Therefore, the defense against this type of attack can be summarized as detecting the adversarial word w * and reducing the adversarial word to the original word w i . On the one hand, it is necessary to ensure that the restored text is classified as the true label, i.e., Equation (5): where x t represents the restored text. On the other hand, it is necessary to ensure the similarity between the restored text and the original text, which can be measured by the F1 score.

VOLUME 4, 2016
The specific calculation method is as follows. Precision: Recall: F1 score: Among them, the result of the restoration of the adversarial word is correctly regarded as TP, the missed detection of the adversarial word is regarded as FN, and the result of the restoration of the adversarial word is wrong or the normal text is wrongly modified is regarded as FP.  1) Filtering Operation: The "positive text" is obtained by removing the clauses that do not contribute to the current classification label. 2) Adversarial Word Detection: It includes two parts: the detection network and the position importance calculation function. The former is used to detect the probability that each word is a typo, and the latter calculates the position of important features. The two results are simply processed and multiplied together to be the probability that the position is an adversarial word. 3) Adversarial Word Restoration: The candidate score and the detection score are included, and the replacement word is determined by the product of the two.
Among them, the detection network in the Adversarial Word Detection stage and the candidate score in the Adversarial Word Restoration stage are implemented by the same sequence labeling model BiLSTM, while the position importance calculation function and candidate score are implemented by the access target model. The above three steps will be described in detail below.

C. FILTERING OPERATION
In the current research context, the adversarial attack methods for Chinese text classification are word-level attacks, which to find keywords that affect the classification through various methods, and then generate adversarial examples by modifying these words. We call these modified words adversarial words. The distribution of adversarial words in the text is different from the error distribution in general text. Under normal circumstances, the keywords that affect the prediction of the classification model are not evenly distributed throughout the original text, and the adversarial words are generated by masking the keywords, so the distribution of the adversarial words in the entire adversarial text is also not even. If we directly finds the adversarial words in a longer sentence, positioning bias may occur. Taking into account the influence of the text search space, in order to more accurately locate the adversarial words and improve the search efficiency, we propose a filtering operation, that is, to filter out the clauses that are not helpful to the current classification label, and leave the "positive text" that contribute to the current classification label. Finding adversarial words in the "positive text" will effectively improve the positioning accuracy and search efficiency.
According to the characteristics of the Chinese text, the entire input text x is divided into n clauses based on punctuation and spaces, as shown in Equation (9): For the i clause in the sequence x seg , input the sequence after removing this clause into the target model F to obtain the confidence score of the current classification label y i , Then sequentially calculate the confidence difference between the sequence after removing the clause and the original sequence, which is defined as Delete Score (DS) in Equation (10): where P F (y i | x seg ) is the probability that the text sequence x seg is classified as the current label y i after being input to the target classification model F . x seg −s i represents the text after removing the clause s i . If DS (s i ) ≤ 0, it means s i does not contribute to the current classification label, and it will be removed from the original sequence x seg . After n + 1 visits to the target model F , the filtered "positive text" x ′ is obtained. Then x ′ is sliced by word to get x ′ = {w 1 , w 2 , . . . , w i , . . . , w n }, which is ready for the adversarial word detection later.

D. ADVERSARIAL WORD DETECTION
Adversarial word detection is the detection of adversarial words from the filtered "positive text" for later adversarial word restoration, and the detection effect determines the upper limit of the defense algorithm.
In the section, we introduce the detection and location algorithm of adversarial words, including two parts, one is the detection network, which is a sequence labeling model is used to label the probability that each word in the input text is a typo. The other is the position importance calculation function, which is used to locate key positions in the input text, which is the high-frequency position of the adversarial word, to assist in locating the adversarial word.
The detection network is a sequence labeling model implemented by a BiLSTM network. The input is the word embedding sequence E = (e 1 , e 2 , . . . , e i , . . . , e n ), where e i represents the word embedding of the word w i . The output is a label sequence L = (l 1 , l 2 , . . . , l i , . . . , l n ), where l i represents the label of the word w i , and 0 represents correct, 1 means wrong. For each word w i in the input sequence, there is a probability g i that represents the possibility that its label is 1. The higher g i , the higher the probability of w i being wrong. Using BiLSTM to construct the detection network, the calculation of g i is shown in Equation (11): where P D (l i = 1 | x ′ ) denotes the conditional probability of being a wrong word given the input text x ′ , σ denotes the sigmoid function, W and b are the parameters of the detection network. h i denotes the hidden state of the BiLSTM, as defined below: where − → h i ; ← − h i denotes splicing the LSTM hidden states in two directions, and LSTM is the LSTM function.
The learning process of the detection network is driven by optimizing an objective: where L is the objective for the training of the detection network, which is defined as cross entropy loss. Using the equation (11), we can calculate the probability G = {g 1 , g 2 , . . . , g i , . . . , g n } that each word w i in the "positive text" x ′ is a wrong word. The detection network realizes the error detection of all words in the input sequence x ′ . To locate the adversarial word more accurately, it is necessary to pay more attention to the position of the key features in the sequence, which is the high-frequency position of adversarial words. Therefore, we designed a position importance calculation function to assist in locating adversarial words by calculating important positions.
We mark each word in x ′ = {w 1 , w 2 , . . . , w i , . . . , w n } as an unknown character in turn, and then input it into the target classifier. The difference between the original score of the target model and the current score is counted as the importance score of the position, as shown in Equation (16): where P F (y i | x ′ ) is the probability value that the text x ′ is classified into the current label y i after being input to the target classifier F . x ′w i represents the text after marking w i as unknown words. UNK represents unknown words outside the vocabulary.
Using Equation (16), we can calculate the importance score S = {s 1 , s 2 , . . . , s i , . . . , s n } for each position in the "positive text" x ′ . However, the locations of key features in clean examples also receive higher scores. To prevent misjudgment of clean examples, we set the output of the detection network G = {g 1 , g 2 , . . . , g i , . . . , g n } in the detection phase as in Equation (17): Finally, the probability P = {p 1 , p 2 , . . . , p i , . . . , p n } of the adversarial word in the "positive text" x ′ is obtained by Equation (18) For x ′ = {w 1 , w 2 , . . . , w i , . . . , w n }, we take w i of p i ≥ δ as the adversarial word and add it to the set X of adversarial words to be modified, where δ is a hyperparameter. VOLUME 4, 2016

E. ADVERSARIAL WORD RESTORATION
Adversarial word restoration means to restore the adversarial word to the original word, which is a key link in determining the performance of the adversarial defense algorithm. As mentioned above, adversarial words often occupy an important position in the text sequence and hide the important features of the original classification of the text sequence. The restoration of original words is actually the restoration of the original text features.
In adversarial examples of text classification, the operation of replacing adversarial words with original words will make the confidence score of adversarial examples relative to the wrong label decrease rapidly. Using this feature, the original word can be found from the dictionary corresponding to the adversarial word. Aiming at the current popular Chinese adversarial example generation methods, including homophone substitution, traditional character substitution, we have designed the following defense methods.
We first obtain the set T i corresponding to each adversarial word w i in the set X through the third-party dictionary, including homophones and traditional characters, and then get the candidate word t i from the set T i and replace the adversarial word w i in turn, input the replaced text into the target classifier to receive the confidence score relative to the current classification label. Finally, the confidence difference between the "positive text" x ′ and the replaced text x ′ ti is calculated as the candidate score S (t i ), as shown in Equation (19): where t i ∈ T i , T i is the candidate set corresponding to w i .
Generally speaking, the original words belong to the correct vocabulary in the text sequence, whether in terms of lexical, grammatical, and semantic constraints. Therefore, to obtain the original word more accurately, we also send the replaced text to the detection network to get the probability G (g ti ) that the candidate word t i is the wrong word, as shown in Equation (20): and use 1 − G (g ti ) to represent the probability that the candidate word t i is the correct word in the replaced text x ′ ti . We select the candidate word t i with the highest product of the candidate score S (t i ) and correct word probability 1 − G (g ti ) as the final replacement word t * , as shown in Equation (21): For each adversarial word in the set X, the corresponding replacement word t * can be found, and these replacement words are used to replace the adversarial word in the original text x to obtain the restored text x t . The final WordRevert algorithm is as shown in Algorithm (1).

23:
for t i in T i do 24: 28: x t ← replace w i with t * in text x t 29: end for 30: return x t 31: End

IV. EMPIRICAL EVALUATION
For empirical evaluation, we use WordRevert on three real data sets involving two deep neural network classification models to defend against existing Chinese text classification adversarial attack methods and analyze the defense effect.

A. DATASETS
We used the SIGHAN dataset, which is the benchmark for Chinese spelling error correction. SIGHAN is a small data set that contains 1100 texts and 461 character-level errors. In order to detect the error types of traditional characters in the adversarial example, we also mark the traditional characters in the data. Since the texts are collected in the essay section of Test of Chinese as Foreign Language and the subject scope is narrow, we added 5000 Ctrip hotel review data, 5000 JD shopping review data and 10000 spam classification data, which was disturbed by the adversarial example generation method. Finally, all the above data is processed into sentence pairs for training the detection network.
We also prepared more data for the target models, as shown in Table 1, including 30,000 Ctrip hotel review data, 50,000 JD shopping review data, and 90,000 spam classification data, and use these data to train the target models. In addition, 3000 Ctrip hotel review data, 5000 JD shopping review data, and 10000 spam classification data are also prepared to test the classification accuracy of the target models. These test sets are also used to generate adversarial examples using currently popular adversarial attack algorithms for Chinese text classification to attack the target model. The WordRevert algorithm proposed in the paper is used to defend against attacks from adversarial examples. The defense effect is verified by analyzing the changes in the classification accuracy of the target model. These data are all real data from the Internet. The average number of words is 144 words, 32 words, and 51 words respectively, which are used to evaluate the impact of different lengths of adversarial examples on the defense effect.

B. ADVERSARIAL ATTACK ALGORITHMS AND TARGET MODELS
We use the currently popular adversarial examples attack method for Chinese text classification to verify the effectiveness of the WordRevert algorithm, including WordHanding [21], CWordAttacker [22] and WordChange [23]. In experiments, these methods are used to generate adversarial example to attack target models. Among them, the homophone replacement strategy is selected in WordHanding, the traditional character replacement strategy is selected in CWor-dAttacker, and the homophone replacement and Tongueflatted or Tongue-rolled Pronunciation Replacement(TTPR) are selected in WordChange. In the experiment, the threshold of these adversarial example generation methods is set to 30, which means that the number of modified words in each text does not exceed 30.
For the target models, we considered several classic and state-of-the-art models for text classification, which include Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
Word-based CNN(word-CNN) [27] consists of an embedding layer that performs 50-dimensional word embeddings on 400-dimensional input vectors, an 1D-convolutional layer consisting of 250 filters of kernel size 3, an 1D-maxpooling layer, and two fully connected layers.
Bidirectional LSTM(BiLSTM) consists of a 128dimensional embedding layer, a bidirectional LSTM layer whose forward and reverse are respectively composed of 64 LSTM units, and a fully connected layer.
The learning rate of these two models is set to 0.003 during the training process, and the default training epoch is 200.

C. EXPERIMENT SETTING
Given that there is no defense method specifically for Chinese text classification, we defend against the currently popular attack methods and verify the effectiveness of the WordRevert algorithm by evaluating the defense effect. As an evaluation measure, we first evaluate the defense effect of the WordRevert algorithm by analyzing the changes in the classification accuracy of the target model before and after the defense. The effect of the method on the classification accuracy of clean samples is then further analyzed. Finally, we use word-level precision, recall, and F1 score to evaluate the performance of the algorithm in the two stages of detection and restoration.
In the detection network, the dimension of the embedding layer is 128, the size of the hidden unit of BiLSTM is 256, and the batch size used is 64.
The learning rate in the training phase is set to 0.001, and the early stopping method is used for training. When the loss stops decreasing for 10 epochs, the training will be stopped. In the experiment, the hyperparameter δ in the WordRevert algorithm is set to 0.4.

D. DEFENDING RESULTS
We use the above three attack methods to generate adversarial examples for the test set to evaluate the effectiveness of the WordRevert algorithm. The more effective the defense method is, the more the model's classification accuracy rises, and the closer it is to the classification accuracy of the original examples. Table 2  The results show that our method maximizes the classification accuracy. The classification accuracy of Ctrip hotel review dataset, JD shopping review dataset, and spam classification dataset increase to 88.65%, 86.81% and 96.45% on average for the word-CNN model respectively, and increase to 86.79%, 87.84% and 97.20% on average for the BiLSTM model respectively. Compared with the other two data sets, the defense algorithm performs better on the spam classification dataset.
In order to further explore the impact of the WordRevert algorithm on the accuracy of clean examples, we separately defended against 1,000 clean examples for each dataset, and input the defensive results into the target model to observe the changes of accuracy. Table 3 shows the classification accuracy of the word-CNN model and the BiLSTM model on the original examples and the original examples after the defense. The result shows that the WordRevert algorithm can reduce the accuracy of clean samples by at most 1%, and this effect is negligible. Table 4 presents the performance of the WordRevert algorithm in the detection and restoration of adversarial words, evaluated by word-level precision(Prec.), recall(Rec.), and F1 score(F1.). Obviously, restoration is more difficult than   detection, because the former is dependent on the latter. All metrics in the adversarial word restoration phase are slightly lower than in the adversarial word detection phase, and the difference between them depends on the effectiveness of adversarial word restoration. This difference is small in WordRevert algorithm. As can be seen from the table, the proposed WordRevert algorithm performs well on all three datasets. The F1 scores are above 0.8 in the adversarial word detection stage and above 0.74 in the adversarial word restoration stage. In particular, WordRevert algorithm performs much better than the other two datasets in all metrics on the spam classification dataset. The best recall rate of the detection phase on the spam classification dataset is greater than 90%, which means that more than 90% of the adversarial words will be found, which determines the upper limit of the defense effectiveness. Meanwhile, the precision of the restoration phase can reach up to 91%, which means that the probability of correcting a normal word by mistake or not correctly restoring to the original word is only 10%, and the high precision of the restoration phase maximizes the effect of the detection stage.

E. ABLATION STUDY
We use Ctrip hotel review data to conduct ablation research on the WordRevert algorithm to evaluate the contribution of the detection network and the location importance calculation function to the defense effect. This is done as follows: 1000 samples are randomly selected from Ctrip hotel review data and input to the target model, and the classification accuracy is recorded as "Original". The target model is attacked using different adversarial attack methods and the classification accuracy is recorded as "Attacked". The target model is defended using the WordRevert algorithm, and the model accuracy is recorded as "WordRevert". Then we use the WordRevert algorithm that removes the detection network in the adversarial word detection stage to defend the target model, and the record classification accuracy is "WordRevert-N". Similarly, the defense accuracy of the WordRevert algorithm that removes the position importance calculation function in the adversarial word detection stage is recorded as "WordRevert-F". And it can represent the general Chinese typos detection model. In the WordRevert-N method, the importance score of the position S = {s 1 , s 2 , . . . , s i , . . . , s n } becomes the only criterion to determine the adversarial word, and we use the size of s i to determine whether the word is an adversarial word. Similarly, in the WordRevert-F method, the output sequence of the detection network G = {g 1 , g 2 , . . . , g i , . . . , g n } becomes the only criterion for judging an adversarial word, and g i is used to determine whether the word is an adversarial word. The settings of the adversarial word restoration stage remain unchanged in the above two methods. Figure 2 shows the performance of different defense methods on word-CNN and BiLSTM respectively. The results show that both defense methods have a certain defense effect, and the accuracy of WordRevert-F is slightly higher than that of WordRevert-D, indicating that the defense effect of the detection network is better than the location importance calculation function. The reason is that the position importance calculation function will also locate the key position in the clean sample, which will misjudge the normal word as the adversarial word, which may eventually cause the algorithm to modify the clean sample incorrectly. In WordRevert algorithm, the positional importance calculation function becomes an effective adversarial word localization method when clean samples are filtered out using Equation (17) and Equation (18), which helps the detection network to significantly improve the detection accuracy.

V. CONCLUSIONS AND FUTURE WORK
We propose an effective method called WordRevert for defending against adversarial examples on Chinese text classification tasks. The WordRevert algorithm introduces a new adversarial word detection method determined by the detection network and the positional importance calculation function and a new adversarial word restoration method determined by both the candidate score and the detection score. Before adversarial word detection is performed, the filtering operation is used to filter out clauses that do not contribute to the current classification label, which will contribute to the efficiency and accuracy of the search. Experiments show that the WordRevert algorithm can greatly improve the classification accuracy of adversarial text with a small reduction in the classification accuracy of the clean text while maintaining a high level of precision, recall, and F1 score in the detection and restoration of adversarial words. Our work organically integrates detection networks and location importance algorithms and candidate scoring mechanisms for text features to provide an effective defense against existing adversarial attack algorithms for Chinese text classification. In the future, we hope to evaluate the defensive effectiveness and efficiency of our approach on more classification tasks and models, and further optimize the detection model for the detection and restoration of nonequal-length adversarial examples.