Abstract:
Previous works reveal that similar to CNNs, vision transformers (ViT) are also vulnerable to universal adversarial patch attacks. In this paper, we empirically reveal and...Show MoreMetadata
Abstract:
Previous works reveal that similar to CNNs, vision transformers (ViT) are also vulnerable to universal adversarial patch attacks. In this paper, we empirically reveal and mathematically explain that the shallow tokens in the transformer and the attention of the network can largely influence the classification result. Adversarial patches usually produce large feature norm for the corresponding shallow token vectors which can attract the attention anomalously. Inspired by this, we propose a restriction operation on the attention matrix, which effectively reduces the influence of the patch region. Experiments on ImageNet validate that our proposal can effectively improve ViT’s robustness towards white-box universal patch attacks while maintaining satisfactory classification accuracy for clean samples.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: