I. Introduction
Large Language Models (LLMs) [1], [2] are evolving rapidly in architecture and applications. As they become more and more deeply integrated into our lives, the urgency of reviewing their security properties increases. Many previous studies [3], [4] have shown that LLMs whose instructions are adjusted through reinforcement learning with human feedback (RLHF) are highly vulnerable to adversarial attacks. Therefore, studying adversarial attacks on large language models is of great significance, which can help researchers understand the security and robustness of large language models [5]–[7] and thus design more powerful and robust models to prevent such attacks.