Decomposition, Synthesis, and Attack: A Multi-Instruction Fusion Method for Jailbreaking LLMs | IEEE Journals & Magazine | IEEE Xplore

Decomposition, Synthesis, and Attack: A Multi-Instruction Fusion Method for Jailbreaking LLMs

; ; ; ; ;

Abstract:

Large language models (LLMs) can transform natural language instructions into executable commands for IoT devices like autonomous aerial vehicles (AAVs), creating new dev...Show More

Abstract:

Large language models (LLMs) can transform natural language instructions into executable commands for IoT devices like autonomous aerial vehicles (AAVs), creating new development opportunities. However, safety concerns about LLMs translating commands into machine or program control instructions cannot be overlooked. Currently, jailbreak instructions used to test the LLM security are often restricted to specific modes or tasks, resulting in a lack of diversity and leaving some tasks unexplored. To address this issue, we introduce a multi-instruction fusion (MIF) method that can automatically fuse harmful prompts and various task instructions into jailbreaks. First, we adopt a reverse decomposition strategy to acquire sufficient supervised data for fusing harmful prompts and harmless task instructions into jailbreaks and construct a task instruction synthesizer based on it. Then, to determine the optimal instruction combinations in the vast combination space, we propose a representative-node-based selection strategy, ReNB, to rank and filter the instruction combinations on a few representative samples, thereby accelerating the identification of the valid ones. Experimental results demonstrate that MIF significantly improves the attack success rate (ASR), achieving over 90% on GPT-4o-mini, LLaMa2-70B, and Qwen2-7B models, outperforming the state-of-the-art (SOTA) baselines.
Published in: IEEE Internet of Things Journal ( Volume: 12, Issue: 8, 15 April 2025)
Page(s): 9420 - 9434
Date of Publication: 03 January 2025

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.