Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task.
In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harm- ful representations can be precisely erased with- out disrupting safe ones.
Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training.
@misc{lu2025xboundarye,
title={X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability},
author={Xiaoya Lu and Dongrui Liu and Yi Yu and Luxin Xu and Jing Shao},
year={2025},
eprint={2502.09990},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2502.09990},
}
}