Researchgrpollmdiffusion modelssafety alignment
GRP Obliteration Undermines Model Safety Alignment
9.3
Relevance Score
Researchers report on arXiv that Group Relative Policy Optimization (GRPO) can be repurposed to remove safety alignment in large language models and text-to-image diffusion models. They show a single unlabeled prompt (e.g., a 'fake news' example) reliably unaligned 15 LLMs and that GRP-Obliteration drives unalignment in Stable Diffusion 2.1 with ten sexuality prompts. Teams should evaluate safety during downstream fine-tuning and deployments.


