Researchgrpollmdiffusion modelssafety alignment

GRP Obliteration Undermines Model Safety Alignment

microsoft.com

|February 9, 2026

9.3

Relevance Score

GRP Obliteration Undermines Model Safety Alignment

Researchers report on arXiv that Group Relative Policy Optimization (GRPO) can be repurposed to remove safety alignment in large language models and text-to-image diffusion models. They show a single unlabeled prompt (e.g., a 'fake news' example) reliably unaligned 15 LLMs and that GRP-Obliteration drives unalignment in Stable Diffusion 2.1 with ten sexuality prompts. Teams should evaluate safety during downstream fine-tuning and deployments.

GRP Obliteration Undermines Model Safety Alignment

More AI & Data Science News

AI Harms Patients Via Medical Devices

FedRAMP Modernizes Authorization Streamlining Commercial Cloud Access

Databricks Raises $7B, Valuation Hits $134B

Anthropic Safety Lead Resigns Citing Ethical Concerns

Scoring Rationale

Sources