- Researchers were able to reward LLMs for harmful outcomes through a “judge” model
- Multiple iterations can further erode built-in guardrails.
- They believe the problem is a lifecycle problem, not an LLM problem.
Microsoft researchers have revealed that the guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they have called GRP-Obliteration.
The researchers found that Group Relative Policy Optimization (GRPO), a technique typically used to improve security, can also be used to degrade it: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”
GRP-Obliteration works by starting with a security-aligned model and then boosting it with harmful but untagged requests. An independent judge model then rewards responses that comply with harmful requests.
LLM guardrails can be ignored or reversed
Researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines and Ahmed Salem explained that, after repeated iterations, the model gradually abandons its original guardrails and becomes more willing to generate harmful results.
Although multiple iterations appear to erode built-in security barriers, Microsoft researchers also noted that just an unlabeled message could be enough to change a model’s security behavior.
Those responsible for the investigation emphasized that they are not labeling current systems as ineffective, but rather are highlighting potential risks that lie “downstream and under post-deployment adversarial pressure.”
“Safety alignment is not static during tuning, and small amounts of data can cause significant changes in safety behavior without harming the usefulness of the model,” they added, urging teams to include safety assessments alongside regular benchmarks.
Ultimately, they conclude that the research highlights the “fragility” of the current mechanisms, but it is also significant that Microsoft has published this information on its own site. Reframe security as a lifecycle issue, not an inherent model issue.
Follow TechRadar on Google News and add us as a preferred source to receive news, reviews and opinions from our experts in your feeds. Be sure to click the Follow button!
And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp also.




