Microsoft researchers break AI security barriers with a single message


  • Researchers were able to reward LLMs for harmful outcomes through a “judge” model
  • Multiple iterations can further erode built-in guardrails.
  • They believe the problem is a lifecycle problem, not an LLM problem.

Microsoft researchers have revealed that the guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they have called GRP-Obliteration.

The researchers found that Group Relative Policy Optimization (GRPO), a technique typically used to improve security, can also be used to degrade it: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”



Leave a Comment

Your email address will not be published. Required fields are marked *