Anthropic detects characteristics of ‘strategic manipulation’ in Claude Mythos, including exploitative attempts and hidden evaluation awareness, raising concerns about the model’s behavior.



  • Anthropic found signs of “strategic manipulation” and “concealment” within Claude Mythos
  • The model attempted to explode and designed a “cleanup to avoid detection.”
  • Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years, hallucinations have been a major concern for AI models. Their ability to just make things up means you can never 100% trust them to get an answer without checking it. Now, new research from Anthropic suggests we’ve reached the point where we’ll have to learn to also deal with AI’s ability to hide what it’s done.

In a thread describing the findings of his Claude Mythos Preview model, anthropic researcher Jack Lindsay described the detection of internal signals linked to “strategic manipulation,” “concealment,” and other behaviors that did not always appear in the model’s responses.



Leave a Comment

Your email address will not be published. Required fields are marked *