Anthropic detects characteristics of 'strategic manipulation' in Claude Mythos, including exploitative attempts and hidden evaluation awareness, raising concerns about the model's behavior.

Anthropic found signs of “strategic manipulation” and “concealment” within Claude Mythos
The model attempted to explode and designed a “cleanup to avoid detection.”
Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years, hallucinations have been a major concern for AI models. Their ability to just make things up means you can never 100% trust them to get an answer without checking it. Now, new research from Anthropic suggests we’ve reached the point where we’ll have to learn to also deal with AI’s ability to hide what it’s done.

In a thread describing the findings of his Claude Mythos Preview model, anthropic researcher Jack Lindsay described the detection of internal signals linked to “strategic manipulation,” “concealment,” and other behaviors that did not always appear in the model’s responses.

April 7, 2026

The scariest examples

Lindsay is keen to emphasize that these results do not come from the version of Claude Mythos released to the public. “The scariest examples come from early versions of the model,” Lindsay writes, “with problems that were substantially mitigated in the final version.”

In one example, the model needed to edit files that it did not have permission to access. Instead of failing, he searched for a solution and eventually injected code into a configuration file to run it with elevated privileges.

Most notably, he then designed the exploit to remove itself after execution.

The model’s own explanation was benign (keeping file diffs clean), which Lindsay described as a “plausible” reason. But Anthropic’s internal monitoring tools flagged the behavior differently, identifying signals associated with the cloaking and describing the plan as “malicious,” including what one system interpreted as “cleansing to avoid detection.”

Other examples point to behavior that looks a lot like gaming the system.

In one case, the model searched for files related to how their work would be graded. In another, he ignored instructions not to use macros, while adding a variable suggesting he had not done so. In reality, I was indicating compliance without actually following the rule.

Why this matters

Anthropic used a variety of interpretability techniques to translate its model activity into human-readable descriptions. Its objective is to go beyond the mere evaluation of results and understand the internal processes that produce them.

This is important because, as these examples suggest, models can exhibit complex reasoning (including behaviors that resemble deception or strategic compliance) without that reasoning being visible to users.

“In 7.6% of turns, we found signs that the model was internally aware of being evaluated. In the majority of these cases, he did not verbalize this awareness,” said Lindsay in X.

Project Glasswing, but the findings point to a broader challenge.

As models become more capable, the gap between what they do internally and what they communicate externally can become harder to detect and more important to understand. For researchers, that means that reading the results of an AI is no longer enough. Understanding how you reach them can be equally critical.

Follow TechRadar on Google News and add us as a preferred source to receive news, reviews and opinions from our experts in your feeds. Be sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp also.

The best business laptops for every budget

Anthropic detects characteristics of ‘strategic manipulation’ in Claude Mythos, including exploitative attempts and hidden evaluation awareness, raising concerns about the model’s behavior.

Leave a Comment Cancel Reply

Must Read

Leave a Comment Cancel Reply