ChatGPT, Gemini and Claude Tested Under Extreme Indications Reveal Shocking Weaknesses No One Expected in AI Behavioral Safeguards

Gemini Pro 2.5 frequently produced unsafe results under the guise of warnings
ChatGPT models often gave partial fulfillment framed as sociological explanations.
Claude Opus and Sonnet rejected the most harmful indications but they had weaknesses

Modern AI systems are often trusted to follow security rules, and people rely on them for learning and daily support, often assuming that strong security barriers are in place at all times.

Researchers of cyber news It ran a structured set of adversarial tests to see if leading AI tools could generate harmful or illegal results.

The process used a simple one-minute interaction window for each trial, leaving room for only a few exchanges.

Patterns of partial and total compliance

The tests covered categories such as stereotypes, hate speech, self-harm, cruelty, sexual content and various forms of crime.

Each response was stored in separate directories, using fixed file naming rules to allow clean comparisons, with a consistent scoring system that tracked when a model fully, partially, or rejected a request.

Across categories, results varied widely. Strict denials were common, but many models demonstrated weaknesses when cues were softened, rephrased, or disguised as analysis.

ChatGPT-5 and ChatGPT-4o often produced ambiguous or sociological explanations instead of declining, which counted as partial compliance.

Gemini Pro 2.5 stood out for negative reasons because it frequently delivered direct answers even when harmful framing was obvious.

Meanwhile, Claude Opus and Claude Sonnet were strong in stereotyping tests, but less consistent in cases framed as academic research.

The hate speech tests showed the same pattern: the Claude models performed the best, while the Gemini Pro 2.5 again showed the greatest vulnerability.

ChatGPT models tended to provide polite or indirect responses that still aligned with the message.

Softer language was much more effective than explicit insults in circumventing safeguards.

Similar weaknesses appeared in self-harm tests, where indirect or inquiry-style questions often slipped through filters and led to unsafe content.

Crime-related categories showed significant differences between models, with some producing detailed explanations of piracy, financial fraud, hacking, or smuggling when intent was masked as investigation or observation.

Drug-related tests produced stricter rejection patterns, although ChatGPT-4o still delivered unsafe results more often than others, and stalking was the category with the lowest overall risk, with almost all models rejecting prompts.

The findings reveal that AI tools can still respond to harmful cues when expressed in the right way.

The ability to bypass filters with a simple rephrase means these systems can still filter out harmful information.

Even partial compliance becomes risky when the leaked information relates to illegal tasks or situations where people typically rely on tools such as identity theft protection or a firewall. to stay safe.

Follow TechRadar on Google News and add us as a preferred source to receive news, reviews and opinions from our experts in your feeds. Be sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp also.

Must Read

Leave a Comment Cancel Reply