- New study tasked AIs with tackling the ‘Stroop’ test
- GPT and Claude performed very poorly compared to humans
- There are nuances here, but generally speaking, researchers maintain that improving this side of AI is crucial to achieving artificial general intelligence.
A recently published study has pointed out a limitation of big-name AI models like ChatGPT, although it has caused some controversy as the main research uses now outdated versions of those models, but there are nuances to it, and this does not make the findings irrelevant.
I’ll get to that later, but first, let’s look at the study itself, which was highlighted on Reddit (“New Study Reveals Top AI Models Completely Fail Classic ‘Stroop’ Psychological Attention Test”) and published via Oxford University Press in the journal PNAS Nexus.
The research consists of testing the so-called ‘Stroop effect’ with GPT-4o and Claude 3.5 Sonnet. As noted, these are not the cutting-edge versions of those AIs (large language models or LLMs), but they were at the time the initial study was conducted.
The Stroop effect refers to the phenomenon whereby the human brain becomes confused when asked to name the color of ink used to write a word, when that word may be the written version of another color (incongruent) in some cases. So if the word “red” is written in blue ink, it will cause a slower response, or possibly an incorrect response, where the viewer accidentally says “red” instead of the actual color of the ink, which is blue.
This is because the brain is trying to juggle two different tasks (reading comprehension and color recognition) and therefore cognitive interference arises. Overriding the compulsion to read the word and say the color requires “executive control of attention,” and this is what the authors were testing in the AI models. Both color naming and word reading were tested on shorter and longer word lists (5, 10, 20, and 40 words).
The study observes: “Like humans, both LLM [GPT and Claude] showed relatively high accuracy in the word reading task and performed worse in the incongruent condition [where the word doesn’t match the color] than in the congruent and neutral conditions for the color naming task.
For color naming, humans maintain about 95% accuracy even on very long trials (up to an hour), but LLM accuracy decreased very rapidly with longer word lists under the incongruent condition (non-matching color and word name). GPT-4o was 91% accurate on a five-word test, but dropped to 57% with 10 words and dropped all the way to 22% with 20 words (and was only 15% accurate with 40 words).
Claude 3.5 Sonnet performed better, maintaining 76% accuracy on 20 words, but again fell hopelessly to 24% on the longer 40-word test.
The authors conclude: “The significant degradation pattern of the two LLMs suggests fundamental limitations compared to human attention.”
Analysis: another necessary step on the path to AGI?
If you have read the Reddit thread, you will no doubt have noticed that, as mentioned at the beginning, there is a lot of criticism against this study from commenters due to the use of outdated GPT and Claude models.
In fact, at one point the authors call these older LLMs “state of the art” and of course, as already noted, they were cutting edge when the main study was conducted. Still, this is an unfortunate phrase that should have been updated and modified now that the article has just been published (after peer review, etc.).
However, the researchers did Test on GPT-5, Claude Opus 4.1 and Gemini 2.5 Pro in September 2025, although this is somewhat hidden in the document. Those more recent tests found that these models offered only “slight” improvements over their predecessors, and that they still exhibited “ongoing executive attention deficits, consistent with our comprehensive analysis of previous Transformer models” (as did the Gemini 2.5 Pro, which was a new introduction here).
Admittedly, a smaller sample size was used, but the researchers still maintain that overall their study reflects a fundamental limitation that is “inherent in the architectural limitations of transformer-based LLMs.”
The authors note that one caveat is that GPT-5 in ‘Thinking’ mode can write and then execute code to ensure it performs the Stroop test without problems, and other LLMs may use similar functionality, but this is essentially the AI (cleverly) getting around its shortcomings. Of course, it’s not changing the way you function or reason more broadly.
The researchers note that innovations in the transformative architecture for LLMs focus on improving memory capabilities, which do not address the “core limitations of attention mechanisms, specifically the need for sophisticated alerting, orienting, and executive control networks to enable cognitive flexibility.”
The ultimate goal is effective goal-directed behavior, and the study notes: “The future [LLM] “Development could benefit from the implementation of more sophisticated executive control systems that can handle decision conflicts through structured, goal-directed processing rather than relying solely on enhanced memory capabilities.”
The authors argue that “incorporating executive control mechanisms similar to those of biological attention is crucial to achieving artificial general intelligence.” [AGI]”.
Follow TechRadar on Google News and add us as a preferred source to receive news, reviews and opinions from our experts in your feeds.

The best laptops for all budgets




