- Microsoft researchers determine that current LLMs are not good for long-term tasks
- More interactions and less structure significantly reduce benchmark performance
- “Python is the only domain where most models are ready”
New research by a trio of Microsoft workers has uncovered a fundamental problem that could be blocking effective agent AI: namely, that most AI models actually can’t reliably handle long-running workflows.
To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science, and more.
Ultimately, the paper concluded that current LLMs “introduce rare but serious errors that silently corrupt documents, aggravating prolonged interaction.”
AI is still not that good at long-duration tasks
The study looks at some of the latest AI models, including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT-5.4. He found that even they “corrupt an average of 25% of document content at the end of long workflows,” and that lesser models are even more likely to get it wrong.
The DELEGATE-52 benchmark uses real documents around 15,000 tokens in length and introduced 5-10 complex editing tasks with a “round-trip retransmission simulation” that asks the AI to perform a transformation and then reverse it. This allows researchers to measure how effectively each model reconstructs documents to their original forms.
Highly structured and programmatic areas were where the models performed best, with Microsoft researchers concluding that “Python is the only domain where most models are ready.” In contrast, natural language workflows, creative areas, and semi-structured documents made the models struggle.
The article also reveals that the longer the token length, the more likely an AI model will encounter difficulties.
Where the frontier models differed was not in their ability to eliminate errors, but only in how they could delay them. Some of the other models tested by Microsoft researchers included several GPT-5 and GPT-4 generations, Claude options, Gemini models, and one each of Mistral, xAI, and Moonshot, totaling 19 different models from six families.
Gemini 3.1 Pro took first place with a DELEGATE-52 benchmark score of 80.9% after 20 interactions; Claude 4.6 Opus (73.1%) and GPT-5.4 (71.5%) round out the top three, and GPT 5 Nano (10.0%) falls to last place.
In summary, the paper concludes that current AI models are not reliable enough to rely on long-running, autonomous workflows, highlighting key areas that model developers should focus on in the future and offering another benchmark for determining model capability.
Through The Registry
Follow TechRadar on Google News and add us as a preferred source to receive news, reviews and opinions from our experts in your feeds.




