The best AI coding assistants fail on one in four tasks, revealing serious gaps between expectations and the reliability of actual performance

Report Finds AI Coding Assistants Regularly Fail on 1 in 4 Structured Outcomes Tasks
Even advanced proprietary models only achieve about 75% accuracy
Open source AI models perform worse, averaging close to 65% reliability

The promise of artificial intelligence as a tireless coding assistant has hit a major roadblock after new research claimed such tools can experience a number of problems.

A recent study from the University of Waterloo found that AI struggles with software development, with even the most advanced models failing on one in four structured output tasks.

The research evaluated 11 large language models in 18 different structured formats and 44 tasks to test how well the systems could follow predefined rules, finding a clear disparity between performance on text-based tasks and results involving multimedia or complex structures.

Article continues below.

Benchmarking Reveals Worrying Reliability Gap

While text-related tasks were generally handled with moderate success, tasks requiring image, video, or website generation were much more problematic.

Accuracy in these areas dropped dramatically, raising questions about how these AI tools can be safely integrated into professional workflows.

“With this type of study, we want to measure not only the syntax of the code, that is, whether it follows established rules, but also whether the results produced for various tasks were accurate,” said Dongfu Jiang, a doctoral student and co-first author of the study.

Structured results, designed to enforce formatting consistency via JSON, XML, or Markdown, were intended to make AI responses more reliable for developers.

AI companies including OpenAI, Google, and Anthropic introduced structured results to force responses into predictable formats.

The Waterloo research suggests that this approach has not yet provided the level of reliability that developers require.

The Waterloo benchmark revealed that even the most advanced proprietary models achieved only around 75% accuracy, while open source alternatives performed close to 65%.

These results suggest that, despite improvements, AI systems still make important errors that cannot be ignored in professional development settings.

The report emphasized the need for human oversight, noting: “Developers can have these agents working for them, but they still need significant human oversight.”

Although structured output is a step up from free-form natural language responses, errors are still common.

The technology is not yet robust enough to work independently in complex development scenarios.

One might reasonably wonder if the industry’s enthusiasm for AI and vibration coding assistants has outpaced the actual capabilities of the underlying technology.

Even the most advanced models demonstrate a significant failure rate on structured tasks, revealing a wide gap between marketing claims and actual performance.

Therefore, for now, developers should treat these tools as experimental aids rather than autonomous colleagues.

Follow TechRadar on Google News and add us as a preferred source to receive news, reviews and opinions from our experts in your feeds. Be sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp also.

Must Read

Leave a Comment Cancel Reply