‘Current LLMs introduce substantial errors when editing work documents’: Microsoft scientists find that most AI models struggle with long-duration tasks, so they may not fully trust them yet



  • Microsoft researchers determine that current LLMs are not good for long-term tasks
  • More interactions and less structure significantly reduce benchmark performance
  • “Python is the only domain where most models are ready”

New research by a trio of Microsoft workers has uncovered a fundamental problem that could be blocking effective agent AI: namely, that most AI models actually can’t reliably handle long-running workflows.

To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science, and more.

Leave a Comment

Your email address will not be published. Required fields are marked *