Claude only defeated GPT-5, Gemini and Grok in workproof of the real world, according to Opensei's own study

Operai has launched GDPval, a new evaluation system to test how AI works in work -related tasks
Claude Opus 4.1 comes to mind, with ‘chatgpt-5 high’ in second place
Tasks include things like sending an email an answer to a unsatisfied customer

We are all familiar with AI’s reference points, which measure performance in certain tasks, but often these tasks do not reflect the real world and how people really use AI, especially at work.

To combat this problem, OpenAI, the chatgpt manufacturer, is introducing GDPval, a new way of measuring the performance of the AI model using real world’s work tasks compared to a real human in 44 occupations, from software and lawyers developers to registered nurses and mechanical engineers.

Surprisingly, the Operai study shows that the best performance model was Claude Opus 4.1 of Anthrope, which surpassed not only Openai GPT-5 but also Gemini and Grok.

GDPval gain rate

(Image credit: OpenAI)

This graph shows the general rate of victories of GDPval (the moments in which the AI was better than an expert in the industry) and shows that Claude Opus 4.1 is at the head with a rate of victories of 47.6, with ‘Chatgpt-5 High’ in second place with 38.8 and ‘Chatgpt O3 High’ to 34.1. ChatGPT-4O gets the lowest, with a gain rate of 12.4, which is significantly behind Grok 4 and Gemini 2.5 Pro.

The study found that Claude was the highest performance in eight of the nine sectors of the industry he tested, including government, medical care and social assistance. The results clearly show that Claude Opus 4.1 leads to a wide range of work -related tasks.

Claude's gain rates per sector

(Image credit: OpenAI)

Examples of the tasks include things such as sending an email an answer to an unsatisfied customer who requests a return, optimization of a table design for a spring suppliers and inconsistencies of audit prices in purchase orders.

What is in a name?

The name used by OpenAI, GDPval, comes from the concept of Gross Domestic Product (GDP) as a key economic indicator. Operai wants GPDval to be widely adopted to help conversations about future improvements in evidence instead of conjecture.

Librating the results that show a competitor in front seems to be an exercise of radical transparency by OpenAi, but that fits perfectly to the company’s philosophy. “Our mission is to ensure that artificial general intelligence benefits all humanity. As part of our mission, we want to transparently communicate the progress of how AI models can help people in the real world,” says an Openai statement.

The document, which is available to read in its entirety online, occurs a week after Operai launched a more consumer -centered article that showed that most Chatgpt users (70%really used it at home, instead of at work.

The study was conducted by the OpenAI economic research team and Harvard David Deming economist for the National Economic Research Office (NBER). The results were surprising for many people, as previously, the focus of the new chatgpt launches has focused greatly on work -related tasks such as coding, presentations and being a good research tool.

The news that Claude Opus 4.1 is better in the work-related real tasks, not only the reference points, that even ‘Chatgpt-5 high’ could mean a renewed OpenAI approach towards its changing user base.

Claude only defeated GPT-5, Gemini and Grok in workproof of the real world, according to Opensei’s own study

Leave a Comment Cancel Reply

Must Read

Leave a Comment Cancel Reply