- The accuracy achieved by the best score at the most difficult reference point in the world, improved in 183% in just two weeks
- Chatgpt O3-mini now obtains up to 13% precision depending on capacity
- Operai Deep Research Delete the competition with a precision result of 26.6%
The last examination of the world, the last examination of humanity, was launched less than two weeks ago, and we have already seen a great leap in precision, with Chatgpt O3-mini and now the deep reasoning of OpenI that heads the table of classification.
The point of reference of AI created by experts around the world contains some of the most difficult problems and reasoning questions known by man: it is so difficult, that when I wrote about the last examination of humanity in the article previously linked, or I could even understand one of the questions, much less answer it.
At the time of writing that last article, the Deepseek R1 world phenomenon sat at the top of the classification table with a 9.4% precision score when evaluated only in text (not multimodal). Now, the OPENAI O3-MINI, which was launched earlier this week, has obtained a 10.5% accuracy in the O3-mini environment, and 13% precision in the O3-mini-high environment, which is more Intelligent but it takes longer to generate answers.
However, more impressive is the new AI score of OpenAi at the point of reference, with the new tool by scoring 26.6%, an 183% increase in the accuracy of the results in less than 10 days. Now, it is worth noting that deep research has search capabilities that make comparisons slightly unfair, as the other AI models do. The ability to search on the web is useful for a test as the last examination of humanity, since it includes some general knowledge -based questions.
That said, the precision of the results by the models that take the latest results of the humanity exam is constantly improving, and makes you ask how long we must wait to see that a model of AI is close to completing the reference point. Being realistic, AI should not be able to approach soon, but it would not bet against.
It seems that the last operai model is very good on many issues. My assumption is that deep research helps particularly with subjects, including medications, classics and law. pic.twitter.com/x8ilmq1aqsFebruary 3, 2025
Better, but 26.6% never got Mats
Operai Deep Research is an incredibly impressive tool, and I have impressed the examples that Openai showed when the AI agent announced. Deep Investigation can function as your personal analyst, taking the time to conduct intense investigations and submit reports and responses that would otherwise take hours and hours to humans to complete.
While a 26.6% score in the last examination of humanity is very impressive, especially taking into account how far the classification table of the reference point has reached in just a couple of weeks, it remains a low score in absolute terms: No one would claim to have approved a test with no less than 50% in the real world.
The last examination of humanity is an excellent point of reference, and one that will be invaluable as Ia models develop, which allows us to measure how far they have come. How long will we have to wait to see an AI without going through the 50%mark? And what model will it be to do it?
You may also like