- The latest OpenAi, GPT O3 and O4-mini models, significantly hallucinate more frequently than their predecessors
- The greatest complexity of the models can lead to safer inaccuracies
- High error rates generate concerns about the reliability of AI in real world applications
Bright but not reliable people are a basic element of fiction (and history). The same correlation can also be applied to AI, based on an investigation carried out by Operai and shared by The New York Times. The hallucinations, imaginary facts and direct lies have been part of the chatbots of AI since they were created. Improvements in models should reduce the frequency with which they appear.
The latest emblematic models of Openai, GPT O3 and O4-mini, are destined to imitate human logic. Unlike their predecessors, which focused mainly on the generation of text, Operai built GPT O3 and O4-mini to think about things through step by step. Operai has boasted that O1 could match or overcome the performance of doctoral students in chemistry, biology and mathematics. But the OpenAI report highlights some heartbreaking results for anyone who takes chatgpt responses to the letter.
Openai discovered that the GPT O3 model incorporated hallucinations into a third of a reference test that involves public figures. That is twice the error rate of the previous O1 model last year. The most compact O4-mini model worked even worse, hallucinating in 48% of similar tasks.
When analyzing more general knowledge questions for the simpleqa reference point, hallucinations were moved to 51% of the answers for O3 and 79% for O4-MINI. That is not just a little noise in the system; That is a complete identity crisis. One would think that something marketed as a reasoning system would at least verify its own logic before manufacturing an answer, but it is simply not the case.
A theory that makes rounds in the AI research community is that the more reasoning tries to make a model, the more possibilities the rails have to leave. Unlike the simplest models that adhere to high confidence predictions, reasoning models venture into territory where they must evaluate multiple possible routes, connect disparate facts and essentially improvise. And improvise the facts is also known as inventing things.
Fictional functioning
The correlation is not causality, and Openai told the Times That the increase in hallucinations may not be because reasoning models are inherently worse. Instead, they could simply be more detailed and adventurous in their answers. Because new models not only repeat predictable facts, but speculate on the possibilities, the line between the theory and the facts manufactured can become blurred for AI. Unfortunately, some of those possibilities are completely unnoticed by reality.
Even so, more hallucinations are the opposite of what Openai or its rivals like Google and Anthrope want their most advanced models. Calling attendees and co -pilots of AI Chatbots implies that they will be useful, not dangerous. The lawyers have already gotten into trouble by using chatgpt and not noticing appointments of the imaginary court; Who knows how many mistakes have caused problems in circumstances less at high risk?
The opportunities for hallucination to cause a problem for a user is expanding rapidly as IA systems begin to be implemented in classrooms, offices, hospitals and government agencies. The sophisticated AI could help write employment applications, solve billing problems or analyze spreadsheets, but the paradox is that the more useful the AI becomes, the less space there is by mistake.
He cannot affirm that people save time and effort if they have to spend such a long verification of everything he says. It is not that these models are not impressive. GPT O3 has demonstrated some surprising feats of coding and logic. It can even overcome many humans in some way. The problem is that at the time he decides that Abraham Lincoln organized a podcast or that water boils at 80 ° F, the illusion of reliability is broken.
Until these problems are solved, you must take any response from an AI model with a tablespoon of salt. Sometimes, Chatgpt is a bit like that annoying guy in too many meetings that we have all attended; Rested in confidence in absolute nonsense.