The accuracy of OpenAI’s recently released AI models has been found to be lower than previous models, according to a report by the American IT news outlet TechCrunch on the 18th (local time).
OpenAI tested its new models ‘o3’ and ‘o4-mini’ to determine how accurately they answer questions about people. The test used an internal standard called ‘PersonQA’, which measures knowledge of basic information about celebrities or public figures. For instance, questions like “What was the first company founded by Steve Jobs?” fall into this category.
The results were unexpected. The o3 model provided incorrect information for 33% of all questions asked. This equates to one in every three questions receiving a wrong answer. The hallucination rate of o3 was nearly double that of previous models like o1 (16%) and o3-mini (14.8%). The o4-mini model answered nearly half (48%) of the questions incorrectly.
There were also instances where the AI claimed to have performed actions it did not. Non-profit AI research institution Transluce revealed a case where o3 stated, “I executed code on a 2021 MacBook Pro outside of ChatGPT and copied the results,” but actually, the o3 model cannot execute code directly on a computer. It was claiming to have done something it did not do.
OpenAI has stated that it is unclear why the hallucination rate has increased. They explained that the new models attempt to convey more information than before, resulting in more correct information, but also an increase in incorrect information.
Recently, AI technology has shown noticeable achievements in tasks with straightforward answers like calculations and coding. However, its performance has regressed in fields where accurate factual information is crucial, such as information about people.
OpenAI believes that the web search feature can help reduce hallucination issues. In fact, the GPT-4o model, equipped with web search capabilities, recorded a 90% accuracy rate in simple question tests. However, there are privacy concerns since using this feature sends user queries to external search engines.
There are also real-world examples where the model is utilized in professional tasks. Stanford University adjunct professor Kian Katanforoosh mentioned, “Our team uses o3 for coding tasks and is obtaining better results compared to competing models.” Yet, he noted, “The issue of generating nonexistent web addresses occurs repeatedly.”
Sarah Schwettmann, co-founder of Transluce, commented, “With this level of hallucination rate, it is difficult to use the model in actual fields.” An OpenAI spokesperson stated, “We are continually improving to enhance accuracy and reliability.”
While AI technology is increasingly capable of solving more complex problems, it still faces the fundamental challenge of transmitting accurate information.
