OpenAI’s reasoning models show increased hallucination rates

OpenAI’s new reasoning AI models, o3 and o4-mini, hallucinate more frequently than their predecessors, according to internal testing. Maxwell Zeff from TechCrunch reports that o3 hallucinated in 33% of questions on OpenAI’s PersonQA benchmark, approximately double the rate of previous models. The o4-mini performed even worse, with a 48% hallucination rate. OpenAI acknowledged in its technical report that “more research is needed” to understand why hallucinations increase as reasoning models scale up. Third-party testing by Transluce also found that o3 sometimes fabricates actions it claims to have taken. Despite these issues, some users like Stanford adjunct professor Kian Katanforoosh note that o3 still outperforms competitors in coding tasks. OpenAI spokesperson Niko Felix stated that addressing hallucinations remains “an ongoing area of research” for the company.

Related posts:

Stay up-to-date: