Researchers from the Center for AI Safety and Scale AI have unveiled “Humanity’s Last Exam,” which they assert is the most challenging test ever created for artificial intelligence systems. According to reporting by The New York Times, the test consists of approximately 3,000 multiple-choice and short-answer questions spanning fields from analytic philosophy to rocket engineering.
Dan Hendrycks, director of the Center for AI Safety, developed the test in collaboration with Scale AI, where he serves as an advisor. The questions were contributed by field experts, including university professors and accomplished mathematicians, who were compensated between $500 and $5,000 per accepted question.
The evaluation process involved a two-step filtering system. First, questions were tested against leading AI models. Those that stumped the AI systems were then reviewed and refined by human experts who verified the correct answers.
Current AI models have performed poorly on the test, with OpenAI’s o1 system achieving the highest score at just 8.3 percent. Other tested systems included Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet.
Hendrycks predicts rapid improvement, suggesting that AI systems might exceed 50 percent accuracy by the end of 2025. At such levels, he believes AI systems could become “world-class oracles,” potentially surpassing human experts across various disciplines.
The test’s development was partly inspired by a conversation between Hendrycks and Elon Musk, who criticized existing AI evaluations as too simple. This new examination aims to provide a comprehensive assessment of AI capabilities across academic disciplines, effectively measuring general intelligence.
Summer Yue, Scale AI’s director of research, suggests future versions might include questions without known answers, allowing AI systems to contribute to new discoveries in mathematics and science.
The test highlights a peculiar aspect of current AI development: while systems excel at complex tasks like disease diagnosis and competitive programming, they often struggle with basic functions such as simple arithmetic. This inconsistency has made it challenging to accurately measure AI progress.
Kevin Zhou, a theoretical particle physics researcher who contributed to the test, emphasizes that excelling at exam questions doesn’t necessarily translate to real-world research capabilities, noting the significant difference between test-taking and actual scientific practice.