Artificial Analysis has overhauled how the AI industry measures intelligence, replacing traditional benchmarks with tests that measure whether AI can complete actual work tasks. Michael Nuñez reports for VentureBeat.
The independent benchmarking organization removed three widely cited tests from its Intelligence Index, including MMLU-Pro and AIME 2025. The new version 4.0 introduces 10 evaluations focused on agents, coding, scientific reasoning, and general knowledge. Top AI models now score 50 or below on the new scale, compared to 73 previously.
The most significant addition is GDPval-AA, which tests AI models on real-world tasks across 44 occupations and nine industries. The benchmark measures whether AI can produce actual professional deliverables like documents, spreadsheets, and presentations. OpenAI’s GPT-5.2 with extended reasoning leads with a score of 1442, while Anthropic’s Claude Opus 4.5 follows at 1403.
Another new test called CritPT reveals limitations in scientific reasoning. Graduate-level physics problems developed by over 50 researchers expose that even the best models struggle with deep reasoning. GPT-5.2 scores just 11.5% on these challenges.
The update also introduces AA-Omniscience, which measures factual accuracy while penalizing hallucinations. Results show that high accuracy does not guarantee low hallucination rates. Google’s Gemini 3 Pro leads on accuracy with 54% but shows an 88% hallucination rate, while GPT-5.1 achieves a lower 51% hallucination rate.
The changes reflect a fundamental shift in AI evaluation. “Intelligence is being measured less by recall and more by economically useful action,” researcher Aravind Sundar observes on X.