New AGI benchmark shows major gap between human and AI reasoning abilities

The Arc Prize Foundation has released ARC-AGI-2, a new benchmark designed to measure artificial general intelligence that has proven extremely difficult for even the most advanced AI systems. This second-generation test specifically evaluates test-time reasoning abilities – requiring AI to adapt to novel, never-before-seen tasks rather than relying on memorization.

The results reveal a stark capability gap: while human panels achieve 60% accuracy on average (with at least two humans solving 100% of tasks), the most sophisticated AI reasoning systems score in the single digits. OpenAI’s o3-low model, which achieved 75.7% on the previous ARC-AGI-1 benchmark, manages only 4% on the new test.

Key challenges for AI systems

ARC-AGI-2 exposes specific reasoning deficiencies in current AI systems:

Symbolic interpretation: AI struggles to assign semantic meaning to symbols beyond visual patterns
Compositional reasoning: Systems fail when multiple rules must be applied simultaneously
Contextual rule application: AI has difficulty applying rules differently based on context

The benchmark consists of visual puzzle tasks where systems must identify patterns and generate correct answer grids. Every task in ARC-AGI-2 was solved by at least two humans in two attempts or less, matching the evaluation criteria for AI systems.

Measuring efficiency alongside capability

A significant innovation in ARC-AGI-2 is its emphasis on efficiency. The benchmark now explicitly tracks the cost of solving tasks, recognizing that intelligence isn’t just about problem-solving capability but also resource utilization.

“Intelligence is not solely defined by the ability to solve problems or achieve high scores,” writes Arc Prize Foundation co-founder Greg Kamradt. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component.”

This efficiency metric highlights another gap: human solvers cost approximately $17 per task, while the o3-low model costs around $200 per task while achieving far lower accuracy.

Competition and prizes

Alongside the new benchmark, the Arc Prize Foundation has announced the ARC Prize 2025 competition with $1 million in prizes. The competition challenges developers to reach 85% accuracy on ARC-AGI-2 while spending only $0.42 per task. The contest runs from March 26 through November 3, 2025.

The foundation’s approach reflects a growing industry consensus that new, unsaturated benchmarks are needed to measure AI progress toward general intelligence. By designing tasks that are easy for humans but difficult for AI, ARC-AGI aims to identify the specific capabilities that separate human-level general intelligence from even the most advanced AI systems available today.

Sources: ARC Prize, TechCrunch

New AGI benchmark shows major gap between human and AI reasoning abilities

Key challenges for AI systems

Measuring efficiency alongside capability

Competition and prizes

Related posts:

Stay up-to-date: