Oxford study finds AI performance tests lack scientific rigor

Widely used tests to measure artificial intelligence capabilities may be fundamentally flawed and oversell AI performance, according to a new study from the Oxford Internet Institute. Researchers examined 445 benchmarks that AI developers use to evaluate their models and found significant methodological problems.

Jared Perlo reports for NBC News that roughly half of the benchmarks fail to clearly define what they aim to measure. The tests also frequently reuse data from existing benchmarks and rarely employ reliable statistical methods to compare results between models.

“When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure,” Adam Mahdi tells NBC News. Andrew Bean adds that claims such as “a model achieves Ph.D. level intelligence” should be taken with a grain of salt.

The study highlights the Grade School Math 8K benchmark as an example. While the test shows AI models answering basic math questions correctly, this does not necessarily prove the models engage in actual mathematical reasoning. Mahdi compares it to a first grader correctly answering that two plus five equals seven without demonstrating mastery of arithmetic reasoning.

The Oxford researchers provide eight recommendations to improve benchmark reliability, including clearly defining evaluation scope, constructing better task batteries, and using statistical analysis to compare model performance.

Nikola Jurkovic from METR AI research center calls the checklist a starting point for researchers to ensure their benchmarks will be insightful. OpenAI and other organizations have recently begun developing tests that measure AI performance on real world tasks tied to specific occupations.

About the author

Related posts:

Stay up-to-date:

Advertisement