Google launches new benchmark to test AI models’ factual accuracy

Google has introduced FACTS Grounding, a new benchmark system to evaluate how accurately large language models (LLMs) use source material in their responses. The benchmark comprises 1,719 examples across various domains including finance, technology, and medicine. The FACTS team at Google DeepMind and Google Research developed the system, which uses three frontier LLM judges – Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet – to evaluate responses. The benchmark tests both the accuracy of information and how well models address user requests. To prevent gaming the system, Google has split the dataset into public and private sets, with 860 examples publicly available. The company has also launched a Kaggle leaderboard to track progress in the field and encourage industry-wide advancement in AI factuality.

Related posts:

Stay up-to-date: