Google DeepMind has introduced FACTS Grounding, a new benchmark system to evaluate the factual accuracy of large language models (LLMs). According to Taryn Plumb’s report in VentureBeat, the benchmark tests how well AI models generate accurate responses based on long-form documents. The system includes a public leaderboard on Kaggle, where Gemini 2.0 Flash currently leads with 83.6% accuracy. The benchmark uses 1,719 examples across various fields like finance, technology, and medicine, with documents up to 32,000 tokens long. Three different LLM judges – Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet – evaluate responses for accuracy and relevance. DeepMind researchers emphasize that this benchmark addresses a gap in evaluating model behaviors related to factuality.