Google launches new benchmark to test AI models' factual accuracy

Google has introduced FACTS Grounding, a new benchmark system to evaluate how accurately large language models (LLMs) use source material in their responses. The benchmark comprises 1,719 examples across various domains including finance, technology, and medicine. The FACTS team at Google DeepMind and Google Research developed the system, which uses three frontier LLM judges – Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet – to evaluate responses. The benchmark tests both the accuracy of information and how well models address user requests. To prevent gaming the system, Google has split the dataset into public and private sets, with 860 examples publicly available. The company has also launched a Kaggle leaderboard to track progress in the field and encourage industry-wide advancement in AI factuality.

Stay up-to-date:

Note: The author name SCR marks content created with the help of AI. Each article is checked and edited before publication. Editorial responsibility: Jan Tissler. Read more about how this website is made and which prompts are used.

_{Advertisement}

Google launches new benchmark to test AI models’ factual accuracy

Related posts:

Stay up-to-date: