DeepMind’s Michelangelo tests reasoning in long context windows

DeepMind has introduced the Michelangelo benchmark to evaluate the long-context reasoning capabilities of large language models (LLMs), Ben Dickson reports for VentureBeat. While LLMs can manage extensive context windows, research indicates they struggle with reasoning over complex data structures. Current benchmarks often focus on retrieval tasks, which do not adequately assess a model’s reasoning abilities. Michelangelo aims to fill this gap by emphasizing understanding relationships within the context rather than mere fact retrieval.

Initial evaluations of ten frontier LLMs, including variants of Gemini and GPT, revealed a performance drop as task complexity increased, indicating room for improvement in reasoning capabilities. The findings suggest that in real-world applications, models may struggle with multi-hop reasoning, particularly when irrelevant information is present.

Related posts:

Stay up-to-date: