New AI math benchmark exposes limitations in advanced reasoning

The FrontierMath benchmark, developed by Epoch AI, presents hundreds of challenging math problems that require deep reasoning and creativity to solve. Despite the growing power of AI models like GPT-4o and Gemini 1.5 Pro, they are solving fewer than 2% of these problems, even with extensive support, according to Epoch AI. The benchmark was created … Read more

OpenAI and others exploring new strategies to overcome AI improvement slowdown

OpenAI is reportedly developing new strategies to deal with a slowdown in AI model improvements. According to The Information, OpenAI employees testing the company’s next flagship model, code-named Orion, found less improvement compared to the jump from GPT-3 to GPT-4, suggesting the rate of progress is diminishing. In response, OpenAI has formed a foundations team … Read more

Chain-of-Thought reasoning no panacea for AI shortfalls

The research paper “Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse” investigates the effectiveness of chain-of-thought (CoT) prompting in large language and multimodal models. While CoT has generally improved model performance on various tasks, the authors explore scenarios where it may actually hinder performance, drawing parallels from … Read more

Entropix: New AI technique improves reasoning by detecting uncertainty

Researchers at XJDR have developed a new technique called Entropix that aims to improve reasoning in language models by making smarter decisions when the model is uncertain, according to a recent blog post by Thariq Shihipar. The method uses adaptive sampling based on two metrics, entropy and varentropy, which measure the uncertainty in the model’s … Read more

LLMs don’t reason logically

A new study from Apple reveals that large language models (LLMs) don’t reason logically but rely on pattern recognition. This finding, published by six AI researchers at Apple, challenges the common understanding of LLMs. The researchers discovered that even small changes, such as swapping names, can alter the models’ results by about 10%. Gary Marcus, … Read more

DeepMind’s Michelangelo tests reasoning in long context windows

DeepMind has introduced the Michelangelo benchmark to evaluate the long-context reasoning capabilities of large language models (LLMs), Ben Dickson reports for VentureBeat. While LLMs can manage extensive context windows, research indicates they struggle with reasoning over complex data structures. Current benchmarks often focus on retrieval tasks, which do not adequately assess a model’s reasoning abilities. … Read more

Google working on AI with advanced reasoning capabilities

Google is developing AI with reasoning abilities inspired by the human brain, similar to OpenAI’s o1 model. Several teams at the company are making progress on AI systems capable of solving complex problems in fields such as mathematics and programming. This was reported by Julia Love and Rachel Metz for Bloomberg. Researchers are using a … Read more

Chain of Thought

Chain of Thought is a concept in artificial intelligence that describes the ability of AI systems to solve complex problems step-by-step, much like humans do. This method allows AI models to explain their thought processes in a way that humans can understand. Instead of just providing a final answer, the AI shows the individual steps … Read more

OpenAI o1 impresses with surprisingly strong performance for some tasks

OpenAI has unveiled a new family of AI models called “o1”. It was previously known as “Project Strawberry” and had led to all kinds of speculation and high expectations. The first two versions, o1-preview and o1-mini, use a reasoning method known as “chain of thought” to solve complex tasks. This technique allows the models to … Read more

Quiet-STaR helps language models to think

Researchers at Stanford University and Notbad AI want to teach language models to think before responding to prompts. Using their model, called “Quiet-STaR,” they were able to improve the reasoning skills of the language models they tested.