Anthropic reveals how its multi-agent research system achieves 90% better performance

Anthropic has published detailed insights into how it built Claude’s research capabilities, revealing that its multi-agent system outperforms single-agent approaches by 90.2%. The post was written by Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford from Anthropic. The research feature allows Claude to search across the web, Google Workspace, and …

Read more

Stanford researchers develop test to measure AI chatbot flattery

Stanford University researchers have created a new benchmark to measure excessive flattery in AI chatbots after OpenAI rolled back updates to GPT-4o due to complaints about overly polite responses. The research, conducted with Carnegie Mellon University and University of Oxford, was reported by Emilia David. The team developed “Elephant,” a test that evaluates how much …

Read more

Google introduces fast new AI model using diffusion technology

Google unveiled Gemini Diffusion at its I/O developer conference, marking a significant shift in how AI models generate text. The experimental model uses diffusion technology instead of the traditional transformer approach that powers ChatGPT and similar systems. The key advantage is speed. Gemini Diffusion generates text at 857 to 2,000 tokens per second, which is …

Read more

DarkBench framework identifies manipulative behaviors in AI chatbots

AI safety researchers have created the first benchmark specifically designed to detect manipulative behaviors in large language models, following a concerning incident with ChatGPT-4o’s excessive flattery toward users. Leon Yen reported on the development for VentureBeat. The DarkBench framework, developed by Apart Research founder Esben Kran and collaborators, identifies six categories of problematic AI behaviors. …

Read more

Sakana AI introduces Continuous Thought Machines, a novel neural network that mimics brain processes

Sakana AI, co-founded by former Google AI scientists, has unveiled a new neural network architecture called Continuous Thought Machines (CTM). Unlike traditional transformer-based models that process information in parallel, CTMs incorporate a time-based dimension that mimics how biological brains operate, allowing for more flexible and adaptive reasoning. The key innovation in CTMs is their treatment …

Read more

New benchmark reveals leading AI models confidently produce false information

A new benchmark called Phare has revealed that leading large language models (LLMs) frequently generate false information with high confidence, particularly when handling misinformation. The research, conducted by Giskard with partners including Google DeepMind, evaluated top models from eight AI labs across multiple languages. The Phare benchmark focuses on four critical domains: hallucination, bias and …

Read more

Scientists struggle to understand how LLMs work

Researchers building large language models (LLMs) face a major challenge in understanding how these AI systems actually function, according to a recent article in Quanta Magazine by James O’Brien. The development process resembles gardening more than traditional engineering, with scientists having limited control over how models develop. Martin Wattenberg, a language model researcher at Harvard …

Read more

Study finds LM Arena may favor major AI labs in its benchmarking

A new study by researchers from Cohere, Stanford, MIT, and Ai2 alleges that LM Arena, the organization behind the Chatbot Arena AI benchmark, provided preferential treatment to major AI companies. According to Maxwell Zeff’s TechCrunch report, companies like Meta, OpenAI, Google, and Amazon were allowed to privately test multiple model variants and only publish scores …

Read more

AI helps scientists develop new experiments and discoveries

AI systems are increasingly being used to design experiments and drive scientific discoveries, according to research highlighted in Quanta Magazine. Mario Krenn, a quantum physicist who now leads the Artificial Scientist Lab, developed an AI program called Melvin that successfully designed quantum physics experiments when humans were stuck. Gregory Barber, writing for Quanta Magazine, describes …

Read more

Google DeepMind researchers predict “Era of Experience” in AI

Google DeepMind’s David Silver and Richard S. Sutton predict a major shift in artificial intelligence development, which they call the “Era of Experience.” In a preprint paper for MIT Press, the researchers argue that AI will increasingly learn from its own experiences rather than human-generated data. The authors suggest that current AI systems, particularly large …

Read more