The Allen Institute for Artificial Intelligence (Ai2) has released Tulu 3 405B, a new AI language model that, according to the institute’s internal testing, outperforms several leading AI systems including DeepSeek V3 and matches capabilities with OpenAI’s GPT-4o on certain benchmarks. The model contains 405 billion parameters and required 256 GPUs running in parallel for training.
The key innovation behind Tulu 3 405B is its use of “reinforcement learning from verifiable rewards” (RLVR), a technique that trains models on tasks with verifiable outcomes such as mathematical problem solving. Ai2 reports that on the PopQA benchmark, containing 14,000 specialized knowledge questions, and GSM8K, a test of grade school-level math problems, their model achieved higher performance than comparable systems in its class.
Unlike some competing models that release only partial code, Ai2 has made Tulu 3 405B fully open source, publishing all components necessary for replication including training data, infrastructure code, and model weights. The model is available for testing through Ai2’s chatbot web application, and its code can be accessed via GitHub and Hugging Face platforms. According to Ai2’s senior director of NLP Research, Hannaneh Hajishirzi, this comprehensive open approach allows users to customize their pipeline from data selection through evaluation.
Sources: TechCrunch, VentureBeat