This tiny AI model outperforms Google’s Gemini at understanding videos

The Allen Institute for AI has released Molmo 2, an open-source video model designed to compete with larger proprietary systems in video understanding and analysis. The launch aims to demonstrate that smaller models can serve as practical alternatives for businesses. Emilia David reports for VentureBeat.

Ai2 released three versions of Molmo 2: an 8B model based on Qwen-3 for optimal video grounding and question answering, a more efficient 4B variant, and a 7B version built on the Olmo foundation model. The models handle single images, multiple images, and video clips of varying lengths.

The institute states that closing the grounding gap in open models was a core design goal. Grounding refers to an AI’s ability to locate and track specific elements within visual content at the pixel level.

Benchmark tests show Molmo 2 outperforming competitors including Google’s Gemini 3 Pro in video tracking tasks. The 8B version leads all open-weight models in image and multi-image reasoning, with the 4B variant following closely. The models achieved their strongest results in video grounding and video counting.

Ai2 acknowledges that video grounding remains challenging, with no current model reaching 40% accuracy on existing benchmarks. Unlike video generation models such as Google’s Veo 3.1 or OpenAI’s Sora, Molmo 2 focuses specifically on video analysis and understanding rather than content creation.

The company previously introduced the Molmo family for image analysis last year.

About the author

Related posts:

Stay up-to-date:

Advertisement