Hugging Face has introduced SmolVLM, a new vision-language AI model that processes both images and text while using significantly less computing power than comparable solutions. As reported by Michael Nuñez, the model requires only 5.02 GB of GPU RAM, compared to competitors that need up to 13.70 GB. The system uses advanced compression technology to handle 384×384 pixel images with just 81 visual tokens and has shown unexpected capabilities in video analysis. Released under the Apache 2.0 license, SmolVLM comes in three variants for different enterprise needs and could make advanced AI vision systems more accessible to companies with limited resources. The model builds on the SigLIP image encoder and SmolLM2 for text processing, using training data from The Cauldron and Docmatix datasets.