Google's Gemma 3 models now run on consumer GPUs through quantization

Google has released new versions of its Gemma 3 AI models that can run on consumer-grade graphics cards through a technique called Quantization-Aware Training (QAT). This development makes powerful AI models accessible to users without high-end hardware.

The company announced that QAT dramatically reduces memory requirements while maintaining high quality performance. Gemma 3’s largest 27B model, which normally requires 54GB of VRAM in BFloat16 precision, now needs just 14.1GB in int4 precision, allowing it to run on a single NVIDIA RTX 3090 graphics card.

The quantization process reduces the precision of model parameters from 16 bits to as few as 4 bits, compressing the data size by up to 75%. Unlike standard post-training quantization, QAT incorporates the quantization process during model training to preserve accuracy.

Google has made these optimized models available through popular development platforms including Ollama, LM Studio, MLX, Gemma.cpp, and llama.cpp. The company states that its QAT approach reduces perplexity drop by 54% compared to standard quantization methods.

The new optimizations apply across all Gemma 3 models, with even the 12B version now able to run on laptop GPUs with 8GB of VRAM. The smallest models (4B and 1B) require even less memory, potentially enabling AI capabilities on more constrained devices.

This development addresses a common barrier to AI democratization by bringing state-of-the-art model performance to widely available consumer hardware rather than requiring specialized enterprise equipment.

Google’s Gemma 3 models now run on consumer GPUs through quantization

Related posts:

Stay up to date

Related posts: