Meta Platforms has released quantized versions of its Llama 3.2 1B and 3B models, which the company says offer reduced memory requirements, faster on-device inference, accuracy, and portability. The models were developed in close collaboration with Qualcomm and MediaTek and are available on SoCs with Arm CPUs. According to Meta, the average model size has been reduced by 56% and memory consumption by 41% compared to the original format.
Two techniques were used to quantize the Llama 3.2 1B and 3B models: QLoRA (Quantization-Aware Training with LoRA adapters) and SpinQuant. According to Meta, QLoRA prioritizes accuracy, while SpinQuant prioritizes portability. Inference using both quantization techniques is supported in the Llama stack reference implementation via PyTorch’s ExecuTorch framework.
Based on tests with Android OnePlus 12 models, results show a 2-4x speedup and an average 56% reduction in model size compared to the original format, according to Meta.
Sources: Meta, VentureBeat, Silicon Angle