Microsoft has unveiled two new AI models in its Phi series: Phi-4-multimodal with 5.6 billion parameters and Phi-4-mini with 3.8 billion parameters. These small language models (SLMs) deliver exceptional performance while requiring significantly less computing power than larger systems, challenging the notion that bigger AI models are always better.
The Phi-4-multimodal model stands out for its ability to process text, images, and speech simultaneously using a novel “Mixture of LoRAs” technique. This approach enables the model to handle multiple input types without the performance degradation typically associated with multimodal systems. Microsoft reports that the model has claimed the top position on the Hugging Face OpenASR leaderboard with a word error rate of 6.14%, outperforming specialized speech recognition systems.
Despite its compact size, Phi-4-mini demonstrates remarkable capabilities in text-based tasks, particularly excelling in mathematics and coding. According to Microsoft’s technical report, the model achieved an 88.6% score on the GSM-8K math benchmark, outperforming most 8-billion-parameter models, while reaching 64% on the MATH benchmark, substantially higher than similar-sized competitors.
Key features and applications
- Both models use decoder-only transformer architecture and grouped query attention (GQA) to optimize performance and reduce hardware usage
- Phi-4-multimodal can process visual, audio, and text input in a single model
- The models are designed to run efficiently on standard hardware or directly on devices
- They’re particularly suitable for edge computing scenarios where real-time intelligence is required but cloud connectivity may be limited
Weizhu Chen, vice president of generative AI at Microsoft, stated: “These models are designed to empower developers with advanced AI capabilities. Phi-4-multimodal, with its ability to process speech, vision and text simultaneously, opens new possibilities for creating innovative and context-aware applications.”
Real-world applications are already emerging. Capacity, an AI “answer engine,” has leveraged the Phi family to enhance its platform’s efficiency and accuracy, reporting 4.2x cost savings compared to competing workflows while achieving similar or better results.
Both Phi-4 models will be available through Azure AI Foundry, Hugging Face, and the Nvidia API Catalog under an MIT license, which permits commercial use. This accessibility aims to democratize AI capabilities, making advanced intelligence available to developers regardless of their hardware resources.
Sources: VentureBeat, SiliconAngle