ARIA is an open multimodal native mixture-of-experts model designed to integrate diverse forms of information for comprehensive understanding, outperforming existing proprietary models in various tasks. With 24.9 billion total parameters, it activates 3.9 billion and 3.5 billion parameters for visual and text tokens, respectively. The model is pre-trained on a substantial dataset comprising 6.4 trillion language tokens and 400 billion multimodal tokens, utilizing a four-stage training pipeline that enhances its capabilities progressively. ARIA’s architecture includes a fine-grained mixture-of-experts decoder, allowing for efficient parameter utilization and superior performance across multiple modalities like text, images, and videos.
The training process emphasizes multimodal understanding and long-context capabilities, achieving a context window of 64k tokens. Benchmark results indicate that ARIA excels in long-context multimodal understanding, outperforming both open-source and proprietary models in tasks such as document comprehension and video analysis. Additionally, ARIA demonstrates strong instruction-following capabilities, making it suitable for real-world applications.