Nvidia releases Nemotron 3 Nano Omni, a unified multimodal AI model

Nvidia has launched Nemotron 3 Nano Omni, an open AI model that combines text, vision and audio processing in a single system. Most existing AI agent systems rely on separate models for each modality, which increases latency and cost. Nvidia says its new model eliminates that fragmentation.

The model uses a hybrid mixture-of-experts architecture with 30 billion parameters. By integrating vision and audio encoders directly into the model, Nvidia says it achieves up to nine times higher throughput compared to other open omni models at the same level of interactivity.

Gautier Cloix, CEO of H Company, one of the early adopters, said: “To build useful agents, you can’t wait seconds for a model to interpret a screen. By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn’t practical before.”

The model is designed to work in agentic workflows, systems where AI takes sequences of actions to complete tasks. Practical use cases include document analysis, customer support, and audio-video reasoning.

Nemotron 3 Nano Omni is available with open weights on Hugging Face, OpenRouter and build.nvidia.com. It can run on local hardware such as Nvidia DGX Spark as well as cloud environments. Nvidia reports that the broader Nemotron model family has reached over 50 million downloads in the past year.

Sources: Nvidia, Silicon Angle

Stay up to date

Related posts: