NVIDIA Open-Sources Nemotron 3 Ultra: 550B Parameter Hybrid Mamba-MoE Model with Million-Token Context and Agent Reasoning
NVIDIA open-sourced Nemotron 3 Ultra in June 2026, a 550B total parameter (55B active) hybrid Mamba-Attention LatentMoE model with native 1M token context, designed for agent reasoning. The model is released on Hugging Face with base, SFT, and NVFP4 quantized weights, along with training datasets, recipes, and inference code.
Model Architecture & Training
- Architecture: 108 layers alternating Mamba2 blocks and sparse Attention blocks, 512 experts per layer, top-22 activation, latent space 2048.
- Pretraining: Two-stage on 20T tokens: Stage 1 (15T) focuses on diversity, Stage 2 (5T) increases high-quality data ratio. Uses NVFP4 4-bit pretraining with <0.4% loss gap vs BF16.
- Long-Context Extension: Continual training on 33B tokens with 92% sequences at 1M length.
Performance
- Inference Throughput: In long-agent scenarios (8K input/64K output), achieves 5.9×, 4.8×, and 1.6× throughput improvements over GLM-5.1, Kimi-K2.6, and Qwen-3.5 respectively, with equal accuracy.
- Long Context: 76.83 on 1M-length RULER benchmark; competitors have no results.
- General Benchmarks: Leads in MMLU-Pro, GPQA, MATH, HumanEval/MBPP, etc.
Post-Training & Agent Capabilities
- Two-Stage SFT: Covers 10+ domains including long context, multi-step reasoning, multilingual safety, agent trajectories.
- Unified RLVR: Based on asynchronous GRPO, optimized for terminal, code, retrieval, math scenarios.
- MOPD Multi-Teacher Distillation: Core innovation: trains domain-specific teachers and fuses via online distillation to address signal dilution.
Companion Tool: NeMo AutoModel
NVIDIA also open-sourced NeMo AutoModel, optimized for MoE fine-tuning. Built on Hugging Face Transformers v5, a single import enables:
- Fine-tuning Speedup: 3.69× throughput increase on Qwen3-30B-A3B (TPS/GPU from 3075 to 11340).
- Memory Reduction: Peak memory reduced by 29%-32%.
- Core Technologies: Expert parallelism (EP), DeepEP communication fusion, TransformerEngine kernel acceleration.
Open Source & Impact
Nemotron 3 Ultra's full-stack open source (weights, data, recipes, inference code) lowers the barrier for large MoE models. Its hybrid architecture and agent optimization provide efficient solutions for long-context, multi-tool scenarios. NeMo AutoModel further simplifies MoE fine-tuning, promising to drive community adoption.
Also available in 中文.