Back to Blog

NVIDIA Debuts Nemotron 3 Family of Open Models

The AI hype cycle operates on a predictable loop. A tech giant drops a new model, marketing teams scream about benchmark supremacy, and engineers are left to figure out if it actually compiles or if it's just more vaporware. NVIDIA just dropped the Nemotron 3 family. If you ignore the corporate press release fluff about powering "transparent, efficient and specialized agentic AI," what you actually have is a hardware monopoly aggressively commoditizing the software layer. They want to sell more silicon. Giving away highly optimized, open-weight models is the easiest way to ensure you stay locked into the CUDA ecosystem. But cynicism aside, the architecture here demands attention. Released in December 2025, Nemotron 3 isn't just another generic Transformer clone. They went all in on a Hybrid Mamba-Transformer Mixture of Experts (MoE) design. For edge deployment and agentic workflows, this is the most interesting technical release of the year. ## The Nemotron 3 Lineup NVIDIA split the family into three tiers: Nano, Super, and Ultra. As of right now, only Nano is in the wild. Super and Ultra are slated for the first half of 2026. Here is the breakdown of what we are dealing with. | Model | Total Parameters | Active Parameters | Architecture | Status | Target Hardware | | :--- | :--- | :--- | :--- | :--- | :--- | | **Nano** | 31.6B | 3.2B (3.6B w/ embeds) | Mamba-Transformer MoE | Released (Dec 2025) | Consumer GPUs / Edge | | **Super** | Unknown | Unknown | Mamba-Transformer MoE | H1 2026 | Enterprise / Multi-GPU | | **Ultra** | Unknown | Unknown | Mamba-Transformer MoE | H1 2026 | Data Center Clusters | The open-weight license is genuinely permissive. You can use it commercially and modify it without jumping through arbitrary legal hoops. ## The Architecture: Why Hybrid Mamba-Transformer MoE Matters Most of the industry is stuck trying to brute-force the quadratic scaling bottleneck of traditional attention mechanisms. If you want a massive context window in a standard Transformer, you pay for it in VRAM and compute. NVIDIA bypassed this by hybridizing State Space Models (Mamba) with Transformers, and then wrapping the whole thing in a sparse Mixture of Experts (MoE) routing layer. ### The Mamba Advantage Mamba operates in linear time complexity relative to sequence length. It achieves this by updating a hidden state sequentially, much like the RNNs of the past, but optimized for modern parallel hardware. The problem with pure SSMs has always been recall. They tend to "forget" precise needles in massive haystacks because they compress everything into a fixed-size state representation. Nemotron 3 fixes this by interlacing standard self-attention layers with Mamba blocks. You get the linear scaling and massive throughput of Mamba for the bulk of the processing, with the sharp recall of selective attention where it counts. ### The MoE Efficiency Play Look at the Nano specs: 31.6B total parameters, but only 3.2B active during forward passes (3.6B if you count the embedding layer). This is a memory bandwidth play. To run this model at FP16, you need about 65GB of VRAM just to hold the weights. But to generate a token, the router only activates 3.2B parameters. Compute is cheap. Memory bandwidth is the absolute bottleneck in LLM inference. By dropping the active parameter count to ~10% of the total model size, Nemotron 3 Nano spits out tokens at absurd speeds once the weights are loaded. It is heavily optimized to maximize the token generation speed, pushing throughput to the absolute limits of the underlying memory bus. ## Running Nano on Edge Hardware NVIDIA designed Nano specifically for local deployment on consumer and edge hardware. You don't need an H100 cluster for this. If you have a Mac Studio or a dual RTX 3090 / 4090 setup, you can run this locally right now. Forget the heavy Python wrappers for a second. The cleanest way to serve this is via vLLM. ### Serving with vLLM Assuming you have a machine with at least 80GB of combined VRAM (or unified memory on Apple Silicon), you can spin up an OpenAI-compatible API server in one command. ```bash # Start the vLLM server with tensor parallelism across 2 GPUs python3 -m vllm.entrypoints.openai.api_server \ --model nvidia/nemotron-3-nano-31b-instruct \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --dtype bfloat16 \ --gpu-memory-utilization 0.90 \ --port 8000 ``` Because of the Mamba architecture, you might need to build vLLM from source if your package manager is lagging behind the Dec 2025 release cycle. ```bash git clone https://github.com/vllm-project/vllm.git cd vllm pip install -e . ``` ### Quantization for Single-GPU Deployment If you want to stuff Nano onto a single 24GB RTX 4090, you are going to have to quantize it. EXL2 or AWQ formats are your best bets here. At 4-bit quantization, the 31.6B parameters compress down to roughly 17GB of VRAM. ```bash # Example using text-generation-inference (TGI) with AWQ docker run --gpus all --shm-size 1g -p 8080:80 \ -v $PWD/data:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id nvidia/nemotron-3-nano-31b-awq \ --quantize awq \ --max-input-length 8192 \ --max-total-tokens 16384 ``` At 4-bit, the degradation in reasoning logic is measurable but acceptable for standard agentic routing tasks. The throughput on a 4090 running a 3.2B active parameter MoE is staggering. You will easily see upwards of 120+ tokens per second. ## Building Agentic Workflows NVIDIA heavily marketed this family for "agentic AI applications." That isn't just a buzzword. It reflects a specific engineering optimization. Agentic workflows don't rely on zero-shot generation. They rely on massive iteration. An autonomous agent needs to generate a thought, invoke a tool, parse the result, realize it failed, and try again. This requires an architecture with low time-to-first-token (TTFT) and high sustained throughput. If your agent is sitting around for five seconds waiting for a 70B dense model to finish generating a JSON blob, the loop is dead. ### The Speed vs. Intelligence Tradeoff Nemotron 3 Nano is not GPT-4. It is not going to write a novel or solve complex math proofs zero-shot. It is a highly specialized engine for fast, structured reasoning. You wrap it in a framework that utilizes strict schema enforcement. Here is a practical Python implementation using the `openai` SDK to talk to your local vLLM instance. We will force the model to output a strict JSON action for a web-scraping agent. ```python import json from openai import OpenAI # Connect to local vLLM instance client = OpenAI( base_url="http://localhost:8000/v1", api_key="sk-local-dev" ) def agent_loop(objective: str, max_iterations=5): system_prompt = """ You are an autonomous agent. Output ONLY valid JSON in the following format: { "thought": "Your reasoning here", "action": "search" | "click" | "extract" | "finish", "target": "URL or query string" } """ messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Objective: {objective}"} ] for i in range(max_iterations): response = client.chat.completions.create( model="nvidia/nemotron-3-nano-31b-instruct", messages=messages, temperature=0.1, # Keep it deterministic response_format={"type": "json_object"} ) raw_output = response.choices[0].message.content print(f"Iteration {i+1} Output:\n{raw_output}\n") try: action_data = json.loads(raw_output) # Route action to your local tools here if action_data.get("action") == "finish": print("Task Complete.") return action_data # Simulate tool output tool_result = f"Simulated result for {action_data.get('action')} on {action_data.get('target')}" messages.append({"role": "assistant", "content": raw_output}) messages.append({"role": "user", "content": f"Tool Output: {tool_result}"}) except json.JSONDecodeError: print("Model hallucinated invalid JSON. Retrying.") messages.append({"role": "assistant", "content": raw_output}) messages.append({"role": "user", "content": "Error: Invalid JSON. Please fix."}) print("Max iterations reached. Failing gracefully.") # Execute the loop agent_loop("Find the latest pricing for DigitalOcean GPU droplets.") ``` Because Nano only activates 3.2B parameters, this 5-step retry loop completes in the time it takes a massive dense model to generate its first two sentences. Speed unlocks reliability through brute-force iteration. ## The 2026 Horizon: Super and Ultra NVIDIA expects to drop Super and Ultra in the first half of 2026. Based on the scaling laws we've seen, Super will likely target the 100B-150B parameter range (with maybe 15B active), designed to fit across a standard 8x H100 node. Ultra will be the flagship, likely pushing past the 500B parameter mark for massive multi-node deployments. If Nano is the fast routing agent, Super will be the orchestrator. You will run Nano on the edge to handle immediate, low-latency API interactions and tool use, while offloading complex reasoning tasks to Super sitting in your data center. This tiered approach is standard industry practice now. You don't use a sledgehammer to drive a thumbtack. ## Hardware Optimization and The Underlying Strategy We have to talk about why NVIDIA is doing this. DigitalOcean is already publishing tutorials on running Nemotron 3 on their GPU Droplets. Cloud providers are tripping over themselves to offer one-click deployments of these open-weight models. By pushing a highly efficient Mamba-Transformer MoE architecture, NVIDIA is explicitly increasing the ROI of consumer and edge GPUs. If a small startup can run a highly capable agentic loop on a cluster of RTX 4090s or an Apple Silicon rack instead of renting premium A100 time, the barrier to entry plummets. NVIDIA makes money regardless. They are seeding the ecosystem with extremely capable, locally deployable software to ensure that the hardware demand remains bottomless. When open-source models match or beat closed APIs for specific agentic tasks, developers stop paying OpenAI and start buying hardware. ## Practical Takeaways Stop treating Nemotron 3 as a chatbot. It is a programmatic reasoning engine. 1. **Audit your Agent Architecture:** If you are using GPT-4o or Claude 3.5 Sonnet for simple JSON routing, you are burning money and latency. Swap the router for Nemotron 3 Nano. 2. **Standardize on vLLM:** If your infrastructure relies on older inferencing engines, upgrade. The Mamba/MoE hybrid requires modern kernels to see the actual speed benefits. 3. **Quantize and Deploy Local:** Pull the AWQ or EXL2 variants. Throw them on a 24GB consumer GPU. Measure the tokens/sec. The throughput will fundamentally alter how you design your retry loops. 4. **Prepare for Super:** Build your orchestrator/worker architecture now. Put Nano on the worker nodes. When Super drops in H1 2026, you plug it directly into the orchestrator slot without rewriting your pipelines. The era of massive, monolithic dense models dominating every workflow is over. Fast, sparse, highly specific MoEs running locally are the baseline. Nemotron 3 just set the new standard for the edge. Build accordingly.